sparklearning

When Should the Next Batch Run? The Story of Trigger Types

This is the story of triggers—the mechanism in Structured Streaming that controls when each micro-batch starts. The trigger answers a simple but consequential question: “how often should Spark check for new data and process it?” The answer shapes everything about a streaming query’s behavior: its latency, its throughput, its cost, and its interaction with sources and sinks. Different trigger types are designed for different operational goals, from near-real-time processing to controlled batch processing. Understanding what each trigger does—and what happens inside Spark when it fires—lets you choose the right one for your workload.


The role of the trigger in the micro-batch loop

In Structured Streaming’s micro-batch engine, the StreamExecution thread runs a continuous loop: determine input → run the batch → commit progress → wait → repeat. The trigger governs the “wait” step: how long does the loop pause between completing one batch and starting the next?

The trigger is set when you start a streaming query:

query = df.writeStream \
    .format("delta") \
    .trigger(processingTime="10 seconds") \
    .start(path)

If no trigger is specified, the default is processingTime="0 seconds"—which means: start the next batch immediately after the previous one commits, with no intentional wait.

The trigger is like the pace setter for a relay race. A fast trigger says “hand off the baton immediately after each leg.” A slow trigger says “wait at the handoff zone for 30 seconds before the next runner starts, even if you arrived early.” The pace setter doesn’t do the running—it just controls the timing between legs.


Trigger 1: ProcessingTime (interval-based)

.trigger(processingTime="30 seconds")  # or Trigger.ProcessingTime("30 seconds")

This is the most common trigger for production streaming jobs. It sets a fixed interval between batch starts.

How it works:

When to use it: near-real-time processing where you need a predictable latency bound. A processingTime="10 seconds" trigger means output is at most 10 seconds behind the source (plus processing time). Good for dashboards, alerting, and continuous ETL.

ProcessingTime is like a TV news channel that updates the ticker every 30 seconds. If the latest news came in at second 8, it appears at second 30. If preparing the update takes 40 seconds (longer than the interval), the next update starts immediately after—the ticker doesn’t wait for the next exact 30-second boundary. But it never publishes faster than the interval if data is sparse.

Key properties:


Trigger 2: default (no trigger / processingTime=”0 seconds”)

.trigger(processingTime="0 seconds")  # or no .trigger() call at all

This is ProcessingTime with a zero interval: as soon as one batch commits, the next starts immediately, with no pause. The streaming loop runs as fast as possible.

When to use it: maximum throughput scenarios where you want to minimize lag at the cost of continuous cluster resource consumption. If your source delivers data continuously and you want to process it as fast as Spark can go, this is the right choice.

Caution: at maximum speed, a streaming query with zero trigger interval never pauses. The cluster is continuously occupied. Small batches of data (e.g., a few records per second from Kafka) result in many tiny Spark jobs, each with scheduling overhead that dominates the actual processing work. In this case, a small positive interval (e.g., 1 second) produces better overall throughput by batching data between launches.


Trigger 3: Once

.trigger(once=True)  # or Trigger.Once()

The query processes all available data as a single batch, then stops automatically.

How it works:

When to use it: incremental batch processing. You have a Kafka topic, Delta table, or file directory as a source, and you want to process it periodically (e.g., once per hour via a job scheduler) rather than continuously. Each invocation processes the backlog since the last run, then exits. The cost is pay-per-run rather than continuous cluster occupancy.

Trigger.Once turns a streaming query into an incremental ETL job. It’s like a mail carrier who only delivers when called—they pick up all mail that has accumulated since their last visit, deliver it, and go home. The next call starts where they left off. Contrast with ProcessingTime, where the carrier lives in the mailbox and delivers continuously.

Key properties:

Limitation: one single batch for all available data. If the source has accumulated a very large backlog, this one batch may be very large and slow. For large backlogs, AvailableNow is better.


Trigger 4: AvailableNow

.trigger(availableNow=True)  # or Trigger.AvailableNow()

Like Once, but may run multiple micro-batches until all currently available data is consumed, then stops.

How it works:

AvailableNow is like a mail carrier who processes a large accumulated pile of mail in multiple trips. If 5,000 letters have piled up, they don’t try to carry all 5,000 in one go (that would be Trigger.Once). They take 500 per trip, make multiple trips until the pile is gone, then go home. The pace is controlled (rate limiting); the result is the same (all letters delivered); the approach is more manageable.

When to use it: incremental batch processing with rate limiting. If you schedule a job every hour but in some hours the source produces 10 million records, Trigger.Once would try to process 10 million records in one giant batch. AvailableNow processes them in smaller, rate-limited micro-batches, keeping each batch’s resource usage predictable.

Key properties:


Trigger 5: Continuous (experimental)

.trigger(continuous="1 second")  # or Trigger.Continuous("1 second")

The experimental Continuous processing engine breaks the micro-batch model entirely. Instead of a batch of tasks launched per interval, Continuous mode runs long-lived tasks that continuously read from sources and write to sinks, checkpointing progress every checkpoint_interval seconds.

How it works:

Latency: millisecond-scale end-to-end latency, vs. seconds for micro-batch mode.

Limitations (why it’s still experimental):

Continuous processing is like a live phone interpreter who translates each sentence as it’s spoken, rather than taking notes during the whole meeting and translating everything at the end. Ultra-low latency, but much harder to maintain state (you can’t summarize across sentences), and if the interpreter loses the connection for a moment, you lose a few seconds of translation.

When to use it: only when millisecond latency is a hard requirement and your query is simple (stateless filtering and transformation). For most production workloads, micro-batch with a short ProcessingTime trigger (1–5 seconds) achieves sufficient latency with much better reliability.


Choosing the right trigger

Goal Trigger
Near-real-time continuous processing processingTime("5 seconds")
Minimum possible latency (stateless) continuous("1 second") (experimental)
Maximum throughput, don’t care about per-batch latency processingTime("0 seconds")
Hourly incremental load, moderate data volume once=True
Hourly incremental load, large backlog possible availableNow=True
Scheduled backfill job availableNow=True

Bringing it together

Triggers control when each micro-batch starts in Structured Streaming’s execution loop. ProcessingTime runs batches on a fixed schedule—start immediately, complete, wait until the next interval, repeat. Zero interval / default runs batches back-to-back with no pause—maximum throughput, continuous resource use. Once processes all available data in a single batch and exits—ideal for incremental scheduled pipelines. AvailableNow processes all available data in multiple rate-limited batches and exits—the safer version of Once for large backlogs. Continuous (experimental) abandons micro-batches entirely for millisecond latency but only supports stateless operations. The choice of trigger is a fundamental design decision that determines latency, throughput, cost, and operational complexity. Most production workloads should use ProcessingTime with a short interval for continuous queries, or AvailableNow for scheduled incremental loads.