Muistiinpano
Tämän sivun käyttö edellyttää valtuutusta. Voit yrittää kirjautua sisään tai vaihtaa hakemistoa.
Tämän sivun käyttö edellyttää valtuutusta. Voit yrittää vaihtaa hakemistoa.
Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.
For a step-by-step tutorial, see Run your first Structured Streaming workload.
Read from a data stream
Use Structured Streaming to incrementally ingest data from supported data sources.
| Feature | Description |
|---|---|
| Auto Loader | Incrementally and efficiently process new data files as they arrive in cloud storage. |
| Delta table streaming reads and writes | Use Delta Lake tables as streaming sources and sinks with exactly-once processing guarantees. |
| Standard connectors | Connect to message buses, queues, and enterprise applications using standard connectors. |
| Micro-batch size | Limit input rates to maintain consistent batch sizes and prevent processing delays. |
Write to a data sink
Configure how Structured Streaming delivers data to target systems.
| Feature | Description |
|---|---|
| Checkpoints | Store processing state to enable fault tolerance and exactly-once delivery semantics. |
| Output mode | Choose between append, update, and complete modes for stateful streaming queries. |
| Trigger intervals | Set trigger intervals to balance latency and cost for your processing requirements. |
| Real-time mode in Structured Streaming | Process data for real-time workloads with end-to-end latency as low as five milliseconds. |
Stateful and stateless processing
Stateless queries process rows without retaining state. Stateful queries maintain intermediate state for aggregations, joins, and deduplication.
| Feature | Description |
|---|---|
| Stateless streaming queries | Optimize queries that process data without maintaining intermediate state. |
| Watermarks | Control how long Structured Streaming waits for late-arriving data in stateful operations. |
| Stateful streaming | Manage aggregations, stream-stream joins, and deduplication using stateful operators. |
Monitor and manage
Track query performance, apply optimizations, and govern data access for production Structured Streaming workloads.
| Feature | Description |
|---|---|
| Monitor with StreamingQueryListener | Track query progress and performance metrics using the Spark UI and listener API. |
| Govern with Unity Catalog | Configure Unity Catalog for streaming workloads with governance and access control. |