Handling Duplicates In Streaming Pipeline
Three ways to handle duplicate data in streaming pipelines. Learn the benefits, use cases and more in this article.
Streaming or Real Time pipelines are on the rise, teams are switching from Lambda to Kappa Architecture and making streaming as the source of truth, the benefits of streaming could be great but comes with new challenges. Today, I am going to focus on Duplicate Handling in a Streaming Pipeline Architecture.
Let’s look into three ways to handle duplicates in a streaming pipeline, depends on use case and requirements, understanding in detail will help you choose the best.
💡My core experience is Spark and Delta so my knowledge may be limited.
Asynchronous Job
Async job that runs on a schedule to dedupe the data, this typically happens in the warehouse, like a schedule function or query.
Pros
Best works when latency is important than duplicate handling,
latency > dedupe
.Related to latency, Async Job keeps the streaming job away from complexities and makes it easy to rerun since its a batch job.
Works well when duplicates are rare, handled in the source mostly.
Helps when querying is mostly done on non recent data, e.g. last week, last month, as the duplicates are already handled on older data.
Append only sink is faster.
Cons
Recent data would still have duplicates and that would need to be handled by the consumer. E.g. if a duplicate job runs every midnight, then users would see duplicates for 24 hours until next run.
Requires additional setup like monitoring and alerting for the batch job to keep data up-to-date.
Upsert
Upsert also known as Merge Into Statements in SQL are very popular to handle duplicates. This works well in batch, however in Streaming it depends on data size, latency requirements.
Pros
Best works when duplicate handling is important than latency,
dedupe > latency
.Easy to maintain as it is one unified job that handles both processing and deduplicate logic.
This approach makes sure Zero Duplicates are present in the sink.
Works when duplicates are very often and can cause issues in the downstream use cases.
Cons
Upsert can slow down queries, sometimes be the bottleneck when writing to sink, compared to append only.
Depends on the Upsert functionality support between the stream processing engine and warehouse. E.g. Spark Streaming has
merge
support for Delta Lake, Flink hasmerge
support for Iceberg.
Watermark
Watermark approach requires a state to be maintained for doing lookups against to deduplicate the data within the streaming app.
Pros
Best works when both duplicate handling and latency are important,
dedupe = latency
.This approach makes sure Zero Duplicates are present in the sink.
Works well when the duplicates are often but does not need to lookup history, depends on data size.
Append only sink is faster.
Cons
Application requires a state to be maintained through watermarking which could lead to issues (slowness, OOM) depending on the size of state.
With watermarking, finding the right window is important otherwise duplicates may still flow through, e.g. dedupe over the last 24 hours.
Without watermarking, state will contain the whole dataset which could lead to slowness depending on size. It would be similar to doing Upsert.
💡These approaches can also be used together if needed, e.g. Watermark approach for weekly duplicates, and Async job for historical duplicates.
Choosing the Dedupe Approach:
Related Reading:
Exciting Content from May:
- explains in detail How Hadoop Distributed File System work
- and dive deep into the Databricks Unity Catalogue
- shares insights about the core issues that Kill Team’s Productivity
- shares his thoughts on Why We Need Data Pipelines
Thank you for the mention, Junaid!