My Data Pipeline Orchestrators Journey
Sharing my journey of different data pipeline orchestrators used for large scale batch pipelines, discussing migrations, reasoning and much more.
As a Data Engineer, I have worked on many different orchestrators from open source to fully managed ones. Some orchestrators were part of my team so I managed the data infra, while in some cases I was just leveraging what was provided by the Data Platform team.
Today, I will share different types of orchestrators I have used, what were the pain points and why did we migrate or used a a new one.
I will not focus on like pros and cons from general perspective, rather it will be geared towards my experience in terms of moving to new orchestrator.
My sharing will be from the time I used these, so they may have been improved or changed now.
AWS Data Pipelines
I started with this one at my first job at Earnest Analytics knowing nothing about what data orchestrator does, this was pretty basic one and my first one. Used long time ago so don't remember all the details.
Lot of teams actually leveraged AWS Data Pipelines in the early days before the more modern players came into the picture.
This was the pipeline that my team was already using for daily scheduling of SQL jobs on Redshift and some scripting for bootstrapping and data movement among s3 buckets using EC2.
We faced some of the challenges while building pipelines using Data Pipelines which led us to look out for newer alternatives. It was also the same time we moved from SQL on Redshift to Spark on EMR.
No parameter passing support
Schedule Based only
Manual rerun of partial pipeline failures
Hard to debug and maintain
Thanks for reading Junaid Effendi | Sharing knowledge for Engineers! Subscribe for free to receive new posts and support my work.
Within first few months of my first first job, we looked into Luigi, and we migrated, it is a data orchestrator open sourced by Spotify. It was a big improvement from the last one as it solved most of the problems we had.
Luigi was the only choice we looked into because we had team members with experience.
Luigi requires a compute to run on, we used EC2 for the main scheduler and EMR for Spark jobs. It enabled us to pass parameters and build event driven pipelines.
A pipeline was triggered from a command sent by AWS Lambda upon a file drop in S3.
However, we worked with Luigi for few years and then we realized there is a better alternate which can provide much more than Luigi, and we also later realized that Luigi came with its own problems, like:
UI was not interactive
UI runs only when the job was running
Scaling was a bit challenging
No containerization support
Development experience was not great
Hard to test and deploy
Yes, finally we migrated to the popular one, which I believe we may have done it before instead of Luigi if we had the knowledge, however Airflow at that time was still getting popular.
At this time, we looked into alternatives, but Airflow was popular and a clear winner.
We leveraged Kubernetes to host Airflow along with the Kubernetes Pod Operator to scale the scheduler, we also moved to containerization approach which kept the scheduler very lightweight. Same Spark Jobs on EMR now orchestrated through Airflow.
Since then, I have been using Airflow on K8. At KHealth, I leveraged Google Cloud Composer to schedule BigQuery Jobs, while at Socure I currently use Airflow to schedule Spark based workloads on K8.
Airflow definitely comes up with its pros and cons, some important ones are:
Interactive and easy to use UI
Concepts of Sensors; allowing us to eliminate extra tooling like AWS Lambda
Easy to scale through Kubernetes Operator
Docker Support through Kubernetes Operator
Lightweight scheduler through decoupling of logic
Parameter passing as a config
Hard to test DAGs
Challenging to pass data between tasks
Easy to integrate with tools like Slack
AWS Step functions
I have worked on Step functions at two times, first looked into when we were migrating from Luigi to Airflow. Second at my current company.
Follow these Large Scale Migration Best Practices for a smooth process.
We had to leverage Step Functions to automate some of the upstream tasks which we could not do through Airflow due to the limitations of our current Data Platform.
Step Function is fully managed service (no compute is needed), so from that part it was pretty smooth. Step Function goal was to automate the two tasks: a SFTP to S3 service that runs on AWS Batch, an AWS Lambda Function that dynamically build a Config that we pass to Airflow to our config driven pipelines.
Easy to plug in different AWS services
Easy to maintain infra through IaaC like Terraform
Lacks flexibility with partial reruns
No built in scheduling, requires external service like Event Bridge
Ideally, if we had the capability, we would have kept everything within the Airflow.
Mage came in the picture recently, claiming to be a modern Airflow replacement, I never used it professionally but did a test and experimented with it.
I setup locally by following the documentation, and also used their free hosted one as well.
Mage has some great features especially to enable Data Scientists all in one place.
A scheduler with integrated Editor
Easier to setup comparatively
Pre built boiler code
And many more …
Read my full story: My Two Cents on Mage
Today, most likely Airflow is still on top, with Dagster coming second while Mage is still getting popular. No tool can solve all the problems, it depends on use cases and the tradeoffs.
One good lesson here is that, I did not work on so many different tech on purpose, which I believe as a young Engineer I really wanted to just hop on to shiny one every other year, but this somehow came naturally as part of the journey where we experienced difficulties and moved to a better alternate to solve problems, giving me experience while learning the migration process.