Luigi data pipelines for batch processing
Luigi is a data pipeline library written in python for batch processing jobs handling dependency management, workflow management, visualization and failures using few lines of commands.
Luigi was developed by Spotify, later they open sourced it and now is used by variety of companies to automate their day to day tasks. The purpose of Luigi is to make things easy by providing more flexibility and ease unlike the cloud platforms that provide data pipelines as service, those could be very easy to use but with less features and functionality plus lack of portability.
It also comes with a web portal that let youtracks your tasks, dependencies and failures. It comes up with dependency graph which makes it easy to see what is going on, how far your job is in the process.
How to use Luigi?
It is really easy to use Luigi, download therecent Luigi release, you have several options to start working, either you cango with local or on cluster on Hadoop, for example
EC2 instance on AWS.
pip install luigi
luigi –m <path to scheduler/job> <scheduler/job class name> --param1 <param1> ….. --paramN <paramN>
You can have as many parameters as you want, you will need to handle all of them in the scheduler side on your Luigi code.
How does python code look like?
The following is a simple example of how Luigi python code looks like. Its a class containing four easy to understandmethods and consider its under
name = luigi.Parameter(default=None)
return <input path>
return <output path>
To run the above code, you can use something like this:
luigi –m luigi.jobs MyJob --name Junaid
How to check dependency graph?
Dependency graph is a great way to visualize your jobs in the pipelines, it can be seen from the web GUI by visiting the
server-ip:port, where port should be set in the following way.
luigid --background --port <port>
Why one should consider moving to Luigi?
One of the most important reason is its portability, you can take your Luigi work and put on different servers. For example, if you plan to move away from AWS then your Luigi can still work on other platforms like Google Cloud. Another reason is parameter passing, using the same pipeline efficiently for different jobs is really easy in Luigi as you can handle parameters on runtime.
Want more on Luigi?
Find detailed docs on Luigi at