Apache Airflow: A Real-life Use Case
In this post, I will guide you through how to write an airflow scheduler for a practical use case.
Overview
Airflow is simply a tool for us to programmatically schedule and monitor our workflows. When dealing with complicate pipelines, in which many parts depend on each other, using Airflow can help us to write a clean scheduler in Python along with WebUI to visualize pipelines, monitor progress and troubleshoot issues when needed.
Real-life Example
The best way to comprehend the power of Airflow is to write a simple pipeline scheduler. A common use case for Airflow is to periodically check current file directories and run bash jobs based on those directories. In this post, I will write an Airflow scheduler that checks HDFS directories and run simple bash jobs according to the existing HDFS files. The high-level pipeline can be illustrated as below:
As we can see, first we will try to check the today dir1 and dir2, if one of them does not exist (due to some failed jobs, corrupted data…) we will get the yesterday directory. We also have a rule for job2 and job3, they are dependent on job1. So if job1 fails, the expected outcome is that both job2 and job3 should also fail. This is one of the common pipeline pattern that can be easily done when using Airflow.
Setting Up
There are a lot of good source for Airflow installation and troubleshooting. Here, I just briefly show you how to set up Airflow on your local machine.
Installing Airflow using pip:
pip install apache-airflow
Initialize Airflow database:
airflow initdb
Start the webserver:
airflow webserver -p 8080
Run the scheduler:
airflow scheduler
If all run successfully, you can check out Airflow UI via: http://localhost:8080/
OK, let’s write it!
First we need to define a set of default parameters that our pipeline will use. Since our pipeline needs to check directory 1 and directory 2 we also need to specify those variables. Fortunately, This is how we define it: