Live Course Module: Apache Airflow Course for Data Engineering
Total Duration: 40 Hours (4 Weeks)
Week 1: Introduction to Apache Airflow & Core Concepts
Duration: 8 hours (4 sessions × 2 hrs)
Topics:
-
Introduction to Workflow Orchestration (2 hrs)
-
What is orchestration?
-
Role of Airflow in Data Engineering
-
Airflow vs Luigi vs Prefect comparison
-
Airflow architecture: Scheduler, Executor, Worker, Web Server
-
-
Airflow Installation & Environment Setup (2 hrs)
-
Installing Airflow using
pip
and Docker -
Understanding Airflow components
-
Navigating Airflow UI (DAGs, Logs, Tasks, Graphs)
-
-
Understanding DAGs & Tasks (2 hrs)
-
Creating a simple DAG in Python
-
Operators: PythonOperator, BashOperator, DummyOperator
-
Dependencies:
set_upstream()
,set_downstream()
,>>
and<<
-
-
Mini Project + Q&A (2 hrs)
-
Build a simple ETL DAG to extract and transform CSV data
-
Schedule and run through the Airflow UI
-
Week 2: Building & Managing Complex DAGs
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Advanced DAG Design (2 hrs)
-
DAG parameters, default_args, retries, SLAs
-
Dynamic task generation
-
Branching and SubDAGs
-
-
Using Airflow Operators (2 hrs)
-
FileSensor, EmailOperator, SimpleHttpOperator, PostgresOperator
-
Working with external APIs and SQL databases
-
-
XComs and Data Sharing (2 hrs)
-
Passing data between tasks
-
Using XComs effectively in data pipelines
-
-
Error Handling & Task Monitoring (2 hrs)
-
Handling task failures and retries
-
Alerting & notifications (Slack/Email integration)
-
-
Mini Project + Q&A (2 hrs)
-
Build a multi-stage DAG integrating API extraction + data transformation + DB loading
-
Week 3: Airflow with Big Data & Cloud Integration
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Airflow with Apache Spark (2 hrs)
-
Submitting Spark jobs using Airflow
-
Using
SparkSubmitOperator
for batch data pipelines
-
-
Airflow with Hadoop & HDFS (2 hrs)
-
Managing data in HDFS
-
Using Airflow for daily ingestion & transformation jobs
-
-
Airflow with AWS / GCP / Azure (2 hrs)
-
AWS S3, Redshift, BigQuery, and Azure Blob Storage integrations
-
Using Airflow Hooks and Connections
-
-
Airflow with Kafka & Streaming Data (2 hrs)
-
Triggering workflows from Kafka topics
-
Real-time batch pipeline simulation
-
-
Mini Project + Q&A (2 hrs)
-
Build a batch pipeline integrating Airflow + Spark + S3
-
Week 4: Airflow in Production, Scaling & Capstone Project
Duration: 12 hours (6 sessions × 2 hrs)
Topics:
-
Scheduling, Triggers, and Backfills (2 hrs)
-
Airflow scheduling and cron expressions
-
Manual triggers and backfilling DAG runs
-
-
Airflow in Production Environments (2 hrs)
-
Airflow Executors: Sequential, Local, Celery, Kubernetes
-
Configuring Airflow for scalability and high availability
-
-
CI/CD and Version Control (2 hrs)
-
DAG versioning using Git
-
Deploying Airflow pipelines through CI/CD tools (GitHub Actions, Jenkins)
-
-
Monitoring, Logging & Security (2 hrs)
-
Airflow Metrics, Logging, Prometheus, Grafana integration
-
Authentication & Role-Based Access Control (RBAC)
-
-
Capstone Project Development (2 hrs)
-
Design and build an end-to-end data pipeline using Airflow and Cloud Storage
-
-
Capstone Presentation & Feedback (2 hrs)
-
Present final DAG and pipeline workflow
-
Instructor feedback and best practices discussion
-
🧩 Capstone Project Example
Project Title: Automated Data Pipeline for E-Commerce Analytics
Goal:
Extract transactional data from APIs → Load into AWS S3 → Transform using Spark → Load into Redshift → Orchestrate with Airflow
Tech Stack: Airflow, Python, Spark, AWS S3, Redshift
Reviews
There are no reviews yet.