Live Course Module: Prefect Course for Data Engineering
Total Duration: 40 Hours (5 Weeks)
Week 1: Introduction to Data Engineering & Core Concepts
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Introduction to Data Engineering (2 hrs)
-
What is Data Engineering?
-
Data Engineer vs Data Scientist vs Data Analyst
-
Overview of Data Lifecycle & Modern Data Stack
-
-
Data Architecture & Ecosystem Overview (2 hrs)
-
OLTP vs OLAP Systems
-
Data Pipelines & ETL/ELT concepts
-
Data Warehousing, Data Lakes, and Lakehouse
-
-
Relational Databases & SQL Basics (2 hrs)
-
SQL fundamentals
-
Data extraction using SQL queries
-
Hands-on: Querying sample datasets
-
-
Data Modeling & Schema Design (2 hrs)
-
Normalization, Star/Snowflake Schema
-
Primary/Foreign keys
-
Dimensional Modeling in Warehouses
-
-
Mini Project + Q&A (2 hrs)
-
Design a small database schema
-
Query and load data into it
-
Week 2: Data Collection, Ingestion & Processing
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Data Ingestion Techniques (2 hrs)
-
Batch vs Streaming ingestion
-
ETL vs ELT pipelines
-
-
Apache Kafka for Streaming Data (2 hrs)
-
Kafka fundamentals
-
Building real-time ingestion pipelines
-
Hands-on: Produce & consume data streams
-
-
Apache NiFi / Airbyte / Fivetran (2 hrs)
-
Low-code data ingestion tools
-
Connecting APIs & Databases
-
-
Data Transformation using Apache Spark (2 hrs)
-
Spark architecture & RDD/DataFrame concepts
-
Data Cleaning & Transformation tasks
-
-
Mini Project + Q&A (2 hrs)
-
Build a streaming ingestion pipeline (Kafka + Spark)
-
Week 3: Data Storage, Warehousing & Orchestration
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Data Storage Systems (2 hrs)
-
HDFS, S3, Azure Data Lake, GCS
-
File formats: CSV, Parquet, Avro, ORC
-
-
Data Warehousing Concepts (2 hrs)
-
Amazon Redshift, Google BigQuery, Snowflake
-
Partitioning, Clustering, and Query Optimization
-
-
Workflow Orchestration with Apache Airflow (2 hrs)
-
DAGs, Operators, Scheduling
-
Building and monitoring pipelines
-
-
Data Quality & Testing (2 hrs)
-
Great Expectations / dbt tests
-
Data validation and lineage tracking
-
-
Mini Project + Q&A (2 hrs)
-
Build a batch pipeline orchestrated with Airflow
-
Week 4: Infrastructure as Code & Cloud Data Engineering
Duration: 10 hours (5 sessions × 2 hrs)
Topics:
-
Introduction to Cloud Platforms (2 hrs)
-
AWS / GCP / Azure overview
-
Managed data services comparison
-
-
Infrastructure as Code with Terraform (2 hrs)
-
Basics of IaC
-
Deploying storage and compute resources
-
-
Containerization with Docker (2 hrs)
-
Dockerizing data applications
-
Docker Compose for multi-container setups
-
-
Kubernetes for Data Pipelines (2 hrs)
-
Basics of pods, deployments, services
-
Running Spark jobs on Kubernetes
-
-
Mini Project + Q&A (2 hrs)
-
Deploy a data pipeline on cloud infrastructure
-
Week 5: Advanced Topics & Capstone Project
Duration: 12 hours (6 sessions × 2 hrs)
Topics:
-
Data Governance & Security (2 hrs)
-
Data cataloging, lineage, encryption, access control
-
-
Monitoring, Logging & Performance Tuning (2 hrs)
-
Prometheus, Grafana, CloudWatch
-
Optimizing ETL performance
-
-
Data Engineering with dbt (2 hrs)
-
Modular SQL transformations
-
Version control & testing in dbt
-
-
Real-Time Data Processing with Flink (2 hrs)
-
Stream processing fundamentals
-
Integrating Flink with Kafka
-
-
Capstone Project Development (2 hrs)
-
Design & build an end-to-end data pipeline
-
Incorporating ingestion, transformation, storage, and orchestration
-
-
Capstone Presentation & Review (2 hrs)
-
Present final pipeline
-
Instructor feedback & career guidance
-
🧩 Capstone Project Example
Project Title: Real-Time Analytics Pipeline for E-Commerce
Stack: Kafka → Spark → Airflow → S3 → Snowflake → Power BI
Goal: Build and orchestrate a scalable pipeline to process and analyze real-time sales data.
Reviews
There are no reviews yet.