Live Course Module: Apache Spark Course for Data Engineering
Total Duration: 40 Hours (5 Weeks)
Week 1: Introduction to Apache Spark and Big Data Fundamentals
Total Time: 8 hours
-
Introduction to Big Data Ecosystem (1 hr)
-
What is Big Data?
-
Data Engineering vs Data Science
-
Role of Spark in Modern Data Architecture
-
-
Overview of Apache Spark (1 hr)
-
Spark’s core concepts
-
Spark vs Hadoop MapReduce
-
Spark Components and Architecture
-
-
Spark Cluster Overview (1.5 hrs)
-
Spark driver, executors, cluster managers
-
Standalone, YARN, and Mesos modes
-
-
Spark Installation and Setup (2 hrs – Lab)
-
Local setup using PySpark / Databricks
-
Running first Spark job
-
-
Hands-On & Assignment (2.5 hrs)
-
Word Count example
-
Explore Spark UI
-
Assignment: Build and run a Spark application locally
-
Week 2: Spark Core and RDD Programming
Total Time: 8 hours
-
Understanding RDDs (1 hr)
-
What are RDDs?
-
Lazy evaluation & DAGs
-
-
Transformations and Actions (2 hrs)
-
Map, Filter, ReduceByKey, FlatMap, Join
-
Common actions: Collect, Count, Take
-
-
Persistence and Caching (1 hr)
-
Memory management and optimization
-
-
Working with Key-Value Pairs (1.5 hrs)
-
Pair RDDs and aggregations
-
-
Hands-On & Assignment (2.5 hrs)
-
RDD transformations practice
-
Assignment: Build a data processing pipeline using RDDs
-
Week 3: Spark SQL and DataFrames for Data Engineering
Total Time: 8 hours
-
Introduction to Spark SQL (1 hr)
-
Structured APIs overview
-
Catalyst optimizer and Tungsten engine
-
-
DataFrame Operations (2 hrs)
-
Creating DataFrames from JSON, CSV, Parquet
-
Schema inference and transformations
-
-
Spark SQL Queries (1.5 hrs)
-
Registering temporary views
-
Writing SQL queries in Spark
-
-
Data Sources and Connectors (1.5 hrs)
-
Working with JDBC, S3, Delta Lake
-
-
Hands-On & Assignment (2 hrs)
-
Build ETL job using DataFrames
-
Assignment: Transform and load structured data
-
Week 4: Advanced Spark Concepts and Optimization
Total Time: 8 hours
-
Spark Streaming (1.5 hrs)
-
Introduction to Structured Streaming
-
Working with real-time data sources
-
-
Performance Tuning and Optimization (2 hrs)
-
Partitioning, Caching, and Shuffle operations
-
Broadcast variables and accumulators
-
-
Spark Joins and Aggregations (1.5 hrs)
-
Efficient join strategies
-
Window and group operations
-
-
Monitoring and Debugging (1 hr)
-
Spark UI metrics and troubleshooting
-
-
Hands-On & Assignment (2 hrs)
-
Streaming job with Kafka data source
-
Assignment: Optimize a Spark ETL job
-
Week 5: Integrations, Workflows, and Project
Total Time: 8 hours
-
Integration with Data Engineering Tools (1.5 hrs)
-
Spark with Airflow, Kafka, and Delta Lake
-
Spark on Databricks and AWS EMR
-
-
Deployment and Productionization (1.5 hrs)
-
Packaging Spark applications
-
Job scheduling and CI/CD pipelines
-
-
Capstone Project (3 hrs)
-
End-to-end ETL pipeline
-
Load raw data → Clean → Transform → Store in Data Lake/Warehouse
-
-
Final Review and Q&A (2 hrs)
-
Project presentation
-
Certification guidance and career tips
-
🧩 Final Deliverables
-
Mini Projects: 3 (RDD, SQL, Streaming)
-
Capstone Project: 1 End-to-End Data Engineering Pipeline
-
Assessments: Weekly quizzes + project evaluation





Reviews
There are no reviews yet.