Live Course Module: Apache Hadoop Course for Data Engineering
Total Duration: 40 Hours (5 Weeks)
Week 1: Big Data Fundamentals and Hadoop Ecosystem Overview
Total Time: 8 hours
-
Introduction to Big Data (1 hr)
-
What is Big Data?
-
3Vs of Big Data (Volume, Velocity, Variety)
-
Role of Data Engineering
-
-
Overview of Hadoop Ecosystem (1.5 hrs)
-
Hadoop history and evolution
-
Core components: HDFS, YARN, MapReduce
-
Ecosystem tools: Hive, Pig, Sqoop, Flume, Oozie
-
-
Hadoop Architecture (2 hrs)
-
Namenode, Datanode, Secondary Namenode
-
Hadoop Cluster topology and setup
-
Block storage mechanism
-
-
Setting up Hadoop Environment (1.5 hrs – Lab)
-
Single-node cluster setup using local VM or Docker
-
Basic Hadoop commands
-
-
Hands-On & Assignment (2 hrs)
-
Explore HDFS shell commands
-
Upload and retrieve files from HDFS
-
Assignment: Simulate HDFS data flow
-
Week 2: Hadoop Distributed File System (HDFS) and Data Management
Total Time: 8 hours
-
HDFS Deep Dive (1.5 hrs)
-
Architecture and components
-
Read/Write operations
-
Fault tolerance and replication
-
-
HDFS Commands and API (2 hrs – Lab)
-
File operations with CLI and Java API
-
Permissions, quotas, and configuration
-
-
Data Ingestion into HDFS (1.5 hrs)
-
Tools: Flume, Sqoop basics
-
Importing data from relational sources
-
-
Hands-On & Assignment (3 hrs)
-
Load data using Sqoop and Flume
-
Validate replication and data recovery
-
Assignment: Design a data ingestion workflow
-
Week 3: MapReduce for Data Engineering
Total Time: 8 hours
-
Introduction to MapReduce (1.5 hrs)
-
Programming model: Mapper, Reducer, Combiner
-
InputFormat and OutputFormat
-
-
Developing MapReduce Programs (2 hrs – Lab)
-
Writing MapReduce jobs in Java and Python
-
Running jobs on a Hadoop cluster
-
-
Advanced MapReduce Concepts (2 hrs)
-
Custom InputFormat and Partitioner
-
Counters, DistributedCache, and job optimization
-
-
Hands-On & Assignment (2.5 hrs)
-
WordCount and Log Analysis projects
-
Assignment: Build and optimize a MapReduce ETL job
-
Week 4: Hive, Pig, and Data Processing Tools
Total Time: 8 hours
-
Apache Hive for Data Warehousing (2 hrs)
-
Hive architecture and metastore
-
Creating databases, tables, and partitions
-
Writing HiveQL queries
-
-
Apache Pig for Data Flow Processing (1.5 hrs)
-
Pig architecture and execution modes
-
Pig Latin scripts for data transformation
-
-
Integrating Hive and Pig with HDFS (1.5 hrs – Lab)
-
Loading HDFS data into Hive and Pig
-
Using SerDe and UDFs
-
-
Hands-On & Assignment (3 hrs)
-
ETL pipeline using Hive and Pig
-
Assignment: Transform raw log data into analytics-ready tables
-
Week 5: Hadoop Ecosystem, Workflow, and Project Implementation
Total Time: 8 hours
-
Workflow Management and Orchestration (1.5 hrs)
-
Introduction to Oozie
-
Building workflows for Hadoop jobs
-
-
Hadoop Integration with Other Tools (1.5 hrs)
-
Connecting Hadoop with Spark, Kafka, and HBase
-
Hadoop on Cloud: AWS EMR, GCP Dataproc
-
-
Performance Tuning and Troubleshooting (2 hrs)
-
Cluster monitoring and resource optimization
-
Log analysis and debugging
-
-
Capstone Project (3 hrs)
-
Build a complete data engineering pipeline using Hadoop tools
-
Ingest → Process → Store → Analyze
-
Example: Retail or IoT data pipeline
-
🧩 Final Deliverables
-
Mini Projects: 3 (HDFS, MapReduce, Hive)
-
Capstone Project: 1 End-to-End Data Engineering Workflow
-
Assessments: Weekly quizzes + final project review
-
Tools Covered: HDFS, YARN, MapReduce, Hive, Pig, Sqoop, Flume, Oozie
Reviews
There are no reviews yet.