Live Course Module: Hadoop for Data Science
Total Duration: 40 Hours (5 Weeks)
Week 1: Introduction to Big Data & Hadoop Ecosystem (6 Hours)
-
Understanding Big Data (1 hr)
-
What is Big Data? Characteristics (Volume, Velocity, Variety, Veracity, Value).
-
Role of Big Data in Data Science.
-
-
Introduction to Hadoop Framework (1 hr)
-
History of Hadoop, Why Hadoop?
-
Key advantages and limitations.
-
-
Hadoop Ecosystem Overview (2 hrs)
-
HDFS, YARN, MapReduce, Hive, Pig, HBase, Sqoop, Flume.
-
Role of Hadoop in Data Science workflows.
-
-
HDFS (Hadoop Distributed File System) Deep Dive (2 hrs)
-
Architecture, Blocks, Replication, NameNode & DataNode.
-
Hands-on: Storing & retrieving files in HDFS.
-
Week 2: Hadoop Core Components (8 Hours)
-
YARN Architecture (2 hrs)
-
Resource Manager, Node Manager, Job Scheduling.
-
Monitoring jobs on YARN.
-
-
MapReduce Framework (4 hrs)
-
Map & Reduce concepts.
-
WordCount Example.
-
Writing custom MapReduce jobs.
-
Hands-on: Running MR jobs in Hadoop cluster.
-
-
Hadoop Cluster Setup & Administration Basics (2 hrs)
-
Local vs Pseudo vs Fully Distributed mode.
-
Configuration files (core-site.xml, hdfs-site.xml).
-
Week 3: Hadoop Ecosystem Tools for Data Science (10 Hours)
-
Apache Hive (3 hrs)
-
Hive architecture & metastore.
-
HiveQL for querying big data.
-
Hands-on: Creating tables, loading & querying data.
-
-
Apache Pig (2 hrs)
-
Pig Latin scripting.
-
Data transformations with Pig.
-
Hands-on: Data filtering, grouping, joining.
-
-
HBase (2 hrs)
-
NoSQL basics.
-
HBase data model: Tables, column families, regions.
-
Hands-on: CRUD operations in HBase.
-
-
Sqoop & Flume (3 hrs)
-
Sqoop: Import/export between Hadoop & RDBMS.
-
Flume: Data ingestion from logs/social media.
-
Hands-on: Importing MySQL data to HDFS & Hive.
-
Week 4: Data Science with Hadoop (8 Hours)
-
Data Preprocessing with Hadoop (2 hrs)
-
Cleaning, transforming, handling missing values.
-
Using Hive & Pig for preprocessing.
-
-
Integrating Hadoop with R/Python (3 hrs)
-
Using Hadoop Streaming with Python.
-
Pydoop, mrjob library.
-
R with Hadoop connectors.
-
-
Machine Learning with Hadoop (3 hrs)
-
Introduction to Apache Mahout & MLlib.
-
Building simple recommendation systems.
-
Week 5: Advanced Topics & Capstone Project (8 Hours)
-
Spark vs Hadoop: Modern Big Data Tools (2 hrs)
-
Why Spark gained popularity.
-
Hadoop + Spark hybrid use cases.
-
-
Hadoop in Real-Time Data Science Projects (2 hrs)
-
Use cases in Finance, Healthcare, Retail.
-
Industry best practices.
-
-
Capstone Project (4 hrs)
-
Real-world project:
-
Import data with Sqoop.
-
Store & process in HDFS using MapReduce.
-
Query with Hive/Pig.
-
Apply ML with Mahout or integrate with Spark.
-
-
Presentation & evaluation.
-
✅ Final Outcome: After completing the course, learners will:
-
Understand Hadoop ecosystem & its role in data science.
-
Perform data ingestion, storage, processing, and querying using Hadoop tools.
-
Build end-to-end Big Data pipelines for analytics & machine learning.
Reviews
There are no reviews yet.