Apache Spark for Data Science – Live Course Module
Total Duration: 36 Hours (6 Weeks)
Week 1: Introduction & Foundations (6 hrs)
-
Introduction to Big Data & Spark (2 hrs)
-
Evolution from Hadoop to Spark
-
Why Spark for Data Science?
-
Spark ecosystem overview (Spark Core, SQL, MLlib, Streaming, GraphX)
-
Real-world use cases
-
-
Spark Architecture & Setup (2 hrs)
-
Spark architecture (Driver, Executors, Cluster Manager)
-
RDD vs DataFrames vs Datasets
-
Installing & running Spark (Standalone, YARN, Databricks, Google Colab, Jupyter)
-
-
Hands-on with Spark Shell & PySpark (2 hrs)
-
Spark Shell (Scala/Python) basics
-
Using PySpark with Jupyter Notebook
-
Simple Spark applications
-
Week 2: Spark Core – RDD Operations (6 hrs)
-
RDD Basics (2 hrs)
-
Creating RDDs
-
Transformations & Actions
-
Lazy evaluation & DAG
-
-
Advanced RDD Operations (2 hrs)
-
Map, FlatMap, Filter, ReduceByKey, GroupByKey
-
Joins & Aggregations
-
Persisting & caching RDDs
-
-
Hands-on RDD Case Study (2 hrs)
-
Word Count Example
-
Log File Analysis
-
Performance tuning with RDDs
-
Week 3: DataFrames & Spark SQL (6 hrs)
-
Introduction to DataFrames (2 hrs)
-
Creating DataFrames from files (CSV, JSON, Parquet)
-
Schema & Data types
-
DataFrame operations (select, filter, groupBy, join, agg)
-
-
Spark SQL (2 hrs)
-
Registering DataFrames as SQL tables
-
Writing SQL queries in Spark
-
Integration with BI tools
-
-
Hands-on Data Analysis with Spark SQL (2 hrs)
-
Case study: Analyzing large dataset with DataFrames & SQL
-
Optimization techniques (Catalyst Optimizer, Tungsten)
-
Week 4: Machine Learning with MLlib (6 hrs)
-
Introduction to Spark MLlib (2 hrs)
-
Machine Learning in Spark
-
MLlib vs Scikit-learn
-
Pipelines & Transformers
-
Supervised Learning with MLlib (2 hrs)
-
Regression & Classification (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
-
Model training & evaluation
-
Unsupervised Learning with MLlib (2 hrs)
-
Clustering (K-Means, Gaussian Mixture)
-
Dimensionality Reduction (PCA)
-
Hands-on project with MLlib
Week 5: Spark Streaming & Real-Time Analytics (6 hrs)
-
Introduction to Spark Streaming (2 hrs)
-
Batch vs Streaming
-
DStreams & Structured Streaming basics
-
Streaming architecture
-
Structured Streaming Operations (2 hrs)
-
Reading real-time data (Kafka, Socket, Files)
-
Window operations
-
Aggregations & checkpoints
-
Hands-on Streaming Project (2 hrs)
-
Real-time Twitter sentiment analysis / Log monitoring
-
Building streaming pipeline
Week 6: Capstone Project & Deployment (6 hrs)
-
GraphX & Advanced Topics (2 hrs)
-
Basics of GraphX
-
Graph analysis use cases in Data Science
-
Capstone Project Work (2 hrs)
-
End-to-end project (e.g., Movie Recommendation, Customer Churn Prediction, Real-time Fraud Detection)
-
Data ingestion → Processing → ML pipeline → Results
-
Deployment & Wrap-up (2 hrs)
-
Deploying Spark jobs (Standalone / Cluster)
-
Integrating with Hadoop, AWS EMR, Databricks
-
Best practices & course recap
✅ Outcome:
By the end of this course, learners will be able to:
-
Build and optimize Spark applications
-
Perform large-scale data analysis using Spark SQL
-
Train ML models using Spark MLlib
-
Work with streaming data in real-time
-
Deploy Spark solutions in production





Reviews
There are no reviews yet.