Program Summary:
Session 1: HDFS, Map Reduce, Hive
- Introduction to Big data and Hadoop Ecosystem
- Why industry needs Bigdata? Advantages of Bigdata over traditional RDBMS
- Introduction to Bigdata ecosystems
- Understanding Data various formats, transformation techniques
- HDFS, YARN architecture
- MapReduce
- Understanding Hadoop and Hive
- HDFS and Hive
- Hive and datatypes
- Hive advance features for performance
- Uses of Hive in real life projects
- Project -1
Session 2: Impala, Oozie, Shell Scripting, Linux
- Usages of Shell scripting in Bigdata projects
- Shell and Hive exercises
- Introduction to Impala
- Architecture of Impala
- Usages of Hive and Impala in Real life project
- Understanding Oozie as scheduler
- Oozie Coordinator
- Setup ooze job
- Introduction to Sqoop
- Understanding capabilities of Sqoop and underlying MapReduce
- Use Sqoop to ingest data from traditional database to HDFS/ Hive
- Project-2
Session 3: Spark, Scala
- Introduction to Scala programing language
- Scala from functional perspective
- Scala features for Bigdata transformations
- Spark, fastest data processing engine in the world
- Spark architecture
- Deep drive spark data transformation capabilities
- Spark SQL with HDFS, Hive and Impala
- Dealing with various data types JSON, XML, CSV, parquet, text
- Project-3
Session 4: Spark Streaming, Flume
- Introduction to streaming, new era of data analytics
- Introduction Kafka
- Deep drive of Kafka architecture
- Setup up Kafka for message generation
- Spark Streaming with Kafka
- Kafka performance Tuning considerations
- Consideration for Zero data loss streaming pipelines
- Dealing with small file issue and compaction
- Flume architecture
- Usages of Flume to setup streaming pipeline
- Exercises on Flume Agent setup
- Kafka and Flume
- project 4
Session 5: Big Data on Cloud
- Understanding bigdataa technologies in Cloud AWS
- Using Kinesis, Firehose, data stream
- Using Dynamo DB
- Using Lamda, Hive, Glue
- Understanding Elastic MapReduce (EMR)
- Spark on Cloud
- Project 5
Session 6: Bigdata on Cloud, Python, Introduction to DataScience
- Flume on Cloud
- Hue, Splunk on Cloud
- Introduction to Python for data science -Circuit Learn, Pandas, Numpy
- Using Jupyter notebook with Python
- Understanding data Science and its usages in real life usage cases
- Going over various data science Algorithms – Regression and Classification
Session 7: Data Science
- Using PySpark, Spak MLlib For data science
- Using Spark-Scala and MLlib for data science
- Understanding Features, and training models
- Data preparation for training model
- Machine learning on cloud -SageMaker
- Project-6
Session 8: Docker
- Understanding Micro Services
- Introduction to Docker and its usages
- Docker installation, configuration
- Understanding and working with container
- Inter Containers communication, expose services through port
- Understanding docker file
- Container based deployment
- Docker compose
- Introduction to Kubernetes
- Introduction to Helmchart
- Using Kubernetes
- Deployment of Docker images to kubernetes using Helm Chart
- Managing PODs
Project-7: Create data science environment using Micro services
Final Project
Pre-requisite for Program: Python, Linux, SQL
Job roles: Data Engineer, Data Scientist, Big data developer, Database administrator