Big Data/Data Science

Duration: 8 weeks

Classes only on Weekends

Program Summary:

Session 1: HDFS, Map Reduce, Hive

  • Introduction to Big data and Hadoop Ecosystem
  • Why industry needs Bigdata? Advantages of Bigdata over traditional RDBMS
  • Introduction to Bigdata ecosystems
  • Understanding Data various formats, transformation techniques
  • HDFS, YARN architecture
  • MapReduce
  • Understanding Hadoop and Hive
  • HDFS and Hive
  • Hive and datatypes
  • Hive advance features for performance
  • Uses of Hive in real life projects
  • Project -1

Session 2: Impala, Oozie, Shell Scripting, Linux

  • Usages of Shell scripting in Bigdata projects
  • Shell and Hive exercises
  • Introduction to Impala
  • Architecture of Impala
  • Usages of Hive and Impala in Real life project
  • Understanding Oozie as scheduler
    • Oozie Coordinator
    • Setup ooze job
  • Introduction to Sqoop
  • Understanding capabilities of Sqoop and underlying MapReduce
  • Use Sqoop to ingest data from traditional database to HDFS/ Hive
  • Project-2

Session 3: Spark, Scala

  • Introduction to Scala programing language
  • Scala from functional perspective
  • Scala features for Bigdata transformations
  • Spark, fastest data processing engine in the world
  • Spark architecture
  • Deep drive spark data transformation capabilities
  • Spark SQL with HDFS, Hive and Impala
  • Dealing with various data types JSON, XML, CSV, parquet, text
  • Project-3

Session 4: Spark Streaming, Flume

  • Introduction to streaming, new era of data analytics
  • Introduction Kafka
  • Deep drive of Kafka architecture
  • Setup up Kafka for message generation
  • Spark Streaming with Kafka
  • Kafka performance Tuning considerations
  • Consideration for Zero data loss streaming pipelines
  • Dealing with small file issue and compaction
  • Flume architecture
  • Usages of Flume to setup streaming pipeline
  • Exercises on Flume Agent setup
  • Kafka and Flume
  • project 4

Session 5: Bigdata on Cloud

  • Understanding bigdataa technologies in Cloud AWS
  • Using Kinesis, Firehose, data stream
  • Using Dynamo DB
  • Using Lamda, Hive, Glue
  • Understaning Elastic MapReduce (EMR)
  • Spark on Cloud
  • Project 5

Session 6: Bigdata on Cloud, Python, Introduction to DataScience

  • Flume on Cloud
  • Hue, Splunk on Cloud
  • Introduction to Python for data science -Circuit Learn, Pandas, Numpy
  • Using Jupiter notebook with Python
  • Understanding data Science and its usages in real life usage cases
  • Going over various data science Algorithms – Regression and Classification

Session 7: Data Science

  • Using PySpark, Spak MLLIB For data science
  • Using Spark-Scala and MLIB for data science
  • Understanding Features, and training models
  • Data preparation for training model
  • Machine learning on cloud -SageMaker
  • Project-6

Session 8: Docker

  • Understanding Micro Services
  • Introduction to Docker and its usages
  • Docker installation, configuration
  • Understanding and working with container
  • Inter Containers communication, expose services through port
  • Understanding docker file
  • Container based deployment
  • Docker compose
  • Introduction to Kubernetes
  • Intodcution to Helmchart
  • Using Kubernets
  • Deployment of Docker images to kubernetes using Helm Chart
  • Managing PODs

Project-7: Create data science environment using Micro services

Final Project

Familiarity with:

CORE Java, SQL, Linux

The training material is developed using real-world use cases that are designed to give students a competitive career advantage.