Big Data/Data Science

Duration: 40 Hours

Major Focus of the training is to develop/improve skillset at industry level to work in Data Science and Data Engineering domain. Less theories, more exercise, and projects

  1. Data preparation
    1. Working with data pipeline and prepare data using technologies like Spark, Streaming, Kafka, Hadoop, HDFS, HQL, SQL and cloud technologies like DataProc, Cloud Storage, Big Query
    2. Dealing with various data format included unstructured to structure and data types like JSON, XML and logs
  2. Applied Data Science
    1. Achieve skill set to deal with data science use cases through widely used programming languages, data structures, tools and libraries, algorithms including cloud technologies

 

Session 1 [1]: HDFS, Map Reduce, Hive

  • Introduction to Big data and Hadoop Ecosystem
  • Why industry needs Bigdata? Advantages of Bigdata over traditional RDBMS
  • Introduction to Bigdata ecosystems
  • Understanding Data various formats, transformation techniques
  • HDFS, YARN architecture
  • Understanding HDFS, Hadoop and Hive
  • Hive advance features for performance
  • Project -1

 

Session 2[.5]: Impala, Oozie, Shell Scripting, Linux

  • Usages of Shell scripting in Bigdata projects
  • Shell for combining various Hadoop technologies
  • Architecture of Impala
  • Usages of Hive and Impala in Real life project
  • Understanding Oozie as scheduler, Oozie Coordinator
  • Introduction to Sqoop
  • Understanding capabilities of Sqoop and underlying MapReduce
  • Use Sqoop to ingest data from traditional database to HDFS/ Hive
  • Project-2

 

Session 3[.5]: Spark, Scala

  • Introduction to Scala programing language
  • Scala from functional perspective
  • Scala features for Bigdata transformations
  • Spark, fastest data processing engine in the world
  • Spark architecture
  • Deep drive spark data transformation capabilities
  • Spark SQL with HDFS, Hive and Impala
  • Dealing with various data types JSON, XML, CSV, parquet, text
  • Project-3

 

Session 4[.5]: Streaming – Kafka and Spark streaming

  • Introduction to streaming, new era of data analytics
  • Introduction Kafka
  • Deep drive of Kafka architecture
  • Setup up Kafka for message generation
  • Spark Streaming with Kafka
  • Kafka performance Tuning considerations
  • Consideration for Zero data loss streaming pipelines
  • Dealing with small file issue and compaction
  • project 4

 

Session 5(1.5) : Big data in Cloud: GCP (Google Cloud Platform)

  • GCP Orchestration
  • GCP tools and services used in Canadian companies to develop data pipelines
  • GCP tools Services used in Canadian companies to prepare data for ML models
  • Hadoop on GCP
  • Cloud Migration: Lift and shift strategy with minimum code changes
  • Streaming on the cloud
  • Project -5

 

Session 6(1.5): Introduction to Data Science and Python:

  • Basics of Data Science
    • How data science differs from other use cases
  • Python for data science
    • Using Anaconda, Jupiter node book, Docker, VS Code
  • Apply Python libraries for
    • Plotting
    • Data Science: Scikit-learn, Pandas, Numpy
  • Applying data Science and its usages in real life usage cases
  • Going over various data science Algorithms – Supervised and Unsupervised
  • Understanding Features engineering, training and testing models
  • Data preparation for training model
  • Project-6

 

Session 7(2): Google Cloud for Data Science

  • GCP tools and services for data science
  • Vartex AI, Work bench
  • Model garden
  • BQML
  • AutoML
  • Generative AI
  • Project 7

 

Session 8(.5): Docker

  • Understanding Micro Services
  • Introduction to Docker and its usages
  • Docker installation, configuration
  • Understanding and working with container
  • Inter Containers communication, expose services through port
  • Understanding docker file
  • Container based deployment
  • Docker compose
  • Introduction to Kubernetes
  • Introduction to Helmchart
  • Deployment of Docker images to kubernetes using Helm Chart
  • Managing PODs

Project-8: Create data science environment using Micro services

Final Project

 

Pre-requisite for Program: 

  • Familiar with programing language like Python, C, C++.
  • Familiar with RDBMS, SQL
  • Nice to have Linux and shell scripting knowledge
  • Must be available for 8 hours class per week, and at least 2 hours a day for learning and projects beyond class hours

Job roles: Data Engineer, Data Scientist, Big data developer

The training material is developed using real-world use cases that are designed to give students a competitive career advantage.