Big Data/Data Science

Duration: 40 Hours

Major Focus of the training is to develop/improve skillset at industry level to work in Data Science and Data Engineering domain. Less theories, more exercise, and projects

Program Outline

Data preparation
1. Working with data pipeline and prepare data using technologies like Spark, Streaming, Kafka, Hadoop, HDFS, HQL, SQL and cloud technologies like DataProc, Cloud Storage, Big Query
2. Dealing with various data format included unstructured to structure and data types like JSON, XML and logs
Applied Data Science
1. Achieve skill set to deal with data science use cases through widely used programming languages, data structures, tools and libraries, algorithms including cloud technologies

Session 1 [1]: HDFS, Map Reduce, Hive

Introduction to Big data and Hadoop Ecosystem
Why industry needs Bigdata? Advantages of Bigdata over traditional RDBMS
Introduction to Bigdata ecosystems
Understanding Data various formats, transformation techniques
HDFS, YARN architecture
Understanding HDFS, Hadoop and Hive
Hive advance features for performance
Project -1

Session 2[.5]: Impala, Oozie, Shell Scripting, Linux

Usages of Shell scripting in Bigdata projects
Shell for combining various Hadoop technologies
Architecture of Impala
Usages of Hive and Impala in Real life project
Understanding Oozie as scheduler, Oozie Coordinator
Introduction to Sqoop
Understanding capabilities of Sqoop and underlying MapReduce
Use Sqoop to ingest data from traditional database to HDFS/ Hive
Project-2

Session 3[.5]: Spark, Scala

Introduction to Scala programing language
Scala from functional perspective
Scala features for Bigdata transformations
Spark, fastest data processing engine in the world
Spark architecture
Deep drive spark data transformation capabilities
Spark SQL with HDFS, Hive and Impala
Dealing with various data types JSON, XML, CSV, parquet, text
Project-3

Session 4[.5]: Streaming – Kafka and Spark streaming

Introduction to streaming, new era of data analytics
Introduction Kafka
Deep drive of Kafka architecture
Setup up Kafka for message generation
Spark Streaming with Kafka
Kafka performance Tuning considerations
Consideration for Zero data loss streaming pipelines
Dealing with small file issue and compaction
project 4

Session 5(1.5) : Big data in Cloud: GCP (Google Cloud Platform)

GCP Orchestration
GCP tools and services used in Canadian companies to develop data pipelines
GCP tools Services used in Canadian companies to prepare data for ML models
Hadoop on GCP
Cloud Migration: Lift and shift strategy with minimum code changes
Streaming on the cloud
Project -5

Session 6(1.5): Introduction to Data Science and Python:

Basics of Data Science
- How data science differs from other use cases
Python for data science
- Using Anaconda, Jupiter node book, Docker, VS Code
Apply Python libraries for
- Plotting
- Data Science: Scikit-learn, Pandas, Numpy
Applying data Science and its usages in real life usage cases
Going over various data science Algorithms – Supervised and Unsupervised
Understanding Features engineering, training and testing models
Data preparation for training model
Project-6

Session 7(2): Google Cloud for Data Science

GCP tools and services for data science
Vartex AI, Work bench
Model garden
BQML
AutoML
Generative AI
Project 7

Session 8(.5): Docker

Understanding Micro Services
Introduction to Docker and its usages
Docker installation, configuration
Understanding and working with container
Inter Containers communication, expose services through port
Understanding docker file
Container based deployment
Docker compose
Introduction to Kubernetes
Introduction to Helmchart
Deployment of Docker images to kubernetes using Helm Chart
Managing PODs

Project-8: Create data science environment using Micro services

Final Project

Pre-requisite for Program:

Familiar with programing language like Python, C, C++.
Familiar with RDBMS, SQL
Nice to have Linux and shell scripting knowledge
Must be available for 8 hours class per week, and at least 2 hours a day for learning and projects beyond class hours

Job roles: Data Engineer, Data Scientist, Big data developer

The training material is developed using real-world use cases that are designed to give students a competitive career advantage.