- Data preparation
- Working with data pipeline and prepare data using technologies like Spark, Streaming, Kafka, Hadoop, HDFS, HQL, SQL and cloud technologies like DataProc, Cloud Storage, Big Query
- Dealing with various data format included unstructured to structure and data types like JSON, XML and logs
- Applied Data Science
- Achieve skill set to deal with data science use cases through widely used programming languages, data structures, tools and libraries, algorithms including cloud technologies
Session 1 [1]: HDFS, Map Reduce, Hive
- Introduction to Big data and Hadoop Ecosystem
- Why industry needs Bigdata? Advantages of Bigdata over traditional RDBMS
- Introduction to Bigdata ecosystems
- Understanding Data various formats, transformation techniques
- HDFS, YARN architecture
- Understanding HDFS, Hadoop and Hive
- Hive advance features for performance
- Project -1
Session 2[.5]: Impala, Oozie, Shell Scripting, Linux
- Usages of Shell scripting in Bigdata projects
- Shell for combining various Hadoop technologies
- Architecture of Impala
- Usages of Hive and Impala in Real life project
- Understanding Oozie as scheduler, Oozie Coordinator
- Introduction to Sqoop
- Understanding capabilities of Sqoop and underlying MapReduce
- Use Sqoop to ingest data from traditional database to HDFS/ Hive
- Project-2
Session 3[.5]: Spark, Scala
- Introduction to Scala programing language
- Scala from functional perspective
- Scala features for Bigdata transformations
- Spark, fastest data processing engine in the world
- Spark architecture
- Deep drive spark data transformation capabilities
- Spark SQL with HDFS, Hive and Impala
- Dealing with various data types JSON, XML, CSV, parquet, text
- Project-3
Session 4[.5]: Streaming – Kafka and Spark streaming
- Introduction to streaming, new era of data analytics
- Introduction Kafka
- Deep drive of Kafka architecture
- Setup up Kafka for message generation
- Spark Streaming with Kafka
- Kafka performance Tuning considerations
- Consideration for Zero data loss streaming pipelines
- Dealing with small file issue and compaction
- project 4
Session 5(1.5) : Big data in Cloud: GCP (Google Cloud Platform)
- GCP Orchestration
- GCP tools and services used in Canadian companies to develop data pipelines
- GCP tools Services used in Canadian companies to prepare data for ML models
- Hadoop on GCP
- Cloud Migration: Lift and shift strategy with minimum code changes
- Streaming on the cloud
- Project -5
Session 6(1.5): Introduction to Data Science and Python:
- Basics of Data Science
- How data science differs from other use cases
- Python for data science
- Using Anaconda, Jupiter node book, Docker, VS Code
- Apply Python libraries for
- Plotting
- Data Science: Scikit-learn, Pandas, Numpy
- Applying data Science and its usages in real life usage cases
- Going over various data science Algorithms – Supervised and Unsupervised
- Understanding Features engineering, training and testing models
- Data preparation for training model
- Project-6
Session 7(2): Google Cloud for Data Science
- GCP tools and services for data science
- Vartex AI, Work bench
- Model garden
- BQML
- AutoML
- Generative AI
- Project 7
Session 8(.5): Docker
- Understanding Micro Services
- Introduction to Docker and its usages
- Docker installation, configuration
- Understanding and working with container
- Inter Containers communication, expose services through port
- Understanding docker file
- Container based deployment
- Docker compose
- Introduction to Kubernetes
- Introduction to Helmchart
- Deployment of Docker images to kubernetes using Helm Chart
- Managing PODs
Project-8: Create data science environment using Micro services
Final Project
Pre-requisite for Program:
- Familiar with programing language like Python, C, C++.
- Familiar with RDBMS, SQL
- Nice to have Linux and shell scripting knowledge
- Must be available for 8 hours class per week, and at least 2 hours a day for learning and projects beyond class hours
Job roles: Data Engineer, Data Scientist, Big data developer