A Big Data Hadoop and Spark project for absolute beginners
What you’ll learn
Big Data , Hadoop and Spark from scratch by solving a real world use case using Python and Scala
Spark Scala & PySpark real world coding framework.
Real world coding best practices, logging, error handling , configuration management using both Scala and Python.
Serverless big data solution using AWS Glue, Athena and S3
Requirements
Students should have some programming background and some knowledge of SQL queries.
Description
This course will prepare you for a real world Data Engineer role ! Data Engineering is a crucial component of data-driven organizations, as it encompasses the processing, management, and analysis of large-scale data sets, which is essential for staying competitive.This course provides an opportunity to quickly get started with Big Data through the use of a free cloud clusters, and solve a practical use case. You will learn the fundamental concepts of Hadoop, Hive, and Spark, using both Python and Scala. The course aims to develop your Spark Scala and PySpark coding abilities to that of a professional developer, by introducing you to industry-standard coding practices such as logging, error handling, and configuration management.Additionally, you will understand the Databricks Lakehouse Platform and learn how to conduct analytics using Python and Scala with Spark, apply Spark SQL and Databricks SQL for analytics, develop a data pipeline with Apache Spark, and manage a Delta table by accessing version history, restoring data, and utilizing time travel features. You will also learn how to optimize query performance using Delta Cache, work with Delta Tables and Databricks File System, and gain insights into real-world scenarios from our experienced instructor.What you will learn :Big Data, Hadoop conceptsHow to create a free Hadoop and Spark cluster using Google DataprocHadoop hands-on – HDFS, HivePython basicsPySpark RDD – hands-onPySpark SQL, DataFrame – hands-onProject work using PySpark and HiveScala basicsSpark Scala DataFrameProject work using Spark ScalaDeveloping a practical comprehension of Databricks Delta Lake Lakehouse concepts through hands-on experience Learning to operate a Delta table by accessing its version history, recovering data, and utilizing time travel functionalitySpark Scala Real world coding framework and development using Winutil, Maven and IntelliJ. Python Spark Hadoop Hive coding framework and development using PyCharmBuilding a data pipeline using Hive , PostgreSQL, Spark Logging , error handling and unit testing of PySpark and Spark Scala applicationsSpark Scala Structured StreamingApplying spark transformation on data stored in AWS S3 using Glue and viewing data using AthenaHow to become a productive data engineer leveraging ChatGPTPrerequisites :This course is designed for Data Engineering beginners with no prior knowledge of Python and Scala required. However, some familiarity with databases and SQL is necessary to succeed in this course. Upon completion, you will have the skills and knowledge required to succeed in a real-world Data Engineer role.
Overview
Section 1: Introduction
Lecture 1 Introduction
Lecture 2 New addition – Databricks Delta Lake Lakehouse
Section 2: Big Data Hadoop concepts and hands-on
Lecture 3 Big Data concepts
Lecture 4 Hadoop concepts
Lecture 5 Hadoop Distributed File System (HDFS)
Lecture 6 Understanding Google Cloud (GCP) Dataproc
Lecture 7 Signing up for a Google Cloud free trial
Lecture 8 Storing a file in HDFS
Lecture 9 MapReduce and YARN
Lecture 10 Hive
Lecture 11 Querying HDFS data using Hive
Lecture 12 Deleting the Cluster
Lecture 13 Analyzing a billion records with Hive
Lecture 14 Fast queries with Hive Partitioning
Lecture 15 Fast queries with Hive Bucketing
Section 3: Spark concepts and hands-on
Lecture 16 What is Spark?
Lecture 17 Spark Hello World on Dataproc
Lecture 18 Running Python Spark 3 on Google Colab
Lecture 19 Spark for data transformation
Lecture 20 What is a DataFrame?
Lecture 21 RDDs – The fundamental building block
Lecture 22 Python basics
Lecture 23 PySpark – Creating RDDs
Lecture 24 Python functions and lambda expressions
Lecture 25 RDD – Transformation & Action
Lecture 26 PySpark – SparkSQL and DataFrame
Section 4: Project – Bank prospects marketing data cleansing using Hadoop and Spark
Lecture 27 Project problem statement
Lecture 28 Project solution using PySpark on Colab
Lecture 29 Project solution using PySpark on a Dataproc cluster
Lecture 30 Rapid Revision – Big Data, Hadoop and Spark concepts
Section 5: Running the project in Scala
Lecture 31 Scala basics
Lecture 32 Spark SQL DataFrame using Scala
Lecture 33 Bank prospects marketing project in Scala
Section 6: Learning Apache Spark on Databricks
Lecture 34 What is Databricks
Lecture 35 Creating a Databricks Community Edition account to practice Spark
Lecture 36 Saving data to Databricks DBFS and Delta tables
Lecture 37 Exporting and importing Notebooks
Lecture 38 Sample transformations on Databricks using PySpark
Lecture 39 Sample transformations on Databricks using Spark Scala
Lecture 40 Spark User defined functions (UDF)
Lecture 41 Joining Datasets using DataFrame APIs and Spark SQL
Lecture 42 More join operations using Spark
Section 7: Deep dive into Databricks Delta Lake Lakehouse Platform
Lecture 43 Understanding Data Warehouse, Data Lake and Data Lakehouse
Lecture 44 Databricks Lakehouse Architecture and Delta Lake
Lecture 45 Delta tables
Lecture 46 Storing data in a Delta table, Databricks SQL and time travel
Lecture 47 Databricks SQL vs Spark SQL
Lecture 48 Delta Table caching
Lecture 49 Delta Table partitioning
Lecture 50 Delta Table Z-ordering
Section 8: Being a productive Data Engineer with ChatGPT
Lecture 51 Leveraging ChatGPT for faster development
Lecture 52 Spark Performance tuning using Spark Submit leveraging ChatGPT
Section 9: Spark Scala real world coding framework and best practices
Lecture 53 Spark Scala real world coding introduction
Lecture 54 Installing JDK 11 on a Windows Machine
Lecture 55 Installing IntelliJ and Winutils for Spark Scala Hive programming on Windows
Lecture 56 For Mac users – JDK , IntelliJ installation and Spark Scala Hive Hello World
Lecture 57 Scala basics using IntelliJ
Lecture 58 Installing PostgreSQL
Lecture 59 psql command line interface for PostgreSQL
Lecture 60 Fetching PostgresSQL data to a Spark DataFrame
Lecture 61 Importing a project into IntelliJ
Lecture 62 Organizing code with Objects and Methods
Lecture 63 Implementing Log4j SLf4j Logging
Lecture 64 Exception Handling with try, catch, Option, Some and None
Section 10: A Data Pipeline with Spark Scala Hadoop PostgreSQL
Lecture 65 Reading from Hive and Writing to Postgres
Lecture 66 Reading Configuration from JSON using Typesafe
Lecture 67 Reading command-line arguments and debugging in InjtelliJ
Lecture 68 Writing data to a Hive Table
Lecture 69 Managing input parameters using a Scala Case Class
Lecture 70 Intellij Maven troubleshooting tips
Section 11: Spark Scala Unit Testing using ScalaTest
Lecture 71 Scala Unit Testing using JUnit & ScalaTest
Lecture 72 Spark Transformation unit testing using ScalaTest
Lecture 73 Unit testing to catch an Exception
Lecture 74 Catching Exception using assertThrows
Lecture 75 Throwing Custom Error and Intercepting Error Message
Lecture 76 Testing with assertResult
Lecture 77 Testing with Matchers
Lecture 78 Failing tests intentionally
Lecture 79 Sharing fixtures
Section 12: Exporting the Project and Spark Submit
Lecture 80 Exporting the project to an uber jar
Lecture 81 Doing spark-submit locally
Section 13: Spark Scala – Structured Streaming
Lecture 82 Structured Streaming concepts
Lecture 83 Streaming data from files
Lecture 84 Batch Vs Streaming code
Lecture 85 Writing streaming data to a Hive table
Lecture 86 Streaming Aggregation
Lecture 87 Filtering Stream
Lecture 88 Adding timestamp to streaming data
Lecture 89 Aggregation in a time window
Lecture 90 Tumbling window and Sliding window
Section 14: Creating a PySpark real world coding framework
Lecture 91 PySpark Hadoop Hive development environment using PyCharm and Winutils
Lecture 92 Instructions for Mac users
Lecture 93 Creating a project in the main Python environment
Lecture 94 Structuring code with classes and methods
Lecture 95 How Spark works?
Lecture 96 Creating and reusing SparkSession
Lecture 97 Spark DataFrame
Lecture 98 Quick tips – winutil permission
Lecture 99 Separating out Ingestion, Transformation and Persistence code
Section 15: PySpark Logging and Error Handling
Lecture 100 Python Logging
Lecture 101 Managing log level through a configuration file
Lecture 102 Having custom logger for each Python class
Lecture 103 Error Handling with try except and raise
Lecture 104 Logging using log4p and log4python packages
Section 16: Creating a Data Pipeline with Hadoop PySpark and PostgreSQL
Lecture 105 Ingesting data from Hive
Lecture 106 Transforming ingested data
Lecture 107 Installing PostgreSQL
Lecture 108 PySpark PostgreSQL interaction with Psycopg2 adapter
Lecture 109 Spark PostgreSQL interaction with JDBC driver
Lecture 110 Persisting transformed data in PostgreSQL
Section 17: PySpark – Reading Configuration from properties file
Lecture 111 Organizing code further
Lecture 112 Reading configuration from a property file
Section 18: Unit testing PySpark application and spark-submit
Lecture 113 Python unittest framework
Lecture 114 Unit testing PySpark transformation logic
Lecture 115 Unit testing an error
Lecture 116 PySpark – spark submit
Section 19: Bank prospects data transformation using AWS S3, Glue and Athena
Lecture 117 Introduction to AWS data lake use case
Lecture 118 Signing up for Amazon web services (AWS)
Lecture 119 A Data Lake with AWS S3
Lecture 120 A data catalog with AWS Glue
Lecture 121 Querying data using Amazon Athena
Lecture 122 Running Spark transformation jobs on AWS Glue
Lecture 123 An automated data pipeline using Lambda, S3 and Glue
Lecture 124 Bank prospects data transformation solution using PySpark , Glue, S3 and Athena
Beginners who want to learn Big Data or experienced people who want to transition to a Big Data role,Big data beginners who want to learn how to code in the real world
Course Information:
Udemy | English | 12h 18m | 6.39 GB
Created by: FutureX Skills
You Can See More Courses in the IT & Software >> Greetings from CourseDown.com