A Big Data Hadoop and Spark project for absolute beginners

Data Engineering Spark Hive Python PySpark Scala Coding Framework Testing IntelliJ Maven Glue Databricks Delta Lake
A Big Data Hadoop and Spark project for absolute beginners
File Size :
6.39 GB
Total length :
12h 18m



FutureX Skills


Last update




A Big Data Hadoop and Spark project for absolute beginners

What you’ll learn

Big Data , Hadoop and Spark from scratch by solving a real world use case using Python and Scala
Spark Scala & PySpark real world coding framework.
Real world coding best practices, logging, error handling , configuration management using both Scala and Python.
Serverless big data solution using AWS Glue, Athena and S3

A Big Data Hadoop and Spark project for absolute beginners


Students should have some programming background and some knowledge of SQL queries.


This course will prepare you for a real world Data Engineer role ! Data Engineering is a crucial component of data-driven organizations, as it encompasses the processing, management, and analysis of large-scale data sets, which is essential for staying competitive.This course provides an opportunity to quickly get started with Big Data through the use of a free cloud clusters, and solve a practical use case. You will learn the fundamental concepts of Hadoop, Hive, and Spark, using both Python and Scala. The course aims to develop your Spark Scala and PySpark coding abilities to that of a professional developer, by introducing you to industry-standard coding practices such as logging, error handling, and configuration management.Additionally, you will understand the Databricks Lakehouse Platform and learn how to conduct analytics using Python and Scala with Spark, apply Spark SQL and Databricks SQL for analytics, develop a data pipeline with Apache Spark, and manage a Delta table by accessing version history, restoring data, and utilizing time travel features. You will also learn how to optimize query performance using Delta Cache, work with Delta Tables and Databricks File System, and gain insights into real-world scenarios from our experienced instructor.What you will learn :Big Data, Hadoop conceptsHow to create a free Hadoop and Spark cluster using Google DataprocHadoop hands-on – HDFS, HivePython basicsPySpark RDD – hands-onPySpark SQL, DataFrame – hands-onProject work using PySpark and HiveScala basicsSpark Scala DataFrameProject work using Spark ScalaDeveloping a practical comprehension of Databricks Delta Lake Lakehouse concepts through hands-on experience Learning to operate a Delta table by accessing its version history, recovering data, and utilizing time travel functionalitySpark Scala Real world coding framework and development using Winutil, Maven and IntelliJ. Python Spark Hadoop Hive coding framework and development using PyCharmBuilding a data pipeline using Hive , PostgreSQL, Spark Logging , error handling and unit testing of PySpark and Spark Scala applicationsSpark Scala Structured StreamingApplying spark transformation on data stored in AWS S3 using Glue and viewing data using AthenaHow to become a productive data engineer leveraging ChatGPTPrerequisites :This course is designed for Data Engineering beginners with no prior knowledge of Python and Scala required. However, some familiarity with databases and SQL is necessary to succeed in this course. Upon completion, you will have the skills and knowledge required to succeed in a real-world Data Engineer role.


Section 1: Introduction

Lecture 1 Introduction

Lecture 2 New addition – Databricks Delta Lake Lakehouse

Section 2: Big Data Hadoop concepts and hands-on

Lecture 3 Big Data concepts

Lecture 4 Hadoop concepts

Lecture 5 Hadoop Distributed File System (HDFS)

Lecture 6 Understanding Google Cloud (GCP) Dataproc

Lecture 7 Signing up for a Google Cloud free trial

Lecture 8 Storing a file in HDFS

Lecture 9 MapReduce and YARN

Lecture 10 Hive

Lecture 11 Querying HDFS data using Hive

Lecture 12 Deleting the Cluster

Lecture 13 Analyzing a billion records with Hive

Lecture 14 Fast queries with Hive Partitioning

Lecture 15 Fast queries with Hive Bucketing

Section 3: Spark concepts and hands-on

Lecture 16 What is Spark?

Lecture 17 Spark Hello World on Dataproc

Lecture 18 Running Python Spark 3 on Google Colab

Lecture 19 Spark for data transformation

Lecture 20 What is a DataFrame?

Lecture 21 RDDs – The fundamental building block

Lecture 22 Python basics

Lecture 23 PySpark – Creating RDDs

Lecture 24 Python functions and lambda expressions

Lecture 25 RDD – Transformation & Action

Lecture 26 PySpark – SparkSQL and DataFrame

Section 4: Project – Bank prospects marketing data cleansing using Hadoop and Spark

Lecture 27 Project problem statement

Lecture 28 Project solution using PySpark on Colab

Lecture 29 Project solution using PySpark on a Dataproc cluster

Lecture 30 Rapid Revision – Big Data, Hadoop and Spark concepts

Section 5: Running the project in Scala

Lecture 31 Scala basics

Lecture 32 Spark SQL DataFrame using Scala

Lecture 33 Bank prospects marketing project in Scala

Section 6: Learning Apache Spark on Databricks

Lecture 34 What is Databricks

Lecture 35 Creating a Databricks Community Edition account to practice Spark

Lecture 36 Saving data to Databricks DBFS and Delta tables

Lecture 37 Exporting and importing Notebooks

Lecture 38 Sample transformations on Databricks using PySpark

Lecture 39 Sample transformations on Databricks using Spark Scala

Lecture 40 Spark User defined functions (UDF)

Lecture 41 Joining Datasets using DataFrame APIs and Spark SQL

Lecture 42 More join operations using Spark

Section 7: Deep dive into Databricks Delta Lake Lakehouse Platform

Lecture 43 Understanding Data Warehouse, Data Lake and Data Lakehouse

Lecture 44 Databricks Lakehouse Architecture and Delta Lake

Lecture 45 Delta tables

Lecture 46 Storing data in a Delta table, Databricks SQL and time travel

Lecture 47 Databricks SQL vs Spark SQL

Lecture 48 Delta Table caching

Lecture 49 Delta Table partitioning

Lecture 50 Delta Table Z-ordering

Section 8: Being a productive Data Engineer with ChatGPT

Lecture 51 Leveraging ChatGPT for faster development

Lecture 52 Spark Performance tuning using Spark Submit leveraging ChatGPT

Section 9: Spark Scala real world coding framework and best practices

Lecture 53 Spark Scala real world coding introduction

Lecture 54 Installing JDK 11 on a Windows Machine

Lecture 55 Installing IntelliJ and Winutils for Spark Scala Hive programming on Windows

Lecture 56 For Mac users – JDK , IntelliJ installation and Spark Scala Hive Hello World

Lecture 57 Scala basics using IntelliJ

Lecture 58 Installing PostgreSQL

Lecture 59 psql command line interface for PostgreSQL

Lecture 60 Fetching PostgresSQL data to a Spark DataFrame

Lecture 61 Importing a project into IntelliJ

Lecture 62 Organizing code with Objects and Methods

Lecture 63 Implementing Log4j SLf4j Logging

Lecture 64 Exception Handling with try, catch, Option, Some and None

Section 10: A Data Pipeline with Spark Scala Hadoop PostgreSQL

Lecture 65 Reading from Hive and Writing to Postgres

Lecture 66 Reading Configuration from JSON using Typesafe

Lecture 67 Reading command-line arguments and debugging in InjtelliJ

Lecture 68 Writing data to a Hive Table

Lecture 69 Managing input parameters using a Scala Case Class

Lecture 70 Intellij Maven troubleshooting tips

Section 11: Spark Scala Unit Testing using ScalaTest

Lecture 71 Scala Unit Testing using JUnit & ScalaTest

Lecture 72 Spark Transformation unit testing using ScalaTest

Lecture 73 Unit testing to catch an Exception

Lecture 74 Catching Exception using assertThrows

Lecture 75 Throwing Custom Error and Intercepting Error Message

Lecture 76 Testing with assertResult

Lecture 77 Testing with Matchers

Lecture 78 Failing tests intentionally

Lecture 79 Sharing fixtures

Section 12: Exporting the Project and Spark Submit

Lecture 80 Exporting the project to an uber jar

Lecture 81 Doing spark-submit locally

Section 13: Spark Scala – Structured Streaming

Lecture 82 Structured Streaming concepts

Lecture 83 Streaming data from files

Lecture 84 Batch Vs Streaming code

Lecture 85 Writing streaming data to a Hive table

Lecture 86 Streaming Aggregation

Lecture 87 Filtering Stream

Lecture 88 Adding timestamp to streaming data

Lecture 89 Aggregation in a time window

Lecture 90 Tumbling window and Sliding window

Section 14: Creating a PySpark real world coding framework

Lecture 91 PySpark Hadoop Hive development environment using PyCharm and Winutils

Lecture 92 Instructions for Mac users

Lecture 93 Creating a project in the main Python environment

Lecture 94 Structuring code with classes and methods

Lecture 95 How Spark works?

Lecture 96 Creating and reusing SparkSession

Lecture 97 Spark DataFrame

Lecture 98 Quick tips – winutil permission

Lecture 99 Separating out Ingestion, Transformation and Persistence code

Section 15: PySpark Logging and Error Handling

Lecture 100 Python Logging

Lecture 101 Managing log level through a configuration file

Lecture 102 Having custom logger for each Python class

Lecture 103 Error Handling with try except and raise

Lecture 104 Logging using log4p and log4python packages

Section 16: Creating a Data Pipeline with Hadoop PySpark and PostgreSQL

Lecture 105 Ingesting data from Hive

Lecture 106 Transforming ingested data

Lecture 107 Installing PostgreSQL

Lecture 108 PySpark PostgreSQL interaction with Psycopg2 adapter

Lecture 109 Spark PostgreSQL interaction with JDBC driver

Lecture 110 Persisting transformed data in PostgreSQL

Section 17: PySpark – Reading Configuration from properties file

Lecture 111 Organizing code further

Lecture 112 Reading configuration from a property file

Section 18: Unit testing PySpark application and spark-submit

Lecture 113 Python unittest framework

Lecture 114 Unit testing PySpark transformation logic

Lecture 115 Unit testing an error

Lecture 116 PySpark – spark submit

Section 19: Bank prospects data transformation using AWS S3, Glue and Athena

Lecture 117 Introduction to AWS data lake use case

Lecture 118 Signing up for Amazon web services (AWS)

Lecture 119 A Data Lake with AWS S3

Lecture 120 A data catalog with AWS Glue

Lecture 121 Querying data using Amazon Athena

Lecture 122 Running Spark transformation jobs on AWS Glue

Lecture 123 An automated data pipeline using Lambda, S3 and Glue

Lecture 124 Bank prospects data transformation solution using PySpark , Glue, S3 and Athena

Beginners who want to learn Big Data or experienced people who want to transition to a Big Data role,Big data beginners who want to learn how to code in the real world

Course Information:

Udemy | English | 12h 18m | 6.39 GB
Created by: FutureX Skills

You Can See More Courses in the IT & Software >> Greetings from CourseDown.com

New Courses

Scroll to Top