Azure Databricks Spark For Data Engineers PySpark SQL

Real World Project on Formula1 Racing using Azure Databricks, Delta Lake, Unity Catalog, Azure Data Factory [DP203]
Azure Databricks Spark For Data Engineers PySpark SQL
File Size :
7.72 GB
Total length :
19h 54m

Category

Instructor

Ramesh Retnasamy

Language

Last update

5/2023

Ratings

4.6/5

Azure Databricks Spark For Data Engineers PySpark SQL

What you’ll learn

You will learn how to build a real world data project using Azure Databricks and Spark Core. This course has been taught using real world data.
You will acquire professional level data engineering skills in Azure Databricks, Delta Lake, Spark Core, Azure Data Lake Gen2 and Azure Data Factory (ADF)
You will learn how to create notebooks, dashboards, clusters, cluster pools and jobs in Azure Databricks
You will learn how to ingest and transform data using PySpark in Azure Databricks
You will learn how to transform and analyse data using Spark SQL in Azure Databricks
You will learn about Data Lake architecture and Lakehouse Architecture. Also, you will learn how to implement a Lakehouse architecture using Delta Lake.
You will learn how to create Azure Data Factory pipelines to execute Databricks notebooks
You will learn how to create Azure Data Factory triggers to schedule pipelines as well as monitor them.
You will gain the skills required around Azure Databricks and Data Factory to pass the Azure Data Engineer Associate certification exam DP203
You will learn how to connect to Azure Databricks from PowerBI to create reports
You will gain a comprehensive understanding about Unity Catalog and the data governance capabilities offered by Unity Catalog.
You will learn to implement a data governance solution using Unity Catalog enabled Databricks workspace.

Azure Databricks Spark For Data Engineers PySpark SQL

Requirements

All the code and step-by-step instructions are provided, but the skills below will greatly benefit your journey
Basic Python programming experience will be required
Basic SQL knowledge will be required
Knowledge of cloud fundamentals will be beneficial, but not necessary
Azure subscription will be required, If you don’t have one we will create a free account in the course

Description

Major updates to the course since the launchMay 2023 – New sections 25, 26 and 27 added to include Unity Catalog. Unity Catalog is a recent addition to Databricks which offers unified data governance solution for a Data Lakehouse. These sections cover all aspects of Unity Catalog and the implementation using a project. March 2023 – New sections 6 and 7 added. Section 8 Updated. These changes are to reflect latest Databricks recommendations around accessing Azure Data Lake. Also, this provides a better solution to complete the course project for students using Azure Student Subscription or Corporate Subscriptions with limited access to Azure Active Directory. December 2022 – Sections 3, 4 & 5 updated to reflect recent UI changes to Azure Databricks. Also included lessons on additional functionality included by Databricks recently to Databricks clusters. . Welcome! I am looking forward to helping you with learning one of the in-demand data engineering tools in the cloud, Azure Databricks! This course has been taught with implementing a data engineering solution using Azure Databricks and Spark core for a real world project of analysing and reporting on Formula1 motor racing data.This is like no other course in Udemy for Azure Databricks. Once you have completed the course including all the assignments, I strongly believe that you will be in a position to start a real world data engineering project on your own and also proficient on Azure Databricks. I have also included lessons on Azure Data Lake Storage Gen2, Azure Data Factory as well as PowerBI. The primary focus of the course is Azure Databricks and Spark core, but it also covers the relevant concepts and connectivity to the other technologies mentioned. Please note that the course doesn’t cover other aspects of Spark such as Spark streaming and Spark ML. Also the course has been taught using PySpark as well as Spark SQL; It doesn’t cover Scala or Java. The course follows a logical progression of a real world project implementation with technical concepts being explained and the Databricks notebooks being built at the same time. Even though this course is not specifically designed to teach you the skills required for passing the Azure Data Engineer Associate Certification Exam DP203, it can greatly help you get most of the necessary skills required for the exam. I value your time as much as I do mine. So, I have designed this course to be fast-paced and to the point. Also, the course has been taught with simple English and no jargons. I start the course from basics and by the end of the course you will be proficient in the technologies used. Currently the course teaches you the followingAzure DatabricksBuilding a solution architecture for a data engineering solution using Azure Databricks, Azure Data Lake Gen2, Azure Data Factory and Power BICreating and using Azure Databricks service and the architecture of Databricks within AzureWorking with Databricks notebooks as well as using Databricks utilities, magic commands etcPassing parameters between notebooks as well as creating notebook workflowsCreating, configuring and monitoring Databricks clusters, cluster pools and jobsMounting Azure Storage in Databricks using secrets stored in Azure Key VaultWorking with Databricks Tables, Databricks File System (DBFS) etcUsing Delta Lake to implement a solution using Lakehouse architectureCreating dashboards to visualise the outputsConnecting to the Azure Databricks tables from PowerBISpark (Only PySpark and SQL)Spark architecture, Data Sources API and Dataframe APIPySpark – Ingestion of CSV, simple and complex JSON files into the data lake as parquet files/ tables. PySpark – Transformations such as Filter, Join, Simple Aggregations, GroupBy, Window functions etc.PySpark – Creating local and temporary viewsSpark SQL – Creating databases, tables and viewsSpark SQL – Transformations such as Filter, Join, Simple Aggregations, GroupBy, Window functions etc.Spark SQL – Creating local and temporary viewsImplementing full refresh and incremental load patterns using partitionsDelta LakeEmergence of Data Lakehouse architecture and the role of delta lake.Read, Write, Update, Delete and Merge to delta lake using both PySpark as well as SQL History, Time Travel and VacuumConverting Parquet files to Delta filesImplementing incremental load pattern using delta lakeUnity CatalogOverview of Data Governance and Unity CatalogCreate Unity Catalog Metastore and enable a Databricks workspace with Unity CatalogOverview of 3 level namespace and creating Unity Catalog objectsConfiguring and accessing external data lakes via Unity CatalogDevelopment of mini project using unity catalog and seeing the key data governance capabilities offered by Unity Catalog such as Data Discovery, Data Audit, Data Lineage and Data Access Control.Azure Data FactoryCreating pipelines to execute Databricks notebooksDesigning robust pipelines to deal with unexpected scenarios such as missing filesCreating dependencies between activities as well as pipelinesScheduling the pipelines using data factory triggers to execute at regular intervalsMonitor the triggers/ pipelines to check for errors/ outputs.

Overview

Section 1: Introduction

Lecture 1 Course Introduction

Lecture 2 Course Structure

Lecture 3 Course Resources Download

Lecture 4 Course Slides Download

Section 2: Azure Subscription (Optional)

Lecture 5 Creating Azure Free Account

Lecture 6 Azure Portal Overview

Section 3: Azure Databricks Overview

Lecture 7 Introduction to Azure Databricks

Lecture 8 Creating Azure Databricks Service

Lecture 9 Databricks User Interface Overview

Lecture 10 Azure Databricks Architecture Overview

Section 4: Databricks Clusters

Lecture 11 Section Overview

Lecture 12 Please Read – Important Note for Free and Student Subscription

Lecture 13 Azure Databricks Cluster Types

Lecture 14 Azure Databricks Cluster Configuration

Lecture 15 Creating Azure Databricks Cluster

Lecture 16 Azure Databricks Pricing

Lecture 17 Azure Cost Control

Lecture 18 Azure Databricks Cluster Pool

Lecture 19 Azure Databricks Cluster Policy

Section 5: Databricks Notebooks

Lecture 20 Section Overview

Lecture 21 Azure Databricks Notebooks Introduction

Lecture 22 Magic commands

Lecture 23 Databricks Utilities

Lecture 24 Project Solution Download – Databricks Notebooks

Lecture 25 Project Solution Download – Python/SQL Files

Section 6: Accessing Azure Data Lake from Databricks

Lecture 26 Accessing Azure Data Lake Overview

Lecture 27 Creating Azure Data Lake Storage Gen2

Lecture 28 Azure Storage Explorer Overview

Lecture 29 Access Azure Data Lake using Access Keys

Lecture 30 Access Azure Data Lake using SAS Token

Lecture 31 Access Azure Data Lake using Service Principal

Lecture 32 Cluster Scoped Authentication

Lecture 33 Access Azure Data Lake using Credential Passthrough

Lecture 34 Recommended Approach for Course Project

Section 7: Securing Access to Azure Data Lake

Lecture 35 Securing Secrets Overview

Lecture 36 Creating Azure Key Vault

Lecture 37 Creating Secret Scope

Lecture 38 Databricks Secrets Utility

Lecture 39 Using Secrets to Access Azure Data Lake using notebooks

Lecture 40 Using Secrets to Access Azure Data Lake using notebooks (Assignment)

Lecture 41 Using Secrets Utility in Clusters

Section 8: Mounting Data Lake Container to Databricks

Lecture 42 Section Overview

Lecture 43 Databricks File System (DBFS)

Lecture 44 Databricks Mount overview

Lecture 45 Mounting Azure Data Lake Storage Gen2

Lecture 46 Mounting Azure Data Lake Storage Gen2 (Assignment)

Section 9: Formula1 Project Overview

Lecture 47 Section Overview

Lecture 48 Formula1 Data Overview

Lecture 49 Upload Formula1 Data to Data Lake

Lecture 50 Project Requirement Overview

Lecture 51 Solution Architecture Overview

Section 10: Spark Introduction

Lecture 52 Spark Cluster Architecture

Lecture 53 Dataframe & Data Source API Overview

Section 11: Data Ingestion – CSV

Lecture 54 Data Ingestion Overview

Lecture 55 Circuits File – Requirements

Lecture 56 Circuits File – Dataframe Reader

Lecture 57 Circuits File – Specify Schema

Lecture 58 Circuits File – Select Columns

Lecture 59 Circuits File – WithColumnRenamed

Lecture 60 Circuits File – WithColumn

Lecture 61 Circuits File – Dataframe Writer

Lecture 62 Races File – Requirements

Lecture 63 Races File – Spark Program (Assignment)

Lecture 64 Races File – Partitioning

Section 12: Data Ingestion – JSON

Lecture 65 Constructors File – Requirements

Lecture 66 Constructors File – Read Data

Lecture 67 Constructors File – Transform & Write Data

Lecture 68 Drivers File – Requirements

Lecture 69 Drivers File – Spark Program

Lecture 70 Results File – Requirements

Lecture 71 Results File – Spark Program (Assignment)

Lecture 72 Pitstops File – Requirements

Lecture 73 Pitstops File – Spark Program

Section 13: Data Ingestion – Multiple Files

Lecture 74 Lap Times – Requirements

Lecture 75 Lap Times – Spark Program

Lecture 76 Qualifying – Requirements

Lecture 77 Qualifying – Spark Program (Assignment)

Section 14: Databricks Workflows

Lecture 78 Section Overview

Lecture 79 Including a Child Notebook

Lecture 80 Passing Parameters to Notebooks

Lecture 81 Notebook Workflows

Lecture 82 Databricks Jobs

Section 15: Filter & Join Transformations

Lecture 83 Section Overview

Lecture 84 Filter Transformation

Lecture 85 Join Transformation – Inner Join

Lecture 86 Join Transformation – Outer Join

Lecture 87 Join Transformation – Semi, Anti & Cross Joins

Lecture 88 Join Race Results – Requirement

Lecture 89 Set-up Presentation Layer (Assignment)

Lecture 90 Join Race Results – Solution (Assignment)

Section 16: Aggregations

Lecture 91 Section Overview

Lecture 92 Simple Aggregate functions

Lecture 93 Group By

Lecture 94 Window Functions

Lecture 95 Driver Standings

Lecture 96 Constructor Standings (Assignment)

Section 17: Using SQL in Spark Applications

Lecture 97 Local Temp View

Lecture 98 Global Temp View

Section 18: Spark SQL – Databases/ Tables/ Views

Lecture 99 Spark SQL – Introduction

Lecture 100 Databases

Lecture 101 Managed Tables

Lecture 102 External Tables

Lecture 103 Views

Lecture 104 Formula1 Project SQL Requirement

Lecture 105 Create Table – CSV Source

Lecture 106 Create Table – JSON Source

Lecture 107 Create Table – Multi Files Source

Lecture 108 Create Table – Parquet Source (Processed Data)

Lecture 109 Create Table – Parquet Source (Presentation Data) – Assignment

Section 19: Spark SQL – Filters/ Joins/ Aggregations

Lecture 110 Section Overview

Lecture 111 SQL DML Basics

Lecture 112 SQL Simple Functions

Lecture 113 SQL Aggregates/ Window functions

Lecture 114 SQL Joins

Section 20: Spark SQL – Analysis

Lecture 115 Introduction

Lecture 116 Create Race Results table

Lecture 117 Dominant Drivers – Analysis

Lecture 118 Dominant Teams – Analysis

Lecture 119 Dominant Drivers – Visualisation

Lecture 120 Dominant Teams – Visualisation

Lecture 121 Create dashboards – Drivers

Lecture 122 Create dashboards – Teams

Section 21: Incremental Load

Lecture 123 Section Overview

Lecture 124 Data Loading Design Patterns

Lecture 125 Formula1 Project Scenario

Lecture 126 Formula1 Project Data Set-up

Lecture 127 Full Refresh Implementation

Lecture 128 Incremental Load – Method 1

Lecture 129 Incremental Load – Method 2

Lecture 130 Incremental Load Improvements – Assignment

Lecture 131 Incremental Load Improvements – Solution

Lecture 132 Incremental Load – Notebook Workflows

Lecture 133 Incremental Load – Race Results

Lecture 134 Incremental Load – Driver Standings

Lecture 135 Incremental Load – Constructor Standings (Assignment)

Section 22: Delta Lake

Lecture 136 Section Overview

Lecture 137 Pitfalls of Data Lakes

Lecture 138 Data Lakehouse Architecture

Lecture 139 Read & Write to Delta Lake

Lecture 140 Updates and Deletes on Delta Lake

Lecture 141 Merge/ Upsert to Delta Lake

Lecture 142 History, Time Travel, Vacuum

Lecture 143 Delta Lake Transaction Log

Lecture 144 Convert from Parquet to Delta

Lecture 145 Data Ingestion – Circuits File

Lecture 146 Data Ingestion – Results File

Lecture 147 Data Ingestion – Results File Improvements

Lecture 148 Data Ingestion – All Other Files (Assignment)

Lecture 149 Data Ingestion – Fix Duplicates in Results Data

Lecture 150 Data Transformation – All PySpark Notebooks

Lecture 151 Data Transformation – SQL Notebook

Section 23: Azure Data Factory

Lecture 152 Section Overview

Lecture 153 Azure Data Factory Overview

Lecture 154 Create Azure Data Factory Service

Lecture 155 Azure Data Factory Components

Lecture 156 Create Pipeline – Circuits File Ingestion

Lecture 157 Debugging a Pipeline

Lecture 158 Update Pipeline – Ingest All Other Files

Lecture 159 Improve Pipeline – Handle Missing Files

Lecture 160 Create Pipeline – Transformation Notebooks

Lecture 161 Create ADF Trigger

Section 24: Connect to Other Services

Lecture 162 Power BI

Section 25: Unity Catalog – Introduction

Lecture 163 Section Overview

Lecture 164 Unity Catalog Sections Resource Download

Lecture 165 Unity Catalog Overview

Lecture 166 Unity Catalog Set-up Overview

Lecture 167 Create Unity Catalog Metastore – Prerequisites

Lecture 168 Create Unity Catalog Metastore

Lecture 169 Cluster Configurations for Unity Catalog

Lecture 170 Unity Catalog Object Model Overview

Lecture 171 Unity Catalog Object Model Demo (Data Explorer UI)

Lecture 172 Unity Catalog Object Model Demo (SQL/ Python)

Lecture 173 Accessing External Data Lake Overview

Lecture 174 Create Storage Credential

Lecture 175 Create External Location

Section 26: Unity Catalog – Mini Project

Lecture 176 Project Overview

Lecture 177 Create External Location

Lecture 178 Create Catalogs and Schema

Lecture 179 Create External Tables

Lecture 180 Create Managed Tables

Lecture 181 Create Databricks Workflow

Section 27: Unity Catalog – Key Benefits

Lecture 182 Section Overview

Lecture 183 Data Discovery

Lecture 184 Data Audit

Lecture 185 Data Lineage

Lecture 186 Data Access Control Overview

Lecture 187 Data Access Control Demo

Section 28: Next Steps

Lecture 188 Good Luck

Lecture 189 Bonus Lecture

University students looking for a career in Data Engineering,IT developers working on other disciplines trying to move to Data Engineering,Data Engineers/ Data Warehouse Developers currently working on on-premises technologies, or other cloud platforms such as AWS or GCP who want to learn Azure Data Technologies,Data Architects looking to gain an understanding about Azure Data Engineering stack

Course Information:

Udemy | English | 19h 54m | 7.72 GB
Created by: Ramesh Retnasamy

You Can See More Courses in the IT & Software >> Greetings from CourseDown.com

New Courses

Scroll to Top