Mastering AWS Elastic Map Reduce EMR for Data Engineers

Build Pyspark and Spark SQL Applications on AWS EMR, Orchestrate using Step Functions, Manage EMR using Boto3 and more
Mastering AWS Elastic Map Reduce EMR for Data Engineers
File Size :
5.54 GB
Total length :
11h 18m

Category

Instructor

Durga Viswanatha Raju Gadiraju

Language

Last update

8/2022

Ratings

4.3/5

Mastering AWS Elastic Map Reduce EMR for Data Engineers

What you’ll learn

Creating Clusters using AWS Elastic Map Reduce Web Console
Setup Remote Application Development using AWS Elastic Map Reduce (EMR) and Visual Studio Code
Develop and Validate Simple Spark Application using Visual Studio Code and AWS Elastic Map Reduce (EMR)
Deploy Spark Application as Step to AWS Elastic Map Reduce (EMR)
Manage AWS Elastic Map Reduce (EMR) based Pipelines using Boto3 and Python
Build End to End AWS Elastic Map Reduce (EMR) based Pipelines using AWS Step Functions
Develop Applications using Spark SQL on AWS EMR Cluster
Build State Machine or Pipeline using AWS Step Functions using Spark SQL Script on AWS EMR Cluster
Understand how to pass parameters to Spark SQL Scripts deployed on EMR

Mastering AWS Elastic Map Reduce EMR for Data Engineers

Requirements

A computer science or IT Degree or 1 or 2 years of IT Experience
Basic Linux Skills with ability to run commands using Terminal
Programming Skills using Python is required
Valid AWS Account to use the AWS Services to learn how to build Data Pipelines using AWS Lambda Functions

Description

AWS Elastic Map Reduce (EMR) is one of the key AWS Services used in building large-scale data processing leveraging Big Data Technologies such as Apache Hadoop, Apache Spark, Hive, etc. As part of this course, you will end up learning AWS Elastic Map Reduce (EMR) by building end-to-end data pipelines leveraging Apache Spark and AWS Step Functions.Here is the detailed outline of the course.First, you will learn how to Get Started with AWS Elastic Map Reduce (EMR) by understanding how to use AWS Web Console to create and manage EMR Clusters. You will also learn about all the key features of Web Console and also how to connect to the master node of the cluster and validate all the important CLI interfaces such as spark-shell, pyspark, hive, etc as well as hdfs and aws CLI commands.Once you understand how to get started with AWS EMR, you will go through the details related to Setting up Development Cluster using AWS EMR. There are quite a few advantages to using AWS EMR Clusters for development purposes and most enterprises do so.After setting up a development cluster using AWS EMR, you will go through the Development Life Cycle of Spark Applications using AWS EMR Development Cluster. You will be using Visual Studio Code Remote Development on top of the AWS EMR Development Cluster to go through the details.Once the development is done, you will go through the details related to Deploying Spark Application on AWS EMR Cluster. You will build the zip file and understand how to run using CLI in both clients as well as cluster deployment modes. You will also understand how you can deploy the spark application as a step on AWS EMR Clusters. You will also understand the details related to troubleshooting the issues related to Spark Applications by going through relevant logs.Typically we run Spark Applications programmatically. After going through the details related to deploying spark applications on AWS EMR Clusters, you will be learning how to Manage AWS EMR Clusters using Python Boto3. You will not only learn how to create clusters programmatically but also how to deploy Spark Applications as Steps programmatically using Python Boto3.End to End Data Pipelines using AWS EMR is built using AWS Step Functions. Once you understand how to manage EMR Clusters using Python Boto3 and also deploy Spark Applications on EMR Clusters using the same, it is important to learn how to Build EMR-based Workflows or Pipelines using AWS Step Functions. You will be learning how to create the cluster, deploy Spark Application as Step on to the cluster, and then terminate the cluster as part of a basic pipeline or State Machine using AWS Step Functions.You will also learn how to perform validations as part of State Machines by Enhancing AWS EMR-based State Machine or Pipeline. You will check if the files specified already exist as part of the validations.We can also build Data Processing Applications or Pipelines using Spark SQL on AWS EMR. First, you will learn how to design and develop solutions using Spark SQL Script, how to validate by using appropriate commands by passing relevant runtime arguments, etc.Once you understand the development process of implementing solutions using Spark SQL on AWS EMR, you will learn how to deploy Data Pipeline using AWS Step Function to deploy Spark SQL Script on EMR Cluster. You will also learn the concept of Boto3 Waiters to make sure the steps are executed in a linear fashion.

Overview

Section 1: Introduction to Mastering AWS Elastic Map Reduce for Data Engineers

Lecture 1 Introduction to Mastering AWS Elastic Map Reduce for Data Engineers

Section 2: Getting Started on Windows with Required Tools

Lecture 2 Overview of Powershell on Windows 10 or Windows 11

Lecture 3 Install Visual Studio Code on Windows

Lecture 4 Install Remote Development Extension Kit for Visual Studio Code

Section 3: Getting Started with AWS EMR

Lecture 5 Planning of EMR Cluster

Lecture 6 Create EC2 Key Pair

Lecture 7 Setup EMR Cluster with Spark

Lecture 8 Understanding Summary of AWS EMR Cluster

Lecture 9 Review EMR Cluster Application User Interfaces

Lecture 10 Review EMR Cluster Monitoring

Lecture 11 Review EMR Cluster Hardware and Cluster Scaling Policy

Lecture 12 Review EMR Cluster Configurations

Lecture 13 Review EMR Cluster Events

Lecture 14 Review EMR Cluster Steps

Lecture 15 Review EMR Cluster Bootstrap Actions

Lecture 16 Connecting to EMR Master Node using SSH

Lecture 17 Disabling Termination Protection and Terminating the Cluster

Lecture 18 Clone and Create New Cluster

Lecture 19 Listing AWS S3 Buckets and Objects using AWS CLI on EMR Cluster

Lecture 20 Listing AWS S3 Buckets and Objects using HDFS CLI on EMR Cluster

Lecture 21 Managing Files in AWS s3 using HDFS CLI on EMR Cluster

Lecture 22 Review Glue Catalog Databases and Tables

Lecture 23 Accessing Glue Catalog Databases and Tables using EMR Cluster

Lecture 24 Accessing spark-sql CLI of AWS EMR Cluster

Lecture 25 Accessing pyspark CLI of AWS EMR Cluster

Lecture 26 Accessing spark-shell CLI of AWS EMR Cluster

Lecture 27 Create AWS EMR Cluster for Notebooks

Section 4: Setup Development Cluster using AWS EMR

Lecture 28 Create bootstrap script for AWS EMR Cluster

Lecture 29 Provision Elastic IP for Master Node of AWS EMR Cluster

Lecture 30 Create AWS EMR for Development

Lecture 31 Troubleshooting Issues related to Bootstrap of EMR Cluster

Lecture 32 Fix Bootstrap Script for AWS EMR Cluster

Lecture 33 Validate AWS EMR Cluster with Bootstrap Action with updated script

Lecture 34 Setup Python Virtual Environment as part of VS Code Workspace

Lecture 35 Getting Started with Boto3 to Manage AWS EMR Clusters

Lecture 36 Setup boto3 to explore APIs to manage AWS EMR Clusters

Lecture 37 Set AWS Profile using env file in Visual Studio Code

Lecture 38 Get Cluster Details of AWS EMR Development Cluster using boto3

Lecture 39 Getting Instance Id of the Master Node of AWS EMR Cluster using boto3

Lecture 40 Getting Allocation Id of the Elastic Ip using AWS boto3

Lecture 41 Associating Elastic Ip with AWS EMR Master Node using Boto3

Lecture 42 Setup Notebook Environment for EMR Cluster using IAM User

Section 5: Development Life Cycle using AWS EMR Development Cluster

Lecture 43 Open Remote Window on AWS EMR Master Node using VS Code

Lecture 44 Setup Workspace on AWS EMR Master using Git Repository

Lecture 45 Best Practices and Advantages of using AWS EMR Cluster for Team Development

Lecture 46 Install VSCode Extensions in remote Workspace for Python

Lecture 47 Review Python and Pyspark details on EMR Cluster

Lecture 48 Running Applications using local and yarn during development

Lecture 49 Getting Started with Development of Spark Applications on EMR Cluster

Lecture 50 Create Function for Spark Session

Lecture 51 Upload Files to AWS s3 for the development using AWS EMR Cluster

Lecture 52 Develop read logic for the Spark Application

Lecture 53 Process Data Frame using Spark APIs

Lecture 54 Write Data to Files using Spark APIs

Lecture 55 Productionize the Code and setup required data sets for validation

Lecture 56 Resize the AWS EMR Cluster using Web Console

Lecture 57 Validate Changes to productionize the Application Code

Lecture 58 Take the backup and terminate the cluster

Section 6: Deploy Spark Application on AWS EMR Cluster

Lecture 59 Recreate the AWS EMR Cluster to deploy Spark Applications

Lecture 60 Setup Code Repository on the AWS EMR Master Node

Lecture 61 Resize the AWS EMR Cluster to validate application on larger data sets

Lecture 62 Build Zip File for the Spark Application

Lecture 63 Validate the Spark Application using zip file and client as deploy mode

Lecture 64 Run Spark Application on EMR using Cluster Deployment Mode

Lecture 65 Run Spark Application copied to s3 on EMR using Cluster Deployment Mode

Lecture 66 Deploy Spark Application as Step to the AWS EMR Cluster

Lecture 67 Setup Multiple Files to Manage AWS s3 Objects using State Machines

Lecture 68 Validate Spark Application Deployed as Step on AWS EMR Cluster

Section 7: Manage AWS EMR Clusters using Python Boto3

Lecture 69 Update Material related to Managing AWS EMR using Boto3

Lecture 70 Create AWS EMR Cluster using AWS CLI Command

Lecture 71 Manage AWS EMR Clusters using AWS CLI Commands

Lecture 72 Overview of AWS boto3 to Manage AWS EMR Clusters

Lecture 73 Overview of Run Job Flow API to create AWS EMR Cluster

Lecture 74 Create AWS EMR Cluster or Job Flow Cluster using AWS Boto3

Lecture 75 Prepare Data Sets to add Spark Application as Step to AWS EMR Cluster

Lecture 76 Add Spark Application as Step to AWS EMR Cluster using Boto3

Lecture 77 Exercise to add Spark Application as Step to EMR Cluster using boto3

Lecture 78 Terminate the AWS EMR Cluster used for adding Steps

Lecture 79 Exercise to Create AWS EMR Cluster with Steps for Spark Application

Section 8: Build EMR based Workflows or Pipelines using AWS Step Functions

Lecture 80 Review of Development Environment for AWS Step Functions and EMR

Lecture 81 Quick Overview of Important Terms of AWS Step Functions

Lecture 82 Getting Started with EMR based Pipeline using AWS Step Functions

Lecture 83 Overview of AWS IAM Role associated with State Machine copy

Lecture 84 Overview of Creating EMR Cluster using AWS Step Functions

Lecture 85 Parameters to Create EMR Cluster using AWS Step Functions

Lecture 86 Attach Permissions to Step Function Role to Create AWS EMR Cluster

Lecture 87 Add Step to AWS EMR Cluster using AWS Step Function

Lecture 88 Validate Adding Step to AWS EMR Cluster using Step Functions

Lecture 89 Add Action to Step Machine to Terminate the AWS EMR Cluster

Lecture 90 Validate the execution of State Machine to run Spark Application on AWS EMR

Lecture 91 Terminate AWS EMR Clusters Created to Validate State Machine copy

Section 9: Develop State Machine using AWS Step Functions to manage s3

Lecture 92 Review the current state of AWS EMR based Pipeline or State Machine copy

Lecture 93 Create State Machine using AWS Step Function to Validate s3 copy

Lecture 94 Attach Policy with Permissions on AWS s3 to Step Function Role copy

Lecture 95 Setup File in AWS s3 and Validate State Machine to list objects copy

Lecture 96 Relationship between AWS Boto3 and Actions in Step Functions copy

Lecture 97 Add State to Delete Object from AWS s3 copy

Lecture 98 Fix Permissions and Run State Machine to Delete Object from AWS s3 copy

Lecture 99 Passing Input to States in AWS Step Functions State Machine copy

Lecture 100 Setup Multiple Files to Manage AWS s3 Objects using State Machines copy

Lecture 101 Process AWS s3 Objects using Map in State Machine

Lecture 102 Extract Key of AWS s3 Objects using Step Functions Pass

Lecture 103 Add State to AWS Step Function Delete s3 Object

Lecture 104 Develop AWS Lambda Function to customise State Machine Data

Lecture 105 Add AWS Lambda Function to State Machine to Pass s3 Details for delete

Lecture 106 Add Condition to State Machine to avoid Key Error on AWS s3 List Objects

Lecture 107 Overview of Map Concurrency in State Machines of AWS Step Functions

Lecture 108 Invoking AWS Step Function State Machine from Other State Machines

Lecture 109 Overview of integration of s3 based State Machine with EMR State Machine

Section 10: Adding s3 Validation Logic to AWS EMR based State Machine

Lecture 110 Taking back up of AWS Step Functions State Machines

Lecture 111 Grant Permissions between AWS Step Functions State Machines via IAM Role

Lecture 112 Update AWS Step Function State Machine with EMR to validate s3

Lecture 113 Pass EMR Step Details to AWS Step Functions State

Lecture 114 Validate AWS Step Function EMR based State Machine Execution

Lecture 115 Run AWS Step Function State Machine to validate logic to delete AWS s3 Objects

Lecture 116 Exercise to add validation of source s3 location in AWS Step Function StateMach

Lecture 117 Update AWS Step Function State Machine to Validate Source s3 Location

Lecture 118 Run AWS Step Function State Function with source s3 Validation Logic

Lecture 119 Develop AWS Lambda Function to check number of files in source s3

Lecture 120 Attach Policy to State Machine Role to Invoke AWS Lambda Function

Lecture 121 Run Updated State Machine to validate source count

Lecture 122 Best Practices to Run AWS Step Functions State Machines

Section 11: Develop Applications using Spark SQL on AWS EMR Cluster

Lecture 123 Setup AWS EMR Cluster to develop applications using Spark SQL

Lecture 124 Setup Visual Studio Code Workspace using AWS EMR Master Node

Lecture 125 Update PYTHONPATH to access Pyspark Libraries or Modules on AWS EMR Master Node

Lecture 126 Setup Required Data Sets for Spark SQL

Lecture 127 Upload Retail DB Files to AWS s3 using AWS CLI commands

Lecture 128 Getting Started with Spark SQL and Temporary Views using Spark SQL on AWS EMR C

Lecture 129 Create Spark SQL Temporary Views for Orders and Order Items

Lecture 130 Join and Aggregate using Spark SQL on AWS EMR Cluster

Lecture 131 Write Query Results back to AWS s3 using Spark SQL on AWS EMR Cluster

Lecture 132 Develop Script using Spark SQL Commands

Lecture 133 Parameterize Bucket Name in Spark SQL Script

Lecture 134 Deploy Spark SQL Script in s3 and Run using CLI on AWS EMR Master Node

Lecture 135 Deploy Spark SQL Script as Step on AWS EMR Cluster

Lecture 136 Conclusion to Develop Spark SQL Applications on EMR Cluster

Section 12: Develop AWS Step Function to deploy Spark SQL Script on EMR Cluster

Lecture 137 Create State Machine to Deploy Spark SQL Script on AWS EMR Cluster

Lecture 138 Overview of Managing AWS EMR Clusters using Boto3

Lecture 139 Overview of AWS boto3 to Manage AWS EMR Clusters

Lecture 140 Create AWS EMR Job Flow Cluster using Python Boto3

Lecture 141 Add Spark SQL Script as Step to AWS EMR Cluster using Boto3

Lecture 142 Overview of AWS EMR Waiters using Python Boto3

Lecture 143 Terminate AWS EMR Cluster using waiters and Python Boto3

Lecture 144 Overview of AWS Step Functions State Machine to execute Spark SQL on EMR

Lecture 145 Create State Machine using AWS Step Function to create EMR Cluster

Lecture 146 Grant Permissions to State Machine via Role to Create AWS EMR Cluster

Lecture 147 Add Spark SQL Script as Step to AWS EMR Cluster using AWS Step Functions

Lecture 148 Add Add Terminate AWS EMR Cluster Step to AWS Step Functions State Machine

Lecture 149 Pass AWS EMR Step Details as Input to State Machine at Execution Time

Lecture 150 Validate Spark SQL Script Execution as AWS EMR Step using State Machine

University Students who want to learn AWS Elastic Map Reduce to process heavy volumes of data with hands on and real time examples,Aspiring Data Engineers and Data Scientists who want to master building data pipelines using AWS Elastic Map Reduce for large scale Data Processing,Experienced Application Developers who would like to explore how to build end to end Data Pipelines using Python and AWS Services such as AWS Elastic Map Reduce,Experienced Data Engineers to build end to end data pipelines using Python and AWS Elastic Map Reduce,Any IT Professional who is keen to deep dive into AWS Elastic Map Reduce (EMR) for heavy weight Data Processing

Course Information:

Udemy | English | 11h 18m | 5.54 GB
Created by: Durga Viswanatha Raju Gadiraju

You Can See More Courses in the IT & Software >> Greetings from CourseDown.com

New Courses

Scroll to Top