The Ultimate Web Scraping With Python Bootcamp 2023

Learn to extract data from the web with python with just one course, covering selectolax, playwright, scrapy and more
The Ultimate Web Scraping With Python Bootcamp 2023
File Size :
6.77 GB
Total length :
17h 30m

Category

Instructor

Andy Bek

Language

Last update

2/2023

Ratings

0/5

The Ultimate Web Scraping With Python Bootcamp 2023

What you’ll learn

Understand the fundamentals of web scraping in python from absolute scratch
Scrape information from static and dynamic websites and extract it to a variety of formats
Intercept and emulate hidden APIs to identify highly productive alternatives to getting your data
Master the requests library for working with HTTP
Parse and extract content from HTML using beautifulsoup, selectolax, and Microsoft Playwright
Master complex CSS selectors including descendant, child, sibling combinators
Understand how the web works, including HTTP, HTML, CSS, and JavaScript
Create scrapy crawlers and practice items, itemloaders and custom pipelines
Integrate scrapy with playwright for highly performant, fine-tuned dynamic website crawling
Practice processing and extracting data to a variety of formats including csv, json, xml, and SQL

The Ultimate Web Scraping With Python Bootcamp 2023

Requirements

No programming experience needed – I’ll teach you everything you need to know
No paid software required – we’ll be using open-sourced python libraries
A computer with access to the internet
Prepare to learn real skills you could put to practice right away

Description

Welcome to the Ultimate Web Scraping With Python Bootcamp, the only course you need to go from a complete beginner in python to a very competent web scraper.Web scraping is the process of programmatically extracting data from the web. Scraping agents visit a web resource, extract content from it, and then process the resulting data in order to parse some specific information of interest. Scraping is the kind of programming skill that offers immediate feedback, and can be used to automate a wide variety of data collection and processing tasks.Over the next 17+ hours, we will methodically cover everything you need to know to write web scraping agents in python. This bootcamp is organized in three parts of increasing difficulty designed to help you progressively build your skill.Part I – BeginWe’ll start by understanding how the web works by taking a closer look at HTTP, the key application layer communication protocol of the modern web. Next, we’ll explore HTML, CSS, and JavaScript from first principles to get a deeper understanding of how website are built. Finally, we’ll learn how to use python to send HTTP requests and parse the resulting HTML, CSS, and JavaScript to extract the data we need. Our goal in the first part of the course is to build a solid foundation in both web scraping and python, and put those skills to practice by building functional web scrapers from scratch. Selected topics include:a detailed overview the request-response cycleunderstanding user-agents, HTTP verbs, headers and statusesunderstanding why custom headers can often be used to bypass paywallsmastering the requests library to work with HTTP in pythonwhat stateless means and how cookies workexploring the role of proxies in modern web architecturesmastering beautifulsoup for parsing and data extractionPart II – RefineIn the second part of the course, we’ll build on the foundation we’ve already laid to explore more advanced topics in web scraping. We’ll learn how to scrape dynamic websites that use JavaScript to render their content, by setting up Microsoft Playwright as a headless browser to automate this process. We’ll also learn how to identify and emulate API calls to scrape data from websites that don’t have formally public APIs. Our projects in this section will include an image scraper that can download a set number of high-resolution images given some keyword, as well as another scraping agent that extracts price and content of discounted video games from a dynamically rendered website. Topics include:identifying and using hidden APIs and understanding the benefits they offeremulating headers, cookies, and body content with easeautomatically generating python code from intercepted API requests using postman and httpieworking with the highly performant selectolax parsing librarymastering CSS selectors introducing Microsoft Playwright for headless browsing and dynamic rendering Part III – MasterIn the final part of the course, we’ll introduce scrapy. This will give us an excellent, time-tested framework for building more complex and robust web scrapers. We’ll learn how to set up scrapy within a virtual environment and how to create spiders and pipelines to extract data from websites in a variety of formats. Having learned how to use scrapy, we’ll then explore how to integrate it with Playwright so that we tackle the challenge of scraping dynamic websites from right within scrapy. We’ll conclude this section by building a scraping agent that executes custom JavaScript code before returning the resulting HTML to scrapy. Some topics from this section:learning how to set up scrapy and explore its command line interface (“the scrapy tool”)dynamically explore response objects using scrapy shellunderstand and define item schemas and load data using itemloaders and input/output processorsintegrate Playwright into scrapy to tackle dynamically rendered JavaScript siteswrite PageMethods to specify highly specific instructions to the headless browser from right within scrapy define custom pipelines for saving into SQL databases and highly customized output formatsIn this bootcamp, I will take you step-by-step through engaging video lectures and teach you everything you need to know to get started with web scraping in python. By the end of this course, you will have a complete toolset to conceptualize and implement scraping agents for any website you can imagine.See you inside!

Overview

Section 1: Introduction

Lecture 1 Prerequisites

Lecture 2 A Useful Mental Model

Lecture 3 All Code Resources

Section 2: The HTTP Protocol

Lecture 4 What Is HTTP?

Lecture 5 The Request-Response Cycle

Lecture 6 Extra: But, This Website Remembers Me

Lecture 7 User-Agents

Lecture 8 HTTP Verbs

Lecture 9 Status Codes

Lecture 10 Headers

Lecture 11 Extra: Headers Do Lie

Lecture 12 Proxies

Section 3: HTML, CSS, And JavaScript

Lecture 13 The Ingredients

Lecture 14 Markup

Lecture 15 Attributes

Lecture 16 Presentation

Lecture 17 Some More Rules

Lecture 18 Behaviour

Lecture 19 More JavaScript

Lecture 20 JavaScript In Web Scraping

Lecture 21 Comments

Lecture 22 Embedded

Section 4: Web Requests In Python

Lecture 23 Urllib

Lecture 24 Requests

Lecture 25 Setting Headers

Lecture 26 Query Parameters

Lecture 27 Authentication And Authorization

Lecture 28 Aside From GET

Lecture 29 POSTing Data

Section 5: Parsing And Extraction

Lecture 30 BeautifulSoup

Lecture 31 Tags

Lecture 32 Parents, Children, And Descendants

Lecture 33 Siblings

Lecture 34 Extracting Text

Lecture 35 All Strings

Lecture 36 Search

Lecture 37 Challenge

Lecture 38 Solution

Lecture 39 Solution Refinement

Lecture 40 An Extra: pandas

Lecture 41 Functional Search Patterns

Lecture 42 Text Search

Lecture 43 Searching By CSS

Lecture 44 Just One Tag

Section 6: Project 1 – Portfolio Valuation With Google Finance

Lecture 45 Scope Statement

Lecture 46 An Extra: Some Finance Concepts

Lecture 47 Parsing Price

Lecture 48 Non-USD Prices

Lecture 49 Adding Structure With Dataclasses

Lecture 50 Position And Portfolio

Lecture 51 Tabular Display

Section 7: APIs: The Hidden Gems

Lecture 52 Befriend The Network Tab

Lecture 53 Case Study: Coffee Shop Locations

Lecture 54 The Advantages Of APIs

Lecture 55 Full Header Emulation

Lecture 56 An Extra: Postman

Lecture 57 Code Generation

Lecture 58 Challenge

Lecture 59 Solution: Interacting With The API

Lecture 60 Solution: Processing The Data

Lecture 61 Solution: Adding Geocode

Section 8: Selectolax And Advanced CSS Selectors

Lecture 62 Introduction

Lecture 63 What Is selectolax?

Lecture 64 CSS Combinators

Lecture 65 Sibling Combinators

Lecture 66 Selector Types

Section 9: Project 2 – Image Scraper

Lecture 67 Scope Statement

Lecture 68 Prospecting

Lecture 69 Scraping HTML

Lecture 70 Filtering Relevant URLs

Lecture 71 Extracting High-Res Image URLs

Lecture 72 Saving The Images

Lecture 73 Stepping It Up With Logging

Lecture 74 Back To The API

Lecture 75 Filtered Canonical URLs

Lecture 76 Pagination Prospecting

Lecture 77 Wrapping Up

Section 10: Tackling JavaScript With Microsoft PlayWright

Lecture 78 What You See vs. What You Get

Lecture 79 Rendering JavaScript

Lecture 80 PlayWright Over Selenium

Lecture 81 Case Study: Show Me The Money

Section 11: Project 3 – Building A Configurable Scraping Pipeline

Lecture 82 Scope Statement

Lecture 83 Initial Setup

Lecture 84 Fully Loaded Site

Lecture 85 Selecting Game Containers

Lecture 86 More Robust Render Thresholds

Lecture 87 Extracting Title And Thumbnail

Lecture 88 Game Category Tags

Lecture 89 Release Date And Reviews

Lecture 90 Original And Discount Price

Lecture 91 Refactoring

Lecture 92 Introducing Config

Lecture 93 Configuration Integrated

Lecture 94 Parsing Pipeline

Lecture 95 Parameterized Extraction

Lecture 96 Functional Post-Processing

Lecture 97 Date Formatting

Lecture 98 Regular Expressions

Lecture 99 Saving To Disk

Lecture 100 Integrating HTMLParser With The Generic Parser

Lecture 101 Finishing Touches

Section 12: The Scrapy Framework

Lecture 102 Introduction

Lecture 103 Virtual Environments And Scrapy

Lecture 104 First Project And Spider

Lecture 105 Scraping Elements

Lecture 106 Extracting Specific Attributes

Lecture 107 An Extra: Scrapy Shell

Lecture 108 Rewriting Using XPath Selectors

Lecture 109 Outputting Data

Lecture 110 Defining Scrapy Items

Lecture 111 Introducing Itemloaders

Lecture 112 Fine-Tuned Post-Processing

Lecture 113 Pipelined Data Validation

Lecture 114 Saving To Databases

Lecture 115 Challenge

Lecture 116 Solution: Defining NoDuplicateCountryPipeline

Section 13: Boosting Scrapy With scrapy-playwright

Lecture 117 The JavaScript Wrench In The Works

Lecture 118 Integrating scrapy-playwright

Lecture 119 PageMethods

Lecture 120 Pagination And Infinite Scroll

Lecture 121 Playwright, Do This

Lecture 122 Improved Snippet As PageMethod

Lecture 123 Scraping Location, Department, And Posted Date

Section 14: Project 4 – Scraping Dynamic Sites With Scrapy And PlayWright

Lecture 124 Scope Statement

Lecture 125 New Project And Spider

Lecture 126 Item And Itemloading

Lecture 127 Pipelining To Database

Lecture 128 Quick Fix

Lecture 129 Grouped Elements JSON Export

Section 15: Closing Thoughts

Lecture 130 Try To Respect robots.txt

Lecture 131 Thank You

Lecture 132 My Other Courses

Section 16: Appendix – Python Fundamentals

Lecture 133 A Quick Note + Section Resources

Lecture 134 Data Types

Lecture 135 Variables

Lecture 136 Arithmetic And Augmented Assignment Operators

Lecture 137 Ints And Floats

Lecture 138 Booleans And Comparison Operators

Lecture 139 Strings

Lecture 140 Methods

Lecture 141 Containers I – Lists

Lecture 142 Lists vs. Strings

Lecture 143 List Methods And Functions

Lecture 144 Containers II – Tuples

Lecture 145 Containers III – Sets

Lecture 146 Containers IV – Dictionaries

Lecture 147 Dictionary Keys And Values

Lecture 148 Membership Operators

Lecture 149 Controlling Flow With if, else, And elif

Lecture 150 Truth Value Of Non-Booleans

Lecture 151 For Loops

Lecture 152 The range() Immutable Sequence

Lecture 153 While Loops

Lecture 154 Break And Continue

Lecture 155 Zipping Iterables

Lecture 156 List Comprehensions

Lecture 157 Defining Functions

Lecture 158 Function Arguments: Positional vs Keyword

Lecture 159 Lambdas

Lecture 160 Importing Modules

Anyone who wants to learn how to collect data from the web programmatically,Students with or without web scraping experience looking to level up,Complete beginners with no experience

Course Information:

Udemy | English | 17h 30m | 6.77 GB
Created by: Andy Bek

You Can See More Courses in the Developer >> Greetings from CourseDown.com

New Courses

Scroll to Top