Data Cleaning in Python

Preprocessing, structuring and normalizing data
Data Cleaning in Python
File Size :
1.69 GB
Total length :
5h 43m


Taimoor khan


Last update




Data Cleaning in Python

What you’ll learn

Data cleaning or cleansing as a preprocessing step towards making the data more consistent and high quality before training predictive models.

Data Cleaning in Python


Basics of Python


Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term ‘garbage in garbage out’ refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable.  In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept.  In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.


Section 1: Introduction

Lecture 1 Introduction

Lecture 2 Quality of Data

Lecture 3 Missing Values, Noise and Outliers

Lecture 4 Examples of Anomalies

Lecture 5 Instructor

Section 2: Detecting Missing and Noise Values (Univariate Outliers)

Lecture 6 2.1.1 Anomaly Detection (Median)

Lecture 7 2.2.1 Implementing Detection of Missing Values

Lecture 8 2.2.2 Implementing Median based Detection (Global Context)

Lecture 9 2.2.3 Implementing Median based Detection (Local Context)

Lecture 10 2.1.2 Anomaly Detection (Mean)

Lecture 11 2.2.4 Implementing Mean based Detection of Noise values

Lecture 12 2.1.3 Anomally Detection (Z-score)

Lecture 13 2.2.5 Implementing Z-score based Detection

Lecture 14 2.1.4 Anomally Detection (Interquartile Range)

Lecture 15 2.2.6 Implementing Interquartile Range for Noise Detection

Section 3: Handling Missing and Noise Values (Univariate Outliers)

Lecture 16 3.1.1 Approaches to Handle Anomalies

Lecture 17 3.1.2 Deletion Strategy

Lecture 18 3.2.1 Deleting Missing Values

Lecture 19 3.1.3 Global and Local Context

Lecture 20 3.1.4 Replacement Strategy

Lecture 21 3.1.5 Statistical Measures

Lecture 22 3.2.2 Implementing Imputation with Mode

Lecture 23 3.2.3 Implementing Imputation with Median and Mean

Section 4: Multivariate Outliers

Lecture 24 4.1.1 Multivariate Outliers

Lecture 25 4.1.2 Local Outlier Factor

Lecture 26 4.2.1 Implementing LOF for Outlier Detection

Lecture 27 4.1.3 Clustering for Multivariate Outlier Detection

Lecture 28 4.2.2 Implementing DBSCAN Clustering for Outlier Detection

Lecture 29 4.1.3 Data Visualization for Outlier Detection

Lecture 30 4.2.3 Implementing Data Visualization

Section 5: Anomalies in Textual data

Lecture 31 5.1.1 Normalizing Text Anomalies

Lecture 32 5.2.1 Lowercase, Whitespaces, Punctuations

Lecture 33 5.2.2 Stopwords Removal

Lecture 34 5.1.2 Regular Expressions

Lecture 35 5.2.4 Implementing Regular Expressions for Filtering stopwords

Lecture 36 5.2.3 Stemming and Lemmatization

Lecture 37 Parts-of-speech (POS) Tagging

Lecture 38 5.2.6 Text Segmentation and Tokenization

Section 6: Structuring Textual Documents

Lecture 39 6.1.1 Structuring Textual Data

Lecture 40 6.1.2 Bag-of-Words (BoW) Approach

Lecture 41 6.1.3 Binary and TF-IDF Representation

Lecture 42 6.2.1 Implementing One Document Corpus Representation

Lecture 43 6.2.2 Implementing Multi-doc Corpus Representation

Lecture 44 6.2.3 Tuning Parameters to Improve Representation

Lecture 45 6.2.4 Implementing TF-IDF Representation Scheme

Lecture 46 6.2.5 Implementing Dummy Dataset Representation

Lecture 47 6.2.6 Implementing UCI Repository Dataset Representation

Section 7: Feature Scaling (Normalization)

Lecture 48 7.1.1 Why Feature Scaling

Lecture 49 7.1.2 Feature Normalization (Min Max Scaler)

Lecture 50 7.2.1 Implementing Feature Normalization

Lecture 51 7.1.3 Feature Standardization (Standard Scaler)

Lecture 52 7.2.2 Implementing Feature Standardization

Lecture 53 7.1.4 Robust Feature Scaler

Lecture 54 7.2.3 Implementation of Robust Scaler

Section 8: Handling Categorical Features

Lecture 55 8.1.1 Types of Features

Lecture 56 8.2.1 Handling Categorical Ordinal Features

Lecture 57 8.2.2 Categorical Nominal Features

Lecture 58 8.2.3 Text Sequence Encoding (for Deep Learning Models)

Section 9: Machine Learning Overview

Lecture 59 Deductive Learning and Inductive Learning

Lecture 60 Learning from Features

Lecture 61 Machine Learning (Introduction)

Lecture 62 Supervised and Unsupervised Learning

Lecture 63 Pattern Recognition

Lecture 64 Machine Learning Project Pipeline

Section 10: Data Acquisition

Lecture 65 Data Acquisition from Webpages

The target students are beginners to data science and machine learning.

Course Information:

Udemy | English | 5h 43m | 1.69 GB
Created by: Taimoor khan

You Can See More Courses in the Teaching & Academics >> Greetings from

New Courses

Scroll to Top