Data Cleaning in Python
What you’ll learn
Data cleaning or cleansing as a preprocessing step towards making the data more consistent and high quality before training predictive models.
Requirements
Basics of Python
Description
Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term ‘garbage in garbage out’ refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable. In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept. In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.
Overview
Section 1: Introduction
Lecture 1 Introduction
Lecture 2 Quality of Data
Lecture 3 Missing Values, Noise and Outliers
Lecture 4 Examples of Anomalies
Lecture 5 Instructor
Section 2: Detecting Missing and Noise Values (Univariate Outliers)
Lecture 6 2.1.1 Anomaly Detection (Median)
Lecture 7 2.2.1 Implementing Detection of Missing Values
Lecture 8 2.2.2 Implementing Median based Detection (Global Context)
Lecture 9 2.2.3 Implementing Median based Detection (Local Context)
Lecture 10 2.1.2 Anomaly Detection (Mean)
Lecture 11 2.2.4 Implementing Mean based Detection of Noise values
Lecture 12 2.1.3 Anomally Detection (Z-score)
Lecture 13 2.2.5 Implementing Z-score based Detection
Lecture 14 2.1.4 Anomally Detection (Interquartile Range)
Lecture 15 2.2.6 Implementing Interquartile Range for Noise Detection
Section 3: Handling Missing and Noise Values (Univariate Outliers)
Lecture 16 3.1.1 Approaches to Handle Anomalies
Lecture 17 3.1.2 Deletion Strategy
Lecture 18 3.2.1 Deleting Missing Values
Lecture 19 3.1.3 Global and Local Context
Lecture 20 3.1.4 Replacement Strategy
Lecture 21 3.1.5 Statistical Measures
Lecture 22 3.2.2 Implementing Imputation with Mode
Lecture 23 3.2.3 Implementing Imputation with Median and Mean
Section 4: Multivariate Outliers
Lecture 24 4.1.1 Multivariate Outliers
Lecture 25 4.1.2 Local Outlier Factor
Lecture 26 4.2.1 Implementing LOF for Outlier Detection
Lecture 27 4.1.3 Clustering for Multivariate Outlier Detection
Lecture 28 4.2.2 Implementing DBSCAN Clustering for Outlier Detection
Lecture 29 4.1.3 Data Visualization for Outlier Detection
Lecture 30 4.2.3 Implementing Data Visualization
Section 5: Anomalies in Textual data
Lecture 31 5.1.1 Normalizing Text Anomalies
Lecture 32 5.2.1 Lowercase, Whitespaces, Punctuations
Lecture 33 5.2.2 Stopwords Removal
Lecture 34 5.1.2 Regular Expressions
Lecture 35 5.2.4 Implementing Regular Expressions for Filtering stopwords
Lecture 36 5.2.3 Stemming and Lemmatization
Lecture 37 Parts-of-speech (POS) Tagging
Lecture 38 5.2.6 Text Segmentation and Tokenization
Section 6: Structuring Textual Documents
Lecture 39 6.1.1 Structuring Textual Data
Lecture 40 6.1.2 Bag-of-Words (BoW) Approach
Lecture 41 6.1.3 Binary and TF-IDF Representation
Lecture 42 6.2.1 Implementing One Document Corpus Representation
Lecture 43 6.2.2 Implementing Multi-doc Corpus Representation
Lecture 44 6.2.3 Tuning Parameters to Improve Representation
Lecture 45 6.2.4 Implementing TF-IDF Representation Scheme
Lecture 46 6.2.5 Implementing Dummy Dataset Representation
Lecture 47 6.2.6 Implementing UCI Repository Dataset Representation
Section 7: Feature Scaling (Normalization)
Lecture 48 7.1.1 Why Feature Scaling
Lecture 49 7.1.2 Feature Normalization (Min Max Scaler)
Lecture 50 7.2.1 Implementing Feature Normalization
Lecture 51 7.1.3 Feature Standardization (Standard Scaler)
Lecture 52 7.2.2 Implementing Feature Standardization
Lecture 53 7.1.4 Robust Feature Scaler
Lecture 54 7.2.3 Implementation of Robust Scaler
Section 8: Handling Categorical Features
Lecture 55 8.1.1 Types of Features
Lecture 56 8.2.1 Handling Categorical Ordinal Features
Lecture 57 8.2.2 Categorical Nominal Features
Lecture 58 8.2.3 Text Sequence Encoding (for Deep Learning Models)
Section 9: Machine Learning Overview
Lecture 59 Deductive Learning and Inductive Learning
Lecture 60 Learning from Features
Lecture 61 Machine Learning (Introduction)
Lecture 62 Supervised and Unsupervised Learning
Lecture 63 Pattern Recognition
Lecture 64 Machine Learning Project Pipeline
Section 10: Data Acquisition
Lecture 65 Data Acquisition from Webpages
The target students are beginners to data science and machine learning.
Course Information:
Udemy | English | 5h 43m | 1.69 GB
Created by: Taimoor khan
You Can See More Courses in the Teaching & Academics >> Greetings from CourseDown.com