Information Retrieval and Mining Massive Data Sets

Learn various techniques to build a Google scale Information Retrieval System.
Information Retrieval and Mining Massive Data Sets
File Size :
4.46 GB
Total length :
39h 9m



Omkar Deshpande


Last update

Last updated 4/2014



Information Retrieval and Mining Massive Data Sets

What you’ll learn

The course is primarily divided into 6 parts.
Part 1: Building an Information Retrieval System
Part 2: Mining Frequent Patterns and Associations
Part 3: Classification and Clustering
Part 4: Web Mining
Part 5: Recommendation Systems

Information Retrieval and Mining Massive Data Sets


Knowledge of probability and linear algebra.
Good grasp on graduate level algorithms.
Experience with a programming language ( C, Python, Java)


The goal is to introduce various techniques required to build an IR System. In this course we will explore various methods to solve big data problem. We will evaluate alternative solutions and trade offs. In the later part of the course we will discuss various data mining algorithms to make sense of massive data sets.


Section 1: Introduction To a Boolean Search Engine

Lecture 1 What is Data Mining

Lecture 2 Structured Data, Unstructured data and Information Retrieval

Lecture 3 Term-Document Incidence Matrix (1)

Lecture 4 Term-Document Incidence Matrix (2)

Lecture 5 Inverted Index

Lecture 6 Tradeoffs in implementing an Inverted Index

Lecture 7 Processing AND, OR, NOT queries

Lecture 8 Overview of Index Construction Pipeline

Lecture 9 Query optimization using Document Frequency (1)

Lecture 10 Query Optimization Using Document Frequency (2)

Lecture 11 Boolean Retrieval Model

Lecture 12 Example of a Boolean Retrieval Model

Lecture 13 Limitations of Boolean Retrieval Model

Lecture 14 How to evaluate performance of an IR System

Lecture 15 Google zeitgeist

Section 2: Dictionary Data Structure. Tolerant retrieval

Lecture 16 Parsing Documents and Issues Associated with it

Lecture 17 Tokenization Process in an IR System

Lecture 18 Normalization to Terms

Lecture 19 Faster Postings Merges With Skip Pointers

Lecture 20 How to Handle Phrase Query

Lecture 21 Phrase Query Using Positional Index

Lecture 22 How to handle proximity query

Lecture 23 Discussion on Positional Index Size

Section 3: Index construction. Postings size estimation, sort-based indexing, dynamic index

Lecture 24 Dictionary Data Structure Implementation

Lecture 25 Wild card queries

Lecture 26 Questions on Wild Card Queries

Lecture 27 Wild Card Query Handling Using Permuterm Index

Lecture 28 Wild Card Query Handling Using K-Gram Index

Lecture 29 Soundex Algorithm

Lecture 30 Spelling Correction Techniques in an IR System

Lecture 31 Question On Soundex Algorithm

Lecture 32 Spelling Correction (Part 2)

Lecture 33 Introduction To Dynamic Programming

Lecture 34 How To Calculate Edit Distance Between Two Strings

Lecture 35 Spelling Correction Using Weighted Edit Distance

Lecture 36 Spelling Correction Using Ngram Overlap Technique

Lecture 37 Calculating Jaccard Coefficient (An Example)

Lecture 38 Context Sensitive Spell Correction

Section 4: Dictionary Compression, Posting Compression

Lecture 39 Introduction to Index Construction

Lecture 40 Index Construction Using InMemory Sorting

Lecture 41 Index Construction Using BSBI Algorithm

Lecture 42 Index Construction Using SPIMI Algorithm

Lecture 43 Introduction To Distributed Indexing

Lecture 44 How To build distributed indexes

Lecture 45 Q & A on Distributed Index

Lecture 46 Map Reduce

Lecture 47 Dynamic indexing using naive approach

Lecture 48 Dynamic indexing using logarithimic merge

Lecture 49 Issues With Multiple Indexes

Section 5: Scoring, term weighting, and the vector space model

Lecture 50 Why do we compress indexes

Lecture 51 Important Statistics about RCV Collection

Lecture 52 Various Dictionary Compression Techniques

Lecture 53 Various Dictionary Compression Techniques Part 2

Lecture 54 Various Posting Compression Techniques

Section 6: Efficient vector space scoring. Nearest neighbor techniques

Lecture 55 Ranked Retrieval Model

Lecture 56 Jaccard Score

Lecture 57 Term Frequency Weighing And Bag Of Words Model

Lecture 58 Inverse Document Frequency

Lecture 59 TF-IDF Score

Lecture 60 Documents AS TF-IDF Vectors

Lecture 61 Length Normalization

Lecture 62 Cosine Similarity Example

Lecture 63 Computing Cosine Scores On Index

Lecture 64 Variants of TF IDF Weights

Section 7: Evaluating search engines. User happiness, precision, recall, F-measure

Lecture 65 Term at a Time Scoring

Lecture 66 Efficient Cosine Ranking

Lecture 67 Generic Approach For Speeding up Cosine Similarity

Lecture 68 Index Elimination

Lecture 69 Champion Lists

Lecture 70 Static Quality Score

Lecture 71 High And Low Lists

Lecture 72 Impact Ordered Posting

Lecture 73 Cluster Pruning

Lecture 74 Parametric Zone Tired Index

Lecture 75 Query Term Proximity And Query Parsing

Lecture 76 How A Search Engine Works

Section 8: Advertisement Systen. Google AdSense. Search Engine Optimization

Lecture 77 Performance of a Search Engine Part 1

Lecture 78 Performance of a Search Engine Part 2

Lecture 79 Performance of a Search Engine Part 3

Lecture 80 Performance of a Search Engine Part 4

Lecture 81 Performance of a Search Engine Part 5

Section 9: Supervised Learning. Text Classification. Naive-Bayes Text Classification

Lecture 82 ECommerce Vs. Traditional Businesses

Lecture 83 Pricing Models For Online Advertisement

Lecture 84 AdWords and AdSense

Lecture 85 SEM And SEO

Section 10: Link analysis. Web as a graph. PageRank

Lecture 86 Classification System

Lecture 87 Document Classification

Lecture 88 Manual Classification Methods

Lecture 89 Naive Bayes Classifiers

Lecture 90 Bayes Rules Of Text Classification

Lecture 91 Various Classification Methods

Lecture 92 Example of Multivariate Bernouli Model

Lecture 93 Second Version of Naive Bayes

Lecture 94 Example of Second Version of Naive Bayes

Section 11: Clustering. Introduction to the problem. Partitioning methods: k-means clusterin

Lecture 95 Reputation System

Lecture 96 Examples of Reputation System

Lecture 97 Limitations of Reputation System

Lecture 98 Page Rank Calculation

Section 12: Web Crawler

Lecture 99 What is Clustering

Lecture 100 Applications of Clustering in IR Systems

Lecture 101 Issues For Clustering

Lecture 102 Introduction to Clustering Algorithms

Lecture 103 K-Means Clustering Algorithms

Lecture 104 Rocchio Algorithms

Lecture 105 K Nearest Neighbor Algorithms

Lecture 106 Discussion on K Nearest Neighbor

Lecture 107 Proof of Rocchio’s Algorithm as linear classifier

Lecture 108 Worked out Example On Rocchio Algorithms

Lecture 109 Examples On Bigram Index

Section 13: Association Rules. Market Basket Model and Frequent Item Sets. A Priori Algorith

Lecture 110 How a Web Crawler Works

Lecture 111 Complications in Crawling

Lecture 112 Advance Crawler Architecture

Lecture 113 URL Frontier

Section 14: Association Rules. Market Basket Model and Frequent Item Sets. A Priori Algorith

Lecture 114 Association Rule Introduction

Lecture 115 Market Basket Model and Frequent Item Sets

Lecture 116 A formal approach to Association Rules

Lecture 117 How to find association Rules

Lecture 118 Storage Considerations for Market Basket

Lecture 119 Memory Bottleneck in Storage of Market Basket

Lecture 120 A Naive Algorithm to discover Association Rules Part1

Lecture 121 A Naive Algorithm to discover Association Rules Part2

Lecture 122 A Priori Algorithm

Lecture 123 Extension of A Priori Algorithm

Big Data Enthusiast,Data Scientists

Course Information:

Udemy | English | 39h 9m | 4.46 GB
Created by: Omkar Deshpande

You Can See More Courses in the Developer >> Greetings from

New Courses

Scroll to Top