Project Overview

The objective of the Johns Hopkins University Data Science capstone project is to create a “next word” predictor, similar to how a mobile phone or Google search “suggests” the next word when you are typing.

The Next Word Predictor is a web-based application which a user can interact with the prediction model.

A structured approach is taken to build the Next Word Predictor

Exploratory Data Analysis - What We’ve Learned

To build our predicting model, extracts of three (3) English text sources are used to create the words and phrases which are used in the model:
- extracts of news feeds, - blogs and - Twitter conversations

The table below shows a brief analysis of the text extracts

Before Sampling
Sample Before Pre-Processing
Sample After Pre-Proc
Document MB Lines Pct No.Lines No.Tokens No.Unique Tokens Unique
News 196.3 1,010,242 10% 101,024 3,657,373 67,302 1,971,816 58,251
Blogs 200.4 899,288 10% 89,929 3,976,838 69,431 1,941,122 57,704
Twitter 159.4 2,360,148 10% 236,015 3,521,649 73,458 1,690,589 64,881
TOTAL 556.1 4,269,678 0 426,968 11,155,860 141,661 5,603,527 121,382

For the “ALL” category, a small number of the tokens account for large percentage of the occurrences, as seen in the table below:

Using text analysis tools, the top 20 tokens (“words”) of each data source have been identified.

N-grams

N-grams represent sequences of tokens (words) which occur within text extracts. 2-gram and 3-gram sequences have been generated. The resulting top sequences (words separated by a “/”) are displayed in the table below.

The corpus of all the text sources is so large that we subset it to use only words that occur at least 10 times. The following figure show the top 20 2 and 3 n-grams:

Creating Prediction Algorithm and Web-Based Application

The current plan is to use a Markov model as the prediction engine. Using the probabilities associated with the n-grams, the prediction algorithm with suggest the top 3 “next words” (see figure below).

Illustrative Markov Model