Data Science Capstone Project

Project Overview

The objective of the Johns Hopkins University Data Science capstone project is to create a “next word” predictor, similar to how a mobile phone or Google search “suggests” the next word when you are typing.

The Next Word Predictor is a web-based application which a user can interact with the prediction model.

A structured approach is taken to build the Next Word Predictor

Exploratory Data Analysis - What We’ve Learned

To build our predicting model, extracts of three (3) English text sources are used to create the words and phrases which are used in the model:
- extracts of news feeds, - blogs and - Twitter conversations

The table below shows a brief analysis of the text extracts

The “ALL” row does not necessarily sum for unique values because of commonality of words between the data sources.

	Before Sampling		Sample Before Pre-Processing				Sample After Pre-Proc
Document	MB	Lines	Pct	No.Lines	No.Tokens	No.Unique	Tokens	Unique
News	196.3	1,010,242	10%	101,024	3,657,373	67,302	1,971,816	58,251
Blogs	200.4	899,288	10%	89,929	3,976,838	69,431	1,941,122	57,704
Twitter	159.4	2,360,148	10%	236,015	3,521,649	73,458	1,690,589	64,881
TOTAL	556.1	4,269,678	0	426,968	11,155,860	141,661	5,603,527	121,382

For the “ALL” category, a small number of the tokens account for large percentage of the occurrences, as seen in the table below:

Using text analysis tools, the top 20 tokens (“words”) of each data source have been identified.

N-grams

N-grams represent sequences of tokens (words) which occur within text extracts. 2-gram and 3-gram sequences have been generated. The resulting top sequences (words separated by a “/”) are displayed in the table below.

The corpus of all the text sources is so large that we subset it to use only words that occur at least 10 times. The following figure show the top 20 2 and 3 n-grams:

Creating Prediction Algorithm and Web-Based Application

The current plan is to use a Markov model as the prediction engine. Using the probabilities associated with the n-grams, the prediction algorithm with suggest the top 3 “next words” (see figure below).

The model will be built using a sample of text from the total set of text
As each word is typed

Illustrative Markov Model

Data Science Capstone Project

Week 2 Milestone Report

J. Register

4/16/2021

Project Overview

Exploratory Data Analysis - What We’ve Learned

N-grams

Creating Prediction Algorithm and Web-Based Application