Next Word Prediction

Ken Peters
date: 9/7/2020
autosize: true

A special thanks to our instructors

Course Instructors:

Jeff Leek-Department of Biostatistics, Johns Hopkins Bloomberg School of Health
Roger Peng-Department of Biostatistics, Johns Hopkins Bloomberg School of Health
Brian Caffo-Department of Biostatistics, Johns Hopkins Bloomberg School of Health

Project Overview:

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

The corpora are collected from publicly available sources by a web crawler and consists of 3 files that are composed of Blogs, News and Twitters, all provided by SwiftKey.

We will use these for Next Word Predicition

install.packages(“kableExtra”) install.packages(“stringi”)

First we need to Explore the Datasets and determine each of their sizes, number of lines, and number of words

Table Size, # of Lines, # of Words for the 4 Datasets
	Size_in_Mb	Number_of_Lines	Number_of_Words
Blogs	200	899288	38154238
Twitter	159	2360148	30218125
News	196	77259	2693898
All	555	3336695	71066261

The Data is so large, we need to use a sample of only 1% of the Data.
Next we Clean and Pre-Process the Data.

Convert to lower case
Remove Punctuation
Remove numbers
Remove whitespaces
We did not remove profanity because in all the exploring of the data, no profanity was found. Also some profanity lists contained words such as “beer” or “weed” and we did not want those removed
We did not remove Stopwords, such as “the”, “and”, “a”, etc. because often times these are the next word in a phrase

Here's what a sample search might look like

A screen capture of the Prediction page.

Give it a try

Link to Next Word prediction
A description of the algorithm is on the next slide

PREDICTION ALGORITHM

First we Tokenize the data-Tokenization is breaking a text chunk in smaller parts. For us, it is breaking the text into words.
Next we form, order and assigning probabilities to ngrams–phrases of length n.
- We use: unigrams, bigrams, trigrams, quadgrams, fivegrams, sixgrams
We use Kneser-Ney smoothing. It is a method to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney [de].[2]See this reference
We store the resulting n-gram dataframes locally and load them for our online Predictions–to save computation time and space.
The user can choose the next n words prediction, for n = 5,6,7,8,9,10.
And to “jazz” it up a little, the user can also see a radar plot of the next n words prediction, for n = 5,10,15,20,25,30