Final Project-Capstone Next Word Prediction

Next Word Prediction

Ken Peters
date: 9/7/2020
autosize: true

A special thanks to our instructors

Course Instructors:

  • Jeff Leek-Department of Biostatistics, Johns Hopkins Bloomberg School of Health
  • Roger Peng-Department of Biostatistics, Johns Hopkins Bloomberg School of Health
  • Brian Caffo-Department of Biostatistics, Johns Hopkins Bloomberg School of Health

Project Overview:

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

The corpora are collected from publicly available sources by a web crawler and consists of 3 files that are composed of Blogs, News and Twitters, all provided by SwiftKey.

We will use these for Next Word Predicition

install.packages(“kableExtra”) install.packages(“stringi”)

First we need to Explore the Datasets and determine each of their sizes, number of lines, and number of words

Table Size, # of Lines, # of Words for the 4 Datasets
Size_in_Mb Number_of_Lines Number_of_Words
Blogs 200 899288 38154238
Twitter 159 2360148 30218125
News 196 77259 2693898
All 555 3336695 71066261

The Data is so large, we need to use a sample of only 1% of the Data.
Next we Clean and Pre-Process the Data.

  • Convert to lower case
  • Remove Punctuation
  • Remove numbers
  • Remove whitespaces
  • We did not remove profanity because in all the exploring of the data, no profanity was found. Also some profanity lists contained words such as “beer” or “weed” and we did not want those removed
  • We did not remove Stopwords, such as “the”, “and”, “a”, etc. because often times these are the next word in a phrase

Here's what a sample search might look like

A screen capture of the Prediction page.

Give it a try

Link to Next Word prediction
A description of the algorithm is on the next slide

PREDICTION ALGORITHM

  1. First we Tokenize the data-Tokenization is breaking a text chunk in smaller parts. For us, it is breaking the text into words.
  2. Next we form, order and assigning probabilities to ngrams–phrases of length n.
    • We use: unigrams, bigrams, trigrams, quadgrams, fivegrams, sixgrams
  3. We use Kneser-Ney smoothing. It is a method to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney [de].[2]See this reference
  4. We store the resulting n-gram dataframes locally and load them for our online Predictions–to save computation time and space.
  5. The user can choose the next n words prediction, for n = 5,6,7,8,9,10.
  6. And to “jazz” it up a little, the user can also see a radar plot of the next n words prediction, for n = 5,10,15,20,25,30