Word Prediction Capstone Project

Jeff B
31 May 2020

Introduction

This application was built for the final Capstone project of the Johns Hopkins University Data Science: Statistics and Machine Learning Specialization on Coursera. This presentation provides a brief overview of the application, which is accessible here. It is structured as follows:

  1. The problem
  2. The data, including loading, tidying, and exploratory analysis
  3. The solution, namely the logic of the algorithm that converts the data to an output
  4. The app, including features beyond the algorithm

The Problem

  • We've been tasked by SwiftyKey, a virtual keyboard app developer, to make a tool to predict the next word in a string of text
  • For example, if the given the input “United States of”, the app can be expected to return “America”
  • In addition to predictive accuracy, the tool needs to balance size and speed, because users don't have infinite space or time to predict a word when typing
  • To get started, three large sets of unprocessed text data have been provided from Twitter, news stories, and blogs

Summary of Data Sets for Corpus

File Twitter News Blogs
Total Lines (#) 2360148 1010242 899288
Total Words (#) 30093413 34762395 37546239
Longest Line 140 11384 40833
Avg. Line Length 68.68043 201.16284 229.98668
Unique Words (#) 1554362 1066687 1352044

The Data

  • 80,000 lines of each full set randomly sampled to balance speed and accuracy
  • Sampled data combined into single corpus, cleaned of profanity, punctuation, numbers, symbols, and URLs
  • Quanteda package used to turn corpus into 5 document feature matrix (DFM) objects for unigrams (1-word strings) up through five-grams (5-word strings)
  • Each DFM trimmed to remove extremely rare observations (appearing < 2 or 3 times total)
  • Quanteda used to turn DFMs into data.tables of frequency tables – final objects for model, much smaller and faster

Example 1: Creating a DFM from the corpus

dfm_trigram <- tokensClean %>% tokens_ngrams(n = 1) %>% dfm(tolower = TRUE) %>% dfm_trim(min_termfreq = 3)


Example 2: Creating a data.table object of a frequency table

gramfreq_tri <- data.table(textstat_frequency(dfm_trigram))[,1:2]


Example of frequency table:

X feature frequency
1 one_of_the 2564
2 a_lot_of 2281
3 to_be_a 1245

The Solution

  • Data.table frequency tables enable relatively simple solution
  • Input string converted into regex-formatted string
  • Model then searches for occurences of regex in the frequency table that is one word longer than the string
  • Simple probabilities for the results: frequency over total observations, discounted by 25%
  • Additional discount of 25% for each backoff
  • If no results, first word of string is removed, and search is run on the next table (“back off” model)
  • After running through all frequency tables and/or all grams of the ngram, model returns table of the overall 4 most frequent words, plus 1 randomly chosen word

Example: Search for “a case of”

First it converts it into a regex, ngram-formatted term

convertInput("a case of")
[1] "^a_case_of_"


If it doesn't find a match, it “backs off” the term and searches again

backoff_ngram("a case of") %>% convertInput
[1] "^case_of_"


It returns the results in a simple table with discounted probabilities

predGram("a case of")
# A tibble: 5 x 2
  nextword probability
  <chr>          <dbl>
1 the            0.214
2 a              0.107
3 beer           0.107
4 what           0.107
5 mistaken       0.107

The App

  • Interface is simple and self-explanatory – user types in string, app returns table of results
  • Option to include or exclude stopwords
  • First click of each session is somewhat slow due to entire app loading – subsequent clicks quite fast and accurate
  • Overall accuracy of model is so-so – even including full dataset could not make all test predictions accurately

App in action