Word Prediction Capstone Project

Jeff B
31 May 2020

Introduction

This application was built for the final Capstone project of the Johns Hopkins University Data Science: Statistics and Machine Learning Specialization on Coursera. This presentation provides a brief overview of the application, which is accessible here. It is structured as follows:

The problem
The data, including loading, tidying, and exploratory analysis
The solution, namely the logic of the algorithm that converts the data to an output
The app, including features beyond the algorithm

The Problem

We've been tasked by SwiftyKey, a virtual keyboard app developer, to make a tool to predict the next word in a string of text
For example, if the given the input “United States of”, the app can be expected to return “America”
In addition to predictive accuracy, the tool needs to balance size and speed, because users don't have infinite space or time to predict a word when typing
To get started, three large sets of unprocessed text data have been provided from Twitter, news stories, and blogs

Summary of Data Sets for Corpus


File	Twitter	News	Blogs
Total Lines (#)	2360148	1010242	899288
Total Words (#)	30093413	34762395	37546239
Longest Line	140	11384	40833
Avg. Line Length	68.68043	201.16284	229.98668
Unique Words (#)	1554362	1066687	1352044

The Data

80,000 lines of each full set randomly sampled to balance speed and accuracy
Sampled data combined into single corpus, cleaned of profanity, punctuation, numbers, symbols, and URLs
Quanteda package used to turn corpus into 5 document feature matrix (DFM) objects for unigrams (1-word strings) up through five-grams (5-word strings)
Each DFM trimmed to remove extremely rare observations (appearing < 2 or 3 times total)
Quanteda used to turn DFMs into data.tables of frequency tables – final objects for model, much smaller and faster

Example 1: Creating a DFM from the corpus

dfm_trigram <- tokensClean %>% tokens_ngrams(n = 1) %>% dfm(tolower = TRUE) %>% dfm_trim(min_termfreq = 3)

Example 2: Creating a data.table object of a frequency table

gramfreq_tri <- data.table(textstat_frequency(dfm_trigram))[,1:2]

Example of frequency table:

X	feature	frequency
1	one_of_the	2564
2	a_lot_of	2281
3	to_be_a	1245

The Solution

Data.table frequency tables enable relatively simple solution
Input string converted into regex-formatted string
Model then searches for occurences of regex in the frequency table that is one word longer than the string
Simple probabilities for the results: frequency over total observations, discounted by 25%
Additional discount of 25% for each backoff
If no results, first word of string is removed, and search is run on the next table (“back off” model)
After running through all frequency tables and/or all grams of the ngram, model returns table of the overall 4 most frequent words, plus 1 randomly chosen word

Example: Search for “a case of”

First it converts it into a regex, ngram-formatted term

convertInput("a case of")

[1] "^a_case_of_"

If it doesn't find a match, it “backs off” the term and searches again

backoff_ngram("a case of") %>% convertInput

[1] "^case_of_"

It returns the results in a simple table with discounted probabilities

predGram("a case of")

# A tibble: 5 x 2
  nextword probability
  <chr>          <dbl>
1 the            0.214
2 a              0.107
3 beer           0.107
4 what           0.107
5 mistaken       0.107

The App

Interface is simple and self-explanatory – user types in string, app returns table of results
Option to include or exclude stopwords
First click of each session is somewhat slow due to entire app loading – subsequent clicks quite fast and accurate
Overall accuracy of model is so-so – even including full dataset could not make all test predictions accurately