Word Prediction Algorithm

September 4th 2017

Coursera Data Science Capstone Project

Motivation and Approach

Natural Language Processing is a field of data science concerned with building models and software to improve computer-human interaction. A commercial example of this is next-word prediction apps for mobile devices, such as SwiftKey. https://swiftkey.com/en

As the capstone project for the Coursera Data Science Certification, a Shiny app has been developed using a simple n-gram model to predict the next word in a sentence. Emphasis has been placed on speed of the prediction algorithm, since it is infeasible to wait several moments for the program to generate a result.

N-Gram Prediction Tables

  1. Data is imported, cleaned, and sampled.
  2. N-Gram Tables are generated by creating document-feature matrices and tabulating next-word frequency (example below)
  3. N-Gram tables are split into initial/final words, compiled and exported as a .csv file, to be read in by the app.
# Create a document-feature matrix of 5-word phrases:
five_grams <- dfm(all_sample, ngrams = 5, verbose = FALSE)
# Sum occurances of each 5-word phrase across all documents, and store as a data.table:
five_freq <- as.data.frame(col_sums(five_grams, na.rm=T))
fivegrams_Frequency <- data.table(NGram = rownames(five_freq), Frequency = five_freq[,1])
# Keep ngrams that appear more than once:
fivegrams_Frequency <- fivegrams_Frequency[Frequency>1]

Model Implementation

  1. Read in the ngrams.csv table
  2. Clean the user text input, and compare the last 4 words against the ngrams data table
  3. Return the most frequent 5th word. If there are no results, repeat this algorithm using only the last 3 words of the user input. Continue dropping an initial word until a result is found.
# An excerpt of the model function predictwords()
ngramtables <- fread("ngrams.csv")  
input <- removePunctuation(input)
input <- stripWhitespace(input)
input <- char_tolower(input)
fourwordsample <- word(input, start = -4, end = -1, sep=" ")
return5 <- ngramtables[Initial==fourwordsample]
return5 <- head(return5[order(-Frequency), Final], 1)

Shiny App

The published Shiny app can be found at this link: https://dboucher.shinyapps.io/n-gram_word_prediction/

Below is an example of the model prediction:

predictword("it would mean the")

Read 77.7% of 2186699 rows
Read 98.3% of 2186699 rows
Read 2186699 rows and 3 (of 3) columns from 0.050 GB file in 00:00:04
[1] "world"
predictword("can i get what i")
[1] "want"