Introduction

The objective of the Coursera Data Science Capstone Project is to design a predictive text model using n-grams. Ultimately, a Shiny web application will be produced that can predict the next word the user is most likely to type, based on previously-typed words. This report lays the groundwork for development of the Shiny application.

N-grams are sequences of words in a body of text, or corpus. The “n” in n-gram refers to the number of words in the n-gram. Thus a 2-gram is a sequence of two words, a 3-gram is a sequence of three words, and so on. The n-gram predictive text model reduces the complexity of natural language to the most recent few words typed by a user, and uses those words to predict the next word in the sequence (see n-gram).

The core of our model is a data.table with three columns: (1) the (n-1) words in the n-gram, (2) the predicted word (the last word of the n-gram), and (3) a count variable for the frequency of occurrence of the n-gram in the corpus. The data.table was processed so that each row contained the most frequently-occurring predicted word for each (n-1)-gram. The data.table column containing the (n-1)-grams was indexed to provide near-instantaneous searching. We created 2- through 5-grams for use in the model, allowing up to four previously-typed words to be used in predicting the next word. Our model used the Stupid Backoff algorithm to deal with the situation in which the exact phrase typed by the user is not found. In such a case the algorithm “backs off” to the next lower n-gram and searches again. The backoff algorithm continues reducing the level of n-gram until it is determined that no match is possible, at which time the most common word in the English language (“the”) is returned as a “best guess” prediction.

R packages used in this project include quanteda, readtext, knitr, pander, readr, data.table, and dplyr.

Data Processing, Construction of N-grams, and Exploratory Analysis

The data provided for this project consisted of text files in several languages downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Three English language files located in the sub-directory ~./final/en_US/ of the downloaded zip archive were used to build the n-grams for the predictive text model. The files are: en_US.blogs.txt (blog postings), en_US.news.txt (news feeds), and en_US.twitter.txt (Twitter posts).

As a first step towards exploration of the text data sets we used the Unix wc utility to produce Table 1, which shows the number of lines, words, and characters in each file.

system("wc -mwl ./final/en_US/*.txt > table1.txt")
table1 <- read_table2("table1.txt", 
                      col_names = c("Lines", "Words", "Characters", "Text File"), n_max = 3)
table1 <- table1[c("Text File", "Lines", "Words", "Characters")]
kable(table1, caption = "Table 1. Preliminary analysis of text files used in building the corpus.")
Table 1. Preliminary analysis of text files used in building the corpus.
Text File Lines Words Characters
./final/en_US/en_US.blogs.txt 899288 37334147 208623081
./final/en_US/en_US.news.txt 1010242 34372530 205243643
./final/en_US/en_US.twitter.txt 2360148 30373603 166816544

We next brought the three files into a text corpus using the quanteda package. Table 2 shows summary statistics for each text in the corpus.

# Create corpus and produce a summary table.
# Write results to data files to save future processing time.

if (!file.exists("theCorpus.rds")) {
    saveRDS(corpus(readtext("./final/en_US/*.txt")), file = "theCorpus.rds") 
} 

theCorpus <- readRDS("theCorpus.rds")

if (!file.exists("summaryOfCorpus.rds")) {
    saveRDS(summary(theCorpus, showmeta = TRUE, tolower = TRUE), file = "summaryOfCorpus.rds")
} 

summaryOfCorpus <- readRDS("summaryOfCorpus.rds")
kable(summaryOfCorpus, caption = "Table 2. Summary of text datasets as contained in the quanteda corpus.")
Table 2. Summary of text datasets as contained in the quanteda corpus.
Text Types Tokens Sentences
en_US.blogs.txt 403130 42840140 2072941
en_US.news.txt 380001 39918314 1867522
en_US.twitter.txt 442205 36719658 2588551

From Table 2 we can see that the full corpus contains a total of 6,529,014 sentences.

One of the requirements of this project was to produce a text prediction algorithm implemented as a Shiny app running on the free version of shinyapps.io. Because of memory limits on the Shiny web platform (the app and associated data must fit into approximately 1GB of memory), we decided to work with samples of the full corpus for construction of our n-gram model. Also, because natural language is based on sentences, we restructured the text corpus into sentences to facilitate construction of n-grams. Working with sentences as the basic units of text structure helped to ensure that n-grams were constructed within sentences and did not cross sentence boundaries, which would have produced nonsense n-grams. We also chose to not eliminate stop words and profanity from the corpus, again because such restrictions would break the natural flow of language and interfere with the construction of n-grams.

The remainder of this report presents and discusses two R code chunks, labeled Code Chunk #1 and Code Chunk #2 in the text. The first chunk creates an .rds data file containing the data.table discussed in the Introduction. The second code chunk implements the n-gram word prediction model.

Code Chunk #1: The following code produces a data.table for use by the n-gram prediction model.

# Remove `eval=FALSE` to run the code when the document is knit.

# This code produces an Rdata file named `ngrams.rds`, containing a data.table 
# with (n-1)-grams, their predicted words, and frequencies. The data.table is 
# loaded by the word prediction model for use in n-gram lookup.

quanteda_options(threads = 4)

# Function to create predictor from an n-gram.
# Trims off the final word from a character vector.
# Parameter s is a string containing one or more words.
notLast <- function(s) {
  sp <- unlist(strsplit(s, " "))
  if (length(sp) == 1){
    return(sp)
  } else {
    return(paste(sp[-length(sp)], collapse = " "))
  }          
}

# Timing.
start_time <- Sys.time()
print(paste0("Start time: ", start_time))

# Create the corpus.
message(Sys.time(), " Creating corpus.")
theCorpus <- corpus(readtext("./final/en_US/*.txt"))

# Remove non-ASCII characters.
message(Sys.time(), " Removing non-ASCII characters.")
texts(theCorpus) <- iconv(texts(theCorpus), from = "UTF-8", to = "ASCII", sub = "")

# Reshape the corpus into sentences.
message(Sys.time(), " Reshaping the corpus into sentences.")
theCorpus <- corpus_reshape(theCorpus, to = "sentences")

# Create a 3% random sample of sentences in the corpus.
message(Sys.time(), " Creating a random sample of sentences.")
set.seed(42) # See goo.gl/Y1K9xV.
theCorpus <- corpus_sample(theCorpus,
                           size = 0.03 * ndoc(theCorpus))

# Tokenize the corpus, and do some cleaning.
message(Sys.time(), " Tokenizing.")
tk <- tokens(theCorpus, what = 'word',
             remove_numbers = T,
             remove_punct = T,
             remove_symbols = T,
             remove_separators = T,
             remove_twitter = T,
             remove_hyphens = T,
             remove_url = T)

# Convert tokens to lowercase.
message(Sys.time(), " Converting tokens to lowercase.")
tk <- tokens_tolower(tk)

# Make n-grams.
message(Sys.time(), " Making n-grams.")
ng <- tokens_ngrams(tk, n = 2:5, concatenator = " ")

# DFM of the n-grams.
message(Sys.time(), " Making document-feature matrix.")
dfmNGrams <- dfm(ng)

# Data frame containing n-gram frequencies.
message(Sys.time(), " Computing n-gram frequencies.")
dfmNGramsDF <- textstat_frequency(dfmNGrams)

# Convert the data frame to a data.table.
message(Sys.time(), " Creating data.table.")
ngrams <- as.data.table(dfmNGramsDF)

# Extract predicted word from n-grams.
message(Sys.time(), " Extracting predicted words.")
ngrams$predicted <- unlist(lapply(strsplit(ngrams[[1]], split=" "), last))

# Create predictor column from n-grams.
message(Sys.time(), " Creating predictor column and creating data.table key.")
ngrams$predictor <- sapply(strsplit(ngrams[[1]], split=" "), notLast)

# Drop unneeded columns and set data.table key.
message(Sys.time(), " Dropping unneeded columns, setting data.table key.")
ngrams <- select(ngrams, predictor, predicted, frequency)
setkey(ngrams, predictor)

# For each n-gram, select rows having the predicted word 
# with the highest frequency. Cast the resulting tbl as a data.table.
message(Sys.time(), " Selecting highest frequency rows.")
ngrams<- ngrams %>%
  group_by(predictor) %>% 
  arrange(desc(frequency)) %>%
  slice(1) %>%
  ungroup()

message(Sys.time(), " Converting tbl to data.table, setting key.")
ngrams <- as.data.table(ngrams)
setkey(ngrams, predictor)

# Save the data.table to disk.
message(Sys.time(), " Writing data.table to disk.")
saveRDS(ngrams, file = "ngrams.rds")

# Timing.
end_time <- Sys.time()
print(paste0("End time: ", end_time))
print(end_time - start_time)

# Cleanup.
# rm(list=ls())
# gc()
# .rs.restartR()

If this code chunk is run, it will eventually produce a binary .rds data file called ngrams.rds containing the data.table used by the prediction algorithm. The code produces a time-stamped message at each processing step so that progress can be tracked. Table 3 shows the time sequence of a typical processing run producing 2- through 5-grams from a 3% sample of sentences in the corpus. Table 3 also serves as a summary of the steps leading from the full text corpus to a final data.table that can be loaded by the prediction algorithm, and eventually, by the Shiny app.

Table 3. Typical processing run to produce a data.table containing
(n-1)-grams, predicted words, and frequencies for use by the word
prediction algorithm. This run produced 2- through 5-grams, based 
on a 3% sample of sentences in the corpus.

[1] "Start time: 2019-03-28 15:16:26"
2019-03-28 15:16:26 Creating corpus.
2019-03-28 15:16:54 Removing non-ASCII characters.
2019-03-28 15:16:59 Reshaping the corpus into sentences.
2019-03-28 15:19:07 Creating a random sample of sentences.
2019-03-28 15:19:07 Tokenizing.
2019-03-28 15:19:15 Converting tokens to lowercase.
2019-03-28 15:19:15 Making n-grams.
2019-03-28 15:19:40 Making document-feature matrix.
2019-03-28 15:19:56 Computing n-gram frequencies.
2019-03-28 15:20:35 Creating data.table.
2019-03-28 15:20:36 Extracting predicted words.
2019-03-28 15:25:06 Creating predictor column and creating data.table key.
2019-03-28 15:26:57 Dropping unneeded columns, setting data.table key.
2019-03-28 15:27:04 Selecting highest frequency rows.
2019-03-28 15:29:23 Converting tbl to data.table, setting key.
2019-03-28 15:29:24 Writing data.table to disk.
[1] "End time: 2019-03-28 15:29:33"
Time difference of 13.11839 mins

Keeping in mind the memory limits of the Shiny web platform, we tried producing data files for different sizes of random samples of sentences. A 30% sample caused out-of-memory errors on the desktop iMac machine used for this project. By trial-and-error we produced a series of data files using smaller random samples of sentences, listed in Table 4.

Table 4. Summary of processing runs for different sizes of random samples of sentences 
from the full corpus, including production of 2- through 5-grams.

1%  sample:  Processing time: 5.9 min   n-grams: 1,875,528   file size: 15.6 mb   load time:  1.5 sec

3%  sample:  Processing time: 12.1 min  n-grams: 5,082,029   file size: 42.4 mb   load time:  9.2 sec

5%  sample:  Processing time: 18.6 min  n-grams: 8,047,876   file size: 67.2 mb   load time: 13.1 sec

7%  sample:  Processing time: 28.3 min  n-grams: 10,886,419  file size: 90.8 mb   load time: 19.1 sec

10% sample:  Processing time: 40.1 min  n-grams: 14,964,752  file size: 124.8 mb  load time: 31.0 sec

While none of the sample sizes in Table 4 caused out-of-memory errors on our iMac, we still needed to determine which files would successfully load on shinyapps.io. Using a prototype version of our Shiny app, we tried loading the app on the free version of shinyapps.io using each of the different sizes of data files listed in Table 4. File sizes greater than the 3% sample (42 MB) failed to load on shinyapps.io, giving out-of-memory errors. The data file based on a 3% sample of sentences from the corpus was the largest file that would successfully load on shinyapp.io, and is the file that we ended up using for our Shiny app. That file contained 5,082,029 n-grams and took around 9 seconds to load from disk into memory, which was considered to be a reasonable amount of time in terms of user experience.

Figures 1 through 4 show the most frequent words and n-grams in the 3% sample of sentences.

tk <- readRDS("tk.rds")
textplot_wordcloud(dfm(tk), max_words = 100, min_size = 2, max_size = 6)
Figure 1. Wordcloud of the most frequent 100 words from the 3% sample of sentences in the corpus.

Figure 1. Wordcloud of the most frequent 100 words from the 3% sample of sentences in the corpus.

ng <- readRDS("ng.rds")
textplot_wordcloud(dfm(ng), max_words = 100, min_size = 1.3, max_size = 5)
Figure 2. Wordcloud of the most frequent 100 two- through five-grams from the 3% sample of sentences in the corpus.

Figure 2. Wordcloud of the most frequent 100 two- through five-grams from the 3% sample of sentences in the corpus.

tk <- readRDS("tk.rds")
barplot(topfeatures(dfm(tk), n = 20), las=2)
Figure 3. The 20 most frequent words in the 3% sample of sentences in the corpus.

Figure 3. The 20 most frequent words in the 3% sample of sentences in the corpus.

ng <- readRDS("ng.rds")
barplot(topfeatures(dfm(ng), n = 20), las=2)
Figure 4. The 20 most frequent n-grams in the 3% sample of sentences in the corpus.

Figure 4. The 20 most frequent n-grams in the 3% sample of sentences in the corpus.

Our decision to not remove stop words and profanity from the text corpus is reflected in Figures 1 through 4. The most common word is the followed by other short connective words such as to, and, and a. All of the most common n-grams are 2-grams, the most common being of the and in the. Higher-order n-grams are less frequent, which is not surprising and is inherent in the tradeoff between the predictive power of higher order n-grams and the less frequent occurrence of those same high-order n-grams.

Model Building and Word Prediction

Code Chunk #2: The following code implements the n-gram word prediction model, using Stupid Backoff.

# Remove `eval=FALSE` to run the code when the document is knit.

# Predicts next word from n-grams contained
# in a data.table produced by the previous code chunk.
# If you execute this code chunk it will run in
# the R console.

# Function parseWords accepts words typed by user
# converts them to lower case, and 
# returns only up to the last 4 words.
# Parameter s is a character string
# typed by user.
parseWords <- function(s) {
  sp <- unlist(strsplit(tolower(s), " "))
  if (length(sp) > 4){
    sp <- sp[(length(sp) - 3):length(sp)]
  } 
  return(paste(sp, collapse = " "))
}

# Function leftTrim removes the first word of an n-gram.
# Parameter s is a character string.
leftTrim <- function(s) {
  sp <- unlist(strsplit(tolower(s), " "))
  if (length(sp) > 1){
    sp <- sp[-1]
  } 
  return(paste(sp, collapse = " "))
}

cat("\014") # Clear the console.

if (!exists("ngrams")) { 
  message("Initializing, please wait (~15 seconds).")
  ngrams <- readRDS("ngrams.rds") # 3% sample, 5-grams.
}

words <- readline("Type some words, then hit Enter: ")

while(words == "") {
  message("\nPlease enter at least one word.\n")
  words <- readline("Type some words, then hit Enter: ")
}

words <- gsub("[[:punct:]]", "", words) # Remove punctuation.
words <- parseWords(words)
print(paste0("Entered words: ", words))

# Stupid Backoff. See goo.gl/EzvZcy .
repeat{
  if (!is.na(ngrams[words][1, 2])) { # Found a prediction.
    print(paste0("For: ", words))
    print(paste0("Prediction is: ", ngrams[words][1, 2]))
    break
  } else if (is.na(ngrams[words][1, 2]) &
             length(unlist(strsplit(words, " "))) == 1) { # Prediction not found.
    print("the")
    break
  } else { # Back off to next lower n-gram and search again.
    words <- leftTrim(words)
    print(paste0("Backed off to: ", words))
    next
  }
} # repeat

Chunk #2 uses two custom functions, parseWords and leftTrim, to aid in processing n-grams input by the user. Function parseWords accepts words typed by the user, converts them to lower case, and returns only up to the last 4 words. This ensures that user input is compatible with the 5-gram word prediction algorithm. Function leftTrim removes the first word of an n-gram and is central to our implementation of the Stupid Backoff algorithm.

The code then loads the data.table from the disk file ngrams.rds, which takes on average around 15 seconds on the free version of shinyapps.io. The user is then prompted for input. If the user hits Enter without typing any words they will be continually prompted for input until a valid string of words is submitted. Upon submission of valid input, punctuation is removed, and the user-entered n-gram is then passed to the parseWords function. The resulting n-gram, which at most will be a 4-gram, is then processed by the Stupid Backoff algorithm. This is an if-else if structure that repeatedly searches the ngrams data.table for a matching (n-1)-gram. If a match is found the corresponding predicted word is returned and the backoff history displayed. If a match is not found, the n-gram is trimmed by the leftTrim function and the search repeated (the backoff model). This continues until either a match is found or no match is found and the n-gram has been reduced to a single word, at which point the best guess “the” is returned.

Next Steps

The main work that remains is to implement the n-gram word prediction model as a Shiny app. Some experimentation will be done with different sizes of data files to determine the optimum balance between size of the sample of corpus sentences and loading time for the Shiny app. This is to ensure a reasonable experience for the user.