Executive Summary

This report outlines exploratory data analysis for the Coursera Capstone project in the Data Science Specialization. The utilized data set can was downloaded from the link below.

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In this article, the Data Import and Tokenization methods are explained and some key features of the text files are evaluated. Finally, the proposed prediction model methodology is explained and previewed. All the developed functions used in this study are listed at the end of this article as an appendix.

Data Import

The “getImportDt” function below imports all three text files from the working directory into a data.table.

allLinesDt<- getImportDt(NULL, NULL)

Number of lines observed for each imported text file
Text Doc	Number of Lines
en_US.blogs.txt	899,288
en_US.news.txt	1,010,242
en_US.twitter.txt	2,360,148

Based on the table above we can see that the twitter text file has a lot more lines than the news file, followed by the blogs file. Considering the large size of these text files, a smaller subset of the data is required to further evaluate them, as follows.

linesnmax = 50000
nsamp = 5000
set.seed(34341)
dtSubset <- getImportDt(linesnmax = linesnmax, nsamp = nsamp)
docsOrig <- dtToQcorp(dtSubset)

For testing purposes and due to RAM limitations, only the first 50000 lines were read from each input file and then only a small subset (5,000 samples per input file) was selected from those lines, with equal number of samples per input file.

The “dtToQcorp” function converts the data.table into a corpus using the quanteda package. The tm package was initially utilized for text processing/analysis but the switch to quanteda was made due to it’s faster speed.

Below, a document feature matrix is created using the quanteda package and the number of total words per document is sumamrized.

dfmWrds <- dfm(docsOrig)
numwords <- ntoken(dfmWrds)

Number of lines observed for each imported text file
Text Doc File	No. of words	No. of lines read
en_US.blogs.txt	237,319	5,000
en_US.news.txt	200,046	5,000
en_US.twitter.txt	76,987	5,000

Based on the table above we can see that blogs have a lot more words per lines read, followed by news articles, followed by tweets. Typically, blogs tend to have longer texts compared to news. And tweets are limited in characters and so would be expected to have the least number of words per line.

Tokenization

To further analyze the text in each text file, n-grams were extracted from the document corpus. The “Get_qdfm” function receives a quanteda corpus and returns a n-gram tokenized document-feature matrix. Here are some assumptions made during tokenization of ngrams:

The text is first divided into individual sentences before tokenizing into ngrams, in order to avoid tokenizing words from different sentences.
Stopwords are not removed for our analysis because our goal is word prediction (including stopword prediction) and not so much understanding context.
Punctuation and extra whitespace is removed from the ngrams.
Profanity (only explicit swear words) are removed from the ngrams.

For further manipulation, the document-feature matrices are converted into ordered frequency data.tables using the “getOrderedFreqDt”. All three text documents are summarized into one document-feature data table using this function. A spell check option is built into this custom function such that ngrams with any misspelled words are removed from the data table.

Misspelled and Non-english Words

It is possible some lines in the text corpora are misspelled or that they are not in English. For prediction model development it is not desirable to predict misspelled or non-English words for the user.

Using the hunspell package, it was found that out of 35,366 unique words, 12,299 were not in the English dictionary.
In addition, the cld3 package was used to evaluate each line from the dataset. 12,898 of the 15,000 total lines were detected to be English. However, the accuracy of the cld3 package functions would require further validation.

Based on the results above, going forward, all n-grams that contain words that are not in the English dictionary based on the hunspell package, will be removed from the document-term data tables.

Analysis of Single words, 2-grams, and 3-grams

# Generate ngrams for subset of data
freqWordsDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 1, removeStpwrds = F), spellCheck = T)
freqBigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 2, removeStpwrds = F), spellCheck = T)
freqTrigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 3, removeStpwrds = F), spellCheck = T)

# Extract stems from words
stem_doc <- stemDocument(freqWordsDt$term)
stems <- unique(stem_doc)

# Determine unique words required for 50% and 90% of all words
freqWordsDt[,coverage:=cumsum(freq)/sum(freq)]
nwords.5 = nrow(freqWordsDt[coverage<.5])
nwords.9 = nrow(freqWordsDt[coverage<.9])

The frequencies of the top single, 2-grams, and 3-grams of the text of all three documents combined are summarized in the graphs above. Based on this analysis:

There are a total of 23,458 unique words in the subset of this corpora
The coverage graph shows the number of unique words vs the number of word instances that they cover. The top 117 unique words cover 50% of all word instances and the top 4,665 unique words cover 90% of all word instances
By stemming, the number of unique words can be reduced down to 14,110 stem words. However, considering our goal is prediction of words and not prediction of stems or related words, stemming will not be used going forward in the the prediction models.
We notice that the majority of the top words and n-grams are actually considered “stopwords” and this is acceptable because prediction of stopwords is also within the scope of our prediction model.

Comparison of Text Documents

The term frequencey graphs below compare the words used between the three texts document. To evaluate the context of the words used here, stopwords are removed from the frequencies shown in the graphs below.

It’s also interesting to note that the top word for news articles is ‘said’ and the top 2 words for tweets are ‘just’ and ‘like’

n-gram prediction model development

The function “getProbMatrix” was developed to generate a list of ngram probability matrices by:

Generating ngram frequencies
Determining probability of the last word occuring in each n-gram term given the string of words prior to the last word.
Storing these probabilities in a list for each n-gram. For example if the maximum n is determined to be 3, a list variable of 3 ngram probability matrices are returned: a unigram matrix, a bigram matrix, and a trigrammatrix.
A “coverage” option is provided to the user if only a subset of the ngrams is required to cover x percent of all the word instances. This criteria will be tweaked during the final prediction model development.
Efficiently return the resulting matrices as “data.table” to be stored with minimum size.

The function “predictNxtWrd” was developed to predict the next word provided a series of words by:

Receiving a stored probability matrix containing n-gram probabilities.
Reformatting the phrase provided by the user as input (i.e., spell check, select the last (n-1) number of words in phrase, ensure selected words start from beginning of sentence)
By searching for the phrase in the n-gram probabilities and determining the most probable next word.
Using the back-off model, if the input phrase is not found in the n-gram, the (n-1)-gram probability is searched for a matching term and if not the (n-2)-gram and so on.
Returning a list of the top 5 list of next words in order starting with the most probable.

Model Evaluation and Future Development

Going forward I intend to take the following steps to further evaluate and develop the prediction model:

Calculate “perplexity” based on a given test sample set.
Evaluate effect of parameters (highest n-gram to use and % coverage to use) on perplexity.
Evaluate effect of parameters on the size of resulting probability matrix and the run time of the prediction model.
Incorporate the “stupid back-off” model into determining n-gram probabilities
Incorporate “smoothing” of probabilities for unobserved n-grams in the test data.

Appendix: All custom functions used in analysis

# Function used to load text files in current directory as data.table
getImportDt <- function(linesnmax = 50000, nsamp = 2000, txtfilename = NULL){
  # linesnmax and nsamp can take on NULL values which mean they are infinity in this function
  # If txtfilename is not defined, all files in current directory are read
  if (is.null(txtfilename)) {
    filenames <- list.files(".") #read filenames from directory
  } else {
    filenames <- txtfilename
  }
  dtread <- NULL
  dtread <- data.table(file = character(), lines = character())
  # Only read the first linesnmax lines of each text doc
  for (filename in filenames) {
    if (is.null(linesnmax)){
      lins <- readr::read_lines(filename) 
    } else {
      lins <- readr::read_lines(filename, n_max = linesnmax)
    }
    newdt <- data.table(file = filename, lines = lins)
    dtread <- rbind(dtread, newdt)
  }
  dtread$file<-as.factor(dtread$file)
  # Take sample of nsamp size from the read data 
  if (!is.null(nsamp)) dtread <- dtread[, .SD[sample(.N, nsamp)], by = file]
  return(dtread)
}

# This function converts the data table generated from getImportDt to a quanteda corpus
dtToQcorp <- function(dtinput){
  docs <- NULL
  for (filename in levels(dtinput$file)){
    subdt <- dtinput[file == filename]
    doc_str <- paste(subdt$lines, collapse = "\n")
    tempdoc <- corpus(doc_str, docnames = filename)
    if (is.null(docs)) {docs <- tempdoc
    } else {docs <- c(docs, tempdoc)}
  }
  return(docs)
}

# This function generates a document-feature matrix based on a quanteda corpus
# To avoid tokenizing words from different sentences, first all lines are tokenized into sentences
Get_qdfm <- function(qcorp, n = 1, removeStpwrds = F){
  sens <- unlist(lapply(tokenize_lines(as.String(qcorp)), tokenize_sentences),
                 use.names = F)
  # Generate a list of explicit swear words
  stpwrds <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
  if (removeStpwrds == T) stpwrds <- c(stpwrds, stopwords())
  # The tokenize_ngrams function automatically removes punct and extra whitespace
  ngrams <- unlist(tokenize_ngrams(sens, n=n, lowercase = T, stopwords = stpwrds))
  ngrams <- ngrams[!is.na(ngrams)]
  dfm(as.tokens(list(ngrams)))
}

# This function generates an ordered term-frequency data table based on a quanteda dfm
# The function combines all docs in dfm, info about individual docs is lost
getOrderedFreqDt <- function(dfminput, spellCheck = T){
  dt <- data.table(convert(dfminput, "data.frame"))
  dt <- dt[,-c(1)] # remove first "document" column
  freqv <- colSums(dt)
  freqdt <- data.table(term = names(freqv), freq = freqv)
  if (spellCheck == T) {
    freqdt[, wrongTerms := hunspell(term)]
    freqdt[, correctSpell := identical(wrongTerms[[1]], character(0)), by= 1:nrow(freqdt)]
    freqdt <- freqdt[correctSpell==T, c("term", "freq")]
  }
  freqdt[!grepl("[0-9]", term)] #removes all numbers from ngrams
  setorder(freqdt, -freq)
  return(freqdt)
}

# Function for plotting terms vs occurrences
plot_occurrences <- function(freqDt, nwords = 20){
  wf <- data.frame(term = freqDt$term,
                   occurrences = freqDt$freq,
                   row.names = 1:nrow(freqDt))[1:nwords,]
  ggplot(wf, aes(term, occurrences)) +
    geom_bar(stat="identity") +
    coord_flip() +
    scale_x_discrete(limits=wf$term)
}

# This function splits sentence by their last n number of words and is used in following funcs
getSplitSent <- function(sen, nwrds){
  wrds <- unlist(strsplit(sen, split = " "))
  lastNwrds <- paste(tail(wrds,nwrds), collapse = " ")
  remaining <- paste(head(wrds,length(wrds)-nwrds), collapse = " ")
  c(remaining, lastNwrds)
}

# This function generates a list of freq/probabilitty matrices based on input quanteda corpus
getProbMatrix <- function(inputDocs, maxngram = 3, bareMatOnly = F, coverage = 1.0) {
  freqList <- list()
  freqList[[1]] <- maxngram
  # Convert frequency vectors into data tables and split words in ngram cols
  for (num in 2:maxngram){
    dt <- getOrderedFreqDt(Get_qdfm(inputDocs, n=num), spellCheck = T)
    if (coverage > 0.0 & coverage < 1.0) {
      dt[,sumFreq:=cumsum(freq)]
      dt <- dt[sumFreq<coverage*sum(freq)]
      dt <- dt[,-c("sumFreq")]
    }
    dt[,remainingTerm := getSplitSent(term, 1)[1], by= 1:nrow(dt)]
    dt[,lastWrd := getSplitSent(term, 1)[2], by= 1:nrow(dt)]
    dt[,rTermFreq := sum(freq), by = .(remainingTerm)]
    # Calculate probabilities and logs
    dt[,p := freq / rTermFreq]
    dt[,logp := log(p)]
    if (bareMatOnly == T) dt[,c("term", "freq", "rTermFreq", "p"):=NULL]
    freqList[[num]] <- dt
  }
  return(freqList)
}

# This function cleans and reformats an input sentences used to clean user input before predicting
Clean_Str <- function(inputstr, removeStpwrds = F){
  corpus <- VCorpus(VectorSource(inputstr),
                       readerControl = list(reader=readPlain, language = "en"))
  # Lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove numbers
  corpus <- tm_map(corpus, removeNumbers)
  # Remove explicitly profane words
  profanity <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
  corpus <- tm_map(corpus, removeWords, profanity)
  # Remove extra whitespace BUT maintain \n line breaks
  whitespaceFUN <- content_transformer(function(x) gsub("[ ]+", " ",as.String(x)))
  corpus <- tm_map(corpus, whitespaceFUN)
  # Remove stop words if applicable
  if (removeStpwrds == T) {
    corpus <- tm_map(corpus, removeWords, words = stopwords("en"))
    }
  return(corpus[[1]]$content)
}

# This function suggest the most probable words to the user based on their input phrase
predictNxtWrd <- function(inputpmat, inputsent) {
  # determine starting n based on length of input sentence and maxngram in prob matrix
  numWrds <- length(strsplit(inputsent, split = " ")[[1]])
  maxngram <- inputpmat[[1]]
  n <- numWrds + 1
  if (n > maxngram) n <- maxngram
  # Use "Backoff" to determine probabilities
  #TODO: develop an actual "stupid backoff" model uses 0.4 x probability of next ngram down to calculate probabilities
  for (i in n:2){
    lastNWrds_str <- getSplitSent(inputsent, i - 1)[2]
    # format input sentence
    lastNWrds_str <- Clean_Str(lastNWrds_str)
    # select corresponding ngram matrix
    pdt <- inputpmat[[i]]
    subpdt <- pdt[remainingTerm == lastNWrds_str]
    # if no match  for last-n-words then use next ngram down, otherwise break loop
    if (nrow(subpdt)!=0) break
  }
  
  #TODO: optimize so whole list doesn't have to be ordered just to get top words
  setorderv(subpdt, c("logp"), c(-1))
  predictTop <- subpdt$lastWrd[1:5]
}

Text Prediction using n-gram Models: Part 1 - Exploratory Data Analysis

Amir Manafpour

5/20/2020