Executive Summary

This report outlines exploratory data analysis for the Coursera Capstone project in the Data Science Specialization. The utilized data set can was downloaded from the link below.

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In this article, the Data Import and Tokenization methods are explained and some key features of the text files are evaluated. Finally, the proposed prediction model methodology is explained and previewed. All the developed functions used in this study are listed at the end of this article as an appendix.

Data Import

The “getImportDt” function below imports all three text files from the working directory into a data.table.

allLinesDt<- getImportDt(NULL, NULL)
Number of lines observed for each imported text file
Text Doc Number of Lines
en_US.blogs.txt 899,288
en_US.news.txt 1,010,242
en_US.twitter.txt 2,360,148

Based on the table above we can see that the twitter text file has a lot more lines than the news file, followed by the blogs file. Considering the large size of these text files, a smaller subset of the data is required to further evaluate them, as follows.

linesnmax = 50000
nsamp = 5000
set.seed(34341)
dtSubset <- getImportDt(linesnmax = linesnmax, nsamp = nsamp)
docsOrig <- dtToQcorp(dtSubset)

For testing purposes and due to RAM limitations, only the first 50000 lines were read from each input file and then only a small subset (5,000 samples per input file) was selected from those lines, with equal number of samples per input file.

The “dtToQcorp” function converts the data.table into a corpus using the quanteda package. The tm package was initially utilized for text processing/analysis but the switch to quanteda was made due to it’s faster speed.

Below, a document feature matrix is created using the quanteda package and the number of total words per document is sumamrized.

dfmWrds <- dfm(docsOrig)
numwords <- ntoken(dfmWrds)
Number of lines observed for each imported text file
Text Doc File No. of words No. of lines read
en_US.blogs.txt 237,319 5,000
en_US.news.txt 200,046 5,000
en_US.twitter.txt 76,987 5,000

Based on the table above we can see that blogs have a lot more words per lines read, followed by news articles, followed by tweets. Typically, blogs tend to have longer texts compared to news. And tweets are limited in characters and so would be expected to have the least number of words per line.

Tokenization

To further analyze the text in each text file, n-grams were extracted from the document corpus. The “Get_qdfm” function receives a quanteda corpus and returns a n-gram tokenized document-feature matrix. Here are some assumptions made during tokenization of ngrams:

For further manipulation, the document-feature matrices are converted into ordered frequency data.tables using the “getOrderedFreqDt”. All three text documents are summarized into one document-feature data table using this function. A spell check option is built into this custom function such that ngrams with any misspelled words are removed from the data table.

Misspelled and Non-english Words

It is possible some lines in the text corpora are misspelled or that they are not in English. For prediction model development it is not desirable to predict misspelled or non-English words for the user.

  • Using the hunspell package, it was found that out of 35,366 unique words, 12,299 were not in the English dictionary.
  • In addition, the cld3 package was used to evaluate each line from the dataset. 12,898 of the 15,000 total lines were detected to be English. However, the accuracy of the cld3 package functions would require further validation.

Based on the results above, going forward, all n-grams that contain words that are not in the English dictionary based on the hunspell package, will be removed from the document-term data tables.

Analysis of Single words, 2-grams, and 3-grams

# Generate ngrams for subset of data
freqWordsDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 1, removeStpwrds = F), spellCheck = T)
freqBigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 2, removeStpwrds = F), spellCheck = T)
freqTrigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 3, removeStpwrds = F), spellCheck = T)

# Extract stems from words
stem_doc <- stemDocument(freqWordsDt$term)
stems <- unique(stem_doc)

# Determine unique words required for 50% and 90% of all words
freqWordsDt[,coverage:=cumsum(freq)/sum(freq)]
nwords.5 = nrow(freqWordsDt[coverage<.5])
nwords.9 = nrow(freqWordsDt[coverage<.9])

The frequencies of the top single, 2-grams, and 3-grams of the text of all three documents combined are summarized in the graphs above. Based on this analysis:

Comparison of Text Documents

The term frequencey graphs below compare the words used between the three texts document. To evaluate the context of the words used here, stopwords are removed from the frequencies shown in the graphs below.

It’s also interesting to note that the top word for news articles is ‘said’ and the top 2 words for tweets are ‘just’ and ‘like’

n-gram prediction model development

The function “getProbMatrix” was developed to generate a list of ngram probability matrices by:

The function “predictNxtWrd” was developed to predict the next word provided a series of words by:

Model Evaluation and Future Development

Going forward I intend to take the following steps to further evaluate and develop the prediction model:

Appendix: All custom functions used in analysis

# Function used to load text files in current directory as data.table
getImportDt <- function(linesnmax = 50000, nsamp = 2000, txtfilename = NULL){
  # linesnmax and nsamp can take on NULL values which mean they are infinity in this function
  # If txtfilename is not defined, all files in current directory are read
  if (is.null(txtfilename)) {
    filenames <- list.files(".") #read filenames from directory
  } else {
    filenames <- txtfilename
  }
  dtread <- NULL
  dtread <- data.table(file = character(), lines = character())
  # Only read the first linesnmax lines of each text doc
  for (filename in filenames) {
    if (is.null(linesnmax)){
      lins <- readr::read_lines(filename) 
    } else {
      lins <- readr::read_lines(filename, n_max = linesnmax)
    }
    newdt <- data.table(file = filename, lines = lins)
    dtread <- rbind(dtread, newdt)
  }
  dtread$file<-as.factor(dtread$file)
  # Take sample of nsamp size from the read data 
  if (!is.null(nsamp)) dtread <- dtread[, .SD[sample(.N, nsamp)], by = file]
  return(dtread)
}

# This function converts the data table generated from getImportDt to a quanteda corpus
dtToQcorp <- function(dtinput){
  docs <- NULL
  for (filename in levels(dtinput$file)){
    subdt <- dtinput[file == filename]
    doc_str <- paste(subdt$lines, collapse = "\n")
    tempdoc <- corpus(doc_str, docnames = filename)
    if (is.null(docs)) {docs <- tempdoc
    } else {docs <- c(docs, tempdoc)}
  }
  return(docs)
}
# This function generates a document-feature matrix based on a quanteda corpus
# To avoid tokenizing words from different sentences, first all lines are tokenized into sentences
Get_qdfm <- function(qcorp, n = 1, removeStpwrds = F){
  sens <- unlist(lapply(tokenize_lines(as.String(qcorp)), tokenize_sentences),
                 use.names = F)
  # Generate a list of explicit swear words
  stpwrds <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
  if (removeStpwrds == T) stpwrds <- c(stpwrds, stopwords())
  # The tokenize_ngrams function automatically removes punct and extra whitespace
  ngrams <- unlist(tokenize_ngrams(sens, n=n, lowercase = T, stopwords = stpwrds))
  ngrams <- ngrams[!is.na(ngrams)]
  dfm(as.tokens(list(ngrams)))
}

# This function generates an ordered term-frequency data table based on a quanteda dfm
# The function combines all docs in dfm, info about individual docs is lost
getOrderedFreqDt <- function(dfminput, spellCheck = T){
  dt <- data.table(convert(dfminput, "data.frame"))
  dt <- dt[,-c(1)] # remove first "document" column
  freqv <- colSums(dt)
  freqdt <- data.table(term = names(freqv), freq = freqv)
  if (spellCheck == T) {
    freqdt[, wrongTerms := hunspell(term)]
    freqdt[, correctSpell := identical(wrongTerms[[1]], character(0)), by= 1:nrow(freqdt)]
    freqdt <- freqdt[correctSpell==T, c("term", "freq")]
  }
  freqdt[!grepl("[0-9]", term)] #removes all numbers from ngrams
  setorder(freqdt, -freq)
  return(freqdt)
}

# Function for plotting terms vs occurrences
plot_occurrences <- function(freqDt, nwords = 20){
  wf <- data.frame(term = freqDt$term,
                   occurrences = freqDt$freq,
                   row.names = 1:nrow(freqDt))[1:nwords,]
  ggplot(wf, aes(term, occurrences)) +
    geom_bar(stat="identity") +
    coord_flip() +
    scale_x_discrete(limits=wf$term)
}
# This function splits sentence by their last n number of words and is used in following funcs
getSplitSent <- function(sen, nwrds){
  wrds <- unlist(strsplit(sen, split = " "))
  lastNwrds <- paste(tail(wrds,nwrds), collapse = " ")
  remaining <- paste(head(wrds,length(wrds)-nwrds), collapse = " ")
  c(remaining, lastNwrds)
}

# This function generates a list of freq/probabilitty matrices based on input quanteda corpus
getProbMatrix <- function(inputDocs, maxngram = 3, bareMatOnly = F, coverage = 1.0) {
  freqList <- list()
  freqList[[1]] <- maxngram
  # Convert frequency vectors into data tables and split words in ngram cols
  for (num in 2:maxngram){
    dt <- getOrderedFreqDt(Get_qdfm(inputDocs, n=num), spellCheck = T)
    if (coverage > 0.0 & coverage < 1.0) {
      dt[,sumFreq:=cumsum(freq)]
      dt <- dt[sumFreq<coverage*sum(freq)]
      dt <- dt[,-c("sumFreq")]
    }
    dt[,remainingTerm := getSplitSent(term, 1)[1], by= 1:nrow(dt)]
    dt[,lastWrd := getSplitSent(term, 1)[2], by= 1:nrow(dt)]
    dt[,rTermFreq := sum(freq), by = .(remainingTerm)]
    # Calculate probabilities and logs
    dt[,p := freq / rTermFreq]
    dt[,logp := log(p)]
    if (bareMatOnly == T) dt[,c("term", "freq", "rTermFreq", "p"):=NULL]
    freqList[[num]] <- dt
  }
  return(freqList)
}

# This function cleans and reformats an input sentences used to clean user input before predicting
Clean_Str <- function(inputstr, removeStpwrds = F){
  corpus <- VCorpus(VectorSource(inputstr),
                       readerControl = list(reader=readPlain, language = "en"))
  # Lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove numbers
  corpus <- tm_map(corpus, removeNumbers)
  # Remove explicitly profane words
  profanity <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
  corpus <- tm_map(corpus, removeWords, profanity)
  # Remove extra whitespace BUT maintain \n line breaks
  whitespaceFUN <- content_transformer(function(x) gsub("[ ]+", " ",as.String(x)))
  corpus <- tm_map(corpus, whitespaceFUN)
  # Remove stop words if applicable
  if (removeStpwrds == T) {
    corpus <- tm_map(corpus, removeWords, words = stopwords("en"))
    }
  return(corpus[[1]]$content)
}

# This function suggest the most probable words to the user based on their input phrase
predictNxtWrd <- function(inputpmat, inputsent) {
  # determine starting n based on length of input sentence and maxngram in prob matrix
  numWrds <- length(strsplit(inputsent, split = " ")[[1]])
  maxngram <- inputpmat[[1]]
  n <- numWrds + 1
  if (n > maxngram) n <- maxngram
  # Use "Backoff" to determine probabilities
  #TODO: develop an actual "stupid backoff" model uses 0.4 x probability of next ngram down to calculate probabilities
  for (i in n:2){
    lastNWrds_str <- getSplitSent(inputsent, i - 1)[2]
    # format input sentence
    lastNWrds_str <- Clean_Str(lastNWrds_str)
    # select corresponding ngram matrix
    pdt <- inputpmat[[i]]
    subpdt <- pdt[remainingTerm == lastNWrds_str]
    # if no match  for last-n-words then use next ngram down, otherwise break loop
    if (nrow(subpdt)!=0) break
  }
  
  #TODO: optimize so whole list doesn't have to be ordered just to get top words
  setorderv(subpdt, c("logp"), c(-1))
  predictTop <- subpdt$lastWrd[1:5]
}