Introduction

This document addresses the Milestone task of the JHU/Coursera Data Science Capstone. In it, the data necessary for building a word prediction algorithm are acquired, cleaned, and analysed for interesting patterns.

Acquistition and summary

The data downloaded from the course website were composed of twelve .txt files: tweets, blog entries and news articles, from four different countries. Only those samples originating from the US were considered in this analysis. The acquisition procedure is reproduced in the appendix to this document.

The data were initially analysed to determine for each of the three data sets the number of lines of text, the size, and the number of words in each set, which is displayed below.

basicSummary
##          numEntries    sizes numWords
## twitter     2360148 301.4 Mb 30373605
## blogs        899288 248.5 Mb 37334149
## newsText    1010242 249.6 Mb 34372814

Sample Size Selection

From the summary, we can see that, on average, there are about, on average, 3.402685610^{7} individual words in all three data sets. To improve the accuracy of the text prediction app that is the eventual goal of this Capstone, it is desirable to capture as many of those words as possible, using the smallest sample possible. However, among those 3.402685610^{7}, many are bound to be extraordinarily rare, or misspellings.

Arguably, the best way to handle selecting the sample size would be to observe that words in a natural language obey Zipf’s law, perform enough samples to estimate the parameters, and find an actual desired sample threshold. A much lazier approach (taken here) is to simply take increasingly large samples of the data sets, compute the number of unique words in each sample, and plot them. From this, an educated guess can be made about the number of unique words in very large samples of text.

From the plot, it appears that the majority of the numbers of unique words in extremely large samples do not exceed 210000 (indicated by the red line), 95% of which is 1.99510^{5} and 90% of which is 1.8910^{5}, which are in turn achieved by sample sizes of 199413 and 189833 respectively. This lower threshold is easily obtainable with a ten percent sample of each of the three data sets; when this sample is cleaned of profanities and filtered of punctuation, the resulting 184514 unique words are 87.8638095 % of 210000. Stratified sampling was used to ensure that the sample was as representative of all the text as possible. The samples were stored to three Rdata objects that were used for further analysis.

Cleaning and n-gram construction

Following Eng and Eisner, in recognition of the fact that sentence-initial words do not actually follow sentence-terminal words, but rather a period, full-stops in the sample were replaced by an end-of-phrase token prior to cleaning and n-gram construction.

A list of profanities were obtained from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en . These were used to replace profanities in the samples with blank spaces. The profanities were enveloped in word-breaks to ensure that non-profane words that contain the spellings of swears, such as “assume”, or “class” were not filtered. The resulting clean text was then further filtered to remove URLs, Twitter handles, and, even though they contain syntactic value, emoticons. The resulting list was stored to an Rdata object textData.Rdata, and was henceforth considered to be the primary data set.

The data were converted into a corpus, which was then deconstructed into \(n\)-grams (where \(n=1,2,3\)), and arranged into three data frames. Each observation in each data frame is a unique phrase of \(n\) words; the data is composed of phrases and their frequencies. The twenty most commonly occurring \(n\)-grams are displayed below.

FALSE               tm        SnowballC             qdap qdapDictionaries 
FALSE             TRUE             TRUE             TRUE             TRUE 
FALSE            dplyr            RWeka          stringi          stringr 
FALSE             TRUE             TRUE             TRUE             TRUE 
FALSE       data.table        Rgraphviz 
FALSE             TRUE             TRUE
displayCounts(unigramDF)

displayCounts(bigramDF)

displayCounts(trigramDF)

Analysis

By far, the most surprising observation is that the number of end-of-phrase tokens (eop) is significantly higher than any other. The frequency of ellipses (represented by eop eop eop trigrams) eclipse the frequency of the next most frequent trigram, “one of the”. It seems that this is an overwhelmingly strong artifact of modern written indirect style…you know what I mean? Otherwise, the results do not violate any expectations of the English language, where one may reasonably expect that the numbers of articles and conjunctions are far more than any other words, and prepositional phrases make up the bulk of the most frequent bigrams and trigrams.

Steps twoards a predictive algorithm

The structure of the data immediately suggests a probabilistic approach to word prediction. For example, the unigramDF data set asserts that given no words, the most likely following words are: eop, the, and, for, that, you.

To understand the remaining data sets, the phrases were divided into a history and following word. The probability of a phrase occurring given a particular history was calculated, for the first 10 bigram phrases

bigramHist<-createHistory(bigramDF[1:10, ])
melted<-melt(data = bigramHist, id.vars = c("history", "toPredict", "prob"))
dcast(melted, history+prob~toPredict)
##    history       prob be eop i the
## 1      at  0.04720306  0   0 0   2
## 2     eop  0.06875480  0   0 0   2
## 3     eop  0.09121542  0   0 2   0
## 4     eop  0.24915035  0   2 0   0
## 5     for  0.06814084  0   0 0   2
## 6      in  0.13793987  0   0 0   2
## 7      of  0.14405538  0   0 0   2
## 8      on  0.06611522  0   0 0   2
## 9      to  0.05506862  2   0 0   0
## 10     to  0.07235643  0   0 0   2

These data tells us, for example, that in this subset of the data, given a history of “to”, the phrase “to be” occurs with probability 0.06, and the phrase “to the” occurs with probability 0.07. Thus, given the input “to”, “the” would be suggested as a first choice, and “be” as a second; and words such as “i”, and “eop” would not be suggested at all.

If the history is updated by the user to now include one of the suggested words, then the trigram data frame suggests

## Aggregation function missing: defaulting to length
##    history       prob a as be dont eop eops have i is know of the think to
## 29  to be  0.02144156 2  0  0    0   0    0    0 0  0    0  0   0     0  0
##    was you
## 29   0   0

It can be observed that the probability of a particular following word occurring given a richer history is different.

The initial idea for a text prediction algorithm follows from these observations. Despite the massive size of the data matrices, their sparsity should allow for efficient storage. Each subsequent word suggestion should be informed by the preceding history. However, this is not the only option. A statistical approach, such as a tree-based approach, or multiclass logistic regression seems appropriate to automate learning in this problem. However, it is clear that further data engineering needs to take place before this can be done.

Appendix

  1. Functions used for data obtention
getdata<-function(fileUrl, dir, filename, ext){
        # create directory, if it is not already present
        dirName<-paste(dir, sep = "")
        if(!file.exists(dirName)){
                dir.create(path = dirName)
        }
        # Get the data, unless this step has already been done
        dest<-paste("./", 
                    dirName,"/", 
                    filename, 
                    ext, 
                    sep = "")
        if(!file.exists(dest)){
                download.file(url = fileUrl, 
                              destfile = dest, 
                              method = "curl") 
                datedownloaded<-date()
        }
        print(dest)
        dest
}
# data obtention
fileURL1 <- 
        "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" 
dataset<-getdata(fileUrl = fileURL1, 
                 dir = "swiftkey", 
                 filename = "dataset", 
                 ext = ".zip")
if(!exists("swiftKey")){
        swiftKey<-unzip(zipfile = dataset)
}
save(swiftKey, file = "swiftKey.Rdata")

# profanity obtention
fileURL2<-
        "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if(!exists("./badwords/badwords.txt")){
        profanities<-getdata(fileUrl = fileURL2, 
                             dir = "badwords",
                             filename = "badwords", 
                             ext = ".txt")
}
badwords<-fread(profanities, 
                header = F, 
                stringsAsFactors = T, 
                sep = "\n")
save(badwords, file = "badwords.Rdata")
  1. Procedure used to get a sample of the data
set.seed(123)
smallTwiter<-twitter[sample(1:length(twitter), 
                            size = floor(.1*length(twitter)), 
                            replace = F)]
set.seed(321)
smallBlogs<-blogs[sample(1:length(blogs), 
                        size = floor(.1*length(blogs)), 
                        replace = F)]
set.seed(213)
smallNews<-newsText[sample(1:length(newsText), 
                           size = floor(.1*length(newsText)), 
                           replace = F)]
  1. Procedure used to filter profanities, URLs, Twitter handles and emoticons
text<-list("twitter"=smallTwitter, 
           "blogs"=smallBlogs, 
           "news"=smallNews)
load("badwords.Rdata")
badwordsList<-paste("\\b", 
                    badwords$V1[-grep(pattern = " ", badwords$V1)], 
                    "\\b",
                    sep = "", 
                    collapse = "|")
cleanText<-sapply(X = text, 
                  function(x){gsub(pattern = badwordsList, 
                                   replacement = "", 
                                   x = x)})
# replace periods with an end-of-phrase token
eop<-function(text){
        gsub(pattern = ".", replacement = " EOP", x = text, fixed = T)
}
eopText<-sapply(X = cleanText, FUN = eop)

# remove all non-roman characters
text<-sapply(X = cleanText, 
             function(cleanText){iconv(eopText, 
                                       "latin1",  
                                       "ASCII", 
                                       sub="")})

# url filtering
filterURL<-function(text){
        pattern<- '^.*<td> *<a href="(https.*)">.*$'
        gsub(pattern = pattern, replacement = " ", x = text)
}
noURLs<-sapply(X = text, FUN = filterURL)

# twitter name filtering
filterTwittNames<-function(text){
        gsub(pattern = "@[^\\s]+", 
             replacement = " ", 
             x = text)
}
noHandles<-sapply(X = noURLs, FUN = filterTwittNames)



# emoticon filtering

emoticons<-emoticon$emoticon
emoticonList<-paste("\\b", 
                    emoticons, 
                    "\\b", 
                    sep = "", 
                    collapse = "|")
emoticonList<-gsub(pattern = "\\{", 
                   replacement = "", x = emoticonList)
emoticonList<-gsub(pattern = "\\}", 
                   replacement = "", x = emoticonList)
data<-sapply(X = noHandles, 
             function(x){gsub(pattern = emoticonList, 
                              replacement = "", 
                              x = x, 
                              ignore.case = T)})
save(data, file = "textData.Rata")
  1. Functions used for tokenization and the display of \(n\)-gram frequencies
# getNGramDF() returns a dataframe of the frequencies of occurences of ngrams in the corpus in descending order
getNGramDF<-function(corpus, n){
        removeNumbers(corpus)
        tokenizer<-function(x){
                NGramTokenizer(x, 
                               Weka_control(min = n, max = n))
        }
        options(mc.cores=1)
        tdm <- TermDocumentMatrix(corpus, 
                                  control = list(tokenize = tokenizer))
        freq<-rowSums(as.matrix(tdm))
        head(orderedNGrams<-freq[order(freq, decreasing = T)], 20)
        NGramDF<-as.data.frame(orderedNGrams)
        data.frame(phrase=rownames(NGramDF), 
                                     freq=NGramDF$orderedNGrams)
}

# displayCounts() returns a barchart displaying the top 20 most frequently occuring ngrams

displayCounts<-function(NGramDF){
        print(smallDF<-NGramDF[1:20,])
        smallDF$phrase<-factor(x = smallDF$phrase, 
                                 levels = smallDF$phrase[order(smallDF$freq, decreasing = T)])
        smallDF<-transform(smallDF, 
                           phrase<-factor(phrase, 
                                          levels = phrase, 
                                          ordered = T))
        ggplot(data = smallDF, aes(x = phrase, y = freq)) +
                geom_bar(stat="identity") + 
                theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
  1. Function for dividing an \(n\)-gram into a history and word to predict
createHistory<-function(DF){
        require(stringr)
  histNum<-length(strsplit(as.character(DF[1,1]), " ")[[1]])-1
#   totalFreq<-sum(DF[,2])
#   DF$prob<-sapply(X = DF[,2], function(x){x/totalFreq})
  DF$toPredict<-sapply(X = DF[,1], function(x){word(x, -1)})
  histDat<-sapply(X = DF[, 1], 
                     function(x){paste(word(x, 1:histNum), 
                                       sep ="", 
                                       colapse= " ")})
  hist<-vector()
  if(histNum!=1){
    for (i in 1:ncol(histDat)){
      hist[i]<-paste(histDat[, i], 
                     sep = " ", 
                     collapse = "")
    }
    DF$history<-hist
  }else {
    DF$history<-str_trim(string = histDat, side = "right")
  }
  histories<-paste("^", unique(DF$history), "$", sep = "")
  DF$prob<-0
  for (i in (1:length(histories))){
        sharedHistory<-DF[grep(histories[i], DF$history),]
        probSum<-sum(sharedHistory$freq)
        DF[grep(histories[i], DF$history),]$prob<-(DF[grep(histories[i], DF$history),]$freq)/probSum
}
  DF<-DF[, c(4,3, 5)]
allWords<-unique(c(DF$history, DF$toPredict))
notInHist<-setdiff(x = allWords, y=DF$history)
notInPredictions<-setdiff(x = allWords, y = DF$toPredict)

  DF
}