This document addresses the Milestone task of the JHU/Coursera Data Science Capstone. In it, the data necessary for building a word prediction algorithm are acquired, cleaned, and analysed for interesting patterns.
The data downloaded from the course website were composed of twelve .txt files: tweets, blog entries and news articles, from four different countries. Only those samples originating from the US were considered in this analysis. The acquisition procedure is reproduced in the appendix to this document.
The data were initially analysed to determine for each of the three data sets the number of lines of text, the size, and the number of words in each set, which is displayed below.
basicSummary
## numEntries sizes numWords
## twitter 2360148 301.4 Mb 30373605
## blogs 899288 248.5 Mb 37334149
## newsText 1010242 249.6 Mb 34372814
From the summary, we can see that, on average, there are about, on average, 3.402685610^{7} individual words in all three data sets. To improve the accuracy of the text prediction app that is the eventual goal of this Capstone, it is desirable to capture as many of those words as possible, using the smallest sample possible. However, among those 3.402685610^{7}, many are bound to be extraordinarily rare, or misspellings.
Arguably, the best way to handle selecting the sample size would be to observe that words in a natural language obey Zipf’s law, perform enough samples to estimate the parameters, and find an actual desired sample threshold. A much lazier approach (taken here) is to simply take increasingly large samples of the data sets, compute the number of unique words in each sample, and plot them. From this, an educated guess can be made about the number of unique words in very large samples of text.
From the plot, it appears that the majority of the numbers of unique words in extremely large samples do not exceed 210000 (indicated by the red line), 95% of which is 1.99510^{5} and 90% of which is 1.8910^{5}, which are in turn achieved by sample sizes of 199413 and 189833 respectively. This lower threshold is easily obtainable with a ten percent sample of each of the three data sets; when this sample is cleaned of profanities and filtered of punctuation, the resulting 184514 unique words are 87.8638095 % of 210000. Stratified sampling was used to ensure that the sample was as representative of all the text as possible. The samples were stored to three Rdata objects that were used for further analysis.
Following Eng and Eisner, in recognition of the fact that sentence-initial words do not actually follow sentence-terminal words, but rather a period, full-stops in the sample were replaced by an end-of-phrase token prior to cleaning and n-gram construction.
A list of profanities were obtained from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en . These were used to replace profanities in the samples with blank spaces. The profanities were enveloped in word-breaks to ensure that non-profane words that contain the spellings of swears, such as “assume”, or “class” were not filtered. The resulting clean text was then further filtered to remove URLs, Twitter handles, and, even though they contain syntactic value, emoticons. The resulting list was stored to an Rdata object textData.Rdata, and was henceforth considered to be the primary data set.
The data were converted into a corpus, which was then deconstructed into \(n\)-grams (where \(n=1,2,3\)), and arranged into three data frames. Each observation in each data frame is a unique phrase of \(n\) words; the data is composed of phrases and their frequencies. The twenty most commonly occurring \(n\)-grams are displayed below.
FALSE tm SnowballC qdap qdapDictionaries
FALSE TRUE TRUE TRUE TRUE
FALSE dplyr RWeka stringi stringr
FALSE TRUE TRUE TRUE TRUE
FALSE data.table Rgraphviz
FALSE TRUE TRUE
displayCounts(unigramDF)
displayCounts(bigramDF)
displayCounts(trigramDF)
By far, the most surprising observation is that the number of end-of-phrase tokens (eop) is significantly higher than any other. The frequency of ellipses (represented by eop eop eop trigrams) eclipse the frequency of the next most frequent trigram, “one of the”. It seems that this is an overwhelmingly strong artifact of modern written indirect style…you know what I mean? Otherwise, the results do not violate any expectations of the English language, where one may reasonably expect that the numbers of articles and conjunctions are far more than any other words, and prepositional phrases make up the bulk of the most frequent bigrams and trigrams.
The structure of the data immediately suggests a probabilistic approach to word prediction. For example, the unigramDF data set asserts that given no words, the most likely following words are: eop, the, and, for, that, you.
To understand the remaining data sets, the phrases were divided into a history and following word. The probability of a phrase occurring given a particular history was calculated, for the first 10 bigram phrases
bigramHist<-createHistory(bigramDF[1:10, ])
melted<-melt(data = bigramHist, id.vars = c("history", "toPredict", "prob"))
dcast(melted, history+prob~toPredict)
## history prob be eop i the
## 1 at 0.04720306 0 0 0 2
## 2 eop 0.06875480 0 0 0 2
## 3 eop 0.09121542 0 0 2 0
## 4 eop 0.24915035 0 2 0 0
## 5 for 0.06814084 0 0 0 2
## 6 in 0.13793987 0 0 0 2
## 7 of 0.14405538 0 0 0 2
## 8 on 0.06611522 0 0 0 2
## 9 to 0.05506862 2 0 0 0
## 10 to 0.07235643 0 0 0 2
These data tells us, for example, that in this subset of the data, given a history of “to”, the phrase “to be” occurs with probability 0.06, and the phrase “to the” occurs with probability 0.07. Thus, given the input “to”, “the” would be suggested as a first choice, and “be” as a second; and words such as “i”, and “eop” would not be suggested at all.
If the history is updated by the user to now include one of the suggested words, then the trigram data frame suggests
## Aggregation function missing: defaulting to length
## history prob a as be dont eop eops have i is know of the think to
## 29 to be 0.02144156 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## was you
## 29 0 0
It can be observed that the probability of a particular following word occurring given a richer history is different.
The initial idea for a text prediction algorithm follows from these observations. Despite the massive size of the data matrices, their sparsity should allow for efficient storage. Each subsequent word suggestion should be informed by the preceding history. However, this is not the only option. A statistical approach, such as a tree-based approach, or multiclass logistic regression seems appropriate to automate learning in this problem. However, it is clear that further data engineering needs to take place before this can be done.
getdata<-function(fileUrl, dir, filename, ext){
# create directory, if it is not already present
dirName<-paste(dir, sep = "")
if(!file.exists(dirName)){
dir.create(path = dirName)
}
# Get the data, unless this step has already been done
dest<-paste("./",
dirName,"/",
filename,
ext,
sep = "")
if(!file.exists(dest)){
download.file(url = fileUrl,
destfile = dest,
method = "curl")
datedownloaded<-date()
}
print(dest)
dest
}
# data obtention
fileURL1 <-
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataset<-getdata(fileUrl = fileURL1,
dir = "swiftkey",
filename = "dataset",
ext = ".zip")
if(!exists("swiftKey")){
swiftKey<-unzip(zipfile = dataset)
}
save(swiftKey, file = "swiftKey.Rdata")
# profanity obtention
fileURL2<-
"https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if(!exists("./badwords/badwords.txt")){
profanities<-getdata(fileUrl = fileURL2,
dir = "badwords",
filename = "badwords",
ext = ".txt")
}
badwords<-fread(profanities,
header = F,
stringsAsFactors = T,
sep = "\n")
save(badwords, file = "badwords.Rdata")
set.seed(123)
smallTwiter<-twitter[sample(1:length(twitter),
size = floor(.1*length(twitter)),
replace = F)]
set.seed(321)
smallBlogs<-blogs[sample(1:length(blogs),
size = floor(.1*length(blogs)),
replace = F)]
set.seed(213)
smallNews<-newsText[sample(1:length(newsText),
size = floor(.1*length(newsText)),
replace = F)]
text<-list("twitter"=smallTwitter,
"blogs"=smallBlogs,
"news"=smallNews)
load("badwords.Rdata")
badwordsList<-paste("\\b",
badwords$V1[-grep(pattern = " ", badwords$V1)],
"\\b",
sep = "",
collapse = "|")
cleanText<-sapply(X = text,
function(x){gsub(pattern = badwordsList,
replacement = "",
x = x)})
# replace periods with an end-of-phrase token
eop<-function(text){
gsub(pattern = ".", replacement = " EOP", x = text, fixed = T)
}
eopText<-sapply(X = cleanText, FUN = eop)
# remove all non-roman characters
text<-sapply(X = cleanText,
function(cleanText){iconv(eopText,
"latin1",
"ASCII",
sub="")})
# url filtering
filterURL<-function(text){
pattern<- '^.*<td> *<a href="(https.*)">.*$'
gsub(pattern = pattern, replacement = " ", x = text)
}
noURLs<-sapply(X = text, FUN = filterURL)
# twitter name filtering
filterTwittNames<-function(text){
gsub(pattern = "@[^\\s]+",
replacement = " ",
x = text)
}
noHandles<-sapply(X = noURLs, FUN = filterTwittNames)
# emoticon filtering
emoticons<-emoticon$emoticon
emoticonList<-paste("\\b",
emoticons,
"\\b",
sep = "",
collapse = "|")
emoticonList<-gsub(pattern = "\\{",
replacement = "", x = emoticonList)
emoticonList<-gsub(pattern = "\\}",
replacement = "", x = emoticonList)
data<-sapply(X = noHandles,
function(x){gsub(pattern = emoticonList,
replacement = "",
x = x,
ignore.case = T)})
save(data, file = "textData.Rata")
# getNGramDF() returns a dataframe of the frequencies of occurences of ngrams in the corpus in descending order
getNGramDF<-function(corpus, n){
removeNumbers(corpus)
tokenizer<-function(x){
NGramTokenizer(x,
Weka_control(min = n, max = n))
}
options(mc.cores=1)
tdm <- TermDocumentMatrix(corpus,
control = list(tokenize = tokenizer))
freq<-rowSums(as.matrix(tdm))
head(orderedNGrams<-freq[order(freq, decreasing = T)], 20)
NGramDF<-as.data.frame(orderedNGrams)
data.frame(phrase=rownames(NGramDF),
freq=NGramDF$orderedNGrams)
}
# displayCounts() returns a barchart displaying the top 20 most frequently occuring ngrams
displayCounts<-function(NGramDF){
print(smallDF<-NGramDF[1:20,])
smallDF$phrase<-factor(x = smallDF$phrase,
levels = smallDF$phrase[order(smallDF$freq, decreasing = T)])
smallDF<-transform(smallDF,
phrase<-factor(phrase,
levels = phrase,
ordered = T))
ggplot(data = smallDF, aes(x = phrase, y = freq)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
createHistory<-function(DF){
require(stringr)
histNum<-length(strsplit(as.character(DF[1,1]), " ")[[1]])-1
# totalFreq<-sum(DF[,2])
# DF$prob<-sapply(X = DF[,2], function(x){x/totalFreq})
DF$toPredict<-sapply(X = DF[,1], function(x){word(x, -1)})
histDat<-sapply(X = DF[, 1],
function(x){paste(word(x, 1:histNum),
sep ="",
colapse= " ")})
hist<-vector()
if(histNum!=1){
for (i in 1:ncol(histDat)){
hist[i]<-paste(histDat[, i],
sep = " ",
collapse = "")
}
DF$history<-hist
}else {
DF$history<-str_trim(string = histDat, side = "right")
}
histories<-paste("^", unique(DF$history), "$", sep = "")
DF$prob<-0
for (i in (1:length(histories))){
sharedHistory<-DF[grep(histories[i], DF$history),]
probSum<-sum(sharedHistory$freq)
DF[grep(histories[i], DF$history),]$prob<-(DF[grep(histories[i], DF$history),]$freq)/probSum
}
DF<-DF[, c(4,3, 5)]
allWords<-unique(c(DF$history, DF$toPredict))
notInHist<-setdiff(x = allWords, y=DF$history)
notInPredictions<-setdiff(x = allWords, y = DF$toPredict)
DF
}