Milestone: Progress on the JHU Data Science Capstone

Abstract

The following is an exploratory analysis of the JHU swiftkey data set, particularly with respect to building a model for next word text prediction; this report fulfills the milestone requirement portion of the 2015 Summer Capstone project in Data Science, hosted by Coursera. In this document you will find 5 sections: an Introduction that poses the problem, followed by 4 other sections that give abridged information on Downloading and Opening the Data, Summarizing the Data, Cleaning the Data, and Future Plans for the Data. For the programming savy, note that the lion’s share of code used to make this report possible and reproducible can be found in the Appendix included at the end. Ultimately, this analysis finds the swiftkey data conducive to deriving an n-gram based statistical model for next word text prediction, though some additional processing will be required.

Introduction

It is certainly cliche to say that “hindsight is 20/20,” for indeed one can usually determine an error, and its corresponding remedy, with perfect clarity after the fact; what has always been of more value to mankind is foresight, so that errors do not occur in the first place. Contemporary autocorrecting systems, in the realm of text messaging, are an example of computerized hindsight. Annoyingly, hindsight for these automatons is still far poorer than man’s 20/20, but is it wise to focus on improving it further? Would they not be of more use to mankind if we gave them some form of foresight? In other words, what if your phone could predict the next word that you were about to type instead of haphazardly correcting what you did type? That is the primary question that motivates research on text prediction, and all we have is a very large body of english text to go about it.

Downloading and Opening the Data

A common approach to forecasting future events is through the use of statistical techniques coupled with high quality data that details the past. For us, the Swiftkey data set is essentially our “high quality”" data that details the past. It’s part of a larger multilingual corpora called the “HC Corpora,” which contains natural language samples from blogs, the news, and twitter feeds. One can easily download the data via the above inline link (Swiftkey data set), and then unzipped it manually to view parts in a text editor, or one can use the the following snippets of code to do everything within an R environment:

## Assuming this code is run in Rstudio on a windows 8.1 platform, fetch the Swiftkey data, if not present, and unzip it into the user's working directory


myzip1 = "Coursera-SwiftKey.zip"
#if data file has not yet been downloaded, fetch it
if (!file.exists(myzip1)) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")

}

unzip(myzip1)

After running, use the setwd() command to set the new working directory path to the “…/final/en_US” subdirectory of the previous working directory. Then, load each text file into memory by applying the following code:

## Load each file into memory, if they do not already exist, using scan(): Blogs, News, and then Twitter
blogs_file <-  "en_US.blogs.txt"
news_file <-  "en_US.news.txt"
twitter_file <-  "en_US.twitter.txt"

if (!exists("Blogs")) {
  Blogs <- scan(blogs_file, character(0), sep = "\n")
}
if (!exists("News")) {
  News <- scan(news_file, character(0), sep = "\n")
}
if (!exists("Twitter")) {
  Twitter <- scan(twitter_file, character(0), sep = "\n")
}

Summarizing the Data

The relevant swiftkey data consists of three files, each containing english text separated by newlines. These are en_US.blogs, en_US.news, and en_US.twitter.

	Size.in.Megabytes	Lines	Words	Avg.Number.of.Words.per.Line
en_US.blogs.txt	260.56	899288	36893516	41.06
en_US.news.txt	20.11	77259	2579113	33.40
en_US.twitter.txt	316.04	2360148	29430648	12.47

en_US.blogs

The blog data is approximately 261 MB in memory. It contains 899,288 lines of text and 36,893,516 words. Here are three random, but reproducible, samples of the Blog data:

set.seed(9950)
Blogs[sample(1:length(Blogs), 3, replace=F)]

## [1] "4. Geoff Earhart is organized and methodical in his approach to everything. Heâs been ordered to recover a vital piece of equipment from a capsized ship. Force of nature: Lightning storm"                                     
## [2] "John Smithâs Grand National Record: 1992 Whats The Crack (13th); 1997 Full Of Oats (Fell 1st); 2003 Chives (PU bef 12th), Maximize (Fell 19th), Southern Star (14th); 2004 Southern Star (PU 18th); 2011 Calgary Bay (Fell 4th)"
## [3] "Once you know this you can plan picture-taking around this."

en_US.news

The news data is approximately 20 MB in memory. It contains 77,259 lines of text and 2,579,113 words. Here are three random, but reproducible, samples of the news data:

set.seed(9950)
News[sample(1:length(News), 3, replace=F)]

## [1] "Sixteen instructional assistants, 11 special education instructional assistants, two outreach consultants and one of the two district gardeners also were given pink slips."                                                                                            
## [2] "\"Then I'll sit down and think about whether I want to continue,\" he said this week."                                                                                                                                                                                  
## [3] "What's needed is some less inflammatory language and confrontation and some more dialogue and quiet reflection. The federal government has an important role to play in promoting public health and the greater good. The church leadership has a right to its beliefs."

en_US.twitter

The twitter data is approximately 316 MB in memory. It contains 2,360,148 lines of text and 29,430,648 words. Here are three random, but reproducible, samples of the news data:

set.seed(9950)
Twitter[sample(1:length(Twitter), 3, replace=F)]

## [1] "nope! Won't have it back till Friday...ah!!!"
## [2] "happy birthday"                              
## [3] "God I am drunk! Are the NFL games today?"

Cleaning the Data

As can be seen by the samples, the text is less than ideal as a tool for building a predictive model; non-ascii characters, capitalization, punctuation, and idiosyncratic spacing hinder it as being useful “high quality” data. Also, not apparent in the samples, are random bits of profanity that must be removed. Here is a brief synopsis of the steps taken to clean the data, for the actual code please refer to the appendix:

Each text object is converted into ASCII, via the iconv() command.
Using the tm package, each text object is then converted into a corpus.
All characters are converted to lower case by way of content_transformer(tolower).
Puncuation is removed: removePunctuation.
Numbers are removed: removeNumbers.
Google’s list of naughty words was downloaded into the working directory and labeled “badwords.txt”. Next, that list was loaded into memory and the removeWords command was used to remove any profanity from the data set.
Finally, all white space was trimmed, and each cleaned corpus was written to its own directory, for safe keeping.

____

One of the most widely used techniques in statistical natural language processing is the n-gram model. By using this technique language is modeled such that each n-gram is composed of n words, and these groups of words have a frequency within a body of text. One uses this frequency data to find the probability of a word, conditioned on some number of previous words, and in this way one can find the most likely future word.

After intially processing the data, each file was loaded back into memory and then converted into multiple Term Document Matrices: one for unigrams, bigrams, trigram, and quadrigrams. N-gram frequency data frames where derived from these, and then finally all data frames were merged under their respective n-gram type. Unfortunately, due to memory limitations, not all terms could be captured; at best, only terms with at least 10 counts within their samples managed to make it past the culling. Nevertheless, the data is certainly more useful now. Displayed are bargraphs and wordclouds that illustrate the cleaned n-gram frequency data:

Future Plans for the Data

Clearly, the goal is to use the data to create an n-gram language model, which will ultimately be used to build a shiny app for text prediction. Although my current understanding of natural language processing is quite limited, I have briefly researched the subject of n-gram models and have found these essential concepts associated with their creation:

Provide a start and stop context for the training set sentences. In practice, without training with delimiters the model assumes sentences have no end. Thusly, a next word will always be predicted.
Maximum Likelihood Estimates (MLE) need to be computed for all n-grams. The common method of doing this is normalizing each term by dividing the observed frequency of a particular sequence by the observed frequency of its prefix. For example, let’s say that the bigram “I am” appears twice in a corpus, and the unigram “I” appears 3 times. The MLE for “am” to follow “I” would be 2 over 3.
Smoothing is required to deal with errors caused by unknown terms (Chen and Goodman). For instance, say one is using a bigram model to calculate the probability of a sentence like “Austin loves you.” Now, suppose that the term “Austin loves” does not exist. Without smoothing, the probablity of this sentence occuring in our language model will be equal to zero, which makes it a bad model for english text prediction. Common approaches to remedy this fall into one of two categories: interpolation and backoff smoothing.

Number one is a bit of an issue given the data in its current form, so either I will need to process the raw corpora again, including start and stop symbols, or find some other remedy for the situation of never ending next word prediction. Finding maximum likelihood estimates, on the other hand, is a very straight forward operation that can be done on the data as it is right now. Concerning smoothing, I plan to use some form of backoff smoothing, where the appearance of a zero probability causes the model to backoff to a lower order model (i.e. quadrigrams backoff to trigrams backoff to bigrams, which in turn fall back on unigrams).

Appendix

This section contains the code used to clean the data, create the n-gram frequency matrices, merge those matrices, as well as create the data visualizations.

Cleaning

Below you will find the code used to clean the raw Swiftkey data. Note that, due to memory limitations, the Twitter text had to be broken up; thus, the twitter code is further down.

require(tm)
## First, convert the Blogs and News data to ASCII. 

Blogs <- iconv(Blogs, "latin1", "ASCII", "")
News <- iconv(News, "latin1", "ASCII", "")


##Using the tm package, turn both into a corpus

BlogsCorpus <- Corpus(VectorSource(News))
NewsCorpus <- Corpus(VectorSource(News))


##Convert to lower case
BlogsCorpus <- tm_map(BlogsCorpus, content_transformer(tolower))
NewsCorpus <- tm_map(NewsCorpus, content_transformer(tolower))


##remove punctuation
BlogsCorpus <- tm_map(BlogsCorpus, removePunctuation)
NewsCorpus <- tm_map(NewsCorpus, removePunctuation)


##remove numbers
BlogsCorpus <- tm_map(BlogsCorpus, removeNumbers)
NewsCorpus <- tm_map(NewsCorpus, removeNumbers)


##remove profanity
badwords <- scan("badwords.txt", character(0), sep = "\n")

BlogsCorpus <- tm_map(BlogsCorpus, removeWords, badwords)
NewsCorpus <- tm_map(NewsCorpus, removeWords, badwords) 
 

##remove extra white space
BlogsCorpus <- tm_map(BlogsCorpus, stripWhitespace)
NewsCorpus <- tm_map(NewsCorpus, stripWhitespace)


##Set the working directory to a directory of your choice and issue one of the following commands to save that particular corpus. 
writeCorpus(BlogsCorpus)
writeCorpus(NewsCorpus)


##Twitter was split into 6 random, but reproducible, samples.

set.seed(69)
Twitter <- split(Twitter, sample(rep(1:6, ((length(Twitter))/6))))

write(Twitter$'1', file ="a", sep = "\n")
write(Twitter$'2', file ="b", sep = "\n")
write(Twitter$'3', file ="c", sep = "\n")
write(Twitter$'4', file ="d", sep = "\n")
write(Twitter$'5', file ="e", sep = "\n")
write(Twitter$'6', file ="f", sep = "\n")

##All samples were then cleaned in the same fashion as the Blogs and News data

a <- scan("a", character(0), sep = "\n")
a <- iconv(a, "latin1", "ASCII", "")
parta <- Corpus(VectorSource(a))
parta <- tm_map(parta, content_transformer(tolower))
parta <- tm_map(parta, removePunctuation)
parta <- tm_map(parta, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

parta <- tm_map(parta, removeWords, badwords)
parta <- tm_map(parta, stripWhitespace)

b <- scan("b", character(0), sep = "\n")
b <- iconv(b, "latin1", "ASCII", "")
partb <- Corpus(VectorSource(b))
partb <- tm_map(partb, content_transformer(tolower))
partb <- tm_map(partb, removePunctuation)
partb <- tm_map(partb, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

partb <- tm_map(partb, removeWords, badwords)
partb<- tm_map(partb, stripWhitespace)

c <- scan("c", character(0), sep = "\n")
c <- iconv(c, "latin1", "ASCII", "")
partc <- Corpus(VectorSource(c))
partc <- tm_map(partc, content_transformer(tolower))
partc <- tm_map(partc, removePunctuation)
partc <- tm_map(partc, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

partc <- tm_map(partc, removeWords, badwords)
partc <- tm_map(partc, stripWhitespace)


d <- scan("d", character(0), sep = "\n")
d <- iconv(d, "latin1", "ASCII", "")
partd <- Corpus(VectorSource(d))
partd <- tm_map(partd, content_transformer(tolower))
partd <- tm_map(partd, removePunctuation)
partd <- tm_map(partd, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

partd <- tm_map(partd, removeWords, badwords)
partd <- tm_map(partd, stripWhitespace)

e <- scan("e", character(0), sep = "\n")
e <- iconv(e, "latin1", "ASCII", "")
parte <- Corpus(VectorSource(e))
parte<- tm_map(parte, content_transformer(tolower))
parte <- tm_map(parte, removePunctuation)
parte <- tm_map(parte, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

parte <- tm_map(parte, removeWords, badwords)
parte <- tm_map(parte, stripWhitespace)


f <- scan("f", character(0), sep = "\n")
f <- iconv(f, "latin1", "ASCII", "")
partf <- Corpus(VectorSource(f))
partf<- tm_map(partf, content_transformer(tolower))
partf <- tm_map(partf, removePunctuation)
partf <- tm_map(partf, removeNumbers)
badwords <- scan("badwords.txt", character(0), sep = "\n")

partf <- tm_map(partf, removeWords, badwords)
partf <- tm_map(partf, stripWhitespace)

N-gram Frequency Matrices

Here is the code used to create the N-gram Frequency Matrices. Only one of the twitter subsets, parta, is used as an example.

## Load the RWeka package 

require(RWeka)


## Define tokenizers

QuadrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
 TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))

 
##Unigram frequency matrix

 tdm <- TermDocumentMatrix(parta, control = list(tokenize = UnigramTokenizer))

tdm <- removeSparseTerms(tdm, 0.99)
b <- as.matrix(tdm)
v <- sort(rowSums(b),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
write.csv2(d, file = "unigram_news")

##Bigram frequency matrix

tdm <- TermDocumentMatrix(parta, control = list(tokenize = BigramTokenizer))

tdm <- removeSparseTerms(tdm, 0.99)
b <- as.matrix(tdm)
v <- sort(rowSums(b),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
write.csv2(d, file = "bigram_news")

##Trigram frequency matrix

tdm <- TermDocumentMatrix(parta, control = list(tokenize = TrigramTokenizer))

tdm <- removeSparseTerms(tdm, 0.999)
b <- as.matrix(tdm)
v <- sort(rowSums(b),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
write.csv2(d, file = "trigram_news")


##Quadrigram frequency matrix

tdm <- TermDocumentMatrix(parta, control = list(tokenize = QuadrigramTokenizer))

tdm <- removeSparseTerms(tdm, 0.9999)
b <- as.matrix(tdm)
v <- sort(rowSums(b),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
write.csv2(d, file = "quadrigram_news")

Merging the Data

First the twitter data was unified by n-gram type. After that, all swiftkey n-gram frequency data was merged by type.

## Merging data frames
## First the Twitter data is unified

 uni_a <- read.csv2("unigram_a", stringsAsFactors = F) 
 uni_b <- read.csv2("unigram_b", stringsAsFactors = F) 
 uni_c <- read.csv2("unigram_c", stringsAsFactors = F) 
 uni_d <- read.csv2("unigram_d", stringsAsFactors = F) 
 uni_e <- read.csv2("unigram_e", stringsAsFactors = F)
 uni_f <- read.csv2("unigram_f", stringsAsFactors = F)
 
list.of.data.frames = list(uni_a, uni_b, uni_c, uni_d, uni_e, uni_f)
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data =  tmp , FUN = sum )

write.csv2(tmp2, file = "unigram_twitter")

##Twitter bigrams

 bi_a <- read.csv2("bigram_a", stringsAsFactors = F) 
 bi_b <- read.csv2("bigram_b", stringsAsFactors = F) 
 bi_c <- read.csv2("bigram_c", stringsAsFactors = F) 
 bi_d <- read.csv2("bigram_d", stringsAsFactors = F) 
 bi_e <- read.csv2("bigram_e", stringsAsFactors = F)
 bi_f <- read.csv2("bigram_f", stringsAsFactors = F)
 
list.of.data.frames = list(bi_a, bi_b, bi_c, bi_d, bi_e, bi_f)
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "bigram_twitter") 

##Twitter bigrams

 tri_a <- read.csv2("trigram_a", stringsAsFactors = F) 
 tri_b <- read.csv2("trigram_b", stringsAsFactors = F) 
 tri_c <- read.csv2("trigram_c", stringsAsFactors = F) 
 tri_d <- read.csv2("trigram_d", stringsAsFactors = F) 
 tri_e <- read.csv2("trigram_e", stringsAsFactors = F)
 tri_f <- read.csv2("trigram_f", stringsAsFactors = F)
 
list.of.data.frames = list(tri_a, tri_b, tri_c, tri_d, tri_e, tri_f)
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "trigram_twitter") 

##Twitter Quadrigrams

 qu_a <- read.csv2("quadrigram_a", stringsAsFactors = F) 
 qu_b <- read.csv2("quadrigram_b", stringsAsFactors = F) 
 qu_c <- read.csv2("quadrigram_c", stringsAsFactors = F) 
 qu_d <- read.csv2("quadrigram_d", stringsAsFactors = F) 
 qu_e <- read.csv2("quadrigram_e", stringsAsFactors = F)
 qu_f <- read.csv2("quadrigram_f", stringsAsFactors = F)
 
list.of.data.frames = list(qu_a, qu_b, qu_c, qu_d, qu_e, qu_f)
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "quadrigram_twitter") 

##Now to merge the entire swiftkey data set

##Unigram

uni_blogs <- read.csv2("unigram_blogs", stringsAsFactors = F) 
uni_news <- read.csv2("unigram_news", stringsAsFactors = F) 
uni_twitter <- read.csv2("unigram_twitter", stringsAsFactors = F) 
 
list.of.data.frames = list(uni_blogs, uni_news, uni_twitter )
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "unigram_swiftkey") 

##bigram

bi_blogs <- read.csv2("bigram_blogs", stringsAsFactors = F) 
bi_news <- read.csv2("bigram_news", stringsAsFactors = F) 
bi_twitter <- read.csv2("bigram_twitter", stringsAsFactors = F) 
 
list.of.data.frames = list(bi_blogs, bi_news, bi_twitter )
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "bigram_swiftkey") 

##trigram

tri_blogs <- read.csv2("trigram_blogs", stringsAsFactors = F) 
tri_news <- read.csv2("trigram_news", stringsAsFactors = F) 
tri_twitter <- read.csv2("trigram_twitter", stringsAsFactors = F) 
 
list.of.data.frames = list(tri_blogs, tri_news, tri_twitter )
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "trigram_swiftkey") 

##quadrigram

qu_blogs <- read.csv2("quadrigram_blogs", stringsAsFactors = F) 
qu_news <- read.csv2("quadrigram_news", stringsAsFactors = F) 
qu_twitter <- read.csv2("quadrigram_twitter", stringsAsFactors = F) 
 
list.of.data.frames = list(qu_blogs, qu_news, qu_twitter )
 
 
tmp <- Reduce(function(...) merge(..., all=T), list.of.data.frames)
 
tmp2 <- aggregate(freq ~ word, data = tmp , FUN = sum )

write.csv2(tmp2, file = "quadrigram_swiftkey")

Data Visualization

Here is the code used to produce the Bargraphs and Wordclouds for the merged n-gram data.

require(wordcloud)

##Colors 

pal2 <- brewer.pal(8,"Dark2")

colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan", "grey", "black") 


## Merged unigrams

uni <- read.csv2("unigram_swiftkey", stringsAsFactors = F)
j_uni <- uni[order(uni$freq, decreasing = T),]
y_uni <- j_uni[1:10,]

##Bargraph first

barplot(y_uni$freq, names = y_uni$word,xlab = "Unigram", ylab = "Frequency", col=colors, main = "Top 10 Swiftkey Unigrams")



## Then the wordcloud
wordcloud(words = uni$word, freq = uni$freq, scale=c(8,0.5), max.words=400, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, random.color = F, colors=pal2)

## Merged Bigrams

bi <- read.csv2("bigram_swiftkey", stringsAsFactors = F)
j_bi <- bi[order(bi$freq, decreasing = T),]
y_bi <- j_bi[1:10,]

barplot(y_bi$freq, names = y_bi$word,xlab = "Bigram", ylab = "Frequency", col=colors, main = "Top 10 Swiftkey Bigrams")

wordcloud(words = bi$word, freq = bi$freq, scale=c(8,0.5), max.words=400, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, random.color = F, colors=pal2)


## Merged Trigrams

tri <- read.csv2("trigram_swiftkey", stringsAsFactors = F)
j_tri <- tri[order(tri$freq, decreasing = T),]
y_tri <- j_tri[1:10,]

barplot(y_tri$freq, names = y_tri$word,xlab = "Trigram", ylab = "Frequency", col=colors, main = "Top 10 Swiftkey Trigrams")

wordcloud(words = tri$word, freq = tri$freq, scale=c(8,0.5), max.words=400, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, random.color = F, colors=pal2)

## Merged Quadrigrams

qu <- read.csv2("quadrigram_swiftkey", stringsAsFactors = F)
j_qu <- qu[order(qu$freq, decreasing = T),]
y_qu <- j_qu[1:10,]

barplot(y_qu$freq, names = y_qu$word,xlab = "Quadrigram", ylab = "Frequency", col=colors cex.names= .7, main = "Top 10 Swiftkey Quadrigrams")

wordcloud(words = qu$word, freq = qu$freq, scale=c(8,0.3), max.words=400, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, random.color = F, colors=pal2)

Works Cited

Chen, Stanley F., and Joshua Goodman. “An Empirical Study of Smoothing Techniques for Language Modeling.” Aug. 1998. Web. 24 July 2015.