The Coursera Data Science program has a final major capstone project, where we use provided data to produce predictions akin to those found on cell phone keyboards. This predicts the next word based on the previous word(s) typed by the user.

This first report outlines the data exploration and initial development performed for this purpose

Data Acquisition

Data can be downloaded from the following source and unpacked for analysis. The download is quite large, and takes significant space when unpacked, so chose a working directory suitable for that purpose.

#NOT RUN
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "./Coursera-SwiftKey.zip")
unzip("./Coursera-SwiftKey.zip", exdir = ".")
dir.create("./data")
file.rename("./final/", "./data/")

With the file downloaded and extracted, we can see that it has three large English text files (as well as Russian, Finish, and German language files). We’ll leave them, it could be interesting to play with them later. These files are compiled blog posts, news reports, and Twitter tweets

Our text files are quite big:

#file.size/1000000 to get MB
fs<-file.size(c("./data/en_US/en_US.blogs.txt", "./data/en_US/en_US.news.txt", "./data/en_US/en_US.twitter.txt"))/1000000
fs

## [1] 210.1600 205.8119 167.1053

So with three files near 200 MB each, we should perform some sub-setting for our exploratory work. We’ll read each file, then see how many lines they have. The readr package makes this easier.

library(readr)
en_blogs<-read_lines("./data/en_US/en_US.blogs.txt", progress=FALSE)
en_news<-read_lines("./data/en_US/en_US.news.txt", progress=FALSE)
en_twitter<-read_lines("./data/en_US/en_US.twitter.txt", progress=FALSE)
blogs_lines<-length(en_blogs)
news_lines<-length(en_news)
twitter_lines<-length(en_twitter)

So, our blog file has 899288 lines, the news file has 1010242 lines, and the twitter file has 2360148 lines. There’s likely a difference in the number of words in each of the lines of the files, a quick example could be twitter (where no line should be above 140 characters) and news or blogs. We’ll investigate these distributions.

en_blogs_word<-strsplit(gsub("[[:punct:]]","",en_blogs), " ")
en_news_word<-strsplit(gsub("[[:punct:]]","",en_news), " ")
en_twitter_word<-strsplit(gsub("[[:punct:]]","",en_twitter), " ")
blogs_wordcount<-length(unlist(en_blogs_word))
blogs_un_wordcount<-length(unique(unlist(en_blogs_word)))
news_wordcount<-length(unlist(en_news_word))
news_un_wordcount<-length(unique(unlist(en_news_word)))
twitter_wordcount<-length(unlist(en_twitter_word))
twitter_un_wordcount<-length(unique(unlist(en_twitter_word)))

A histogram of each file’s word per line count:

## Warning: Removed 26980 rows containing non-finite values (stat_density).

A summary of all of the discovoered data is below:

Summary of file information
File	File.Size	Lines	Words	Unique.Words	Prop.Unique
Blogs	210.1600	899288	37320843	518021	0.0138802
News	205.8119	1010242	34368164	422939	0.0123061
Twitter	167.1053	2360148	30163333	649081	0.0215189

For the rest of our exploration we’ll use only 5% of each file, randomly selected, to keep computation cost down. We’ll look at each file separately as well, to see if there’s any differences.

set.seed(1)
fraction<-0.05
en_blogs<-en_blogs[sample(c(1:blogs_lines), blogs_lines*fraction)]
en_news<-en_news[sample(c(1:news_lines), news_lines*fraction)]
en_twitter<-en_twitter[sample(c(1:twitter_lines), twitter_lines*fraction)]

Data Exploration

One of the things we will be doing is building ‘n-gram’ models. However, they should be built once the data is cleaned. To clean the data effectively, we’ll use the quanteda package in R. We’ll also use stringi to clear any string formatting issues. We’ll clean, then discuss the cleaning and compare to other language processing that may be utilized in other situations.

library(quanteda)
library(stringi)

#stokens<-tokens(c_blogs, what='sentence', removeNumbers=TRUE, removePunc=TRUE, removeSymbols=TRUE, removeSeparators=TRUE, removeTwitter=TRUE, removeHyphens=TRUE, removeURL=TRUE)

#Prepare the text data
#Fix accented characters (Not applicable for other languages) & some symbols associated
en_blogs <- stri_trans_general(en_blogs,"Latin-ASCII")
en_blogs <- gsub("[µºˆ_]+", '', en_blogs)

#set to lowercase
en_blogs <- char_tolower(en_blogs)

#Remove URLS
en_blogs <- gsub("(f|ht)tp(s?):\\/\\/(.*)[.][a-z=\\?\\/]+", "", en_blogs)

#Remove Twitter 'Via @username' and '@usernames' and 'rt' and hashtags
en_blogs <- gsub("via\\s*@[a-z_]{1,20}","", en_blogs)
en_blogs <- gsub("\\s@[a-z_]{1,20}", "", en_blogs)
en_blogs <- gsub("\\brt\\b", "", en_blogs)
en_blogs <- gsub("#[a-z0-9_]+", "", en_blogs)

#Get a 'bad words' list to remove from the corpus
badwords_url <- "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
badwords <- unlist(strsplit(RCurl::getURL(badwords_url), "\n"))
#last 'bad word' is an emoji, remove
badwords <- badwords[1:length(badwords)-1]
#replacement builder code from getAnywhere(tm::removeWords.character)
en_blogs <- gsub(sprintf("(*UCP)\\b(%s)\\b", paste(badwords, collapse = "|")), "", en_blogs, perl = TRUE)

#gsub(sprintf("(*UCP)\\b(%s)\\b", paste(stopwords("en"), collapse = "|")), "", en_blogs, perl = TRUE)

So what all happened there? We converted all of the text to normal ASCII characters (not applicable for languages with accented characters), then to lower case, then took out URLs and bad words. We removed some Twitter noise (usernames, ‘via @usernames’, hashtags, and ‘rt’ (but not ‘rt’ from words like ‘art’)). We’ll do the same to the other text inputs in the background.

Now onto the next step, where we really get into the text analysis with quanteda.

#Create a 'Corpus' of words
corpus_blog <- corpus(en_blogs)
corpus_news <- corpus(en_news)
corpus_twitter <- corpus(en_twitter)

docnames(corpus_blog) <- paste('blog',1:ndoc(corpus_blog))
docnames(corpus_news) <- paste('news',1:ndoc(corpus_news))
docnames(corpus_twitter) <- paste('tweet',1:ndoc(corpus_twitter))

corpus_tot <- corpus_blog+corpus_news+corpus_twitter

dfm1 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=1)
dfm2 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=2)
dfm3 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=3)
dfm4 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=4)

We then turned the blog, news and twitter data into ‘corpi’, and assigned document names corresponding to each, then combined them to one ‘corpus’. We then produced document feature matrices, or term document matrices, and in doing so breaking each document up into 1-, 2-, 3- and 4-gram chunks (e.g. for the string: “I want to go.”, the 3-grams are “I want to” and “want to go.”). While doing so, we cleaned out any numbers, punctuation, symbols, separators, and missed URLs or twitter crud. With this, we can see which n-grams are the most popular!

topfeatures(dfm3)

##     one_of_the       a_lot_of thanks_for_the    going_to_be        to_be_a 
##           1790           1499           1192            883            870 
##     out_of_the       it_was_a    some_of_the     as_well_as     the_end_of 
##            722            709            708            702            698

Similarly, the single most common 4-gram from the set is “the_end_of_the”. Seems a bit… pessimistic.

Data Processing Options

Typical text processing begins to diverge from this point. Sometimes, you may be interested in removing all of the common words such as ‘a’, ‘an’, ‘the’, ‘not’, ‘is’ … etc. These are called ‘stop words’, and are easily removed in the dfm creating step with dfm(remove = stopwords()). This won’t work for us, as we want to know when to predict a stop word. Another manipulation to the text is called ‘stemming’, where words are shortened to their prefix base. For example, ‘analyse’, ‘analysis’, ‘analysing’, and ‘analysed’ would all be turned into ‘analys’, allowing for analysis not dependent on plurality, temporal (past/future/present) or other effects. Again, this would have been easy as dfm(stem = TRUE) but we want to predict the full word in our tool.

There’s a few other interesting analysis we can do with text data if we want:

Sentiments

In terms of analysis, one can do ‘sentiment analysis’ and look at the texts, pulling out words that imply happiness, sadness, anger, etc. This can be done with the R package syuzhet quite easily, so I’ll demonstrate with the blogs data:

library(syuzhet)
sentiments <- get_nrc_sentiment(en_blogs)
sentimentTotals <- data.frame("count" = colSums(sentiments))
sentimentTotals$sentiment <- rownames(sentimentTotals)
rownames(sentimentTotals) <- NULL

ggplot(data=sentimentTotals, aes(x=sentiment, y=count)) +
    geom_bar(aes(fill=sentiment), stat='identity') +
    theme_bw() + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
    ggtitle("Sentiment Counts for Blogs") + xlab("Sentiment") + ylab("Total Count")

Word Clouds

Similarly, word clouds are popular yet powerful visual tools for analyzing data. We’ll make a word cloud for Twitter with the built in textplot_wordcloud function, which interfaces to the wordcloud package. This works better after removing stop words, and stemming the tweets. I’ll pull a few extra stop words out too, as some don’t make it out due to our preprocessing removing punctuation or they’re not in the base stop words list.

wordcloud_dfm<-dfm(corpus_tot, remove=stopwords('english'), stem=TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeSeparators = TRUE, removeURL = TRUE, removeTwitter = TRUE)

textplot_wordcloud(wordcloud_dfm, max.words = 100, random.order = FALSE, colors = RColorBrewer::brewer.pal(9, 'PuBuGn'))

N-Gram models

Back to the problem at hand. With our n-gram models, we’ll want to be able to predict which words come next after a user has entered a few words. Things like ‘I want to go to the …’ should result in proposed words such as ‘store’, ‘park’, ‘mall’. We can do this by manipulating our n-gram data.

We’ll split our n-grams to the n-1 and 1-gram components. For a 4-gram “I want to go”, that will be split to “I want to” and “go”. We can then compare our counts of “go” to “eat” and “sleep” and any other combinations we saw in the corpus. The stringr package will help here. I’ve made a function called ‘splitGrams’ that takes our document feature matrices as input, and provides a table of ‘pregram’ values and the expected word after each.

library(stringr)

splitGrams<-function(docfeatmatrix){
    freq_table<-colSums(docfeatmatrix)
    grams<-colnames(docfeatmatrix)
    grams<-trimws(gsub("_", " ", grams))
    n<-length(strsplit(grams[1], split = ' ')[[1]])
    if (n > 1){
        grams<-str_split_fixed(grams, ' ', n)
        gram_table<-data.frame(pregram=apply(grams, 1, function(x) paste(x[1:n-1], collapse = '_')), postgram=grams[,n], count=freq_table, stringsAsFactors = FALSE)
    }
    else if (n == 1){
        gram_table<-data.frame(pregram = rep(''), postgram=grams, count = freq_table, stringsAsFactors = FALSE)
    }
    else{
        stop("n-gram input not valid: < 1 n-gram length")
    }
    rm(grams, freq_table, n)
    #fix ordering
    gram_table<-gram_table[order(gram_table$pregram, -gram_table$count, gram_table$postgram),]
    rownames(gram_table)<-NULL
    return(gram_table)
}

gt1<-splitGrams(dfm1)
gt2<-splitGrams(dfm2)
gt3<-splitGrams(dfm3)
gt4<-splitGrams(dfm4)

With our gram popularity now calculated, we can look at our top for each n. For example, with our ‘3-gram’ model, we see ‘to the’ yields:

head(gt3[gt3$pregram == 'to_the', ])

##         pregram postgram count
## 2946775  to_the   public    98
## 2946776  to_the    point    97
## 2946777  to_the     next    96
## 2946778  to_the      top    71
## 2946779  to_the      new    68
## 2946780  to_the    state    64

Likewise, for our ‘go to the …’ case, we get:

head(gt4[gt4$pregram == 'go_to_the', ])

##           pregram postgram count
## 1283763 go_to_the      gym    12
## 1283764 go_to_the    beach    10
## 1283765 go_to_the hospital     8
## 1283766 go_to_the    store     8
## 1283767 go_to_the bathroom     6
## 1283768 go_to_the   movies     6

These are low frequencies, but using more data should provide better results.

Model Plan

With our ngrams, the plan is to produce predictions on what could follow, in much the same way that I have above. When the user provides ‘go to the’, the model should suggest ‘gym’, ‘beach’, ‘hospital’, etc. If a provided ‘n-gram’ has not been previously seen, or the count for a specific pregram is very very low, it may be beneficial to test the ‘n-1’-gram, i.e., test ‘to the’ if ‘go to the’ does not provide enough hits.

Note:

These make large data frames.

library(pryr)
list(gt1=object_size(gt1), gt2=object_size(gt2), gt3=object_size(gt3), gt4=object_size(gt4))

## $gt1
## 8.93 MB
## 
## $gt2
## 42.9 MB
## 
## $gt3
## 171 MB
## 
## $gt4
## 310 MB

However, they can be quickly cut down in size:

list(gt2.small=object_size(gt2[gt2$count > 1, ]), gt3.small=object_size(gt3[gt3$count > 1, ]), gt4.small=object_size(gt4[gt4$count > 1, ]))

## $gt2.small
## 12.3 MB
## 
## $gt3.small
## 17.1 MB
## 
## $gt4.small
## 9.84 MB

Or, removing only prgrams with only one postgram:

g2unique<-data.frame(table(gt2$pregram))
g2unique<-g2unique[g2unique$Freq > 1,]
g3unique<-data.frame(table(gt3$pregram))
g3unique<-g3unique[g3unique$Freq > 1,]
g4unique<-data.frame(table(gt4$pregram))
g4unique<-g4unique[g4unique$Freq > 1,]

list(gt2.unique_pregram=object_size(gt2[gt2$pregram %in% g2unique$Var1,]), gt3.unique_pregram=object_size(gt3[gt3$pregram %in% g3unique$Var1,]), gt4.unique_pregram=object_size(gt4[gt4$pregram %in% g4unique$Var1,]))

## $gt2.unique_pregram
## 47.1 MB
## 
## $gt3.unique_pregram
## 85.1 MB
## 
## $gt4.unique_pregram
## 53.7 MB

It may be worth struggling through analysis of larger fractions of the text (25+%) and keeping only count > 1, then building a ‘parts of speech’ model (eg suggest noun after an adjective, based on noun popularity). Backoff models will help with size reduction as well. Furthermore this would leave room for introduction of Proper Nouns (capitals) as well.

Data Exploration and Milestone Report - Coursera Data Science Capstone

Phil Bulsink

May 15, 2017