The Coursera Data Science program has a final major capstone project, where we use provided data to produce predictions akin to those found on cell phone keyboards. This predicts the next word based on the previous word(s) typed by the user.
This first report outlines the data exploration and initial development performed for this purpose
Data can be downloaded from the following source and unpacked for analysis. The download is quite large, and takes significant space when unpacked, so chose a working directory suitable for that purpose.
#NOT RUN
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "./Coursera-SwiftKey.zip")
unzip("./Coursera-SwiftKey.zip", exdir = ".")
dir.create("./data")
file.rename("./final/", "./data/")
With the file downloaded and extracted, we can see that it has three large English text files (as well as Russian, Finish, and German language files). We’ll leave them, it could be interesting to play with them later. These files are compiled blog posts, news reports, and Twitter tweets
Our text files are quite big:
#file.size/1000000 to get MB
fs<-file.size(c("./data/en_US/en_US.blogs.txt", "./data/en_US/en_US.news.txt", "./data/en_US/en_US.twitter.txt"))/1000000
fs
## [1] 210.1600 205.8119 167.1053
So with three files near 200 MB each, we should perform some sub-setting for our exploratory work. We’ll read each file, then see how many lines they have. The readr package makes this easier.
library(readr)
en_blogs<-read_lines("./data/en_US/en_US.blogs.txt", progress=FALSE)
en_news<-read_lines("./data/en_US/en_US.news.txt", progress=FALSE)
en_twitter<-read_lines("./data/en_US/en_US.twitter.txt", progress=FALSE)
blogs_lines<-length(en_blogs)
news_lines<-length(en_news)
twitter_lines<-length(en_twitter)
So, our blog file has 899288 lines, the news file has 1010242 lines, and the twitter file has 2360148 lines. There’s likely a difference in the number of words in each of the lines of the files, a quick example could be twitter (where no line should be above 140 characters) and news or blogs. We’ll investigate these distributions.
en_blogs_word<-strsplit(gsub("[[:punct:]]","",en_blogs), " ")
en_news_word<-strsplit(gsub("[[:punct:]]","",en_news), " ")
en_twitter_word<-strsplit(gsub("[[:punct:]]","",en_twitter), " ")
blogs_wordcount<-length(unlist(en_blogs_word))
blogs_un_wordcount<-length(unique(unlist(en_blogs_word)))
news_wordcount<-length(unlist(en_news_word))
news_un_wordcount<-length(unique(unlist(en_news_word)))
twitter_wordcount<-length(unlist(en_twitter_word))
twitter_un_wordcount<-length(unique(unlist(en_twitter_word)))
A histogram of each file’s word per line count:
g
## Warning: Removed 26980 rows containing non-finite values (stat_density).
A summary of all of the discovoered data is below:
| File | File.Size | Lines | Words | Unique.Words | Prop.Unique |
|---|---|---|---|---|---|
| Blogs | 210.1600 | 899288 | 37320843 | 518021 | 0.0138802 |
| News | 205.8119 | 1010242 | 34368164 | 422939 | 0.0123061 |
| 167.1053 | 2360148 | 30163333 | 649081 | 0.0215189 |
For the rest of our exploration we’ll use only 5% of each file, randomly selected, to keep computation cost down. We’ll look at each file separately as well, to see if there’s any differences.
set.seed(1)
fraction<-0.05
en_blogs<-en_blogs[sample(c(1:blogs_lines), blogs_lines*fraction)]
en_news<-en_news[sample(c(1:news_lines), news_lines*fraction)]
en_twitter<-en_twitter[sample(c(1:twitter_lines), twitter_lines*fraction)]
One of the things we will be doing is building ‘n-gram’ models. However, they should be built once the data is cleaned. To clean the data effectively, we’ll use the quanteda package in R. We’ll also use stringi to clear any string formatting issues. We’ll clean, then discuss the cleaning and compare to other language processing that may be utilized in other situations.
library(quanteda)
library(stringi)
#stokens<-tokens(c_blogs, what='sentence', removeNumbers=TRUE, removePunc=TRUE, removeSymbols=TRUE, removeSeparators=TRUE, removeTwitter=TRUE, removeHyphens=TRUE, removeURL=TRUE)
#Prepare the text data
#Fix accented characters (Not applicable for other languages) & some symbols associated
en_blogs <- stri_trans_general(en_blogs,"Latin-ASCII")
en_blogs <- gsub("[µºˆ_]+", '', en_blogs)
#set to lowercase
en_blogs <- char_tolower(en_blogs)
#Remove URLS
en_blogs <- gsub("(f|ht)tp(s?):\\/\\/(.*)[.][a-z=\\?\\/]+", "", en_blogs)
#Remove Twitter 'Via @username' and '@usernames' and 'rt' and hashtags
en_blogs <- gsub("via\\s*@[a-z_]{1,20}","", en_blogs)
en_blogs <- gsub("\\s@[a-z_]{1,20}", "", en_blogs)
en_blogs <- gsub("\\brt\\b", "", en_blogs)
en_blogs <- gsub("#[a-z0-9_]+", "", en_blogs)
#Get a 'bad words' list to remove from the corpus
badwords_url <- "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
badwords <- unlist(strsplit(RCurl::getURL(badwords_url), "\n"))
#last 'bad word' is an emoji, remove
badwords <- badwords[1:length(badwords)-1]
#replacement builder code from getAnywhere(tm::removeWords.character)
en_blogs <- gsub(sprintf("(*UCP)\\b(%s)\\b", paste(badwords, collapse = "|")), "", en_blogs, perl = TRUE)
#gsub(sprintf("(*UCP)\\b(%s)\\b", paste(stopwords("en"), collapse = "|")), "", en_blogs, perl = TRUE)
So what all happened there? We converted all of the text to normal ASCII characters (not applicable for languages with accented characters), then to lower case, then took out URLs and bad words. We removed some Twitter noise (usernames, ‘via @usernames’, hashtags, and ‘rt’ (but not ‘rt’ from words like ‘art’)). We’ll do the same to the other text inputs in the background.
Now onto the next step, where we really get into the text analysis with quanteda.
#Create a 'Corpus' of words
corpus_blog <- corpus(en_blogs)
corpus_news <- corpus(en_news)
corpus_twitter <- corpus(en_twitter)
docnames(corpus_blog) <- paste('blog',1:ndoc(corpus_blog))
docnames(corpus_news) <- paste('news',1:ndoc(corpus_news))
docnames(corpus_twitter) <- paste('tweet',1:ndoc(corpus_twitter))
corpus_tot <- corpus_blog+corpus_news+corpus_twitter
dfm1 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=1)
dfm2 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=2)
dfm3 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=3)
dfm4 <- dfm(corpus_tot, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeURL = TRUE, removeTwitter = TRUE, removeHyphens=TRUE, ngrams=4)
We then turned the blog, news and twitter data into ‘corpi’, and assigned document names corresponding to each, then combined them to one ‘corpus’. We then produced document feature matrices, or term document matrices, and in doing so breaking each document up into 1-, 2-, 3- and 4-gram chunks (e.g. for the string: “I want to go.”, the 3-grams are “I want to” and “want to go.”). While doing so, we cleaned out any numbers, punctuation, symbols, separators, and missed URLs or twitter crud. With this, we can see which n-grams are the most popular!
topfeatures(dfm3)
## one_of_the a_lot_of thanks_for_the going_to_be to_be_a
## 1790 1499 1192 883 870
## out_of_the it_was_a some_of_the as_well_as the_end_of
## 722 709 708 702 698
Similarly, the single most common 4-gram from the set is “the_end_of_the”. Seems a bit… pessimistic.
Typical text processing begins to diverge from this point. Sometimes, you may be interested in removing all of the common words such as ‘a’, ‘an’, ‘the’, ‘not’, ‘is’ … etc. These are called ‘stop words’, and are easily removed in the dfm creating step with dfm(remove = stopwords()). This won’t work for us, as we want to know when to predict a stop word. Another manipulation to the text is called ‘stemming’, where words are shortened to their prefix base. For example, ‘analyse’, ‘analysis’, ‘analysing’, and ‘analysed’ would all be turned into ‘analys’, allowing for analysis not dependent on plurality, temporal (past/future/present) or other effects. Again, this would have been easy as dfm(stem = TRUE) but we want to predict the full word in our tool.
There’s a few other interesting analysis we can do with text data if we want:
In terms of analysis, one can do ‘sentiment analysis’ and look at the texts, pulling out words that imply happiness, sadness, anger, etc. This can be done with the R package syuzhet quite easily, so I’ll demonstrate with the blogs data:
library(syuzhet)
sentiments <- get_nrc_sentiment(en_blogs)
sentimentTotals <- data.frame("count" = colSums(sentiments))
sentimentTotals$sentiment <- rownames(sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data=sentimentTotals, aes(x=sentiment, y=count)) +
geom_bar(aes(fill=sentiment), stat='identity') +
theme_bw() + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ggtitle("Sentiment Counts for Blogs") + xlab("Sentiment") + ylab("Total Count")
Similarly, word clouds are popular yet powerful visual tools for analyzing data. We’ll make a word cloud for Twitter with the built in textplot_wordcloud function, which interfaces to the wordcloud package. This works better after removing stop words, and stemming the tweets. I’ll pull a few extra stop words out too, as some don’t make it out due to our preprocessing removing punctuation or they’re not in the base stop words list.
wordcloud_dfm<-dfm(corpus_tot, remove=stopwords('english'), stem=TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSymbols = TRUE, removeSeparators = TRUE, removeURL = TRUE, removeTwitter = TRUE)
textplot_wordcloud(wordcloud_dfm, max.words = 100, random.order = FALSE, colors = RColorBrewer::brewer.pal(9, 'PuBuGn'))
Back to the problem at hand. With our n-gram models, we’ll want to be able to predict which words come next after a user has entered a few words. Things like ‘I want to go to the …’ should result in proposed words such as ‘store’, ‘park’, ‘mall’. We can do this by manipulating our n-gram data.
We’ll split our n-grams to the n-1 and 1-gram components. For a 4-gram “I want to go”, that will be split to “I want to” and “go”. We can then compare our counts of “go” to “eat” and “sleep” and any other combinations we saw in the corpus. The stringr package will help here. I’ve made a function called ‘splitGrams’ that takes our document feature matrices as input, and provides a table of ‘pregram’ values and the expected word after each.
library(stringr)
splitGrams<-function(docfeatmatrix){
freq_table<-colSums(docfeatmatrix)
grams<-colnames(docfeatmatrix)
grams<-trimws(gsub("_", " ", grams))
n<-length(strsplit(grams[1], split = ' ')[[1]])
if (n > 1){
grams<-str_split_fixed(grams, ' ', n)
gram_table<-data.frame(pregram=apply(grams, 1, function(x) paste(x[1:n-1], collapse = '_')), postgram=grams[,n], count=freq_table, stringsAsFactors = FALSE)
}
else if (n == 1){
gram_table<-data.frame(pregram = rep(''), postgram=grams, count = freq_table, stringsAsFactors = FALSE)
}
else{
stop("n-gram input not valid: < 1 n-gram length")
}
rm(grams, freq_table, n)
#fix ordering
gram_table<-gram_table[order(gram_table$pregram, -gram_table$count, gram_table$postgram),]
rownames(gram_table)<-NULL
return(gram_table)
}
gt1<-splitGrams(dfm1)
gt2<-splitGrams(dfm2)
gt3<-splitGrams(dfm3)
gt4<-splitGrams(dfm4)
With our gram popularity now calculated, we can look at our top for each n. For example, with our ‘3-gram’ model, we see ‘to the’ yields:
head(gt3[gt3$pregram == 'to_the', ])
## pregram postgram count
## 2946775 to_the public 98
## 2946776 to_the point 97
## 2946777 to_the next 96
## 2946778 to_the top 71
## 2946779 to_the new 68
## 2946780 to_the state 64
Likewise, for our ‘go to the …’ case, we get:
head(gt4[gt4$pregram == 'go_to_the', ])
## pregram postgram count
## 1283763 go_to_the gym 12
## 1283764 go_to_the beach 10
## 1283765 go_to_the hospital 8
## 1283766 go_to_the store 8
## 1283767 go_to_the bathroom 6
## 1283768 go_to_the movies 6
These are low frequencies, but using more data should provide better results.
With our ngrams, the plan is to produce predictions on what could follow, in much the same way that I have above. When the user provides ‘go to the’, the model should suggest ‘gym’, ‘beach’, ‘hospital’, etc. If a provided ‘n-gram’ has not been previously seen, or the count for a specific pregram is very very low, it may be beneficial to test the ‘n-1’-gram, i.e., test ‘to the’ if ‘go to the’ does not provide enough hits.
These make large data frames.
library(pryr)
list(gt1=object_size(gt1), gt2=object_size(gt2), gt3=object_size(gt3), gt4=object_size(gt4))
## $gt1
## 8.93 MB
##
## $gt2
## 42.9 MB
##
## $gt3
## 171 MB
##
## $gt4
## 310 MB
However, they can be quickly cut down in size:
list(gt2.small=object_size(gt2[gt2$count > 1, ]), gt3.small=object_size(gt3[gt3$count > 1, ]), gt4.small=object_size(gt4[gt4$count > 1, ]))
## $gt2.small
## 12.3 MB
##
## $gt3.small
## 17.1 MB
##
## $gt4.small
## 9.84 MB
Or, removing only prgrams with only one postgram:
g2unique<-data.frame(table(gt2$pregram))
g2unique<-g2unique[g2unique$Freq > 1,]
g3unique<-data.frame(table(gt3$pregram))
g3unique<-g3unique[g3unique$Freq > 1,]
g4unique<-data.frame(table(gt4$pregram))
g4unique<-g4unique[g4unique$Freq > 1,]
list(gt2.unique_pregram=object_size(gt2[gt2$pregram %in% g2unique$Var1,]), gt3.unique_pregram=object_size(gt3[gt3$pregram %in% g3unique$Var1,]), gt4.unique_pregram=object_size(gt4[gt4$pregram %in% g4unique$Var1,]))
## $gt2.unique_pregram
## 47.1 MB
##
## $gt3.unique_pregram
## 85.1 MB
##
## $gt4.unique_pregram
## 53.7 MB
It may be worth struggling through analysis of larger fractions of the text (25+%) and keeping only count > 1, then building a ‘parts of speech’ model (eg suggest noun after an adjective, based on noun popularity). Backoff models will help with size reduction as well. Furthermore this would leave room for introduction of Proper Nouns (capitals) as well.