Milestone report

Introduction

This is the Milestone report in the swiftkey capstone project. In this report we will load, clean and explore the data.

The data set consists of feeds from blogs, twitter and news in english, german, finnish and russian. For the purpose of exploration and cleaning, I will be considering english based feeds only. We will also build a n-gram model from the data set.

The data is available for download from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Due to large size of the data set, I will be storing the data set in rds format locally.

Loading the data

We can load the data from provided data URL by using readLines routine in R. However, for performance reason we will load the data locally from a pre-saved RDS file.

The number of lines, size, words etc for each data file are as mentioned below,

##    source file_size_MB   lines    words
## 1   blogs     200.4242  899288 37546246
## 2    news     196.2775 1010242 34762395
## 3 twitter     159.3641 2360148 30093410

Sample the data to make the program faster.

set.seed(1)
ENGDATA$blogs <- sample(ENGDATA$blogs, 100000)
ENGDATA$news <- sample(ENGDATA$news, 100000)
ENGDATA$twitter <- sample(ENGDATA$twitter, 100000)

Cleaning the data

We use the “tm” package to create a Corpus for the provided data set and then convert strings to lower case, remove whitespaces, punctuations, numbers and profanity. The profanity list is used from google’s blocked list.

badwords <- read.csv('full-list-of-bad-words-banned-by-google-csv-file_2013_11_26_04_52_30_695.csv', header=FALSE, stringsAsFactors = FALSE)
corpus <- Corpus(VectorSource(ENGDATA))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords('en'))
corpus <- tm_map(corpus, removeWords, badwords$V1)

Explore Data

Now that we have the corpus, we can get the frequency table for unigram, bigram and trigram.

#removeSparseTerms is taking too long
# unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus),sum)
unigram_freq <- getFreq(unigram)

# bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = function(x) {
#   ngramsTokenizer(x, 2)
# })), 0.9999)
bigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus, control = list(tokenize = function(x) {
  ngramsTokenizer(x, 2)
})), sum)

bigram_freq <- getFreq(bigram)

# trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = function(x){
#   ngramsTokenizer(x, 3)
# })), 0.9999)
trigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus, control = list(tokenize = function(x){
  ngramsTokenizer(x, 3)
})), sum)


trigram_freq <- getFreq(trigram)

Plot Data

Using the frequency tables, we can draw a wordcloud for the unigram which should us the top 100 most frequently used words.

pal = brewer.pal(9,"Blues")

wordcloud(unigram_freq$word, unigram_freq$freq, max.words = 100, colors = pal)

Due to the small subset of data for bigram and trigrams, we get very small frequency table with single frequency each.

ggplot(head(bigram_freq, 25), aes(reorder(word, -freq), freq)) +
         labs(x = "Words", y = "Frequency") +
         ggtitle('Bigram histogram') +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity")

ggplot(head(trigram_freq, 25), aes(reorder(word, -freq), freq)) +
         labs(x = "Words", y = "Frequency") +
         ggtitle('Trigram histogram') +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity")

Summary

Due to large computation required for processing larger data set, it is becoming difficult to come up with a prediction model. Next step is to explore some prediction models which would work well with this data set. Eliminating stopwords also may need to be removed since these may need to be predicted as well. Trigrams appear to be most useful for predicting next word but this needs to be explored.