This is the Milestone report in the swiftkey capstone project. In this report we will load, clean and explore the data.
The data set consists of feeds from blogs, twitter and news in english, german, finnish and russian. For the purpose of exploration and cleaning, I will be considering english based feeds only. We will also build a n-gram model from the data set.
The data is available for download from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Due to large size of the data set, I will be storing the data set in rds format locally.
We can load the data from provided data URL by using readLines routine in R. However, for performance reason we will load the data locally from a pre-saved RDS file.
The number of lines, size, words etc for each data file are as mentioned below,
## source file_size_MB lines words
## 1 blogs 200.4242 899288 37546246
## 2 news 196.2775 1010242 34762395
## 3 twitter 159.3641 2360148 30093410
Sample the data to make the program faster.
set.seed(1)
ENGDATA$blogs <- sample(ENGDATA$blogs, 100000)
ENGDATA$news <- sample(ENGDATA$news, 100000)
ENGDATA$twitter <- sample(ENGDATA$twitter, 100000)
We use the “tm” package to create a Corpus for the provided data set and then convert strings to lower case, remove whitespaces, punctuations, numbers and profanity. The profanity list is used from google’s blocked list.
badwords <- read.csv('full-list-of-bad-words-banned-by-google-csv-file_2013_11_26_04_52_30_695.csv', header=FALSE, stringsAsFactors = FALSE)
corpus <- Corpus(VectorSource(ENGDATA))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords('en'))
corpus <- tm_map(corpus, removeWords, badwords$V1)
Now that we have the corpus, we can get the frequency table for unigram, bigram and trigram.
#removeSparseTerms is taking too long
# unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus),sum)
unigram_freq <- getFreq(unigram)
# bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = function(x) {
# ngramsTokenizer(x, 2)
# })), 0.9999)
bigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus, control = list(tokenize = function(x) {
ngramsTokenizer(x, 2)
})), sum)
bigram_freq <- getFreq(bigram)
# trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = function(x){
# ngramsTokenizer(x, 3)
# })), 0.9999)
trigram <- rowapply_simple_triplet_matrix(TermDocumentMatrix(corpus, control = list(tokenize = function(x){
ngramsTokenizer(x, 3)
})), sum)
trigram_freq <- getFreq(trigram)
Using the frequency tables, we can draw a wordcloud for the unigram which should us the top 100 most frequently used words.
pal = brewer.pal(9,"Blues")
wordcloud(unigram_freq$word, unigram_freq$freq, max.words = 100, colors = pal)
Due to the small subset of data for bigram and trigrams, we get very small frequency table with single frequency each.
ggplot(head(bigram_freq, 25), aes(reorder(word, -freq), freq)) +
labs(x = "Words", y = "Frequency") +
ggtitle('Bigram histogram') +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity")
ggplot(head(trigram_freq, 25), aes(reorder(word, -freq), freq)) +
labs(x = "Words", y = "Frequency") +
ggtitle('Trigram histogram') +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity")
Due to large computation required for processing larger data set, it is becoming difficult to come up with a prediction model. Next step is to explore some prediction models which would work well with this data set. Eliminating stopwords also may need to be removed since these may need to be predicted as well. Trigrams appear to be most useful for predicting next word but this needs to be explored.