Milestone Report

library(stringi)
library(tm)
library(slam)
library(ggplot2)
library(wordcloud)
library(RWeka)
library(reshape2)
library(R.utils)

Summary

The goal of this report is to display that we’ve gotten used to working with the data and that we are on track to create our prediction algorithm. A report needs to be submitted on R Pubs that explains our exploratory analysis and our goals for the eventual app and algorithm. This is a concise document and explain only the major features of the data we have identified and briefly summarize our plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. Tables and plots are being used to illustrate important summaries of the data set. The motivation for this project is to:

Demonstrate that we’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that we have amassed so far.
Get feedback on our plans for creating a prediction algorithm and Shiny app.

About the Data

All the data has been downloaded from the link provided in Week 1 of the Data Science Capstone on Coursera. There are four sets of files which contain samples of blogs, news and tweets in different languages - German, English, Finnish & Russian. This whole project is being done on the English dataset.

Traits and Characteristics of the Data

blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)

# Check the length/line count of the files

summary(blogs)

##    Length     Class      Mode 
##    899288 character character

summary(news)

##    Length     Class      Mode 
##     77259 character character

summary(twitter)

##    Length     Class      Mode 
##   2360148 character character

# Check the word count of the files

blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))
print(blogs_words)

## [1] 37546246

print(news_words)

## [1] 2674536

print(twitter_words)

## [1] 30093410

Aggregating the Data

After taking into consideration the above traits of the data, each dataset is being sampled for faster processing

sampleTwitter <- sample(twitter, 75000)
sampleNews <- sample(news, 75000)
sampleBlogs <- sample(blogs, 75000)
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)

## Save sample
writeLines(textSample, "./Sample Data/BigTextSample.txt")

Cleaning the Data

We clean the sample data with the ‘tm’ package. We remove the punctuation, numbers, profanity, urls, and stem words.

SampleCleanse <- VCorpus(DirSource("./Sample Data/", encoding = "UTF-8"), readerControl = list(reader = readPlain))

inspect(SampleCleanse)

ProfanityFilter <- read.table("./Profanitywords.txt", header=FALSE)

## Converting to lower case
SampleCleanse <- tm_map(SampleCleanse, content_transformer(tolower))

## Not removing the stop words right now as they may improve the predictive model. This will be tested when the algorithm is being built. 
## Removing punctuation, nos., profanity, urls and stem words:
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removePunctuation))
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removeURL))
SampleCleanse <- tm_map(SampleCleanse, stripWhitespace)
SampleCleanse <- tm_map(SampleCleanse, removeWords, ProfanityFilter)
SampleCleanse <- tm_map(SampleCleanse, stemDocument)
SampleCleanse <- tm_map(SampleCleanse, stripWhitespace)

## Save the cleaned and final sample
saveRDS(SampleCleanse, file = "./FinalSample.RData")

##Load the final sample
FinalData <- readRDS(file = "./FinalSample.RData")
sampleTDM <- TermDocumentMatrix(FinalData)
saveRDS(sampleTDM, file = "./sampleTDM.RData")

Building a Basic n-gram Model

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1)order (based on the Markov model).

n-gram models are widely used in statistical natural language processing.

Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability - with larger n, a model can store more context with a well-understood space-time tradeoff, enabling small experiments to scale up efficiently.

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

smpdata <- readRDS("./FinalSample.RData")

# n-gram generation
unigram <- removeSparseTerms(TermDocumentMatrix(smpdata), 0.9999)
bigram  <- removeSparseTerms(TermDocumentMatrix(smpdata, control = list(tokenize = BigramTokenizer)), 0.9999)
trigram <- removeSparseTerms(TermDocumentMatrix(smpdata, control = list(tokenize = TrigramTokenizer)), 0.9999)

Plots for the n-gram

To create plots for the n-grams, it would be helpful to extract the frequency from each of them.

freq_ngm <- function(x){
  freq <- sort(rowSums(as.matrix(x)), decreasing=TRUE)
  freq_ngm <- data.frame(word=names(freq), freq=freq)
  return(freq_ngm)
}

unifreq <- freq_ngm(unigram)
bifreq <- freq_ngm(bigram)
trifreq <- freq_ngm(trigram)

plotfrequency <- function(data, title) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity")
}


plotfrequency(unifreq, "Top 30 Unigrams")

plotfrequency(bifreq, "Top 30 Bigrams")

plotfrequency(trifreq, "Top 30 Trigrams")

Interesting Observations

Loading and processing the data takes a lot of time because of the size. The data had to be sampled for faster processing time.
Although removing the stop words is advised, they have not been removed here because they may improve the prediction model later on as they are an essential part of any language.

Next Steps

The next step is to create a predictive algorithm and make it as efficient and accurate as possible.