Executive Summary

This is the milestone report for the Data Science Capstone project. This project involves developing an algorithm for predicting the next word in real-time based on user entry. The algorithm learns from a set of news articles, blog posts and twitter data. This report provides some exploratory analysis on the text data, and briefly explains the next steps of the project, in creating the prediction algorithm and the data product.

Summary Statistics of the Datasets

Firstly, we analyze the given datasets to explore the size, number of lines and number of words. A function is written to specifically return these statistics for the dataset.

As the first step, we load each dataset:

blog <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
stat_summary <- function(x){
  num_lines <- length(x)
  num_words <- sum(sapply(gregexpr("\\S+", x),length))
  maxchar_line <- nchar(x[which.max(nchar(x))])
  tot_stats <- cbind(num_lines, num_words, maxchar_line)
  return(tot_stats)
}

Using the function, we get statistics summary for each dataset:

##           Lines    Words Max_Chars
## Blog     899288 37334131     40833
## Twitter 2360148 30373583       140
## News      77259  2643969      5760

Since, we are assessing the text documents together, we need to combine these three datasets into one. Since, the next steps would be time-consuming with the full dataset, a small sample dataset is derived which is about 0.1% of the main dataset.

Data Cleaning

The next step is to clean the dataset to remove punctuation, numbers, profanity, and so on. This example removes stop words such as “the”, “you”, “his”, etc. from the database, just to visualize the wordcloud to see what the most common word is other than these common words. These will be put back into the dataset for the next step of generating n-grams and prediction algorithm, as these are quite important in predicing the next word.

sample_corpora = gsub("[^\x20-\x7E]", "", sample_corpora) ## This removes the non-ASCII characters
review_source <- VectorSource(sample_corpora)
corpus <- Corpus(review_source)

bad_words <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt") ## Profane words

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeWords, bad_words)

This cleaned up dataset will be used to visualize the most common words in the form of wordcloud. In order to do that, Document Term Matrix is generated, which can be used to provide frequency of words.

dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
freq <- colSums(dtm2)
freq <- sort(freq, decreasing=TRUE)
f <- data.frame(freq)
f$word <- row.names(f)
pal2 = brewer.pal(8,"Dark2")
wordcloud(f$word, f$freq, scale=c(4,0.2), min.freq=50, random.order=FALSE, rot.per=0.15,colors=pal2)

High frequency set of words

The wordcloud shows the high frequency words in the sample of the text document (minus the stop words). Next, we retain the stop words back and generate N-grams, which are frequent set of words. RWeka package is used to set up tokenizer functions for 1-gram, 2-gram, 3-gram, and 4-gram words.

oneGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=1, max=1))}
biGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))}
triGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=3, max=3))}
quadGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=4, max=4))}

Then we use the Term Document Matrix function to obtain the frequency of words, and then produce sparse matrices of these sets.

tdm1 <- TermDocumentMatrix(corpus, control = list(tokenize=oneGram))
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize=biGram))
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize=triGram))
tdm4 <- TermDocumentMatrix(corpus, control = list(tokenize=quadGram))

gram1freq <- data.frame(word = tdm1$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm1$i, j = tdm1$j, x = tdm1$v)))
gram1freq <- arrange(gram1freq, desc(freq))

gram2freq <- data.frame(word = tdm2$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm2$i, j = tdm2$j, x = tdm2$v)))
gram2freq <- arrange(gram2freq, desc(freq))

gram3freq <- data.frame(word = tdm3$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm3$i, j = tdm3$j, x = tdm3$v)))
gram3freq <- arrange(gram3freq, desc(freq))

gram4freq <- data.frame(word = tdm4$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm4$i, j = tdm4$j, x = tdm4$v)))
gram4freq <- arrange(gram4freq, desc(freq))

From these tables, we can find the top 10 word chains in the sample dataset.

##    1-Gram  2-Gram             3-Gram                4-Gram
## 1     the  of the     thanks for the thanks for the follow
## 2     and  in the         one of the       the rest of the
## 3     you  to the           a lot of    for the first time
## 4     for for the          i have to        i was going to
## 5    that   to be       cant wait to        i cant wait to
## 6    with  on the        the rest of       of a trade mark
## 7     was  it was        going to be     thanks for the rt
## 8    this  at the looking forward to         to be able to
## 9     are    in a            to be a          as much as i
## 10   have   for a        im going to      cant wait to see

We can plot the frequency plots and see the distribution of N-grams.

It can be observed that the frequency of word sets reduces for higher N-grams. Highest frequency word occurs about 3000 times in the 1-gram dataset, compared to the 4-gram dataset, where the highest frequency wordchain occurs only about 8 times. This is an expected result.

Future Work

Challenges

The main issue I have been experiencing is the size of the corpus. In order to construct the actual model, I would need to use most of the text dataset, which is huge for my machine! I would appreciate any suggestions that could result in improving this issue.