Load the data

First, I load the raw data into character vectors:

dir <- file.path("~", "Coursera", "Capstone", "final", "en_US")
news_data <- readLines(file.path(dir, "en_US.news.txt"))
blogs_data  <- readLines(file.path(dir, "en_US.blogs.txt"))
twitter_data    <- readLines(file.path(dir, "en_US.twitter.txt"))

Summary Stats

Next, we look at how many lines and words are in each set of documents:

news_line_count <- length(news_data)
blogs_line_count <- length(blogs_data)
twitter_line_count <- length(twitter_data)

news_word_count <- sum(sapply(gregexpr("\\W+", news_data), length) + 1)
blogs_word_count <- sum(sapply(gregexpr("\\W+", blogs_data), length) + 1)
twitter_word_count <- sum(sapply(gregexpr("\\W+", twitter_data), length) + 1)
word_counts <- c(twitter_word_count, news_word_count, blogs_word_count)
line_counts <- c(twitter_line_count, news_line_count, blogs_line_count)

m <- matrix(c(word_counts, line_counts), byrow=TRUE, ncol=3, nrow=2)
rownames(m) <- c("Word count", "Line count")
colnames(m) <- c("Twitter", "News", "Blogs")
m

##             Twitter     News    Blogs
## Word count 32793399 36721087 39120549
## Line count  2360148  1010242   899288

In order to efficiently analyze the term frequencies, I take a random sample:

sample_size <- 0.05
set.seed(1234)
news_subset <- sample(news_data, length(news_data) * sample_size)
blogs_subset <- sample(blogs_data, length(blogs_data) * sample_size)
twitter_subset <- sample(twitter_data, length(twitter_data) * sample_size)
combined_doc <- sample(c(news_subset, blogs_subset, twitter_subset))

Uni-gram frequencies

tokenized_doc <- tokenize(toLower(combined_doc), 
                          removePunct = TRUE,
                          removeNumbers = TRUE,
                          removeTwitter = TRUE,
                          ngrams = 1)
my_dfm <- dfm(tokenized_doc)

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 142,584 feature types
##    ... created a 213483 x 142584 sparse dfm
##    ... complete. 
## Elapsed time: 3.193 seconds.

top_features <- topfeatures(my_dfm, 10)
top_features

##    the     to    and      a     of     in      i    for     is   that 
## 238371 137627 120771 118170 100250  82450  82010  54459  53668  51858

barplot(top_features, horiz=TRUE)

Without Stop Words

We see many of these are stop words. It would be interesting to see which words are most common after stop words are removed:

tokenized_doc <- removeFeatures(tokenized_doc, stopwords("english"))
topfeatures(dfm(tokenized_doc), 10)

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 142,410 feature types
##    ... created a 213483 x 142410 sparse dfm
##    ... complete. 
## Elapsed time: 2.55 seconds.

##  will  said  just   one  like   can   get  time   new  good 
## 15853 15405 15012 14643 13484 12274 11360 10568  9771  9001

topfeatures

## function (x, ...) 
## {
##     UseMethod("topfeatures")
## }
## <environment: namespace:quanteda>

barplot(top_features, horiz = TRUE)

Bi-Grams

tokenized_doc <- tokenize(toLower(combined_doc), 
                          removePunct = TRUE,
                          removeNumbers = TRUE,
                          removeTwitter = TRUE,
                          ngrams = 2)
top_features <- topfeatures(dfm(tokenized_doc))

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 1,576,113 feature types
##    ... created a 213483 x 1576113 sparse dfm
##    ... complete. 
## Elapsed time: 6.097 seconds.

top_features

##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    21816    20681    10553     9978     9957     8103     7120     6427 
##     in_a with_the 
##     6008     5376

barplot(top_features, horiz=TRUE)

Tri-Grams

tokenized_doc <- tokenize(toLower(combined_doc), 
                          removePunct = TRUE,
                          removeNumbers = TRUE,
                          removeTwitter = TRUE,
                          ngrams = 3)
top_features <- topfeatures(dfm(tokenized_doc))

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 3,358,881 feature types
##    ... created a 213483 x 3358881 sparse dfm
##    ... complete. 
## Elapsed time: 7.982 seconds.

top_features

##     one_of_the       a_lot_of thanks_for_the        to_be_a    going_to_be 
##           1772           1410           1273            905            866 
##     the_end_of     out_of_the      i_want_to    some_of_the       it_was_a 
##            783            724            724            721            704

barplot(top_features, horiz = TRUE)

Implementation plan

Model Development

We are interested in the probable next word in a sequence of words. We will build a prediction model that takes an arbitrary sequence of tokens (n-grams), and returns the likely next word.

The modeling steps are, in general, as follows:

Split the data into test/training sets.
Build a predictive model on the training set.
Test the model against the holdout (test) set. Refine the model by repeating step #2 if necessary.
Select the most accurate model from #3.

Shiny App

The Shiny app will provide an interactive interface for testing the model developed in the above steps. It will take a string as the predictor variable and will return a single response – the completed n-gram. For input of 2 words, the most likely 3-word sequence will be returned.

Milestone Report for Data Science Capstone

Nathaniel Reed

March 19, 2016