Milestone Report

Our project involves taking three sample data files and building a predictive text model that predicts the next word when a user enters a string of text. Our first step is an exploratory look at the data sets.

Summary of the blog, news and twitter files

We begin by looking at a summary of the three text files provided. We use Unix commands (like wc -w to perform word counts) on the three files. Here is a summary:

file.name	line.count	byte.count	char.count	word.count	avg.word.per.line
en_US.blogs.txt	899288	210160014	208623081	37334690	41.5
en_US.news.txt	1010242	205811889	205243643	34372720	34.0
en_US.twitter.txt	2360148	167105338	166816544	30374206	12.9

The file sizes range from 167 megabytes to 210 megabytes and there are over 4.2 million lines of text for the three files combined. One interesting note is the twitter file which has the largest number of lines but only 12.9 words per line compared to 34 and 41.5 for the news and blogs files respectively. Not surprising given the 140 character limit of twitter messages, but it’s probably an early indication that these three data sets could have very different predictive attributes.

Data features

Next we’ll look at the features of the data sets, for this report we’ll use the twitter file. We start by loading a sample data set for exploratory purposes. Given the size of the file and our available computing resources we’ll loop through the file and take a sample of 2,360 lines of the file. Here’s the first few twitter messages from our sampled set:

# LOOP through twitter file and read sample of lines
set.seed(334455)
con <- file("~/R/data/final/en_US/en_US.twitter.txt", "r") # open file connection
for (i in 1:as.integer(df$line.count[3]/10000)){ # calculate number of loops needed, total twiiter lines / 10,000
  d <- readLines(con, 10000) # read 10,000 lines at a time
  d <- sample(d, length(d)*.001) # take a random 100 of the 10,000
  
  if (i == 1){
    ds <- d
  }
  else {
    ds <- append(ds, d)
  }
}

close(con) # close connection
rm(d, i, con) # clean up

head(ds)

## [1] "see you all at #b2bexpo !"                                                                     
## [2] "George Hill stopping LeBron and Wade's fast break! Wow is all i have to say...."               
## [3] "\"Do we strive for excellence or perfection?\" Whats the difference? Your thoughts."           
## [4] "everyone please follow my dear buddy > she is very talented, sweet, cool, and fun to chat with"
## [5] "I'm Mr. Piggy"                                                                                 
## [6] "This reminds me to upgrade my FF when I get home. :)"

To look at the most common words we’ll remove numbers, punctuation and stop words (like “the” and “a”) to get a list of the most common words in the subset:

# load tm package
library(tm)

# build corpus from sample
corp <- Corpus(VectorSource(ds))

# build term document matrix and remove numbers, punctuation and stopwords
tdm <- TermDocumentMatrix(corp, control = list(removeNumbers = TRUE, removePunctuation = TRUE, stopwords = TRUE))

# convert Term Document Matrix to R matrix object
mtrx <- as.matrix(tdm)

# sort descending my most frequent words
wordfreq <- sort(rowSums(mtrx), decreasing=TRUE)

head(wordfreq, 12)

##   just   like    get   love   good thanks    can    day   will   dont 
##    146    112    107    105     95     94     91     89     84     80 
##  great   time 
##     77     77

And a word cloud visualization for those appearing more than 40 times in the dataset.

# load wordcloud package
library(wordcloud)

# create data frame of words and frequencies
dm <- data.frame(word = names(wordfreq), freq = wordfreq)

# wordcloud viz
wordcloud(dm$word, dm$freq, min.freq = 40, colors = brewer.pal(8,"Dark2"))

plot of chunk unnamed-chunk-3

We found a large proportion of individual words appear either once or twice in the dataset:

library(ggplot2)
ggplot(dm, aes(freq)) +
  geom_histogram() +
  ggtitle("Frequency of Words: Twitter Sample") +
  ylab("Count of Unique Words") +
  xlab("Frequency of Unique Words")

plot of chunk unnamed-chunk-4

Scaling down to explore words that appear at least five times in the dataset is more interesting, but still shows there are a large amount of unique words that we won’t expect to show up very often, even if the sum of the lesser used words is much greater than the sum of the more commonly used words.

library(dplyr)
ggplot(filter(dm, freq > 4), aes(freq)) +
  geom_histogram() +
  ggtitle("Frequency of Words appearing 5+ times: Twitter Sample") +
  ylab("Count of Unique Words") +
  xlab("Frequency of Unique Words")

The fact that the largest part of the data set consists of words that show up very infrequently will make it difficult to train the model to accurately deal with infrequent words. But a good model could just help suggest a next word and provide value, it won’t have to predict every combination that could ever arise.

Next Steps

Our next steps will be creating a prediction algorithm and accompanying app that allows a user to enter a string of text and receive a predicted next word. This will involve more research of the following:

Tokenization and n-grams: look at groups of 2 and 3 a 4 word sets and explore (and eventually model) the frequency of different sets and what words follow.
Explore helper packages: there are a long list of R packages that help with tokenization and n-gram analysis and modeling, but our initial research and early testing have shown many to be very memory intensive and not conducive to our available computing resources.
Add stop words and possibly numbers back to the training set when building a model as these should have predictive value and their removal could cause inaccuracies in our final model.

Milestone Report

Andy Rosa

November 16, 2014

Summary of the blog, news and twitter files

Data features

Next Steps

———