Our project involves taking three sample data files and building a predictive text model that predicts the next word when a user enters a string of text. Our first step is an exploratory look at the data sets.

Summary of the blog, news and twitter files

We begin by looking at a summary of the three text files provided. We use Unix commands (like wc -w to perform word counts) on the three files. Here is a summary:

file.name line.count byte.count char.count word.count avg.word.per.line
en_US.blogs.txt 899288 210160014 208623081 37334690 41.5
en_US.news.txt 1010242 205811889 205243643 34372720 34.0
en_US.twitter.txt 2360148 167105338 166816544 30374206 12.9

The file sizes range from 167 megabytes to 210 megabytes and there are over 4.2 million lines of text for the three files combined. One interesting note is the twitter file which has the largest number of lines but only 12.9 words per line compared to 34 and 41.5 for the news and blogs files respectively. Not surprising given the 140 character limit of twitter messages, but it’s probably an early indication that these three data sets could have very different predictive attributes.

Data features

Next we’ll look at the features of the data sets, for this report we’ll use the twitter file. We start by loading a sample data set for exploratory purposes. Given the size of the file and our available computing resources we’ll loop through the file and take a sample of 2,360 lines of the file. Here’s the first few twitter messages from our sampled set:

# LOOP through twitter file and read sample of lines
set.seed(334455)
con <- file("~/R/data/final/en_US/en_US.twitter.txt", "r") # open file connection
for (i in 1:as.integer(df$line.count[3]/10000)){ # calculate number of loops needed, total twiiter lines / 10,000
  d <- readLines(con, 10000) # read 10,000 lines at a time
  d <- sample(d, length(d)*.001) # take a random 100 of the 10,000
  
  if (i == 1){
    ds <- d
  }
  else {
    ds <- append(ds, d)
  }
}

close(con) # close connection
rm(d, i, con) # clean up

head(ds)
## [1] "see you all at #b2bexpo !"                                                                     
## [2] "George Hill stopping LeBron and Wade's fast break! Wow is all i have to say...."               
## [3] "\"Do we strive for excellence or perfection?\" Whats the difference? Your thoughts."           
## [4] "everyone please follow my dear buddy > she is very talented, sweet, cool, and fun to chat with"
## [5] "I'm Mr. Piggy"                                                                                 
## [6] "This reminds me to upgrade my FF when I get home. :)"

To look at the most common words we’ll remove numbers, punctuation and stop words (like “the” and “a”) to get a list of the most common words in the subset:

# load tm package
library(tm)

# build corpus from sample
corp <- Corpus(VectorSource(ds))

# build term document matrix and remove numbers, punctuation and stopwords
tdm <- TermDocumentMatrix(corp, control = list(removeNumbers = TRUE, removePunctuation = TRUE, stopwords = TRUE))

# convert Term Document Matrix to R matrix object
mtrx <- as.matrix(tdm)

# sort descending my most frequent words
wordfreq <- sort(rowSums(mtrx), decreasing=TRUE)

head(wordfreq, 12)
##   just   like    get   love   good thanks    can    day   will   dont 
##    146    112    107    105     95     94     91     89     84     80 
##  great   time 
##     77     77

And a word cloud visualization for those appearing more than 40 times in the dataset.

# load wordcloud package
library(wordcloud)

# create data frame of words and frequencies
dm <- data.frame(word = names(wordfreq), freq = wordfreq)

# wordcloud viz
wordcloud(dm$word, dm$freq, min.freq = 40, colors = brewer.pal(8,"Dark2"))

plot of chunk unnamed-chunk-3

We found a large proportion of individual words appear either once or twice in the dataset:

library(ggplot2)
ggplot(dm, aes(freq)) +
  geom_histogram() +
  ggtitle("Frequency of Words: Twitter Sample") +
  ylab("Count of Unique Words") +
  xlab("Frequency of Unique Words")

plot of chunk unnamed-chunk-4

Scaling down to explore words that appear at least five times in the dataset is more interesting, but still shows there are a large amount of unique words that we won’t expect to show up very often, even if the sum of the lesser used words is much greater than the sum of the more commonly used words.

library(dplyr)
ggplot(filter(dm, freq > 4), aes(freq)) +
  geom_histogram() +
  ggtitle("Frequency of Words appearing 5+ times: Twitter Sample") +
  ylab("Count of Unique Words") +
  xlab("Frequency of Unique Words")

plot of chunk unnamed-chunk-5

The fact that the largest part of the data set consists of words that show up very infrequently will make it difficult to train the model to accurately deal with infrequent words. But a good model could just help suggest a next word and provide value, it won’t have to predict every combination that could ever arise.

Next Steps

Our next steps will be creating a prediction algorithm and accompanying app that allows a user to enter a string of text and receive a predicted next word. This will involve more research of the following:

———