Our project involves taking three sample data files and building a predictive text model that predicts the next word when a user enters a string of text. Our first step is an exploratory look at the data sets.
We begin by looking at a summary of the three text files provided. We use Unix commands (like wc -w to perform word counts) on the three files. Here is a summary:
| file.name | line.count | byte.count | char.count | word.count | avg.word.per.line |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 210160014 | 208623081 | 37334690 | 41.5 |
| en_US.news.txt | 1010242 | 205811889 | 205243643 | 34372720 | 34.0 |
| en_US.twitter.txt | 2360148 | 167105338 | 166816544 | 30374206 | 12.9 |
The file sizes range from 167 megabytes to 210 megabytes and there are over 4.2 million lines of text for the three files combined. One interesting note is the twitter file which has the largest number of lines but only 12.9 words per line compared to 34 and 41.5 for the news and blogs files respectively. Not surprising given the 140 character limit of twitter messages, but it’s probably an early indication that these three data sets could have very different predictive attributes.
Next we’ll look at the features of the data sets, for this report we’ll use the twitter file. We start by loading a sample data set for exploratory purposes. Given the size of the file and our available computing resources we’ll loop through the file and take a sample of 2,360 lines of the file. Here’s the first few twitter messages from our sampled set:
# LOOP through twitter file and read sample of lines
set.seed(334455)
con <- file("~/R/data/final/en_US/en_US.twitter.txt", "r") # open file connection
for (i in 1:as.integer(df$line.count[3]/10000)){ # calculate number of loops needed, total twiiter lines / 10,000
d <- readLines(con, 10000) # read 10,000 lines at a time
d <- sample(d, length(d)*.001) # take a random 100 of the 10,000
if (i == 1){
ds <- d
}
else {
ds <- append(ds, d)
}
}
close(con) # close connection
rm(d, i, con) # clean up
head(ds)
## [1] "see you all at #b2bexpo !"
## [2] "George Hill stopping LeBron and Wade's fast break! Wow is all i have to say...."
## [3] "\"Do we strive for excellence or perfection?\" Whats the difference? Your thoughts."
## [4] "everyone please follow my dear buddy > she is very talented, sweet, cool, and fun to chat with"
## [5] "I'm Mr. Piggy"
## [6] "This reminds me to upgrade my FF when I get home. :)"
To look at the most common words we’ll remove numbers, punctuation and stop words (like “the” and “a”) to get a list of the most common words in the subset:
# load tm package
library(tm)
# build corpus from sample
corp <- Corpus(VectorSource(ds))
# build term document matrix and remove numbers, punctuation and stopwords
tdm <- TermDocumentMatrix(corp, control = list(removeNumbers = TRUE, removePunctuation = TRUE, stopwords = TRUE))
# convert Term Document Matrix to R matrix object
mtrx <- as.matrix(tdm)
# sort descending my most frequent words
wordfreq <- sort(rowSums(mtrx), decreasing=TRUE)
head(wordfreq, 12)
## just like get love good thanks can day will dont
## 146 112 107 105 95 94 91 89 84 80
## great time
## 77 77
And a word cloud visualization for those appearing more than 40 times in the dataset.
# load wordcloud package
library(wordcloud)
# create data frame of words and frequencies
dm <- data.frame(word = names(wordfreq), freq = wordfreq)
# wordcloud viz
wordcloud(dm$word, dm$freq, min.freq = 40, colors = brewer.pal(8,"Dark2"))
We found a large proportion of individual words appear either once or twice in the dataset:
library(ggplot2)
ggplot(dm, aes(freq)) +
geom_histogram() +
ggtitle("Frequency of Words: Twitter Sample") +
ylab("Count of Unique Words") +
xlab("Frequency of Unique Words")
Scaling down to explore words that appear at least five times in the dataset is more interesting, but still shows there are a large amount of unique words that we won’t expect to show up very often, even if the sum of the lesser used words is much greater than the sum of the more commonly used words.
library(dplyr)
ggplot(filter(dm, freq > 4), aes(freq)) +
geom_histogram() +
ggtitle("Frequency of Words appearing 5+ times: Twitter Sample") +
ylab("Count of Unique Words") +
xlab("Frequency of Unique Words")
The fact that the largest part of the data set consists of words that show up very infrequently will make it difficult to train the model to accurately deal with infrequent words. But a good model could just help suggest a next word and provide value, it won’t have to predict every combination that could ever arise.
Our next steps will be creating a prediction algorithm and accompanying app that allows a user to enter a string of text and receive a predicted next word. This will involve more research of the following:
Tokenization and n-grams: look at groups of 2 and 3 a 4 word sets and explore (and eventually model) the frequency of different sets and what words follow.
Explore helper packages: there are a long list of R packages that help with tokenization and n-gram analysis and modeling, but our initial research and early testing have shown many to be very memory intensive and not conducive to our available computing resources.
Add stop words and possibly numbers back to the training set when building a model as these should have predictive value and their removal could cause inaccuracies in our final model.