This project involves creating a prediction model from an English language corpus to predict the next word given an input phrase. The data was downloaded from the course website. The first problem was that the file en_US.twitter.txt contained nul characters (0x0). This was remedied by loading the file using readBin and then changing nul characters to spaces using the following function nukeNul.
nukeNul <- function(filename) {
infile = paste(fp, ".bin", sep="")
outfile = paste(fp, ".txt", sep="")
bf = readBin(infile, "raw", file.info(infile)$size)
bf[bf==as.raw(0)] = as.raw(0x20)
writeBin(bf, outfile)
}
After this was done, then we can use readLines to input the three files, as follows:
blogs <- readLines("final/en_US/en_US.blogs.txt")
news <- readLines("final/en_US/en_US.news.txt")
tweets <- readLines("final/en_US/en_US.twitter.txt")
Now we have the data, ready for cleaning and exploration.
We’ll use the package tm to assist in cleaning the data. Cleaning will involve the following:
The data is large and the tm functions require significant time. Thus, we’ll include in our corpus a sample of 1% of the lines given.
set.seed(42)
blogs.sample <- sample(blogs, length(blogs)*0.01)
news.sample <- sample(news, length(news)*0.01)
tweets.sample <- sample(tweets, length(tweets)*0.01)
Next, we’ll load the text from the samples into our corpus.
corpus <- VCorpus(VectorSource(c(blogs.sample, news.sample, tweets.sample)))
Next, we’ll define a custom content transformer that will remove a regular expression from a string (we’ll use this with the tm function tm_map below).
killPattern <- content_transformer(function(x, p) gsub(p, "", x))
Now, we’ll perform the cleaning steps desired.
corpus <- tm_map(corpus, killPattern, "(f|ht)tp(s?)://(.*)[.][a-z]+") # remove URLs
corpus <- tm_map(corpus, killPattern, "@[^\\s]+") # remove special characters
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, content_transformer(tolower))
Now, we’re ready to perform an exploratory data analysis (EDA).
There are two components to our EDA. First, we’ll calculate basic summary statistics of the data. Then, we’ll plot basic histograms giving relevant information about the data.
Let’s look at some basic summary statistics of the cleaned sample data. We’ll use the function stri_count_words from package stringi to count the number of words in the samples.
blogs.words <- stri_count_words(blogs.sample)
news.words <- stri_count_words(news.sample)
tweets.words <- stri_count_words(tweets.sample)
Let’s look at the total number of words in each sample, the number of lines in each sample, and the mean and standard deviation of the number of words in a line for each sample.
data.frame(data=c("blogs", "news", "tweets"),
tot.lines=c(length(blogs.sample), length(news.sample), length(tweets.sample)),
tot.words=c(sum(blogs.words), sum(news.words), sum(tweets.words)),
mean.words=c(mean(blogs.words), mean(news.words), mean(tweets.words)),
sd.words=c(sd(blogs.words), sd(news.words), sd(tweets.words)))
## data tot.lines tot.words mean.words sd.words
## 1 blogs 8992 380761 42.34442 52.369632
## 2 news 10102 347630 34.41200 22.090465
## 3 tweets 23601 301252 12.76437 6.937734
Now, let’s look at some histograms that give a better idea of what the data looks like.
First, let’s find the most common words in the sample data. These are the same as 1-grams.
# get the term document matrix and remove sparse terms
tdm.corpus <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
# find word frequencies
word.freq <- sort(rowSums(as.matrix(tdm.corpus)), decreasing=TRUE)
# make frequency data frame
df.word.freq <- data.frame(word=names(word.freq), freq=word.freq)
# plot top 50 most frequent words (1-grams)
ggplot(df.word.freq[1:50,], aes(x=reorder(word, -freq), y=freq)) +
geom_bar(stat="identity", fill="#2b8cbe") +
labs(title="50 most frequent words", x="Word", y="Frequency") +
theme(axis.text.x=element_text(angle=60, hjust=1))
Next, we’ll use the package RWeka to find 2-grams. A 2-gram is just a unit of two words that appear together in the corpus.
# first get TDM for 2-grams
twograms <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
tdm.twograms <- removeSparseTerms(TermDocumentMatrix(corpus, control=list(tokenize=twograms)), 0.9999)
# now find 2-gram frequencies and put into a data frame
twogram.freq <- sort(rowSums(as.matrix(tdm.twograms)), decreasing=TRUE)
df.twogram.freq <- data.frame(twogram=names(twogram.freq), freq=twogram.freq)
# now plot top 50
ggplot(df.twogram.freq[1:50,], aes(x=reorder(twogram, -freq), y=freq)) +
geom_bar(stat="identity", fill="#e6550d") +
labs(title="50 most common 2-grams", x="2-gram", y="Frequency") +
theme(axis.text.x=element_text(angle=60, hjust=1))
Finally, we’ll find 3-grams using a similar method and graph their frequencies. A 3-gram is just a unit of three words that occur together in the corpus.
# first get TDM for 2-grams
threegrams <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tdm.threegrams <- removeSparseTerms(TermDocumentMatrix(corpus, control=list(tokenize=threegrams)), 0.9999)
# now find 2-gram frequencies and put into a data frame
threegram.freq <- sort(rowSums(as.matrix(tdm.threegrams)), decreasing=TRUE)
df.threegram.freq <- data.frame(threegram=names(threegram.freq), freq=threegram.freq)
# now plot top 50
ggplot(df.threegram.freq[1:50,], aes(x=reorder(threegram, -freq), y=freq)) +
geom_bar(stat="identity", fill="#31a354") +
labs(title="50 most common 3-grams", x="3-gram", y="Frequency") +
theme(axis.text.x=element_text(angle=60, hjust=1))
Now that the EDA is accomplished, we’ll make a predictive model and deploy it as a Shiny app with an explanatory slide presentation. We will need to consider both the size of the model as well as the runtime (the time it takes to make a prediction given the model). Ideally, the model would be small and fast enough to work on a mobile phone. Questions we’ll need to think about include how to efficiently store a model based on n-grams, and what n we should choose. Is n=3 enough? This will be the type of question we consider in what remains of the capstone.
The goal is to create a small, fast model to predict the next word in a sequence. One method that I’ll try is to use the n-gram method. According to this approach, the model is all n-grams for a small n and then the next word is predicted as the most likely occurring word given that distribution of n-grams.
Another method I’d like to try is a neural network. According to this approach, the corpus is the training data and the net must learn how to predict the next word in a sequence. It might be prohibitively expensive to train on my home computer, but I want to investigate the approach at least.