Reading in the data

To begin exploring the data, the text files are read in using the read_lines function in the readr package.

library(readr)
blog <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")

A quick examination shows that the files are quite large, with 899288 lines in the blog file, 2360148 lines in the twitter file, and 1010242 lines in the news file.

length(blog)
## [1] 899288
length(twitter)
## [1] 2360148
length(news)
## [1] 1010242

Taking a 1% sample of each file will allow us to examine features of the data more easily.

s <- .01
set.seed(123)
blog_sample <- sample(blog,s*length(blog),replace=FALSE)
news_sample <- sample(news,s*length(news),replace=FALSE)
twitter_sample <- sample(twitter,s*length(twitter),replace=FALSE)

Creating a Corpus

Using the quanteda package, a corpus can be created by combining the samples of each file.

library(quanteda)
blog_corp <- corpus(blog_sample)
news_corp <- corpus(news_sample)
twitter_corp <- corpus(twitter_sample)
corp <- blog_corp + news_corp + twitter_corp

Top N-Grams

Document-frequency matrices are created to find the top unigrams, bigrams, and trigrams. Other options could be employed here when creating dfms, such as removing stop words and using stemming, but default parameters are left in place for now for the sake of exploring the data.

dfm1 <- dfm(corp,ngrams=1)
dfm2 <- dfm(corp,ngrams=2)
dfm3 <- dfm(corp,ngrams=3)

Using this sample, there are 57,764 unique unigrams, 442,306 unique bigrams, and 768,405 unique trigrams across 42,695 documents. The topfeatures function allows us to quickly view which unigrams, bigrams, and trigrams appear most frequently in the data.

Not suprisingly, these tend to be prepositions, pronouns, and other common English words. We’d get a different result with more nouns and verbs if English stop words are removed, but stop words might be helpful to include when trying to predict a grammatically correct string of words.

dfm_stop <- dfm(corp,ngrams=1,ignoredFeatures=stopwords("english"))
topfeatures(dfm_stop,20)
##   will   just   said    one   like    can   time    get    new    now 
##   3131   3126   2926   2742   2680   2485   2217   2178   2022   1817 
##   good    day people   know   love   back  first   make    see     go 
##   1759   1747   1639   1630   1557   1405   1372   1369   1350   1335

Coverage

We can estimate how many unique words are needed to cover 50% or 90% of all word instances in the sample. The document-frequenct matrix shows that there are 841,068 total word instances in the sample.

To do this, the words are sorted in decreasing order of frequency and a cumulative sum column is added. Subsetting the resulting data frame shows that the top 236 unique words (0.4% of total unique words) account for 50% of the total word instances. Around 9,156 unique words (16% of total unique words) account for 90% of total word instances.

freq1 <- docfreq(dfm1)
df1 <- data.frame(ngram=names(freq1),freq=freq1)
rownames(df1) <- c()
df1 <- df1[order(df1$freq,decreasing=TRUE),]

goal50 <- sum(df1$freq)*0.5
goal90 <- sum(df1$freq)*0.9

library(dplyr)
df1 <- mutate(df1,cumsum=cumsum(df1$freq))
test50 <- df1[df1$cumsum<goal50,]
test90 <- df1[df1$cumsum<goal90,]

tail(test50)
##       ngram freq cumsum
## 231    done  421 418734
## 232    hard  420 419154
## 233 without  420 419574
## 234    free  414 419988
## 235    used  406 420394
## 236  making  405 420799
tail(test90)
##           ngram freq cumsum
## 9151 parenthood    7 757848
## 9152     geared    7 757855
## 9153   cheating    7 757862
## 9154     greene    7 757869
## 9155  brother's    7 757876
## 9156    belated    7 757883

The cumulative sums, with horizontal lines at 50% and 90% of total word instances, are plotted below.

Algorithm

A prediction model can be built using the Katz Back-Off model, which involves the following steps.