To begin exploring the data, the text files are read in using the read_lines function in the readr package.
library(readr)
blog <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")
A quick examination shows that the files are quite large, with 899288 lines in the blog file, 2360148 lines in the twitter file, and 1010242 lines in the news file.
length(blog)
## [1] 899288
length(twitter)
## [1] 2360148
length(news)
## [1] 1010242
Taking a 1% sample of each file will allow us to examine features of the data more easily.
s <- .01
set.seed(123)
blog_sample <- sample(blog,s*length(blog),replace=FALSE)
news_sample <- sample(news,s*length(news),replace=FALSE)
twitter_sample <- sample(twitter,s*length(twitter),replace=FALSE)
Using the quanteda package, a corpus can be created by combining the samples of each file.
library(quanteda)
blog_corp <- corpus(blog_sample)
news_corp <- corpus(news_sample)
twitter_corp <- corpus(twitter_sample)
corp <- blog_corp + news_corp + twitter_corp
Document-frequency matrices are created to find the top unigrams, bigrams, and trigrams. Other options could be employed here when creating dfms, such as removing stop words and using stemming, but default parameters are left in place for now for the sake of exploring the data.
dfm1 <- dfm(corp,ngrams=1)
dfm2 <- dfm(corp,ngrams=2)
dfm3 <- dfm(corp,ngrams=3)
Using this sample, there are 57,764 unique unigrams, 442,306 unique bigrams, and 768,405 unique trigrams across 42,695 documents. The topfeatures function allows us to quickly view which unigrams, bigrams, and trigrams appear most frequently in the data.
Not suprisingly, these tend to be prepositions, pronouns, and other common English words. We’d get a different result with more nouns and verbs if English stop words are removed, but stop words might be helpful to include when trying to predict a grammatically correct string of words.
dfm_stop <- dfm(corp,ngrams=1,ignoredFeatures=stopwords("english"))
topfeatures(dfm_stop,20)
## will just said one like can time get new now
## 3131 3126 2926 2742 2680 2485 2217 2178 2022 1817
## good day people know love back first make see go
## 1759 1747 1639 1630 1557 1405 1372 1369 1350 1335
We can estimate how many unique words are needed to cover 50% or 90% of all word instances in the sample. The document-frequenct matrix shows that there are 841,068 total word instances in the sample.
To do this, the words are sorted in decreasing order of frequency and a cumulative sum column is added. Subsetting the resulting data frame shows that the top 236 unique words (0.4% of total unique words) account for 50% of the total word instances. Around 9,156 unique words (16% of total unique words) account for 90% of total word instances.
freq1 <- docfreq(dfm1)
df1 <- data.frame(ngram=names(freq1),freq=freq1)
rownames(df1) <- c()
df1 <- df1[order(df1$freq,decreasing=TRUE),]
goal50 <- sum(df1$freq)*0.5
goal90 <- sum(df1$freq)*0.9
library(dplyr)
df1 <- mutate(df1,cumsum=cumsum(df1$freq))
test50 <- df1[df1$cumsum<goal50,]
test90 <- df1[df1$cumsum<goal90,]
tail(test50)
## ngram freq cumsum
## 231 done 421 418734
## 232 hard 420 419154
## 233 without 420 419574
## 234 free 414 419988
## 235 used 406 420394
## 236 making 405 420799
tail(test90)
## ngram freq cumsum
## 9151 parenthood 7 757848
## 9152 geared 7 757855
## 9153 cheating 7 757862
## 9154 greene 7 757869
## 9155 brother's 7 757876
## 9156 belated 7 757883
The cumulative sums, with horizontal lines at 50% and 90% of total word instances, are plotted below.
A prediction model can be built using the Katz Back-Off model, which involves the following steps.