Milestone Report: Exploratory Analysis

Reading in the data

To begin exploring the data, the text files are read in using the read_lines function in the readr package.

library(readr)
blog <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")

A quick examination shows that the files are quite large, with 899288 lines in the blog file, 2360148 lines in the twitter file, and 1010242 lines in the news file.

length(blog)

## [1] 899288

length(twitter)

## [1] 2360148

length(news)

## [1] 1010242

Taking a 1% sample of each file will allow us to examine features of the data more easily.

s <- .01
set.seed(123)
blog_sample <- sample(blog,s*length(blog),replace=FALSE)
news_sample <- sample(news,s*length(news),replace=FALSE)
twitter_sample <- sample(twitter,s*length(twitter),replace=FALSE)

Creating a Corpus

Using the quanteda package, a corpus can be created by combining the samples of each file.

library(quanteda)
blog_corp <- corpus(blog_sample)
news_corp <- corpus(news_sample)
twitter_corp <- corpus(twitter_sample)
corp <- blog_corp + news_corp + twitter_corp

Top N-Grams

Document-frequency matrices are created to find the top unigrams, bigrams, and trigrams. Other options could be employed here when creating dfms, such as removing stop words and using stemming, but default parameters are left in place for now for the sake of exploring the data.

dfm1 <- dfm(corp,ngrams=1)
dfm2 <- dfm(corp,ngrams=2)
dfm3 <- dfm(corp,ngrams=3)

Using this sample, there are 57,764 unique unigrams, 442,306 unique bigrams, and 768,405 unique trigrams across 42,695 documents. The topfeatures function allows us to quickly view which unigrams, bigrams, and trigrams appear most frequently in the data.

Not suprisingly, these tend to be prepositions, pronouns, and other common English words. We’d get a different result with more nouns and verbs if English stop words are removed, but stop words might be helpful to include when trying to predict a grammatically correct string of words.

dfm_stop <- dfm(corp,ngrams=1,ignoredFeatures=stopwords("english"))
topfeatures(dfm_stop,20)

##   will   just   said    one   like    can   time    get    new    now 
##   3131   3126   2926   2742   2680   2485   2217   2178   2022   1817 
##   good    day people   know   love   back  first   make    see     go 
##   1759   1747   1639   1630   1557   1405   1372   1369   1350   1335

Coverage

We can estimate how many unique words are needed to cover 50% or 90% of all word instances in the sample. The document-frequenct matrix shows that there are 841,068 total word instances in the sample.

To do this, the words are sorted in decreasing order of frequency and a cumulative sum column is added. Subsetting the resulting data frame shows that the top 236 unique words (0.4% of total unique words) account for 50% of the total word instances. Around 9,156 unique words (16% of total unique words) account for 90% of total word instances.

freq1 <- docfreq(dfm1)
df1 <- data.frame(ngram=names(freq1),freq=freq1)
rownames(df1) <- c()
df1 <- df1[order(df1$freq,decreasing=TRUE),]

goal50 <- sum(df1$freq)*0.5
goal90 <- sum(df1$freq)*0.9

library(dplyr)
df1 <- mutate(df1,cumsum=cumsum(df1$freq))
test50 <- df1[df1$cumsum<goal50,]
test90 <- df1[df1$cumsum<goal90,]

tail(test50)

##       ngram freq cumsum
## 231    done  421 418734
## 232    hard  420 419154
## 233 without  420 419574
## 234    free  414 419988
## 235    used  406 420394
## 236  making  405 420799

tail(test90)

##           ngram freq cumsum
## 9151 parenthood    7 757848
## 9152     geared    7 757855
## 9153   cheating    7 757862
## 9154     greene    7 757869
## 9155  brother's    7 757876
## 9156    belated    7 757883

The cumulative sums, with horizontal lines at 50% and 90% of total word instances, are plotted below.

Algorithm

A prediction model can be built using the Katz Back-Off model, which involves the following steps.

Create tables of all unique unigrams, bigrams, and trigrams and their frequencies.
Determine values to use for the bigram and trigram discounts, which is an amount to subtract from probability estimates and distribute to unknown bigrams and trigrams.
Calculate discounted probabilities for observed trigrams. For each observed trigram, take its frequency count and subtract the trigram discount. Then divide by the frequency of the input bigram.
Calculate discounted probabilities for unobserved trigrams. This will involve calculating the discounted probability mass for observed trigrams (i.e. one minus the total sum of probabilities calculated using the discount) and distributing it to unobserved trigrams.
Select the unigram with the highest probability. This will form the output, or the suggested word to complete the input bigram.