Executive Summary

This report documents exploratory analysis of textual data that will later be used to build a model for predicting next word given a textual input. The first step is to understand the distribution of words and the relationship between words in the corpora. We examine the distribution and frequencies of words (unigrams), word pairs (bigrams), and word triplets (trigrams). We conclude by suggesting next steps for building a model and text prediction app.

Data Acquisition and Overview

The data comes from the HC Corpora, specifically 3 files of text collected by web crawler from Twitter, blogs, and news sites in the United States. The data is downloaded from here. Table 1 shows an overview of the raw data.

# Load necessary libraries
library(stringi);library(tm);library(quanteda)
library(ggplot2);library(gridExtra)
library(magrittr)

# Load data
blogs<-readLines("./capstoneData/en_US/en_US.blogs.txt",encoding="UTF-8",skipNul=TRUE)
news<-readLines("./capstoneData/en_US/en_US.news.txt",encoding="UTF-8",skipNul=TRUE)
twitter<-readLines("./capstoneData/en_US/en_US.twitter.txt",encoding="UTF-8",skipNul=TRUE)

# Calculate size, number of words, number of lines, max number of characters per line of each dataset
size<-c(format(object.size(blogs),"MB"),format(object.size(news),"MB"),format(object.size(twitter),"MB"))
words<-c(stri_stats_latex(blogs)["Words"], 
          stri_stats_latex(news)["Words"], 
          stri_stats_latex(twitter)["Words"])
lines<-c(length(blogs),length(news),length(twitter))
chars<-c(max(nchar(blogs)),max(nchar(news)),max(nchar(twitter)))

# Gather calculations in dataframe and display in table
summary<-data.frame(size,words,lines,chars,row.names=c("Blogs","News","Twitter"))
colnames(summary)<-c("Size (in MB)",
                      "Number of words",
                      "Number of lines",
                      "Max characters in a line")
knitr::kable(summary,digits=2,caption="Table 1. Overview of Datasets")
Table 1. Overview of Datasets
Size (in MB) Number of words Number of lines Max characters in a line
Blogs 248.5 Mb 37570839 899288 40833
News 249.6 Mb 34494539 1010242 11384
Twitter 301.4 Mb 30451170 2360148 140

Data Sampling and Cleaning

As evidenced in the overview table above, the datasets are quite large. For efficiency, we will take a random sample of 5% of each dataset.

set.seed(3723)
sampleSize<-.05
sample<-c(sample(blogs,length(blogs)*sampleSize,replace=FALSE),
                 sample(news,length(news)*sampleSize,replace=FALSE),
                 sample(twitter,length(twitter)*sampleSize,replace=FALSE))

Next we need to clean the data. The function below tokenizes the sample datasets into unigrams, bigrams, and trigrams, while also folding case (i.e. converting to all lower case), and removing punctuation, numbers, symbols, and stop words (e.g. the, a, and, of). They are then recast as document feature matrices.

# Tokenizer
tokenizer<-function(data,x){
  tokens_remove(tokens(data,what="word",
                       remove_numbers=TRUE,
                       remove_punct=TRUE,
                       remove_symbols=TRUE),stopwords("english"))%>%
    tokens_tolower()%>%
    tokens_ngrams(n=x,concatenator=" ")
}

# Tokenize sample data for unigram, bigram, and trigram
uniTok<-tokenizer(sample,1)
biTok<-tokenizer(sample,2)
triTok<-tokenizer(sample,3)

# Create dfms
uniDfm<-dfm(uniTok)
biDfm<-dfm(biTok)
triDfm<-dfm(triTok)

Analysis

We explore word distribution by plotting the frequencies of single words (unigrams), word pairs (bigrams), and word triplets (trigrams).

# Get most frequent n-grams and put in dataframe
unigramDF<-data.frame(word=names(topfeatures(uniDfm,15)),
                      freq=topfeatures(uniDfm,15),
                      row.names=NULL)
bigramDF<-data.frame(word=names(topfeatures(biDfm,15)),
                      freq=topfeatures(biDfm,15),
                      row.names=NULL)
trigramDF<-data.frame(word=names(topfeatures(triDfm,15)),
                      freq=topfeatures(triDfm,15),
                      row.names=NULL)

# Plot function
fPlot<-function(data,label){
  ggplot(data,aes(reorder(word,-freq),freq,fill=freq,label=freq))+
         labs(x=label,y="Frequency")+
         scale_fill_gradient(low="lightblue",high="darkblue")+
         geom_text(size=4,position=position_stack(vjust = 0.8),colour="white")+
         coord_flip()+ 
         geom_bar(stat="identity")+
         theme(legend.position="none")
}
# Plot n-grams
uniPlot<-fPlot(unigramDF,"15 Most Common Unigrams")
biPlot<-fPlot(bigramDF,"15 Most Common Bigrams")
triPlot<-fPlot(trigramDF,"15 Most Common Trigrams")
grid.arrange(uniPlot,biPlot,triPlot,ncol=3)

Next Steps: Modeling & Application

The challenge will be building a model that maximizes coverage while also being lightweight and fast enough to run in a prediction app. Two strategies for approaching this challenge:

  1. Increase coverage by refining the sample. This could be accomplished by:
  • Stemming words. Reduce words to their stem by clipping suffixes (e.g. ed, ing)
  • Removing all unique words (frequency=1)
  • Identifying and clustering common misspellings (e.g. happy mothers day, happy mother’s day)
  • Clustering synonyms
  1. Efficient modeling. Some ideas for doing so:
  • Kneser-Key Smoothing. Calculate probability distributions of n-grams to reduce skewness of longer n-grams by unigrams common to them.
  • Katz Back-Off. Use conditional probability to work backwards from longer n-grams (3-grams? 4-grams? 5-grams?) to shorter.