Milestone Report: Exploratory Analysis of Swiftly Dataset

Executive Summary

This report documents exploratory analysis of textual data that will later be used to build a model for predicting next word given a textual input. The first step is to understand the distribution of words and the relationship between words in the corpora. We examine the distribution and frequencies of words (unigrams), word pairs (bigrams), and word triplets (trigrams). We conclude by suggesting next steps for building a model and text prediction app.

Data Acquisition and Overview

The data comes from the HC Corpora, specifically 3 files of text collected by web crawler from Twitter, blogs, and news sites in the United States. The data is downloaded from here. Table 1 shows an overview of the raw data.

# Load necessary libraries
library(stringi);library(tm);library(quanteda)
library(ggplot2);library(gridExtra)
library(magrittr)

# Load data
blogs<-readLines("./capstoneData/en_US/en_US.blogs.txt",encoding="UTF-8",skipNul=TRUE)
news<-readLines("./capstoneData/en_US/en_US.news.txt",encoding="UTF-8",skipNul=TRUE)
twitter<-readLines("./capstoneData/en_US/en_US.twitter.txt",encoding="UTF-8",skipNul=TRUE)

# Calculate size, number of words, number of lines, max number of characters per line of each dataset
size<-c(format(object.size(blogs),"MB"),format(object.size(news),"MB"),format(object.size(twitter),"MB"))
words<-c(stri_stats_latex(blogs)["Words"], 
          stri_stats_latex(news)["Words"], 
          stri_stats_latex(twitter)["Words"])
lines<-c(length(blogs),length(news),length(twitter))
chars<-c(max(nchar(blogs)),max(nchar(news)),max(nchar(twitter)))

# Gather calculations in dataframe and display in table
summary<-data.frame(size,words,lines,chars,row.names=c("Blogs","News","Twitter"))
colnames(summary)<-c("Size (in MB)",
                      "Number of words",
                      "Number of lines",
                      "Max characters in a line")
knitr::kable(summary,digits=2,caption="Table 1. Overview of Datasets")

Table 1. Overview of Datasets
	Size (in MB)	Number of words	Number of lines	Max characters in a line
Blogs	248.5 Mb	37570839	899288	40833
News	249.6 Mb	34494539	1010242	11384
Twitter	301.4 Mb	30451170	2360148	140

Data Sampling and Cleaning

As evidenced in the overview table above, the datasets are quite large. For efficiency, we will take a random sample of 5% of each dataset.

set.seed(3723)
sampleSize<-.05
sample<-c(sample(blogs,length(blogs)*sampleSize,replace=FALSE),
                 sample(news,length(news)*sampleSize,replace=FALSE),
                 sample(twitter,length(twitter)*sampleSize,replace=FALSE))

Next we need to clean the data. The function below tokenizes the sample datasets into unigrams, bigrams, and trigrams, while also folding case (i.e. converting to all lower case), and removing punctuation, numbers, symbols, and stop words (e.g. the, a, and, of). They are then recast as document feature matrices.

# Tokenizer
tokenizer<-function(data,x){
  tokens_remove(tokens(data,what="word",
                       remove_numbers=TRUE,
                       remove_punct=TRUE,
                       remove_symbols=TRUE),stopwords("english"))%>%
    tokens_tolower()%>%
    tokens_ngrams(n=x,concatenator=" ")
}

# Tokenize sample data for unigram, bigram, and trigram
uniTok<-tokenizer(sample,1)
biTok<-tokenizer(sample,2)
triTok<-tokenizer(sample,3)

# Create dfms
uniDfm<-dfm(uniTok)
biDfm<-dfm(biTok)
triDfm<-dfm(triTok)

Analysis

We explore word distribution by plotting the frequencies of single words (unigrams), word pairs (bigrams), and word triplets (trigrams).

# Get most frequent n-grams and put in dataframe
unigramDF<-data.frame(word=names(topfeatures(uniDfm,15)),
                      freq=topfeatures(uniDfm,15),
                      row.names=NULL)
bigramDF<-data.frame(word=names(topfeatures(biDfm,15)),
                      freq=topfeatures(biDfm,15),
                      row.names=NULL)
trigramDF<-data.frame(word=names(topfeatures(triDfm,15)),
                      freq=topfeatures(triDfm,15),
                      row.names=NULL)

# Plot function
fPlot<-function(data,label){
  ggplot(data,aes(reorder(word,-freq),freq,fill=freq,label=freq))+
         labs(x=label,y="Frequency")+
         scale_fill_gradient(low="lightblue",high="darkblue")+
         geom_text(size=4,position=position_stack(vjust = 0.8),colour="white")+
         coord_flip()+ 
         geom_bar(stat="identity")+
         theme(legend.position="none")
}
# Plot n-grams
uniPlot<-fPlot(unigramDF,"15 Most Common Unigrams")
biPlot<-fPlot(bigramDF,"15 Most Common Bigrams")
triPlot<-fPlot(trigramDF,"15 Most Common Trigrams")
grid.arrange(uniPlot,biPlot,triPlot,ncol=3)

Next Steps: Modeling & Application

The challenge will be building a model that maximizes coverage while also being lightweight and fast enough to run in a prediction app. Two strategies for approaching this challenge:

Increase coverage by refining the sample. This could be accomplished by:

Stemming words. Reduce words to their stem by clipping suffixes (e.g. ed, ing)
Removing all unique words (frequency=1)
Identifying and clustering common misspellings (e.g. happy mothers day, happy mother’s day)
Clustering synonyms

Efficient modeling. Some ideas for doing so:

Kneser-Key Smoothing. Calculate probability distributions of n-grams to reduce skewness of longer n-grams by unigrams common to them.
Katz Back-Off. Use conditional probability to work backwards from longer n-grams (3-grams? 4-grams? 5-grams?) to shorter.