This report documents exploratory analysis of textual data that will later be used to build a model for predicting next word given a textual input. The first step is to understand the distribution of words and the relationship between words in the corpora. We examine the distribution and frequencies of words (unigrams), word pairs (bigrams), and word triplets (trigrams). We conclude by suggesting next steps for building a model and text prediction app.
The data comes from the HC Corpora, specifically 3 files of text collected by web crawler from Twitter, blogs, and news sites in the United States. The data is downloaded from here. Table 1 shows an overview of the raw data.
# Load necessary libraries
library(stringi);library(tm);library(quanteda)
library(ggplot2);library(gridExtra)
library(magrittr)
# Load data
blogs<-readLines("./capstoneData/en_US/en_US.blogs.txt",encoding="UTF-8",skipNul=TRUE)
news<-readLines("./capstoneData/en_US/en_US.news.txt",encoding="UTF-8",skipNul=TRUE)
twitter<-readLines("./capstoneData/en_US/en_US.twitter.txt",encoding="UTF-8",skipNul=TRUE)
# Calculate size, number of words, number of lines, max number of characters per line of each dataset
size<-c(format(object.size(blogs),"MB"),format(object.size(news),"MB"),format(object.size(twitter),"MB"))
words<-c(stri_stats_latex(blogs)["Words"],
stri_stats_latex(news)["Words"],
stri_stats_latex(twitter)["Words"])
lines<-c(length(blogs),length(news),length(twitter))
chars<-c(max(nchar(blogs)),max(nchar(news)),max(nchar(twitter)))
# Gather calculations in dataframe and display in table
summary<-data.frame(size,words,lines,chars,row.names=c("Blogs","News","Twitter"))
colnames(summary)<-c("Size (in MB)",
"Number of words",
"Number of lines",
"Max characters in a line")
knitr::kable(summary,digits=2,caption="Table 1. Overview of Datasets")
| Size (in MB) | Number of words | Number of lines | Max characters in a line | |
|---|---|---|---|---|
| Blogs | 248.5 Mb | 37570839 | 899288 | 40833 |
| News | 249.6 Mb | 34494539 | 1010242 | 11384 |
| 301.4 Mb | 30451170 | 2360148 | 140 |
As evidenced in the overview table above, the datasets are quite large. For efficiency, we will take a random sample of 5% of each dataset.
set.seed(3723)
sampleSize<-.05
sample<-c(sample(blogs,length(blogs)*sampleSize,replace=FALSE),
sample(news,length(news)*sampleSize,replace=FALSE),
sample(twitter,length(twitter)*sampleSize,replace=FALSE))
Next we need to clean the data. The function below tokenizes the sample datasets into unigrams, bigrams, and trigrams, while also folding case (i.e. converting to all lower case), and removing punctuation, numbers, symbols, and stop words (e.g. the, a, and, of). They are then recast as document feature matrices.
# Tokenizer
tokenizer<-function(data,x){
tokens_remove(tokens(data,what="word",
remove_numbers=TRUE,
remove_punct=TRUE,
remove_symbols=TRUE),stopwords("english"))%>%
tokens_tolower()%>%
tokens_ngrams(n=x,concatenator=" ")
}
# Tokenize sample data for unigram, bigram, and trigram
uniTok<-tokenizer(sample,1)
biTok<-tokenizer(sample,2)
triTok<-tokenizer(sample,3)
# Create dfms
uniDfm<-dfm(uniTok)
biDfm<-dfm(biTok)
triDfm<-dfm(triTok)
We explore word distribution by plotting the frequencies of single words (unigrams), word pairs (bigrams), and word triplets (trigrams).
# Get most frequent n-grams and put in dataframe
unigramDF<-data.frame(word=names(topfeatures(uniDfm,15)),
freq=topfeatures(uniDfm,15),
row.names=NULL)
bigramDF<-data.frame(word=names(topfeatures(biDfm,15)),
freq=topfeatures(biDfm,15),
row.names=NULL)
trigramDF<-data.frame(word=names(topfeatures(triDfm,15)),
freq=topfeatures(triDfm,15),
row.names=NULL)
# Plot function
fPlot<-function(data,label){
ggplot(data,aes(reorder(word,-freq),freq,fill=freq,label=freq))+
labs(x=label,y="Frequency")+
scale_fill_gradient(low="lightblue",high="darkblue")+
geom_text(size=4,position=position_stack(vjust = 0.8),colour="white")+
coord_flip()+
geom_bar(stat="identity")+
theme(legend.position="none")
}
# Plot n-grams
uniPlot<-fPlot(unigramDF,"15 Most Common Unigrams")
biPlot<-fPlot(bigramDF,"15 Most Common Bigrams")
triPlot<-fPlot(trigramDF,"15 Most Common Trigrams")
grid.arrange(uniPlot,biPlot,triPlot,ncol=3)
The challenge will be building a model that maximizes coverage while also being lightweight and fast enough to run in a prediction app. Two strategies for approaching this challenge: