Overview

In this document I will discuss the steps taken to load, process, and analyze three sets of english-language documents that will be used for a predictive text application:

* US News Articles
* English Blog Posts
* Tweets from Twitter

The goals for this report are to:

*  Create a basic report of summary statistics.
*  Report any interesting findings.
*  Get feedback on my plans for creating a prediction algorithm and Shiny app.

My report will show some basic information on the files we’ll use for the analysis, as well as the relative distributions of N-grams ranging from length 1-3. I will also discuss some ideas to cut down the total number of words necessary for the predictive text application.

Download Data and Load into R

Due to the size of the data set, it was downloaded prior to writing this report from this link which was provided on the course website. After downloading, the tar utility was used to unzip the compressed file.

Since the data was downloaded prior to this report, let’s jump right into loading the data in R, and generate some basic stats of file size, length, and word count using commands provided in the shell environment.

# Check the basic length of each record
{
    # Not sure what encoding is going on here, but just specifying skipNul fixes the issue.
    en_blogs <- readLines('../data/final/en_US/en_US.blogs.txt', skipNul=T)
    en_news <- readLines('../data/final/en_US/en_US.news.txt', skipNul=T)
    en_twitter <- readLines('../data/final/en_US/en_US.twitter.txt', skipNul=T)

    # And some basic stats
    blogs_lines <- system("wc -l ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
    blogs_words <- system("wc -w ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
    blogs_size <- system("du -hs ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
    news_lines <- system("wc -l ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
    news_words <- system("wc -w ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
    news_size <- system("du -hs ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
    twitter_lines <- system("wc -l ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)
    twitter_words <- system("wc -w ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)
    twitter_size <- system("du -hs ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)

    summary_dat <- data.frame(FILE = c("Blogs", "News", "Twitter"),
            LINES = c(blogs_lines, news_lines, twitter_lines), 
            WORDS = c(blogs_words, news_words, twitter_words),
            FILE_SIZE = c(blogs_size, news_size, twitter_size))
    
    kable(summary_dat, caption="Summary Information of Each Data Set")
}

Summary Information of Each Data Set
FILE	LINES	WORDS	FILE_SIZE
Blogs	899288	37334690	200M
News	1010242	34372720	196M
Twitter	2360148	30374206	159M

Data Processing

I parsed the files with the following functions and methods in this sequence:

1. Sample the data set (30% for this report)
2. Remove all punctuation other than apostrophes  
3. Remove numbers  
4. Convert text into a Corpus  
5. Convert to lower-case  
6. Remove extra whitespace  
7. Expand Contractions (i'm -> i am)  
8. Remove Profanity (George Carlin's 7 Dirty Words)  
9. Convert the Corpus to Plaintext  
10. Tokenize to generate 1, 2, and 3-grams

The code to achieve these steps is below. To keep the document a bit shorter, the code of the punct_cleaner, nonword_cleaner, fix_contractions, tdm_gen, and freqs_from_tdm functions can be seen in the appendix at the end of this document.

# Take a sample of the data set
{
    #I'll set a seed for each one so I can recover either of the three samples here. 
    set.seed(1234)
    blog_split.idx <- createDataPartition(1:length(en_blogs), times=1, p=0.30, list=F)
    set.seed(2345)
    news_split.idx <- createDataPartition(1:length(en_news), times=1, p=0.30, list=F)
    set.seed(3456)
    twitter_split.idx <- createDataPartition(1:length(en_twitter), times=1, p=0.30, list=F)
    #Stack them together into one list
    test_set <- append(en_blogs[blog_split.idx], en_news[news_split.idx])
    test_set <- append(test_set, en_twitter[twitter_split.idx])
}
    
# Now let's do some formatting of this set. 
{
    test_set.cleaned <- lapply(test_set, punct_cleaner)
    test_set.cleaned <- lapply(test_set.cleaned, nonword_cleaner)
    test_set.cleaned <- unlist(test_set.cleaned)
    test_set.corp <- Corpus(VectorSource(test_set.cleaned))
    test_set.corp <- tm_map(test_set.corp, tolower)
    test_set.corp <- tm_map(test_set.corp, stripWhitespace)
    test_set.corp <- tm_map(test_set.corp, fix_contractions)
    test_set.corp <- tm_map(test_set.corp, removePunctuation)
    test_set.corp <- tm_map(test_set.corp, removeWords, filthy_words)
    test_set.corp <- tm_map(test_set.corp, PlainTextDocument)
}

# First we generate TDMs
    unigram_tdm <- tdm_gen(n=1, corp=test_set.corp)
    bigram_tdm <- tdm_gen(n=2, corp=test_set.corp)
    trigram_tdm <- tdm_gen(n=3, corp=test_set.corp)
    
# Now we generate term Frequencies
    testset_unigram_termFreqs <- freqs_from_tdm(tdm=unigram_tdm)
    testset_bigram_termFreqs <- freqs_from_tdm(tdm=bigram_tdm)
    testset_trigram_termFreqs <- freqs_from_tdm(tdm=trigram_tdm)

Analysis

Most Common N-Grams

Here are the top 10 most common N-grams from our sample by joining up the top 10 rows of each sorted termFreq table:

Top 10 Counts of N-Grams - 30% Sample
Unigram	Unigram Count	Bigram	Bigram Count	Trigram	Trigram Count
the	1430671	of the	130149	i do not	13936
to	824605	in the	123631	one of the	10446
and	721208	i am	77304	a lot of	8957
of	602528	to the	63913	thanks for the	7084
in	493774	for the	60173	i am not	6893
that	332372	on the	59191	i ca not	6284
for	329676	to be	48543	to be a	5465
it	328777	do not	43097	i have been	5322
is	328647	at the	42915	going to be	5211
you	302574	i have	41958	i did not	4980

Many of the most frequent unigrams are stopwords. This is to be expected, and is the reason why stopwords are frequently removed in topic modeling and sentiment analysis. However, I will be performing text prediction with this data set, and so I purposely chose to leave stopwords in. They’ll need to be present if I want to perform maximum likelihood estimation on ngrams, or even have the ability to predict a stopword.

Frequencies of All Observed N-Grams

The relative frequencies of the number of times each N-gram is observed in the data set is shown in the plots below. The histogram shows ranges of the X-axis so that we can see how quickly the number of N-grams decreases with count. On the left we have unigrams that were observed between 1-25 times, and on the right unigrams observed between 25-500 times across all of the roughly 1.2 million documents in our Corpus training set. Such low counts indicate that most words will be seen infrequently, and are probably not very useful for text prediction.

The plot below shows in the cumulative proportion of all N-grams versus the number of times they’ve been seen. Nearly 60% of unigrams are seen only once, and roughly 95% of unigrams been seen 500 times or fewer. Nearly all bigrams and trigrams are seen fewer than 100 times in our sample of 1.2 million documents!

The final piece of analysis I wanted to discuss is how we can decrease the total set of words in our vocabulary. Specifically, we will look at the smallest set of words that can account for 75% and 90% of all words in our sample of the Corpus. To accomplish this we need to run a calculation to determine the cumulative frequencies and coverage of all N-grams, ranked from the most frequently observed to least.

# Calculate the cumulative frequencies of each term from most frequent to least frequent. 
{
    testset_unigram_termFreqs <- testset_unigram_termFreqs %>% 
        arrange(desc(count)) %>%
        mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
                     Rank = rank(desc(count), ties.method='first'))

    testset_bigram_termFreqs <- testset_bigram_termFreqs %>% 
        arrange(desc(count)) %>%
        mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
                     Rank = rank(desc(count), ties.method='first'))

    testset_trigram_termFreqs <- testset_trigram_termFreqs %>% 
        arrange(desc(count)) %>%
        mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
                     Rank =  rank(desc(count), ties.method='first'))
}

NGRAM	N_UNIQUE_NGRAMS	NGRAMS_75PCT	FRACTION_NGRAMS_75PCT	NGRAMS_90PCT	FRACTION_NGRAMS_90PCT
Unigram	408176	1582	0.00	8085	0.02
Bigram	5920319	499906	0.08	3033124	0.51
Trigram	15996978	9098987	0.57	13236954	0.83

Even though most terms are seen infrequently, less than 1% of the total words in the dictionary account for nearly 75% of the total length of the document. (See code for the cumulative frequency table in the appendix.) Part of this is because I’ve left stop words in the analysis, and they will occur in almost every sentence. However, recovering 90% of all observations of bigrams requires half of the unique bigrams in the data set, and 90% of trigrams requires 83% of the total observed trigrams. (This refers to bigrams and trigrams, not the individual words that make up the bulk of those N-grams)

Conclusions and Next Steps

The above analyses make it fairly clear that a relatively small number of words make up the bulk of the text in our training Corpus, but any combinations of words will be seen infrequently. I am planning to significantly cut down the vocabulary by only using the most frequently observed terms, and ignoring any term that has been seen only a handful of times. I have built a basic prediction model that takes a string of text and returns predictions for the next word with each words likelihood, and will be building the first pieces of my Shiny app to serve as an interface to the prediction algorithm. I will be cutting back the vocabulary to increase the responsiveness of the prediction algorithm while maintaining accuracy of predictions.

Thanks for reading and I look forward to your feedback!

Appendix

Preprocessing Functions to Format Corpus, Generate TDM, and Obtain Frequency Tables from TDMs

# This gets rid of all punctuation other than apostrophes. 
punct_cleaner <- function(sent) { gsub("[^[:alnum:][:space:]'.]","",sent)}

# Non-words.
nonword_cleaner <- function(sent) { gsub("[0-9+]","",sent)}

# In my processing I realized I wanted to expand contractions since apostrophe's get removed
# in tm's preprocessing. I found this function pre-written by the user MKFS at the following blog:
# http://entrenchant.blogspot.com/2013_06_01_archive.html
# I've made some slight modifications of that function to suit my needs. 
fix_contractions <- function(text) {
   text <- gsub("won't", "will not", text)
   text <- gsub("can't", "can not", text)
   text <- gsub("n't", " not", text)
   text <- gsub("'ll", " will", text)
   text <- gsub("'re", " are", text)
   text <- gsub("'ve", " have", text)
   text <- gsub("'m", " am", text)
   text <- gsub("'s", "", text) 
   return(text)
}

# This function creates a TDM from the corpus
tdm_gen <- function(n = 1, corp=NULL) {
    Tokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = n, max = n))
    testset_tdm <- TermDocumentMatrix(x = corp, control = list(
    tokenize=Tokenizer,
    removePunctuation=TRUE,
    removeNumbers=TRUE,
    wordLengths= c(2,20)))
    
    return(testset_tdm)
}

# This function returns a full list of term frequencies from the TDM
freqs_from_tdm <- function(tdm = NULL) {
    testset_termFreqs <- slam::row_sums(tdm, 2, na.rm = T)
    testset_termFreqs <- as.data.frame(testset_termFreqs)
    testset_termFreqs$term <- attr(testset_termFreqs, 'row.names')
    row.names(testset_termFreqs) <- NULL
    names(testset_termFreqs)[1] <- "count"
    testset_termFreqs <- testset_termFreqs %>% arrange(desc(count))
}

Code to Generate Cumulative Frequency Table

    uni_75 <- testset_unigram_termFreqs %>% filter(Cumulative >= 75) %>%
        filter(Rank == min(Rank))
    
    uni_90 <- testset_unigram_termFreqs %>% filter(Cumulative >= 90) %>%
        filter(Rank == min(Rank))
    
    bi_75 <- testset_bigram_termFreqs %>% filter(Cumulative >= 75) %>%
        filter(Rank == min(Rank))
    
    bi_90 <- testset_bigram_termFreqs %>% filter(Cumulative >= 90) %>%
        filter(Rank == min(Rank))
    
    tri_75 <- testset_trigram_termFreqs %>% filter(Cumulative >= 75) %>%
        filter(Rank == min(Rank))
    
    tri_90 <- testset_trigram_termFreqs %>% filter(Cumulative >= 90) %>%
        filter(Rank == min(Rank))
    
    term_summary <- data.frame(NGRAM = c("Unigram","Bigram","Trigram"),
                         TOTAL = c(nrow(testset_unigram_termFreqs), 
                                            nrow(testset_bigram_termFreqs), 
                                            nrow(testset_trigram_termFreqs)),
                         PERCENT_75 = c(uni_75$Rank, bi_75$Rank, tri_75$Rank),
                         FRACTION_75 = c(round(uni_75$Rank/nrow(testset_unigram_termFreqs),2), 
                                                    round(bi_75$Rank/nrow(testset_bigram_termFreqs),2), 
                                                    round(tri_75$Rank/nrow(testset_trigram_termFreqs),2)),
                         PERCENT_90 = c(uni_90$Rank, bi_90$Rank, tri_90$Rank),
                         FRACTION_90 = c(round(uni_90$Rank/nrow(testset_unigram_termFreqs),2), 
                                                    round(bi_90$Rank/nrow(testset_bigram_termFreqs),2), 
                                                    round(tri_90$Rank/nrow(testset_trigram_termFreqs),2)))
    
    kable(term_summary)

Code for the N-Gram Distribution Summary

unigram_summary <- testset_unigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>% 
    group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>% 
    mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="unigram") %>%  filter(count < 5000)

bigram_summary <- testset_bigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>% 
    group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>% 
    mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="bigram") %>% filter(count < 5000)

trigram_summary <- testset_trigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>% 
    group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>% 
    mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="trigram") %>% filter(count < 5000)

ngram_summary <- rbind_all(list(unigram_summary, bigram_summary, trigram_summary))
ngram_summary$NGRAM <- factor(ngram_summary$NGRAM, levels=c("unigram","bigram","trigram"))

qplot(count, PERCENTAGE, data=ngram_summary, color=NGRAM, geom='line', xlim=c(1,500),
            xlab="Times N-Gram Was Observed", ylab="Fraction of All Ngrams")

Swiftkey Milestone Report

Jeremy Beck

December 26, 2015