Exploratory Analysis for N-Gram Prediction Model

Introduction
Load Data
Tokenize
N-gram generation
Prediction Model Considerations

Introduction

This report summarizes the exploratory analysis of the data used to train an n-gram prediction model. Statistical properties of unigram, bigram, and trigram tokens are shown in plots below. Considerations for the final prediction model such as timing and handling of unknown tokens are listed in the last section.

Load Data

The training corpus includes about 1 gigabyte of text from news, twitter, and blog sources. Raw data is loaded using the read_lines from the readr package. The number of lines and object size of the news, twitter, and blog files are printed.

##         num_lines object_megabytes
## news      1010242         269.8410
## twitter   2360148         334.4847
## blogs      899288         267.7586

Tokenize

The corpora are combined and tokenized using the quanteda package. Preprocessing is included in the tokenization. Punctuation, symbols, numbers, urls, and separators are removed. Standard stopwords are not removed, and will be analyzed in later steps to see their effect on timing and model size.

50% of the training set is randomly drawn as part of this analysis. The remaining 50% may be used to test the model.

library(quanteda)

## Package version: 2.1.2

## Parallel computing: 2 of 16 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

#randomly draw indices for 50% of training data
set.seed(123)
news_indices<-sample(1:length(news),length(news)*.5,replace = FALSE)
twit_indices<-sample(1:length(twit),length(twit)*.5,replace = FALSE)
blog_indices<-sample(1:length(blog),length(blog)*.5,replace = FALSE)

#create corpus objects from character arrays
combined_corpus<-corpus(c(news[news_indices],twit[twit_indices],blog[blog_indices]))

#corp_size<-c(object.size(news_corpus),object.size(twit_corpus),object.size(blog_corpus))/1E6
corp_size<-object.size(combined_corpus)/1E6

#remove large raw data objects from memory
rm(news,twit,blog);gc_hide<-gc(verbose = FALSE)

#tokenize
combined_toks = tokens(combined_corpus,remove_punct = TRUE,remove_symbols = TRUE,remove_numbers = TRUE,remove_url = TRUE,remove_separators = TRUE)

#remove stopwords
#combined_toks<-tokens_remove(combined_toks,stopwords("en"))

rm(combined_corpus);gc_hide<-gc(verbose = FALSE)#remove large corpus from memory

N-gram generation

Unigrams, bigrams, and trigrams are generated with quanteda. First, a wordcloud plot is shown to see the relative representation of the top 100 tokens for each n-grams. Then, the cumulative distribution and frequency of token frequencies is displayed. Trigrams and bigrams have many tokens with only 1 instance, where unigrams have less.

Unigrams

unigram_toks = tokens_ngrams(combined_toks,n=1)
unigram_dfm <- dfm(unigram_toks)
unigram_freq<-textstat_frequency(unigram_dfm)

#plot wordcloud to show most frequent words
textplot_wordcloud(unigram_dfm,max_words = 100,
                  ordered_color = TRUE)

par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(unigram_freq$frequency)/sum(unigram_freq$frequency),
     type = "l",xlab = "Number Features",ylab = "CDF",main = "Unigram Feature CDF",
     panel.first = grid())

#plot histogram of frequencies
unigram_hist<-hist(unigram_freq$frequency,plot = FALSE)
plot(unigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
     main = "Unigram Feature Frequency",log = "y",type = "h")

Bigrams

bigram_toks = tokens_ngrams(combined_toks,n=2)
bigram_dfm <- dfm(bigram_toks)
bigram_freq<-textstat_frequency(bigram_dfm)

#plot wordcloud to show most frequent words
textplot_wordcloud(bigram_dfm,max_words = 100,
                   ordered_color = TRUE)

par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(bigram_freq$frequency)/sum(bigram_freq$frequency),
     type = "l",xlab = "Number Features",ylab = "CDF",main = "Bigram Feature CDF",
     panel.first = grid())

#plot histogram of frequencies
bigram_hist<-hist(bigram_freq$frequency,plot = FALSE)
plot(bigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
     main = "Bigram Feature Frequency",log = "y",type = "h")

Trigrams

trigram_toks = tokens_ngrams(combined_toks,n=3)
trigram_dfm <- dfm(trigram_toks)
trigram_freq<-textstat_frequency(trigram_dfm)

#plot wordcloud to show most frequent words
textplot_wordcloud(trigram_dfm,max_words = 100,
                   ordered_color = TRUE)

par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(trigram_freq$frequency)/sum(trigram_freq$frequency),
     type = "l",xlab = "Number Features",ylab = "CDF",main = "Trigram Feature CDF",
     panel.first = grid())

#plot histogram of frequencies
trigram_hist<-hist(bigram_freq$frequency,plot = FALSE)
plot(trigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
     main = "Trigram Feature Frequency",log = "y",type = "h")

Object size of n-gram data

The size of the objects containing the frequency data as well as the number of features in each n-gram are printed. Eliminating bigrams and trigrams with only 1 instance will save a significant amount of memory and search time.

ngram_megabytes<-c(object.size(unigram_freq),object.size(bigram_freq),object.size(trigram_freq))/1E6
names(ngram_megabytes)<-c("Unigram","Bigram","Trigram")
ngram_features<-c(length(unigram_freq$feature),length(bigram_freq$feature),length(trigram_freq$feature))
ngram_size<-rbind(ngram_megabytes,ngram_features)
ngram_size

##                      Unigram      Bigram      Trigram
## ngram_megabytes     93.92868    1477.868     4510.352
## ngram_features  580551.00000 8831228.000 25228165.000

Prediction Model Considerations

Unknown Words - an <unk> token can be added to each n-gram to account for words that don’t appear in the training set. <unk> words can assumed to have a frequency similar to words that appear only once, like in the Kneser-Ney smoothing method.
Timing - Elimination of features with few instances will dramatically improve the timing performance of the prediction model, especially for bigrams and trigrams.
Stopwords - most of the tokens with high frequency are included in standard English stopwords. Standard English stopwords are likely to frequently be typed, so the prediction model should probably include them.
Accuracy Testing - Model accuracy can be tested by predicting word combinations in a “test” portion of the training set, that was not used in the training model.