This report summarizes the exploratory analysis of the data used to train an n-gram prediction model. Statistical properties of unigram, bigram, and trigram tokens are shown in plots below. Considerations for the final prediction model such as timing and handling of unknown tokens are listed in the last section.
The training corpus includes about 1 gigabyte of text from news, twitter, and blog sources. Raw data is loaded using the read_lines from the readr package. The number of lines and object size of the news, twitter, and blog files are printed.
## num_lines object_megabytes
## news 1010242 269.8410
## twitter 2360148 334.4847
## blogs 899288 267.7586
The corpora are combined and tokenized using the quanteda package. Preprocessing is included in the tokenization. Punctuation, symbols, numbers, urls, and separators are removed. Standard stopwords are not removed, and will be analyzed in later steps to see their effect on timing and model size.
50% of the training set is randomly drawn as part of this analysis. The remaining 50% may be used to test the model.
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
#randomly draw indices for 50% of training data
set.seed(123)
news_indices<-sample(1:length(news),length(news)*.5,replace = FALSE)
twit_indices<-sample(1:length(twit),length(twit)*.5,replace = FALSE)
blog_indices<-sample(1:length(blog),length(blog)*.5,replace = FALSE)
#create corpus objects from character arrays
combined_corpus<-corpus(c(news[news_indices],twit[twit_indices],blog[blog_indices]))
#corp_size<-c(object.size(news_corpus),object.size(twit_corpus),object.size(blog_corpus))/1E6
corp_size<-object.size(combined_corpus)/1E6
#remove large raw data objects from memory
rm(news,twit,blog);gc_hide<-gc(verbose = FALSE)
#tokenize
combined_toks = tokens(combined_corpus,remove_punct = TRUE,remove_symbols = TRUE,remove_numbers = TRUE,remove_url = TRUE,remove_separators = TRUE)
#remove stopwords
#combined_toks<-tokens_remove(combined_toks,stopwords("en"))
rm(combined_corpus);gc_hide<-gc(verbose = FALSE)#remove large corpus from memory
Unigrams, bigrams, and trigrams are generated with quanteda. First, a wordcloud plot is shown to see the relative representation of the top 100 tokens for each n-grams. Then, the cumulative distribution and frequency of token frequencies is displayed. Trigrams and bigrams have many tokens with only 1 instance, where unigrams have less.
unigram_toks = tokens_ngrams(combined_toks,n=1)
unigram_dfm <- dfm(unigram_toks)
unigram_freq<-textstat_frequency(unigram_dfm)
#plot wordcloud to show most frequent words
textplot_wordcloud(unigram_dfm,max_words = 100,
ordered_color = TRUE)
par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(unigram_freq$frequency)/sum(unigram_freq$frequency),
type = "l",xlab = "Number Features",ylab = "CDF",main = "Unigram Feature CDF",
panel.first = grid())
#plot histogram of frequencies
unigram_hist<-hist(unigram_freq$frequency,plot = FALSE)
plot(unigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
main = "Unigram Feature Frequency",log = "y",type = "h")
bigram_toks = tokens_ngrams(combined_toks,n=2)
bigram_dfm <- dfm(bigram_toks)
bigram_freq<-textstat_frequency(bigram_dfm)
#plot wordcloud to show most frequent words
textplot_wordcloud(bigram_dfm,max_words = 100,
ordered_color = TRUE)
par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(bigram_freq$frequency)/sum(bigram_freq$frequency),
type = "l",xlab = "Number Features",ylab = "CDF",main = "Bigram Feature CDF",
panel.first = grid())
#plot histogram of frequencies
bigram_hist<-hist(bigram_freq$frequency,plot = FALSE)
plot(bigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
main = "Bigram Feature Frequency",log = "y",type = "h")
trigram_toks = tokens_ngrams(combined_toks,n=3)
trigram_dfm <- dfm(trigram_toks)
trigram_freq<-textstat_frequency(trigram_dfm)
#plot wordcloud to show most frequent words
textplot_wordcloud(trigram_dfm,max_words = 100,
ordered_color = TRUE)
par(mfrow = c(1,2))
#plot cumulative distribution
plot(cumsum(trigram_freq$frequency)/sum(trigram_freq$frequency),
type = "l",xlab = "Number Features",ylab = "CDF",main = "Trigram Feature CDF",
panel.first = grid())
#plot histogram of frequencies
trigram_hist<-hist(bigram_freq$frequency,plot = FALSE)
plot(trigram_hist$counts,ylab = "Frequency",xlab = "Feature Frequency",
main = "Trigram Feature Frequency",log = "y",type = "h")
The size of the objects containing the frequency data as well as the number of features in each n-gram are printed. Eliminating bigrams and trigrams with only 1 instance will save a significant amount of memory and search time.
ngram_megabytes<-c(object.size(unigram_freq),object.size(bigram_freq),object.size(trigram_freq))/1E6
names(ngram_megabytes)<-c("Unigram","Bigram","Trigram")
ngram_features<-c(length(unigram_freq$feature),length(bigram_freq$feature),length(trigram_freq$feature))
ngram_size<-rbind(ngram_megabytes,ngram_features)
ngram_size
## Unigram Bigram Trigram
## ngram_megabytes 93.92868 1477.868 4510.352
## ngram_features 580551.00000 8831228.000 25228165.000