In this document I will discuss the steps taken to load, process, and analyze three sets of english-language documents that will be used for a predictive text application:
* US News Articles
* English Blog Posts
* Tweets from Twitter
The goals for this report are to:
* Create a basic report of summary statistics.
* Report any interesting findings.
* Get feedback on my plans for creating a prediction algorithm and Shiny app.
My report will show some basic information on the files we’ll use for the analysis, as well as the relative distributions of N-grams ranging from length 1-3. I will also discuss some ideas to cut down the total number of words necessary for the predictive text application.
Due to the size of the data set, it was downloaded prior to writing this report from this link which was provided on the course website. After downloading, the tar utility was used to unzip the compressed file.
Since the data was downloaded prior to this report, let’s jump right into loading the data in R, and generate some basic stats of file size, length, and word count using commands provided in the shell environment.
# Check the basic length of each record
{
# Not sure what encoding is going on here, but just specifying skipNul fixes the issue.
en_blogs <- readLines('../data/final/en_US/en_US.blogs.txt', skipNul=T)
en_news <- readLines('../data/final/en_US/en_US.news.txt', skipNul=T)
en_twitter <- readLines('../data/final/en_US/en_US.twitter.txt', skipNul=T)
# And some basic stats
blogs_lines <- system("wc -l ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
blogs_words <- system("wc -w ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
blogs_size <- system("du -hs ../data/final/en_US/en_US.blogs.txt | awk {'print $1'}", intern=T)
news_lines <- system("wc -l ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
news_words <- system("wc -w ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
news_size <- system("du -hs ../data/final/en_US/en_US.news.txt | awk {'print $1'}", intern=T)
twitter_lines <- system("wc -l ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)
twitter_words <- system("wc -w ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)
twitter_size <- system("du -hs ../data/final/en_US/en_US.twitter.txt | awk {'print $1'}", intern=T)
summary_dat <- data.frame(FILE = c("Blogs", "News", "Twitter"),
LINES = c(blogs_lines, news_lines, twitter_lines),
WORDS = c(blogs_words, news_words, twitter_words),
FILE_SIZE = c(blogs_size, news_size, twitter_size))
kable(summary_dat, caption="Summary Information of Each Data Set")
}
| FILE | LINES | WORDS | FILE_SIZE |
|---|---|---|---|
| Blogs | 899288 | 37334690 | 200M |
| News | 1010242 | 34372720 | 196M |
| 2360148 | 30374206 | 159M |
I parsed the files with the following functions and methods in this sequence:
1. Sample the data set (30% for this report)
2. Remove all punctuation other than apostrophes
3. Remove numbers
4. Convert text into a Corpus
5. Convert to lower-case
6. Remove extra whitespace
7. Expand Contractions (i'm -> i am)
8. Remove Profanity (George Carlin's 7 Dirty Words)
9. Convert the Corpus to Plaintext
10. Tokenize to generate 1, 2, and 3-grams
The code to achieve these steps is below. To keep the document a bit shorter, the code of the punct_cleaner, nonword_cleaner, fix_contractions, tdm_gen, and freqs_from_tdm functions can be seen in the appendix at the end of this document.
# Take a sample of the data set
{
#I'll set a seed for each one so I can recover either of the three samples here.
set.seed(1234)
blog_split.idx <- createDataPartition(1:length(en_blogs), times=1, p=0.30, list=F)
set.seed(2345)
news_split.idx <- createDataPartition(1:length(en_news), times=1, p=0.30, list=F)
set.seed(3456)
twitter_split.idx <- createDataPartition(1:length(en_twitter), times=1, p=0.30, list=F)
#Stack them together into one list
test_set <- append(en_blogs[blog_split.idx], en_news[news_split.idx])
test_set <- append(test_set, en_twitter[twitter_split.idx])
}
# Now let's do some formatting of this set.
{
test_set.cleaned <- lapply(test_set, punct_cleaner)
test_set.cleaned <- lapply(test_set.cleaned, nonword_cleaner)
test_set.cleaned <- unlist(test_set.cleaned)
test_set.corp <- Corpus(VectorSource(test_set.cleaned))
test_set.corp <- tm_map(test_set.corp, tolower)
test_set.corp <- tm_map(test_set.corp, stripWhitespace)
test_set.corp <- tm_map(test_set.corp, fix_contractions)
test_set.corp <- tm_map(test_set.corp, removePunctuation)
test_set.corp <- tm_map(test_set.corp, removeWords, filthy_words)
test_set.corp <- tm_map(test_set.corp, PlainTextDocument)
}
# First we generate TDMs
unigram_tdm <- tdm_gen(n=1, corp=test_set.corp)
bigram_tdm <- tdm_gen(n=2, corp=test_set.corp)
trigram_tdm <- tdm_gen(n=3, corp=test_set.corp)
# Now we generate term Frequencies
testset_unigram_termFreqs <- freqs_from_tdm(tdm=unigram_tdm)
testset_bigram_termFreqs <- freqs_from_tdm(tdm=bigram_tdm)
testset_trigram_termFreqs <- freqs_from_tdm(tdm=trigram_tdm)
Here are the top 10 most common N-grams from our sample by joining up the top 10 rows of each sorted termFreq table:
| Unigram | Unigram Count | Bigram | Bigram Count | Trigram | Trigram Count |
|---|---|---|---|---|---|
| the | 1430671 | of the | 130149 | i do not | 13936 |
| to | 824605 | in the | 123631 | one of the | 10446 |
| and | 721208 | i am | 77304 | a lot of | 8957 |
| of | 602528 | to the | 63913 | thanks for the | 7084 |
| in | 493774 | for the | 60173 | i am not | 6893 |
| that | 332372 | on the | 59191 | i ca not | 6284 |
| for | 329676 | to be | 48543 | to be a | 5465 |
| it | 328777 | do not | 43097 | i have been | 5322 |
| is | 328647 | at the | 42915 | going to be | 5211 |
| you | 302574 | i have | 41958 | i did not | 4980 |
Many of the most frequent unigrams are stopwords. This is to be expected, and is the reason why stopwords are frequently removed in topic modeling and sentiment analysis. However, I will be performing text prediction with this data set, and so I purposely chose to leave stopwords in. They’ll need to be present if I want to perform maximum likelihood estimation on ngrams, or even have the ability to predict a stopword.
The relative frequencies of the number of times each N-gram is observed in the data set is shown in the plots below. The histogram shows ranges of the X-axis so that we can see how quickly the number of N-grams decreases with count. On the left we have unigrams that were observed between 1-25 times, and on the right unigrams observed between 25-500 times across all of the roughly 1.2 million documents in our Corpus training set. Such low counts indicate that most words will be seen infrequently, and are probably not very useful for text prediction.
The plot below shows in the cumulative proportion of all N-grams versus the number of times they’ve been seen. Nearly 60% of unigrams are seen only once, and roughly 95% of unigrams been seen 500 times or fewer. Nearly all bigrams and trigrams are seen fewer than 100 times in our sample of 1.2 million documents!
The final piece of analysis I wanted to discuss is how we can decrease the total set of words in our vocabulary. Specifically, we will look at the smallest set of words that can account for 75% and 90% of all words in our sample of the Corpus. To accomplish this we need to run a calculation to determine the cumulative frequencies and coverage of all N-grams, ranked from the most frequently observed to least.
# Calculate the cumulative frequencies of each term from most frequent to least frequent.
{
testset_unigram_termFreqs <- testset_unigram_termFreqs %>%
arrange(desc(count)) %>%
mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
Rank = rank(desc(count), ties.method='first'))
testset_bigram_termFreqs <- testset_bigram_termFreqs %>%
arrange(desc(count)) %>%
mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
Rank = rank(desc(count), ties.method='first'))
testset_trigram_termFreqs <- testset_trigram_termFreqs %>%
arrange(desc(count)) %>%
mutate(Cumulative = 100*round(cumsum(count)/sum(count),4),
Rank = rank(desc(count), ties.method='first'))
}
| NGRAM | N_UNIQUE_NGRAMS | NGRAMS_75PCT | FRACTION_NGRAMS_75PCT | NGRAMS_90PCT | FRACTION_NGRAMS_90PCT |
|---|---|---|---|---|---|
| Unigram | 408176 | 1582 | 0.00 | 8085 | 0.02 |
| Bigram | 5920319 | 499906 | 0.08 | 3033124 | 0.51 |
| Trigram | 15996978 | 9098987 | 0.57 | 13236954 | 0.83 |
Even though most terms are seen infrequently, less than 1% of the total words in the dictionary account for nearly 75% of the total length of the document. (See code for the cumulative frequency table in the appendix.) Part of this is because I’ve left stop words in the analysis, and they will occur in almost every sentence. However, recovering 90% of all observations of bigrams requires half of the unique bigrams in the data set, and 90% of trigrams requires 83% of the total observed trigrams. (This refers to bigrams and trigrams, not the individual words that make up the bulk of those N-grams)
The above analyses make it fairly clear that a relatively small number of words make up the bulk of the text in our training Corpus, but any combinations of words will be seen infrequently. I am planning to significantly cut down the vocabulary by only using the most frequently observed terms, and ignoring any term that has been seen only a handful of times. I have built a basic prediction model that takes a string of text and returns predictions for the next word with each words likelihood, and will be building the first pieces of my Shiny app to serve as an interface to the prediction algorithm. I will be cutting back the vocabulary to increase the responsiveness of the prediction algorithm while maintaining accuracy of predictions.
Thanks for reading and I look forward to your feedback!
Preprocessing Functions to Format Corpus, Generate TDM, and Obtain Frequency Tables from TDMs
# This gets rid of all punctuation other than apostrophes.
punct_cleaner <- function(sent) { gsub("[^[:alnum:][:space:]'.]","",sent)}
# Non-words.
nonword_cleaner <- function(sent) { gsub("[0-9+]","",sent)}
# In my processing I realized I wanted to expand contractions since apostrophe's get removed
# in tm's preprocessing. I found this function pre-written by the user MKFS at the following blog:
# http://entrenchant.blogspot.com/2013_06_01_archive.html
# I've made some slight modifications of that function to suit my needs.
fix_contractions <- function(text) {
text <- gsub("won't", "will not", text)
text <- gsub("can't", "can not", text)
text <- gsub("n't", " not", text)
text <- gsub("'ll", " will", text)
text <- gsub("'re", " are", text)
text <- gsub("'ve", " have", text)
text <- gsub("'m", " am", text)
text <- gsub("'s", "", text)
return(text)
}
# This function creates a TDM from the corpus
tdm_gen <- function(n = 1, corp=NULL) {
Tokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = n, max = n))
testset_tdm <- TermDocumentMatrix(x = corp, control = list(
tokenize=Tokenizer,
removePunctuation=TRUE,
removeNumbers=TRUE,
wordLengths= c(2,20)))
return(testset_tdm)
}
# This function returns a full list of term frequencies from the TDM
freqs_from_tdm <- function(tdm = NULL) {
testset_termFreqs <- slam::row_sums(tdm, 2, na.rm = T)
testset_termFreqs <- as.data.frame(testset_termFreqs)
testset_termFreqs$term <- attr(testset_termFreqs, 'row.names')
row.names(testset_termFreqs) <- NULL
names(testset_termFreqs)[1] <- "count"
testset_termFreqs <- testset_termFreqs %>% arrange(desc(count))
}
Code to Generate Cumulative Frequency Table
uni_75 <- testset_unigram_termFreqs %>% filter(Cumulative >= 75) %>%
filter(Rank == min(Rank))
uni_90 <- testset_unigram_termFreqs %>% filter(Cumulative >= 90) %>%
filter(Rank == min(Rank))
bi_75 <- testset_bigram_termFreqs %>% filter(Cumulative >= 75) %>%
filter(Rank == min(Rank))
bi_90 <- testset_bigram_termFreqs %>% filter(Cumulative >= 90) %>%
filter(Rank == min(Rank))
tri_75 <- testset_trigram_termFreqs %>% filter(Cumulative >= 75) %>%
filter(Rank == min(Rank))
tri_90 <- testset_trigram_termFreqs %>% filter(Cumulative >= 90) %>%
filter(Rank == min(Rank))
term_summary <- data.frame(NGRAM = c("Unigram","Bigram","Trigram"),
TOTAL = c(nrow(testset_unigram_termFreqs),
nrow(testset_bigram_termFreqs),
nrow(testset_trigram_termFreqs)),
PERCENT_75 = c(uni_75$Rank, bi_75$Rank, tri_75$Rank),
FRACTION_75 = c(round(uni_75$Rank/nrow(testset_unigram_termFreqs),2),
round(bi_75$Rank/nrow(testset_bigram_termFreqs),2),
round(tri_75$Rank/nrow(testset_trigram_termFreqs),2)),
PERCENT_90 = c(uni_90$Rank, bi_90$Rank, tri_90$Rank),
FRACTION_90 = c(round(uni_90$Rank/nrow(testset_unigram_termFreqs),2),
round(bi_90$Rank/nrow(testset_bigram_termFreqs),2),
round(tri_90$Rank/nrow(testset_trigram_termFreqs),2)))
kable(term_summary)
Code for the N-Gram Distribution Summary
unigram_summary <- testset_unigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>%
group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>%
mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="unigram") %>% filter(count < 5000)
bigram_summary <- testset_bigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>%
group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>%
mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="bigram") %>% filter(count < 5000)
trigram_summary <- testset_trigram_termFreqs %>% arrange(count) %>% filter(count <= 10000) %>%
group_by(count) %>% summarize(TOT_NGRAMS = n()) %>% ungroup() %>%
mutate(PERCENTAGE = cumsum(TOT_NGRAMS)/sum(TOT_NGRAMS), NGRAM="trigram") %>% filter(count < 5000)
ngram_summary <- rbind_all(list(unigram_summary, bigram_summary, trigram_summary))
ngram_summary$NGRAM <- factor(ngram_summary$NGRAM, levels=c("unigram","bigram","trigram"))
qplot(count, PERCENTAGE, data=ngram_summary, color=NGRAM, geom='line', xlim=c(1,500),
xlab="Times N-Gram Was Observed", ylab="Fraction of All Ngrams")