Capstone Milestone Report

PURPOSE

This Milestone Report forms the Week 2 deliverable to:

Demonstrate that that the data has been downloaded and successfully loaded.
Create a basic report of summary statistics about the data sets.
Report any interesting findings.

Processing

Due to the large size of the documents, a small (2.5% of their total) sample is taken from each of the document sets.

These documents will then be combined into one Corpus for further analysis. This approach should provide a diverse range of text samples for the predictions.

len.prop <- 0.025 # Length proportion - i.e. 2.5% sample

# Randomly sample the data
corp_blogs <- corpus(blogs[sample(length(blogs), length(blogs) * len.prop)]) 
corp_news <- corpus(news[sample(length(news), length(news) * len.prop)])
corp_tweets <- corpus(tweets[sample(length(tweets), length(tweets) * len.prop)])

corp_summary <- data.frame(source = c("Blogs", "News", "Tweets"), 
                           documents = c(length(unlist(corp_blogs)), length(unlist(corp_news)),
                                         length(unlist(corp_tweets))))
corp_summary

##   source documents
## 1  Blogs     22493
## 2   News     25267
## 3 Tweets     59014

The three data sets will be combined for easier comparison and the n-grams calculated for 1, 2, 3 and 4 n-grams.

multi_corp <- corp_blogs + corp_news + corp_tweets

dfmat_multi.1 <- dfm(multi_corp, ngrams = 1, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.2 <- dfm(multi_corp, ngrams = 2, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.3 <- dfm(multi_corp, ngrams = 3, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.4 <- dfm(multi_corp, ngrams = 4, remove_numbers = TRUE, remove_punct = TRUE)

N-grams are an imporant part of word prediction since these provide more context over longer character strings than predicting on the previous single word. There is a trade off between the length of the n-grams and processing speed. The quadgram (4-grams) seems to be a reasonable size to work with.

There are a total of 97438 unique words in the combined unigram corpus.

The top 5 most frequent word features in the unigram corpus are presented below.

topfeatures(dfmat_multi.1)[1:5]

##    the     to    and      a     of 
## 118916  68624  60594  59501  50471

The following plots compare the various n-gram corpus’ frequencies for the combined data sets.

For fun, I created a word cloud of the unigram words in the combined corpus. The word cloud shows the 100 most frequent words in the unigram corpus.

Next Steps

I decided to use the combined data set corpus as a I think this will provide a good cross section of writing styles. There may be some bias towards to the larger Twitter data set due to the fact the sampling is based on proportions of the total original corpus size.

The final model will make use of the Stopwords since they form part of general language and should be captured in the model to accurately predict words. Stopwords will not be removed.

Capstone Milestone Report

Nick

January 2020

PURPOSE

Processing

Next Steps