This Milestone Report forms the Week 2 deliverable to:
Due to the large size of the documents, a small (2.5% of their total) sample is taken from each of the document sets.
These documents will then be combined into one Corpus for further analysis. This approach should provide a diverse range of text samples for the predictions.
len.prop <- 0.025 # Length proportion - i.e. 2.5% sample
# Randomly sample the data
corp_blogs <- corpus(blogs[sample(length(blogs), length(blogs) * len.prop)])
corp_news <- corpus(news[sample(length(news), length(news) * len.prop)])
corp_tweets <- corpus(tweets[sample(length(tweets), length(tweets) * len.prop)])
corp_summary <- data.frame(source = c("Blogs", "News", "Tweets"),
documents = c(length(unlist(corp_blogs)), length(unlist(corp_news)),
length(unlist(corp_tweets))))
corp_summary
## source documents
## 1 Blogs 22493
## 2 News 25267
## 3 Tweets 59014
The three data sets will be combined for easier comparison and the n-grams calculated for 1, 2, 3 and 4 n-grams.
multi_corp <- corp_blogs + corp_news + corp_tweets
dfmat_multi.1 <- dfm(multi_corp, ngrams = 1, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.2 <- dfm(multi_corp, ngrams = 2, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.3 <- dfm(multi_corp, ngrams = 3, remove_numbers = TRUE, remove_punct = TRUE)
dfmat_multi.4 <- dfm(multi_corp, ngrams = 4, remove_numbers = TRUE, remove_punct = TRUE)
N-grams are an imporant part of word prediction since these provide more context over longer character strings than predicting on the previous single word. There is a trade off between the length of the n-grams and processing speed. The quadgram (4-grams) seems to be a reasonable size to work with.
There are a total of 97438 unique words in the combined unigram corpus.
The top 5 most frequent word features in the unigram corpus are presented below.
topfeatures(dfmat_multi.1)[1:5]
## the to and a of
## 118916 68624 60594 59501 50471
The following plots compare the various n-gram corpus’ frequencies for the combined data sets.
For fun, I created a word cloud of the unigram words in the combined corpus. The word cloud shows the 100 most frequent words in the unigram corpus.
I decided to use the combined data set corpus as a I think this will provide a good cross section of writing styles. There may be some bias towards to the larger Twitter data set due to the fact the sampling is based on proportions of the total original corpus size.
The final model will make use of the Stopwords since they form part of general language and should be captured in the model to accurately predict words. Stopwords will not be removed.