Introduction

The purpose of the milestone report is to ensure that students have loaded and explored the datasets. In this report, I will attempt to show some of the initial analysis done on the twitter, blog, and news datasets. First, I will load the libraries and files that will be used.

library(quanteda)
library(stringr)
library(ggplot2)

con <- file('C:/Users/dmartin/Documents/Capstone/final/en_US/en_US.twitter.txt',open="r")
twitdata <- readLines(con, skipNul = TRUE)
close(con)

con <- file('C:/Users/dmartin/Documents/Capstone/final/en_US/en_US.blogs.txt',open="r")
blogdata <- readLines(con, skipNul = TRUE)
close(con)

#opening in binary due to error message that comes when opening regularly
con <- file('C:/Users/dmartin/Documents/Capstone/final/en_US/en_US.news.txt',open="rb")
newsdata <- readLines(con, skipNul = TRUE)
close(con)

Data Summary

Now that the data has been read in, I will do a summary of the major features of the data: line count, word count, and file size (in MB).

datasumm <- data.frame(dataset = c('twitter', 'blogs', 'news'), numLines = c(length(twitdata), length(blogdata), length(newsdata)), numWords = c(sum(str_count(twitdata, boundary('word'))), sum(str_count(blogdata, boundary('word'))), sum(str_count(newsdata, boundary('word')))), fileSizeMb = c(object.size(twitdata) / 1000000, object.size(blogdata) / 1000000, object.size(newsdata) / 1000000))
datasumm
##   dataset numLines numWords fileSizeMb
## 1 twitter  2360148 30218166   316.0376
## 2   blogs   899288 38154238   260.5643
## 3    news  1010242 35010782   261.7590

Sampling data

As seen on the previous slide, the files are quite large. This makes data exploration difficult due to memory issues and long processing times. To alleviate some of this, we will subset a data sample. I will sample 5% of the data to explore, then combine the three samples into one for analysis.

set.seed(29482)
twitsamp <- sample(twitdata, length(twitdata) * 0.05, replace = FALSE)
blogsamp <- sample(blogdata, length(blogdata) * 0.05, replace = FALSE)
newssamp <- sample(newsdata, length(newsdata) * 0.05, replace = FALSE)
allsamp <- c(twitsamp, blogsamp, newssamp)

Data exploration

Now that the data has been reduced to an easier to manage size, I’ll convert it into a document feature matrix, which basically lists all the words present in the data, then counts how many times they appear. I’ll also be doing some cleaning in the process, namely: Remove punctuation Remove numbers Remove symbols Use stemming (groups all forms of a root word together) *Remove stopwords (“a”, “the”, etc.)

sampdfm <- dfm(allsamp, remove = c(stopwords('english'),'â'), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, stem = TRUE)

With this data, I want to now take a look to see what some of the most frequent unigrams (single words) are. I want to look at the top 15 words (more than that makes the plots crowded), and I’ll save the results into a data frame for plotting.

unigram_top_df <- data.frame(unigram = names(topfeatures(sampdfm, 15)), frequency = topfeatures(sampdfm, 15), row.names = NULL)
p <- ggplot(unigram_top_df, aes(reorder(unigram, frequency), frequency)) + geom_bar(stat = 'identity', aes(fill = I('maroon'))) + labs(x='unigrams', y='frequency') + coord_flip()
p

Next, I want to see what the top bigrams are in the data. I’ll need to make a new dfm specifying for 2 ngrams.

samp_bigram_dfm <- dfm(allsamp, remove = c(stopwords('english'),'â'), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, stem = TRUE, ngrams = 2)
bigram_top_df <- data.frame(bigram = names(topfeatures(samp_bigram_dfm, 15)), frequency = topfeatures(samp_bigram_dfm, 15), row.names = NULL)
p <- ggplot(bigram_top_df, aes(reorder(bigram, frequency), frequency)) + geom_bar(stat = 'identity', aes(fill = I('darkgreen'))) + labs(x='bigrams', y='frequency') + coord_flip()
p

Out of curiosity, I also want to see the most frequent trigrams (groups of 3 words).

samp_trigram_dfm <- dfm(allsamp, remove = c(stopwords('english'),'â'), remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, stem = TRUE, ngrams = 3)
trigram_top_df <- data.frame(trigram = names(topfeatures(samp_trigram_dfm, 15)), frequency = topfeatures(samp_trigram_dfm, 15), row.names = NULL)
p <- ggplot(trigram_top_df, aes(reorder(trigram, frequency), frequency)) + geom_bar(stat = 'identity', aes(fill = I('navy'))) + labs(x='trigrams', y='frequency') + coord_flip()
p

Strategy for Prediction

The dfms start getting larger with a higher number of ngrams, so I think that trigram is probably the highest I will go. My strategy is to use a three-fold prediction model. The ideal model would be to use the trigrams to predict the next word. However, it may not be able to find a prediction based on the input, so I’d like it to fall back on the bigram where a prediction can’t be found. Likewise, if the word can’t be predicted using bigrams, the ultimate fall-back would be the unigram.