Exploratory Analysis

The files included in the dataset are as follows:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Word Count

We can look at some basic summary statistics with the following code:

source_list <- c("blogs","twitter","news","allCorp")

# Create empty Dataframe
summaryStats <- data.frame(matrix(NA,
                                  nrow = 4,
                                  ncol = 3,
                                  dimnames = (list(source_list,
                                                   c("Median", "Max", "Total")))))


# Lengths of lines
blogs_words <- unlist(lapply(blogs, function(x) stri_stats_latex(x)[4]))
twitter_words <- unlist(lapply(twitter, function(x) stri_stats_latex(x)[4]))
news_words <- unlist(lapply(news, function(x) stri_stats_latex(x)[4]))
allCorp_words <- unlist(lapply(allCorp, function(x) stri_stats_latex(x)[4]))

# Populate Summary Stats DF
for(source in source_list) {
  summaryStats[source, "Median"] <- median(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
  summaryStats[source, "Max"] <- max(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
  summaryStats[source, "Total"] <- sum(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
}

summaryStats

##         Median  Max    Total
## blogs       28 6454 37570839
## twitter     12   47 30451170
## news        31  539  2651432
## allCorp     14 6454 70673441

To see these distributions visually, we can create histograms for each. Note: some charts will have constrained X axes, due to extreme values distorting the views.

Blogs

qplot(blogs_words, xlim = c(0,250), binwidth = 1)

## Warning: Removed 3869 rows containing non-finite values (stat_bin).

Twitter

qplot(twitter_words, xlim = c(0,35), binwidth = 1)

## Warning: Removed 55 rows containing non-finite values (stat_bin).

News

qplot(news_words, xlim = c(0,150), binwidth = 1)

## Warning: Removed 166 rows containing non-finite values (stat_bin).

Word Frequency

Finally, let’s take a look at word frequency.

allCorp <- unlist(strsplit(allCorp, split = "\\W+"))
allCorp <- tolower(allCorp)
allCorp <- allCorp[nchar(allCorp) > 2]
corpFreq <- table(allCorp[!allCorp %in% stopwords("en")])
topWords <- as.data.frame(corpFreq[order(corpFreq, decreasing = TRUE)][1:20])
names(topWords)[1] <- "Word"

ggplot(topWords, aes(x=Word, y=Freq)) + geom_bar(stat="identity")

Next Step

Moving forward, in order to create an accurate prediction algorithm, we will need to delve into tokenizing 2-gram and 3-gram word tuples.

Milestone Report

Jeremy

November 27, 2016

Introduction