Introduction

This report will explore the Swiftkey twitter/blog/news text corpus. We will look at file size, number of entries, frequency of unigrams, bigrams and trigrams, etc. The purpose of this report is to familiarize ourselves with the dataset, in preparation for an eventual prediction model, which will attempt to predict a user’s next word, using their previous word entries as an input.

Exploratory Analysis

The files included in the dataset are as follows:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Word Count

We can look at some basic summary statistics with the following code:

source_list <- c("blogs","twitter","news","allCorp")

# Create empty Dataframe
summaryStats <- data.frame(matrix(NA,
                                  nrow = 4,
                                  ncol = 3,
                                  dimnames = (list(source_list,
                                                   c("Median", "Max", "Total")))))


# Lengths of lines
blogs_words <- unlist(lapply(blogs, function(x) stri_stats_latex(x)[4]))
twitter_words <- unlist(lapply(twitter, function(x) stri_stats_latex(x)[4]))
news_words <- unlist(lapply(news, function(x) stri_stats_latex(x)[4]))
allCorp_words <- unlist(lapply(allCorp, function(x) stri_stats_latex(x)[4]))

# Populate Summary Stats DF
for(source in source_list) {
  summaryStats[source, "Median"] <- median(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
  summaryStats[source, "Max"] <- max(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
  summaryStats[source, "Total"] <- sum(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
}

summaryStats
##         Median  Max    Total
## blogs       28 6454 37570839
## twitter     12   47 30451170
## news        31  539  2651432
## allCorp     14 6454 70673441

To see these distributions visually, we can create histograms for each. Note: some charts will have constrained X axes, due to extreme values distorting the views.

Blogs

qplot(blogs_words, xlim = c(0,250), binwidth = 1)
## Warning: Removed 3869 rows containing non-finite values (stat_bin).

Twitter

qplot(twitter_words, xlim = c(0,35), binwidth = 1)
## Warning: Removed 55 rows containing non-finite values (stat_bin).

News

qplot(news_words, xlim = c(0,150), binwidth = 1)
## Warning: Removed 166 rows containing non-finite values (stat_bin).

Word Frequency

Finally, let’s take a look at word frequency.

allCorp <- unlist(strsplit(allCorp, split = "\\W+"))
allCorp <- tolower(allCorp)
allCorp <- allCorp[nchar(allCorp) > 2]
corpFreq <- table(allCorp[!allCorp %in% stopwords("en")])
topWords <- as.data.frame(corpFreq[order(corpFreq, decreasing = TRUE)][1:20])
names(topWords)[1] <- "Word"

ggplot(topWords, aes(x=Word, y=Freq)) + geom_bar(stat="identity")

Next Step

Moving forward, in order to create an accurate prediction algorithm, we will need to delve into tokenizing 2-gram and 3-gram word tuples.