This report will explore the Swiftkey twitter/blog/news text corpus. We will look at file size, number of entries, frequency of unigrams, bigrams and trigrams, etc. The purpose of this report is to familiarize ourselves with the dataset, in preparation for an eventual prediction model, which will attempt to predict a user’s next word, using their previous word entries as an input.
The files included in the dataset are as follows:
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
We can look at some basic summary statistics with the following code:
source_list <- c("blogs","twitter","news","allCorp")
# Create empty Dataframe
summaryStats <- data.frame(matrix(NA,
nrow = 4,
ncol = 3,
dimnames = (list(source_list,
c("Median", "Max", "Total")))))
# Lengths of lines
blogs_words <- unlist(lapply(blogs, function(x) stri_stats_latex(x)[4]))
twitter_words <- unlist(lapply(twitter, function(x) stri_stats_latex(x)[4]))
news_words <- unlist(lapply(news, function(x) stri_stats_latex(x)[4]))
allCorp_words <- unlist(lapply(allCorp, function(x) stri_stats_latex(x)[4]))
# Populate Summary Stats DF
for(source in source_list) {
summaryStats[source, "Median"] <- median(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
summaryStats[source, "Max"] <- max(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
summaryStats[source, "Total"] <- sum(eval(as.name(paste0(source,"_words"))), na.rm = TRUE)
}
summaryStats
## Median Max Total
## blogs 28 6454 37570839
## twitter 12 47 30451170
## news 31 539 2651432
## allCorp 14 6454 70673441
To see these distributions visually, we can create histograms for each. Note: some charts will have constrained X axes, due to extreme values distorting the views.
qplot(blogs_words, xlim = c(0,250), binwidth = 1)
## Warning: Removed 3869 rows containing non-finite values (stat_bin).
qplot(twitter_words, xlim = c(0,35), binwidth = 1)
## Warning: Removed 55 rows containing non-finite values (stat_bin).
qplot(news_words, xlim = c(0,150), binwidth = 1)
## Warning: Removed 166 rows containing non-finite values (stat_bin).
Finally, let’s take a look at word frequency.
allCorp <- unlist(strsplit(allCorp, split = "\\W+"))
allCorp <- tolower(allCorp)
allCorp <- allCorp[nchar(allCorp) > 2]
corpFreq <- table(allCorp[!allCorp %in% stopwords("en")])
topWords <- as.data.frame(corpFreq[order(corpFreq, decreasing = TRUE)][1:20])
names(topWords)[1] <- "Word"
ggplot(topWords, aes(x=Word, y=Freq)) + geom_bar(stat="identity")
Moving forward, in order to create an accurate prediction algorithm, we will need to delve into tokenizing 2-gram and 3-gram word tuples.