Synopsis

The goal of this paper is to do some exploratory data analysis on the data we’ll use to create a text prediction algorithm. There are three data files (blogs, news, and twitter) and each is very large and will need to be partitioned to get it into manageable pieces. We’ll also analyze the ngrams in the files to get an idea for most common elements. Finally, we’ll discuss the initial strategy for creating the text prediction algorithm.

Load Libraries

library(readr)
library(ngram)
library(textreg)
library(tm)
library(caret)
library(SnowballC)

Data Summary

The three files we will use for the basis of our analysis are the US versions of blog, news, and twitter from our data source. We first want to get an idea of how large these files are so we’ll make a simple summary table to begin with.

#loads data file onto computer if it isn't already there
if(!file.exists("final")) {
        url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(url, "data.zip")
        unzip("data.zip")
}
#read data into R
blogs <- read_lines(file("final/en_US/en_US.blogs.txt"), skip = 0)
news <- read_lines(file("final/en_US/en_US.news.txt"), skip = 0)
twitter <- read_lines(file("final/en_US/en_US.twitter.txt"), skip = 0)


#table summation of the data
sum_table <- data.frame(matrix(nrow = 3, ncol = 4))
colnames(sum_table) <- c("File", "Size", "Lines", "Words")
sum_table$File <- c("Blogs", "News", "Twitter")
sum_table$Size <- c("255.4 MB", "257.3 MB", "319.0 MB")
sum_table$Lines <- c(length(blogs), length(news), length(twitter))
sum_table$Words <- c(wordcount(blogs), wordcount(news), wordcount(twitter))
sum_table
##      File     Size   Lines    Words
## 1   Blogs 255.4 MB  899288 37334131
## 2    News 257.3 MB 1010242 34372530
## 3 Twitter 319.0 MB 2360148 30373543

As you can see, these files are huge. This will make it very difficult to do analysis on these files, so we will need to partition the data into more usable chunks.

Data Partitioning

For partitioning the data, I wanted to create files which were 5% of each of the original files saved to my computer. This would make it easy to add more data to the corpus when building the prediction model without having to redo the partitioning if we needed more data. The code below shows how I partitioned the data into 20 chunks. I would strongly suggest finding a more efficient way. It took 8-9 hours to partition the blogs and news files, which I deemed reasonable since I had other things to do and a dedicated computer to run the R code. It took the Twitter file over a day and a half to partition. Since everything was saved, I won’t try to find a more efficient way to partition it.

blogsPartition <- function() {
        library(readr)
        library(caret)
        #splits blogs into manageable chunks
        start <- Sys.time()
        blogs <- read_lines(file("final/en_US/en_US.blogs.txt"), skip = 0)
        set.seed(485)
        folds<-createFolds(blogs, 20)
        for(i in 1:20) {
                partition <- blogs[folds[[i]]]
                name <- paste("split_data/blogs_", i, ".txt", sep = "")
                write_lines(partition, name)
        }
finish <- Sys.time()
print(finish - start)
}


newsPartition <- function() {
        library(readr)
        library(caret)
        #splits news into manageable chunks
        start <- Sys.time()
        news <- read_lines(file("final/en_US/en_US.news.txt"), skip = 0)
        set.seed(123)
        folds<-createFolds(news, 20)
        for(i in 1:20) {
                partition <- news[folds[[i]]]
                name <- paste("split_data/news_", i, ".txt", sep = "")
                write_lines(partition, name)
        }
        finish <- Sys.time()
        print(finish - start)
}


twitterPartition <- function() {
        library(readr)
        library(caret)
        #splits twitter into manageable chunks
        start <- Sys.time()
        twitter <- read_lines(file("final/en_US/en_US.twitter.txt"), skip = 0)
        set.seed(987)
        folds<-createFolds(twitter, 20)
        for(i in 1:20) {
                partition <- twitter[folds[[i]]]
                name <- paste("split_data/twitter_", i, ".txt", sep = "")
                write_lines(partition, name)
        }
        finish <- Sys.time()
        print(finish - start)
}

Data Processing

For processing the data and making some plots, we will only look at one 5% chunk of the blogs, news, and twitter files. After combining the files into a corpus, we will strip whitespace, make everything lowercase, remove punctuation, and remove numbers. Since the goal of this project is to predict the next word of text, we won’t remove stopwords. I decided not to stem the words or remove profanity as I don’t think that will help with the eventual predictor, but may have to do that step later if it the predictor isn’t as accurate as wanted.

#data processing
blogs1 <- read_lines(file("split_data/blogs_1.txt"), skip = 0)
news1 <- read_lines(file("split_data/news_1.txt"), skip = 0)
twitter1 <- read_lines(file("split_data/news_1.txt"), skip = 0)
string <- concatenate(blogs1, news1, twitter1)
corpus <- VCorpus(VectorSource(string))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)

Plots of Ngrams

Finally, let’s plot some ngrams to see the most popular ones. The first step is to make some document term matrices for unigrams, bigrams, trigrams, and pentagrams. After those are built, we’ll pull out the ten most frequent terms and plot them.

#ngrams
dtm <- DocumentTermMatrix(corpus)
unigram <- findMostFreqTerms(dtm, n = 10L)
dos <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
dtm2 <- DocumentTermMatrix(corpus,control=list(tokenize=dos))
bigram <- findMostFreqTerms(dtm2, n = 10L)
tres <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
dtm3 <- DocumentTermMatrix(corpus,control=list(tokenize=tres))
trigram <- findMostFreqTerms(dtm3, n = 10L)
cuatro <- function(x) unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
dtm4 <- DocumentTermMatrix(corpus,control=list(tokenize=cuatro))
quadragram <- findMostFreqTerms(dtm4, n = 10L)
cinco <- function(x) unlist(lapply(ngrams(words(x), 5), paste, collapse = " "), use.names = FALSE)
dtm5 <- DocumentTermMatrix(corpus,control=list(tokenize=cinco))
pentagram <- findMostFreqTerms(dtm5, n = 10L)

#plot ngrams
unigram <- as.data.frame(unigram)
par(mar=c(4, 4, 2, 1))
barplot(unigram$X1, horiz = TRUE, names.arg = row.names(unigram), las = 1, xlab = "Occurences",
        main = "Top Ten Unigrams", col = "green")

The top ten unigrams contain a lot of stopwords. We can also see that “the” and “and” blow every other word out of the water when it comes to occurences in our corpus.

bigram <- as.data.frame(bigram)
par(mar=c(4, 5, 2, 1))
barplot(bigram$X1, horiz = TRUE, names.arg = row.names(bigram), las = 1, xlab = "Occurences",
        main = "Top Ten Bigrams", col = "purple")

The bigrams show again the popularity of the word “the” as it occurs in 8 of the top ten bigrams.

trigram <- as.data.frame(trigram)
par(mar=c(4, 6, 2, 1))
barplot(trigram$X1, horiz = TRUE, names.arg = row.names(trigram), las = 1, xlab = "Occurences",
        main = "Top Ten Trigrams", col = "red")

Trigrams show a lot of occurences of the word “of”. You can also see that a lot of the trigrams appear to be phrases leading into words which don’t occur as often and are more variable.

quadragram <- as.data.frame(quadragram)
par(mar=c(4, 9, 2, 1))
barplot(quadragram$X1, horiz = TRUE, names.arg = row.names(quadragram), las = 1, xlab = "Occurences",
        main = "Top Ten Quadragrams", col = "yellow")

pentagram <- as.data.frame(pentagram)
par(mar=c(4, 9, 2, 1))
barplot(pentagram$X1, horiz = TRUE, names.arg = row.names(pentagram), las = 1, xlab = "Occurences",
        main = "Top Ten Pentagrams", col = "blue")

For pentagrams and quadgrams, we notice how the number of occurences falls off immensely vs the other ngrams. Only 3 pentagrams occur more than 100 times in the pentagrams. Quadgrams and pentagrams should be very useful for the prediction algorithm, but the limited number of them opens more opportunities for phrases to not be a part of them.

Prediction Algorithm Plan

My plan for the model is to first look at pentagrams, and see if there are any matches which would be able to predict the next word. If there aren’t any matches, I’ll drop down one level to find a match. If bigrams don’t result in a prediction, the model will give the most common word as it’s guess (most likely “the”). From there, I’ll work on corpus adjustments to see if I can remove unlikely ngrams to speed up the model and how much data I need in the corpus to balance accuracy and speed.

R Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SnowballC_0.6.0 caret_6.0-84    ggplot2_3.2.1   lattice_0.20-38
## [5] tm_0.7-6        NLP_0.2-0       textreg_0.1.5   ngram_3.0.4    
## [9] readr_1.3.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         lubridate_1.7.4    class_7.3-15      
##  [4] assertthat_0.2.1   zeallot_0.1.0      digest_0.6.21     
##  [7] ipred_0.9-9        foreach_1.4.7      slam_0.1-46       
## [10] R6_2.4.0           plyr_1.8.4         backports_1.1.5   
## [13] stats4_3.6.1       evaluate_0.14      pillar_1.4.2      
## [16] rlang_0.4.0        lazyeval_0.2.2     data.table_1.12.4 
## [19] rpart_4.1-15       Matrix_1.2-17      rmarkdown_1.16    
## [22] splines_3.6.1      gower_0.2.1        stringr_1.4.0     
## [25] munsell_0.5.0      compiler_3.6.1     xfun_0.10         
## [28] pkgconfig_2.0.3    htmltools_0.4.0    nnet_7.3-12       
## [31] tidyselect_0.2.5   tibble_2.1.3       prodlim_2018.04.18
## [34] codetools_0.2-16   crayon_1.3.4       dplyr_0.8.3       
## [37] withr_2.1.2        ModelMetrics_1.2.2 MASS_7.3-51.4     
## [40] recipes_0.1.7      grid_3.6.1         nlme_3.1-140      
## [43] gtable_0.3.0       magrittr_1.5       scales_1.0.0      
## [46] stringi_1.4.3      reshape2_1.4.3     timeDate_3043.102 
## [49] xml2_1.2.2         generics_0.0.2     vctrs_0.2.0       
## [52] lava_1.6.6         iterators_1.0.12   tools_3.6.1       
## [55] glue_1.3.1         purrr_0.3.3        hms_0.5.1         
## [58] parallel_3.6.1     survival_2.44-1.1  yaml_2.2.0        
## [61] colorspace_1.4-1   knitr_1.25