Milestone - Exploratory Analysis

Below is a basic exploratory analysis of the training data set that will be used to develop a text prediction algorithm similiar to the one used by SwiftKey. Also is a list of observations from exploring the data and goals for the eventual prediction algorithm and creation of a Shiny app.

Dataset https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

setwd("C:/Users/Alex/Desktop/en_US") #store files in working directory

library(tm)
library(stringi)
library(ggplot2)

con <- file("en_US.blogs.txt", "r")
blog <- readLines(con, skipNul = TRUE)
close(con)

con <- file("en_US.news.txt", "r")
news <- readLines(con, skipNul = TRUE)
close(con)

con <- file("en_US.twitter.txt", "r")
twit <- readLines(con, skipNul = TRUE)
close(con)

Word Counts

sum(stri_count_words(blog))

## [1] 38154238

sum(stri_count_words(news))

## [1] 2693898

sum(stri_count_words(twit))

## [1] 30218166

The number of words in each of the files:

Blogs: 38,154,238
News: 2,693,898
witter: 30,218,166

Line Counts

length(blog)

## [1] 899288

length(news)

## [1] 77259

length(twit)

## [1] 2360148

The number of lines in each of the files:

Blogs: 899,288
News: 77,259
Twitter: 2,360,148

Building N-grams

Here is the process for calculating the highest frequencies of the groups of 2 consecutive words and 3 consecutive words in all three files combined. The charts below show the top 25 most freqent bigrams and trigrams of the sample dataset.

The first step will be to take samples of the blog, news and twitter text files and then combine them into one corpus.

setwd("C:/Users/Alex/Desktop/en_US")
dir.create("./Sample")
con <- file("en_US.blogs.txt", "r")
blog <-readLines(con, length(blog)*0.01)
write(blog, file = "./Sample/blog.txt")
close(con)

con <- file("en_US.news.txt", "r")
news <-readLines(con, length(news)*0.01)
write(news, file = "./Sample/news.txt")
close(con)

con <- file("en_US.twitter.txt", "r")
twit <-readLines(con, length(twit)*0.01)
write(twit, file = "./Sample/twit.txt")
close(con)

docs <- VCorpus(DirSource("./Sample"))

Clean Text

The next step is to clean the data in order to get the most accurate word groups. We do this by changing all words to lower case and removing special symbols, punctuation, numbers and extra space. The following code also removes stop words such as in, an, & the, but the purpose of that is just to explore more unique word groupings at this time.

docs <- tm_map(docs, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)

Bigrams

BigramTokenizer <- function(x) {
        unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
dtm <- DocumentTermMatrix(docs, control=list(tokenize = BigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
df <- data.frame(word=names(freq), freq=freq)

ggplot(head(df, 25), aes(reorder(word,-freq), freq))  + 
        geom_col(fill="steelblue4") + 
        theme(axis.text.x=element_text(angle=45, hjust=1)) +
        labs(x="Bigrams", y="Frequency")

Trigrams

TrigramTokenizer <- function(x) {
        unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
dtm <- DocumentTermMatrix(docs, control=list(tokenize = TrigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
df <- data.frame(word=names(freq), freq=freq)

ggplot(head(df, 25), aes(reorder(word,-freq), freq))  + 
        geom_col(fill="firebrick4") + 
        theme(axis.text.x=element_text(angle=45, hjust=1)) +
        labs(x="Trigrams", y="Frequency")

Interesting Findings

There were some setbacks in cleaning the data. After some research I realized some errors I was getting were due to emojies in the twitter file. I think I found the most effective way to eliminate them.
Manipulating such a large dataset was challenging. The sample sizes I used seem to be large enough to make an accurate representation of the most frequent ngrams, but small enough to run at a reasonable speed. My current method for calculating n-grams may be too sluggish for the finished product, so that can be improved.

Next Steps

In order to predict the next word in a text string, I will have to assign probabilities to n-grams. This should be effective in providing 3 options for the next word the way SwiftKey does. In order to increase speed and accuracy, a large set of n-grams can be separated into chunks. To predict the next word in a string, the algorithm can refer to the first chunk which will have the most frequent n-grams. If it does not find a match, it will look to the next chunk and so on until it finds the most probable option.
Evaluate whether the model is accurate by taking random samples from the text files.
Create a Shiny app that accepts an n-gram and predicts the next word.