Introduction

The goal of this paper is to demonstrate that I am on track to create my prediction alorithm for the Johns Hopkins University Data Science Capstone Project with SwiftKey. The project assignment is to build a text prediction algorithm based on the dataset similar to how a text message application recommends words as you type. It aims to solidify knowledge of tidying and modeling in R along with that of natural language processing.

Getting set up

There are a few components to the dataset. First, these instructions assume you have already downloaded it and set your working directory to where you saved it. The link is here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Because I am an English-speaker and expect my audience to be in the US, I will select the three text files starting with “en_US.” Each files contains text from blog posts, news articles, and tweets: “blogs,” “news,” and “twitter” respectively. You will then need to load the necessary libraries and read the data into your R session.

set.seed(37)

library(tidyverse) ## because we love it
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.3
## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringi) ## for getting word counts
library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(SnowballC) ## for cleansing later on
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.3
library(tm) ## for building a corpus
## Warning: package 'tm' was built under R version 3.4.3
blogs <- readLines(con <- file("./final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

news <- readLines(con <- file("./final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con <- file("./final/en_US/en_US.news.txt"), encoding
## = "UTF-8", : incomplete final line found on './final/en_US/en_US.news.txt'
close(con)

twitter <- readLines(con <- file("./final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

Summary statistics

Next we need to get an idea of what size dataset we’re working with. From the environment menu we can see that the files are 249, 19, and 301 MB respectively. Taking the word count and the linecount will give us a better sense of how to slice and dice it.

##                 sizes    lens   wordct
## size_blogs   200.4242  899288 37546246
## size_news    196.2775   77259  2674536
## size_twitter 159.3641 2360148 30093410

Now that we have a handle on the scope of the problem, we need to tidy up the data and take a sample.

## take a random tenth of each text file
sample_blogs <- sample(blogs, size=len_blogs)
sample_news <- sample(news, size=len_news)
sample_twitter <- sample(twitter, size=len_twitter)#.2 inside parens to reduce size.

## the following sapply functions convert character vectors into whatever coding you provide. The i stands for internationalization though in this case we only want a latin ASCII framework.

sample_blogs <- sapply(sample_blogs,function(row) iconv(row, "utf-8", "ASCII", sub=""))
sample_news <- sapply(sample_news,function(row) iconv(row, "utf-8", "ASCII", sub=""))
sample_twitter <- sapply(sample_twitter,function(row) iconv(row, "utf-8", "ASCII", sub=""))

sample_blogs <- (sample_blogs[!is.na(sample_blogs)])
sample_news <- (sample_blogs[!is.na(sample_blogs)])
sample_twitter <- (sample_blogs[!is.na(sample_blogs)])

sample <- sample(paste(sample_blogs, sample_news, sample_twitter), size=10000, replace = TRUE)

Here we create the corpus or body of natural language. A corpus can be a paragraph or simply a random collection of words or phrases. To get a better picture of our corpus however, we have to cleanse the data.

## Be careful in this section not to pass a character vector through the "tm" functions or else they will throw an error.

corpus <- VCorpus(VectorSource(sample))

# corpus <- lapply(corpus[1:2], as.character)

options(mc.cores = 1)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(stripWhitespace))    ## eliminate white space
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(stemDocument))   ## extracts stems of each word in vector
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords(kind="SMART")) ## remove curse words
corpus <- tm_map(corpus, content_transformer(PlainTextDocument)) ## convert to txt doc because of corpora mappings

saveRDS(corpus, file = "./finalCorpus.RData") #for reference

Tokenization and UniGram Frequency

Now that our data is cleansed and prepared for analysis, we have to performa a few steps. Tokenization is a kind of abstraction of natural language. It can be a word, phrase or chunk of words that contain meaning. Tokens are like strings in that they are character vectors, but differ in that they have a confined structure and meaning. For example below, we take the unigram, or one-word, words/phrases below and plot them in a histogram. That gives us a clear image of the most common tokens in this dataset.

#Tokenize Data

data <- TermDocumentMatrix(corpus)

uniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

data <- removeSparseTerms(data,.9998)

freq <- sort(colSums(as.matrix(data)), decreasing=TRUE)

triGramMatrix <- as.matrix(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = triGramTokenizer)), .9998))
# use rm(list = ls()) and rerun if you get an error. May need to tweak ".9999" down.
biGramMatrix <- as.matrix(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = biGramTokenizer)), .9998))
uniGramMatrix <- as.matrix(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = uniGramTokenizer)),.9998))

trifreqterm <- as.data.frame(rowSums(triGramMatrix))
bifreqterm <- as.data.frame(rowSums(biGramMatrix))
unifreqterm <- as.data.frame(rowSums(uniGramMatrix))

trifreqterm$trigram <- row.names(trifreqterm)
bifreqterm$bigram <- row.names(bifreqterm)
unifreqterm$unigram <- row.names(unifreqterm)

tritermtbl <- as.tibble(data.frame(trifreqterm[,2],trifreqterm[,1]))
bitermtbl <- as.tibble(data.frame(bifreqterm[,2],bifreqterm[,1]))
unitermtbl <- as.tibble(data.frame(unifreqterm[,2],unifreqterm[,1]))

names(unitermtbl) <- c("unigram","count")
names(bitermtbl) <- c("bigram","count")
names(tritermtbl) <- c("trigram","count")

uniplot <- subset(unitermtbl, count>2000)

x <- head(unitermtbl)
print(x)
## # A tibble: 6 x 2
##   unigram count
##   <fct>   <dbl>
## 1 aarhus   6.00
## 2 aaron    9.00
## 3 aback    6.00
## 4 abandon 48.0 
## 5 abbey   18.0 
## 6 abbi    15.0
uniplotR <- ggplot(uniplot, aes(reorder(unigram, -count), count))
uniplotR <- uniplotR + geom_bar(stat = "identity") + 
        xlab("unigram") + 
        ylab("frequency") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
        labs(title = "Unigram")
uniplotR

BiGram Frequency

We do the same thing for two-word phrases.

#Plot BiGram Frequency
#bitermtbl <- arrange(bitermtbl, count)
biplot <- transform(bitermtbl,bigram=reorder(bigram, count))

x <- head(bitermtbl)
print(x)
## # A tibble: 6 x 2
##   bigram         count
##   <fct>          <dbl>
## 1 aarhus denmark  6.00
## 2 abil lead       6.00
## 3 abl accomplish  6.00
## 4 abl advertis    6.00
## 5 abl afford     12.0 
## 6 abl back        6.00
#biplot <- subset(biplot, count>100)
# biplotorder <- reorder(biplot, count)

#uniplotR <- ggplot(uniplot, aes(reorder(unigram, -count), count))
#triplotR <- ggplot(subset(triplot, count>30), aes(reorder(trigram, -count), count))

biplotR <- ggplot(subset(biplot, count>100), aes(reorder(bigram, -count), count))
#aes(reorder(biplot, -count), count))
biplotR <- biplotR + geom_bar(stat = "identity") + 
        xlab("bigram") + 
        ylab("frequency") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
        labs(title = "Bigram")
biplotR

TriGram Frequency

Finally we set up and plot the three-word phrase element of the model.

#Plot TriGram Frequency
triplot <- transform(tritermtbl,trigram=reorder(trigram, -count))

x <- head(triplot)
print(x)
##                 trigram count
## 1  aarhus denmark space     6
## 2      abil lead reader     6
## 3      abl advertis ani     6
## 4         abl put brand     6
## 5        abov abov abov     2
## 6 absolut awesom inspir     6
triplotR <- ggplot(subset(triplot, count>18), aes(reorder(trigram, -count), count))
#aes(reorder(biplot, -count), count))
triplotR <- triplotR + geom_bar(stat = "identity") + 
        xlab("trigram") + 
        ylab("frequency") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
        labs(title = "Trigram")
triplotR

Going forward

Going forward, my aim is to build on this analysis with a predictive model and Shiny app that recommends next-words. Uni, bi, and tri-gram words and phrases will be ranked by frequency. Other variables such as grammatical factors may also be incorporated into the predictive model. For the Shiny app, I intend to make it as simple as possible. I hope to give it the look and feel of Google.com’s plain text box. Either way, the goal is to make the recommendations as natural as possible.