Introduction

The goal of this milestone report is to provide exploratory analysis and describe the goals of the text prediction app. I provide a report of summary statistics from the dataset.

For this Capstone Project, I’ll build a Shiny app that will suggest words based on text entered by the user. I’ll build a probabilistic model, based on Markov chains and trigrams, to provide word suggestions derived from a training set of data. The training set consists of word data from blogs, news sites, and tweets.

library(tm)
library(wordcloud)

setwd("/home/antonio/Downloads/Coursera-Swiftkey/en_US")
con1 = file("en_US.news.txt", "r")
con2 = file("en_US.blogs.txt", "r")
con3 = file("en_US.twitter.txt", "r")

For convenience, we use a subset of the training data during this exploratory analysis.

newsData <- readLines(con1, n = 2500, encoding = "UTF-8", warn = FALSE)
blogsData <- readLines(con2, n = 2500, encoding = "UTF-8", warn = FALSE)
tweetsData <- readLines(con3, n = 2500, encoding = "UTF-8", warn = FALSE)
close(con1)
close(con2)
close(con3)

As the 3 datasets (blogs, news, and tweets) will be used to train the text prediction model, we merge them before exploration.

# merging data
newsData <- paste(newsData, collapse= " ")
blogsData <- paste (blogsData, collapse= " ")
tweetsData <- paste (tweetsData, collapse = " ")
textData <- paste(tweetsData, blogsData, newsData)

# a glimpse of the content before cleaning
str(textData)
##  chr "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long. When you meet som"| __truncated__

Punctuation, whitespace, and numbers will be omitted from the text prediction model. I use the tm package to clean the data.

# setting up source and corpus
review_source_text <- VectorSource(textData)
textCorpus <- Corpus(review_source_text)

#turn to lowercase
textCorpus <- tm_map(textCorpus, content_transformer(tolower))

#remove all punctuation
textCorpus <- tm_map(textCorpus, removePunctuation)

#take out whitespace
textCorpus <- tm_map(textCorpus, stripWhitespace)

#one version of the corpus will be explored with stopwords, as stopwords are often useful while completing typed text 
textCorpusNoStopwords <- tm_map(textCorpus, removeWords, stopwords("english"))

#Document term matrix
dtm <- DocumentTermMatrix(textCorpus)
dtm2 <- as.matrix(dtm)

#find the most frequent words (considering stopwords)
frequency <- sort(colSums(dtm2), decreasing=TRUE)
frequency[1:10]
##   the   and  that   for  with   you   was  this  have   but 
## 10767  5675  2373  2324  1650  1619  1413  1130  1121  1110
#Document term matrix
dtmNoStopwords <- DocumentTermMatrix(textCorpusNoStopwords)
dtm2NoStopwords <- as.matrix(dtmNoStopwords)

#find the most frequent words (considering stopwords)
frequencyNoStopwords <- sort(colSums(dtm2NoStopwords), decreasing=TRUE)
frequencyNoStopwords[1:10]
## said will  one like just  can time  new  get  now 
##  756  638  631  579  578  491  444  415  409  365
words <- names(frequency)
wordsNoStopwords <- names(frequencyNoStopwords)

Trigrams

Trigrams are groups of three adjacent words, and are derived from the training set. The more frequent a trigram is, the higher is its probability to appear as suggested text to complete a phrase.

library("RWeka")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))
trigramDtm <- DocumentTermMatrix(Corpus(VectorSource(textData)), 
                          control = list(tokenize = TrigramTokenizer))



trigramDtm2 <- as.matrix(trigramDtm)



#find the most frequent trigrams 
trigramFrequency <- sort(colSums(trigramDtm2), decreasing=TRUE)
trigramFrequency[1:12]
##    a lot of  one of the     i don t     i can t    it s not    i didn t 
##          83          80          79          45          43          42 
##      it s a going to be     i m not     the u s  don t know   i want to 
##          42          40          38          37          36          35
trigrams <- names(trigramFrequency)

Visual Exploration

Word clouds provide a visually amenable way to explore the training dataset. The most common words appear larger.

Trigram cloud

Trigrams will be the base of our Markov prediction model.

Wordcloud With Stopwords

Although they are usually filtered while doing text analysis, the stopwords included in the training set for the model will be useful due to their highly frequent use. Minimal response time while doing common typing is a desirable feature for a text predictor.

Wordcloud Without Stopwords

The exploration without stopwords is only a curiosity, as they will remain useful for the Shiny app.

Next Steps

Implement a Markov chain for text prediction using bi-grams or trigrams. A profanity filter will also be implemented to exclude curse words from the training set.