Introduction

The objective of this capstone project is to create an app that accepts multiple text inputs and generates a prediction of the next possible word based on a dataset from HC Corpora at http://www.corpora.heliohost.org/aboutcorpus.html. The milestone report attempts to perform an initial explatory data analysis of the dataset in order to have a better understanding of the characteristics of the data and devise a prediction strategy moving forward.

Data Preprocessing

The training dataset (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.

Loading The Dataset

fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip", method = "curl")
unlink(fileURL)
unzip("Dataset.zip")

Aggregating Data Sample

In order to enable faster data processing, a data sample from all three sources was generated.

sampleTwitter <- twitter[sample(1:length(twitter),5000)]
sampleNews <- news[sample(1:length(news),5000)]
sampleBlogs <- blogs[sample(1:length(blogs),5000)]
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)

Exploratory Analysis

Data Summary

The following table provides an overview of the imported data. In addition to the size of each data set, the number of lines and words are displayed.

File Name	File Size in Megabyte	Line Count	Word Count
Blogs	200.42	899288	37334147
News	196.28	1010242	34372530
Twitter	159.36	2360148	30373603
Aggregated Sample	2.42	15000	15000

A word cloud is used to provide a quick visualization of word frequencies. The word cloud displays the data of the aggregated sample file.

trigramTDM <- TermDocumentMatrix(finalCorpus)
wcloud <- as.matrix(trigramTDM)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq,
          c(5,.3),50,
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

Generating Text Corpus

By using the tm package the sample data gets cleaned. With cleaning it is meant that the text data is converted into lower case, further punction, numbers and URLs are getting removed. Next to that stop and profanity words are erased from the text sample. At the end we are getting a clean text corpus which enables an easy subsequent processing.

## Make it work with the new tm package
cleanSample <- tm_map(cleanSample, content_transformer(function(x) iconv(x, to="UTF-8", sub="byte")), 
                      mc.cores=2)
cleanSample <- tm_map(cleanSample, content_transformer(tolower), lazy = TRUE)
cleanSample <- tm_map(cleanSample, content_transformer(removePunctuation))
cleanSample <- tm_map(cleanSample, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
cleanSample <- tm_map(cleanSample, content_transformer(removeURL))
cleanSample <- tm_map(cleanSample, stripWhitespace)
cleanSample <- tm_map(cleanSample, removeWords, stopwords("english"))
cleanSample <- tm_map(cleanSample, removeWords, profanityWords)
cleanSample <- tm_map(cleanSample, stemDocument)
cleanSample <- tm_map(cleanSample, stripWhitespace)

N-gram Tokenization

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The following function is used to extract 1-grams (unigram), 2-grams (bigram) and 3-grams (trigrams) from the corpus.

ngramTokenizer <- function(theCorpus, ngramCount) {
        ngramFunction <- NGramTokenizer(theCorpus, 
                                Weka_control(min = ngramCount, max = ngramCount, 
                                delimiters = " \\r\\n\\t.,;:\"()?!"))
        ngramFunction <- data.frame(table(ngramFunction))
        ngramFunction <- ngramFunction[order(ngramFunction$Freq, 
                                             decreasing = TRUE),][1:10,]
        colnames(ngramFunction) <- c("String","Count")
        ngramFunction
}

Top Occuring Unigrams

Top Occuring Bigrams

Top Occuring Trigrams

Prediction Strategies

The next steps required in the capstone project that are:

Developing the n-gram model to perform next word prediction
Testing and verifying the prediction model
Building the Shiny App
Creating the product pitch slide deck

An n-gram model with a frequency table can be used based to make predictions of the next probable word. A straightforward prediction algorithm is to start with the trigram model to find the most likely next word. If none is found, then the bigram model is used. If there is still no match, then the unigram model, i.e. the most possible word irregardly of context is suggested. The text entry from the use might need the same cleaning that was performed on training data to increase chances of match.

The Shiny app will have a simple user interface where the user can enter a string of English text. Then the prediction model will echo the text entered by the user and suggest ost-likely word as the next word.

Data Science Capstone Milestone Report

Chan

Saturday, December 26, 2015