The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This document explains the major features of the SwiftKey data identified and briefly summarize our plans for creating the prediction algorithm and Shiny app.

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships we observed in the data and prepare to build your first linguistic models.

Getting the data

First we need to download and extract the data:

if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                  destfile = "Coursera-SwiftKey.zip")
    if(!dir.exists("data")){
        unzip("Coursera-SwiftKey.zip", exdir = "data")
    }
}

The data sets consist of text of 4 different languages: 1) German, 2) English, 3) Finnish and 4) Russian. Each language contains data from 3 sources: 1) News, 2) Blogs and 3) Twitter. In this project, we will only focus on the English data sets:

blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- iconv(twitter, to = "UTF-8", sub="")
news <- iconv(news, to = "UTF-8", sub="")
blogs <- iconv(blogs, to = "UTF-8", sub="")

First lets explore the dataset in terms of size:

mb <- 1024*1024
blogs.size <- file.info("data/final/en_US/en_US.blogs.txt")$size/mb
news.size <- file.info("data/final/en_US/en_US.news.txt")$size/mb
twitter.size <- file.info("data/final/en_US/en_US.twitter.txt")$size/mb

Then let us examine the datasets in terms of word count:

library(stringi)

blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
library(knitr)
summary <- data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
kable(summary,caption = "Data Summary", col.names = c("Dataset", "File Size (MB)", 
                                                      "Number of Lines", "Number of Words",
                                                      "Mean Number of Words"))
Data Summary
Dataset File Size (MB) Number of Lines Number of Words Mean Number of Words
blogs 200.4242 899288 37541795 41.74613
news 196.2775 1010242 34762303 34.40988
twitter 159.3641 2360148 30092907 12.75043

The datasets made available are large and hard to process, thus we decided to work with a sample of 10% of the data available:

blogs.sample <- sample(blogs,length(blogs)*.5/100)
news.sample <- sample(news,length(news)*.5/100)
twitter.sample <- sample(twitter,length(twitter)*.5/100)

sample <- c(blogs.sample,news.sample,twitter.sample)

# remove temporary variables
rm(twitter,news,blogs,blogs.sample,news.sample,twitter.sample)

After examining the size of the dataset and getting the feeling of the amount of data available in them, we use an empirimistic approach for cleaning the data.

Cleaning the Data

Using the tm package, the sampled data is used to create a corpus. Subsequently, the the following transformations are performed:

  1. Remove Extra Whitespaces;
  2. Transform to lowercase;
  3. Remove Puctuation;
  4. Remove Numbers;
  5. Remove HTML content;
  6. Remove English Stopwords.
library(tm)
library(RWeka)
library(ggplot2)

sample.corpus<-Corpus(VectorSource(sample))

sample.corpus <- tm_map(sample.corpus, content_transformer(stripWhitespace))                 
sample.corpus <- tm_map(sample.corpus, content_transformer(tolower))       
sample.corpus <- tm_map(sample.corpus, content_transformer(removePunctuation))           
sample.corpus <- tm_map(sample.corpus, content_transformer(removeNumbers))                 
sample.corpus <- tm_map(sample.corpus, content_transformer(PlainTextDocument))                  
sample.corpus <- tm_map(sample.corpus, removeWords, stopwords("en"))

Exploratory Sample Data Analysis

N-grams analisys

An n-gram is a contiguous sequence of one or more elements inside a text or speech. These items can be words, syllables or letters. The n-grams typically are collected from a text or speech corpus.An n-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram.

The datastructure used for these analysis is the TermDocument Matrix, which consists of a matrix that relates the term (unigram, bigram, trigram) to the documents in which they appear. The following analysis filters each Term Document Matrix structure, selecting the most frequent terms. We chose to divide the analysis according to the number of terms at the n-grams.

Unigram analysis

options(mc.cores=1)

uniGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
 
uniGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = uniGramTokenizer))

After collecting the unigrams and assembring a TermDocument data structure we can classify the terms (unigrams) by their frequency:

freqTerms <- findFreqTerms(uniGramMatrix, lowfreq = 1000)
termFrequency <- rowSums(as.matrix(uniGramMatrix[freqTerms,]))
termFrequency <- data.frame(unigram=names(termFrequency), frequency=termFrequency)

Then, we can plot the result as a bar chart, where the bars represent the ammount of repetitions of the term inside the data collection.

g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Unigram") + ylab("Frequency") +
 labs(title = "Top Unigrams by Frequency")
print(g)

biGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

biGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = biGramTokenizer))

freqTerms <- findFreqTerms(biGramMatrix, lowfreq = 50)
termFrequency <- rowSums(as.matrix(biGramMatrix[freqTerms,]))
termFrequency <- data.frame(bigram=names(termFrequency), frequency=termFrequency)

g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Bigram") + ylab("Frequency") +
 labs(title = "Top Bigrams by Frequency")
print(g)

triGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

triGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = triGramTokenizer))

freqTerms <- findFreqTerms(triGramMatrix, lowfreq = 10)
termFrequency <- rowSums(as.matrix(triGramMatrix[freqTerms,]))
termFrequency <- data.frame(trigram=names(termFrequency), frequency=termFrequency)

g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Trigram") + ylab("Frequency") +
labs(title = "Top Trigrams by Frequency")
print(g)

WordCloud

Another very popular way of summarizing the data is by using a WordCloud:

library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(sample.corpus, max.words = 200, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE)
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : money could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : making could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : morning could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : everyone could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : someone could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : percent could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : weekend could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : already could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : trying could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : actually could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : please could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : working could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : care could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : music could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : past could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : started could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : stop could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : believe could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : million could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : pretty could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : try could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : county could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : maybe could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : public could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : story could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : second could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : different could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : small could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : open could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : twitter could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : black could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : less could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : anything could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : tomorrow could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : company could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : police could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : hours could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : called could not be fit on page. It will not be plotted.
## Warning in wordcloud(sample.corpus, max.words = 200, random.order =
## FALSE, : party could not be fit on page. It will not be plotted.
title("Wordcloud")

Next Steps

Now that we have performed some exploratory analysis, we are ready to start building the predictive model(s) and eventually the data product. Below are high-level plans to achieve this goal:

  1. Test new values of N to build N-grams
  2. Using N-grams to generate tokens of one to four words.
  3. Summarizing frequency of tokens and find association between tokens.
  4. Building predictive model(s) using the tokens.
  5. Develop data product (i.e. shiny app) to make word recommendation (i.e. prediction) based on user inputs.