The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This document explains the major features of the SwiftKey data identified and briefly summarize our plans for creating the prediction algorithm and Shiny app.

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships we observed in the data and prepare to build your first linguistic models.

Getting the data

First we need to download and extract the data:

if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                  destfile = "Coursera-SwiftKey.zip")
    if(!dir.exists("data")){
        unzip("Coursera-SwiftKey.zip", exdir = "data")
    }
}

The data sets consist of text of 4 different languages: 1) German, 2) English, 3) Finnish and 4) Russian. Each language contains data from 3 sources: 1) News, 2) Blogs and 3) Twitter. In this project, we will only focus on the English data sets:

blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- iconv(twitter, to = "UTF-8", sub="")
news <- iconv(news, to = "UTF-8", sub="")
blogs <- iconv(blogs, to = "UTF-8", sub="")

First lets explore the dataset in terms of size:

mb <- 1024*1024
blogs.size <- file.info("data/final/en_US/en_US.blogs.txt")$size/mb
news.size <- file.info("data/final/en_US/en_US.news.txt")$size/mb
twitter.size <- file.info("data/final/en_US/en_US.twitter.txt")$size/mb

Then let us examine the datasets in terms of word count:

library(stringi)

blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
library(knitr)
summary <- data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
kable(summary,caption = "Data Summary", col.names = c("Dataset", "File Size (MB)", 
                                                      "Number of Lines", "Number of Words",
                                                      "Mean Number of Words"))
Data Summary
Dataset File Size (MB) Number of Lines Number of Words Mean Number of Words
blogs 200.4242 899288 37541795 41.74613
news 196.2775 1010242 34762303 34.40988
twitter 159.3641 2360148 30092907 12.75043

The datasets made available are large and hard to process, thus we decided to work with a sample of 10% of the data available:

blogs.sample <- sample(blogs,length(blogs)*.5/100)
news.sample <- sample(news,length(news)*.5/100)
twitter.sample <- sample(twitter,length(twitter)*.5/100)

sample <- c(blogs.sample,news.sample,twitter.sample)

# remove temporary variables
rm(twitter,news,blogs,blogs.sample,news.sample,twitter.sample)

After examining the size of the dataset and getting the feeling of the amount of data available in them, we use an empirimistic approach for cleaning the data.

Cleaning the Data

Using the tm package, the sampled data is used to create a corpus. Subsequently, the the following transformations are performed:

  1. Remove Extra Whitespaces;
  2. Transform to lowercase;
  3. Remove Puctuation;
  4. Remove Numbers;
  5. Remove HTML content;
  6. Remove English Stopwords.
library(tm)
library(RWeka)
library(ggplot2)

sample.corpus<-Corpus(VectorSource(sample))

sample.corpus <- tm_map(sample.corpus, content_transformer(stripWhitespace))                 
sample.corpus <- tm_map(sample.corpus, content_transformer(tolower))       
sample.corpus <- tm_map(sample.corpus, content_transformer(removePunctuation))           
sample.corpus <- tm_map(sample.corpus, content_transformer(removeNumbers))                 
sample.corpus <- tm_map(sample.corpus, content_transformer(PlainTextDocument))                  
sample.corpus <- tm_map(sample.corpus, removeWords, stopwords("en"))

Exploratory Sample Data Analysis

N-grams analisys

An n-gram is a contiguous sequence of one or more elements inside a text or speech. These items can be words, syllables or letters. The n-grams typically are collected from a text or speech corpus.An n-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram.

The datastructure used for these analysis is the TermDocument Matrix, which consists of a matrix that relates the term (unigram, bigram, trigram) to the documents in which they appear. The following analysis filters each Term Document Matrix structure, selecting the most frequent terms. We chose to divide the analysis according to the number of terms at the n-grams.

Unigram analysis

options(mc.cores=1)

uniGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
 
uniGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = uniGramTokenizer))

After collecting the unigrams and assembring a TermDocument data structure we can classify the terms (unigrams) by their frequency:

freqTerms <- findFreqTerms(uniGramMatrix, lowfreq = 1000)
termFrequency <- rowSums(as.matrix(uniGramMatrix[freqTerms,]))
termFrequency <- data.frame(unigram=names(termFrequency), frequency=termFrequency)

Then, we can plot the result as a bar chart, where the bars represent the ammount of repetitions of the term inside the data collection.

g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Unigram") + ylab("Frequency") +
 labs(title = "Top Unigrams by Frequency")
print(g)

biGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

biGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = biGramTokenizer))

freqTerms <- findFreqTerms(biGramMatrix, lowfreq = 50)
termFrequency <- rowSums(as.matrix(biGramMatrix[freqTerms,]))
termFrequency <- data.frame(bigram=names(termFrequency), frequency=termFrequency)

g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Bigram") + ylab("Frequency") +
 labs(title = "Top Bigrams by Frequency")
print(g)

triGramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

triGramMatrix <- TermDocumentMatrix(sample.corpus, control = list(tokenize = triGramTokenizer))

freqTerms <- findFreqTerms(triGramMatrix, lowfreq = 10)
termFrequency <- rowSums(as.matrix(triGramMatrix[freqTerms,]))
termFrequency <- data.frame(trigram=names(termFrequency), frequency=termFrequency)

g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
 geom_bar(stat = "identity")  +
 theme(legend.title=element_blank()) +
 xlab("Trigram") + ylab("Frequency") +
labs(title = "Top Trigrams by Frequency")
print(g)

WordCloud

Another very popular way of summarizing the data is by using a WordCloud:

library(wordcloud)
wordcloud(sample.corpus, max.words = 30, scale=c(4,0.2), random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE)

Next Steps

Now that we have performed some exploratory analysis, we are ready to start building the predictive model(s) and eventually the data product. Below are high-level plans to achieve this goal:

  1. Test new values of N to build N-grams
  2. Using N-grams to generate tokens of one to four words.
  3. Summarizing frequency of tokens and find association between tokens.
  4. Building predictive model(s) using the tokens.
  5. Develop data product (i.e. shiny app) to make word recommendation (i.e. prediction) based on user inputs.