Milestone Report - Data Science Capstone Week 2

Introduction

This is the milestone report for week 2 of the Johns Hopkins University on Coursera Data Science Capstone project. The overall goal of the Capstone project is to build a predictive text model using Natural Language Processing (NLM) along with a predictive text application that will determine the most likely next word when a user inputs a word or a phrase.

The purpose of this milestone report is to demonstrate how the data was downloaded, imported into R, and cleaned. This report also contains an exploratory analysis of the data including summary statistics about the three separate data sets (blogs, news and tweets), interesting findings discovered along the way, and an outline of the next steps that will be taken toward building the predictive application.

Loading the Required Libraries

library(tm)

## Loading required package: NLP

library(stringi)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(pryr)

## 
## Attaching package: 'pryr'

## The following object is masked from 'package:tm':
## 
##     inspect

library(RColorBrewer)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RWeka)

Download and Import the Data

The Swifkey Dataset has been downloaded and unzipped manually from here.

This project will look only at the English language files: en_US.blogs, en_US.news, en_US.twitter

blogs <- readLines("./en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)

Generate Summary Statistics

stats <- data.frame(
        FileName = c("blogs", "news", "twitter"),
        FileSize = sapply(list(blogs, news, twitter), function(x){format(object.size(x), "MB")}),
        t(rbind(sapply(list(blogs, news, twitter), stri_stats_general),
        Words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,]))
)

stats

##   FileName FileSize   Lines LinesNEmpty     Chars CharsNWhite    Words
## 1    blogs 255.4 Mb  899288      899288 206824382   170389539 37570839
## 2     news  19.8 Mb   77259       77259  15639408    13072698  2651432
## 3  twitter   319 Mb 2360148     2360148 162096241   134082806 30451170

Sample the Data

From the summary, we can see the sizes of the data files are quite large (the biggest file in the set is nearly 320 Mb). So, we are going to subset the data into three new data files containing a 1% sample of each of the original data files and check the size of the VCorpus (Virtual Corpus) object that will be loaded into memory.

We will set a seed so the sampling will be reproducible. Before building the corpus, we will create a combined sample file and once again check the summary statistics to make sure the file sizes are not too large.

set.seed(10101)
sampleSize <- 0.01

blogsSub <- sample(blogs, length(blogs) * sampleSize)
newsSub <- sample(news, length(news) * sampleSize)
twitterSub <- sample(twitter, length(twitter) * sampleSize)

sampleData <- c(blogsSub, newsSub, twitterSub)

sampleStats <- data.frame(
        FileName = c("blogsSub", "newsSub", "twitterSub", "sampleData"),
        FileSize = sapply(list(blogsSub, newsSub, twitterSub, sampleData), function(x){format(object.size(x), "MB")}),
        t(rbind(sapply(list(blogsSub, newsSub, twitterSub, sampleData), stri_stats_general),
        Words = sapply(list(blogsSub, newsSub, twitterSub, sampleData), stri_stats_latex)[4,])
        )
)

sampleStats

##     FileName FileSize Lines LinesNEmpty   Chars CharsNWhite  Words
## 1   blogsSub   2.5 Mb  8992        8992 2000296     1647992 362415
## 2    newsSub   0.2 Mb   772         772  146989      122964  24869
## 3 twitterSub   3.2 Mb 23601       23601 1611688     1333030 303066
## 4 sampleData   5.9 Mb 33365       33365 3758973     3103986 690350

Build a Corpus and Clean the Data

Build the corpus.

corpus <- VCorpus(VectorSource(sampleData))

Check the size of the corpus in memory using the object.size function.

format(object.size(corpus), "MB")

## [1] "137.3 Mb"

The VCorpus object is quite large (137.3 Mb), even when the sample size is only 1%. This may be an issue due to memory constraints when it comes time to build the predictive model. But, we will start here and see where this approach leads us.

We next need to clean the corpus Data using functions from the tm package. Common text mining cleaning tasks include:

Convert everything to lower case
Remove punctuation marks, numbers, extra whitespace, and stopwords (common words like “and”, “or”, “is”, “in”, etc.)
Filtering out unwanted words

At this early stage, I am not sure if I want to remove the stopwords, even though they may have an adverse affect on the two and three N-Grams. I would like to see how the predictive model works first, before removing the stopwords.

I did notice the source data contains some profanity. I am also not sure if I want to filter these out yet as doing so could leave sentences in the data that make no sense. I want to see if the final application predicts a profane word to the users first. If so, then I will go back and remove these words before finalizing the application.

cleanCorpus <- corpus %>%
       tm_map(content_transformer(tolower)) %>% # Convert all to lower case
       tm_map(removePunctuation) %>% # Remove punctuation marks
       tm_map(removeNumbers) %>% # Remove numbers
       tm_map(stripWhitespace) %>% # Remove whitespace
       tm_map(PlainTextDocument) # Convert all to plain text document

Tokenize and Construct the N-Grams

We next need to tokenize the clean Corpus (i.e., break the text up into words and short phrases) and construct a set of N-grams. We will start with the following three N-Grams:

Unigram - A matrix containing individual words
Bigram - A matrix containing two-word patterns
Trigram - A matrix containing three-word patterns

uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

uniMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = uniTokenizer))
biMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = biTokenizer))
triMatrix <- TermDocumentMatrix(cleanCorpus, control = list(tokenize = triTokenizer))

Calculate the Frequencies of the N-Grams

We now need to calculate the frequencies of the N-Grams and see what these look like.

uniCorpus <- findFreqTerms(uniMatrix, lowfreq = 20)
biCorpus <- findFreqTerms(biMatrix, lowfreq = 20)
triCorpus <- findFreqTerms(triMatrix, lowfreq = 20)

uniCorpusFreq <- rowSums(as.matrix(uniMatrix[uniCorpus,]))
uniCorpusFreq <- data.frame(word = names(uniCorpusFreq), frequency = uniCorpusFreq)
uniCorpusFreq <- arrange(uniCorpusFreq, desc(frequency))
head(uniCorpusFreq)

##      word frequency
## the   the     28572
## and   and     15271
## you   you      8578
## for   for      7612
## that that      6986
## with with      4599

biCorpusFreq <- rowSums(as.matrix(biMatrix[biCorpus,]))
biCorpusFreq <- data.frame(word = names(biCorpusFreq), frequency = biCorpusFreq)
biCorpusFreq <- arrange(biCorpusFreq, desc(frequency))
head(biCorpusFreq)

##            word frequency
## of the   of the      2532
## in the   in the      2461
## for the for the      1360
## to the   to the      1292
## on the   on the      1220
## to be     to be      1196

triCorpusFreq <- rowSums(as.matrix(triMatrix[triCorpus,]))
triCorpusFreq <- data.frame(word = names(triCorpusFreq), frequency = triCorpusFreq)
triCorpusFreq <- arrange(triCorpusFreq, desc(frequency))
head(triCorpusFreq)

##                          word frequency
## one of the         one of the       237
## thanks for the thanks for the       235
## a lot of             a lot of       197
## to be a               to be a       133
## i want to           i want to       128
## going to be       going to be       122

Visualize the Data

The next step will be to create visualizations of the data.

uniBar <- ggplot(data = uniCorpusFreq[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
        geom_bar(stat = "identity", fill = "tomato2") +
        xlab("Words") +
        ylab("Frequency") +
        ggtitle(paste("Top 20 Unigrams")) +
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(axis.text.x = element_text(angle = 50, hjust = 1))
biBar <- ggplot(data = biCorpusFreq[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
        geom_bar(stat = "identity", fill = "darkgreen") +
        xlab("Words") +
        ylab("Frequency") +
        ggtitle(paste("Top 20 Bigrams")) +
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(axis.text.x = element_text(angle = 50, hjust = 1))
triBar <- ggplot(data = triCorpusFreq[1:20,], aes(x = reorder(word, -frequency), y = frequency)) +
        geom_bar(stat = "identity", fill = "deepskyblue") +
        xlab("Words") +
        ylab("Frequency") +
        ggtitle(paste("Top 20 Trigrams")) +
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(axis.text.x = element_text(angle = 50, hjust = 1))
uniBar

biBar

triBar

Distribution of Words

uniCorpusFreq$cum <- cumsum(uniCorpusFreq$frequency/length(uniMatrix$i))

# Number of unique words
uniMatrix$nrow

## [1] 45815

# Number of words are needed to cover 50% of the corpora
which(uniCorpusFreq$cum >= 0.5)[1]

## [1] 160

# Number of words are needed to cover 90% of the corpora
which(uniCorpusFreq$cum >= 0.9)[1]

## [1] 2924

We need 160 words out of 45815 unique words to cover 50% of the total of word instances in the corpora subset while we need 2924 out of 45815 to cover 90% of the total word instances in the corpora subset.

Findings Summary

One question I have is whether a 1% sample of the data is enough? I may find I need to increase the sample size, but doing so could affect the performance of the application.

The VCorpus object is also quite large (137.3 Mb), even with a sample size of only 1%. This may create issues due to memory constraints when it comes time to build the predictive model.

We may need to try different sample sizes to get a balance between enough data, memory consumption and acceptable performance.

We also need to determine whether stopwords need to be removed and create a filter if profane words are suggested when a word or phrase is entered by the user.

Next Steps

Build and test different prediction models and evaluate each based on their performance.
Make and test any necessary modifications to resolve any issues encountered during modeling.
Build, test and deploy a Shiny app with a simple user interface that has acceptable run time and reliably and accurately predicts the next word based on a word or phrase entered by the user.
Decide whether to remove the stopwords and filter out profanity, if necessary.