Capstone Milestone report - Exploratory Data Analysis

Executive Summary

The is a milestone report for week 2 of the Coursera Data Science Capstone Project. The overall goal of the project is to build a Shiny application that accepts a phrase or multiple words as input and after hitting submit, predicts the next word.

The objective of this report is to load the data, inspect, and product exploratory data analysis on the datasets that will be used for the rest of the project.

Getting the data

The first step is to download the data and unzip to specific folder

## Path
datafilepath <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## If file does not exist, download it
if(!file.exists("./Data/Coursera-SwiftKey.zip")) { 
    download.file(datafilepath, destfile="./Data/Coursera-SwiftKey.zip", method="curl") 
}

## Unzip data file
unzip("./Data/Coursera-SwiftKey.zip")

The unzipping operation generates separate directories for German (de_DE), English (en_US), Finnish (fi_FI), and Russian (ru_RU) data files.

For this project we will focus only on the English (en_US) files for both analysis and prediction modeling.

The English (en_US) folder contains 3 files - Twitter, Blogs, and News, which will be loaded into character vectors in R.

# Blogs file
con <- file(".\\Data\\final\\en_US\\en_US.blogs.txt", "r")

resBlogs <- readLines(con, skipNul = TRUE, encoding="UTF-8")

close(con)

# News file
conNews <- file(".\\Data\\final\\en_US\\en_US.news.txt", "r")

resNews <- readLines(conNews, skipNul = TRUE, encoding="UTF-8")

## Warning in readLines(conNews, skipNul = TRUE, encoding = "UTF-8"):
## incomplete final line found on '.\Data\final\en_US\en_US.news.txt'

close(conNews)

# Twitter file
conTwitter <- file(".\\Data\\final\\en_US\\en_US.twitter.txt", "r")

resTwitter <- readLines(conTwitter, skipNul = TRUE, encoding="UTF-8")

close(conTwitter)

#######################################################################

Exploring the data files

TotalNoOfLinesBlogs <- length(resBlogs) # 899288
TotalNoOfLinesBlogs

## [1] 899288

sum(nchar(resBlogs))  ## Number of characters in the Blogs file

## [1] 206824505

TotalNoOfLinesNews <- length(resNews) # 77259
TotalNoOfLinesNews

## [1] 77259

TotalNoOfLinesTwitter <- length(resTwitter) # 2360148
TotalNoOfLinesTwitter

## [1] 2360148

The data files are quite large in size and will struggle to fit all the data in my laptop’s memory. Therefore we will use a smaller sample of the data (5-10%) as the basis for training data to build out the prediction model.

Additionally we will perform transformations to reduce the size of the object residing in memory but also to clean up the object from unwanted words / characters. This surely helps in the data exploration context but might not be useful for the prediction model.

The cleanup is described further below.

Data Preparation

Data sampling

set.seed(12345)

# Sample data %
PercentSample <- 0.05

SampleRows <- TotalNoOfLinesBlogs * PercentSample
SampleBlogs <- sample (resBlogs, size = SampleRows, replace = FALSE)

SampleRows <- TotalNoOfLinesNews * PercentSample
SampleNews <- sample (resNews, size = SampleRows, replace = FALSE)

SampleRows <- TotalNoOfLinesBlogs * PercentSample
SampleTwitter <- sample (resTwitter, size = SampleRows, replace = FALSE)

SampleAll <- c(SampleBlogs, SampleNews, SampleTwitter)
# TotalNoOfLinesSample <- length(SampleTwitter) + length(SampleBlogs) + length(SampleNews)

# Remove some variables to free up memory
rm(list = c('resNews','resBlogs','resTwitter','SampleBlogs','SampleNews','SampleTwitter'))

The R packages tm and RWeka have a number of helpful text mining functions that we will use to help clean up the data.

First we will convert SampleAll variable to a Corpus file

# Convert to Corpus
allCorpus <- VCorpus(VectorSource(SampleAll))

# str(allCorpus)
# head(allCorpus, 10)
# summary(allCorpus)  #check what went in

We could list the possible transformations available directly through the tm package

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

Then we perform the following transformations: Removing special characters Removing infrequent words Removing common words (stop words, for ex. the, to, a, and, of, in, etc.) Removing numbers Removing punctuation Removing extra whitespace Making all text lower case Stemming the document

Again, not all above transformation might be required for the prediction model. Also for the prediction model we will also remove profanity words but is not required at this stage.

allCorpus <- tm_map(allCorpus, removeNumbers) # Removing Numbers
allCorpus <- tm_map(allCorpus, removePunctuation) # Removing punctuation
allCorpus <- tm_map(allCorpus, stripWhitespace) # Removing extra whitespace

# Twitter File contains extended characters that throws error in further processing
allCorpus <- tm_map(allCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
allCorpus <- tm_map(allCorpus, content_transformer(tolower)) # Making all text lowercase

allCorpus <- tm_map(allCorpus, removeWords, stopwords("english")) # Removing common words # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
allCorpus <- tm_map(allCorpus, stemDocument, language = "english")

Dataset ins now clean and we can begin data exploration.

We use the NGramTokenizer from the RWeka package to tokenize the words into 1-gram, 2-gram and 3-gram tokens. n-gram is a continuous sequence of n words that appear in the Corpus. This helps identify the frequencies of 1 word, 2 words, 3 words together, and so on.

Next we will build a TermDocumentMatrix which will show the frequency of each term/word. For this analysis we will look at unigram (single word), bigram, and trigram terms.

For the final prediction algorithm we will also look at 4-gram combinations.

A TermDocumentMatrix is a matrix where the rows are the tokens and columns are the datasets. Each cell in this matrix represents the frequencies of the tokens in the datasets.

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

Generate TermDocumentMatrix for each n-gram

unigramTDM <- TermDocumentMatrix(allCorpus, control=list(tokenize=unigramTokenizer))
bigramTDM <- TermDocumentMatrix(allCorpus, control=list(tokenize=bigramTokenizer))
trigramTDM <- TermDocumentMatrix(allCorpus, control=list(tokenize=trigramTokenizer))

Remove sparse terms to significantly reduce the size of the TDM

unigramTDM <- removeSparseTerms(unigramTDM, 0.9999)
bigramTDM <- removeSparseTerms(bigramTDM, 0.9999)
trigramTDM <- removeSparseTerms(trigramTDM, 0.9999)

Save TDM to file for reuse later on

saveRDS(object =  unigramTDM, 'unigramTDM.RData')
saveRDS(object =  bigramTDM, 'bigramTDM.RData')
saveRDS(object =  trigramTDM, 'trigramTDM.RData')

I created the function below that creates a data frame with word and frequency columns, and plots the top N tokens using ggplot2.

The function’s paramters are: 1. ngram - the TermDocumentMatrix 2. topN - the top N tokens to show in the plot 3. nGramLabel - the label to use in the plot, identifying the input ngram

PlotNgram <- function (ngram, topN = 25, nGramLabel = "1 Gram") {
    
    df_words <- as.data.frame(slam::row_sums(ngram, na.rm=T))
    
    colnames(df_words)<- "freq"
    df_words <- cbind(word = rownames(df_words), df_words)
    rownames(df_words) <- NULL
    
    df_words <- df_words[order(df_words$freq, decreasing = TRUE),]
    df_words_top <- head(df_words, topN)

    ggplot(df_words_top, aes(reorder(word, freq), freq, fill=word)) +
        geom_text(aes(label = df_words_top$freq), vjust=-0.5, size=2) + 
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 10)) +
        labs(x="Words") + 
        labs(y="Frequency") +
        labs(title= paste("Top",topN,"Words for", nGramLabel))
}

The below function siumilar to the one abve, generates a word cloud, using the wordcloud package

WordcloudNgram <- function (ngram, topN = 30) {
    
    df_words <- as.data.frame(slam::row_sums(ngram, na.rm=T))
    
    colnames(df_words)<- "freq"
    df_words <- cbind(word = rownames(df_words), df_words)
    rownames(df_words) <- NULL
    
    df_words <- df_words[order(df_words$freq, decreasing = TRUE),]
    df_words_top <- head(df_words, topN)

    wordcloud(
        df_words_top$word,
        df_words_top$freq,
        max.words = 50,
        rot.per = 0.3,
        colors=brewer.pal(8, "Dark2")
        #vfont=c("serif","plain")
        )
}

Here is a bar plot of the top 20 words for 1, 2, 3 gram

PlotNgram(unigramTDM, 20, "1 Gram")

PlotNgram(bigramTDM, 20, "2 Gram")

PlotNgram(trigramTDM, 20, "3 Gram")

The following are word clouds for the top 50 words for 1, 2, 3 gram #### 1-gram

WordcloudNgram(ngram = unigramTDM)

2-gram

WordcloudNgram(ngram = bigramTDM)

3-gram

WordcloudNgram(ngram = trigramTDM)

Next steps…

The next step is the design and implementation of the prediction algorithm, during which we will perform the remaining pre-processing data cleanups, remove profanity and infrequent words, and determine if word stemming is beneficial.

Finally, we will develop a data product and deploy it as a Shiny App to showcase the prediction algorithm.