Capstone Project Milestone Report

Exploratory Analysis of Project Training Data

Published By: Sachin Raje Published On: 10 Feb 2017

Objectives of Milestone Report

The objective of this report is:
1. Demonstrate competency in working with project’s training data
2. Explain major features identified using exploratory analysis.
3. Summarise plans for creating prediction algorithm
4. Aimed at non-data scientist audiences.



#### 1. Initialise and Setup 1. Load required R libraries 2. Create parallel computing clusters.

library(dplyr)
library(foreach)
library(iterators)
library(parallel)
library(doParallel)
library(stringi)
library(NLP)
library(tm)
library(slam)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
# Setup parallel clusters to accelarate execution time
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(slam)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))

2. Download Data Files

Check if “US” training data file “Coursera-SwiftKey.zip” exists in the local director. And if not found, download it. If the file is found, skip downloading. Once downloaded, check if the en_US.blogs.txt file exists. If it does not exist, unzip the zip file. If exists, skip unzipping.

# Check for zip file and download if necessary
if (!file.exists("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/Coursera-SwiftKey.zip")
}
# Check for data file and unzip if necessary
if (!file.exists("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/final/en_US/en_US.blogs.txt")) {
    unzip("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/Coursera-SwiftKey.zip", 
          exdir = "/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/", list = TRUE)
}

3. Load Raw Data

The project’s works revolves around 3 main files: en_US.blogs.txt, en_US.new.txt, en_US.twitter.txt. All are available in the “Coursera-SwiftKey.zip” file. All these 3 files will be loaded, one at a time, in binary format in order to preserve all characters.

# Read blogs data in binary mode
conn <- file("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)

# Read news data in binary mode
conn <- file("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)

# Read twitter data in binary mode
conn <- file("/Users/Sachin/OneDrive - emiratesgroup/coursera/Capstone/final/en_US/en_US.twitter.txt", open = "rb")
twits <- readLines(conn, encoding = "UTF-8")
close(conn)

rm(conn)

4. Brief Analysis of Raw Data (Exploratory Analysis - Part 1)

For all these 3 data files, some basic exploratory statistics are performed, including Line, Character and Word counts, and Words Per Line (WPL) summaries. Basic histograms depicting distribution of this data are also plotted.

# Compute words per line info on each line for each data type
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))

# Compute statistics and summary info for each data type
rawstats<-data.frame(
            File=c("blogs","news","twitter"), 
            t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
                    TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
            # Compute words per line summary
            WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
            )
print(rawstats)
##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords WPL.Min.
## 1   blogs  899288      899288 206824382   170389539   37570839        0
## 2    news 1010242     1010242 203223154   169860866   34494539        1
## 3 twitter 2360148     2360148 162096031   134082634   30451128        1
##   WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1           9         28    41.75          60     6726
## 2          19         32    34.41          46     1796
## 3           7         12    12.75          18       47

The observations are:
1. Words Per Line for Blogs : mean 41.75
2. Words Per Line for News : mean 34.41
3. Words Per Line for Tweets: mean 12.75

This may be reflective of the expected attention-span of readers of these contents.

# Plot histogram for each data type
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs", xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",  xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits", xlab="No. of Words",ylab="Frequency",binwidth=1)

rm(rawWPL);
rm(rawstats)

The observations from histograms are:
1. Words Per Line for all data types are right-skewed (i.e. longer right tail).

This may be an indication of the general trend towards short and concised communications.

5. Sample Raw Data

Sampling 30000 lines from each file, as the raw data is bigger than required. Cleaning and performing exploratory analysis on these files.

samplesize <- 30000  # Assign sample size
set.seed(1234)  # Ensure reproducibility 

# Create raw data and sample vectors
data <- list(blogs, news, twits)
sample <- list()

# Iterate each raw data to create 'cleaned'' sample for each
for (i in 1:length(data)) {
    # Create sample dataset
    Filter <- sample(1:length(data[[i]]), samplesize, replace = FALSE)
    sample[[i]] <- data[[i]][Filter]
    # Remove unconvention/funny characters
    for (j in 1:length(sample[[i]])) {
        row1 <- sample[[i]][j]
        row2 <- iconv(row1, "latin1", "ASCII", sub = "")
        sample[[i]][j] <- row2
    }
}

rm(blogs)
rm(news)
rm(twits)

6. Creating Corpus and Cleaning Data

Create corpus for each data type and clean them. Text cleaning actions such as below will be performed:
1. Convert text to lowercase.
2. Remove Numbers
3. Remove Stopwords
4. Remove Punctuations
5. Stem Document to normalize similar words (“win”, “wins”, “winning”) to convert to “win”
6. Remove White Spaces

A document term matrix is created to identify terms occurences in documents.

# Create corpus and document term matrix vectors
corpus <- list()
dtMatrix <- list()

# Iterate each sample data to create corpus and DTM for each
for (i in 1:length(sample)) {
    # Create corpus dataset
    corpus[[i]] <- Corpus(VectorSource(sample[[i]]))
    # Cleaning/stemming the data
    corpus[[i]] <- tm_map(corpus[[i]], tolower)
    corpus[[i]] <- tm_map(corpus[[i]], removeNumbers)
    corpus[[i]] <- tm_map(corpus[[i]], removeWords, stopwords("english"))
    corpus[[i]] <- tm_map(corpus[[i]], removePunctuation)
    corpus[[i]] <- tm_map(corpus[[i]], stemDocument)
    corpus[[i]] <- tm_map(corpus[[i]], stripWhitespace)
    corpus[[i]] <- tm_map(corpus[[i]], PlainTextDocument)
    # calculate document term frequency for corpus
    dtMatrix[[i]] <- DocumentTermMatrix(corpus[[i]], control = list(wordLengths = c(0, Inf)))
}

rm(data)
rm(sample)

7. Plotting Data in Word Cloud (Exploratory Analysis - Part 2)

Explore the corpus data using WordCloud to explain the words and their frequencies. The size and centralisation of the word is dependent on the frequency of their usage. One word cloud is plotted for each file.

set.seed(1234)  # Ensure reproducibility
par(mfrow = c(1, 3))  # Establish Plotting Panel
headings = c("US Blogs Word Cloud", "US News Word Cloud", "US Twits Word Cloud")

# Iterate each corpus/DTM and plot word cloud for each
for (i in 1:length(corpus)) {
    wordcloud(words = colnames(dtMatrix[[i]]), freq = col_sums(dtMatrix[[i]]), 
        scale = c(3, 0.05), max.words = 75, random.order = FALSE, rot.per = 0.35, 
        use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
    title(headings[i])
}

8. Plans Ahead for Project

The basic exploratory analysis has been performed. Hope this has explained the data sufficiently and in a non-technical manner. This would be a good platform to build a predictive model and eventually the data product. Below are high-level plans to achieve this goal:

  1. Use N-grams to generate tokens of one to four words.
  2. Summarize frequency of tokens and find association between tokens.
  3. Building predictive model(s) using the tokens.
  4. Develop data product (i.e. shiny app) to make word recommendation (i.e. prediction) based on user inputs.