Capstone - Milestone Report

INTRODUCTION

Project Goal

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

Project Requirement

Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

The data provided for NLP (Natural Language Processing) consists of 3 “corpora” of data:

Blog posts
News articles
“Tweets” on Twitter

READING DATA AND BASIC ANALYSIS

Load Libraries

## Load libraries and suppress messages for ease of reading report

suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(LaF))
suppressMessages(library(quanteda))
suppressMessages(library(RColorBrewer)) 
suppressMessages(library(RWeka))
suppressMessages(library(SnowballC))
suppressMessages(library(tau))
suppressMessages(library(tm))
suppressMessages(library(wordcloud))

Downloading or Extracting Data

# Download and extract data
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destination_file <- "Coursera-SwiftKey.zip"
download.file(source_file, destination_file)
unzip(destination_file)

# Unzip file
unzip(destination_file, list = FALSE )

Reading Data

# Load the data en_US data
dataBlogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
dataNews <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
dataTwitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

#Convert to ASCII
dataNews <- iconv(dataNews, 'UTF-8', 'ASCII', "byte")
dataBlogs <- iconv(dataBlogs, 'UTF-8', 'ASCII', "byte")
dataTwitter <- iconv(dataTwitter, 'UTF-8', 'ASCII', "byte")

Since these files are huge (based on time taken to read), a quick summary will help determine a sampling approach.

Exploratory Analysis 1:

# Assess size of all 3 files - blogs, news and Twitter
dataBlogs.filesizeMB <- file.size("./final/en_US/en_US.blogs.txt")
dataNews.filesizeMB <- file.size("./final/en_US/en_US.news.txt")
dataTwitter.filesizeMB <- file.size("./final/en_US/en_US.twitter.txt")

Create Table for File Size, Word Count & Longest Line

info <- data.frame(rbind(dataframe.blogs, dataframe.news, dataframe.twitter))
names(info) <- c("File Size (MB)", "Word Count", "Longest Line")
row.names(info) <- c("Blogs", "News", "Twitter")

# Showcase table
info

##         File Size (MB) Word Count Longest Line
## Blogs        210160014     899288        40844
## News         205811889      77259         5760
## Twitter      167105338    2360148          589

Sampling & Prediction Approach: Word Frequency

Since working with such huge data sets is memory intensive, using basic random sampling, I will try to reduce the text to mine through. This sample would also be used for the final predictive analysis.

The sampling has been arbitrarily chosen as 5 % of the actual file parameters. However, based on the prediction results, this could later be increased or decreased. The exploratory analysis is however based on the initial arbitrary value.

# Assess maximum number of characters in a line of the files
summary(nchar(dataBlogs))[6]

##  Max. 
## 40840

summary(nchar(dataNews))[6]

## Max. 
## 5760

summary(nchar(dataTwitter))[6]

## Max. 
##  589

# Run sampling at 5% of the actual file parameters because of sizes of files
dataBlogs_sample_size   <- round(.05 * length(dataBlogs), 0)
dataNews_sample_size    <- round(.05 * length(dataNews), 0) 
dataTwitter_sample_size <- round(.05 * length(dataTwitter), 0)

# Compute with approximately 5% of the population for each file
dataBlogs_sample <- sample_lines("./final/en_US/en_US.blogs.txt", n = dataBlogs_sample_size, nlines = NULL) 
dataNews_sample <- sample_lines("./final/en_US/en_US.news.txt", n = dataNews_sample_size , nlines = NULL) 
dataTwitter_sample <- sample_lines("./final/en_US/en_US.twitter.txt", n = dataTwitter_sample_size, nlines = NULL)

Exploratory Analysis 2:

# Determine word frequency for each of the 3 files
dataBlogs_word_freq <- dfm(dataBlogs_sample, verbose = FALSE)
dataNews_word_freq <- dfm(dataNews_sample, verbose = FALSE)
dataTwitter_word_freq <- dfm(dataTwitter_sample, verbose = FALSE)

docfreq(dataBlogs_word_freq)[1:11]

##        todd    breathed      deeply           ,          as          if 
##          11          14          66       27136        7354        4331 
## restraining     himself        from  clobbering          me 
##           5         304        5911           1        4989

docfreq(dataNews_word_freq)[1:11]

##     such students      now     must      pay        $       15      for 
##       79       46      149       32       47      256       39     1084 
##     each       of    their 
##       66     1776      303

docfreq(dataTwitter_word_freq)[1:11]

##    thanks       for       the        rt         /   mention         ! 
##      4392     17618     36410      4150      4043       216     36847 
##         i        am answering questions 
##     28458      1982        24       192

The function below will be used to clean the data, including stemming. Stop words are not removed on purpose. Stop words provided much needed context and sentence fluidity in natural language and hence they will be retained.

CLEANING DATA

## Loading required package: slam

# Set CleanR function
CleanR <- function(corpus){
        tm_map(corpus, removeNumbers) %>%
                tm_map(removePunctuation) %>%
                tm_map(content_transformer(tolower)) %>%
                tm_map(stripWhitespace) %>%
                tm_map(stemDocument)
}

SAMPLE ANALYSIS RESULT

nGram

# Create a few NGram functions via RWeka
unigram_token <- function(x)  NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram_token <- function(x)   NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_token <- function(x)  NGramTokenizer(x, Weka_control(min = 3, max = 3))

UniGram (Histogram)

# Create UniGram functions via RWeka
options(stringsAsFactors = FALSE)
options(mc.cores = 1)
unigram <- TermDocumentMatrix(corpus, control=list(tokenize=unigram_token))
unigram.good <- rollup(unigram, 2, na.rm=TRUE, FUN = sum)

# Sort with decreasing frequency
unigram.tf <- findFreqTerms(unigram.good, lowfreq = 3)
unigram.tf <- sort(rowSums(as.matrix(unigram.good[unigram.tf, ])), decreasing = TRUE)
unigram.tf <- data.frame(unigram.good=names(unigram.tf), frequency=unigram.tf)
names(unigram.tf) <- c("word", "frequency")
head(unigram.tf, 10)

##      word frequency
## the   the     57668
## and   and     31435
## you   you     16836
## for   for     15543
## that that     14065
## with with      9598
## this this      8601
## was   was      8312
## have have      7843
## are   are      7264

BiGram

# BiGram work
bi.gram.dataBlogs <- textcnt(dataBlogs_sample, n = 2, method = "string") 
bi.gram.dataBlogs <- bi.gram.dataBlogs[order(bi.gram.dataBlogs, decreasing = TRUE)]
bi.gram.dataBlogs[1:3] # top three, 2-Word combinations

## of the in the to the 
##   9213   7612   4262

Word Clouds

blogs_corpus <- VCorpus(DataframeSource(data.frame(dataBlogs_sample)))
news_corpus <- VCorpus(DataframeSource(data.frame(dataNews_sample)))
twitter_corpus <- VCorpus(DataframeSource(data.frame(dataTwitter_sample)))

rm(dataBlogs_sample); rm(dataNews_sample); rm(dataTwitter_sample)

blogs_corpus <- CleanR(blogs_corpus)
news_corpus <- CleanR(news_corpus)
twitter_corpus <- CleanR(twitter_corpus)

pal <- brewer.pal(8,"Accent")

wordcloud(blogs_corpus, max.words = 90, random.order = FALSE, colors = pal)

wordcloud(news_corpus, max.words = 90, random.order = FALSE, colors = pal)

wordcloud(twitter_corpus, max.words = 90, random.order = FALSE, colors = pal)

PREDICTION & SHINY

The below section briefly explains the approach for prediction and creating a shiny app. At the time of writing this report, profanity filter has not be decided.

The next steps in the project are:

Continuing cleaning the corpus to to increase the accuracy of the model
Refining the sampling process for getting a good ngram representation without using the entire corpus
Building the final prediction model and testing it

Prediction Approach:

The samples corpus would be used to create bi and tri gram frequencies. The data frames would then be used to predict the next word from the n-gram frequency table.

The top two words per the frequency table would be returned. Only the last word will be used to predict even though the input may be more than one word.

Shiny Apps:

The app would take user input as characters strings and use the last input word and return top two words that could be next.