Executive Summary

This is the initial exploratory data analysis for the Coursera Data Science Specialization Capstone. Here, I start to work with the data by cleaning it up, understanding the resources it uses, and playing with tokenization. Eventually, this will be a Shiny App that predicts the “next word”.

Get the Data

First, we load in the data and select a 1% sample for analysis.

set.seed(459)

# read data
twit<-readLines("en_UStwitter.txt", encoding = "UTF-8", skipNul = TRUE) 
blog<-readLines("en_USblogs.txt", encoding = "UTF-8", skipNul=TRUE)
news<-readLines("en_USnews.txt", encoding = "UTF-8", skipNul=TRUE)

### sample the data
inTrain <- createDataPartition(1:length(news),p = .01, list = FALSE)
news_train<-news[inTrain]
#write(news_train,"news_train.txt")

inTrain <- createDataPartition(1:length(blog),p =.01, list = FALSE)
blog_train<-blog[inTrain]
#write(blog_train,"blog_train.txt")

inTrain <- createDataPartition(1:length(twit),p =.01, list = FALSE)
twit_train<-twit[inTrain]
#write(twit_train,"twitter_train.txt")

# get a corpus of the entire text and sample it
text <- paste(blog, news, twit)
inTrain <- createDataPartition(1:length(text),p =.01, list = FALSE)
text_train<-text[inTrain]
#write(text_train,"text_train.txt")

Resources

The first data visualization shows the size of the datasets and the resources that the data uses.

# sumarize the data
summary(twit)
##    Length     Class      Mode 
##   2360148 character character
summary(blog)
##    Length     Class      Mode 
##    899288 character character
summary(news)
##    Length     Class      Mode 
##   1010242 character character
#full corpus
str(text)
##  chr [1:2360148] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. He wasn't home alone, apparently. "| __truncated__ ...
# show memory resources
system("wc -l en_USblogs.txt")
system("wc -l en_USnews.txt")
system("wc -l en_UStwitter.txt")
system("wc -l blog_train.txt")
system("wc -l twit_train.txt")
system("wc -l news_train.txt")
system("wc -l text_train.txt")

Remove Profanitites

A list of profane words were provided to the class. All these transformations are performed using the tm package, which provides transformations such as stopword removal, punctuation removal, transformation to lower cases, and removal of white space and numbers. Profanity filtering - removing profanity and other words you do not want to predict.

# load profanities from https://gist.github.com/tjrobinson/2366772
swear <- read.csv("profanity.csv")

# prep data for tokenization
sample_corpus <- VCorpus(VectorSource(text_train))

sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
#remove profanities
sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("english"))
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
#write(sample_corpus,"sample_corpus.txt")
#sample_corpus <- tm_map(sample_corpus, removeWords, swear) 

Tokenization

Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. We’ll only be using the sampled data for the milestone report.

Goals for the App

I will create a Shiny app that allows 1-3 words of input. The next word will be displayed as a prediction. Right now, I’m thinking the word with the highest frequency (mode) will be used as the displayed prediction.

Notes

I have been having major troubles with Java versioning and the libraries and packages needed to conduct this assignment…specifially, RWeka and tm.