MILTESTONE REPORT: COURSERA DATA SCIENCE SPECIALIZATION SWIFTKEY CAPSTONE

Summary

In this milestone report an exploratory analysis of the 3 provided HC Corpora English language datasets is performed. Because of the size of the files 1% of the data is used for the exploratory analysis. Exploratory analysis of unigrams, bigrams and trigrams is performed. A brief description of plans for creating the prediction algorithm and Shiny App is also provided.

Install Necessary Packages

library(tm)
library(RWeka)
library(qdap)
library(quanteda)
library(ggplot2)

TASK 1A: DATA ACQUISITION

Task to accomplish:
1. Download and unzip the HC Corpora files if necessary. 2. Download a list of profanity for profanity filtering. 3. Read into R the necessary data files 4. Basic dataset size assessment

The three Enlish Language text files were downloaded to the project folder from the Coursera Website. The URL is https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The three files are 1. blogs.txt 2. news.txt 3. twitter.txt

A list of obscene words (“badwords”) was downloaded from the website https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en

##Check for necessary data files in working directory and download and unzip if necessary. 
setwd("~/Documents/Mandeeps Documents/WORK Related/Courses/Data Science Capstone")
url.swiftkey.data<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
url.badwords<-"https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("Coursera-SwiftKey.zip")) download.file(url.swiftkey.data, destfile="coursera-swiftkey.zip")
if (!file.exists("badwords")) download.file(url.badwords, destfile="badwords.txt")    
english.files<-c("final/en_US/en_US.twitter.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.news.txt")
unzip("coursera-swiftkey.zip", files=english.files, exdir="en_US", overwrite=TRUE, junkpaths=TRUE)


##Read datafiles into R     
blogs.txt<-scan("final/en_US/en_US.blogs.txt", what="", sep="\n")
news.txt<-scan("final/en_US/en_US.news.txt", what="", sep="\n")
twitter.txt<-scan("final/en_US/en_US.twitter.txt", what="", sep="\n")
badwords.txt<-scan("badwords.txt", what="", sep="\n")

Here we summarize the number of lines and memory usage (bytes) of the three data sets

## Determine the size of the three data sets 
file.lines<-c(length(blogs.txt), length(news.txt), length(twitter.txt))
file.bytes<-c(object.size(blogs.txt), object.size(news.txt) , object.size(twitter.txt))
data.summary<-data.frame(file.lines, file.bytes)
row.names(data.summary)<-c("Blogs", "News", "Twitter")

print(data.summary)
##         file.lines file.bytes
## Blogs       899288  260564320
## News       1010242  261759048
## Twitter    2360148  316037344

TASK 1B: Clean Data

Tasks to ccomplish: 1. Data Subsetting: In order to make the exploratory analyzis more practical 1% of the three files were subsetted and combined. 2. Data Cleaning: Data is converted to all lowercase, numbers and punctuation are removed, English language stopwords are removed, and whitespace is removed.
3. Profanity filtering: profane words are removed per the Capstone project instructions.

Subset Data

set.seed(1)
news.subset<-sample(news.txt, round(length(news.txt)*.01))
blogs.subset<-sample(blogs.txt, round(length(blogs.txt)*.01))
twitter.subset<-sample(twitter.txt,  round(length(twitter.txt)*.01))
combo.data<-c(news.subset, blogs.subset, twitter.subset)

The combined 1% data set contains 42696 lines and has a memory size of 8.40188810^{6} bytes

Clean the Data

#To make a volatile corpus, R needs to interpret each element in our vector of text as a document.  The tm package provides Source functions to do this.
combo.data.vector<-VectorSource(combo.data) 

##VCorpus(), creates our volatile corpus. The VCorpus object is a nested list. At each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data, as well as some corresponding metadata.
data.corpus<-VCorpus(combo.data.vector) ##Create a corpus 

preprocessCorpus<-function(corpus) {
        corpus<-tm_map(corpus, content_transformer(tolower))
        corpus<-tm_map(corpus, removeNumbers)
        corpus<-tm_map(corpus, removePunctuation)
        corpus<-tm_map(corpus, removeWords, stopwords("english"))
        corpus<-tm_map(corpus, removeWords, badwords.txt)     
        corpus<-tm_map(corpus, stripWhitespace)
        return(corpus)
}


data.corpus.proc<-preprocessCorpus(data.corpus)

TASK 2: EXPLORATORY DATA ANALYSYS

Key questions - How frequently do certain words appear in the data set - How frequently do certain pairs of words appear? How about triplets?

Tasks to accomplish:

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

  2. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Analyze Unigrams

#Create Unigrans
unigram_tdm<-TermDocumentMatrix(data.corpus.proc) #Create unigram TDM
unigram_tdm_sparse<-removeSparseTerms(unigram_tdm, .99)
unigram_m<-as.matrix(unigram_tdm_sparse)


#Plot most frequent 20 Unigrams
word.freq.uni<-sort(rowSums(unigram_m), decreasing=TRUE)
word.freq.uni.df<-data.frame(word=names(word.freq.uni), freq=word.freq.uni)
freq.words<-word.freq.uni.df[1:20,]
ggplot(freq.words, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text=element_text(size=12, angle=45), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Unigrams")

Analyze Bigrams

options(mc.cores=1) 
tokenizer<-function(x) NGramTokenizer(x, Weka_control(min=2, max=2))  #uses the RWeka package to create trigram (three word) tokens: 
# Sets the default number of threads to use. For some reason this is necessary for n-grams

bigram_tdm<-TermDocumentMatrix(data.corpus.proc, control=list(tokenize=tokenizer))
bigram_tdm_sparse<-removeSparseTerms(bigram_tdm, .999)
bigrams_m<-as.matrix(bigram_tdm_sparse)
bigrams.freq<-sort(rowSums(bigrams_m), decreasing=TRUE)
bigrams.freq.df<-data.frame(word=names(bigrams.freq), freq=bigrams.freq)
bigrams.top20.freq<-bigrams.freq.df[1:20,]
ggplot(bigrams.top20.freq, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(size=12, angle=90), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Bigrams")

Analyze Trigrams

tokenizer<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) 
trigram_tdm<-TermDocumentMatrix(data.corpus.proc, control=list(tokenize=tokenizer))
trigram_tdm_sparse<-removeSparseTerms(trigram_tdm, .9999)
trigrams_m<-as.matrix(trigram_tdm_sparse)
trigrams.freq<-sort(rowSums(trigrams_m), decreasing=TRUE)
trigrams.freq.df<-data.frame(word=names(trigrams.freq), freq=trigrams.freq)
trigrams.top20.freq<-trigrams.freq.df[1:20,]
ggplot(trigrams.top20.freq, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(size=12, angle=90, vjust=.5), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Trigrams")

Strategies for Next Word prediction Shiny App

The goal is to build an n-gram model for predicting the next word based on the user’s previously entered 1, 2, or 3 words. The app will initially try to base prediction by searching for similar trigrams and if not found will backoff to bigrams or unigrams. My app will be probably be based on the combination of the three data sets as above although an alternative is to allow the use to select the data set used as an input to the model since n-gram frequencies likely vary between blogs, news and twitter. Since using all of the data at once will likely not be possible anyway due to computational constraints having the user select which of the data sets to use as the model’s input will likely not reduce the data available to the model.