Capstone Project : Milestone Report for Exploratory Analysis

Project Introduction

The purpose of this report is to explain the analysis of the training text documents provided and to discover the structure in the data by doing some basic exploratory analysis. The data set is acquired, cleansed and then analysed for the frequent words used in the documents. The analysis would be done on sample set of data. The report will also explain the next steps for building a predictive text model and presenting the model in Shiny app.

Getting the Data - The Dataset

The Training Dataset is downloaded from the following website and unzipped and copied in the working folder https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

There are 4 folders in the dataset representing 4 different languages - German(de_DE), English(en_US),Finnish(fi_FI), Russian(ru_RU). We will be using the files under the English folder(en_US). The folder contains three files 1) en_US.blogs.txt 2) en_US.news.txt 3) en_US.twitter.txt

Preprocessing of the data

library(stringi)
library(ggplot2)
library(dplyr)
library(tm)
library(SnowballC)
library(NLP)

We load the dataset into three data files from the working directory.

setwd("C:/Sugi_R/Data Science Capstone Project/Coursera-SwiftKey/final/en_US")
con1<-file("en_US.blogs.txt","r")
 blogs<-readLines(con1)
 close(con1)
 con2<-file("en_US.twitter.txt","r")
 twitter<-readLines(con2)
 close(con2)
 con3<-file("en_US.news.txt","r")
 news<-readLines(con3)
 close(con3)

Raw Data Exploration

Lets explore the three data sets for Linecounts and Wordcounts

data.frame(filename=c("blogs","news","twitter"), rbind(stri_stats_general(blogs),stri_stats_general(news),stri_stats_general(twitter)),Wordcount=sapply(list(blogs,news,twitter),stri_stats_latex)[4,])

##   filename   Lines LinesNEmpty     Chars CharsNWhite Wordcount
## 1    blogs  899288      899288 208361438   171926076  37865888
## 2     news   77259       77259  15683765    13117038   2665742
## 3  twitter 2360148     2360148 162384825   134370864  30578891

Statistics of the three file datasets based on the word counts

data.frame(filename=c("blogs","news","twitter"), rbind(summary(stri_count_words(blogs)),summary(stri_count_words(news)), summary(stri_count_words(twitter))))

##   filename Min. X1st.Qu. Median  Mean X3rd.Qu. Max.
## 1    blogs    0        9     29 42.43       61 6726
## 2     news    1       19     32 34.87       46 1123
## 3  twitter    1        7     12 12.80       18   60

Data Cleaning

For our exploratory analysis , lets take sample data(0.5%) from each data set since the dataset is quite large to create a corpus data.We will clean the Corpus data by removing punctuations, removing bad words(the bad words is downloaded and loaded into the working directory), white spaces, numbers and stopwords, as well as converting text to lowercase.

# Sample the data 
set.seed(1000)
allsample<-c(sample(blogs, length(blogs) * 0.005),sample(news, length(news) * 0.005),sample(twitter, length(twitter)* 0.005))

# Create Corpus Data and Clean the data
sample_vcorp <- VCorpus(VectorSource(allsample), list(reader = readPlain))
# Function to replace unwanted characters with space
spaceadd<-content_transformer(function(x, pattern) gsub(pattern, " ", x))
sample_vcorp <- tm_map(sample_vcorp, spaceadd, "(f|ht)tp(s?)://(.*)[.][a-z]+")
sample_vcorp <- tm_map(sample_vcorp, spaceadd, "@[^\\s]+")
# Remove bad words from the corpus I downloaded the file from this URL and saved to my local working directory https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words 

bad_words<-read.csv2(file='bad_words.txt', header=TRUE,strip.white = TRUE, stringsAsFactors = FALSE)
sample_vcorp<-tm_map(sample_vcorp,removeWords,bad_words$bad_words)
sample_vcorp<-tm_map(sample_vcorp,content_transformer(tolower))
sample_vcorp<-tm_map(sample_vcorp,removePunctuation)
sample_vcorp<-tm_map(sample_vcorp,removeWords,stopwords("en"))
sample_vcorp<-tm_map(sample_vcorp,stripWhitespace)
sample_vcorp<-tm_map(sample_vcorp,removeNumbers)
sample_vcorp <- tm_map(sample_vcorp, PlainTextDocument)

Exploratory Analysis

N-Gram Tokenization

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n in modern language, e.g., “four-gram”, “five-gram”, and so on.

we will create unigrams, bigrams and trigrams using the sample dataset in order for us explore the data better. The below tokenization function will create the different n-Grams using the NLP package.

UnigramTokenizer<-function(x)unlist(lapply(ngrams(words(x),1), paste, collapse = " "), use.names= FALSE)
BigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),use.names =FALSE)
TrigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),use.names=FALSE)

Tokenization and Frequencies

Lets calculate the frequencies of the unigrams, bigrams, trigrams and plot these to understand the most frequent words used.Removing the sparse terms was a necessary step to get the TDM to a manageable size.

options(mc.cores=1)

# Unigram Frequencies
sample_vcorp.unigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = UnigramTokenizer)),0.9999)
uni_matrix<- sort(rowSums(as.matrix(sample_vcorp.unigrams)), decreasing = TRUE)
uni_ngram<-data.frame(uni_ngram=names(uni_matrix),freq=uni_matrix)
# Bigram Frequencies  
sample_vcorp.bigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = BigramTokenizer)), 0.9999)
bi_matrix<- sort(rowSums(as.matrix(sample_vcorp.bigrams)), decreasing=TRUE)
bi_ngram<-data.frame(bi_ngram=names(bi_matrix),freq=bi_matrix)
# Trigram Frequencies
sample_vcorp.trigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = TrigramTokenizer )), 0.9999)
tri_matrix<- sort(rowSums(as.matrix(sample_vcorp.trigrams)), decreasing=TRUE)
tri_ngram<-data.frame(tri_ngram=names(tri_matrix),freq=tri_matrix)

Lets plot the frequencies of the uni_ngram

   ggplot(as.data.frame(head(uni_ngram,20)),aes(y=freq,x=reorder(uni_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 uni grams")+ xlab("Uni Grams") +     ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()

Plot the Frequencies of Bigram

ggplot(as.data.frame(head(bi_ngram,20)),aes(y=freq,x=reorder(bi_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 bi grams")+ xlab("Bi Grams") +     ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()

Plot the Frequencies of Trigram

ggplot(as.data.frame(head(tri_ngram,20)),aes(y=freq,x=reorder(tri_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 Tri grams")+ xlab("Tri Grams") +ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()

Exploratory Analysis Summary

Based on the analysis above, we have a better understanding of the frequencies of words used and have a good basis for creating a n-gram model for our predictive algorithm. We can extend the model to have fourgrams and fivegram as necessary.

Plan For Prediction Algorithm And Shiny App

The next steps for this project would be to create predictive algorithm and deploy the same as a Shiny app.

The predictive algorithm will be based on n-gram model- Markov probabilistic model for associating a certain probability to words, depending on the previous (N-1) words. with frequency lookup which predicts the word based on their frequency in trigram, bigram by summarizing frequency of tokens and by finding association between tokens.I might also build a model to handle unseen words by appling smoothing techniques by assigning non-zero probabilities to unseen words.

I’ll use a simplest backoff strategy, if there is no suggestion for the last 4 words (using 5-Grams and associated probabilities), then use last 3, if not, last 2, if not, last 1, and, if still not, use the highest 1-Gram probabilities (“the”, “end”, etc).

We will develop a data product i.e, a Shiny app that will have a simple interface with a text input box which will predict the word based on the user inputs.