Data Science Capstone Milestone Report

Introduction/Synopsis

The major objective of the capstone project is to build a predictive text model. The data used for for this project include blogs, news and twitters (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). The final product for the project will be a shiny application which can be used to predict the next word/words when the user inputs a phrase. This milestone report for the capstone project is intended to present the data preprocessing and exploratory data analysis and next steps for the Data Science Capstone project.

Loading and preprocessing the data

The required libraries for this analysis are first loaded:

library(tm) # Framework for text mining.

## Loading required package: NLP

library(SnowballC) # Provides wordStem() for stemming.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(magrittr)
library(Rgraphviz) # Correlation plots.

## Loading required package: graph
## Loading required package: grid

library(Hmisc)

## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(stringi)# for character string analysis
library(wordcloud) # Plot Word Cloud 
library(RWeka)

First the data that will be used for the analysis is downloaded and the Attributes/properties of the data are checked. The codes shown below are commented to save the reader the time and computer resources needed to read and store the data files.

# Download the files
# download.file(source_file, destination_file)

# extract the files from the zip file
# unzip(destination_file)

# Explore general information about the files
setwd("F:/corpus/txt/final/en_US/fullData") 
# list.files() # list the files

# Read the files
 blog <- readLines("en_US.blogs.txt", encoding="UTF-8")
 news <- readLines("en_US.news.txt", encoding="UTF-8")

## Warning in readLines("en_US.news.txt", encoding = "UTF-8"): incomplete
## final line found on 'en_US.news.txt'

 twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")

## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 167155
## appears to contain an embedded nul

## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 268547
## appears to contain an embedded nul

## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line
## 1274086 appears to contain an embedded nul

## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line
## 1759032 appears to contain an embedded nul

# general statistcs and counts
stri_stats_general(blog)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(news)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

stri_stats_general(twitter)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

The results provided above show that the three files are big files. These files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt have 899288, 77259, and 2360148 lines of text. From this information we can conclude that sampling from each file will be essential for the subsequent analyses.

Sampling/Subsetting

Once the data is downloaded it is observed that it is a huge dataset. Therefore a sampling scheme was employed to select 50% of the samples to undertake the exploratory data analysis

###########################################################################################
# This is to subset the text files selecting/sampling the first 50% of the data
####################################################################################################
setwd("F:/corpus/txt/final/en_US") 

con <- file("en_US.twitter.txt", "r")# replace with your path
twit <-readLines(con, 1200000)  ## Read ~50% lines of text

## Warning in readLines(con, 1200000): line 167155 appears to contain an
## embedded nul

## Warning in readLines(con, 1200000): line 268547 appears to contain an
## embedded nul

save(twit, file = "./subset1/twit1.txt")
close(con)

con <- file("en_US.blogs.txt", "r")# replace with your path
blogs <-readLines(con, 400000)  ## Read ~50% lines of text
save(blogs, file = "./subset1/blogs1.txt")
close(con)

con<- file("en_US.news.txt", "r")
news<-readLines(con, 50000) ## Read ~50% lines of text
save(news, file = "./subset1/news1.txt")
close(con)  ## It's important to close the connection when you are done

Process/transform the data into a format suitable for the analysis

This includes creating the corpus of the sampled dataset and conducting data transformation and data cleaning. The TM package mainly employed for this purpose. In this step, such transformation as converting to lower case, removing punctuations, numbers, common stop words etc. are done to prepare for the next stage of the analysis

###########################################################################################
# Creating corpus fior further exploration and analysis
###########################################################################################
cname <- file.path("F:/corpus/txt/final", "en_US", "subset1")
docs <- Corpus(DirSource(cname))

###########################################################################################
# Data Cleaning - Applying Transformations
###########################################################################################
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/|@|//|$|:|:)|*|&|!|?|_|-|#|")  ## replace special characters by space
docs <- tm_map(docs, content_transformer(tolower)) # Conversion to Lower Case
docs <- tm_map(docs, removePunctuation) # Punctuation can provide gramatical context which supports 
docs <- tm_map(docs, removeWords, stopwords("english")) # common stop Words like for, very, and, of, are, etc,
docs <- tm_map(docs, removeWords, c("the", "will", "The", "also", "that", "and", "for", "in", "is", "it", "not", "to"))
docs <- tm_map(docs, removeNumbers) # removal of numbers
docs <- tm_map(docs, stripWhitespace) # removal of whitespace
docs <- tm_map(docs, stemDocument) # Stemming uses an algorithm that removes common word endings for English

Plots - Histograms and word cloud of the n-gram frequency

Prepare data to construct a histogram of counts of words and a wordcloud.

This step will be used to prepare data to construct a histogram of the total number of counts of unique words and to construct a wordcloud. The frequency of each word in each of the three datasets is calculated.This is done using the TM package to build the document term matrix and the transpose of the document term matrix.

###################################################
# The document term matrix 
####################################################
dtm <- DocumentTermMatrix(docs)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 14863)>>
## Non-/sparse entries: 19443/25146
## Sparsity           : 56%
## Maximal term length: 42
## Weighting          : term frequency (tf)

#########################################################
# The transpose of the document term matrix 
####################################################
tdm <- TermDocumentMatrix(docs)
tdm

## <<TermDocumentMatrix (terms: 14863, documents: 3)>>
## Non-/sparse entries: 19443/25146
## Sparsity           : 56%
## Maximal term length: 42
## Weighting          : term frequency (tf)

The histogram showing the frequency of occurence of uniques words is constructed uing the following code.

# Frequency
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
# Plot Histogram
subset(wf, freq>250)    %>%
        ggplot(aes(word, freq)) +
        geom_bar(stat="identity", fill="darkred", colour="darkgreen") +
        theme(axis.text.x=element_text(angle=45, hjust=1))

# Word Cloud
library(wordcloud)
set.seed(100)
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

### N-gram analysis

In the following steps and N-gram analysis was carried out by Tokenizing the data to Uni-gram, Bi-gram and Tri-gram and building frequency table and plotting the results for the combined dataset in order to gain more understanding of the dataset. The Uni-Gram, Bi-Gram, and Tri-Gram plots are shown below.

##############################################################################
# N-gram tokenization of the Corpus 
##################################################################################
OnegramTokenizer <- function(x) NGramTokenizer(x, 
                                              Weka_control(min = 1, max =1))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = OnegramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq > 250) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="darkred", colour="blue")
pl + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Uni-Gram Frequency")

BigramTokenizer <- function(x) NGramTokenizer(x, 
                                              Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq >25) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="darkgreen", colour="blue")
pl + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Bi-Gram Frequency")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                              Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq > 3) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="darkred", colour="green")
pl + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Tri-Gram Frequency")

## Warning in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font width unknown for character 0x9d

## Warning in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font width unknown for character 0x9d

## Warning in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font width unknown for character 0x9d

## Warning in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font width unknown for character 0x9d

## Warning in grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y,
## : font width unknown for character 0x9d

Next Steps

Further analysis and synthesis will be carried out on the results of the Uni-Gram the Bi-gram, and Tri-gram, frequencies, then convert them to an n-gram frequency matrix. Once this step is done successfuly building the predictive model will follow. After the predictive model is developed, the model will be evaluated for accuracy and speed and further refinment on the model will be usied to finalize the model and develop and deploy the shiny application.