Data Science Capstone

Milestone Report

Introduction and Objectives

The main goal of this report is to demonstrate the level of competency achieved in working with unstructured data in order to produce a structured set of records which can then be used for the purposes of statistical modeling. The first step in any such task is to really know (as much as possible), what is included in the raw data (or document corpus) and to separate out the useful from the not-so-useful information. I would like to note that because the running of the codes while preparing the document for publication on RPub.com was taking unreasonably long period of time, I am forced to present this report based on a 10% sample of the entire data that was provided. The idea is to provide this as an evidence of what I will do with the entire data at the end of the capstone project.

Methods

The first task is to download the raw resources that would be used for the analytics tasks - The main being the three data sources en_US.blogs.txt, en_US.news.txt, and en_US.tweets.txt. In addition, the list of bad/profane words were also obtained (later to be used to exclude from the analysis). The raw data were extracted from the given site: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip” in a compressed format and locally uncompressed. The bad/profane words were downloaded from “https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en”. These were also locally stored as “en_bws.txt”. The following chunc of code shows how the files acquisition went.

dtsrc  <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("coursera-swiftkey.zip")){
   download.file(dtsrc, destfile="coursera-swiftkey.zip")
   unzip("coursera-swiftkey.zip")
 }
## list of bad/profane words download from github
bwsrc1<-"https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("en_bws.txt")){download.file(bwsrc1, destfile="en_bws.txt")}

Examining to Understand the Corpus

The next step is to read each of the downloaded files into R and collect some basic information from the files so that they can be described in terms of size and characteristics. The number of lines and the number of words are extracted for this purpose. Other characteristics such as the size of file are determined by observing the global environment view of R Studio.

blg<- file("en_US.blogs.txt", open="rb")
blDat<- readLines(blg, encoding="latin1")
close(blg)

nws<- file("en_US.news.txt", open="rb")
nwsDat<- readLines(nws, encoding="latin1")
close(nws)

twts<- file("en_US.twitter.txt", open="rb")
twtsDat<- readLines(twts, encoding="latin1")

## Warning in readLines(twts, encoding = "latin1"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(twts, encoding = "latin1"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(twts, encoding = "latin1"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(twts, encoding = "latin1"): line 1759032 appears to
## contain an embedded nul

close(twts)


blgWordCnt<- sum((nchar(blDat) - nchar(gsub(' ','',blDat))) + 1)
blgLinesCnt<-NROW(blDat)

nWsWordCnt<- sum((nchar(nwsDat) - nchar(gsub(' ','',nwsDat))) + 1)
nwsLinesCnt<-NROW(nwsDat)

twtWordCnt<- sum((nchar(twtsDat) - nchar(gsub(' ','',twtsDat))) + 1)
twtLinesCnt<-NROW(twtsDat)

Based on what was done in the above code, the following was learned. Blogs Data: The entire blogs data was worth 248.5 MBs. It had 37,334,131 words and 899,288 lines. News Data: The news data was worth 249.6 MBs, and contains 343,725,530 words in 1,010,242. Twitter Data: The twitter data was worth 301.4 MBs, and contains 30,373,545 words in 2,360,148 lines. In the next step I collected all the meta-data (i.e., the information gathered on the data), into a two dimensional matrix so I can show the summary using a bar graph. To produce the descriptive statistics, the data that is collected will be stored in a two-dimensional array (matrix).

docProperties<-matrix(c(blgWordCnt, nWsWordCnt, twtWordCnt, blgLinesCnt, nwsLinesCnt, twtLinesCnt), nrow=3, ncol=2)
rownames(docProperties)<- c("blog","news","twitter")
colnames(docProperties)<- c("# of words","# of lines")

barplot(docProperties[,1], main="Number of Words",
        xlab="data", col=c("lightblue","lightgreen", "yellow"),
        log="y", beside=TRUE)

barplot(docProperties[,2], main="Number of Lines",
        xlab="data", col=c("lightblue","lightgreen", "yellow"),
        log="y", beside=TRUE)

The presentation on the two bar graphs above shows that the “blogs” file contains the most words, followed by the “news” file. The twitter data has the least word count. However, in terms of lines count, twitter outnumbers both news (which comes second), and bloggs. This conforms with the known characteristics of the three media venues, and can be considered a validation of the read.

Data Pre-Processing

The next task is to pre-process the data such that only the most pertient and accurate data will be used for building the predictive mode. For this report, since posting data to RPub is going to be very challenging (with a large dataset), I decided to randomly select 10% of each document and demonstrate how I do the pre-processing. Pre-process steps therefore follow the following steps: (a) sample 10 percent of each file (b), combine all three files, (c), use regular expressions to remove non-ASCII text, numbers, white spaces, change cases to lower cases etc… When all this is done then the clean text will be converted to a “Corpus” document. Different recommendations gathered from different online sources were used to build the search/replace strategy. Therefore it is probably not the best or simplest approach that I chose. The following steps show all the pre-processing steps up-to converting to “tm”’s Corpus form.

##Use line_counts to randomly sample 10% of the data to work with.

blogSamp<-sample(blDat, blgLinesCnt*0.1)
newsSamp<-sample(nwsDat,nwsLinesCnt * 0.1)
twtsSamp<-sample(twtsDat,twtLinesCnt * 0.1)

##Combine samples to work with

combSamp<- c(blogSamp, newsSamp, twtsSamp)

# remove words with non-ASCII characters - 

# convert string to vector of words
spSamp<- unlist(strsplit(combSamp, split=", "))

# find indices of words with non-ASCII characters
nonAscIDX<- grep("spSamp", iconv(spSamp, "latin1", "ASCII", sub="spSamp"))

# subset original vector of words to exclude words with non-ACCII characters
ascVec<- spSamp[ - nonAscIDX]

# convert vector back to string
ascSamp<- paste(ascVec, collapse = ", ")
#remove numbers and punctuation

clnSamp<- gsub('[[:digit:]]+', '', ascSamp)
clnSamp<- gsub('[[:punct:]]+', '', clnSamp)
clnSamp<- gsub("http[[:alnum:]]*", "", clnSamp)
clnSamp<- gsub("([[:alpha:]])\1+", "", clnSamp)

Removing Known Patterns in TM

Now that we have somewhat cleaned the raw footage of our data, we will load the “tm” package into memory and convert the data into a corpus. Further, I will do more cleaning of the data by taking advantage of the capabilites of the tm package in recognizing and removing (a), known general patterns such as e-mail codes, twitter tags, and other English StopWords. The follwoing code will effectively do all these tasks:

## Make into corpus
library(tm)
SampCrps<- Corpus(VectorSource(clnSamp))

##Convert characters to lower case
SampCrps<- tm_map(SampCrps, tolower)

##Remove punctuation
SampCrps<- tm_map(SampCrps, removePunctuation)

##Remove numbers
SampCrps<- tm_map(SampCrps, removeNumbers)


# Create patterns to elimina special code and other patterns
# URLs
urlPat<-function(x) gsub("(ftp|http)(s?)://.*\\b", "", x)
SampCrps<-tm_map(SampCrps, urlPat)

# Emails
emlPat<-function(x) gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", "", x)
SampCrps<- tm_map(SampCrps, emlPat)

# Twitter tags
tt<-function(x) gsub("RT |via", "", x)
SampCrps<- tm_map(SampCrps, tt)

# Twitter Usernames
tun<-function(x) gsub("[@][a - zA - Z0 - 9_]{1,15}", "", x)
SampCrps<- tm_map(SampCrps, tun)

#White Space
SampCrps<- tm_map(SampCrps, stripWhitespace)


#Remove profane words
#First get the list of bad words
bwdat<-read.table("en_bws.txt", header=FALSE, sep="\n", strip.white=TRUE)
names(bwdat)<-"Bad Words"

SampCrps<- tm_map(SampCrps, removeWords, bwdat[,1])

SampCrps<-tm_map(SampCrps, removeWords, stopwords("english"))

Tokenizing and nGram Generation

Now that we have the document in the shape that we want, the next step is building oneGram, twoGram and threeGram words through extraction. At this point I will load the “stylo” package, since from my readings, it appears to be the most robust of all the packages to do the job. Here comes:

# Create nGrams
library(stylo)

## Warning: package 'stylo' was built under R version 3.1.3

## stylo version: 0.5.9

myCrps<- txt.to.words(SampCrps)

#create data frames of one, two and three ngrams, 

tblUniGrm<-data.frame(table(make.ngrams(myCrps, ngram.size = 1)))
tbldiGrm<-data.frame(table(make.ngrams(myCrps, ngram.size = 2)))
tbltriGrm<-data.frame(table(make.ngrams(myCrps, ngram.size = 3)))

#Create a sorted table "stbl*" by decending frequency count

stblUnigrm<-tblUniGrm[order(tblUniGrm$Freq, decreasing = TRUE),]
stblDigrm<-tbldiGrm[order(tbldiGrm$Freq, decreasing = TRUE),]
stbltrigrm<-tbltriGrm[order(tbltriGrm$Freq, decreasing = TRUE),]


top20unig<-stblUnigrm[1:20,]
colnames(top20unig)<-c("UniGram","Frequency")

top20dig<-stblDigrm[1:20,]
colnames(top20dig)<-c("DiGram","Frequency")

top20trig<-stbltrigrm[1:20,]
colnames(top20trig) <- c("TriGram","Frequency")

Showing the Results

It is now time to show the nGrams that are most popular in the document corpus. I will use GGPLOT graph to show the top 20 words, two combination and three combination of words that were identified through the above tokenization process.

library(ggplot2)


ggplot (top20unig, aes(x = reorder(UniGram, - Frequency), y= Frequency )) + 
  geom_bar( stat = "Identity" , fill = "magenta" ) +  
  geom_text( aes (label = Frequency ) , vjust = - 0.20, size = 3 ) +
  xlab( "UniGram List" ) +
  ylab( "Frequency" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

ggplot (top20dig, aes(x = reorder(DiGram, - Frequency), y= Frequency )) + 
  geom_bar( stat = "Identity" , fill = "lightBlue" ) +  
  geom_text( aes (label = Frequency ) , vjust = - 0.20, size = 3 ) +
  xlab( "DiGram List" ) +
  ylab( "Frequency" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

ggplot (top20trig, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
  geom_bar( stat = "Identity" , fill = "lightGreen" ) +  
  geom_text( aes (label = Frequency ) , vjust = - 0.20, size = 3 ) +
  xlab( "TriGram List" ) +
  ylab( "Frequency" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

Conclusion

As could be seen from the above presentation, among the one-gram words, “will”, “just”, “said” are the top three most frequent words. In the digrams, “right now”, “can’t wait”, and “last year” are the top 3 most frequent consequitive words. In the tri-grams, “can’t wait see”, “happy mothers day”, and “let us know” come on top. Note that in the “can’t wait see” the sentence was made so because “to” may have been removed as an English stop word.

I have gone through the entire step of organizing structured text from the unstructured text corpus - albeit this is a 10% sample of the entire corpus. What needs to be done next is to apply the same code to the entire data, and then compile the nGrams in the same manner as done here. Given that the compute power of my laptop was so challenged by this portion of the data, I will likely need a server or a data center to run the entire data on. Once that is done, a prediction model can be built from such compilation. The final step will be creating the ShinyApp. At the end I plan to create a presentation of how all is put together.

Acknowledgement

In order to achieve the above goal of this capstone project I had to rely on information from a diverse set of sources, including CRAN sources for the various packages used. Individuals who took the capstone course have furnished valuable guiding information. The course faculty have also helped in providing guidence (while at the same time keeping the work an individual effort).