Data Science Capstone - 1st Milestone Report

Executive Summary

The purpose of this document is to read in the provided data and do some exploratory analysis. Utilizing tools such as the “tm” package for text mining and RWeka for creating n-grams I will move on from here to start putting together the prediction algorithm for text word patterns that the assignment requires.

library(tm)

## Loading required package: NLP

suppressMessages(library(R.utils))

## Warning: package 'R.utils' was built under R version 3.2.5

fp <- file.path("/","Users","Maria","Documents","Coursera","Data Science Specialization", "Developing Data Products", "Capstone", "texts")
#docs <- Corpus(DirSource(fp))

Count the lines in the files

Blogs

countLines(file.path(fp, "en_US.blogs.txt"))

## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE

News

countLines(file.path(fp, "en_US.news.txt"))

## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE

Twitter

countLines(file.path(fp, "en_US.twitter.txt"))

## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE

Sample the files

While running the script, it was taking a very long time and/or encountering OOM errors even with the java parameters tweaked. So I further reduced the sample sizes written to the sample files until it ran in a reasonable amount of time.

set.seed(1234)
sampath <- file.path(fp, "samples")
rconn <- file(file.path(fp, "en_US.blogs.txt"), "r")
file <- suppressWarnings(readLines(rconn))  
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.blogs.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)

## [1] "#70...Babs"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "“I don’t know. Maybe they’re getting too much sun. I think I’m going to cut them way back.” I replied."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [3] "The reason could be anything. Maybe you violated some arcane, meaningless regulation among the hundreds of thousands of pages of US Code (ignorance of the law is NOT an excuse!). Maybe you were at the wrong place at the wrong time. Or maybe they had no real reason at all other than mere suspicion."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [4] "Last but certainly far from least, I want to talk about the magnetic triggers that was mentioned yesterday. I had seen for a couple of weeks various people just waking up one day and walking out of their lives. I had not talked about it because it was really strange. It looked almost zombie like… blank stares just leaving. I had no clue where they were going, I was too transfixed on the blank facial expressions… some even had older children along side of them, equally with the same blank look on their face. I am sure, if I had really looked at the expression on my own face as I moved out of my family’s life to New Mexico, I would have looked the same. Had no clue why I was doing it, or what would happen…. I just had to go. I am more than grateful that I did!!"
## [5] "I think I can believe that, though it’s hard"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [6] "Josef Strauss: Delirien waltz"

rconn <- file(file.path(fp, "en_US.news.txt"), "r")
file <- suppressWarnings(readLines(rconn))  
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.news.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)

## [1] "In Illinois, legislators are aiming to make current anti-bullying laws even more stringent. A bill to that effect passed the House late last month and now rests with the Senate."                                                                                                                                     
## [2] "\"No. I think we can be an underdog. We haven't been (at the NCAA tourney) in nine years. We haven't won a game since '96 or '97, whatever it was when Chauncey (Billups) was here,\" Boyle said."                                                                                                                     
## [3] "The scientists developed an avatar of the future Ms. Price by using special software to \"age-morph\" a recent photograph until the young woman's eyes became heavily lined, her smile faded and her blond hair went steel gray. Less than four years out of high school, Ms. Price has suddenly become a grandmother."
## [4] "So he keeps charging, and hoping. But man, that was some dismal defensive display in the first and fourth quarters against Minnesota. Even the players' wives were buzzing in the hallway after the game, saying things such as, \"I can't remember the last time Minnesota beat us.\""                                
## [5] "9143 Pine Av, $700,000"                                                                                                                                                                                                                                                                                                
## [6] "“Right now, I’m a little bothered about leaving Jersey,’’ said Brooks, a rookie shooting guard from Providence College. “We lost. We didn’t really finish like we wanted to down the stretch. But you know, Brooklyn-ready. I’ve got a long offseason ahead to think about before playing in Brooklyn.’’"

rconn <- file(file.path(fp, "en_US.twitter.txt"), "r")
file <- suppressWarnings(readLines(rconn))  
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.twitter.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)

## [1] "“: I think you have the wrong number”oops I thought this was Brain Barton"                                                                   
## [2] "duh bitch \U0001f48d"                                                                                                                        
## [3] "Yeah baby pat urself on the back for some sweet counter surveillance & proceed directly to installing back ups to the wrong partition...Doh!"
## [4] "lets follow for follow."                                                                                                                     
## [5] "ok cool:) we have a Homegame tuesday against springhill, but idk if its home thursday yet, but I'll let you know!"                           
## [6] "Hey Big Papi sweet sun glasses bro. You look like a douche with those colored lenses. It's not 1996 anymore"

Build the corpus and clean it

While thinking through the assignment, I decided not to exclude stem words and stop words as the goal is to present the user with a prediction of a likely next word based on the input.

vc <- VCorpus(DirSource(directory = sampath))
summary(vc)

##                          Length Class             Mode
## en_US.blogs.sample.txt   2      PlainTextDocument list
## en_US.news.sample.txt    2      PlainTextDocument list
## en_US.twitter.sample.txt 2      PlainTextDocument list

#vc <- tm_map(vc, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
#vc <- tm_map(vc, content_transformer(function(x) iconv(x, "UTF-8", sub = "" )))
#vc <- tm_map(vc, content_transformer(function(x) gsub("[ãåâ]", x = x, replacement = "")))
vc <- tm_map(vc, removePunctuation)
vc <- tm_map(vc, removeNumbers)
vc <- tm_map(vc, stripWhitespace)
vc <- tm_map(vc, tolower)
#vc <- tm_map(vc, stemDocument)
#vc <- tm_map(vc, removeWords, stopwords("english"))
vc <- tm_map(vc, PlainTextDocument)

Produce the Term Document Matrix

tdm <- TermDocumentMatrix(vc)
#summary(tdm)
tdm <- as.matrix(tdm)
tdm <-sort(rowSums(tdm), decreasing = TRUE)
tdm <-data.frame(word=names(tdm), freq=tdm)
#head(tdm)

Produce the Document Term Matrix

dtm <- DocumentTermMatrix(vc)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 84389)>>
## Non-/sparse entries: 126830/126337
## Sparsity           : 50%
## Maximal term length: 79
## Weighting          : term frequency (tf)

Look at word frequencies

#dtms <- removeSparseTerms(dtm, 0.1)
#freq <- colSums(as.matrix(dtms))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
head(wf)

##      word  freq
## the   the 96141
## and   and 48777
## for   for 22241
## that that 20911
## you   you 19270
## with with 14241

Plot the word frequencies

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

p <- ggplot(subset(wf, freq> 2500), aes(word, freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))   
p

View the Word Cloud

library(RColorBrewer)
library(wordcloud)
wordcloud(words = tdm$word, freq = tdm$freq, min.freq = 3000, max.words = 100, random.order = TRUE, colors = brewer.pal(6, "Dark2"), rot.per = 0.4)

Using RWeka to create n-gram tokens

suppressWarnings(library(rJava))

## 
## Attaching package: 'rJava'

## The following object is masked from 'package:R.oo':
## 
##     clone

options( java.parameters = "-Xmx4g" )
suppressWarnings(library(RWeka))
twoG   <- NGramTokenizer(vc, Weka_control(min=2, max=2))
threeG <- NGramTokenizer(vc, Weka_control(min=3, max=3))

Bigrams

bi <- data.frame(table(twoG))
bi <- bi[sort.list(bi$Freq, decreasing = TRUE),]
bi <- head(bi, 10)
bi

##            twoG Freq
## 491087   of the 8471
## 349075   in the 8443
## 732747   to the 4401
## 263185  for the 4011
## 499731   on the 3944
## 727160    to be 3219
## 69041    at the 2947
## 47198   and the 2571
## 344515     in a 2401
## 802690 with the 2096

Trigrams

tri <- data.frame(table(threeG))
tri <- tri[sort.list(tri$Freq, decreasing = TRUE),]
tri <- head(tri, 10)
tri

##                 threeG Freq
## 969905      one of the  684
## 17198         a lot of  582
## 1265296 thanks for the  466
## 528784     going to be  351
## 1401688        to be a  348
## 1300943     the end of  319
## 635740       i want to  297
## 152808      as well as  296
## 994976      out of the  290
## 713592        it was a  286