The purpose of this document is to read in the provided data and do some exploratory analysis. Utilizing tools such as the “tm” package for text mining and RWeka for creating n-grams I will move on from here to start putting together the prediction algorithm for text word patterns that the assignment requires.
library(tm)
## Loading required package: NLP
suppressMessages(library(R.utils))
## Warning: package 'R.utils' was built under R version 3.2.5
fp <- file.path("/","Users","Maria","Documents","Coursera","Data Science Specialization", "Developing Data Products", "Capstone", "texts")
#docs <- Corpus(DirSource(fp))
countLines(file.path(fp, "en_US.blogs.txt"))
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines(file.path(fp, "en_US.news.txt"))
## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE
countLines(file.path(fp, "en_US.twitter.txt"))
## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE
While running the script, it was taking a very long time and/or encountering OOM errors even with the java parameters tweaked. So I further reduced the sample sizes written to the sample files until it ran in a reasonable amount of time.
set.seed(1234)
sampath <- file.path(fp, "samples")
rconn <- file(file.path(fp, "en_US.blogs.txt"), "r")
file <- suppressWarnings(readLines(rconn))
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.blogs.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)
## [1] "#70...Babs"
## [2] "“I don’t know. Maybe they’re getting too much sun. I think I’m going to cut them way back.” I replied."
## [3] "The reason could be anything. Maybe you violated some arcane, meaningless regulation among the hundreds of thousands of pages of US Code (ignorance of the law is NOT an excuse!). Maybe you were at the wrong place at the wrong time. Or maybe they had no real reason at all other than mere suspicion."
## [4] "Last but certainly far from least, I want to talk about the magnetic triggers that was mentioned yesterday. I had seen for a couple of weeks various people just waking up one day and walking out of their lives. I had not talked about it because it was really strange. It looked almost zombie like… blank stares just leaving. I had no clue where they were going, I was too transfixed on the blank facial expressions… some even had older children along side of them, equally with the same blank look on their face. I am sure, if I had really looked at the expression on my own face as I moved out of my family’s life to New Mexico, I would have looked the same. Had no clue why I was doing it, or what would happen…. I just had to go. I am more than grateful that I did!!"
## [5] "I think I can believe that, though it’s hard"
## [6] "Josef Strauss: Delirien waltz"
rconn <- file(file.path(fp, "en_US.news.txt"), "r")
file <- suppressWarnings(readLines(rconn))
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.news.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)
## [1] "In Illinois, legislators are aiming to make current anti-bullying laws even more stringent. A bill to that effect passed the House late last month and now rests with the Senate."
## [2] "\"No. I think we can be an underdog. We haven't been (at the NCAA tourney) in nine years. We haven't won a game since '96 or '97, whatever it was when Chauncey (Billups) was here,\" Boyle said."
## [3] "The scientists developed an avatar of the future Ms. Price by using special software to \"age-morph\" a recent photograph until the young woman's eyes became heavily lined, her smile faded and her blond hair went steel gray. Less than four years out of high school, Ms. Price has suddenly become a grandmother."
## [4] "So he keeps charging, and hoping. But man, that was some dismal defensive display in the first and fourth quarters against Minnesota. Even the players' wives were buzzing in the hallway after the game, saying things such as, \"I can't remember the last time Minnesota beat us.\""
## [5] "9143 Pine Av, $700,000"
## [6] "“Right now, I’m a little bothered about leaving Jersey,’’ said Brooks, a rookie shooting guard from Providence College. “We lost. We didn’t really finish like we wanted to down the stretch. But you know, Brooklyn-ready. I’ve got a long offseason ahead to think about before playing in Brooklyn.’’"
rconn <- file(file.path(fp, "en_US.twitter.txt"), "r")
file <- suppressWarnings(readLines(rconn))
samp <- sample(file, size = (.02 * length(file)))
samp <- iconv(samp, "UTF-8", sub = "")
close(rconn)
wconn <- file(file.path(sampath, "en_US.twitter.sample.txt"))
writeLines(samp, con = wconn)
close(wconn)
head(samp)
## [1] "“: I think you have the wrong number”oops I thought this was Brain Barton"
## [2] "duh bitch \U0001f48d"
## [3] "Yeah baby pat urself on the back for some sweet counter surveillance & proceed directly to installing back ups to the wrong partition...Doh!"
## [4] "lets follow for follow."
## [5] "ok cool:) we have a Homegame tuesday against springhill, but idk if its home thursday yet, but I'll let you know!"
## [6] "Hey Big Papi sweet sun glasses bro. You look like a douche with those colored lenses. It's not 1996 anymore"
While thinking through the assignment, I decided not to exclude stem words and stop words as the goal is to present the user with a prediction of a likely next word based on the input.
vc <- VCorpus(DirSource(directory = sampath))
summary(vc)
## Length Class Mode
## en_US.blogs.sample.txt 2 PlainTextDocument list
## en_US.news.sample.txt 2 PlainTextDocument list
## en_US.twitter.sample.txt 2 PlainTextDocument list
#vc <- tm_map(vc, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
#vc <- tm_map(vc, content_transformer(function(x) iconv(x, "UTF-8", sub = "" )))
#vc <- tm_map(vc, content_transformer(function(x) gsub("[ãåâ]", x = x, replacement = "")))
vc <- tm_map(vc, removePunctuation)
vc <- tm_map(vc, removeNumbers)
vc <- tm_map(vc, stripWhitespace)
vc <- tm_map(vc, tolower)
#vc <- tm_map(vc, stemDocument)
#vc <- tm_map(vc, removeWords, stopwords("english"))
vc <- tm_map(vc, PlainTextDocument)
tdm <- TermDocumentMatrix(vc)
#summary(tdm)
tdm <- as.matrix(tdm)
tdm <-sort(rowSums(tdm), decreasing = TRUE)
tdm <-data.frame(word=names(tdm), freq=tdm)
#head(tdm)
dtm <- DocumentTermMatrix(vc)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 84389)>>
## Non-/sparse entries: 126830/126337
## Sparsity : 50%
## Maximal term length: 79
## Weighting : term frequency (tf)
#dtms <- removeSparseTerms(dtm, 0.1)
#freq <- colSums(as.matrix(dtms))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## the the 96141
## and and 48777
## for for 22241
## that that 20911
## you you 19270
## with with 14241
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p <- ggplot(subset(wf, freq> 2500), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p
View the Word Cloud
library(RColorBrewer)
library(wordcloud)
wordcloud(words = tdm$word, freq = tdm$freq, min.freq = 3000, max.words = 100, random.order = TRUE, colors = brewer.pal(6, "Dark2"), rot.per = 0.4)
Using RWeka to create n-gram tokens
suppressWarnings(library(rJava))
##
## Attaching package: 'rJava'
## The following object is masked from 'package:R.oo':
##
## clone
options( java.parameters = "-Xmx4g" )
suppressWarnings(library(RWeka))
twoG <- NGramTokenizer(vc, Weka_control(min=2, max=2))
threeG <- NGramTokenizer(vc, Weka_control(min=3, max=3))
Bigrams
bi <- data.frame(table(twoG))
bi <- bi[sort.list(bi$Freq, decreasing = TRUE),]
bi <- head(bi, 10)
bi
## twoG Freq
## 491087 of the 8471
## 349075 in the 8443
## 732747 to the 4401
## 263185 for the 4011
## 499731 on the 3944
## 727160 to be 3219
## 69041 at the 2947
## 47198 and the 2571
## 344515 in a 2401
## 802690 with the 2096
Trigrams
tri <- data.frame(table(threeG))
tri <- tri[sort.list(tri$Freq, decreasing = TRUE),]
tri <- head(tri, 10)
tri
## threeG Freq
## 969905 one of the 684
## 17198 a lot of 582
## 1265296 thanks for the 466
## 528784 going to be 351
## 1401688 to be a 348
## 1300943 the end of 319
## 635740 i want to 297
## 152808 as well as 296
## 994976 out of the 290
## 713592 it was a 286