Milestone Report for Coursera Data Science Capstone Project

What word comes next? That is the question that I will attempt to answer using natural language processing techiques. The training data set is comprised of a large number of documents spanning blogs, news, and twitter feeds.

This report carries out exploratory analysis and describes a plan for creating a predictive model.

Setting Global Options and using relevant libraries

I firts invoke the R libraries that will be needed in the analysis.

setwd ("C:/Data Science/Capstone")
library(stringi)
library(magrittr)
library(tm)
## Loading required package: NLP
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RWeka)

Reading the Training Data

I read the training data.

t <- readLines(file("c:/Data Science/Capstone/Data/en_US.twitter.txt"),encoding="UTF-8", -1)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
n <- readLines(file("c:/Data Science/Capstone/Data/en_US.news.txt"),encoding="UTF-8", -1)
## Warning in readLines(file("c:/Data Science/Capstone/Data/
## en_US.news.txt"), : incomplete final line found on 'c:/Data Science/
## Capstone/Data/en_US.news.txt'
b <- readLines(file("c:/Data Science/Capstone/Data/en_US.blogs.txt"),encoding="UTF-8", -1)

Basic Stats on Training Data

I also do some basic stats.

b.words <- stri_count_words(b)
n.words <- stri_count_words(n)
t.words <- stri_count_words(t)
# Put information in a data frame
data.frame(source = c("blogs", "news", "twitter"),
           num.lines = c(length(b), length(n), length(t)),
           num.words = c(sum(b.words), sum(n.words), sum(t.words)),
           mean.num.words = c(mean(b.words), mean(n.words), mean(t.words)))
##    source num.lines num.words mean.num.words
## 1   blogs    899288  38154238       42.42716
## 2    news     77259   2693898       34.86840
## 3 twitter   2360148  30218125       12.80349

Sample the data

The files are huge, so I take a tiny sample of the data.

set.seed(1234)
tsamp <- sample.int(length(t), length(t)*0.001)
nsamp <- sample.int(length(n), length(n)*0.001)
bsamp <- sample.int(length(b), length(n)*0.001)
m <- paste(t[tsamp], n[nsamp], b[bsamp])

Cleaning the Data

The data needs to be cleaned before and after it is placed in a corpus.
I take out some strange characters. Then I put the data into a corpus. And finally I apply some more cleaning transformation, including remove profane words.

m=gsub("â"," ",m,fixed=TRUE)
m=gsub("€"," ",m,fixed=TRUE)
m=gsub("™"," ",m,fixed=TRUE)
mcorp <- VCorpus(VectorSource(m))
mcorp <- tm_map(mcorp, removePunctuation)
mcorp <- tm_map(mcorp, tolower)
mcorp <- tm_map(mcorp, removeNumbers)
mcorp <- tm_map(mcorp, PlainTextDocument)
mcorp <- tm_map(mcorp, removeWords, stopwords('english'))
# remove profane words;
badwords=data.frame(readLines(file("stopwords.txt"), -1))
colnames(badwords)[1]="badwords"
mcorp <- tm_map(mcorp, removeWords, badwords$badwords)
## Warning: closing unused connection 5 (stopwords.txt)
mcorp <- tm_map(mcorp, stripWhitespace)

Create a document term matrix

Before creating a document term matrix, I remove words less than 4 characters and more than 20.

dtm <- DocumentTermMatrix(mcorp, control=list(wordLengths=c(4,20)))

Inspect most frequently occurring terms

freqr <- colSums(as.matrix(dtm))
ord <- order(freqr,decreasing=TRUE)
freqr[head(ord)]
##  will  said  just first  time  like 
##  1068   604   481   426   410   403

Create a word cloud

wordcloud(names(freqr),freqr,min.freq=110,colors=brewer.pal(6,"Dark2"))

bitoken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tritoken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
qtoken <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))

# bigram Term-Document Matrix
tdm_bigram <-
  mcorp %>%
  TermDocumentMatrix( control = list( tokenize = bitoken)
  )

# aggregate frequencies
tdm_bigram %>%
  as.matrix %>%
  rowSums -> freq_bigram

ord <- order(freq_bigram,decreasing=TRUE)

freq_bigram[head(ord,20)]
##        im going       last week     high school      food waste 
##              96              95              94              93 
##  close together       right now       dont know       will play 
##              90              81              69              64 
##        now know             ˜ ˜            ” tr  came something 
##              63              62              62              62 
##       civil war    corn mixture           don t      dont think 
##              62              62              62              62 
##         hotel s         loss im       love life school division 
##              62              62              62              62
wf=data.frame(term=names(freq_bigram),occurrences=freq_bigram)
wf2=wf[order(wf$occurrences),] 
wf2=tail(wf2,10)
ggplot(data=wf2,aes(x=term,y=occurrences)) + geom_bar(stat="identity")

# trigram Term-Document Matrix
tdm_trigram <-
  mcorp %>%
  TermDocumentMatrix( control = list(tokenize = tritoken)
  )

# aggregate frequencies
tdm_trigram %>%
  as.matrix %>%
  rowSums -> freq_trigram

ord <- order(freq_trigram,decreasing=TRUE)


freq_trigram[head(ord,20)]
##     will spend million        st patricks day        lets go rangers 
##                     62                     33                     32 
##              ˜ ˜ reach              ˜ ˜ thats      ˜ reach potential 
##                     31                     31                     31 
##            ˜ thats big          “ two richard          ” though will 
##                     31                     31                     31 
##           ” tr braaten              ” tr buck          help achieve 
##                     31                     31                     31 
##          pun intended          said birches     ability users able 
##                     31                     31                     31 
##      able enjoyed last     able find question    able marry jaejoong 
##                     31                     31                     31 
##          able set free academic challenge don 
##                     31                     31
wf=data.frame(term=names(freq_trigram),occurrences=freq_trigram)
wf2=wf[order(wf$occurrences),] 
wf2=tail(wf2,10)
ggplot(data=wf2,aes(x=term,y=occurrences)) + geom_bar(stat="identity")

Modeling and Prediction

I am going to take the triagram and construct a model out of that, predicting the 3rd word after a two word sequence. I am going to create a shiny app for that.