What word comes next? That is the question that I will attempt to answer using natural language processing techiques. The training data set is comprised of a large number of documents spanning blogs, news, and twitter feeds.
This report carries out exploratory analysis and describes a plan for creating a predictive model.
I firts invoke the R libraries that will be needed in the analysis.
setwd ("C:/Data Science/Capstone")
library(stringi)
library(magrittr)
library(tm)
## Loading required package: NLP
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
I read the training data.
t <- readLines(file("c:/Data Science/Capstone/Data/en_US.twitter.txt"),encoding="UTF-8", -1)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
n <- readLines(file("c:/Data Science/Capstone/Data/en_US.news.txt"),encoding="UTF-8", -1)
## Warning in readLines(file("c:/Data Science/Capstone/Data/
## en_US.news.txt"), : incomplete final line found on 'c:/Data Science/
## Capstone/Data/en_US.news.txt'
b <- readLines(file("c:/Data Science/Capstone/Data/en_US.blogs.txt"),encoding="UTF-8", -1)
I also do some basic stats.
b.words <- stri_count_words(b)
n.words <- stri_count_words(n)
t.words <- stri_count_words(t)
# Put information in a data frame
data.frame(source = c("blogs", "news", "twitter"),
num.lines = c(length(b), length(n), length(t)),
num.words = c(sum(b.words), sum(n.words), sum(t.words)),
mean.num.words = c(mean(b.words), mean(n.words), mean(t.words)))
## source num.lines num.words mean.num.words
## 1 blogs 899288 38154238 42.42716
## 2 news 77259 2693898 34.86840
## 3 twitter 2360148 30218125 12.80349
The files are huge, so I take a tiny sample of the data.
set.seed(1234)
tsamp <- sample.int(length(t), length(t)*0.001)
nsamp <- sample.int(length(n), length(n)*0.001)
bsamp <- sample.int(length(b), length(n)*0.001)
m <- paste(t[tsamp], n[nsamp], b[bsamp])
The data needs to be cleaned before and after it is placed in a corpus.
I take out some strange characters. Then I put the data into a corpus. And finally I apply some more cleaning transformation, including remove profane words.
m=gsub("â"," ",m,fixed=TRUE)
m=gsub("€"," ",m,fixed=TRUE)
m=gsub("™"," ",m,fixed=TRUE)
mcorp <- VCorpus(VectorSource(m))
mcorp <- tm_map(mcorp, removePunctuation)
mcorp <- tm_map(mcorp, tolower)
mcorp <- tm_map(mcorp, removeNumbers)
mcorp <- tm_map(mcorp, PlainTextDocument)
mcorp <- tm_map(mcorp, removeWords, stopwords('english'))
# remove profane words;
badwords=data.frame(readLines(file("stopwords.txt"), -1))
colnames(badwords)[1]="badwords"
mcorp <- tm_map(mcorp, removeWords, badwords$badwords)
## Warning: closing unused connection 5 (stopwords.txt)
mcorp <- tm_map(mcorp, stripWhitespace)
Before creating a document term matrix, I remove words less than 4 characters and more than 20.
dtm <- DocumentTermMatrix(mcorp, control=list(wordLengths=c(4,20)))
freqr <- colSums(as.matrix(dtm))
ord <- order(freqr,decreasing=TRUE)
freqr[head(ord)]
## will said just first time like
## 1068 604 481 426 410 403
wordcloud(names(freqr),freqr,min.freq=110,colors=brewer.pal(6,"Dark2"))
bitoken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tritoken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
qtoken <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
# bigram Term-Document Matrix
tdm_bigram <-
mcorp %>%
TermDocumentMatrix( control = list( tokenize = bitoken)
)
# aggregate frequencies
tdm_bigram %>%
as.matrix %>%
rowSums -> freq_bigram
ord <- order(freq_bigram,decreasing=TRUE)
freq_bigram[head(ord,20)]
## im going last week high school food waste
## 96 95 94 93
## close together right now dont know will play
## 90 81 69 64
## now know tr came something
## 63 62 62 62
## civil war corn mixture don t dont think
## 62 62 62 62
## hotel s loss im love life school division
## 62 62 62 62
wf=data.frame(term=names(freq_bigram),occurrences=freq_bigram)
wf2=wf[order(wf$occurrences),]
wf2=tail(wf2,10)
ggplot(data=wf2,aes(x=term,y=occurrences)) + geom_bar(stat="identity")
# trigram Term-Document Matrix
tdm_trigram <-
mcorp %>%
TermDocumentMatrix( control = list(tokenize = tritoken)
)
# aggregate frequencies
tdm_trigram %>%
as.matrix %>%
rowSums -> freq_trigram
ord <- order(freq_trigram,decreasing=TRUE)
freq_trigram[head(ord,20)]
## will spend million st patricks day lets go rangers
## 62 33 32
## reach thats reach potential
## 31 31 31
## thats big two richard though will
## 31 31 31
## tr braaten tr buck help achieve
## 31 31 31
## pun intended said birches ability users able
## 31 31 31
## able enjoyed last able find question able marry jaejoong
## 31 31 31
## able set free academic challenge don
## 31 31
wf=data.frame(term=names(freq_trigram),occurrences=freq_trigram)
wf2=wf[order(wf$occurrences),]
wf2=tail(wf2,10)
ggplot(data=wf2,aes(x=term,y=occurrences)) + geom_bar(stat="identity")
I am going to take the triagram and construct a model out of that, predicting the 3rd word after a two word sequence. I am going to create a shiny app for that.