This report summarizes the work done so far for the capstone project. The following sections describes how the data for the project is loaded, analyzed and visualized. At the end a plan for the prediction aglorithm will be proposed.
The data for the project consist of text corpi in diferent languages, gathered from blogs, news and tweets. In this report the folder containing english text is used.
The work here is done on a Core i7 laptop running 64-bit Windows 7 with 8GB RAM.
Here is a summary of what the blogs, news and twitter files contain :
The amount of data is huge relative to the processing capability of the computer used. A major challenge is to find an optimal sample size so that :
By increasing the amount of data through trial and error, an attempt is currently made at processing between 15k-20k lines of data from each of the files. Combined, it will be between 45k-20k lines in total.
Moreover, the files contain various non-ascii characters/foreign language characters that needs to be removed before processing can happen.
A function to load the files is first written :
ReadFile<-function(filename, nlines,encoding_option)
{
filehnd<-file(filename,open="r")
filelines<-readLines(filehnd,n=nlines,encoding=encoding_option)
close(filehnd)
return(filelines)
}
A function to remove character encodings is then written :
RemoveCharEncodings<-function(lines)
{
cnv<-iconv(lines,"latin1","ASCII","byte")
cnv<-gsub("<[a-z0-9][a-z0-9]>+","",cnv)
cnv<-gsub("\u0097","",cnv)
return(cnv)
}
Reading of the files, removing non-ascii characters are then performed.
READLINES=20000
filehnd<-file("../final/en_US/en_US.blogs.txt",open="r")
# Read the file in UTF-8 format so that various control/escape charaters can be decoded correctly.
# lines now contain an array of text lines read from the file.
lines_blogs<-readLines(filehnd,n=READLINES,encoding="UTF-8")
close(filehnd)
lines_blogs<-RemoveCharEncodings(lines_blogs)
filehnd<-file("../final/en_US/en_US.news.txt",open="r")
lines_news<-readLines(filehnd,n=READLINES,encoding="UTF-8")
close(filehnd)
lines_news<-RemoveCharEncodings(lines_news)
filehnd<-file("../final/en_US/en_US.twitter.txt",open="r")
lines_twitter<-readLines(filehnd,n=READLINES)
close(filehnd)
lines_twitter<-RemoveCharEncodings(lines_twitter)
In this context, a total of 20,000 lines are read from each file.
The lines are then combined and formed into a corpus for the purpose of preprocessing the text.
library(tm)
## Loading required package: NLP
lines<-c(lines_blogs, lines_news,lines_twitter)
#Creates the corpus so that the text can be cleaned
corpus<-Corpus(VectorSource(lines))
After the corpus is created, preprocessing of the text can proceed. the following operations are performed :
removeChar <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toAprostrophe <- content_transformer(function(x, pattern) {return (gsub(pattern, "'", x))})
removeSingleChar <- content_transformer(function(x) {return (gsub(" [a-zA-Z] ", "", x))})
orphan_words<-c("\n't","\'re","\'m","\'s","\'ve","\'d","\'ll")
CleanText<-function(corpus)
{
profanities_filehnd <- file('../final/en_US/profanities.txt',open="r")
profanities<-readLines(profanities_filehnd)
close(profanities_filehnd)
corpus<-tm_map(corpus,removePunctuation)
corpus<-tm_map(corpus, toSpace, "“")
corpus<-tm_map(corpus, toSpace, "”")
# corpus<-tm_map(corpus, toSpace, "’")
corpus<-tm_map(corpus, toSpace, "‘")
corpus<-tm_map(corpus, toSpace, " -")
corpus<-tm_map(corpus, toSpace, "-")
corpus<-tm_map(corpus, toSpace, ":")
corpus<-tm_map(corpus, toSpace, "=")
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, removeWords, profanities)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, removeWords, stopwords("english"))
corpus<-tm_map(corpus, removeWords, orphan_words)
corpus<-tm_map(corpus, removeSingleChar)
corpus<-tm_map(corpus, stripWhitespace)
return(corpus)
}
corpus<-CleanText(corpus)
The corpus now contains words that can be sent into the training model for prediction training.
The corpus can be converted into a data frame, each text line is then a row in the data frame. Tokens (1-gram, 2-gram and 3-gram) can be extracted from the text and the token lists can be sorted and analyzed for word frequencies.
This approach allows a relatively large data set (45K-60 lines of text) to be analyzed without running into the memory limits of R.
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.4
text<-unlist(sapply(corpus, '[',"content"))
text_df <- data.frame(text,stringsAsFactors=FALSE)
onetoken<-NGramTokenizer(text_df, Weka_control(min = 1, max = 1))
twotoken<-NGramTokenizer(text_df, Weka_control(min = 2, max = 2))
threetoken<-NGramTokenizer(text_df, Weka_control(min = 3, max = 3))
It is now possible to visualize the frequency of the most common words in the token lists. The word frequencies for the first 20 terms is plotted in a barchart, and the line graph shows the cummulated frequencies of the first 4000 terms.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.4
plotTokens<-function(df, ntokhist, ntokaccum)
{
print("Plotting token statistics ")
print(as.character(Sys.time()))
colnames(df)<-c("token", "Freq")
ord<-order(df$Freq,decreasing=TRUE)
df<-df[ord,]
df$cumSum<-cumsum(df$Freq)
df$token<-factor(df$token,levels=df$token)
p<-ggplot(df[1:ntokhist,],aes(x=as.factor(token),y=Freq))
p<-p+geom_bar(stat="identity")
p1<-ggplot(df[1:ntokaccum,],aes(x=token,y=cumSum))
p1<-p1+geom_line(group=1)
grid.arrange(p, p1, ncol=2)
print("Plot done")
print(as.character(Sys.time()))
return(df)
}
onetoken_df<-data.frame(table(onetoken))
onetoken_df<-plotTokens(onetoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:07"
## [1] "Plot done"
## [1] "2016-03-19 07:39:11"
The following show the top 10 single-word tokens in the corpus and their respective frequencies.
onetoken_df[1:10,]
## token Freq cumSum
## 58589 said 5881 5881
## 75082 will 5428 11309
## 47838 one 5063 16372
## 35541 just 4629 21001
## 38702 like 4203 25204
## 9687 can 4123 29327
## 68817 time 3693 33020
## 26882 get 3313 36333
## 32533 im 3224 39557
## 45767 new 3127 42684
The following plots show the bar plots for two-word tokens and the accumulated frequency plots.
twotoken_df<-data.frame(table(twotoken))
twotoken_df<-plotTokens(twotoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:27"
## [1] "Plot done"
## [1] "2016-03-19 07:39:33"
twotoken_df[1:10,]
## token Freq cumSum
## 362843 last year 373 373
## 447145 new york 364 737
## 562154 right now 327 1064
## 184499 dont know 308 1372
## 303099 high school 252 1624
## 765506 years ago 247 1871
## 240246 first time 224 2095
## 362826 last week 221 2316
## 231221 feel like 206 2522
## 362644 last night 199 2721
The plots for three-word tokens are given below.
threetoken_df<-data.frame(table(threetoken))
threetoken_df<-plotTokens(threetoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:54"
## [1] "Plot done"
## [1] "2016-03-19 07:40:00"
threetoken_df[1:10,]
## token Freq cumSum
## 547231 new york city 45 45
## 113549 cant wait see 38 83
## 352088 happy mothers day 33 116
## 451523 let us know 29 145
## 547386 new york times 29 174
## 386894 im pretty sure 28 202
## 864446 two years ago 28 230
## 631971 president barack obama 27 257
## 285231 first time since 24 281
## 273395 feel like im 21 302
The percentile of 50 and 90 percentile of unique tokens required to cover the vocabulary of the entire corpus can be found.
showPercentile<-function(df,pct)
{
len<-nrow(df)
numwords<-sum(df$Freq)
print(paste("Number of unique words =", len))
print(paste("Number of words = ", numwords))
df$pctl<-(df$cumSum/numwords)*100
det<-which(df$pctl>pct)
return(det[1])
}
For 1-token :
numwords1tok<-nrow(onetoken_df)
onetoken50<-showPercentile(onetoken_df,50)
## [1] "Number of unique words = 77242"
## [1] "Number of words = 959851"
print(onetoken50)
## [1] 1063
onetoken90<-showPercentile(onetoken_df,90)
## [1] "Number of unique words = 77242"
## [1] "Number of words = 959851"
print(onetoken90)
## [1] 16306
It can be seen that a mere 1063 words is sufficient to cover the entire vocabulary of 77242 1-token words in the corpus. Whereas to cover 90%, we require 16306 words.
For 2-token :
numwords2tok<-nrow(twotoken_df)
twotoken50<-showPercentile(twotoken_df,50)
## [1] "Number of unique words = 770882"
## [1] "Number of words = 959850"
print(twotoken50)
## [1] 290958
twotoken90<-showPercentile(twotoken_df,90)
## [1] "Number of unique words = 770882"
## [1] "Number of words = 959850"
print(twotoken90)
## [1] 674898
From the above results 290958 tokens are required to cover 50% of the vocabulary of 770882 2-word tokens found in the corpus. To cover 90%, 674898 tokens will be needed.
For 3-token :
numwords3tok<-nrow(threetoken_df)
threetoken50<-showPercentile(threetoken_df,50)
## [1] "Number of unique words = 947372"
## [1] "Number of words = 959849"
print(threetoken50)
## [1] 467448
threetoken90<-showPercentile(threetoken_df,90)
## [1] "Number of unique words = 947372"
## [1] "Number of words = 959849"
print(threetoken90)
## [1] 851388
From the results above 467448 tokens are required to cover 50% of the vocabulary of 947372 3-word tokens found in the corpus. To cover 90%, 851388 tokens will be needed.
It can be observed that for two and three gram tokens, it requires more tokens to cover the vocabulary of the corpus. Three-gram even more so than two-gram.
An alternative approach to for exploratory data analysis forms the DocumentTermMatrix or its transpose (TermDocumentMatrix) from the corpus. The process of N-gram tokenization is built into the formation of the matrix. For large data sets like the one used in this project, it is not possible to merely form a matrix out of the DocumentTermMatrix for analysis. What should probably be done is to select the k-th most commonly occuring terms and find their correlations/associations within the corpus. Even with this approach the time taken was found to be prohibitively long. It took hours to find the associations for 400+ 1-gram terms within the corpus comprising 10000 lines from each file (blog, news and twitter).
The text data in the corpus will be separated into a training and a test set, and a prediction model will be built using the training data. It is envisaged that the two-gram words can be used in the training such that the predicted output given the first word will be the second word. After the model is created, the test data will be used to evaluate the accuracy of the prediction. If time permits, it may be possible to see if the three-gram tokens can be used fo training such that the first two words be used to predict the next word.
At this time, it is not planned to do prediction for auto-word completion (ie. user types the first few characters of a word and the prediction algorithm attempts to guess which word the user is trying to type).
The words in the preprocessing is not stemmed, so this will also be a starting point in building the prediction algorithm.