Data Science Specialization Capstone – Milestone Report

This report summarizes the work done so far for the capstone project. The following sections describes how the data for the project is loaded, analyzed and visualized. At the end a plan for the prediction aglorithm will be proposed.

The data for the project consist of text corpi in diferent languages, gathered from blogs, news and tweets. In this report the folder containing english text is used.

The work here is done on a Core i7 laptop running 64-bit Windows 7 with 8GB RAM.

1. Loading the text

Here is a summary of what the blogs, news and twitter files contain :

  1. en_US.blogs.txt (899288 lines, 248.5 MB)
  2. en_US.news.txt (77259 lines, 19.2 MB)
  3. en_US.twitter.txt (2360148 lines, 301.4 MB)

The amount of data is huge relative to the processing capability of the computer used. A major challenge is to find an optimal sample size so that :

  1. R does not run out of memory
  2. The processing completes in a reasonable amount of time
  3. Gives meaningful results

By increasing the amount of data through trial and error, an attempt is currently made at processing between 15k-20k lines of data from each of the files. Combined, it will be between 45k-20k lines in total.

Moreover, the files contain various non-ascii characters/foreign language characters that needs to be removed before processing can happen.

A function to load the files is first written :

ReadFile<-function(filename, nlines,encoding_option)
{
    filehnd<-file(filename,open="r")
    filelines<-readLines(filehnd,n=nlines,encoding=encoding_option)
    close(filehnd)
    return(filelines)
}

A function to remove character encodings is then written :

RemoveCharEncodings<-function(lines)
{
  cnv<-iconv(lines,"latin1","ASCII","byte")
  cnv<-gsub("<[a-z0-9][a-z0-9]>+","",cnv)
  cnv<-gsub("\u0097","",cnv)
  return(cnv)
}

Reading of the files, removing non-ascii characters are then performed.

READLINES=20000
filehnd<-file("../final/en_US/en_US.blogs.txt",open="r")
# Read the file in UTF-8 format so that various control/escape charaters can be decoded correctly.
# lines now contain an array of text lines read from the file.
lines_blogs<-readLines(filehnd,n=READLINES,encoding="UTF-8")
close(filehnd)
lines_blogs<-RemoveCharEncodings(lines_blogs)

filehnd<-file("../final/en_US/en_US.news.txt",open="r")
lines_news<-readLines(filehnd,n=READLINES,encoding="UTF-8")
close(filehnd)
lines_news<-RemoveCharEncodings(lines_news)

filehnd<-file("../final/en_US/en_US.twitter.txt",open="r")
lines_twitter<-readLines(filehnd,n=READLINES)
close(filehnd)
lines_twitter<-RemoveCharEncodings(lines_twitter)

In this context, a total of 20,000 lines are read from each file.

The lines are then combined and formed into a corpus for the purpose of preprocessing the text.

library(tm)
## Loading required package: NLP
lines<-c(lines_blogs, lines_news,lines_twitter)
#Creates the corpus so that the text can be cleaned
corpus<-Corpus(VectorSource(lines))

2. Preprocessing

After the corpus is created, preprocessing of the text can proceed. the following operations are performed :

  1. Removal of punctuations.
  2. Removing numerals.
  3. Converting all characters to lowercase
  4. Remove profanities
  5. Remove common english words like “the”, “we” .. etc
  6. Make sure that “orphan words” like ’ll, ’re, ’m are all removed
  7. Remove all single character words.
removeChar <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
toSpace    <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toAprostrophe <- content_transformer(function(x, pattern) {return (gsub(pattern, "'", x))})
removeSingleChar <- content_transformer(function(x) {return (gsub(" [a-zA-Z] ", "", x))})
orphan_words<-c("\n't","\'re","\'m","\'s","\'ve","\'d","\'ll")
CleanText<-function(corpus)
{
  profanities_filehnd <- file('../final/en_US/profanities.txt',open="r")
  profanities<-readLines(profanities_filehnd)
  close(profanities_filehnd)
  
  corpus<-tm_map(corpus,removePunctuation)
  corpus<-tm_map(corpus, toSpace, "“")
  corpus<-tm_map(corpus, toSpace, "”")
#  corpus<-tm_map(corpus, toSpace, "’")
  corpus<-tm_map(corpus, toSpace, "‘")
  corpus<-tm_map(corpus, toSpace, " -")
  corpus<-tm_map(corpus, toSpace, "-")
  corpus<-tm_map(corpus, toSpace, ":")
  corpus<-tm_map(corpus, toSpace, "=")
  corpus<-tm_map(corpus, removeNumbers)
  corpus<-tm_map(corpus, content_transformer(tolower))
  corpus<-tm_map(corpus, removeWords, profanities)
  corpus<-tm_map(corpus, stripWhitespace)
  corpus<-tm_map(corpus, removeWords, stopwords("english"))
  corpus<-tm_map(corpus, removeWords, orphan_words)
  corpus<-tm_map(corpus, removeSingleChar)
  corpus<-tm_map(corpus, stripWhitespace)
  return(corpus)
}

corpus<-CleanText(corpus)

The corpus now contains words that can be sent into the training model for prediction training.

The corpus can be converted into a data frame, each text line is then a row in the data frame. Tokens (1-gram, 2-gram and 3-gram) can be extracted from the text and the token lists can be sorted and analyzed for word frequencies.

This approach allows a relatively large data set (45K-60 lines of text) to be analyzed without running into the memory limits of R.

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.4
text<-unlist(sapply(corpus, '[',"content"))
text_df <- data.frame(text,stringsAsFactors=FALSE)
onetoken<-NGramTokenizer(text_df, Weka_control(min = 1, max = 1))
twotoken<-NGramTokenizer(text_df, Weka_control(min = 2, max = 2))
threetoken<-NGramTokenizer(text_df, Weka_control(min = 3, max = 3))

3. Visualizing the Corpus

It is now possible to visualize the frequency of the most common words in the token lists. The word frequencies for the first 20 terms is plotted in a barchart, and the line graph shows the cummulated frequencies of the first 4000 terms.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.4
plotTokens<-function(df, ntokhist, ntokaccum)
{
  print("Plotting token statistics ")
  print(as.character(Sys.time()))
  colnames(df)<-c("token", "Freq")
  ord<-order(df$Freq,decreasing=TRUE)
  df<-df[ord,]
  df$cumSum<-cumsum(df$Freq)
  df$token<-factor(df$token,levels=df$token)
  p<-ggplot(df[1:ntokhist,],aes(x=as.factor(token),y=Freq))
  p<-p+geom_bar(stat="identity")
  p1<-ggplot(df[1:ntokaccum,],aes(x=token,y=cumSum))
  p1<-p1+geom_line(group=1)
  grid.arrange(p, p1, ncol=2) 
  print("Plot done")
  print(as.character(Sys.time()))
  return(df)
}
onetoken_df<-data.frame(table(onetoken))
onetoken_df<-plotTokens(onetoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:07"

## [1] "Plot done"
## [1] "2016-03-19 07:39:11"

The following show the top 10 single-word tokens in the corpus and their respective frequencies.

onetoken_df[1:10,]
##       token Freq cumSum
## 58589  said 5881   5881
## 75082  will 5428  11309
## 47838   one 5063  16372
## 35541  just 4629  21001
## 38702  like 4203  25204
## 9687    can 4123  29327
## 68817  time 3693  33020
## 26882   get 3313  36333
## 32533    im 3224  39557
## 45767   new 3127  42684

The following plots show the bar plots for two-word tokens and the accumulated frequency plots.

twotoken_df<-data.frame(table(twotoken))
twotoken_df<-plotTokens(twotoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:27"

## [1] "Plot done"
## [1] "2016-03-19 07:39:33"
twotoken_df[1:10,]
##              token Freq cumSum
## 362843   last year  373    373
## 447145    new york  364    737
## 562154   right now  327   1064
## 184499   dont know  308   1372
## 303099 high school  252   1624
## 765506   years ago  247   1871
## 240246  first time  224   2095
## 362826   last week  221   2316
## 231221   feel like  206   2522
## 362644  last night  199   2721

The plots for three-word tokens are given below.

threetoken_df<-data.frame(table(threetoken))
threetoken_df<-plotTokens(threetoken_df,20,4000)
## [1] "Plotting token statistics "
## [1] "2016-03-19 07:39:54"

## [1] "Plot done"
## [1] "2016-03-19 07:40:00"
threetoken_df[1:10,]
##                         token Freq cumSum
## 547231          new york city   45     45
## 113549          cant wait see   38     83
## 352088      happy mothers day   33    116
## 451523            let us know   29    145
## 547386         new york times   29    174
## 386894         im pretty sure   28    202
## 864446          two years ago   28    230
## 631971 president barack obama   27    257
## 285231       first time since   24    281
## 273395           feel like im   21    302

The percentile of 50 and 90 percentile of unique tokens required to cover the vocabulary of the entire corpus can be found.

showPercentile<-function(df,pct)
{
  len<-nrow(df)
  numwords<-sum(df$Freq)
  print(paste("Number of unique words =", len))
  print(paste("Number of words = ", numwords))
  df$pctl<-(df$cumSum/numwords)*100
  det<-which(df$pctl>pct)
  return(det[1])
}

For 1-token :

numwords1tok<-nrow(onetoken_df)
onetoken50<-showPercentile(onetoken_df,50)
## [1] "Number of unique words = 77242"
## [1] "Number of words =  959851"
print(onetoken50)
## [1] 1063
onetoken90<-showPercentile(onetoken_df,90)
## [1] "Number of unique words = 77242"
## [1] "Number of words =  959851"
print(onetoken90)
## [1] 16306

It can be seen that a mere 1063 words is sufficient to cover the entire vocabulary of 77242 1-token words in the corpus. Whereas to cover 90%, we require 16306 words.

For 2-token :

numwords2tok<-nrow(twotoken_df)
twotoken50<-showPercentile(twotoken_df,50)
## [1] "Number of unique words = 770882"
## [1] "Number of words =  959850"
print(twotoken50)
## [1] 290958
twotoken90<-showPercentile(twotoken_df,90)
## [1] "Number of unique words = 770882"
## [1] "Number of words =  959850"
print(twotoken90)
## [1] 674898

From the above results 290958 tokens are required to cover 50% of the vocabulary of 770882 2-word tokens found in the corpus. To cover 90%, 674898 tokens will be needed.

For 3-token :

numwords3tok<-nrow(threetoken_df)
threetoken50<-showPercentile(threetoken_df,50)
## [1] "Number of unique words = 947372"
## [1] "Number of words =  959849"
print(threetoken50)
## [1] 467448
threetoken90<-showPercentile(threetoken_df,90)
## [1] "Number of unique words = 947372"
## [1] "Number of words =  959849"
print(threetoken90)
## [1] 851388

From the results above 467448 tokens are required to cover 50% of the vocabulary of 947372 3-word tokens found in the corpus. To cover 90%, 851388 tokens will be needed.

It can be observed that for two and three gram tokens, it requires more tokens to cover the vocabulary of the corpus. Three-gram even more so than two-gram.

An alternative approach to for exploratory data analysis forms the DocumentTermMatrix or its transpose (TermDocumentMatrix) from the corpus. The process of N-gram tokenization is built into the formation of the matrix. For large data sets like the one used in this project, it is not possible to merely form a matrix out of the DocumentTermMatrix for analysis. What should probably be done is to select the k-th most commonly occuring terms and find their correlations/associations within the corpus. Even with this approach the time taken was found to be prohibitively long. It took hours to find the associations for 400+ 1-gram terms within the corpus comprising 10000 lines from each file (blog, news and twitter).

4. Proposal for Text Prediction

The text data in the corpus will be separated into a training and a test set, and a prediction model will be built using the training data. It is envisaged that the two-gram words can be used in the training such that the predicted output given the first word will be the second word. After the model is created, the test data will be used to evaluate the accuracy of the prediction. If time permits, it may be possible to see if the three-gram tokens can be used fo training such that the first two words be used to predict the next word.

At this time, it is not planned to do prediction for auto-word completion (ie. user types the first few characters of a word and the prediction algorithm attempts to guess which word the user is trying to type).

The words in the preprocessing is not stemmed, so this will also be a starting point in building the prediction algorithm.