Background

The ultimate goal of this Capstone Project is to build a Shiny App, with an algorithm embedded in it, which can predict the next word based the text input by the users.
In this Milestone Report, I will present the findings from the data exploration, using the training dataset (10% of what have been provided to us). In addition, I will give a brief idea on how to move on to build the prediction model.

Prepare the Corpus

The size of the test files are quite big, so I randomly selected 10% of the lines from each of the three files:

  1. en_US.blogs (number of lines selected:8983)
  2. en_US.news (number of lines selected:23389)
  3. en_US.twitter (number of lines selected:756)
con<-file("Data/en_US/en_US.blogs.txt", "r") 
en_US_blog<-readLines(con); close(con)

con<-file("Data/en_US/en_US.twitter.txt", "r") 
en_US_twitter<-readLines(con); close(con)

con<-file("Data/en_US/en_US.news.txt", "r") 
en_US_news<-readLines(con); close(con)

## Select 1% from each of the three documents
set.seed(18082020)
indx<-rbinom(length(en_US_blog),1,prob=0.01)
sample1<-en_US_blog[indx==1]
write.table(sample1, "DocSample/en_blog_sample.txt", 
            row.names = FALSE, col.names=FALSE, quote=FALSE)

set.seed(19082020)
indx<-rbinom(length(en_US_twitter),1,prob=0.01)
sample2<-en_US_twitter[indx==1]
write.table(sample2, "DocSample/en_twitter_sample.txt", 
            row.names = FALSE, col.names=FALSE, quote=FALSE)

set.seed(19082020)
indx<-rbinom(length(en_US_news),1,prob=0.01)
sample3<-en_US_news[indx==1]
write.table(sample3, "DocSample/en_news_sample.txt", 
            row.names = FALSE, col.names=FALSE, quote=FALSE)

rm(en_US_blog, en_US_twitter, en_US_news)

Below gives the summary statistics for these three samples.

sumstats<-rbind(cbind(length(sample1),length(sample2),length(sample3)),
cbind(sum(nchar(sample1)), sum(nchar(sample2)), sum(nchar(sample3))))
row.names(sumstats)<-c("number of lines","number of characters")
colnames(sumstats)<-c("blog","twitter","news")
sumstats
##                         blog twitter   news
## number of lines         8983   23389    756
## number of characters 2097726 1607935 159982

In order to perform data mining using tm package, I create a corpus (a corpus is a collection of documents) from the above three files.

library(tm)   # for text mining
library(SnowballC) # for text stemming
library(RWeka) # for n-gram tokenization

corpus<-Corpus(DirSource("DocSample/"), readerControl = list(language="en"))
summary(corpus)
##                       Length Class             Mode
## en_blog_sample.txt    2      PlainTextDocument list
## en_news_sample.txt    2      PlainTextDocument list
## en_twitter_sample.txt 2      PlainTextDocument list

Preprocessing

An 8-step preprocessing is conducted, including profanity filtering (step6).

### Step1: remove white spaces
corpus <- tm_map(corpus, stripWhitespace)

### Step2: convert to lower case letters
corpus <- tm_map(corpus, content_transformer(tolower))

### Step3: remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("en"))

### Step4: remove punctuation
corpus <- tm_map(corpus, removePunctuation)

### Step5: remove all special characters
rmschar <- content_transformer(function(x, pattern) gsub(pattern, "", x))
pattern2<-"($)|(\\*)|(\\/)|(\\|)|(\\()|(\\))|(-)|(@)|(#)|(%)|(^)|(â)|(€)|(œ)|(“)|(~)|(&)|(™)|(¦)|(ã)|(©)|(”)|(˜)|(ð)|(•)|(ÿ)|(¥)|(‰)"
corpus <- tm_map(corpus, rmschar, pattern2)

### Step6: profanity filtering
badword<-as.vector(unique(tolower(lexicon::profanity_alvarez)))
badword<-gsub(pattern2, "", badword)
corpus <- tm_map(corpus, removeWords, badword)

### step7: remove all the numbers 
corpus <- tm_map(corpus, removeNumbers)

### Step8: Activate stemming algorithm, 
### i.e. to receive each word’s stem with its suffixes removed.
corpus <- tm_map(corpus, stemDocument, language = "english")

Making A Document-Term Matrix

A document-term matrix(dtm) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In this matrix, rows correspond to documents in the collection and columns correspond to terms.

cleandata<-corpus

# Create the dtm from the cleaned corpus: 
data_dtm<-DocumentTermMatrix(cleandata)
inspect(data_dtm)
## <<DocumentTermMatrix (documents: 3, terms: 31995)>>
## Non-/sparse entries: 44758/51227
## Sparsity           : 53%
## Maximal term length: 52
## Weighting          : term frequency (tf)
## Sample             :
##                        Terms
## Docs                    can  day  get good just like love  one time will
##   en_blog_sample.txt    988  670  908  510  911 1158  642 1331 1053 1200
##   en_news_sample.txt     57   31   60   29   31   45   10   58   48  100
##   en_twitter_sample.txt 879 1083 1411 1066 1525 1304 1193  881  846  985

The dtm needs to be converted to a normal matrix, so that I can perform further analysis. The column sums of this matrix will give us the frequency of each word.

# Convert the dtm data to a matrix: 
datam <- sort(colSums(as.matrix(data_dtm)),decreasing=TRUE)
wordmatrix <- data.frame(word = names(datam),freq=datam)
summary(wordmatrix)
##      word                freq        
##  Length:31995       Min.   :   1.00  
##  Class :character   1st Qu.:   1.00  
##  Mode  :character   Median :   1.00  
##                     Mean   :  11.23  
##                     3rd Qu.:   3.00  
##                     Max.   :2507.00

The summary table above shows that there are 31,995 words detected, but more than half of them appears only once (median of the frequency is 1). In addition there are also some funny characters left but with very low frequency.

tail(wordmatrix,5)
##                  word freq
## „¢                 „¢    1
## „²³\201             „²³\201    1
## †’                 †’    1
## ‡¬‡§„lolol ‡¬‡§„lolol    1
## …\235…               …\235…    1

Therefore, I decide to remove terms appear less than 3 times in my data, as it is likely that they will not provide useful information. My final matrix contains 9,869 unique words.

# remove low frequency terms (appear less than 3 times)
wordmatrix<-wordmatrix[wordmatrix$freq>2,]
dim(wordmatrix)
## [1] 9869    2

The figure below shows the top 20 most frequent words.

barplot(wordmatrix[1:20,]$freq, names.arg = wordmatrix[1:20,]$word,
        col ="lightblue", ylab = "Frequencies",main ="Top 20 Most Frequent Words")

n-gram Tokenization

With the help of RWeka package, I can perform n-gram tokenization analysis. One thing to note that for the n-gram (n>=2) analysis, I do not remove the stop words. These words appear quite often in English phrases, so it is reasonable to keep them.

For this report, I only do for 2-grams and 3-grams(the data mining process is quite similar as above, so I will not show the R code again). For my prediction model, I may use a larger number.
There are 640,305 2-grams and 617,456 3-grams identified from the corpus. The figure below presents the top twenty most frequent terms.

Model Prediction Plan

The final matrix from n-grams analysis will be used for model building, with the first (n-1) words as the predictors and the last word as the outcome.
I plan to try different model building methods and use cross validation to find the model with the highest accuracy.

Problems encounter

It takes quite long to run the preprocessing code for the entire data set (in this report I only use 10% of the data). Need to think a way to increase the efficiency.

For words not included in the n-grams matrix, my model will not be able to provide predictions. Therefore, need to think about how to deal with new words, e.g. use synonyms.

My list of special characters are not comprehensive, I did some search online and people suggested using the expression [^[:alnum:]]. However, After applying it to the gsub function, I still get special characters left and it slows down my whole program significantly (cost extra 30 mins to run the processing code).