Introduction

Text analytics is the process of retrieving unstructured data and transforming it to structured data with the application of suitable algorithms to find patters, trends, and classify the texts in distinct groups. The text analytics pipeline can be burdensome for any small computer or workstation when dealing with big unstructured data. The feature space may explode with sparse data resulting heavy tolls on computational power and the time required to build any machine learning model for prediction and classification.

In this research project, we have developed a state of the art methodology for text analytics bypassing any deep learning architecture to make text analytics feasible for small machines on a go. In this experimental setting, we have collected huge amount of Spam and Non-spam text data for classification. We have tokenized the text data and preprocessed it with Quanteda library in R programming setting. Later, we have built a Document frequency matrix and prepared the data to build our first decision tree model with cross validation approach. Then we created the Term Frequency Inverse Document Matrix function from scratch and incorporated the function with our DFM to normalize the documents and at the same time accounting for the weight of each word in making the prediction. Later we have endorsed n-gram modeling to enrich the feature space with more word frequency.

All these can be considered pretty standard procedures on text analytics but the feature space for any big data will explode with such a scale to render any machine learning models on small computer almost infeasible on a limited time frame. For the resolution, we proposed Latent Semantic Analysis to project the DFM transformed through Term Frequency Inverse Document Matrix function on a vector space with the application of Singular Value Decomposition. This solved the problem of curse of dimensionality and also retained the performance of the decision tree model. The newly projected vector space dataset contained minimal number of most important features which makes it possible to run more sophisticated algorithms like Random Forest to raise the accuracy level. But that’s not the end. We have engineered new features on our vector space model to boost our accuracy not only on the training data but also the test data hence maintaining the balance of accuracy with high standards. Feature engineering has appeared to be a true winner for optimization when we are only left with the vector space model.

Let us load the required libraries for this text analytics project.

library(caret)
library(e1071)
library(quanteda)
library(irlba)
library(dplyr)
library(ggplot2)
library(randomForest)

Loading and Preparing the Dataset

raw_spam_data<-read.csv("spam.csv", stringsAsFactors = F)
head(raw_spam_data)
dim(raw_spam_data)
## [1] 5572    5

Renaming the columns

The raw spam data has spam categories in column “V1” and text in column “V2”. Other columns are redundant so we remove them accordingly. We rename the columns and convert categorical column (ham and spam) in to a factor.

raw_spam_data<-raw_spam_data[,1:2] #Reading only the first two columns
names(raw_spam_data)<-c("Label","Text") #Naming the two columns
raw_spam_data$Label<-as.factor(raw_spam_data$Label) #turning Label columns in to factor of two categories
levels(raw_spam_data$Label)
## [1] "ham"  "spam"
length(which(complete.cases(raw_spam_data))) #checking if all the rows don't have any missing values
## [1] 5572
prop.table(table(raw_spam_data$Label)) #proportion of ham and spam text in the dataset
## 
##       ham      spam 
## 0.8659368 0.1340632

Now we create another column in raw_spam_data as the text length of each text and check the summary

raw_spam_data$TextLength<-nchar(raw_spam_data$Text) #creating a new column for the text length
summary(raw_spam_data$TextLength)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   36.00   61.00   80.12  121.00  910.00

plotting a histogram

From summary of the text length it is evident that mean length is much different from the median length. Lets plot the histogram for text length

raw_spam_data %>% ggplot(aes(TextLength, fill=Label))+
  theme_bw()+geom_histogram(binwidth = 5)

The histogram clearly shows on average the ham text has much lower word count than the spam text.

Preparing the Dataset for a Decision TREE Model

set.seed(32984, sample.kind = "Rounding")
index<-createDataPartition(raw_spam_data$Label, times = 1, p=0.7, list = F) #create data 
#partition function will always maintain the correct proportion of ham and spam in the train and test data.
test_set<-raw_spam_data[-index,]
train_set<-raw_spam_data[index,]
prop.table(table(test_set$Label)) #checking to see if the proportion of ham and spam data were 
## 
##       ham      spam 
## 0.8659485 0.1340515
#preserved in  the training data

Tokenization

Now we will utilize the powerful Quanteda library to tokenize the text data.

train_tokens<-tokens(train_set$Text, what = "word", remove_punct = T, remove_numbers = T,
                    remove_symbols = T, split_hyphens = T) #tokenizing the text data in to words 
#and removing the punctuation, numbers, and symbols

Lets check the 357th text from both the train_set and train_tokens to see how this text has been tokenized.

train_tokens[[357]]
##  [1] "Your"                      "credits"                  
##  [3] "have"                      "been"                     
##  [5] "topped"                    "up"                       
##  [7] "for"                       "http://www.bubbletext.com"
##  [9] "Your"                      "renewal"                  
## [11] "Pin"                       "is"                       
## [13] "tgxxrz"
train_set$Text[357]
## [1] "Your credits have been topped up for http://www.bubbletext.com Your renewal Pin is tgxxrz"

We can see the whole text has been disintegrated into distinct words. Due to capitalization, similar words will be regarded as different so we will convert all the letters to lower case. Also, the stop words like auxiliary verbs, articles aren’t necessary. So we will remove them as well. Finally, we convert all the words to their stem words. This is very important as all the similar words will shrink to the stemming word.

train_tokens<-tokens_tolower(train_tokens) #converts all token words in lower form
train_tokens<-tokens_select(train_tokens, stopwords(), selection = "remove" ) #removes all the 
#stop words from dictionary
train_tokens<-tokens_wordstem(train_tokens, language = "english") #converts all token words in 
#their source of origin

Let’s check the previous 357th text again how the words has been converted to stem words.

train_tokens[[357]]
## [1] "credit"                    "top"                      
## [3] "http://www.bubbletext.com" "renew"                    
## [5] "pin"                       "tgxxrz"

We can see the stop words are now removed and words are converted to their stemming form.

Document Frequency Matrix

Now this is time to generate the most desired document frequency matrix of the tokens we have created so far.

train_tokens_dfm<-dfm(train_tokens, tolower = F)
train_tokens_matrix<-as.matrix(train_tokens_dfm)

Lets view the first 20 rows and 20 columns of the document frequency matrix

train_tokens_matrix[1:15,1:15]
##         features
## docs     go jurong point crazi avail bugi n great world la e buffet cine got
##   text1   1      1     1     1     1    1 1     1     1  1 1      1    1   1
##   text2   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text3   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text4   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text5   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text6   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text7   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text8   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text9   0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text10  0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text11  0      0     0     0     0    0 1     0     0  0 0      0    0   0
##   text12  0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text13  0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text14  0      0     0     0     0    0 0     0     0  0 0      0    0   0
##   text15  0      0     0     0     0    0 0     0     0  0 0      0    0   0
##         features
## docs     amor
##   text1     1
##   text2     0
##   text3     0
##   text4     0
##   text5     0
##   text6     0
##   text7     0
##   text8     0
##   text9     0
##   text10    0
##   text11    0
##   text12    0
##   text13    0
##   text14    0
##   text15    0

We will merge the label column from train set with the newly formed train token matrix to eventually produce a dataframe. Most of the feature names in the dataframe are invalid so we need to construct valid names for the columns/features.

train_token_df<-cbind(Label=train_set$Label, as.data.frame(train_tokens_matrix))
names(train_token_df)<-make.names(names(train_token_df))

Cross Validation

We will run cross validation for the decision tree modeling on train token dataframe. Let’s create 30 different cross validation fold on train set and build a train control object to control the training method in the decision tree modeling.

cv_folds<-createMultiFolds(train_set$Label, k = 10,times = 3) #creating 30 folds in total 
train_control<-trainControl(method = "repeatedcv", number = 10, repeats = 3, index = cv_folds) #for 
#each repetition the train control function will get the index from cv_folds

Contructing the Decision Tree Model

library(doSNOW) #This library is required for multicore processing
start_time<-Sys.time()
clusters<-makeCluster(2, type = "SOCK") #This will instruct rstudio to use two cores simultaneously 
registerDoSNOW(clusters) #Clustering will begin
trained_model_01<-train(Label~., data = train_token_df, method="rpart", trControl=train_control,
                        tuneLength=7) #Training a decision tree model
stopCluster(clusters) #Ending the clustering and stopping multicore processing
total_time<-Sys.time()-start_time #Time required to train the whole decision tree model

The total time it has taken to train the model is 12.0869878172874 Now let’s check out our trained decision tree model

trained_model<-trained_model_01
trained_model
## CART 
## 
## 3901 samples
## 5815 predictors
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3510, 3511, 3511, 3511, 3510, 3511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01912046  0.9451399  0.7256170
##   0.02294455  0.9415528  0.7034354
##   0.02868069  0.9359129  0.6670062
##   0.03059273  0.9333494  0.6486466
##   0.03824092  0.9299317  0.6244985
##   0.05098789  0.9155793  0.5117530
##   0.32695985  0.8860153  0.2185265
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01912046.

As you can see using the cross validation resampling method, we have got the optimized model for a cp value of 0.0191 with an accuracy of 0.945 which is not bad at all. But later we will try to increase the accuracy level of our model by using the elegant term-frequency-inverse-document-matrix algorithm.

Term Frequency Inverse Document Matrix(TF-IDF)

There still remains some issues that we need to take care of: 1. Longer documents will tend to have higher term counts 2. Terms that appear frequently across the corpus aren’t as important

To address these issues we can scrutinize our representation by the following steps: 1. Normalize documents based on their length. 2. Penalize terms that occurs frequently across the corpus.

That’s where the mighty TF-IDF algorithm comes into rescue. TF or the term frequency normalizes the word frequency by taking the proportion of count of each word in the document.

Inverse document frequency or simply IDF penalizes any frequently occurring word. First it divides total number documents by the number of documents that distinct word appears in and later takes the log of it. So for example if we have 1000 documents and the word “soccer” appears in all of those documents then the IDF of the word “soccer” will be exactly zero. So, IDF is a counter measure to put less weight on those words which occurs in many documents. Finally, we need to multiply TF of each word with respective IDF to calculate TF-IDF.

Let’s build the TF and IDF function. We create these functions from scratch.

TF.function<-function(r){
  r/sum(r) #This will take the frequency of a particular word in a document and divide by the total 
  #word count in that document
}

IDF.function<-function(c){
  size<-length(c) #Count the total number of documents
  doc.count<-length(which(c>0)) # For a certain word, count the number of documents where it appeared
  log10(size/doc.count)
}

Now, we combine the TF and IDF function together to custom-build our TF-IDF function.

tf.idf<-function(tf,idf){
  tf*idf
}

To address the first issue, We normalize all documents first using the TF function so that all words in each document get their measure of proportionality irrespective of the length of the document.

train.token.tf<-apply(train_tokens_matrix,1,TF.function) # '1' has been used for row operation
train.token.tf[1:10,1:10] #checking the first 10 rows and 10 columns
##         docs
## features  text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
##   go     0.0625     0     0     0     0     0     0     0     0      0
##   jurong 0.0625     0     0     0     0     0     0     0     0      0
##   point  0.0625     0     0     0     0     0     0     0     0      0
##   crazi  0.0625     0     0     0     0     0     0     0     0      0
##   avail  0.0625     0     0     0     0     0     0     0     0      0
##   bugi   0.0625     0     0     0     0     0     0     0     0      0
##   n      0.0625     0     0     0     0     0     0     0     0      0
##   great  0.0625     0     0     0     0     0     0     0     0      0
##   world  0.0625     0     0     0     0     0     0     0     0      0
##   la     0.0625     0     0     0     0     0     0     0     0      0

One thing to remember here: the apply function produces a transposed matrix.

To address the second issue, we calculate the IDF vector which we will use both for our training data and test data.

train.token.idf<-apply(train_tokens_matrix,2,IDF.function) #'2' has been used for column operation

In the final step, we calculate the TF-IDF for the training data

train.token.tf_idf<-apply(train.token.tf,2,tf.idf,idf=train.token.idf) #here tf.idf function will 
#be applied on train.token.tf and idf values will be chosen from the idf.vector

Let’s take a look in to the TF-IDF of train tokens.

train.token.tf_idf[1:14,1:12]
##         docs
## features      text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
##   go     0.06953809     0     0     0     0     0     0     0     0      0
##   jurong 0.22444850     0     0     0     0     0     0     0     0      0
##   point  0.13934051     0     0     0     0     0     0     0     0      0
##   crazi  0.16480834     0     0     0     0     0     0     0     0      0
##   avail  0.15936145     0     0     0     0     0     0     0     0      0
##   bugi   0.17162987     0     0     0     0     0     0     0     0      0
##   n      0.10230834     0     0     0     0     0     0     0     0      0
##   great  0.10322854     0     0     0     0     0     0     0     0      0
##   world  0.13934051     0     0     0     0     0     0     0     0      0
##   la     0.18681975     0     0     0     0     0     0     0     0      0
##   e      0.11617389     0     0     0     0     0     0     0     0      0
##   buffet 0.20563412     0     0     0     0     0     0     0     0      0
##   cine   0.17581404     0     0     0     0     0     0     0     0      0
##   got    0.08635381     0     0     0     0     0     0     0     0      0
##         docs
## features    text11 text12
##   go     0.0000000      0
##   jurong 0.0000000      0
##   point  0.0000000      0
##   crazi  0.0000000      0
##   avail  0.0000000      0
##   bugi   0.0000000      0
##   n      0.1091289      0
##   great  0.0000000      0
##   world  0.0000000      0
##   la     0.0000000      0
##   e      0.0000000      0
##   buffet 0.0000000      0
##   cine   0.0000000      0
##   got    0.0000000      0

We need to revert this train.token.tf_idf matrix to a document frequency matrix where terms are in the column and documents/texts are in the rows.

train.token.tf_idf<-t(train.token.tf_idf)

There remains the possibility that in the data pre-processing stage when removing stop words, punctuation, numbers, symbols and stemming, many text messages/documents could render totally void of words. So, we need to check if all the rows in our TF-IDF data matrix are complete. We will see which rows in the TF-IDF data matrix have produced errors and also take a look at those texts.

incomplete.case.index<-which(!complete.cases(train.token.tf_idf))
train_set$Text[incomplete.case.index]
## [1] "What you doing?how are you?" "645"                        
## [3] ":) "                         "What you doing?how are you?"
## [5] ":( but your not here...."    ":-) :-)"

In accordance with our suspect, we found the texts/documents which have been totally stripped off during data pre-processing stage.

We can easily fix those incomplete rows in the TF-IDF matrix by simply replacing all the values in those rows by zeroes. One might tempt to erase all the incomplete rows but that should not be the case. From our experience and what he have seen here so far, the documents/texts which received errors in TF-IDF matrix are predominantly Ham texts. So, erasing them might not be a good idea from the machine learning perspective.

train.token.tf_idf[incomplete.case.index,]<-rep(0.0,ncol(train.token.tf_idf))
sum(which(!complete.cases(train.token.tf_idf)))
## [1] 0

We received a sum of zero which means none of the rows in TF-IDF matrix are incomplete. Next, we create a nice and clean data frame by combining Label column from train set with our TF-IDF matrix.

train.tf_idf.df<-cbind(train_set$Label, data.frame(train.token.tf_idf))
names(train.tf_idf.df)<-make.names(names = names(train.tf_idf.df)) #creating valid names for each column

Previously, we trained our decision tree model on Document Frequency Matrix. This time we are going to train our decision tree model on the TF-IDF matrix. Recall the cross validation folds and clusters we created before.

cv_folds<-createMultiFolds(train_set$Label, k = 10,times = 3) #creating 30 folds in total 
train_control<-trainControl(method = "repeatedcv", number = 10, repeats = 3, index = cv_folds) #for 
#each repetition the trainControl function will get the index from cv_folds

Constructing the Decision Tree Model on TF-IDF Transformed Matrix

library(doSNOW) #This library is required for multicore processing
start_time<-Sys.time()
clusters<-makeCluster(2, type = "SOCK") #This will instruct rstudio to use two cores simultaneously 
registerDoSNOW(clusters) #Clustering will begin
trained_model_tfidf<-train(train_set.Label~., data = train.tf_idf.df, method="rpart", trControl=train_control,
                        tuneLength=7) #Training a decision tree model
stopCluster(clusters) #Ending the clustering and stopping multicore processing
total_time<-Sys.time()-start_time #Time required to train the whole decision tree model

The trained model on TF-IDF matrix yielded slight improvement with an accuracy of 0.94 and minimum Cp value of 0.017.

N-grams modelling

Until now our model has been evaluated on single terms known as unigrams or 1-grams. There are other important varieties of N-grams namely bi-grams, tri-grams, 4-grams and so on. Our model performance could be further increased if we would enhance the scope of tokenization by utilizing N-grams. N-grams definitely provide more opportunities to expand the bag-of-words model to include word ordering. We will add bi-grams to our previously created token_matrix.

train.tokens.bigrams<-tokens_ngrams(train_tokens, n=1:2)
train_tokens[[357]]
## [1] "credit"                    "top"                      
## [3] "http://www.bubbletext.com" "renew"                    
## [5] "pin"                       "tgxxrz"
train.tokens.bigrams[[357]]
##  [1] "credit"                          "top"                            
##  [3] "http://www.bubbletext.com"       "renew"                          
##  [5] "pin"                             "tgxxrz"                         
##  [7] "credit_top"                      "top_http://www.bubbletext.com"  
##  [9] "http://www.bubbletext.com_renew" "renew_pin"                      
## [11] "pin_tgxxrz"

As seen from the code above, previously we had only single tokens from the 357th text but now the tokens consists both of single and double words maintaining the words in ordering, almost like creating groups of touples.

We will apply the same data pre-processing pipelines we have used before on the newly formed bi-grams tokens. The token object will be converted to a document frequency matrix and later to a matrix. Then the term frequency function will normalize the document frequency matrix. The IDF function will calculate the IDF vector for the training and test data. We then construct the mighty TF-IDF for our training data. It’s not the time to get overwhelmed. That comes later.

train.token.bigram.dfm<-dfm(train.tokens.bigrams, tolower = F) #Creating document freq matrix 
train.token.bigram.matrix<-as.matrix(train.token.bigram.dfm) #saving dfm as matrix
rm(train.tokens.bigrams, train.token.bigram.dfm)
train.token.bigram.tf<-apply(train.token.bigram.matrix,1,TF.function) #normalizing dfm using tf function
train.token.bigram.idf<-apply(train.token.bigram.matrix,2,IDF.function) #creating idf vector
rm(train.token.bigram.matrix)
train.token.bigram.tf_idf<-apply(train.token.bigram.tf,2,tf.idf, idf=train.token.bigram.idf) 
train.token.bigram.tf_idf<-t(train.token.bigram.tf_idf) #transposing the tf-idf data frame
rm(train.token.bigram.tf)
incomplete.cases<-which(!complete.cases(train.token.bigram.tf_idf))
train.token.bigram.tf_idf[incomplete.cases,]<-rep(0.0,ncol(train.token.bigram.tf_idf))
train.bigram.tf_idf.df<-cbind(Label=train_set$Label, data.frame(train.token.bigram.tf_idf))
names(train.bigram.tf_idf.df)<-make.names(names = names(train.bigram.tf_idf.df))

Review

Let’s review what we have achieved so far. 1. We rendered unstructured textual data in a format flexible for Analytics and Machine Learning. 2. We devised standard de facto data pre-processing pipeline for Text Analytics. 3. We improved upon the Bag-of-word model by manipulating the text data set with the mighty TF-IDF. 4. And last but not the least, we unleashed the power of n-grams by extending BOW models to incorporate word orderings.

Let’s also discuss the problems we have encountered along the way. 1. After the application of N-grams the document terms exploded giving rises to thousands of new features. 2. Most of the new features do not contain a lot of information which creates the sparsity problem. 3. We are definitely moving towards scalability issues like shortage of RAM and computation power. 4. curse of dimensionality.

Dimension Reduction for Latent semantic Analysis

After adding bi-grams, it is quite evident that the number of features have been exploded. That in turn, will simply increase the time for training our models without increasing much accuracy. We need to shrink our feature space. For this, we will apply Singular Value Decomposition to reduce and transform the feature space only to handful of most important features.

The irlba library will allow us to apply SVD(singular value decomposition) on our bi-gram model and we will reduce the dimensionality to most valuable 300 columns for Latent Semantic Analysis.

library(irlba)
start.time<-Sys.time()
train.iralba<-irlba(t(train.token.bigram.tf_idf), nv = 300, maxit = 600)
total.time<-Sys.time()-start.time
total.time
## Time difference of 26.18933 mins

The train.irlba has three different components. The first component is the u matrix that contains the eigen vectors or the left singular vectors based on the term correlation. The second component d contains the singular values and the third component is the right singular vectors and our desired transformation of the documents on the semantic space.

Let’s take a look at the document feature space created by SVD.

train.iralba$v[1:10,1:6]
##               [,1]          [,2]          [,3]          [,4]          [,5]
##  [1,] 1.415771e-05 -1.559756e-04 -0.0003491520 -4.271159e-17  6.793260e-18
##  [2,] 6.215661e-05 -5.998686e-04 -0.0006433967 -2.051033e-16 -9.825678e-17
##  [3,] 8.700574e-07 -1.412254e-04 -0.0003030183 -9.358290e-17 -3.239437e-17
##  [4,] 2.150650e-05 -6.839671e-03 -0.0003782779  2.988763e-17 -2.550531e-17
##  [5,] 5.059657e-07 -6.989342e-05 -0.0028715967  5.041339e-17  1.472977e-17
##  [6,] 3.730699e-06 -1.613618e-04 -0.0034781339  3.860688e-17  2.492099e-17
##  [7,] 7.905780e-06 -2.177330e-04 -0.0010205300  3.963421e-17  2.584425e-17
##  [8,] 6.645125e-07 -1.124602e-04 -0.0005403705  3.263164e-17  8.297269e-18
##  [9,] 1.924329e-06 -1.395741e-04 -0.0009554010 -1.172455e-16 -3.457649e-17
## [10,] 6.766047e-07 -4.358364e-05 -0.0005116495  5.120809e-17  6.512532e-18
##                [,6]
##  [1,]  9.250953e-17
##  [2,] -3.550752e-17
##  [3,] -2.084898e-17
##  [4,]  1.699646e-16
##  [5,] -2.681073e-17
##  [6,]  1.368668e-16
##  [7,] -3.172991e-17
##  [8,] -1.207722e-17
##  [9,]  2.853290e-16
## [10,]  1.740098e-19

One little problem with the semantic analysis is that our new data will also have to be projected in to the newly transformed feature space after the tf-idf transformation.

Now, we will commit the inverse operation of the SVD on a single tf-idf transformed document so we can make sure our SVD transformed document has the similar values.

sigma.inverse<-1/train.iralba$d
u.transposed<-t(train.iralba$u)
doc.01<-train.token.bigram.tf_idf[1,]
doc01.inverse.trans<-sigma.inverse * u.transposed %*% doc.01

We will now check whether our newly projected document on the semantic space really matches with the previously transformed document with the SVD. We are only looking at the first 10 numbers from both documents.

train.iralba$v[1,1:10]
##  [1]  1.415771e-05 -1.559756e-04 -3.491520e-04 -4.271159e-17  6.793260e-18
##  [6]  9.250953e-17  7.202770e-19  3.556815e-17 -1.795476e-18 -9.006774e-06
doc01.inverse.trans[1:10]
##  [1]  1.415771e-05 -1.559756e-04 -3.491520e-04 -6.092751e-17 -1.190469e-17
##  [6]  1.101069e-17  2.093775e-18  1.156389e-17 -9.216917e-18 -9.006774e-06

The values look pretty similar from both of the documents.

Now, we will create a dataframe with the labels ham and spam and the transformed documents on the semantic space for our training data.

train.svd<-data.frame(Labels=train_set$Label, train.iralba$v)

We will run the single decision tree model for one last time. As we have already shrunk our feature space by multitudes of columns, now we are ready to run the mighty random forest algorithm for our future training.

library(doSNOW) #This library is required for multicore processing
start_time<-Sys.time()
clusters<-makeCluster(2, type = "SOCK") #This will instruct rstudio to use two cores simultaneously 
registerDoSNOW(clusters) #Clustering will begin
trained_model.svd<-train(Labels~., data = train.svd, method="rpart", trControl=train_control,
                        tuneLength=7) #Training a decision tree model
stopCluster(clusters) #Ending the clustering and stopping multicore processing
total_time<-Sys.time()-start_time #Time required to train the whole decision tree model
total_time
## Time difference of 40.32775 secs
trained_model.svd
## CART 
## 
## 3901 samples
##  300 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3512, 3510, 3512, 3511, 3511, 3511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01465902  0.9330073  0.6810833
##   0.01682600  0.9289891  0.6490688
##   0.01720841  0.9257453  0.6301867
##   0.01816444  0.9252325  0.6215931
##   0.03059273  0.9238650  0.6136579
##   0.05353728  0.9212161  0.6089718
##   0.14212874  0.8955822  0.3648145
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01465902.

From the output of the trained model for a single decision tree, we can see the optimal accuracy is hovering around 93.21%. But we can definitely do better with a random forest trained model.

Training a Random Forest Model

We need to remember that previously we ran our decision tree on a 10 fold cross validation repeated 3 times that means 30 different models on each iteration. Also the random forest implements 500 trees for each model by default. We are assigning 7 different values for mtry as a tuning parameter. If we calculate all these then the random forest have to train (307500)+500=105,500 different decision trees to give us a one final optimal trained model. This will require a huge amount of time to train a optimal random forest model.

Let’s do this.

library(doSNOW) #This library is required for multicore processing
start_time<-Sys.time()
clusters<-makeCluster(2, type = "SOCK") #This will instruct rstudio to use two cores simultaneously 
registerDoSNOW(clusters) #Clustering will begin
trained_model.svd<-train(Labels~., data = train.svd, method="rf", trControl=train_control,
                        tuneLength=7) #Training a random forest model
stopCluster(clusters) #Ending the clustering and stopping multicore processing
total_time<-Sys.time()-start_time #Time required to train the whole random forest model
## Random Forest 
## 
## 3901 samples
##  300 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3511, 3510, 3511, 3511, 3511, 3511, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.9618038  0.8128406
##    51   0.9675298  0.8446518
##   101   0.9675290  0.8448381
##   151   0.9675298  0.8450677
##   200   0.9664192  0.8404983
##   250   0.9662480  0.8396997
##   300   0.9664192  0.8405704
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 151.

The confusion will give us a better idea not only on the accuracy of the model but also on the sensitivity and the specificity.

confusionMatrix(data = train.svd$Labels, reference = rf.cv.1$finalModel$predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  3371    7
##       spam  117  406
##                                           
##                Accuracy : 0.9682          
##                  95% CI : (0.9622, 0.9735)
##     No Information Rate : 0.8941          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8497          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9665          
##             Specificity : 0.9831          
##          Pos Pred Value : 0.9979          
##          Neg Pred Value : 0.7763          
##              Prevalence : 0.8941          
##          Detection Rate : 0.8641          
##    Detection Prevalence : 0.8659          
##       Balanced Accuracy : 0.9748          
##                                           
##        'Positive' Class : ham             
## 

From the confusion matrix, we can observe the overall accuracy is 96.65% which is a much better improvement than the previous decision tree model. Also, we see the detection rates of ham and spam are 96.7% and 98.3% respectively which are much higher than the single decision tree model.

Feature Engineering

So far, we have been able pull an accuracy close to 97% with a random forest model on our training data. But can we do better? From our previous analysis, we have seen that the variable text length is a good approximator of the ham and spam data. We can incorporate that variable in our SVD transformed training data and run random forest algorithm again to check whether the inclusion of this variable can boost our accuracy level.

train.svd$TextLength<-train_set$TextLength #adding the text length variable 
start_time<-Sys.time()
clusters<-makeCluster(2, type = "SOCK") #This will instruct rstudio to use two cores simultaneously 
registerDoSNOW(clusters) #Clustering will begin
trained_model.svd<-train(Labels~., data = train.svd, method="rf", trControl=train_control,
                        tuneLength=7) #Training a random forest model
stopCluster(clusters) #Ending the clustering and stopping multicore processing
total_time<-Sys.time()-start_time #Time required to train the whole random forest model
trained_model.svd
## Random Forest 
## 
## 3901 samples
##  301 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3511, 3510, 3511, 3511, 3511, 3511, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.9621448  0.8147416
##    51   0.9705182  0.8595248
##   101   0.9706898  0.8609443
##   151   0.9702629  0.8587626
##   201   0.9700056  0.8576510
##   251   0.9689802  0.8535269
##   301   0.9679543  0.8486897
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 101.

Adding text length as a new variable actually increased the accuracy of the model up to 97.1%.

confusionMatrix(data = train.svd$Labels, reference = rf.cv.2$finalModel$predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  3375    3
##       spam  110  413
##                                           
##                Accuracy : 0.971           
##                  95% CI : (0.9653, 0.9761)
##     No Information Rate : 0.8934          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8634          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9684          
##             Specificity : 0.9928          
##          Pos Pred Value : 0.9991          
##          Neg Pred Value : 0.7897          
##              Prevalence : 0.8934          
##          Detection Rate : 0.8652          
##    Detection Prevalence : 0.8659          
##       Balanced Accuracy : 0.9806          
##                                           
##        'Positive' Class : ham             
## 

The confusion matrix reveals a great improvement on detecting the spam data with 99% accuracy and ham data with 96.84% accuracy. But still some of the ham messages are being detected as the spam messages. We can check how well our new feature the text length has performed with compared to the other features by plotting the variable importance graph.

library(randomForest)
varImpPlot(rf.cv.1$finalModel)

varImpPlot(rf.cv.2$finalModel)

It is somewhat very interesting to see that without the feature text length the variable x24 was the most prominent feature in the random forest model but after the inclusion of text length in to the model it has literally dwarfed the other features on the importance scale.

We can do more feature engineering to investigate whether adding more interesting feature can improve the accuracy of our model. Cosine similarity is a very good candidate. But, we will stop right here.