Document Classification

Introduction

For Project 4 we will try to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, we used kaggle spam/ham dataset, then predict the class of new documents. It can be useful to be able to classify new “test” documents using already classified “training” documents.

Spam Filter Data

I will be using kaggle Spam Filter dataset for this project. This dataset contains a csv file having spam and ham data. This csv file has following attributes: ‘text’ and ‘spam’. The variable ‘spam’ is target variable having values 1 (for spam) or 0 (for ham).

In this project, we will follow below steps:

Loading dataset from github
Data analysis
Text pre-preprocessing
Apply Model
Conclusion

Load packages

Let’s load necessary packages which will be needed here.

library(RCurl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

library(caTools)
library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(SnowballC)

## Warning: package 'SnowballC' was built under R version 3.6.2

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(wordcloud)

## Loading required package: RColorBrewer

Load dataset and analysis

First we will load the dataset from github. The dataset has 5,728 observations and 2 columns text and spam indicator.

#github URL
url <- getURL("https://raw.githubusercontent.com/amit-kapoor/data607/master/project4/emails.csv")
# Read csv from github
spamham_df <- read.csv(text = url,stringsAsFactors = FALSE)
# glimpse data
glimpse(spamham_df)

## Observations: 5,728
## Variables: 2
## $ text <chr> "Subject: naturally irresistible your corporate identity  lt is …
## $ spam <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

# data dimension
dim(spamham_df)

## [1] 5728    2

In the next couple of setps we will first see the counts of spam attribute values and then draw the histogram to visualize its count.

spamham_df %>% group_by(spam) %>% tally()

## # A tibble: 2 x 2
##    spam     n
##   <int> <int>
## 1     0  4360
## 2     1  1368

# plot histogram for spam variable
hist(spamham_df$spam)

Corpus and text pre-processing

The text attribute contains unstructured data with upper/lower cases, stop words, punctuation, numbers and all. In this section we will address all this by following the standard steps to build and pre-process the corpus.

Build a new corpus variable called sh_corpus
Using tm_map, convert the text to lowercase. This will make it uniform.
Using tm_map, remove numbers.
Using tm_map, remove all punctuation from the corpus.
Using tm_map, remove all English stopwords from the corpus as they dont add any value.
Using tm_map, stem the words in the corpus. Word stemming reduces words to unify across documents.

Lets first build the corpus for the text column in our dataset.

sh_corpus <- Corpus(VectorSource(spamham_df$text))

Now we will use tm_map transformation.

# lower case
sh_corpus <- tm_map(sh_corpus, tolower)
# remove numbers
sh_corpus <- tm_map(sh_corpus, removeNumbers)
# remove punctuation
sh_corpus <- tm_map(sh_corpus, removePunctuation)
# remove stop words
sh_corpus <- tm_map(sh_corpus, removeWords, stopwords())
# stem document
sh_corpus <- tm_map(sh_corpus, stemDocument)

Next we will create a Document Term Matrix. We will now ready extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that provides a matrix where the rows correspond to documents and the columns correspond to words. The values in the matrix are the word frequencies in each document.

# Document term matrix
sh_dtm <- DocumentTermMatrix(sh_corpus)
sh_dtm

## <<DocumentTermMatrix (documents: 5728, terms: 25172)>>
## Non-/sparse entries: 462906/143722310
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

We see a high sparcity percentage so will filter out few sparse words.

# remove sparsity
sh_dtm <- removeSparseTerms(sh_dtm,sparse = 0.99)
sh_dtm

## <<DocumentTermMatrix (documents: 5728, terms: 1429)>>
## Non-/sparse entries: 352457/7832855
## Sparsity           : 96%
## Maximal term length: 13
## Weighting          : term frequency (tf)

Now we’ll re-construct our dataset from newly created dtm, and add the target variable spam in it.

# reconstruct data frame from dtm
sh_finaldf <- as.data.frame(as.matrix(sh_dtm))
# add target variable
sh_finaldf$spam <- spamham_df$spam
# lets see dimension
dim(sh_finaldf)

## [1] 5728 1430

In the next couple of steps we will use wordcloud to see most frequest spam and ham words visualization.

# spam word cloud
spam_indices <- which(spamham_df$spam == "1")
suppressWarnings(wordcloud(sh_corpus[spam_indices], min.freq=50, max.words = 100, random.order = FALSE, random.color = TRUE,colors=palette()))

# ham word cloud
ham_indices <- which(spamham_df$spam == "0")
suppressWarnings(wordcloud(sh_corpus[ham_indices], min.freq=50, max.words = 100, random.order = FALSE, random.color = TRUE,colors=palette()))

Train/Test Split and Applying model

We will now use this data for out modeling. Before applying models we will split up our dataset into a training set and testing set. Our model(s) will learn from training data and using testing set we’ll test our model(s). It is commong to avoid overfitting. We’ll do a 75:25 for train and test split and use caTools to split it up.

set.seed(197)

sample <- sample.split(sh_finaldf, SplitRatio = 0.75)
train_df <- subset(sh_finaldf, sample == TRUE)
test_df  <- subset(sh_finaldf, sample == FALSE)

To apply model we will follow below steps:

Initialize each model classifier with the training set data and target variable in training data.
Make predictions on the test set.
Summarize model.
Check the resuts using confusion matrix.

Support Vector Machine

We will first start with Support Vector Machine model. A Support Vector Machine (SVM) is a classifier defined by a separating hyperplane i.e. given labeled training data the algorithm outputs an optimal hyperplane which categorizes new data. Lets use the above defined steps using svm.

# initialize svm
model_svm <- svm(train_df, as.factor(train_df$spam))
# make predictions using svm
predict_svm <- predict(model_svm, test_df)

#model summary
summary(model_svm)

## 
## Call:
## svm.default(x = train_df, y = as.factor(train_df$spam))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  1562
## 
##  ( 501 1061 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# svm actuals vs predicted
table(`Actual Class` = test_df$spam, `Predicted Class` = predict_svm)

##             Predicted Class
## Actual Class    0    1
##            0 1093    0
##            1   20  321

# svm confusion matrix
confusionMatrix(data = predict_svm, reference = as.factor(test_df$spam),
                positive = "1", dnn = c("svm prediction", "svm actual"))

## Confusion Matrix and Statistics
## 
##               svm actual
## svm prediction    0    1
##              0 1093   20
##              1    0  321
##                                           
##                Accuracy : 0.9861          
##                  95% CI : (0.9785, 0.9915)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9607          
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.9413          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.2378          
##          Detection Rate : 0.2238          
##    Detection Prevalence : 0.2238          
##       Balanced Accuracy : 0.9707          
##                                           
##        'Positive' Class : 1               
##

Random Forest

Next we will use another model RandomForest to make predictions and do comparison with svm. It is an ensemble tree-based learning algorithm. The Random Forest Classifier uses a set of decision trees from randomly selected subset of training data and aggregates the results from different decision trees to make decision of the final class of the test data.

# initialize random forest
model_rf <- randomForest(train_df, as.factor(train_df$spam))
# make predictions using random forest
predict_rf <- predict(model_rf, test_df)

#model summary
summary(model_rf)

##                 Length Class  Mode     
## call               3   -none- call     
## type               1   -none- character
## predicted       4294   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           8588   matrix numeric  
## oob.times       4294   -none- numeric  
## classes            2   -none- character
## importance      1430   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               4294   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL

# rf actuals vs predicted
table(`Actual Class` = test_df$spam, `Predicted Class` = predict_rf)

##             Predicted Class
## Actual Class    0    1
##            0 1093    0
##            1    0  341

# rf confusion matrix
confusionMatrix(data = predict_rf, reference = as.factor(test_df$spam), positive = "1", dnn = c("rf prediction", "rf actual"))

## Confusion Matrix and Statistics
## 
##              rf actual
## rf prediction    0    1
##             0 1093    0
##             1    0  341
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9974, 1)
##     No Information Rate : 0.7622     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.2378     
##          Detection Rate : 0.2378     
##    Detection Prevalence : 0.2378     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
##

Conclusion

Looking at above models implemented on training dataset, predicted for our test dataset and the accuracy shown for both models it seems that Random Forest Model’s accuracy (~99%) is more accurate than SVM (~98% ) Model. Also Random Forest took more time to run as compared to SVM. We could also use hyper paramter tuning to improve our model performance.