Data607 - Project 4

Amit Kapoor

4/22/2020

Document Classification

Introduction

For Project 4 we will try to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, we used kaggle spam/ham dataset, then predict the class of new documents. It can be useful to be able to classify new “test” documents using already classified “training” documents.

Spam Filter Data

I will be using kaggle Spam Filter dataset for this project. This dataset contains a csv file having spam and ham data. This csv file has following attributes: ‘text’ and ‘spam’. The variable ‘spam’ is target variable having values 1 (for spam) or 0 (for ham).

In this project, we will follow below steps:

  1. Loading dataset from github
  2. Data analysis
  3. Text pre-preprocessing
  4. Apply Model
  5. Conclusion

Load packages

Let’s load necessary packages which will be needed here.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: NLP
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Warning: package 'SnowballC' was built under R version 3.6.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## Loading required package: RColorBrewer

Load dataset and analysis

First we will load the dataset from github. The dataset has 5,728 observations and 2 columns text and spam indicator.

## Observations: 5,728
## Variables: 2
## $ text <chr> "Subject: naturally irresistible your corporate identity  lt is …
## $ spam <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## [1] 5728    2

In the next couple of setps we will first see the counts of spam attribute values and then draw the histogram to visualize its count.

## # A tibble: 2 x 2
##    spam     n
##   <int> <int>
## 1     0  4360
## 2     1  1368

Corpus and text pre-processing

The text attribute contains unstructured data with upper/lower cases, stop words, punctuation, numbers and all. In this section we will address all this by following the standard steps to build and pre-process the corpus.

  1. Build a new corpus variable called sh_corpus
  2. Using tm_map, convert the text to lowercase. This will make it uniform.
  3. Using tm_map, remove numbers.
  4. Using tm_map, remove all punctuation from the corpus.
  5. Using tm_map, remove all English stopwords from the corpus as they dont add any value.
  6. Using tm_map, stem the words in the corpus. Word stemming reduces words to unify across documents.

Lets first build the corpus for the text column in our dataset.

Now we will use tm_map transformation.

Next we will create a Document Term Matrix. We will now ready extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that provides a matrix where the rows correspond to documents and the columns correspond to words. The values in the matrix are the word frequencies in each document.

## <<DocumentTermMatrix (documents: 5728, terms: 25172)>>
## Non-/sparse entries: 462906/143722310
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

We see a high sparcity percentage so will filter out few sparse words.

## <<DocumentTermMatrix (documents: 5728, terms: 1429)>>
## Non-/sparse entries: 352457/7832855
## Sparsity           : 96%
## Maximal term length: 13
## Weighting          : term frequency (tf)

Now we’ll re-construct our dataset from newly created dtm, and add the target variable spam in it.

## [1] 5728 1430

In the next couple of steps we will use wordcloud to see most frequest spam and ham words visualization.

Train/Test Split and Applying model

We will now use this data for out modeling. Before applying models we will split up our dataset into a training set and testing set. Our model(s) will learn from training data and using testing set we’ll test our model(s). It is commong to avoid overfitting. We’ll do a 75:25 for train and test split and use caTools to split it up.

To apply model we will follow below steps:

  1. Initialize each model classifier with the training set data and target variable in training data.
  2. Make predictions on the test set.
  3. Summarize model.
  4. Check the resuts using confusion matrix.

Support Vector Machine

We will first start with Support Vector Machine model. A Support Vector Machine (SVM) is a classifier defined by a separating hyperplane i.e. given labeled training data the algorithm outputs an optimal hyperplane which categorizes new data. Lets use the above defined steps using svm.

## 
## Call:
## svm.default(x = train_df, y = as.factor(train_df$spam))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  1562
## 
##  ( 501 1061 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
##             Predicted Class
## Actual Class    0    1
##            0 1093    0
##            1   20  321
## Confusion Matrix and Statistics
## 
##               svm actual
## svm prediction    0    1
##              0 1093   20
##              1    0  321
##                                           
##                Accuracy : 0.9861          
##                  95% CI : (0.9785, 0.9915)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9607          
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.9413          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.2378          
##          Detection Rate : 0.2238          
##    Detection Prevalence : 0.2238          
##       Balanced Accuracy : 0.9707          
##                                           
##        'Positive' Class : 1               
## 

Random Forest

Next we will use another model RandomForest to make predictions and do comparison with svm. It is an ensemble tree-based learning algorithm. The Random Forest Classifier uses a set of decision trees from randomly selected subset of training data and aggregates the results from different decision trees to make decision of the final class of the test data.

##                 Length Class  Mode     
## call               3   -none- call     
## type               1   -none- character
## predicted       4294   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           8588   matrix numeric  
## oob.times       4294   -none- numeric  
## classes            2   -none- character
## importance      1430   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               4294   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
##             Predicted Class
## Actual Class    0    1
##            0 1093    0
##            1    0  341
## Confusion Matrix and Statistics
## 
##              rf actual
## rf prediction    0    1
##             0 1093    0
##             1    0  341
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9974, 1)
##     No Information Rate : 0.7622     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.2378     
##          Detection Rate : 0.2378     
##    Detection Prevalence : 0.2378     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 

Conclusion

Looking at above models implemented on training dataset, predicted for our test dataset and the accuracy shown for both models it seems that Random Forest Model’s accuracy (~99%) is more accurate than SVM (~98% ) Model. Also Random Forest took more time to run as compared to SVM. We could also use hyper paramter tuning to improve our model performance.

Recommendation

We could run few more advance models and compare results for further improvements. We could also further modify our corpus by reducing data sparsity.