Project 4: Document Classification

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Dataset from Kaggle

The dataset is obtained from Kaggle. It contains 5171 observations of spam and ham.

https://www.kaggle.com/datasets/venky73/spam-mails-dataset

Download

The csv file is uploaded into my github. Use read.csv to read in the csv file.

kaggle <- read.csv("https://raw.githubusercontent.com/suswong/DATA-607-Project-4/main/spam_ham_dataset.csv")
kaggle <- kaggle[,-4]
kaggle$text <- gsub("Subject:", "", kaggle$text)
kaggle$text <- gsub("re:", "", kaggle$text)

Number of Spam and Ham

There are 3672 spam and 1499 spam in the kaggle dataset.

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("DT")
ham_or_spam <- kaggle%>%
  count(label)
datatable(ham_or_spam)

Create a Corpus and Document Term Matrix

To clean the corpus, we need to make all text into lowercase, remove numbers, remove stopwords, remove punctucation, and remove whitespaces.

#install.packages("tm")
library(tm)

## Loading required package: NLP

spam_ham_corpus <- Corpus(VectorSource(as.vector(kaggle$text)))
spam_ham_corpus <- tm_map(spam_ham_corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, content_transformer(tolower)):
## transformation drops documents

spam_ham_corpus <- tm_map(spam_ham_corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removeNumbers): transformation
## drops documents

spam_ham_corpus <- tm_map(spam_ham_corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removeWords,
## stopwords("english")): transformation drops documents

spam_ham_corpus <- tm_map(spam_ham_corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removePunctuation):
## transformation drops documents

spam_ham_corpus <- tm_map(spam_ham_corpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, stripWhitespace): transformation
## drops documents

spam_ham_corpus <- tm_map(spam_ham_corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(spam_ham_corpus, stemDocument): transformation
## drops documents

DTM <- DocumentTermMatrix(spam_ham_corpus)

# Remove sparse terms with threshold of 99%
DTM <- removeSparseTerms(DTM, 0.99)
spam_ham <- as.data.frame(as.matrix(DTM))

spam_ham$label <- kaggle$label

Training and Testing

80% of the observation will be used to train and 20% for testing.

#install.packages("caTools")
library(caTools)
set.seed(1113)

sample <- sample.split(spam_ham, SplitRatio = 0.8)
train  <- subset(spam_ham, sample == TRUE)
test   <- subset(spam_ham, sample == FALSE)

Count number of spam and ham in each set

We want to check the proportion of the test data and train data. Both proportion in both dataset are similar.

train_count <- train%>%
  count(label)
datatable(train_count)

prop.table(train_count$n)

## [1] 0.7093023 0.2906977

test_count <- test%>%
  count(label)
datatable(test_count)

prop.table(test_count$n)

## [1] 0.7133269 0.2866731

Model and Prediction

Use Naive Bayes machine learning model to train and predict the test data.

#install.packages("naivebayes")
#install.packages("e1071")

library(naivebayes)

## naivebayes 0.9.7 loaded

library(e1071)
model <- naive_bayes(label~., train)

prediction <- predict(model, test)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

Model Accuracy

#install.packages("caret")
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Loading required package: lattice

confusion_matrix <- table(test$label, prediction)

confusionMatrix(confusion_matrix)

## Confusion Matrix and Statistics
## 
##       prediction
##        ham spam
##   ham  490  254
##   spam   5  294
##                                           
##                Accuracy : 0.7517          
##                  95% CI : (0.7243, 0.7776)
##     No Information Rate : 0.5254          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5139          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9899          
##             Specificity : 0.5365          
##          Pos Pred Value : 0.6586          
##          Neg Pred Value : 0.9833          
##              Prevalence : 0.4746          
##          Detection Rate : 0.4698          
##    Detection Prevalence : 0.7133          
##       Balanced Accuracy : 0.7632          
##                                           
##        'Positive' Class : ham             
##

Conclusion

5 observations were false negative and 251 observations were false positive. The model above has a 75.17% accuracy rate. The balance accuracy rate is 82.3%.

A larger data set can help train the model better or we can try to use other machine learning models.

There are other machine learning R packages to create a model. For example, we can use randomForest, kernlab, etc.