It can be useful to be able to classify new “test” documents using
already classified “training” documents. A common example is using a
corpus of labeled spam and ham (non-spam) e-mails to predict whether or
not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict
the class of new documents (either withheld from the training dataset or
from another source such as your own spam folder). One example corpus:
https://spamassassin.apache.org/old/publiccorpus/
The dataset is obtained from Kaggle. It contains 5171 observations of spam and ham.
https://www.kaggle.com/datasets/venky73/spam-mails-dataset
The csv file is uploaded into my github. Use read.csv
to
read in the csv file.
kaggle <- read.csv("https://raw.githubusercontent.com/suswong/DATA-607-Project-4/main/spam_ham_dataset.csv")
kaggle <- kaggle[,-4]
kaggle$text <- gsub("Subject:", "", kaggle$text)
kaggle$text <- gsub("re:", "", kaggle$text)
There are 3672 spam and 1499 spam in the kaggle dataset.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("DT")
ham_or_spam <- kaggle%>%
count(label)
datatable(ham_or_spam)
To clean the corpus, we need to make all text into lowercase, remove numbers, remove stopwords, remove punctucation, and remove whitespaces.
#install.packages("tm")
library(tm)
## Loading required package: NLP
spam_ham_corpus <- Corpus(VectorSource(as.vector(kaggle$text)))
spam_ham_corpus <- tm_map(spam_ham_corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, content_transformer(tolower)):
## transformation drops documents
spam_ham_corpus <- tm_map(spam_ham_corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removeNumbers): transformation
## drops documents
spam_ham_corpus <- tm_map(spam_ham_corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removeWords,
## stopwords("english")): transformation drops documents
spam_ham_corpus <- tm_map(spam_ham_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, removePunctuation):
## transformation drops documents
spam_ham_corpus <- tm_map(spam_ham_corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, stripWhitespace): transformation
## drops documents
spam_ham_corpus <- tm_map(spam_ham_corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(spam_ham_corpus, stemDocument): transformation
## drops documents
DTM <- DocumentTermMatrix(spam_ham_corpus)
# Remove sparse terms with threshold of 99%
DTM <- removeSparseTerms(DTM, 0.99)
spam_ham <- as.data.frame(as.matrix(DTM))
spam_ham$label <- kaggle$label
80% of the observation will be used to train and 20% for testing.
#install.packages("caTools")
library(caTools)
set.seed(1113)
sample <- sample.split(spam_ham, SplitRatio = 0.8)
train <- subset(spam_ham, sample == TRUE)
test <- subset(spam_ham, sample == FALSE)
We want to check the proportion of the test data and train data. Both proportion in both dataset are similar.
train_count <- train%>%
count(label)
datatable(train_count)
prop.table(train_count$n)
## [1] 0.7093023 0.2906977
test_count <- test%>%
count(label)
datatable(test_count)
prop.table(test_count$n)
## [1] 0.7133269 0.2866731
Use Naive Bayes machine learning model to train and predict the test data.
#install.packages("naivebayes")
#install.packages("e1071")
library(naivebayes)
## naivebayes 0.9.7 loaded
library(e1071)
model <- naive_bayes(label~., train)
prediction <- predict(model, test)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
#install.packages("caret")
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: lattice
confusion_matrix <- table(test$label, prediction)
confusionMatrix(confusion_matrix)
## Confusion Matrix and Statistics
##
## prediction
## ham spam
## ham 490 254
## spam 5 294
##
## Accuracy : 0.7517
## 95% CI : (0.7243, 0.7776)
## No Information Rate : 0.5254
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5139
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9899
## Specificity : 0.5365
## Pos Pred Value : 0.6586
## Neg Pred Value : 0.9833
## Prevalence : 0.4746
## Detection Rate : 0.4698
## Detection Prevalence : 0.7133
## Balanced Accuracy : 0.7632
##
## 'Positive' Class : ham
##
5 observations were false negative and 251 observations were false positive. The model above has a 75.17% accuracy rate. The balance accuracy rate is 82.3%.
A larger data set can help train the model better or we can try to use other machine learning models.
There are other machine learning R packages to create a model. For example, we can use randomForest, kernlab, etc.