For this project, We used the documents within the ham/spam dataset located at https://spamassassin.apache.org/old/publiccorpus/ . Using these documents, we used a portion of the documents to create a decision tree data model. In order to create the decision tree data model, we used the rpart package for creating a decision tree and partitioning the dataset into a training and testing data. Once the decision tree data model was created, we used it to test the portion of documents that were not used in the initial creation of the decision tree. Through testing, we were able to draw final conclusions from the results.
Creating the Dataframe
For our dataset, we took the files from https://spamassassin.apache.org/old/publiccorpus/ and converted them to a .csv file and then uploaded that to our github repository. From there, we read in the data using readr library and the read_csv function on the raw github link.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 6 × 4
file label text id
<chr> <chr> <chr> <dbl>
1 00001.7c53336b37003a9286aba55d2945844c ham "From exmh-workers-admin@r… 1
2 00002.9c4069e25e1ef370c078db7ee85ff9ac ham "From Steve_Burt@cursor-sy… 2
3 00003.860e3c3cee1b42ead714c5c874fe25f7 ham "From timc@2ubh.com Thu A… 3
4 00004.864220c5b6930b209cc287c361c99af1 ham "From irregulars-admin@tb.… 4
5 00005.bf27cdeaf0b8c4647ecd61b1d09da613 ham "From Stewart.Smith@ee.ed.… 5
6 00006.253ea2f9a9cc36fa0b1129b04b806608 ham "From martin@srv0.ems.ed.a… 6
Email Model
Next, we created an email model. The variables for classifying spam or ham emails include email length, number of words, numbers of exclamation marks, number of dollar signs, number of links, number of digits, and uppercase ratio.
For training and testing the model, we created a partition where 70% of the data was used for training the model and then the remaining 30% of the data was used for testing the data.
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 660 79
spam 90 339
Accuracy : 0.8553
95% CI : (0.8338, 0.875)
No Information Rate : 0.6421
P-Value [Acc > NIR] : <2e-16
Kappa : 0.687
Mcnemar's Test P-Value : 0.4418
Sensitivity : 0.8800
Specificity : 0.8110
Pos Pred Value : 0.8931
Neg Pred Value : 0.7902
Prevalence : 0.6421
Detection Rate : 0.5651
Detection Prevalence : 0.6327
Balanced Accuracy : 0.8455
'Positive' Class : ham
Conclusion
The decision tree model achieved an accuracy of 0.8553 on the testing data, meaning that about 85.5% of the emails were correctly classified as either ham or spam. The confusion matrix shows that the model correctly classified 660 ham emails and 339 spam emails. However, it misclassified 79 spam emails as ham and 90 ham emails as spam. The model’s accuracy is higher than the no information rate of 0.6421, which suggests that the model performs better than simply predicting the majority class.
Balanced accuracy was 0.8455, which averages the model’s sensitivity and specificity. This is useful because the dataset has more ham emails than spam emails. The balanced accuracy shows that the model was able to predict both the the ham and the spam reasonably well. After taking into account the model’s ability to distinguish between the “ham” and “spam” categories, its average performance was approximately 84.55%.
The Kappa value was 0.687 which means that there is a fairly strong agreement between the prediction and the true label.