Project 4 Code Base Submission

Author

Long Lin, Zihao Yu

Overview

For this project, We used the documents within the ham/spam dataset located at https://spamassassin.apache.org/old/publiccorpus/ . Using these documents, we used a portion of the documents to create a decision tree data model. In order to create the decision tree data model, we used the rpart package for creating a decision tree and partitioning the dataset into a training and testing data. Once the decision tree data model was created, we used it to test the portion of documents that were not used in the initial creation of the decision tree. Through testing, we were able to draw final conclusions from the results.

Creating the Dataframe

For our dataset, we took the files from https://spamassassin.apache.org/old/publiccorpus/ and converted them to a .csv file and then uploaded that to our github repository. From there, we read in the data using readr library and the read_csv function on the raw github link.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(rpart)
library(rpart.plot)
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
url <- "https://github.com/longflin/DATA-607-Project-4/raw/refs/heads/main/emails.csv"

emails <- read_csv(url, show_col_types = FALSE)

head(emails)
# A tibble: 6 × 4
  file                                   label text                           id
  <chr>                                  <chr> <chr>                       <dbl>
1 00001.7c53336b37003a9286aba55d2945844c ham   "From exmh-workers-admin@r…     1
2 00002.9c4069e25e1ef370c078db7ee85ff9ac ham   "From Steve_Burt@cursor-sy…     2
3 00003.860e3c3cee1b42ead714c5c874fe25f7 ham   "From timc@2ubh.com  Thu A…     3
4 00004.864220c5b6930b209cc287c361c99af1 ham   "From irregulars-admin@tb.…     4
5 00005.bf27cdeaf0b8c4647ecd61b1d09da613 ham   "From Stewart.Smith@ee.ed.…     5
6 00006.253ea2f9a9cc36fa0b1129b04b806608 ham   "From martin@srv0.ems.ed.a…     6

Email Model

Next, we created an email model. The variables for classifying spam or ham emails include email length, number of words, numbers of exclamation marks, number of dollar signs, number of links, number of digits, and uppercase ratio.

emails_model <- emails |>
  mutate(
    is_spam = factor(label, levels = c("ham", "spam")),
    text = replace_na(text, ""),
    text = iconv(text, to = "UTF-8", sub = ""),
    text_length = str_length(text),
    num_words = str_count(text, "\\S+"),
    num_exclamation = str_count(text, fixed("!")),
    num_dollar = str_count(text, fixed("$")),
    num_links = str_count(str_to_lower(text), "http|www"),
    num_digits = str_count(text, "\\d"),
    uppercase_ratio = if_else(
      text_length > 0,
      str_count(text, "[A-Z]") / text_length,
      0
    )
  ) |>
  select(
    is_spam,
    text_length,
    num_words,
    num_exclamation,
    num_dollar,
    num_links,
    num_digits,
    uppercase_ratio
  )

Creating the Data Partition

For training and testing the model, we created a partition where 70% of the data was used for training the model and then the remaining 30% of the data was used for testing the data.

set.seed(8585)

index <- createDataPartition(
  emails_model$is_spam,
  p = 0.7,
  list = FALSE
)

train_data <- emails_model[index, ]
test_data <- emails_model[-index, ]

Decision Tree

We created a decision tree by using the 70% partitioned training data.

spam_model <- rpart(
  is_spam ~ .,
  data = train_data,
  method = "class",
  control = rpart.control(cp = 0.01)
)

rpart.plot(
  spam_model,
  type = 5,
  extra = 101,
  under = TRUE,
  cex = 0.8,
  main = "Spam Detection Decision Tree"
)

Predictions

Next we used the decision tree on the 30% partitioned testing data.

predictions <- predict(
  spam_model,
  newdata = test_data,
  type = "class"
)

confusionMatrix(predictions, test_data$is_spam)
Confusion Matrix and Statistics

          Reference
Prediction ham spam
      ham  660   79
      spam  90  339
                                         
               Accuracy : 0.8553         
                 95% CI : (0.8338, 0.875)
    No Information Rate : 0.6421         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.687          
                                         
 Mcnemar's Test P-Value : 0.4418         
                                         
            Sensitivity : 0.8800         
            Specificity : 0.8110         
         Pos Pred Value : 0.8931         
         Neg Pred Value : 0.7902         
             Prevalence : 0.6421         
         Detection Rate : 0.5651         
   Detection Prevalence : 0.6327         
      Balanced Accuracy : 0.8455         
                                         
       'Positive' Class : ham            
                                         

Conclusion

The decision tree model achieved an accuracy of 0.8553 on the testing data, meaning that about 85.5% of the emails were correctly classified as either ham or spam. The confusion matrix shows that the model correctly classified 660 ham emails and 339 spam emails. However, it misclassified 79 spam emails as ham and 90 ham emails as spam. The model’s accuracy is higher than the no information rate of 0.6421, which suggests that the model performs better than simply predicting the majority class.

Balanced accuracy was 0.8455, which averages the model’s sensitivity and specificity. This is useful because the dataset has more ham emails than spam emails. The balanced accuracy shows that the model was able to predict both the the ham and the spam reasonably well. After taking into account the model’s ability to distinguish between the “ham” and “spam” categories, its average performance was approximately 84.55%.

The Kappa value was 0.687 which means that there is a fairly strong agreement between the prediction and the true label.