Assignment Overview and Data Import

The goal here is to develop a classification model for determining if a message is spam (i.e. dangerous) or ham (i.e. safe). I found and selected a spam and ham dataset of 5,572 text messages freely available via Kaggle (https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset), which I downloaded and then stored in my github for ease of use. The data is chiefly composed of two columns: the first contains the classification of the message (i.e. spam or ham), while the second column contains the message contents.

knitr::opts_chunk$set(echo = TRUE)

# Import libraries
library(tm)
## Warning: package 'tm' was built under R version 4.4.2
## Loading required package: NLP
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library('RCurl')
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.4.2
## Package version: 4.1.0
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:tm':
## 
##     stopwords
## 
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
library(caTools)
## Warning: package 'caTools' was built under R version 4.4.2
library(pscl)
## Warning: package 'pscl' was built under R version 4.4.2
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
#read csv from github and rename columns for clarity
getdata <- getURL('https://raw.githubusercontent.com/kr0710/Data607/refs/heads/main/spamham.csv')

df <- read.csv(text = getdata, header = TRUE)[,c(1,2)]
colnames(df) <- c('classification', 'text_messages')

#View the first ten entries in the data set
kable(df[c(1:10),])
classification text_messages
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
ham Ok lar… Joking wif u oni…
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
ham U dun say so early hor… U c already then say…
ham Nah I don’t think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

Corpus generation for modeling

The text corpus is generated using Vcorpus, then processed with various functions using tm_map. The processed corpus is then converted into a document term matrix, which is further filtered based on term scarcity to focus on a sub sample of terms. The DTM results are converted to a data frame and recombined with the original classification (ham or spam) from the original data set using cbind. Next, the ham and span classifications are factorized (ham = 0, spam = 1) to facilitate predictive modeling.

#generate and process corpus

msgcorpus <- VCorpus(VectorSource(df$text_messages))

clean_corpus <- msgcorpus |>
  tm_map(stripWhitespace) |>
  tm_map(removePunctuation) |>
  tm_map(content_transformer(tolower)) |>
  tm_map(removeWords, stopwords("english"))

#generate document term matrix and remove sparsest terms
dtm <- DocumentTermMatrix(clean_corpus)

dtm_nonsparse <- removeSparseTerms(dtm, .9975)

#combine with original classifications and factorize ham and spam to 0 and 1, respectively
dtm_df <- as.data.frame(as.matrix(dtm_nonsparse))

combined_df <- cbind(dtm_df, df$classification)


colnames(combined_df)[613] <- 'classification'


combined_df$classification <- factor(combined_df$classification,
                                     levels = c('ham', 'spam'),
                                     labels = c(0,1))

Modeling spam and ham messages using a random forest classifier

I originally sought to build a classifier using KNN, but the algorithm consistently raised errors due to an inability to resolve ties, even if K were set to 1. Thus, I switched gears and built a classifier using a random forest. The data is split: 70% of rows are used for training, 30% used to test. The random forest model makes 50 trees from which it classifies each of the training or test samples. Both the training results show that this classifier is pretty good at classifying ham as such (training error = 0.80%, test error = 0.70%), but misclassifies about 20% of spam as ham in both the training and test sets, an error rate that is probably too high. The error rate in classifying spam could potentially be reduced with different models or tuning the random forest model further, but different data features could also be informative. For example, keeping even sparser terms from the DTM set might lead to better classification of spam.

set.seed(123)
split <- sample.split(combined_df$classification, SplitRatio = .7)
rftrain <- subset(combined_df, split == TRUE)
rftest <- subset(combined_df, split == FALSE)

rfclassifier <- randomForest(x = rftrain[-613],
                             y = rftrain$classification,
                             ntree = 50)
rfclassifier

Call: randomForest(x = rftrain[-613], y = rftrain$classification, ntree = 50) Type of random forest: classification Number of trees: 50 No. of variables tried at each split: 24

    OOB estimate of  error rate: 3.18%

Confusion matrix: 0 1 class.error 0 3350 28 0.008288928 1 96 427 0.183556405

rfpredict <- predict(rfclassifier, newdata = rftest[-613])

rfcm <- confusionMatrix(data=rfpredict, reference=rftest[,613])

rfcm

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 1437 47 1 10 177

           Accuracy : 0.9659         
             95% CI : (0.956, 0.9741)
No Information Rate : 0.8659         
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.842          
                                     

Mcnemar’s Test P-Value : 1.858e-06

        Sensitivity : 0.9931         
        Specificity : 0.7902         
     Pos Pred Value : 0.9683         
     Neg Pred Value : 0.9465         
         Prevalence : 0.8659         
     Detection Rate : 0.8600         

Detection Prevalence : 0.8881
Balanced Accuracy : 0.8916

   'Positive' Class : 0