Project 4

Introduction

The purpose of this project is to practice using document classification models to predict the class of a document and further to practice using the textrecipes and tidymodels packages.

For this project, I will be utilizing the SMS Spam Collection Data Set, a public set of SMS labeled messages that have been collected for mobile phone spam research and originally sourced from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

library(tidyverse)
library(magrittr)
library(textrecipes)
library(tidymodels)
library(themis)
library(ranger)

Load data

Read in file to clean

text_data <- as.data.frame(read.delim("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/SMSSpamCollection", header = FALSE, stringsAsFactors = FALSE, quote=NULL, sep = "\t"))

Add column headers

text_data %<>% rename(classification = V1, text = V2)

text_data$classification <- as_factor(text_data$classification)

Here we check to see how many ham and how many spam observations that we have. As you can see, it is very unbalanced so we will likely upsample or downsample when we model the data.

text_data %>%
  ggplot(aes(classification)) +
  geom_bar(fill = "#56B4E9") +
  theme_minimal() +
  labs(x = NULL,
       y = "Count",
       title = "Classification Counts for Spam/Ham Dataset")

I utilized two resources to complete this project. The first is a “Text Classification with Tidymodels” tutorial by Emil Hvitfeldt (https://www.hvitfeldt.me/blog/text-classification-with-tidymodels/). The second was the “Get started with tidymodels and classification of penguin data” YouTube tutorial by Julia Silge https://www.youtube.com/watch?v=z57i2GVcdww.

Prepaing the data for modeling

First, we will split our data into testing and training data so that once we develop our model, we can test it. This is done using the rsample package from tidymodels.

set.seed(689) 

text_split <- initial_split(text_data, strata = "classification", p = 0.75)
train_data <- training(text_split)
test_data <- testing(text_split)

Next, we will prepare the data by performing preprocessing including, downsampling the data for the classification categories that are over-represented (ham), removing stopwords, utilize stem word processing, tokenizing by word, etc. using the recipe function.

text_recipe <- recipe(classification ~ ., data = train_data) %>%
  themis::step_downsample(classification, under_ratio = 1) %>% step_tokenize(text) %>%
  step_stem(text) %>%
  step_tokenfilter(text, min_times = 10) %>%
  step_tfidf(text) %>% prep(training = train_data)

ready_train <- juice(text_recipe)
ready_test <- bake(text_recipe, test_data)

Defining our models

I decided to compare two models, a logistic regression model and a random forest classifier model. For full transparency, I know far more about the logistic regression and would not know if I am violating any assumptions for the random forest model, but I thought it was good practice to try multiple models.

First, we define the specifications for the models and will be using the parsnip package from tidymodels.

glmnet_spec <- logistic_reg(mixture = 0, penalty = 0.1) %>%
  set_engine("glmnet")

rf_spec <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger")

Now we can run the model on the training data.

test_model_glmnet <- glmnet_spec %>%
  fit(classification ~ ., data = ready_train)

test_model_rf <- rf_spec %>%
  fit(classification ~ ., data = ready_train)

Evaluating the models

Set up evaluation tibbles using the parsnip function that is part of the yardstick package so that we can evaluate the performance of the models

eval_tibble_glmnet <- test_data %>%
  select(classification) %>%
  mutate(
    class_model_glmnet = parsnip:::predict_class(test_model_glmnet, ready_test),
    prop_model_glmnet  = parsnip:::predict_classprob(test_model_glmnet, ready_test) %>% pull(`spam`))

eval_tibble_rf <- test_data %>%
  select(classification) %>%
  mutate(
    class_model_rf = parsnip:::predict_class(test_model_rf, ready_test),
    prop_model_rf  = parsnip:::predict_classprob(test_model_rf, ready_test) %>% pull(`spam`))

I will use two approaches to looking at the evaluation metrics for the models.

First, evaluate the logistic regression model looking at accuracy, precision, recall.

accuracy_glmnet <- accuracy(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)
precision_glmnet <- precision(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)
recall_glmnet <- recall(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)

accuracy_glmnet

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.897

precision_glmnet

## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.992

recall_glmnet

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.888

Second, using a confusion matrix, a cross-tabulation of reference and predicted classes, to evaluate the model and using the summary function to look at accuracy, specificity, etc.

conf_matrix_glmnet <- conf_mat(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet, dnn = c("Predicted", "Reference"))

conf_matrix_glmnet

##          Reference
## Predicted  ham spam
##      ham  1071    9
##      spam  135  177

summary(conf_matrix_glmnet)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.897
##  2 kap                  binary         0.653
##  3 sens                 binary         0.888
##  4 spec                 binary         0.952
##  5 ppv                  binary         0.992
##  6 npv                  binary         0.567
##  7 mcc                  binary         0.685
##  8 j_index              binary         0.840
##  9 bal_accuracy         binary         0.920
## 10 detection_prevalence binary         0.776
## 11 precision            binary         0.992
## 12 recall               binary         0.888
## 13 f_meas               binary         0.937

I prefer the second method for looking at the model performance so I will run the confusion matrix and summary for random forest below.

conf_matrix_rf <- conf_mat(eval_tibble_rf, truth = classification, estimate = class_model_rf, dnn = c("Predicted", "Reference"))

conf_matrix_rf

##          Reference
## Predicted  ham spam
##      ham  1128    8
##      spam   78  178

summary(conf_matrix_rf)

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.938
##  2 kap                  binary         0.770
##  3 sens                 binary         0.935
##  4 spec                 binary         0.957
##  5 ppv                  binary         0.993
##  6 npv                  binary         0.695
##  7 mcc                  binary         0.784
##  8 j_index              binary         0.892
##  9 bal_accuracy         binary         0.946
## 10 detection_prevalence binary         0.816
## 11 precision            binary         0.993
## 12 recall               binary         0.935
## 13 f_meas               binary         0.963

Conclusion

This analysis shows that the random forest classification model is better at predicting spam text messages than the logistic regression model.

This project covers setting up recipes and modeling which could be powerful if you were running multiple datasets through the same process. Not explored in the project is that you can also create workflows to run multiple models. The tidymodels and textrecipes packages are efficient and can be scaled up to much larger projects.