The purpose of this project is to practice using document classification models to predict the class of a document and further to practice using the textrecipes and tidymodels packages.
For this project, I will be utilizing the SMS Spam Collection Data Set, a public set of SMS labeled messages that have been collected for mobile phone spam research and originally sourced from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
library(tidyverse)
library(magrittr)
library(textrecipes)
library(tidymodels)
library(themis)
library(ranger)
Read in file to clean
text_data <- as.data.frame(read.delim("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/SMSSpamCollection", header = FALSE, stringsAsFactors = FALSE, quote=NULL, sep = "\t"))
Add column headers
text_data %<>% rename(classification = V1, text = V2)
text_data$classification <- as_factor(text_data$classification)
Here we check to see how many ham and how many spam observations that we have. As you can see, it is very unbalanced so we will likely upsample or downsample when we model the data.
text_data %>%
ggplot(aes(classification)) +
geom_bar(fill = "#56B4E9") +
theme_minimal() +
labs(x = NULL,
y = "Count",
title = "Classification Counts for Spam/Ham Dataset")
I utilized two resources to complete this project. The first is a “Text Classification with Tidymodels” tutorial by Emil Hvitfeldt (https://www.hvitfeldt.me/blog/text-classification-with-tidymodels/). The second was the “Get started with tidymodels and classification of penguin data” YouTube tutorial by Julia Silge https://www.youtube.com/watch?v=z57i2GVcdww.
First, we will split our data into testing and training data so that once we develop our model, we can test it. This is done using the rsample package from tidymodels.
set.seed(689)
text_split <- initial_split(text_data, strata = "classification", p = 0.75)
train_data <- training(text_split)
test_data <- testing(text_split)
Next, we will prepare the data by performing preprocessing including, downsampling the data for the classification categories that are over-represented (ham), removing stopwords, utilize stem word processing, tokenizing by word, etc. using the recipe function.
text_recipe <- recipe(classification ~ ., data = train_data) %>%
themis::step_downsample(classification, under_ratio = 1) %>% step_tokenize(text) %>%
step_stem(text) %>%
step_tokenfilter(text, min_times = 10) %>%
step_tfidf(text) %>% prep(training = train_data)
ready_train <- juice(text_recipe)
ready_test <- bake(text_recipe, test_data)
I decided to compare two models, a logistic regression model and a random forest classifier model. For full transparency, I know far more about the logistic regression and would not know if I am violating any assumptions for the random forest model, but I thought it was good practice to try multiple models.
First, we define the specifications for the models and will be using the parsnip package from tidymodels.
glmnet_spec <- logistic_reg(mixture = 0, penalty = 0.1) %>%
set_engine("glmnet")
rf_spec <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger")
Now we can run the model on the training data.
test_model_glmnet <- glmnet_spec %>%
fit(classification ~ ., data = ready_train)
test_model_rf <- rf_spec %>%
fit(classification ~ ., data = ready_train)
Set up evaluation tibbles using the parsnip function that is part of the yardstick package so that we can evaluate the performance of the models
eval_tibble_glmnet <- test_data %>%
select(classification) %>%
mutate(
class_model_glmnet = parsnip:::predict_class(test_model_glmnet, ready_test),
prop_model_glmnet = parsnip:::predict_classprob(test_model_glmnet, ready_test) %>% pull(`spam`))
eval_tibble_rf <- test_data %>%
select(classification) %>%
mutate(
class_model_rf = parsnip:::predict_class(test_model_rf, ready_test),
prop_model_rf = parsnip:::predict_classprob(test_model_rf, ready_test) %>% pull(`spam`))
I will use two approaches to looking at the evaluation metrics for the models.
First, evaluate the logistic regression model looking at accuracy, precision, recall.
accuracy_glmnet <- accuracy(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)
precision_glmnet <- precision(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)
recall_glmnet <- recall(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet)
accuracy_glmnet
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.897
precision_glmnet
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.992
recall_glmnet
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.888
Second, using a confusion matrix, a cross-tabulation of reference and predicted classes, to evaluate the model and using the summary function to look at accuracy, specificity, etc.
conf_matrix_glmnet <- conf_mat(eval_tibble_glmnet, truth = classification, estimate = class_model_glmnet, dnn = c("Predicted", "Reference"))
conf_matrix_glmnet
## Reference
## Predicted ham spam
## ham 1071 9
## spam 135 177
summary(conf_matrix_glmnet)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.897
## 2 kap binary 0.653
## 3 sens binary 0.888
## 4 spec binary 0.952
## 5 ppv binary 0.992
## 6 npv binary 0.567
## 7 mcc binary 0.685
## 8 j_index binary 0.840
## 9 bal_accuracy binary 0.920
## 10 detection_prevalence binary 0.776
## 11 precision binary 0.992
## 12 recall binary 0.888
## 13 f_meas binary 0.937
I prefer the second method for looking at the model performance so I will run the confusion matrix and summary for random forest below.
conf_matrix_rf <- conf_mat(eval_tibble_rf, truth = classification, estimate = class_model_rf, dnn = c("Predicted", "Reference"))
conf_matrix_rf
## Reference
## Predicted ham spam
## ham 1128 8
## spam 78 178
summary(conf_matrix_rf)
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.938
## 2 kap binary 0.770
## 3 sens binary 0.935
## 4 spec binary 0.957
## 5 ppv binary 0.993
## 6 npv binary 0.695
## 7 mcc binary 0.784
## 8 j_index binary 0.892
## 9 bal_accuracy binary 0.946
## 10 detection_prevalence binary 0.816
## 11 precision binary 0.993
## 12 recall binary 0.935
## 13 f_meas binary 0.963
This analysis shows that the random forest classification model is better at predicting spam text messages than the logistic regression model.
This project covers setting up recipes and modeling which could be powerful if you were running multiple datasets through the same process. Not explored in the project is that you can also create workflows to run multiple models. The tidymodels and textrecipes packages are efficient and can be scaled up to much larger projects.