library(tidyverse)
library(tidymodels)
library(parsnip)It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, we will take a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
The corpus to be used was downloaded from: https://spamassassin.apache.org/old/publiccorpus/
Out of the available sample files, I downloaded the following two:
I extracted the files from the test files and placed them in separate directories.
Read the data from the extracted files and store them in dataframes.
# the following function reads multiple files into a single data frame
load_data <- function(path) {
file_list = list.files(path)
docs<- NA
for(i in 1:length(file_list))
{
filepath <- paste0(path, "/", file_list[1])
text <- readLines(filepath)
list1<- list(paste(text, collapse="\n"))
docs <- c(docs,list1)
}
as.data.frame(unlist(docs))
}
# define directory paths where files were extracted
ham_dir <- 'C:\\Users\\Esteban\\OneDrive\\edu\\cuny\\Courses\\S1_2021_01_Spring\\DATA607\\Projects\\PRJ04_DocumentClassification\\easy_ham_2\\'
spam_dir <- 'C:\\Users\\Esteban\\OneDrive\\edu\\cuny\\Courses\\S1_2021_01_Spring\\DATA607\\Projects\\PRJ04_DocumentClassification\\spam_2\\'
# load the documents from the ham files
ham_df <- load_data(ham_dir) %>%
drop_na() %>%
mutate(doc_type = as.factor("ham"))
# load the documents from the spam files
spam_df <- load_data(spam_dir)%>%
drop_na() %>%
mutate(doc_type = as.factor("spam"))
# combine the ham documents with the spam ones into a single dataframe
ham_spam_df <- dplyr::bind_rows(ham_df, spam_df)
# rename the columns of the dataframe
colnames(ham_spam_df) <- c("doc_text", "doc_type")Shuffle the rows of the dataframe to prevent any biases in the ordering of the dataset, which will be critical when we split the dataset into training data and test data
# set a random seed so that our reordering work is reproducible
set.seed(2021)
# use the sample() function to shuffle the row indices of the ham_spam_df dataset
random_rows <- sample(nrow(ham_spam_df))
# use this random vector to reorder the ham_spam_df dataset
ham_spam_df <- ham_spam_df[random_rows,]Let’s take a peak at the combined and shuffled ham/spam data
head(ham_spam_df)## # A tibble: 6 x 2
## doc_text doc_type
## <chr> <fct>
## 1 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham
## 2 "From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002\nReturn-Path: <i~ spam
## 3 "From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002\nReturn-Path: <i~ spam
## 4 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham
## 5 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham
## 6 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham
Let’s create the training and test datasets for model fitting and evaluation.
Let’s use 75% of the data for training the model and 25% for testing the model.
# Create data split object
ham_spam_split <- initial_split(ham_spam_df, prop = 0.75,
strata = doc_type)
# Create the training data
ham_spam_training <- ham_spam_split %>%
training()
# Create the test data
ham_spam_test <- ham_spam_split %>%
testing()
# Check the number of rows
nrow(ham_spam_training)## [1] 2099
nrow(ham_spam_test)## [1] 699
Let’s define a logistic regression model, which should use the “glm” engine, which should work in “classification” mode since we are going to classify documents as ham or spam.
# Specify a logistic regression model
logistic_model <- parsnip::logistic_reg() %>%
# Set the engine
parsnip::set_engine('glm') %>%
# Set the mode
parsnip::set_mode('classification')Using the parsnip’s fit function, let’s define a logistic regression object and train a model to predict ham/spam classification using the “doc_text” as predictor variable from the hamp_spam_df data.
# Fit to training data
logistic_fit <- logistic_model %>%
parsnip::fit(doc_type ~ doc_text,
data = ham_spam_training)
# Print model fit object
logistic_fit## parsnip model object
##
## Fit time: 30ms
##
## Call: stats::glm(formula = doc_type ~ doc_text, family = stats::binomial,
## data = data)
##
## Coefficients:
## (Intercept) doc_text
## 26.57 -53.13
##
## Degrees of Freedom: 2098 Total (i.e. Null); 2097 Residual
## Null Deviance: 2910
## Residual Deviance: 1.218e-08 AIC: 4
Let’s evaluate our model’s performance on the test dataset.
Before calculating classification metrics such as sensitivity or specificity, let’s create a results tibble with the required columns for yardstick metric functions.
# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = ham_spam_test,
type = 'class')
class_preds## # A tibble: 699 x 1
## .pred_class
## <fct>
## 1 spam
## 2 ham
## 3 ham
## 4 ham
## 5 ham
## 6 spam
## 7 ham
## 8 spam
## 9 ham
## 10 spam
## # ... with 689 more rows
# Obtain estimated probabilities for each outcome value -->
prob_preds <- predict(logistic_fit, new_data = ham_spam_test,
type = 'prob')Let’s take a peak at the estimated probabilities
prob_preds## # A tibble: 699 x 2
## .pred_ham .pred_spam
## <dbl> <dbl>
## 1 2.90e-12 1.00e+ 0
## 2 1.00e+ 0 2.90e-12
## 3 1.00e+ 0 2.90e-12
## 4 1.00e+ 0 2.90e-12
## 5 1.00e+ 0 2.90e-12
## 6 2.90e-12 1.00e+ 0
## 7 1.00e+ 0 2.90e-12
## 8 2.90e-12 1.00e+ 0
## 9 1.00e+ 0 2.90e-12
## 10 2.90e-12 1.00e+ 0
## # ... with 689 more rows
# Combine test set results
ham_spam_results <- ham_spam_test %>%
select(doc_type) %>%
bind_cols(class_preds, prob_preds)# View results tibble
ham_spam_results## # A tibble: 699 x 4
## doc_type .pred_class .pred_ham .pred_spam
## <fct> <fct> <dbl> <dbl>
## 1 spam spam 2.90e-12 1.00e+ 0
## 2 ham ham 1.00e+ 0 2.90e-12
## 3 ham ham 1.00e+ 0 2.90e-12
## 4 ham ham 1.00e+ 0 2.90e-12
## 5 ham ham 1.00e+ 0 2.90e-12
## 6 spam spam 2.90e-12 1.00e+ 0
## 7 ham ham 1.00e+ 0 2.90e-12
## 8 spam spam 2.90e-12 1.00e+ 0
## 9 ham ham 1.00e+ 0 2.90e-12
## 10 spam spam 2.90e-12 1.00e+ 0
## # ... with 689 more rows
# Calculate the confusion matrix
yardstick::conf_mat(ham_spam_results, truth = doc_type,
estimate = .pred_class)## Truth
## Prediction ham spam
## ham 350 0
## spam 0 349
# Calculate the accuracy
accuracy(ham_spam_results, doc_type,
estimate = .pred_class)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1
# Calculate the sensitivity
sensitivity(ham_spam_results, doc_type,
.pred_class)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 1
# Calculate the specificity
specificity(ham_spam_results,doc_type,
.pred_class)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 1
# Create a custom metric function
ham_spam_metrics <- metric_set(accuracy, sens, spec)# Calculate metrics using model results tibble
ham_spam_metrics(ham_spam_results, truth = doc_type,
estimate = .pred_class)## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1
## 2 sens binary 1
## 3 spec binary 1
# Create a confusion matrix
conf_mat(ham_spam_results,
truth = doc_type,
estimate = .pred_class) %>%
# Pass to the summary() function
summary()## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1
## 2 kap binary 1
## 3 sens binary 1
## 4 spec binary 1
## 5 ppv binary 1
## 6 npv binary 1
## 7 mcc binary 1
## 8 j_index binary 1
## 9 bal_accuracy binary 1
## 10 detection_prevalence binary 0.501
## 11 precision binary 1
## 12 recall binary 1
## 13 f_meas binary 1
In this project I attempted to use a classification predictive model to classify documents as spam or ham from a corpus of documents downloaded from the Web. I used the “tidymodels” and “parsnip” packages to fit the training dataset to a model. Then I used that model to predict the classification of the documents stored in the test dataset.
The results of evaluating the model showed that the predictions were almost perfect, which indicates the presence of an overfitting situation here. I tried to split the training and test sets using different proportions but the result was about the same. It looks like I need to learn how to fine tune the parameters of the model that I used and probably learn other techniques to deal with overfitting.