Project 4 - Document Classification

library(tidyverse)
library(tidymodels)
library(parsnip)

Project Overview

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, we will take a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

The corpus to be used was downloaded from: https://spamassassin.apache.org/old/publiccorpus/

Out of the available sample files, I downloaded the following two:

20050311_spam_2.tar.bz2
20050311_spam_2.tar.bz2

Reading data from the files

I extracted the files from the test files and placed them in separate directories.

Read the data from the extracted files and store them in dataframes.

# the following function reads multiple files into a single data frame
load_data <- function(path) {
  file_list = list.files(path)
  docs<- NA
  for(i in 1:length(file_list))
  {
    filepath <- paste0(path, "/", file_list[1])  
    text <- readLines(filepath)
    list1<- list(paste(text, collapse="\n"))
    docs <- c(docs,list1)
  }
  as.data.frame(unlist(docs))
}

# define directory paths where files were extracted
ham_dir <- 'C:\\Users\\Esteban\\OneDrive\\edu\\cuny\\Courses\\S1_2021_01_Spring\\DATA607\\Projects\\PRJ04_DocumentClassification\\easy_ham_2\\'
spam_dir <- 'C:\\Users\\Esteban\\OneDrive\\edu\\cuny\\Courses\\S1_2021_01_Spring\\DATA607\\Projects\\PRJ04_DocumentClassification\\spam_2\\'

# load the documents from the ham files
ham_df <- load_data(ham_dir) %>%
  drop_na() %>%
  mutate(doc_type = as.factor("ham"))

# load the documents from the spam files
spam_df <- load_data(spam_dir)%>%
  drop_na() %>%
  mutate(doc_type = as.factor("spam"))

# combine the ham documents with the spam ones into a single dataframe
ham_spam_df <- dplyr::bind_rows(ham_df, spam_df)

# rename the columns of the dataframe
colnames(ham_spam_df) <- c("doc_text", "doc_type")

Shuffle the data

Shuffle the rows of the dataframe to prevent any biases in the ordering of the dataset, which will be critical when we split the dataset into training data and test data

# set a random seed so that our reordering work is reproducible
set.seed(2021)
# use the sample() function to shuffle the row indices of the ham_spam_df dataset
random_rows <- sample(nrow(ham_spam_df))
# use this random vector to reorder the ham_spam_df dataset
ham_spam_df <- ham_spam_df[random_rows,]

Let’s take a peak at the combined and shuffled ham/spam data

head(ham_spam_df)

## # A tibble: 6 x 2
##   doc_text                                                              doc_type
##   <chr>                                                                 <fct>   
## 1 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham     
## 2 "From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002\nReturn-Path: <i~ spam    
## 3 "From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002\nReturn-Path: <i~ spam    
## 4 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham     
## 5 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham     
## 6 "Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered~ ham

Data Resampling

Let’s create the training and test datasets for model fitting and evaluation.

Let’s use 75% of the data for training the model and 25% for testing the model.

# Create data split object
ham_spam_split <- initial_split(ham_spam_df, prop = 0.75,
                     strata = doc_type)

# Create the training data
ham_spam_training <- ham_spam_split %>%
   training()

# Create the test data
ham_spam_test <- ham_spam_split %>%
   testing()

# Check the number of rows
nrow(ham_spam_training)

## [1] 2099

nrow(ham_spam_test)

## [1] 699

Fitting a logistic regression model

Specifying a logistic regression model

Let’s define a logistic regression model, which should use the “glm” engine, which should work in “classification” mode since we are going to classify documents as ham or spam.

# Specify a logistic regression model
logistic_model <- parsnip::logistic_reg() %>%
# Set the engine
parsnip::set_engine('glm') %>%
# Set the mode
parsnip::set_mode('classification')

Model fitting

Using the parsnip’s fit function, let’s define a logistic regression object and train a model to predict ham/spam classification using the “doc_text” as predictor variable from the hamp_spam_df data.

# Fit to training data
logistic_fit <- logistic_model %>%
   parsnip::fit(doc_type ~ doc_text,
       data = ham_spam_training)

# Print model fit object
logistic_fit

## parsnip model object
## 
## Fit time:  30ms 
## 
## Call:  stats::glm(formula = doc_type ~ doc_text, family = stats::binomial, 
##     data = data)
## 
## Coefficients:
## (Intercept)     doc_text  
##       26.57       -53.13  
## 
## Degrees of Freedom: 2098 Total (i.e. Null);  2097 Residual
## Null Deviance:       2910 
## Residual Deviance: 1.218e-08     AIC: 4

Combining test dataset results

Let’s evaluate our model’s performance on the test dataset.

Predicting outcome categories

Before calculating classification metrics such as sensitivity or specificity, let’s create a results tibble with the required columns for yardstick metric functions.

# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = ham_spam_test,
                        type = 'class')

class_preds

## # A tibble: 699 x 1
##    .pred_class
##    <fct>      
##  1 spam       
##  2 ham        
##  3 ham        
##  4 ham        
##  5 ham        
##  6 spam       
##  7 ham        
##  8 spam       
##  9 ham        
## 10 spam       
## # ... with 689 more rows

Estimated probabilities

# Obtain estimated probabilities for each outcome value -->
prob_preds <- predict(logistic_fit, new_data = ham_spam_test,
                      type = 'prob')

Let’s take a peak at the estimated probabilities

prob_preds

## # A tibble: 699 x 2
##    .pred_ham .pred_spam
##        <dbl>      <dbl>
##  1  2.90e-12   1.00e+ 0
##  2  1.00e+ 0   2.90e-12
##  3  1.00e+ 0   2.90e-12
##  4  1.00e+ 0   2.90e-12
##  5  1.00e+ 0   2.90e-12
##  6  2.90e-12   1.00e+ 0
##  7  1.00e+ 0   2.90e-12
##  8  2.90e-12   1.00e+ 0
##  9  1.00e+ 0   2.90e-12
## 10  2.90e-12   1.00e+ 0
## # ... with 689 more rows

Combining results

# Combine test set results
ham_spam_results <- ham_spam_test %>%
select(doc_type) %>%
bind_cols(class_preds, prob_preds)

View the results

# View results tibble
ham_spam_results

## # A tibble: 699 x 4
##    doc_type .pred_class .pred_ham .pred_spam
##    <fct>    <fct>           <dbl>      <dbl>
##  1 spam     spam         2.90e-12   1.00e+ 0
##  2 ham      ham          1.00e+ 0   2.90e-12
##  3 ham      ham          1.00e+ 0   2.90e-12
##  4 ham      ham          1.00e+ 0   2.90e-12
##  5 ham      ham          1.00e+ 0   2.90e-12
##  6 spam     spam         2.90e-12   1.00e+ 0
##  7 ham      ham          1.00e+ 0   2.90e-12
##  8 spam     spam         2.90e-12   1.00e+ 0
##  9 ham      ham          1.00e+ 0   2.90e-12
## 10 spam     spam         2.90e-12   1.00e+ 0
## # ... with 689 more rows

Assessing model fit

Calculate the confusion matrix

# Calculate the confusion matrix
yardstick::conf_mat(ham_spam_results, truth = doc_type,
   estimate = .pred_class)

##           Truth
## Prediction ham spam
##       ham  350    0
##       spam   0  349

Calculate the accuracy

# Calculate the accuracy
accuracy(ham_spam_results, doc_type,
   estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary             1

Calculate the sensitivity

# Calculate the sensitivity
sensitivity(ham_spam_results, doc_type,
     .pred_class)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary             1

Calculate the specificity

# Calculate the specificity
specificity(ham_spam_results,doc_type,
.pred_class)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary             1

Create a custom metric function

# Create a custom metric function
ham_spam_metrics <- metric_set(accuracy, sens, spec)

Calculate metrics using model results tibble

# Calculate metrics using model results tibble
ham_spam_metrics(ham_spam_results, truth = doc_type,
                estimate = .pred_class)

## # A tibble: 3 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary             1
## 2 sens     binary             1
## 3 spec     binary             1

Create a confusion matrix

# Create a confusion matrix
conf_mat(ham_spam_results,
         truth = doc_type,
         estimate = .pred_class) %>%
# Pass to the summary() function
   summary()

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         1    
##  2 kap                  binary         1    
##  3 sens                 binary         1    
##  4 spec                 binary         1    
##  5 ppv                  binary         1    
##  6 npv                  binary         1    
##  7 mcc                  binary         1    
##  8 j_index              binary         1    
##  9 bal_accuracy         binary         1    
## 10 detection_prevalence binary         0.501
## 11 precision            binary         1    
## 12 recall               binary         1    
## 13 f_meas               binary         1

Conclusion

In this project I attempted to use a classification predictive model to classify documents as spam or ham from a corpus of documents downloaded from the Web. I used the “tidymodels” and “parsnip” packages to fit the training dataset to a model. Then I used that model to predict the classification of the documents stored in the test dataset.

The results of evaluating the model showed that the predictions were almost perfect, which indicates the presence of an overfitting situation here. I tried to split the training and test sets using different proportions but the result was about the same. It looks like I need to learn how to fine tune the parameters of the model that I used and probably learn other techniques to deal with overfitting.