library(tidyverse)
library(tidymodels)
library(tidytext)
library(textrecipes)
library(vip)
# Some initialization
#setwd("./Data607_Project4")
rm(list = ls())
path_ham <- "./easy_ham"
path_spam <- "./spam_2"
path_hamhard <- "./hard_ham"
myseed <- 8888
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
For this project we will divide the tasks as follows
Read email data from downloaded folders
First analysis will be using the EASY HAM data set
Easy Ham emails will be merged with SPAM emails into a single data frame
Some basic exploratory analysis to see some characteristics of the data
Using TIDYMODELS we will develop the following steps.
5.1 Process the data using RECIPESS. The process will cleanup the data and TOKENIZE it. Also it will separate the data between testing and training
5.2 Define a LASSO Regression as a classifier to separate SPAM from HAM
5.3 Deine a WORKFLOW to blend the recipe and the regression into a single object.
5.4 Tune the model parameters for optimal results Run performance metrics to see how our model performed.
5.5 Repeat all steps now using the Hard Ham dataset.
We have three folders where the different emails are stored. First step is to read the filenames all the files in each folder. This will be the list we will later read and process.
ham_files_ls = list.files(path=path_ham)
spam_files_ls = list.files(path=path_spam)
hamhard_files_ls = list.files(path=path_hamhard)
# Here we use lapply to ready each file from the list.
ham_files_df <- lapply(ham_files_ls, function(x) {read_file(file = paste0(path_ham,"/",x))})
spam_files_df <- lapply(spam_files_ls, function(x) {read_file(file = paste0(path_spam,"/",x))})
hamhard_files_df <- lapply(hamhard_files_ls, function(x) {read_file(file = paste0(path_hamhard,"/",x))})
For this steps we will cobine the SPAM emails withe two types of HAM emails, to create two dataframes. One for HARD HAM and the other for EASY HAM.
# Combine them
combined_ham_df <- do.call("rbind", lapply(ham_files_df, as_tibble))
combined_spam_df <- do.call("rbind", lapply(spam_files_df, as_tibble))
combined_hamhard_df <- do.call("rbind", lapply(hamhard_files_df, as_tibble))
# Add files names as column id.
combined_ham_df$fileid <- ham_files_ls
combined_spam_df$fileid <- spam_files_ls
combined_hamhard_df$fileid <- hamhard_files_ls
#add a SPAM HAM indicator column to each
combined_ham2_df <- combined_ham_df %>%
mutate(type="ham")
combined_spam2_df <- combined_spam_df %>%
mutate(type="spam")
combined_hamhard2_df <- combined_hamhard_df %>%
mutate(type="ham")
# Merge the two dataframes into a single one_of
#put all data frames into list
df_list <- list(combined_ham2_df, combined_spam2_df)
#merge all data frames together
all_emails_df <- df_list %>% reduce(full_join)
## Joining, by = c("value", "fileid", "type")
# head(all_emails_df)
We will do now for the hard ham emails.
# Let's do it fo hard ham set
dfhard_list <- list(combined_hamhard2_df, combined_spam2_df)
all_emailshard_df <- dfhard_list %>% reduce(full_join)
## Joining, by = c("value", "fileid", "type")
# head(all_emailshard_df)
At this point we have two dataframes. all_emails_df which has the spam and easy ham emails. all_emailshard_df has the spam and hard ham emails.
Let’s do some basic counting of the emails we will analyze.
# Lets count them
all_emails_df %>%
group_by(type) %>%
summarize(messages = n())
We have almost 4,000 emails and the split ham and spam is pretty even, so we have a fairly balanced data set.
words_per_review <- all_emails_df %>%
tidytext::unnest_tokens(word, value)%>%
count(fileid, name = "total_words")
words_per_review %>%
ggplot(aes(total_words)) +
geom_histogram(fill = "midnightblue", alpha = 0.8,bins = 60)
words_per_review %>%
ggplot(aes(total_words)) +
geom_histogram(fill = "midnightblue", alpha = 0.8,bins = 60) +
scale_x_log10()
A histogram show that number of words is very skewed toward smaller number of words per email, but we also see a few very long emails.
If we plot the histogram using a logarithmic scale, we can appreciate better how email are more concentreated towards 200 words per email, but we can some emails with over 10,000 words.
We will TWO MODELS to sea how they perform with the easy and hard ham data.
First we will divide the data into training and testing. We will train the data using only training data, we will optimize using only training data, and we done, we will see how it performs with test data.
First let’s do some adjustments to data using some REGEX
all_emails_df <- all_emails_df %>%
mutate(value= str_remove_all(value, pattern = "<.*?>")) %>%
mutate(value = str_remove_all(value, pattern = "[:digit:]")) %>%
mutate(value = str_remove_all(value, pattern = "[:punct:]")) %>%
mutate(value = str_remove_all(value, pattern = "[\n]")) %>%
mutate(value = str_to_lower(value))
Now we do some basic split of training and test data
set.seed(myseed)
emails_split <- initial_split(all_emails_df)
emails_train <- training(emails_split)
emails_test <- testing(emails_split)
We will make us of TidyModels RECIPES to setup a recipe that will process the emails and convert into tokens so we can perform some regression.
## Pre-process with RECIPES
emails_recipe <- recipe(type ~ value, data = all_emails_df) %>%
step_tokenize(value) %>%
step_stopwords(value) %>%
step_tokenfilter(value, max_tokens = 500) %>%
step_tfidf(value) %>%
step_normalize(all_predictors())
emails_prep <- prep(emails_recipe)
emails_prep
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Training data contained 3898 data points and no missing data.
##
## Operations:
##
## Tokenization for value [trained]
## Stop word removal for value [trained]
## Text filtering for value [trained]
## Term frequency-inverse document frequency with value [trained]
## Centering and scaling for tfidf_value_able, tfidf_value_ac, tfidf_value_a... [trained]
We will use lasso regression which allows to do some regularization by adding a penalty lambda parameter. We will use the tune() function in Tidymodels so we can tune the model to find the optimal lambda parameter by trying several and see which one performs better on the training data.
##### Logistic regression
emails_log_reg <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
logistic_workflow <- workflow() %>%
add_recipe(emails_recipe) %>%
add_model(emails_log_reg)
logistic_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
This function in tidymodels creates a list of potential value for Lambda that we can use to train and compare our model.
### Tune regression parameters
#####
set.seed(myseed)
lambda_grid <- grid_regular(penalty(), levels = 20)
## We will use bootstrapping to training our model
email_folds <- bootstraps(emails_train, strata = type)
email_folds
Here we train our model using several parameters por lambda. We will mesure the performance of the model after each and will later test the model with lambda parameter which gave us the best result during training.
set.seed(myseed)
doParallel::registerDoParallel()
regression_grid <- tune_grid(
logistic_workflow,
resamples = email_folds,
grid = lambda_grid,
metrics = metric_set(roc_auc, ppv, npv, precision, recall, f_meas)
)
We will plot how our model performs under several commong perfomance metrics like ROC-AUC, F1 score, recall, sensitivity, negative predictive value (npv), positive predictive value (ppv)
regression_grid %>%
collect_metrics()
regression_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_line(size = 1.5, show.legend = FALSE) +
facet_wrap(~.metric) +
scale_x_log10()
## Warning: Removed 2 row(s) containing missing values (geom_path).
We will use roc-auc as the performance metric. We could’ve use other like F1
best_auc <- regression_grid %>%
select_best("roc_auc")
best_auc
####
best_f <- regression_grid %>%
select_best("f_meas")
best_f
Here we test our model with the test data and we will collect metrics to measure performance.
# Finalize Model
#
final_regression <- finalize_workflow(logistic_workflow, best_auc)
final_regression
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 1e-10
## mixture = 1
##
## Computational engine: glmnet
emails_final <- last_fit(final_regression, emails_split)
emails_final %>%
collect_metrics()
How did we do with unseen test data
emails_final %>%
collect_predictions() %>%
conf_mat(type, .pred_class)
## Truth
## Prediction ham spam
## ham 603 3
## spam 2 367
Not bad at all 99.5% accuracy and 99.99 ROC-AUC. Out of almost 1,000 emails, we only misclassified 5
Let’s use the VIP package to mesure the importance or words to classify an email positively or negatively. It is not apparent why some of the words are importan though.
extract_fit_engine(emails_final) %>%
vi(lambda = best_f$penalty) %>%
group_by(Sign) %>%
top_n(20, wt = abs(Importance)) %>%
ungroup() %>%
mutate(
Importance = abs(Importance),
Variable = str_remove(Variable, "tfidf_value_"),
Variable = fct_reorder(Variable, Importance)
) %>%
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Sign, scales = "free_y") +
labs(y = NULL)
Let’s see how the model performs with HARD HAM
Same as before, we will split data into testing and traing using the HARD HAM dataset.
all_emailshard_df <- all_emailshard_df %>%
mutate(value= str_remove_all(value, pattern = "<.*?>")) %>%
mutate(value = str_remove_all(value, pattern = "[:digit:]")) %>%
mutate(value = str_remove_all(value, pattern = "[:punct:]")) %>%
mutate(value = str_remove_all(value, pattern = "[\n]")) %>%
mutate(value = str_to_lower(value))
set.seed(myseed)
emails_split <- initial_split(all_emailshard_df)
emails_train <- training(emails_split)
emails_test <- testing(emails_split)
Same as before we will use RECIPES to tokenize our emails.
## Pre-process with RECIPES
emails_recipe <- recipe(type ~ value, data = all_emails_df) %>%
step_tokenize(value) %>%
step_stopwords(value) %>%
step_tokenfilter(value, max_tokens = 500) %>%
step_tfidf(value) %>%
step_normalize(all_predictors())
emails_prep <- prep(emails_recipe)
emails_prep
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Training data contained 3898 data points and no missing data.
##
## Operations:
##
## Tokenization for value [trained]
## Stop word removal for value [trained]
## Text filtering for value [trained]
## Term frequency-inverse document frequency with value [trained]
## Centering and scaling for tfidf_value_able, tfidf_value_ac, tfidf_value_a... [trained]
Same regression model as before, using lasso but time we want to see how good it does it with hard ham.
##### Logistic regression
emails_log_reg <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
logistic_workflow <- workflow() %>%
add_recipe(emails_recipe) %>%
add_model(emails_log_reg)
logistic_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
### Tune regression parameters
#####
set.seed(myseed)
lambda_grid <- grid_regular(penalty(), levels = 20)
email_folds <- bootstraps(emails_train, strata = type)
email_folds
set.seed(myseed)
regression_grid <- tune_grid(
logistic_workflow,
resamples = email_folds,
grid = lambda_grid,
metrics = metric_set(roc_auc, ppv, npv, precision, recall, f_meas)
)
regression_grid %>%
collect_metrics()
##################
regression_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_line(size = 1.5, show.legend = FALSE) +
facet_wrap(~.metric) +
scale_x_log10()
## Warning: Removed 6 row(s) containing missing values (geom_path).
best_auc <- regression_grid %>%
select_best("roc_auc")
best_auc
####
best_f <- regression_grid %>%
select_best("f_meas")
best_f
# Finalize Model
#
final_regression <- finalize_workflow(logistic_workflow, best_auc)
final_regression
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.00233572146909012
## mixture = 1
##
## Computational engine: glmnet
How did we do this time with HARD HAM?
emails_final <- last_fit(final_regression, emails_split)
emails_final %>%
collect_metrics()
emails_final %>%
collect_predictions() %>%
conf_mat(type, .pred_class)
## Truth
## Prediction ham spam
## ham 49 4
## spam 10 349
Still not bad! ACCURACY was 96.6% and ROC-AUC 98% not bad at all.
Because the number of Hard Ham email was smaller, I would like to see how my model did only classifying hard ham emails. We see that them model classified correctly 83% even the hard ham emails. Pretty goo I’d say.
We saw a simplete logistic regression can be used to classify emails as long as we have a sufficient large labelled dataset.
TIDYMODELS is awesome. It provide a structured way to approach any problem, and essentially guides us through the different steps providing excellent and consistent tools to approach any problem.
**For my final project I will use something similar but I will try many other models like Neural networks, Random Forest and KNN to see and compare how they perform with the data.
THANK YOU!!!