STA 279 Lab 5

Complete all Questions.

Goal

We have been working with logistic regression models, and today we are going to explore and practice (1) threshold tuning and (2) feature selection with logistic regression.

Spam emails are unwanted emails, generally sent to masses of people by corporations attempting to sway you into purchasing a good or service. Want a cruise to Europe?? Have we got a deal for you!! Modern email clients like GMail work to identify spam messages and keep them from appearing in your inbox. This is called filtering. Filters take a look at each email that appears in your inbox. Based on what the filter sees, it classifies an email as spam, and tucks it away in your spam folder, or not spam, in which case the email is placed in your inbox. The better the filter, the fewer unwanted emails you have to sift through in your inbox each day.

How does a filter work? Basically, a filter scans an email and looks for a few key text features. For instance, if the email says “cruise”, are you more or less likely to assume it is spam? Filters take these ideas and convert them into statistical models that use these features to predict the probability that an email is spam. If this probability is high, the email is labeled spam and tucked into the spam folder. If the probability is low, the email appears in your inbox. As you probably know, such filters are not perfect, and sometimes spam sneaks into your inbox or important emails end up in the spam folder. The filter is a model, which means it can make mistakes.The goal of someone who designs filters is to make mistakes as infrequently as possible.

Today we will be working to build our own spam filter using logistic regression.

The Data

To load the data sets, put the following into a chunk in Markdown and press play.

# Load the test data set
test <- read.csv("https://www.dropbox.com/scl/fi/98bqf6g9c8euz697pesiu/spamtest.csv?rlkey=ctajtv7420tgslq2clen50i2w&st=zmj7zf03&dl=1")

# Load the training data set 
train <- read.csv("https://www.dropbox.com/scl/fi/fzhrojxhozm7qieguy7dg/spamtrain.csv?rlkey=qmdzinhxq49cu5350ttgrq59h&st=h6bylhgh&dl=1")

There are 19 variables in each data set. We have the response variable

spam: Indicator for whether the email is spam.

and 18 text based features:

to_multiple: Indicator for whether the email was addressed to more than one recipient.
from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc: Indicator for whether anyone was CCed.
sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
image: Indicates whether any images were attached.
attach: Indicates whether any files were attached.
dollar: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email.
winner: Indicates whether “winner” appeared in the email.
inherit: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
password: Indicates whether “password” appeared in the email.
num_char: The number of characters in the email, in thousands.
line_breaks: The number of line breaks in the email (does not count text wrapping).
format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
re_subj: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”.
exclaim_subj: Indicates whether there was an exclamation point in the subject.
urgent_subj: Indicates whether the word “urgent” was in the email subject.
exclaim_mess: The number of exclamation points in the email message.
number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Question 1

Which variable do think might be most highly associated with an email being spam? Why? (You don’t need to look at the data yet, just reason it out. There is no specific correct answer!)

Our response variable is binary, where 1 means that an email is spam and a 0 means that the email is not spam. Before we try to model the data, our first step is always to visualize or summarize our data in some way. With binary data, a good first step is determine how many 0s and 1s we actually have in this data set. Are they all spam? Do we have a good mix? The table command is useful for this:

knitr::kable(table(train$spam))

Question 2

What percent of emails in the training data set are spam?

It is important to check this before we begin an analysis to make sure we have enough emails that are spam to actually find relationships. If there are very few rows with $y_i=1$ (in this case spam), modeling becomes more difficult.

The Full Model

Let’s see how well we can predict spam status if we use all 18 text-based features.

To fit the model using the training data and all the features, we use the following code:

fullmodel <- glm( spam ~ ., data = train, family = "binomial")

To see the coefficients ($\hat{\beta}$ terms), we use

summary(fullmodel)$coefficients

This is what we call raw R-output, and for any kind of formal reports, we don’t want this. To make the output more professional, we can use

knitr::kable( summary(fullmodel)$coefficients )

The knitr::kable part of the code is a wrapper than formats any table output (including data sets!) nicely for knitting.

Question 3

Show the coefficient table professionally formatted and rounded to 3 decimal places.

Hint: If I want to round summary(fullmodel)$coefficients to two decimal places, I use round(summary(fullmodel)$coefficients , 2)

We are not actually interested in interpreting the coefficients today. Instead, we are interested in how well this model we built on our training data is able to make predictions on our test data. The test data are compromised of 900 new emails that were not in the training data.

Making predictions for logistic regression requires two steps. First, we need to find $\hat{\pi}_i$ for all rows in the test data. This means we need to find the probability of being spam for all emails in the test data. To do this in R, we use:

pihat <- predict( fullmodel, type = "resp", newdata = test)

Here,

fullmodel is the model we want to use to make predictions.
type = "resp" tells R to predict the probabilities; without this statement, R predicts the log odds.
newdata = test tells R which data set we want to make predictions on. If you do not include this, R makes predictions using the training data.

Question 4

What is the predicted probability of being spam for the 4th email in the test data set?

Once we have predicted probabilities, we use a threshold to decide whether or not a probability is high enough for us to declare $\hat{y}_i = 1$. A common choice for threshold is .5, so let’s start with that.

yhat <- ifelse( pihat > .5, 1 , 0)

The code ifelse is a handy one. If $\hat{\pi}_i > .5$, then we set $\hat{y}_i = 1$. If not, we set $\hat{y}_i = 0$.

Question 5

What is the prediction $\hat{y}_4$ for the 4th email in the data set?

The next step is to see how accurate our predictions are! We usually use a confusion matrix for this.

knitr::kable( table(yhat, test$spam) , col.names = c("Truth: Not Spam", "Truth: Spam"))

Question 6

Based on our current results, would you say the model is predicting well? Explain in 1 sentence.

When our model is not performing well in prediction, one step we can take is to change the threshold. We chose .5 arbitrarily. Is there a threshold that might be better?

Exploring Thresholds

Instead of just choosing a threshold at random, another option is to try a range of thresholds and then choose the one with the performance we like.

Let’s say we want to consider all thresholds $\alpha = 0, .02, .03, \dots, .99, 1$. To tell R this is what we want, we use:

thresholds <- seq(from = 0 , to = 1, by = .01)

This creates a sequence (seq) of numbers from 0 (from = 0) to 1 (to = 1) in intervals of .01 (by = .01).

For each of these thresholds, we want to find and compute the F1-score. This combined metric is useful for comparing thresholds, as it gives us only one number to look at.

To compute the F1-score, we need:

  TP <- sum(yhat == 1 & train$spam == 1)
  FP <- sum(yhat == 1 & train$spam == 0)
  FN <- sum(yhat == 0 & train$spam == 1)
  F1 <- TP/(TP + .5*(FP+FN))
  F1

Question 7

Find the F1-score for the training data if the threshold $\alpha = .5$.

Hint: We haven’t made predictions to the training data yet, so you need to do that first!

Question 8

Find the F1-score for the training data if the threshold $\alpha = .6$.

Hint: You can use your code from Question 7 and just change one thing!

Question 9

Which threshold would you prefer right now, $\alpha = .5$ or $\alpha = .6$? Why?

This works, but we don’t really want to compare all of the thresholds manually. Luckily, we can use something called a for loop to help us:

F1 <- rep(0,length(thresholds))

# For each possible threshold
for(i in 1:length(thresholds)){
  # Make predictions on the training data
  YHat <- predict(fullmodel, type ="resp",newdata = train)
  YHat <- ifelse(YHat > thresholds[i], 1, 0)
  
  # Compute the F1 score and store
  TP <- sum(YHat == 1 & train$spam == 1)
  FP <- sum(YHat == 1 & train$spam == 0)
  FN <- sum(YHat == 0 & train$spam == 1)
  F1[i] <- TP/(TP + .5*(FP+FN))
}

output <- data.frame(thresholds, F1)

The end result is output, which contains the thresholds in the first column and the F1 scores in the 2nd. To find the highest, we do:

library(dplyr)
library(tidyr)

output |>
  slice_max(F1,n=1)

Question 10

What is the highest F1 score we can get on the training data, and what threshold do we need to use to get that F1 score?

Now that we have chosen a threshold, we go back to the test data. Do we notice the back and forth between data sets? That is because each data set has a different job!

Question 11

Make predictions for the test data using our chosen threshold. Show the confusion matrix as your answer to this question.

knitr::kable( table(yhat, test$spam) , col.names = c("Truth: Not Spam", "Truth: Spam"))

Question 12

What is the true positive rate (TPR) with the new threshold on test data?
What is the F1 score on the test data with the new threshold?

Feature Selection

Now that we have explored threshold tuning, let’s try out feature selection. Feature selection is the process of determining if any features in the full model (meaning the model with all the features) can be removed.

Question 13

What is one reason it might be beneficial to remove features?

For today, we will focus on forward selection.

Question 14

In forward selection, we start with a model with how many features?

In forward selection, a key part of the process is choosing the metric we will use to compare models. We typically use the AIC:

\[AIC = Deviance + 2~k\]

or the BIC:

\[BIC = Deviance + k~log( n )\]

Here,

$k$ is the number of coefficients (including the intercept) in the model.
$n$ is the number of rows in the training data.

Both of these metrics involve the deviance. We will cover the specifics in class soon, but essentially the deviance measure how poorly the model matches the patterns of 0s and 1s we see in the data. Since both the AIC and BIC involve the deviance, we prefer models with low values of the AIC and BIC.

The only difference in the two metrics is the penalty. In the AIC, the penalty on the deviance is $2k$. This means that in order to say it is “worth” add a feature into the model, it has to improve the deviance by at least 2. In the BIC, the penalty on the deviance is $k~ log(n)$. This means that in order to say it is “worth” add a feature into the model, it has to improve the deviance by at least $log(n)$.

We will start by considering forward selection with AIC, and then we will compare the results to what we get using forward selection with BIC.

To do forward selection, copy and paste the following chunk into your RMarkdown and press play. DO NOT CHANGE ANYTHING IN THIS CHUNK!

forwardSelection <- function(train, Y, metric){ 
  J <- ncol(train)
  AIChold <-rep(NA,J-1)
  data <- train[,Y]
  variables <- Y
  for( j in 1:J){
  tocheck <- setdiff(1:J,variables)
  AIChold <-rep(NA,length(tocheck))
  for( i in 1:length(tocheck)){
    keep <- train[,c(variables,tocheck[i])]
    model <- glm( keep[,Y] ~ ., data = data.frame(keep[,-Y]),family = "binomial")
    if(metric=="AIC"){
    AIChold[i] <-AIC(model)
    }
    if(metric=="BIC"){
    AIChold[i] <-BIC(model)
    }
  }
  AICtest <- min(AIChold)
  if(j == 1){
    AICCheck = AICtest
    variables <- c(variables, which.min(AIChold)+1)
  }
  if( j > 1){
    if(AICtest >= AICCheck){
      break
    }
    if(AICtest < AICCheck){
      AICCheck = AICtest
      variables <- c(variables, tocheck[which.min(AIChold)])
    }

  }
}

  forwardselection_data <- train[,variables]

  forwardselection_data 
}

To use the function, you need to give it three inputs:

Input 1: The name of our training data (train)
Input 2: The number of the column of the train data where your $Y$ variable is.
Input 3: “AIC” or “BIC”, depending on which metric you wish to use.

For example, if I have a training data set called datamine in which the Y variable is in the 5th column, and I want to use AIC for forward selection, I use:

AIC_forward <- forwardSelection(datamine, 5, "AIC")

The output AIC_forward is the data chosen by forward selection using AIC. It will contain the response variable and the features chosen using forward selection with AIC.

Question 15

Using forward selection with AIC, how many features do we keep in the model?

Question 16

We are now going to run forward selection using BIC. Do you expect that forward selection with BIC will keep (a) more features than with AIC, (b) fewer features than with AIC, or (c) the same number of features as AIC?

Briefly explain your choice. Hint: If you are stuck, look at the penalty terms for AIC and BIC.

Question 17

Run forward selection with BIC. How many features do we keep in the model?

Question 18

What features are kept with forward selection with AIC but not with BIC?

Okay, now we know what features both AIC and BIC recommend we keep in the model. Both metrics have resulted in a smaller model than the full model, meaning some features were removed.

Before we use these new data sets AIC_forward and BIC_forward to build our models, we need to revisit something from the previous section - threshold tuning!! We are now working with a different set of features than we were with the full model, and this means we have to tune the threshold again.

Question 19

What threshold would you recommend using with the forward selection AIC model? With the forward selection BIC model?

Now that we have tuned our model, let’s assess the predictive abilities on the test data. To make this a little easier, I have created this handy function for you. As before, DO NOT CHANGE ANYTHING ABOUT THIS CODE. Just put it directly into a code chunk and press play.

get_F1_TPR <- function(train_data, test_data, Y, threshold){
  
  # Train the model 
  model1 <- glm( train[,Y] ~ ., data = train[,-Y], family = "binomial")
  
  # Make predictions
  pihat <- predict( model1, type = "resp", newdata = test)
  yhat <-ifelse( pihat > threshold, 1, 0)
  
  # Find the F1 score
  TP <- sum(yhat == 1 & test[,Y] == 1)
  FP <- sum(yhat == 1 & test[,Y] == 0)
  FN <- sum(yhat == 0 & test[,Y] == 1)
  F1 <- TP/(TP + .5*(FP+FN))
  
  # Find the TPR 
  TPR <- TP / sum(test[,Y]==1)
  
  # Find the TNR 
  TNR <- sum(yhat == 0 & test[,Y] == 0) / sum(test[,Y]==0)
  
  # Save the results
  output <- data.frame("TPR" = TPR, "TNR" = TNR, "F1 Score" = F1)
  
  knitr::kable(output)
}

To use the function, you feed the following inputs.

train_data: the name of the data set you want to use to build the model.
test_data: the name of the test data set.
Y: the number of the column in both data sets where $Y$ is.
threshold: the numeric value of the threshold you have chosen.

For example, if my training data is called train, my test data is called test, $Y$ is in the 5th column and the threshold is .23, I use:

get_F1_TPR(train, test, 5, threshold = .23)

Question 20

Use the get_F1_TPR function and show the results using the data chosen by AIC forward selection.

Question 21

Use the get_F1_TPR function and show the results using the data chosen by BIC forward selection.

Question 22

If our goal is prediction, which model would you recommend using: (a) the full model, (b) the model chosen by AIC forward selection, or (c) the model chosen by BIC forward selection?

Briefly describe why.