STA 279 Lab 8

Complete all Questions.

Goal

We have been working with logistic regression models, and today we are going to practice feature selection with logistic regression.

The Data

We are going to work with the same data set as we did in Lab 7. To load the data sets, put the following into a chunk in Markdown and press play.

# Load the test data set
test <- read.csv("https://www.dropbox.com/scl/fi/98bqf6g9c8euz697pesiu/spamtest.csv?rlkey=ctajtv7420tgslq2clen50i2w&st=zmj7zf03&dl=1")

# Load the training data set 
train <- read.csv("https://www.dropbox.com/scl/fi/fzhrojxhozm7qieguy7dg/spamtrain.csv?rlkey=qmdzinhxq49cu5350ttgrq59h&st=h6bylhgh&dl=1")

for( i in c(2:11,14:17,19)){
  train[,i] <-as.factor(train[,i])
  test[,i] <-as.factor(test[,i])
}

for( i in c(2:11,14:17,19)){
  train[,i] <-as.factor(train[,i])
  test[,i] <-as.factor(test[,i])
}

train <- model.matrix(train[,1] ~ ., data = train)[,-1]
train <-data.frame(train)

test <- model.matrix(test[,1] ~ ., data = test)[,-1]
test <-data.frame(test)

There are 20 variables in each data set. We have the response variable

  • spam: Indicator for whether the email is spam.

and 19 text based features:

  • to_multiple: Indicator for whether the email was addressed to more than one recipient.
  • from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
  • cc: Indicator for whether anyone was CCed.
  • sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
  • image: Indicates whether any images were attached.
  • attach: Indicates whether any files were attached.
  • dollar: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email.
  • winner: Indicates whether “winner” appeared in the email.
  • inherit: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
  • password: Indicates whether “password” appeared in the email.
  • num_char: The number of characters in the email, in thousands.
  • line_breaks: The number of line breaks in the email (does not count text wrapping).
  • format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
  • re_subj: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”.
  • exclaim_subj: Indicates whether there was an exclamation point in the subject.
  • urgent_subj: Indicates whether the word “urgent” was in the email subject.
  • exclaim_mess: The number of exclamation points in the email message.
  • numbernone: Factor variable saying whether there was no numbers in the email.
  • numbersmall: Factor variable saying there was only a small number in the email (under 1 million).

When we worked with this data in Lab 7, we used (1) only one feature and then (2) all of the features to model Y = whether or not an email is spam. The model with all features is called the full model.

fullmodel <- glm( spam ~ ., data = train, family = "binomial")

However, what if we want to see if we can use only some of the features rather than all of them? This makes the model more interpretable. It also makes it easier to use the model for prediction later because we have to collect less information on an email in order to predict whether or not it is spam.

In class we discussed one method for feature selection, forward selection. We are going to start with this method, but then we are going to discuss some limitations of the method and propose a few others to try instead.

Forward Selection

Question 1

In forward selection, we start with a model with how many features?

In forward selection, a key part of the process is choosing the metric we will use to compare models. We typically use the AIC:

\[AIC = Deviance + 2~k\]

or the BIC:

\[BIC = Deviance + k~log( n )\] Here,

  • \(k\) is the number of coefficients (including the intercept) in the model.
  • \(n\) is the number of rows in the training data.

The only difference in the two metrics is the penalty. In the AIC, the penalty on the deviance is \(2k\). This means that in order to say it is “worth” add a single new feature into the model, it has to improve the deviance by at least 2. In the BIC, the penalty on the deviance is \(k~ log(n)\). This means that in order to say it is “worth” add a single new feature into the model, it has to improve the deviance by at least \(log(n)\).

Question 2

What is the numeric value of the penalty term for our full model (meaning if we use all features) for (a) the AIC and (b) the BIC?

Question 3

  1. Which of the penalties computed in Question 2 is stronger (i.e., larger): the AIC penalty or the BIC penalty?

  2. Is the penalty that is strongest in Question 2 typically strongest in practice? Briefly explain your reasoning.

Now that we have our two penalty options, let’s proceed with forward selection. The first stage is to decide on the first feature to add into the model. To do this, we build a model using the train data using each feature in the data.

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 tomultiple_i\]

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 from_i\] \[\vdots \]

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 number\_none_i\] \[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 number\_small_i\]

Question 4

How many models in total would we build for Stage 1? In other words, how many possible models with one feature are there for us to consider?

Question 5

Suppose I only had two choices for features: formatPlain or ccyes.

  1. Using forward selection with AIC, which feature would you add to the model first?

  2. Using forward selection with BIC, which feature would you add to the model first?

Hint: The code for logistic is glm( Y ~ X, data = train, family = "binomial")

Stage 1 of forward selection concludes once we have added our first feature to the model. For our data, we will keep the sent_email feature.

Stage 2 of forward selection chooses a 2nd feature to add to the model, keeping the Stage 1 chosen features of sent_email in.

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 sentemail_i + \beta_2 tomultiple_i\]

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 sentemail_i + \beta_2 from_i\] \[\vdots \]

\[log\left( \frac{\pi}{1-\pi }\right) =\beta_0 + \beta_1 sentemail_i + \beta_2 numbersmall_i\]

Question 6

How many models in total would we build for Stage 2? In other words, how many possible models with one feature are there for us to consider?

We continue to move through stages, adding one feature at a time, until the AIC or BIC stops improving.

Question 7

We prefer smaller AIC and BIC values. Why? Explain without using the words deviance, likelihood, or log likelihood.

Let’s give this process a try in R. To teach R the function it needs to perform forward selection, copy and paste the following chunk into your RMarkdown and press play. DO NOT CHANGE ANYTHING IN THIS CHUNK!

forwardSelection <- function(dataset, Y, metric){
  Yhold <- dataset[,Y]
  dataset <-model.matrix(dataset[,Y] ~ ., data = dataset)[,-1]
  dataset<- data.frame(dataset)
  rm(Yhold)
  J <- ncol(dataset)
  AIChold <-rep(NA,J-1)
  variables <- c()
  for( j in 1:J){
  tocheck <- setdiff(c(1:J)[-Y],variables)
  AIChold <-rep(NA,length(tocheck))
  for( i in 1:length(tocheck)){
    keep <- dataset[,c(Y, variables,tocheck[i])]
    model <- glm( keep[,Y] ~ ., data = data.frame(keep[,-Y]),family = "binomial")
    if(metric=="AIC"){
    AIChold[i] <-AIC(model)
    }
    if(metric=="BIC"){
    AIChold[i] <-BIC(model)
    }
  }
  AICtest <- min(AIChold)
  if(j == 1){
    AICCheck = AICtest
    variables <- c(variables, which.min(AIChold)+1)
  }
  if( j > 1){
    if(AICtest >= AICCheck){
      break
    }
    if(AICtest < AICCheck){
      AICCheck = AICtest
      variables <- c(variables, tocheck[which.min(AIChold)])
    }

  }
}

  forwardselection_data <- dataset[,c(Y,variables)]

  forwardselection_data 
}

To use the function, you need to give it three inputs:

  • Input 1: The name of our training data (train)
  • Input 2: The number of the column of the train data where your \(Y\) variable is.
  • Input 3: “AIC” or “BIC”, depending on which metric you wish to use.

For example, if I have a training data set called datamine in which the Y variable is in the 5th column, and I want to use AIC for forward selection, I use:

AIC_forward <- forwardSelection(datamine, 5, "AIC")

The output AIC_forward is the data chosen by forward selection using AIC. It will contain the response variable and the features chosen using forward selection with AIC.

Question 8

Using forward selection with AIC, how many features do we keep in the model?

Question 9

We are now going to run forward selection using BIC. Do you expect that forward selection with BIC will keep (a) more features than with AIC, (b) fewer features than with AIC, or (c) the same number of features as AIC? Briefly explain your choice.

Question 10

Run forward selection with BIC, and store the results under BIC_forward. How many features do we keep in the model?

Question 11

What features are kept with forward selection with AIC but not with BIC?

Okay, now we know what features both AIC and BIC recommend we keep in the model. Forward selection with both metrics have resulted in a smaller model than the full model, meaning some features were removed.

Before we use these new data sets AIC_forward and BIC_forward to build our models, we need to revisit something from the previous lab - threshold tuning!! We are now working with a different set of features than we were with the full model, and this means we have to tune the threshold again.

Last time, I gave this code to you in a few chunks. Today, I will give you a function. Copy and paste the following into a chunk below and press play.

get_threshold <- function(data,Y){
  
  model <- glm(data[,Y]~.,data=data[-Y],family = "binomial")
  
  thresholds <- seq(from = 0.01 , to = .99, by = .01)
  
  F1 <- rep(0,length(thresholds))

# For each possible threshold
  for(i in 1:length(thresholds)){
  # Make predictions on the training data
  YHat <- predict(model, type ="resp",newdata = data)
  YHat <- ifelse(YHat > thresholds[i], 1, 0)
  
  # Compute the F1 score and store
  TP <- sum(YHat == 1 & data[,Y] == 1)
  FP <- sum(YHat == 1 & data[,Y] == 0)
  FN <- sum(YHat == 0 & data[,Y] == 1)
  F1[i] <- TP/(TP + .5*(FP+FN))
}

  output <- data.frame(thresholds, F1)

  suppressMessages(library(dplyr))
  suppressMessages(library(tidyr))

  output <- output |>
    slice_max(F1,n=1, with_ties= FALSE)

  knitr::kable(output)

}

To use the function, you need to give it two inputs:

  • Input 1: The name of our data set
  • Input 2: The number of the column in the dataset where your \(Y\) variable is.

For example, to tune the threshold for AIC for our data, I use::

get_threshold(AIC_forward, 1)

Question 12

  1. What threshold would you recommend using with the forward selection AIC model?

  2. What threshold would you recommend using with the forward selection BIC model?

Now that we have tuned our model, let’s assess the predictive abilities on the test data. To make this a little easier, I have created this handy function for you. As before, DO NOT CHANGE ANYTHING ABOUT THIS CODE. Just put it directly into a code chunk and press play.

get_F1_TPR <- function(train_data, test_data, Y, threshold){
  
  # Train the model 
  model1 <- glm( train[,Y] ~ ., data = train[,-Y], family = "binomial")
  
  # Make predictions
  pihat <- predict( model1, type = "resp", newdata = test)
  yhat <-ifelse( pihat > threshold, 1, 0)
  
  # Find the F1 score
  TP <- sum(yhat == 1 & test[,Y] == 1)
  FP <- sum(yhat == 1 & test[,Y] == 0)
  FN <- sum(yhat == 0 & test[,Y] == 1)
  F1 <- TP/(TP + .5*(FP+FN))
  
  # Find the TPR 
  TPR <- TP / sum(test[,Y]==1)
  
  # Find the TNR 
  TNR <- sum(yhat == 0 & test[,Y] == 0) / sum(test[,Y]==0)
  
  # Save the results
  output <- data.frame("TPR" = TPR, "TNR" = TNR, "F1 Score" = F1)
  
  knitr::kable(output)
}

To use the function, you feed the following inputs.

  • train_data: the name of the data set you want to use to build the model.
  • test_data: the name of the test data set.
  • Y: the number of the column in both data sets where \(Y\) is.
  • threshold: the numeric value of the threshold you have chosen.

For example, if my training data is called train, my test data is called test, \(Y\) is in the 5th column and the threshold is .23, I use:

get_F1_TPR(train, test, 5, threshold = .23)

Question 13

Use the get_F1_TPR function and show the results using the data chosen by AIC forward selection.

Question 14

Use the get_F1_TPR function and show the results using the data chosen by BIC forward selection.

Question 15

If our goal is prediction of spam (Y = 1), which model would you recommend using: (a) the full model, (b) the model chosen by AIC forward selection, or (c) the model chosen by BIC forward selection?

Briefly describe why.

Forward selection is often a popular technique people choose because it is fairly quick. However, it has some drawbacks. Specifically, it has a possibility of being biased.

Forward selection adds one variable at a time, but this ignores the relationships between the variables. What if Feature 1 has the lowest AIC when it is in the model alone, but Features 2 and 3 have the lowest AIC when put together?? Forward selection is not able to account for this, which means sometimes we do not actually end up with the model that is the best fit to the data when we do forward selection.

Is there another option?? Yes!!

BSS

The downside of forward selection is that it cannot account for scenarios when one combination of features has the best AIC/BIC with 3 features allowed but a completely different combination of features is best when 4 features is allowed. To fix this, we therefore consider best subset selection (BSS).

BSS finds the model with the lowest AIC/BIC with one feature, the model with the lowest AIC/BIC with two features, and so on, and the returns the model with the lowest AIC/BIC overall. This means it considers EVERY possible model we can make with our features.

# Teach R the BSS function
BSS <- function(dataset, Y, metric,maxin){
  J <- ncol(dataset)-1
  J <- min(J,maxin)
  variables <- c()
  Yhold <- dataset[,Y]
  dataset <- dataset[,-Y]
  Y <- Yhold
  rm(Yhold)
  for( j in 1:J){
    options <- combn(1:J,j)
    for( i in 1:ncol(options)){
      keep <- dataset[,options[,i]]
      keep <-data.frame(keep)
      model <- suppressWarnings(glm( Y ~ ., data = keep,family = "binomial"))
      if(i == 1 & j == 1 & metric=="AIC"){
        output <- keep
        AIChold <- AIC(model)
      }
      if(i == 1 & j == 1 & metric=="BIC"){
        output <- keep
        AIChold <- BIC(model)
      }
      if(i > 1 & metric=="AIC" & AIC(model) < AIChold){
        AIChold <-AIC(model)
        output <- keep
      }
      if(i > 1 & metric=="BIC" & BIC(model) < AIChold){
        AIChold <-BIC(model)
        output <- keep
      }
    }
 

  }

  cbind(Y,output)
}

Question 16

Use BSS to find the combination of 2 variables with lowest AIC. Use the code below:

AIC_BSS2 <- BSS(train, 1, "AIC", maxin = 3)
  1. What features did BSS choose? In other words, what combination of 2 features gives the lowest AIC overall?

  2. Does this match the 2 features chosen first with AIC in forward selection?

Question 17

Use BSS to find the combination of 5 features with lowest BIC. Does this match the 5 features chosen first with BIC in forward selection?

So far, it looks like there can be large differences between the models chosen by forward selection and BSS. If this is the case, and BSS finds the best overall model in terms of the chosen metric, why don’t we use BSS all the time??

Let’s see. Use the code below to run BSS with all features. Choose either AIC or BIC as you prefer, but do NOT run both.

# Run BSS 
AIC_BSS <- BSS(train, 1, "AIC",maxin = 19)
BIC_BSS <- BSS(train, 1, "BIC",maxin = 19)

Question 18

Based on what you just experienced, why don’t we use BSS all the time for feature selection?

The reason that BSS takes so long is that we have to try every combination of features. This means there are \(2^{p}\) possible models we have to consider, where \(p\) is the number of features in the model. With 19 features, this means you have to consider this many models:

\[2^{19} = 524288\]

So, if we cannot do the optimal thing, what can we do?? We will talk about another option in our next class!