STA 279 Lab 7

Complete all Questions.

Goal

How does a spam filter work? Basically, a filter scans an email and looks for a few key text features. For instance, if the email says “cruise”, are you more or less likely to assume it is spam? Filters take these ideas and convert them into statistical models that use these features to predict the probability that an email is spam. If this probability is high, the email is labeled spam and tucked into the spam folder. If the probability is low, the email appears in your inbox.

Today we will be working to build our own spam filter using logistic regression.

The Data

To load the data sets, put the following into a chunk in Markdown and press play.

# Load the test data set
test <- read.csv("https://www.dropbox.com/scl/fi/98bqf6g9c8euz697pesiu/spamtest.csv?rlkey=ctajtv7420tgslq2clen50i2w&st=zmj7zf03&dl=1")

# Load the training data set 
train <- read.csv("https://www.dropbox.com/scl/fi/fzhrojxhozm7qieguy7dg/spamtrain.csv?rlkey=qmdzinhxq49cu5350ttgrq59h&st=h6bylhgh&dl=1")

There are 19 variables in each data set. We have the response variable

spam: Indicator for whether the email is spam.

and 18 text based features:

to_multiple: Indicator for whether the email was addressed to more than one recipient.
from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc: Indicator for whether anyone was CCed.
sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
image: Indicates whether any images were attached.
attach: Indicates whether any files were attached.
dollar: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email.
winner: Indicates whether “winner” appeared in the email.
inherit: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
password: Indicates whether “password” appeared in the email.
num_char: The number of characters in the email, in thousands.
line_breaks: The number of line breaks in the email (does not count text wrapping).
format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
re_subj: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”.
exclaim_subj: Indicates whether there was an exclamation point in the subject.
urgent_subj: Indicates whether the word “urgent” was in the email subject.
exclaim_mess: The number of exclamation points in the email message.
number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Our response variable is binary, where 1 means that an email is spam and a 0 means that the email is not spam. Before we try to model the data, our first step is always to visualize or summarize our data in some way. With binary data, a good first step is determine how many 0s and 1s we actually have in this data set. Are they all spam? Do we have a good mix? The table command is useful for this:

knitr::kable(table(train$spam))

Question 1

What percent of emails in the training data set are spam?

It is important to check this before we begin an analysis to make sure we have enough emails that are spam to actually find relationships. If there are very few rows with $y_i=1$ (in this case spam), modeling becomes more difficult.

Model 1: Numeric X

One feature we know how to create is the length of the email. In this data set, we have $X$ = the number of characters (num_char), where a character is a letter, space, symbol, punctuation mark, etc. Let’s see if spam emails tend to be longer or shorter than non-spam emails using logistic regression.

As we discussed in class, because we want to describe relationships in the data using our model, logistic regression is going to be a better choice than Naive Bayes. To fit a logistic regression model using the training data and $X$ = the number of characters, we use the following code:

Model1 <- glm( spam ~ num_char, data = train, family = "binomial")

Question 2

What code would you need to create a logistic regression ModelA using $X$ = the number of line breaks and $Y$ = spam?

Hint: If you want to show me code but not run it, put the code in the white space (NOT a chunk) and put ` at the beginning and end of the code.

Running the code above doesn’t actually print anything on the screen. Instead, it stores everything we need, ready to print it out when we ask it to. To see the coefficients ($\hat{\beta}$ terms), for instance, we use

summary(Model1)$coefficients

Running this shows us our estimates of $\hat{\beta}_0$ and $\hat{\beta}_1$, along with standard errors and other things we could use for inference. It is helpful output, but it is what we call raw R-output. For any kind of formal reports, we want formatted output.

To format the output and make it more professional, we can use

knitr::kable( summary(Model1)$coefficients )

The knitr::kable part of the code is a wrapper than formats any table output (including data sets!) nicely for knitting.

Question 3

Show the coefficient table for Model1 professionally formatted and rounded to 3 decimal places.

Hint: If I want to round summary(Model1)$coefficients to two decimal places, I use round(summary(Model1)$coefficients , 2)

Once we have our coefficients, we can write down the trained (fitted) model.

Question 4

Write down the trained logistic regression model in log odds form.

To do this, paste the following into the WHITE SPACE (not a chunk!!!) in your RMarkdown and finish:

$$log\left(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\right) = $$

Question 5

Based on the trained model, does an increase in the number of a characters suggest an increase or decrease in the probability of being spam? Explain your choice.

The answer to Question 5 is what we call an informal interpretation. It gives a general idea of the relationships shown in the model, but it does not actually use any numbers. To incorporate more information from the trained model, we move to a more formal interpretation.

Question 6

Interpret the slope formally in log odds form.

Question 7

Interpret the slope formally in odds form.

At this point, we can discuss in a few different ways how the length of the email is related to an email being spam. However, this is what happens with a numeric $X$. What happens when $X$ is categorical?

Model 2: Categorical X

Instead of $X$ = the number of characters, let’s try using $X$ = attach. This is a variable that tells us whether or not the email has an attachment. During many online safety trainings, we are taught to be wary of attachments, so let’s see if that is related to an email being spam!

Using a categorical variable as a feature requires converting to an indicator variable. This is a variable that takes on 0 or 1 depending on the value of X. The level (or value) of X that is coded as 0 is called the baseline of X.

Luckily, R creates indicator variables for us when we put a categorical $X$ in a model. To see this, build Model 2 using:

Model2 <- glm( spam ~ attach, data = train, family = "binomial")

Question 8

Show the coefficient table for Model2 professionally formatted and rounded to 3 decimal places.

In your answer to Question 8, you should see attachyes. This means that R creates an indicator where $X$ = 1 means the email has an attachment and $X= 0$ means the email does not have an attachment.

Question 9

Based on this, what is the baseline for $X$ = attach?

Question 10

Write down the trained logistic regression model for Model 2 in log odds form.

To do this, paste the following into the WHITE SPACE (not a chunk!!!) in your RMarkdown and finish:

$$log\left(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\right) = $$

Question 11

Based on the trained model, does the presence of an attachment suggest an increase or decrease in the probability of being spam? Explain your choice.

Now we are ready to interpret formally! Interpretation is slightly different with categorical features. We are no longer able to start with “If X increases by 1…” because that doesn’t make any sense when X is either an attachment or not. Instead, we start with “If the email has an attachment…”

Question 12

Interpret the slope of Model 2 formally in odds form.

Another way we sometimes discuss categorical features is in terms of the odds ratio. The odds ratio is

\[\frac{Odds(y_i=1|x_i = 1)}{Odds(y_i=1|x_i = 0)}\]

Suppose the odds ratio is 4. We then interpret the result as: “The odds of being spam are predicted to be 4 times higher if X=1”. We replace “X=1” in our interpretation with what that means for our data.

Question 13

Compute and interpret the odds ratio using $X$ = attachment. Round all values to 3 non-zero decimal places during your computation and in your answer.

Hint: You will notice some familiar numbers!

Comparing Models

Now we have two models, one using $X$ = the number of characters and the second using $X$ = whether or not the email had an attachment. How can we decide which feature is more related to an email being spam??

The tool we typically use for this is called the AIC. The lower the AIC, the stronger the relationship between $X$ and $Y$ and the better the model “fits” the data. This is because the AIC essentially measures how much of the patterns in Y are NOT reflected in the model. This means we want a low AIC.

Question 14

Which Model is a better fit to the train data, Model 1 or Model 2? Explain your choice.

Hint: To compute the AIC, use AIC(model) with the name of your model used in place of model

Do we have to choose only one feature or the other?? No! In logistic regression, just like Naive Bayes, we can use multiple features in our models. This means we can build a model that includes both features.

Question 15

Train (fit) a model for spam using all the possible features in the data set (all 18). Store the model under Model3. Show the coefficient table professionally formatted and rounded to 3 decimal places.

Hint: To add more than one feature to the model, we just use a plus sign. Example: glm( y ~ x1 + x2, data = dataset, family = "binomial"). If you want to use all the features in a data set, we use a dot, like glm( y ~ ., data = dataset, family = "binomial")

Question 16

Based on Model 3, after controlling for all other features, does the presence of an attachment suggest an increase or decrease in the probability of being spam?

Question 17

Based on Model 3, after controlling for all other features, does an increase in the number of characters suggest an increase or decrease in the probability of being spam?

The bold statements in Question 16 and 17 are a reminder from STA 112. When you put more than one feature in a model, interpreting for a single coefficient depends on the presence of other features in the model. We account for this in our interpretation by saying “after controlling for all other features…”

Another thing we should remember from STA 112 is that the AIC can be used to compare models with different numbers of features! The AIC contains a penalty term 2p which accounts for $p$, the number of coefficients in the model. This means we can us the AIC to compare models even when they do not have the same number of features.

Question 18

Which model is the best fit to the training data: Model 1, Model 2, Model 3? Explain.

NOTE: For the rest of the lab, you will work with the model you chose in Question 18.

Prediction

Once we have chosen our model, we can check to see how well that model performs at prediction. We used our training data train to build our model, and now we can see how well that model can predict for spam status on test, our test data set compromised of 900 new emails that were not in the training data.

Making predictions for logistic regression requires two steps. First, we need to find $\hat{\pi}_i$ for all rows in the test data. This means we need to find the probability of being spam for all emails in the test data. To do this in R, we use:

pihat <- predict( model , type = "resp", newdata = test)

Here,

model is the model we want to use to make predictions.
type = "resp" tells R to predict the probabilities; without this statement, R predicts the log odds.
newdata = test tells R which data set we want to make predictions on. If you do not include this, R makes predictions using the training data.

Question 19

Using the model you chose in Question 18, what is the predicted probability of being spam for the 3rd email in the test data set?

Once we have predicted probabilities, we use a threshold to decide whether or not a probability is high enough for us to declare $\hat{y}_i = 1$. A common choice for threshold is .5, so let’s start with that.

yhat <- ifelse( pihat > .5, 1 , 0)

The code ifelse is a handy one. If $\hat{\pi}_i > .5$, then we set $\hat{y}_i = 1$. If not, we set $\hat{y}_i = 0$.

Question 20

What is the prediction $\hat{y}_3$ for the 3rd email in the data set?

The next step is to see how accurate our predictions are! We usually use a confusion matrix for this.

knitr::kable( table(yhat, test$spam) , col.names = c("Truth: Not Spam", "Truth: Spam"))

Question 21

Based on our current results, how well would you say the model you chose in Question 18 is predicting? Explain in 1 sentence.

When our model is not performing well in prediction, one step we can take is to change the threshold. We chose .5 arbitrarily. Is there a threshold that might be better?

Exploring Thresholds

Instead of just choosing a threshold at random, another option is to try a range of thresholds and then choose the one with the performance we like.

Let’s say we want to consider all thresholds $\alpha = 0, .02, .03, \dots, .99, 1$. To tell R this is what we want, we use:

thresholds <- seq(from = 0 , to = 1, by = .01)

This creates a sequence (seq) of numbers from 0 (from = 0) to 1 (to = 1) in intervals of .01 (by = .01).

For each of these thresholds, we want to find and compute the F1-score. This combined metric is useful for comparing thresholds, as it gives us only one number to look at.

To compute the F1-score, we need:

  TP <- sum(yhat == 1 & train$spam == 1)
  FP <- sum(yhat == 1 & train$spam == 0)
  FN <- sum(yhat == 0 & train$spam == 1)
  F1 <- TP/(TP + .5*(FP+FN))
  F1

Question 22

Find the F1-score for the training data if the threshold $\alpha = .5$.

Hint: We haven’t made predictions to the training data yet, so you need to do that first!

We don’t really want to compare all of the thresholds manually. Luckily, we can use something called a for loop to help us:

F1 <- rep(0,length(thresholds))

# For each possible threshold
for(i in 1:length(thresholds)){
  # Make predictions on the training data
  YHat <- predict(Model1, type ="resp",newdata = train)
  YHat <- ifelse(YHat > thresholds[i], 1, 0)
  
  # Compute the F1 score and store
  TP <- sum(YHat == 1 & train$spam == 1)
  FP <- sum(YHat == 1 & train$spam == 0)
  FN <- sum(YHat == 0 & train$spam == 1)
  F1[i] <- TP/(TP + .5*(FP+FN))
}

output <- data.frame(thresholds, F1)

The end result is output, which contains the thresholds in the first column and the F1 scores in the 2nd. To find the highest, we do:

library(dplyr)
library(tidyr)

output |>
  slice_max(F1,n=1, with_ties= FALSE)

Question 23

What is the highest F1 score we can get on the training data, and what threshold do we need to use to get that F1 score?

Now that we have chosen a threshold, we go back to the test data. Do we notice the back and forth between data sets? That is because each data set has a different job!

Question 24

Make predictions for the test data using our chosen threshold. Show the confusion matrix as your answer to this question.

knitr::kable( table(yhat, test$spam) , col.names = c("Truth: Not Spam", "Truth: Spam"))

Question 25

Using our new threshold, has the predictive ability of the model improved?

Next Time

We have now seen that we can use logistic regression models for prediction, and that we can use one feature ore many features. However, what if we do not want to use all the features in our modeling? How can we decide which features to use and which to exclude? We will talk about this next time!