STA 112 Lab 5

Spam Emails

Spam emails are unwanted emails, generally sent to masses of people by corporations attempting to sway you into purchasing a good or service. Want a cruise to Europe?? Have we got a deal for you!! Modern email clients like GMail work to identify spam messages and keep them from appearing in your inbox. This is called filtering. Filters take a look at each email that appears in your inbox. Based on what the filter sees, it classifies an email as spam, and tucks it away in your spam folder, or not spam, in which case the email is placed in your inbox. The better the filter, the fewer unwanted emails you have to sift through in your inbox each day.

How does a filter work? Basically, a filter scans an email and looks for a few key features. A feature is a particular trait of an item that can be used to help classify the item. For instance, if the email says “cruise”, are you more or less likely to assume it is spam? Filters take these ideas and convert them into statistical models that use these features to predict the probability that an email is spam. If this probability is high, the email is labeled spam and tucked into the spam folder. If the probability is low, the email appears in your inbox. As you probably know, such filters are not perfect, and sometimes spam sneaks into your inbox or important emails end up in the spam folder. The filter is a model, which means it can make mistakes.The goal of someone who designs filters is to make mistakes as infrequently as possible.

Today we will be working to build our own spam filter using logistic regression. The data we will be using is a corpus of emails received by a single gmail account over three months. In this case, someone has gone through the data and manually classified each piece of email as either spam or not spam. We will be using what we have learned about logistic regression models to see if we can predict whether or not a message should be labeled spam. To do this, we will build a model based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large corporations like Google and Microsoft are quite a bit more complex the fundamental idea is the same - binary classification based on a set of predictors.

The Data

The data you need are on Canvas, under Lab 5. Remember that you need to load this data into RStudio, and then again into your Markdown file.

There are 19 variables in this data set. They are:

spam: Y variable; Indicator for whether the email was spam.
to_multiple: Indicator for whether the email was addressed to more than one recipient.
from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc: Indicator for whether anyone was CCed.
sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
image: Indicates whether any images were attached.
attach: Indicates whether any files were attached.
dollar: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email.
winner: Indicates whether “winner” appeared in the email.
inherit: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
password: Indicates whether “password” appeared in the email.
num_char: The number of characters in the email, in thousands.
line_breaks: The number of line breaks in the email (does not count text wrapping).
format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
re_subj: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”.
exclaim_subj: Indicates whether there was an exclamation point in the subject.
urgent_subj: Indicates whether the word “urgent” was in the email subject.
exclaim_mess: The number of exclamation points in the email message.
number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Question 1

Which variable do think might be most highly associated an email being spam? Why? (You don’t need to look at the data yet, just reason it out. There is no specific correct answer!)

Our response variable is binary, where 1 means that an email is spam and a 0 means that the email is not spam. Before we try to model the data, our first step is always to visualize or summarize our data in some way. With binary data, a good first step is determine how many 0s and 1s we actually have in this data set. Are they all spam? Do we have a good mix? The table command is useful for this:

knitr::kable(table(email$spam))

Question 2

How many emails are spam? What percent of the emails are spam?

We can also make a graph to visualize the distribution of Y.

Question 3

Create an appropriate graph of Y = spam. Label your graph Figure 1.

Model 1: Winner!

Now that we have explored Y, it is time to see if we can use an explanatory variable to help predict whether or not an email is spam. We are going to consider two possible variables that might be related to whether or not an email is spam: (1) whether or not the email contains the word “winner” and (2) how many lines are in the email. We will start with (1).

Question 4

Make a plot to explore the relationship between Y = whether or not an email is spam and X = whether or not the email contains the word “winner”. Make sure to label your plot appropriately. Hint: Look back at your slides on Odds to find the code you need.

Question 5

Based on this plot, does there seem to be a relationship between whether or not an email contains the word “winner” and whether or not it is spam?

Now that we have chosen our X variable, let’s build our spam filter! To do this we will fit a logistic regression model between spam and the variable winner. This is done using very similar code to simple or multiple linear regression, except we use the glm function instead of lm. Additionally, we have one extra argument, or input, for our code. We must indicate that we wish to fit a logistic regression model by including family=binomial

spamModel1 <- glm(spam ~ winner, data =email, family = binomial)

To examine the coefficient estimates from our logistic regression model, we use the summary command:

summary(spamModel1)

Question 6

Write down the logistic regression line in log odds form. Copy and adapt the following code: $log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = \hat{\beta}_0 + \hat{\beta}_1 Winner$ .

Question 7

Interpret the slope in log odds form.

Question 8

Interpret the slope in odds form.

Question 9

Write down the fitted model in probability form. Hint: To make a fraction, use $\hat{\pi} = \frac{}{}$ . To make the odds, use e^{\hat{\beta}_0 + \hat{\beta}_1 Winner}.

Question 10

According to your model, what is the probability that an email that contains the word “winner” is spam?

Model 2: Line Breaks

Now that we have seen how a logistic regression model could be used with a categorical X, let’s try a numeric X. We are going to explore whether how many line breaks are in the email relates to the whether or not an email is spam. A line break just means a new line in an email.

Checking Conditions

Our X variable is now numeric. When we were working with LSLR and MR models (meaning when our Y was numeric), we generally needed to check the shape of the relationship between X and Y. We used a scatter plot of X versus Y to do this.

\[Y = \beta_0 + \beta_1 X + \epsilon\]

For logistic regression we need to check the shape of the relationship between X and the log odds that an email is spam.

\[log\left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1X \]

We are still going to use a scatter plot to check this assumption, but this time we will have a scatter plot with X on the x-axis and the log odds that Y is spam on the y-axis. This special scatter plot is called an empirical log odds plot. Page 474 in your book provides specific details on plot construction, but we are going to use a function in R to produce the plot.

Create an R chunk, copy the function below, paste it into an R chunk in your Markdown, and run it. You will notice that nothing seems to happen, but if you check the top right panel of your RStudio session, you should notice that a new function called emplogit has appeared. R allows individual users to create functions as needed, and in essence you have just ``taught” R a new function. This function was written by Alex Schell (http://alexschell.github.io/emplogit.html).

EmpLogOddsPlot<- function(x, y, binsize = NULL, ci = FALSE, probit = FALSE,prob = FALSE, main = NULL, xlab = "", ylab = "", lowess.in = FALSE){
  # x         vector with values of the independent variable
  # y         vector of binary responses
  # binsize   integer value specifying bin size (optional)
  # ci        logical value indicating whether to plot approximate
  #           confidence intervals (not supported as of 02/08/2015)
  # probit    logical value indicating whether to plot probits instead
  #           of logits
  # prob      logical value indicating whether to plot probabilities
  #           without transforming
  #
  # the rest are the familiar plotting options
  
  if (length(x) != length(y))
    stop("x and y lengths differ")
  if (any(y < 0 | y > 1))
    stop("y not between 0 and 1")
  if (length(x) < 100 & is.null(binsize))
    stop("Less than 100 observations: specify binsize manually")
  
  if (is.null(binsize)) binsize = min(round(length(x)/10), 50)
  
  if (probit){
    link = qnorm
    if (is.null(main)) main = "Empirical probits"
  } else {
    link = function(x) log(x/(1-x))
    if (is.null(main)) main = "Empirical logits"
  }
  
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=binsize)
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = ns = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
    ns[i] = b[i] - a[i] + 1 # for CI 
  }
  
  extreme = (prob == 1 | prob == 0)
  prob[prob == 0] = min(prob[!extreme])
  prob[prob == 1] = max(prob[!extreme])
  
  g = link(prob) # logits (or probits if probit == TRUE)
  
  linear.fit = lm(g[!extreme] ~ xmean[!extreme])
  b0 = linear.fit$coef[1]
  b1 = linear.fit$coef[2]
  
  loess.fit = loess(g[!extreme] ~ xmean[!extreme])
  
  plot(xmean, g, main=main, xlab=xlab, ylab=ylab)
  abline(b0,b1)
  if(lowess.in ==TRUE){
  lines(loess.fit$x, loess.fit$fitted, lwd=2, lty=2)
  }
}

To use the function, we need two things: a response variable, an explanatory variable. You can decide on the number of bins (binsize =), or the function automatically decides based on those inputs how to best draw the plot.

EmpLogOddsPlot(x=email$line_breaks, y=email$spam, xlab = "Line Breaks", ylab = "Empirical log odds", main = "Figure 3")

Question 11

Create the empirical log odds plot using the code above. Do you feel comfortable claiming that the shape of the reltionship between X and the log odds of being spam is linear? Why or why not?

In cases where the empirical log odds plots suggest that a linear might not the correct shape to use, we know that we have other shapes we can use for $f(X)$ rather than just a line! We are going to try a new one today: $f(X) = log(X)$.

Question 12

Write down the code you would need to use to create the empirical log odds plot with the explanatory variable as the log of the number of line breaks. Change the x-axis label accordingly and change the title to “Figure 4”.

Question 13

Run your code and show the plot. Which of the empirical log odds plots (using line_breaks or using log(line_breaks)) is more linear? Based on the plots, choose one of the two explanatory variables to use in the model.

Question 14

Fit your chosen model and call it spamModel2. Write down the logistic regression line in log odds form. Copy and adapt the following: $log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = $ .

Question 15

Interpret the slope in odds form.

Predictions

To make predictions using a logistic regression model, we use the predict.glm function in R. We know that to make a prediction, we need a particular value of the X variable. Suppose we have a message with 400 line breaks. Will our filter label it spam? That all depends on the probability suggested by the model. So let’s find out! The code below provides the predicted probability for a message with 400 line breaks according to our second model.

predict.glm(spamModel2,data.frame(line_breaks=400), type = "resp")

Question 16

Run the function above, and interpret the result.

Question 17

Based on the prediction we have obtained so far, do you think a message with 400 line breaks is likely to be classified as spam by our second spam filter? Explain.

Question 18

What is the predicted probability of being spam for a message with 5 line breaks?

Question 19

Is our spam filter more or less likely to classify a message with 5 line breaks as spam than a message with 400 line breaks? Answer using both the probability and the log odds as evidence.

Now, a probability is not a decision. We have just determined the probability of declaring a message spam if it has 400 line breaks, or 5 line breaks. But this does not actually declare the message spam.

In order to make the ultimate decision, we essentially will be spinning a spinner where the probability of getting “spam” is defined by the model. The larger the probability, the larger “piece” of the spinner that is labelled spam. Suppose the model says the probability of being spam for a particular email is 40%. To make a decision for this email using R, we use the following code:

sample(c("spam","notspam"),1, prob= c(.4,.6))

Running this will spit out a decision for our email. Do we declare it spam, or not spam? Now remember, with logistic regression, we are modeling probabilities. If we run the code again, we are in essence taking another round at the spinner, and we may get a different answer. (Try it and see!) That is okay!! And better than okay, it’s expected.

In the code, the prob section holds the probabilities associated with the outcomes “spam” and “notspam”, respectively. These can be changed to suit the prediction you need to make.

Question 20

Suppose we want to make a decision about our email with 400 line breaks. We already know the probabilities. Adapt the code above. Show your code, and your result.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 August 1.

This data for this lab were retrieved from stat.duke.edu/courses/Spring13/sta102.001/Lab/email.Rdata.