Complete all Questions, and submit final documents in PDF form on Canvas.

Spam Emails

Spam emails are unwanted emails, generally sent to masses of people in an attempt to sway you into purchasing a good or service. Want a cruise to Europe?? Have we got a deal for you!! Modern email clients like GMail work to identify spam messages and keep them from appearing in your inbox. This is called filtering. Filters take a look at each email that appears in your inbox. Based on what the filter sees, it classifies an email as spam, and tucks it away in your spam folder, or not spam, in which case the email is placed in your inbox. The better the filter, the fewer unwanted emails you have to sift through in your inbox each day.

How does a filter work? Basically, a filter scans an email and looks for a few key features. A feature is a particular trait of an item that can be used to help classify the item. For instance, if the email says "cruise", are you more or less likely to assume it is spam? Filters take these ideas and convert them into statistical models that use these features to predict the probability that an email is spam. If this probability is high, the email is labeled spam and tucked into the spam folder. If the probability is low, the email appears in your inbox. As you probably know, such filters are not perfect, and sometimes spam sneaks into your inbox or important emails end up in the spam folder. The filter is a model, which means it can make mistakes.The goal of someone who designs filters is to make mistakes as infrequently as possible.

Today we will be working to build our own spam filter using logistic regression. The data we will be using is a corpus of emails received by a single gmail account over the first three months of 2012. In this case, someone has gone through the data and manually classified each piece of email as either spam or not spam. We will be using what we have learned about logistic regression models to see if we can predict whether or not a message should be labeled spam. To do this, we will build a model based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large corporations like Google and Microsoft are quite a bit more complex the fundamental idea is the same - binary classification based on a set of predictors.

The Data

The data you need are on Canvas. Remember that you need to load this data into RStudio, and then again into your Markdown file.

There are 19 variables in this data set. They are:

  • spam: Indicator for whether the email was spam.
  • to_multiple: Indicator for whether the email was addressed to more than one recipient.
  • from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
  • cc: Indicator for whether anyone was CCed.
  • sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
  • image: Indicates whether any images were attached.
  • attach: Indicates whether any files were attached.
  • dollar: Indicates whether a dollar sign or the word 'dollar' appeared in the email.
  • winner: Indicates whether "winner" appeared in the email.
  • inherit: Indicates whether "inherit" (or an extension, such as inheritance) appeared in the email.
  • password: Indicates whether "password" appeared in the email.
  • num_char: The number of characters in the email, in thousands.
  • line_breaks: The number of line breaks in the email (does not count text wrapping).
  • format: Indicator for whether the email contained active links ( format = 1) or did not (format = 0).
  • re_subj: Indicates whether the subject started with "Re:", "RE:", "re:", or "rE".
  • exclaim_subj: Indicates whether there was an exclamation point in the subject.
  • urgent_subj: Indicates whether the word "urgent" was in the email subject.
  • exclaim_mess: The number of exclamation points in the email message.
  • number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Don't worry, we won't be using all of these!!

EDA: Filter 1: Categorical Predictor

For our first filter, we are going to choose one categorical variable as our explanatory variable. We are going to consider the 4 possibilities below.

  • to_multiple: Indicator for whether the email was addressed to more than one recipient. (0 = no, 1 = yes)
  • attach: Indicator for whether any files were attached. (0 = no, 1 = yes)
  • winner: Indicator for whether "winner" appeared in the email. ( 0 =no, 1 = yes)
  • format: Indicator for whether the email contained active links ( format = 1) or did not (format = 0).
    1. Choose one of the 4 variables in the list above to use as your explanatory variable. There is no correct answer, just choose one you think might be interesting.

    Here, we are dealing with two categorical variables: a categorical response, and a categorical predictor. Our next step is to visualize the relationship...but how do we do that??

    To visualize the relationship between two categorical variables, we can use a plot called a mosaic plot. This plot shows us (1) what proportion of emails are spam, (2) what proportions of emails are not spam and (3) whether spam emails are more likely to have your X than non-spam emails. It helps us look for a relationship between the two variables. What you are looking for is a difference in the height of the bars.

    1. Make a plot to explore the relationship between whether or not an email is spam (as.factor(email$spam)) and your chosen predictor. Make sure you show your plot and label your axes. Hint: The code mosaicplot( as.factor(spam) ~ as.factor(x) , data = , xlab = "X Axis Label you want", ylab = "Y axis label you want") will prove helpful here. Replace x with the explanatory variables you want to plot.
    1. Based on your plot, describe the relationship between the two variables.

    Fitting the Model: Filter 1: Categorical Predictor

    Now that we have chosen our explanatory variable, let's build our spam filter! To do this we will fit a logistic regression model. This is done using very similar code to LSLR, except we use the glm function instead of lm. Additionally, we have one extra argument (or input) for our code. We must indicate that we wish to fit a logistic model by including the argument family=binomial. A template for the code is below, but you will need to replace X with your chosen explanatory variable.

     spamModel  <-  glm(spam ~ X, data =email, family = binomial)

    To examine the coefficient estimates from our logistic regression model, we use the summary command:

      summary(spamModel)
    1. Write down the fitted model (regression line) in logistic form. To do this, paste the following into the white space in RMarkdown and adapt to reflect your line: $log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = \hat{\beta}_0 + \hat{\beta}_1 X$. Then, add one extra dollar sign to the beginning and end (so you have $$).
    1. Based on the model in the previous question, interpret both the slope and intercept in terms of the log odds.
    1. Interpret the slope in terms of the odds.
    1. Write down the fitted model (regression line) in probability form. To do this, paste the following into the white space in RMarkdown, and adapt to reflect your line : $\hat{\pi} = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 X}}{1+ e^{ \hat{\beta}_0 + \hat{\beta}_1 X}}$. Then, add one extra dollar sign to the beginning and end (so you have $$).

    EDA: Filter 2: Numeric Predictor

    For our second filter, we are going to choose one numeric variable as our explanatory variable. We are going to consider the 2 possibilities below.

  • num_char: The number of characters in the email, in thousands.
  • line_breaks: The number of line breaks in the email (does not count text wrapping).
    1. Choose one of the two variables in the list above to use as your explanatory variable. There is no correct answer, just choose one you think might be interesting. What kind of plot would you create to explore the relationship between whether or not the email is spam and your chosen predictor?
    1. Make a plot to explore the relationship between whether or not an email is spam (as.factor(email$spam) and your chosen predictor. Make sure you show your plot and label your axes.
    1. Based on your plot, describe the relationship between the two variables.

    When we are dealing with numeric variables, it is also important to determine if a predictor meets the shape condition, meaning the predictor should have a linear relationship with the log odds of an email being spam, or we need to use something like a transformation or polynomial that appropriately reflects the shape. To check this, we use what is called a empirical logit plot.

    We are going to use a function in R to produce the empirical logit plot. To do so, create an R chunk, copy the LONG function below, paste it into an R chunk in your Markdown, and run it. You will notice that nothing seems to happen, but if you check the top right panel of your RStudio session, you should notice that a new function called emplogit has appeared. R allows individual users to create functions as needed, and in essence you have just ``taught" R a new function. This function was written by Alex Schell (http://alexschell.github.io/emplogit.html).

     emplogit <- function(x, y, binsize = NULL, ci = FALSE, probit = FALSE,prob = FALSE, main = NULL, xlab = "", ylab = "", lowess.in = FALSE){
      # x         vector with values of the independent variable
      # y         vector of binary responses
      # binsize   integer value specifying bin size (optional)
      # ci        logical value indicating whether to plot approximate
      #           confidence intervals (not supported as of 02/08/2015)
      # probit    logical value indicating whether to plot probits instead
      #           of logits
      # prob      logical value indicating whether to plot probabilities
      #           without transforming
      #
      # the rest are the familiar plotting options
      
      if (length(x) != length(y))
        stop("x and y lengths differ")
      if (any(y < 0 | y > 1))
        stop("y not between 0 and 1")
      if (length(x) < 100 & is.null(binsize))
        stop("Less than 100 observations: specify binsize manually")
      
      if (is.null(binsize)) binsize = min(round(length(x)/10), 50)
      
      if (probit){
        link = qnorm
        if (is.null(main)) main = "Empirical probits"
      } else {
        link = function(x) log(x/(1-x))
        if (is.null(main)) main = "Empirical logits"
      }
      
      sort = order(x)
      x = x[sort]
      y = y[sort]
      a = seq(1, length(x), by=binsize)
      b = c(a[-1] - 1, length(x))
      
      prob = xmean = ns = rep(0, length(a)) # ns is for CIs
      for (i in 1:length(a)){
        range = (a[i]):(b[i])
        prob[i] = mean(y[range])
        xmean[i] = mean(x[range])
        ns[i] = b[i] - a[i] + 1 # for CI 
      }
      
      extreme = (prob == 1 | prob == 0)
      prob[prob == 0] = min(prob[!extreme])
      prob[prob == 1] = max(prob[!extreme])
      
      g = link(prob) # logits (or probits if probit == TRUE)
      
      linear.fit = lm(g[!extreme] ~ xmean[!extreme])
      b0 = linear.fit$coef[1]
      b1 = linear.fit$coef[2]
      
      loess.fit = loess(g[!extreme] ~ xmean[!extreme])
      
      plot(xmean, g, main=main, xlab=xlab, ylab=ylab)
      abline(b0,b1)
      if(lowess.in ==TRUE){
      lines(loess.fit$x, loess.fit$fitted, lwd=2, lty=2)
      }
    }

    To use the function, we need three things: a response variable, an explanatory variable and the number of bins we want to use. Note, if you don't want to specify the number of bins, the function automatically decides based on those inputs how to best draw the plot.

    emplogit(x= , y=email$spam, binsize = 200,xlab = "  ", ylab = "Empirical Logit")
    1. Create the plot using the code above. Do you feel comfortable claiming that the shape condition is satisfied? Why or why not?

    In cases where the empirical logit plots suggest that the shape might not be linear, a good first step is to try transforming the explanatory variables. One common transformation is taking the log.

    1. Write down the code you would need to use to create the empirical logit plot with x = the log of your explanatory variable. Run your code and show the plot; make sure to label your axes! Based on the plot, would you prefer to use the log of your explanatory variable as your predictor?

    Fitting the Model: Filter 2: Numeric Predictor

    Now that we have chosen our explanatory variable, let's build our spam filter!

    1. Write down the fitted model (regression line) in logistic form. To do this, paste the following into the white space in RMarkdown and adapt to reflect your line: $log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = \hat{\beta}_0 + \hat{\beta}_1 X$. Then, add one extra dollar sign to the beginning and end (so you have $$).
    1. Based on the model in the previous question, interpret the slope in terms of the log odds.
    1. Interpret the slope in terms of the odds.
    Creative Commons License
    This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2020 November 16.
    This data for this lab were retrieved from stat.duke.edu/courses/Spring13/sta102.001/Lab/email.Rdata.
    The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.