STA 112 Lab 5
Spam Emails
Spam emails are unwanted emails, generally sent to masses of people by corporations attempting to sway you into purchasing a good or service. Want a cruise to Europe?? Have we got a deal for you!! Modern email clients like GMail work to identify spam messages and keep them from appearing in your inbox. This is called filtering. Filters take a look at each email that appears in your inbox. Based on what the filter sees, it classifies an email as spam, and tucks it away in your spam folder, or not spam, in which case the email is placed in your inbox. The better the filter, the fewer unwanted emails you have to sift through in your inbox each day.
How does a filter work? Basically, a filter scans an email and looks for a few key features. A feature is a particular trait of an item that can be used to help classify the item. For instance, if the email says “cruise”, are you more or less likely to assume it is spam? Filters take these ideas and convert them into statistical models that use these features to predict the probability that an email is spam. If this probability is high, the email is labeled spam and tucked into the spam folder. If the probability is low, the email appears in your inbox. As you probably know, such filters are not perfect, and sometimes spam sneaks into your inbox or important emails end up in the spam folder. The filter is a model, which means it can make mistakes.The goal of someone who designs filters is to make mistakes as infrequently as possible.
Today we will be working to build our own spam filter using logistic regression. The data we will be using is a corpus of emails received by a single gmail account over three months. In this case, someone has gone through the data and manually classified each piece of email as either spam or not spam. We will be using what we have learned about logistic regression models to see if we can predict whether or not a message should be labeled spam. To do this, we will build a model based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large corporations like Google and Microsoft are quite a bit more complex the fundamental idea is the same - binary classification based on a set of predictors.
The Data
The data you need are on Canvas, under Lab 5. Remember that you need to load this data into RStudio, and then again into your Markdown file.
There are 19 variables in this data set. They are:-
spam
: Y variable; Indicator for whether the email was spam. -
to_multiple
: Indicator for whether the email was addressed to more than one recipient. -
from
: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). -
cc
: Indicator for whether anyone was CCed. -
sent_email
: Indicator for whether the sender had been sent an email in the last 30 days. -
image
: Indicates whether any images were attached. -
attach
: Indicates whether any files were attached. -
dollar
: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email. -
winner
: Indicates whether “winner” appeared in the email. -
inherit
: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email. -
password
: Indicates whether “password” appeared in the email. -
num_char
: The number of characters in the email, in thousands. -
line_breaks
: The number of line breaks in the email (does not count text wrapping). -
format
: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext. -
re_subj
: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”. -
exclaim_subj
: Indicates whether there was an exclamation point in the subject. -
urgent_subj
: Indicates whether the word “urgent” was in the email subject. -
exclaim_mess
: The number of exclamation points in the email message. -
number
: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
Question 1
Which variable do think might be most highly associated an email being spam? Why? (You don’t need to look at the data yet, just reason it out. There is no specific correct answer!)
Our response variable is binary, where 1 means that an email is spam
and a 0 means that the email is not spam. Before we try to model the
data, our first step is always to visualize or summarize our data in
some way. With binary data, a good first step is determine how many 0s
and 1s we actually have in this data set. Are they all spam? Do we have
a good mix? The table
command is useful for this:
::kable(table(email$spam)) knitr
Question 2
How many emails are spam? What percent of the emails are spam?
We can also make a graph to visualize the distribution of Y.
Question 3
Create an appropriate graph of Y = spam. Label your graph Figure 1.
Model 1: Winner!
Now that we have explored Y, it is time to see if we can use an explanatory variable to help predict whether or not an email is spam. We are going to consider two possible variables that might be related to whether or not an email is spam: (1) whether or not the email contains the word “winner” and (2) how many lines are in the email. We will start with (1).
Question 4
Make a plot to explore the relationship between Y = whether or not an email is spam and X = whether or not the email contains the word “winner”. Make sure to label your plot appropriately. Hint: Look back at your slides on Odds to find the code you need.
Question 5
Based on this plot, does there seem to be a relationship between whether or not an email contains the word “winner” and whether or not it is spam?
Now that we have chosen our X variable, let’s build our spam filter!
To do this we will fit a logistic regression model between
spam
and the variable winner
. This is done
using very similar code to simple or multiple linear regression, except
we use the glm
function instead of lm
.
Additionally, we have one extra argument, or input, for our code. We
must indicate that we wish to fit a logistic regression model by
including family=binomial
<- glm(spam ~ winner, data =email, family = binomial) spamModel1
To examine the coefficient estimates from our logistic regression model, we use the summary command:
summary(spamModel1)
Question 6
Write down the logistic regression line in log odds form. Copy and
adapt the following code:
$log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = \hat{\beta}_0 + \hat{\beta}_1 Winner$
.
Question 7
Interpret the slope in log odds form.
Question 8
Interpret the slope in odds form.
Question 9
Write down the fitted model in probability form. Hint: To make a
fraction, use $\hat{\pi} = \frac{}{}$
. To make the odds,
use e^{\hat{\beta}_0 + \hat{\beta}_1 Winner}
.
Question 10
According to your model, what is the probability that an email that contains the word “winner” is spam?
Model 2: Line Breaks
Now that we have seen how a logistic regression model could be used with a categorical X, let’s try a numeric X. We are going to explore whether how many line breaks are in the email relates to the whether or not an email is spam. A line break just means a new line in an email.
Checking Conditions
Our X variable is now numeric. When we were working with LSLR and MR models (meaning when our Y was numeric), we generally needed to check the shape of the relationship between X and Y. We used a scatter plot of X versus Y to do this.
\[Y = \beta_0 + \beta_1 X + \epsilon\]
For logistic regression we need to check the shape of the relationship between X and the log odds that an email is spam.
\[log\left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1X \]
We are still going to use a scatter plot to check this assumption, but this time we will have a scatter plot with X on the x-axis and the log odds that Y is spam on the y-axis. This special scatter plot is called an empirical log odds plot. Page 474 in your book provides specific details on plot construction, but we are going to use a function in R to produce the plot.
Create an R chunk, copy the function below, paste it into an R chunk
in your Markdown, and run it. You will notice that nothing seems to
happen, but if you check the top right panel of your RStudio session,
you should notice that a new function called emplogit
has
appeared. R allows individual users to create functions as needed, and
in essence you have just ``taught” R a new function. This function was
written by Alex Schell (http://alexschell.github.io/emplogit.html).
<- function(x, y, binsize = NULL, ci = FALSE, probit = FALSE,prob = FALSE, main = NULL, xlab = "", ylab = "", lowess.in = FALSE){
EmpLogOddsPlot# x vector with values of the independent variable
# y vector of binary responses
# binsize integer value specifying bin size (optional)
# ci logical value indicating whether to plot approximate
# confidence intervals (not supported as of 02/08/2015)
# probit logical value indicating whether to plot probits instead
# of logits
# prob logical value indicating whether to plot probabilities
# without transforming
#
# the rest are the familiar plotting options
if (length(x) != length(y))
stop("x and y lengths differ")
if (any(y < 0 | y > 1))
stop("y not between 0 and 1")
if (length(x) < 100 & is.null(binsize))
stop("Less than 100 observations: specify binsize manually")
if (is.null(binsize)) binsize = min(round(length(x)/10), 50)
if (probit){
= qnorm
link if (is.null(main)) main = "Empirical probits"
else {
} = function(x) log(x/(1-x))
link if (is.null(main)) main = "Empirical logits"
}
= order(x)
sort = x[sort]
x = y[sort]
y = seq(1, length(x), by=binsize)
a = c(a[-1] - 1, length(x))
b
= xmean = ns = rep(0, length(a)) # ns is for CIs
prob for (i in 1:length(a)){
= (a[i]):(b[i])
range = mean(y[range])
prob[i] = mean(x[range])
xmean[i] = b[i] - a[i] + 1 # for CI
ns[i]
}
= (prob == 1 | prob == 0)
extreme == 0] = min(prob[!extreme])
prob[prob == 1] = max(prob[!extreme])
prob[prob
= link(prob) # logits (or probits if probit == TRUE)
g
= lm(g[!extreme] ~ xmean[!extreme])
linear.fit = linear.fit$coef[1]
b0 = linear.fit$coef[2]
b1
= loess(g[!extreme] ~ xmean[!extreme])
loess.fit
plot(xmean, g, main=main, xlab=xlab, ylab=ylab)
abline(b0,b1)
if(lowess.in ==TRUE){
lines(loess.fit$x, loess.fit$fitted, lwd=2, lty=2)
} }
To use the function, we need two things: a response variable, an
explanatory variable. You can decide on the number of bins
(binsize =
), or the function automatically decides based on
those inputs how to best draw the plot.
EmpLogOddsPlot(x=email$line_breaks, y=email$spam, xlab = "Line Breaks", ylab = "Empirical log odds", main = "Figure 3")
Question 11
Create the empirical log odds plot using the code above. Do you feel comfortable claiming that the shape of the reltionship between X and the log odds of being spam is linear? Why or why not?
In cases where the empirical log odds plots suggest that a linear might not the correct shape to use, we know that we have other shapes we can use for \(f(X)\) rather than just a line! We are going to try a new one today: \(f(X) = log(X)\).
Question 12
Write down the code you would need to use to create the empirical log odds plot with the explanatory variable as the log of the number of line breaks. Change the x-axis label accordingly and change the title to “Figure 4”.
Question 13
Run your code and show the plot. Which of the empirical log odds
plots (using line_breaks
or using
log(line_breaks)
) is more linear? Based on the plots,
choose one of the two explanatory variables to use in the model.
Question 14
Fit your chosen model and call it spamModel2. Write down the logistic
regression line in log odds form. Copy and adapt the following:
$log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = $
.
Question 15
Interpret the slope in odds form.
Predictions
To make predictions using a logistic regression model, we use the
predict.glm
function in R. We know that to make a
prediction, we need a particular value of the X variable. Suppose we
have a message with 400 line breaks. Will our filter label it spam? That
all depends on the probability suggested by the model. So let’s find
out! The code below provides the predicted probability for a message
with 400 line breaks according to our second model.
predict.glm(spamModel2,data.frame(line_breaks=400), type = "resp")
Question 16
Run the function above, and interpret the result.
Question 17
Based on the prediction we have obtained so far, do you think a message with 400 line breaks is likely to be classified as spam by our second spam filter? Explain.
Question 18
What is the predicted probability of being spam for a message with 5 line breaks?
Question 19
Is our spam filter more or less likely to classify a message with 5 line breaks as spam than a message with 400 line breaks? Answer using both the probability and the log odds as evidence.
Now, a probability is not a decision. We have just determined the probability of declaring a message spam if it has 400 line breaks, or 5 line breaks. But this does not actually declare the message spam.
In order to make the ultimate decision, we essentially will be spinning a spinner where the probability of getting “spam” is defined by the model. The larger the probability, the larger “piece” of the spinner that is labelled spam. Suppose the model says the probability of being spam for a particular email is 40%. To make a decision for this email using R, we use the following code:
sample(c("spam","notspam"),1, prob= c(.4,.6))
Running this will spit out a decision for our email. Do we declare it spam, or not spam? Now remember, with logistic regression, we are modeling probabilities. If we run the code again, we are in essence taking another round at the spinner, and we may get a different answer. (Try it and see!) That is okay!! And better than okay, it’s expected.
In the code, the prob
section holds the probabilities
associated with the outcomes “spam” and “notspam”, respectively. These
can be changed to suit the prediction you need to make.
Question 20
Suppose we want to make a decision about our email with 400 line breaks. We already know the probabilities. Adapt the code above. Show your code, and your result.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2023 August 1.
This data for this lab were retrieved from stat.duke.edu/courses/Spring13/sta102.001/Lab/email.Rdata.