STA 363 Lab 2

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

In class, we have started to work on classification techniques, which mean situations where our response variable $Y$ is categorical. Today, we are going to practice using R to apply these classification techniques. We will also start to explore the idea of tuning a model.

The Data

Spam emails are unwanted emails, generally sent to masses of people by corporations or other entities attempting to sway you into purchasing a good or service. Want a cruise to Europe?? Have we got a deal for you!! Modern email clients like GMail work to identify spam messages and keep them from appearing in your inbox. This is called filtering. Filters take a look at each email that appears in your inbox. Based on what the filter sees, it classifies an email as spam, and tucks it away in your spam folder, or not spam, in which case the email is placed in your inbox. The better the filter, the fewer unwanted emails you have to sift through in your inbox each day.

Today, we have a client who asks you to build a spam filter for a particular set of emails. They are going to provide training data for you to use to build your model, as well as a test data set that you can use to evaluate how well your model identifies spam.

You can download the test data by putting the following into your browser:

https://drive.google.com/file/d/1Oe7PE3VIpR9fgNVJGmp1RONP1BKzNhZP

and the training data by putting the following into your browser:

https://drive.google.com/file/d/1OjQ-dz5AU7mZf9al4cKtmCVFBx8rFU9Y

You then need to load both of these data sets into R. If you have never loaded data from a csv file into R before, or if you need a refresher, follow these steps in the “Moving the data into RStudio” section.

Question 1

How many rows are in the test data set? What about the training data set?

Question 2

The test data set is smaller than the training data set. This is almost always the case in statistical learning applications. Explain why you think this is the case.

The response variable of interest is $Y=$ spam.

spam: “notspam” if the email is not spam and “spam” if the email is spam.

There are 18 possible features in this data set. They are:

to_multiple: Indicator for whether the email was addressed to more than one recipient.
from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc: Indicator for whether anyone was CCed.
sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
image: Indicates whether any images were attached.
attach: Indicates whether any files were attached.
dollar: Indicates whether a dollar sign or the word ‘dollar’ appeared in the email.
winner: Indicates whether “winner” appeared in the email.
inherit: Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
password: Indicates whether “password” appeared in the email.
num_char: The number of characters in the email, in thousands.
line_breaks: The number of line breaks in the email (does not count text wrapping).
format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
re_subj: Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”.
exclaim_subj: Indicates whether there was an exclamation point in the subject.
urgent_subj: Indicates whether the word “urgent” was in the email subject.
exclaim_mess: The number of exclamation points in the email message.
number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Classification Approach 1: KNN

Our client has asked us to use two numeric features in our model: number of characters in the email (num_char) and the number of line breaks in the email (line_breaks). The goal is to build a model that can use these two features (and only these two features) to predict whether or not an email is spam.

We are going to start by using k-Nearest Neighbors (kNN) to predict whether or not an email is spam. KNN has a large advantage that there are not many conditions you need to check before using it. There is no determined shape defined by the method, as there is in something like logistic regression. The only thing we need is at least two numeric features, which we have. The first step is to visualize the relationship between these two features.

Question 3

Create a scatter plot where line breaks is on the x axis and the number of characters is on the Y axis. (b) Did you use your training or test data set to create your plot?

Hint: Look back at Lab 1 if you have questions on how to make the plot!

Now, this scatter plot is a little unusual because right now, our Y variable is not represented on the plot.

Question 4

Change the color and/or the shape of the points on the graph based on Y. In other words, spam emails should be one shape and/or color and non spam emails should be a different shape and/or color. Hint: Look back at Lab 1 if you have questions on how to do this!

Question 5

Suppose an email has 175 characters and 3800 line breaks. Using the 3-nearest neighbors, what value $\hat{Y}$ would you predict for this email? Hint: You don’t need any code for this; look at the graph from Question 4.

Okay, so with the graph, we can look at a point, find its nearest neighbors, and determine what value $\hat{Y}$ we would predict for that point based on the value of $Y$ for these neighbors. However, we don’t want to do this one point at a time, nor do we want to try to guess at which neighbors are closest when we have a lot of different points on the graph. This means we want to use R to help us use kNN to estimate $Y$.

KNN in R

To use kNN in R, we need the class library. Go ahead and load the library you need. If you get a message indicate you do not have the class package, install it. Look back at Lab 1 if you have questions on how to do this!

suppressMessages(library(class))

The class library contains a lot of functions. The actual function we will use for kNN is called (shockingly!!) knn( ). It takes a few arguments (or inputs we need to make the function run).

To run kNN, we will have a code structure like this:

results <- knn(train = , test =  , cl = , k = )

train = : The first input you need it the training data. We are using the 12th and 13th columns of the training data for our features, so we use train = train[,c(12,13)]
test = : The second input you need is the test data, test = test[,c(12,13)]. Remember that the goal is to make predictions for the test data based on the nearest neighbors in the training data.
cl = : Here, you give the response variable from the training data. Example: cl = train$spam. This is how the code identifies which emails in the training data are spam or not spam so it can predict spam status in the test data set.
k = : Here, you provide the integer value you would like to use for k. How many nearest neighbors should we use?

You will notice that we have one more part of the code: results <-. The <- symbol in R means “store whatever is to the right of this symbol in an object which is named to the left of this symbol”. In other words, results <- knn (...) means “please store the results of running kNN in an object called results”. This stops all the predicted values for the test data from appearing on our screen, and also makes the results accessible for us to use later.

Question 6

Using 5 nearest neighbors, run KNN. Look at the object results you have created. Is this a scalar (one value), a vector (one column or one row), or a matrix (multiple rows and columns)?

Why is the question above important? Well, as we move through the semester, we will see that different types of objects in R (scalars, vectors, and matrices) are handled in slightly different ways.

Question 7

Make a table to show how many emails 5-nearest neighbors predicted to be spam and not spam.

Assessing Accuracy

Now that we have used 5-nearest neighbors to make predictions, we need to assess the accuracy of our predictions. We are making predictions on our test data, which has Y values in it, so we know what the values of Y are supposed to be. This means we can check to see if our predictions are correct!

One common way of comparing predictions of your response variable to the actual values of the response variable is to use a confusion matrix. To make a confusion matrix, you will make a table with the rows containing the predictions and the columns containing the true values of Y. To make your confusion matrix, you can use:

knitr::kable(table(results, test$spam), caption= "Table 1", col = c("True Not Spam", "True Spam"))

Question 8

What is the sensitivity of your classification method? Do we prefer methods with larger or smaller sensitivity?

Question 9

What is the specificity of your classification of your classification method? Do we prefer methods with larger or smaller specificity?

Question 10

What is the false positive predictive rate (FPPR)? Do we prefer methods with larger or smaller FPPR?

Question 11

What is the false negative predictive rate (FNPR)? Do we prefer methods with larger or smaller FNPR?

Question 12

What is the accuracy of your classification approach? Hint: This the proportion of the predictions that are correct.

At this point, we have a fairly good idea of the accuracy of 5-nearest neighbors as a classification approach for these data and these features. However…where did we come up with 5?

Changing K

In kNN, we have a choice of k. We have to choose a k bigger than 0, but we could choose 2, 3, 4, 11, 21, 53, etc. So, how do we choose?

Choosing k in kNN is our first example of a tuning parameter. This is a numeric value in a statistical learning technique that we are able to choose. Sometimes we choose tuning parameters to make a technique easier to understand, or more efficient to run, or so that the technique has higher predictive accuracy.

Let’s see what happens when we change the choice of k.

Question 13

Run kNN using $k = 3, 7, 9$. State the sensitivity and specificity that you get using each choice if $k$.

Question 14

What trend to you notice in the specificity as the $k$ increases? Explain why you think this is happening as $k$ increases.

This seems like it would be tedious for multiple choices of $k$. Is there a faster way? Yes! We are going to learn how to write for loops soon, which will help us to automate this process.

Question 15

Looking at the sensitivity and specificity from $k=3,5,7,9$, which $k$ would you choose? Explain your choice.

As we can see, some choices of $k$ produce higher sensitivity, and some higher specificity. Sometimes we are in a situation where we are interested in prioritizing over the other, and sometimes we are in a situation where we want a balance between both. This is why we have to think critically about the goals in any statistical learning setting. The goals not only help us choose models, but they also help us tune a particular model to suit our needs.

Now that we have explored one classification technique for these data, let’s try another.

Classification Approach 2: Logistic Regression

The next classification approach we are going to try is logistic regression. We have the same goal (prediction) and the same two features (number of characters and line breaks) to use to predict whether or not an email is spam.

When we switch from kNN to logistic regression, we change the modeling assumptions we are making. In logistic regression, we assume that the the relationship between each numeric feature and the log odds of an email being spam is linear. To check this, we create a special scatter plot called an empirical logit plot.

To make an empirical logit plot in R, we need to teach R a special function. Create an R chunk, copy the function below, paste it into an R chunk in your Markdown, and run it. You will notice that nothing seems to happen, but if you check the top right panel of your RStudio session, you should notice that a new function called emplogitPlot has appeared. R allows individual users to create functions as needed, and in essence you have just ``taught” R a new function. This function was written by Alex Schell (http://alexschell.github.io/emplogit.html).

emplogitPlot <- function(x, y, binsize = NULL, ci = FALSE, probit = FALSE,
prob = FALSE, main = NULL, xlab = "", ylab = "", lowess.in = FALSE){
  # x         vector with values of the independent variable
  # y         vector of binary responses
  # binsize   integer value specifying bin size (optional)
  # ci        logical value indicating whether to plot approximate
  #           confidence intervals (not supported as of 02/08/2015)
  # probit    logical value indicating whether to plot probits instead
  #           of logits
  # prob      logical value indicating whether to plot probabilities
  #           without transforming
  #
  # the rest are the familiar plotting options
  
  if(class(y) =="character"){
   y <- as.numeric(as.factor(y))-1
   }
  
  if (length(x) != length(y))
    stop("x and y lengths differ")
  if (any(y < 0 | y > 1))
    stop("y not between 0 and 1")
  if (length(x) < 100 & is.null(binsize))
    stop("Less than 100 observations: specify binsize manually")
  
  if (is.null(binsize)) binsize = min(round(length(x)/10), 50)
  
  if (probit){
    link = qnorm
    if (is.null(main)) main = "Empirical probits"
  } else {
    link = function(x) log(x/(1-x))
    if (is.null(main)) main = "Empirical logits"
  }
  
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=binsize)
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = ns = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
    ns[i] = b[i] - a[i] + 1 # for CI 
  }
  
  extreme = (prob == 1 | prob == 0)
  prob[prob == 0] = min(prob[!extreme])
  prob[prob == 1] = max(prob[!extreme])
  
  g = link(prob) # logits (or probits if probit == TRUE)
  
  linear.fit = lm(g[!extreme] ~ xmean[!extreme])
  b0 = linear.fit$coef[1]
  b1 = linear.fit$coef[2]
  
  loess.fit = loess(g[!extreme] ~ xmean[!extreme])
  
  plot(xmean, g, main=main, xlab=xlab, ylab=ylab)
  abline(b0,b1)
  if(lowess.in ==TRUE){
  lines(loess.fit$x, loess.fit$fitted, lwd=2, lty=2)
  }
}

To use the function, we need two things: a response variable, an explanatory variable. For example, for line breaks, we can use:

emplogitPlot(x=train$line_breaks, y=train$spam, 
             xlab = "Line Breaks", 
             ylab = "Log Odds of Being Spam", 
             main = "Figure 3")

Question 16

Create the plot using the code above. Do you feel comfortable claiming that the shape condition is satisfied? Why or why not?

In cases where the empirical logit plot suggests that linearity might not be valid, a good first step is to try transforming the explanatory variables. One common transformation is taking the log.

Question 17

Create an empirical logit plot where the log of line breaks is on the x axis.Which of the empirical logit plots (using line breaks or using log line breaks) is more linear? Based on the plots, choose whether you will use line breaks or log line breaks as a feature in the model.

Question 18

Repeat this process for number of characters.

Question 19

Train your chosen logistic regression model. Write down the trained logistic regression line in log odds form. Hint: For a template for write out your model in Markdown, copy and paste the following into the white space (NOT a chunk), and adapt it to match your trained model: $$log\left(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\right) =$$

Question 20

Use your trained model to make predictions $\hat{Y}$ for each email in the test data set. Show a confusion matrix comparing the true values of $Y$ in the test data to your predictions.

Question 21

What is the sensitivity of your classification approach?

Question 22

What is the specificity of your classification approach?

Question 23

What is the FPPR?

Question 24

What is the FNPR?

Question 25

What is the accuracy of your classification approach?

Comparing Approaches

We now have two different classification approaches, and we have applied them both to the same data set.

Question 26

Compare the two classification approaches (kNN with your chosen $k$ or logistic regression) in terms of (1) sensitivity, (2) specificity, and (3) accuracy.

Hint: One useful way to compare is to make a table. I like to use this site(https://www.tablesgenerator.com/markdown_tables) to help create tables. Just type in the values as you want them to appear in the table, and then hit “Generate”. The result can be copied from the website and then put into the white space (not a chunk) in your Markdown file. Knit, and you will see your table!

We have now started the process of comparing different approaches to the same statistical learning task, as well as exploring the process of tuning models by choosing the values of certain constants, like $k$ in kNN. We will continue this process next class!

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 August 27.

This data for this lab were retrieved from stat.duke.edu/courses/Spring13/sta102.001/Lab/email.Rdata.