STA 363/663 Lab 7

Goal

We have been doing a lot of work in our course in regression modeling, which means we are working with a numeric Y variable. However, there are also a lot of scenarios in which we want to predict categorical response variables. These tasks are called classification tasks, and they will be the focus of our lab today.

The Data

We are going to work with our familiar penguins data set with information on n = 333 penguins. To load the data, you need to use the following chunk of code:

library(palmerpenguins)
data(penguins)
penguins <- na.omit(penguins)

# Convert to a data frame
penguins <-data.frame(penguins)

We have a client who is interested in building a model for Y = the sex of the penguin. In addition to this response variable, we have information on 7 features.

species - the type penguin.
island - the island where the penguin lives.
bill_length_mm - the length of the penguin bill in millimeters.
bill_depth_mm - the depth of the penguin bill in millimeters.
flipper_length_mm - the flipper length of the penguin in millimeters.
body_mass_g - the body mass of the penguin in grams.
year - the year the penguin was measured.

Our Y variable is binary, and we are going to use three different models to try to predict \(Y\): logistic regression, KNN, and a classification tree. We are going to explore all three techniques and see how to tune each of them for optimal predictive ability.

Classification Method 1: Logistic Regression

We begin by using a classic model for binary response variables: logistic regression. If \(Y\) is a binary random variable and \(X\) is a single feature, our logistic regression model looks like this:

\[Y_i \sim Bernoulli(\pi_i)\]

\[log \left( \frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_i\]

The quantity on the left hand side is called the log-odds that Y = 1, and \(\pi_i\) is the probability that \(Y_i=1\).

\[\pi_i = P(Y_i = 1)\]

In our case, our response variable \(Y\) has two levels: female (0) and male (1), so \(\pi_i\) is the probability that penguin \(i\) is male. The reason why female = 0 and male = 1 is that in R, we always assign the numbers to categories in alphabetical order!

Checking Shape

When we build logistic regression models, we do make a shape assumption that the relationship between \(X\) and the log odds of the penguin being male is a line. To check this, we create a special scatter plot called an empirical log odds plot. To teach R the function it needs to make this plot, create an R chunk, copy the function below, paste it into an R chunk in your Markdown, and press play.

logOddsPlot <- function(x, y, xname, formulahere){

  if(class(y)=="factor"){
    baseline = levels(y)[1]
    y <- ifelse(y == baseline, 0, 1)
  }
  
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=.05*length(y))
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = ns = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
  }
  
  extreme = (prob == 1 | prob == 0)
  prob[prob == 0] = .01
  prob[prob == 1] = .99
  
  g = log(prob/(1-prob))
  
  dataHere <- data.frame("x" = b, "LogOdds" = g)
  
  suppressMessages(library(ggplot2))
  
  ggplot(dataHere, aes(x =xmean, y = LogOdds)) + geom_point() + geom_smooth(formula = formulahere, method = "lm", se = FALSE) + labs(x = xname, y = "Log Odds")
}

To use the function, you will need to specify the inputs, meaning the pieces of information the computer needs in order to run the code.

x: the column containing the x variable; example: penguins[, "flipper_length_mm"]
y: the column containing the y variable; example: penguins[, "sex"]
xname: the label you want on the x axis, example: xname = "Bill Depth in mm". - formula: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x.

Here is some of the code to create the plot. You will need to fill in the gaps.

logOddsPlot(x= , y= , xname = "Bill Depth (mm)", formulahere = y ~ x)

Question 1

Use the code above to create the empirical log odds plot to check shape between X = bill depth in mm and the log odds that the penguin is male.

Typically, we would need to check the shape condition for every numeric feature. For today, you may proceed as though the shape conditions is met. When you work on your final project, you will need to check shape if you choose to use logistic regression or linear regression. Be aware that you may find when you check shape that you need to transform X in order to make sure the shape of the relationship is accurately captured in the model.

Logistic Regression in R

The example logistic regression model only uses one feature, but logistic regression models can use a combination of categorical and numeric features, just like every other model in our course.

To build a logistic regression model in R using all of the features in our data set, we use the following code:

m1 <- glm( sex ~ . , data = penguins, family = "binomial")

If we want to see the coefficients from the model, we can use the same function we would use for linear regression:

summary(m1)

These coefficients help us use the model for both association and prediction tasks. They describe the relationship between the features and the log odds that a penguin is male.

Our goal for today, though, is prediction. This means we are not interested in interpreting the coefficients. Instead, we want to use the model to predict whether or not a penguin is male based on the features.

The first step to doing this is to use our trained model to find \(\hat{\pi}_i\) for each penguin. In order to find the predicted probability that each penguin in the data set is male, we use the predict function (again, standard in our course):

probabilities <- predict(m1, type = "response")

The only new part of our code is type = "response". This is important because in logistic regression we might want the predicted log odds or the predicted probabilities to be produced with this function. type = response tells R we can the predicted probabilities.

Question 2

What is the predicted probability that the 5th penguin in the data set is male?

Now, our goal for today is prediction, which means we need to convert those probabilities \(\hat{\pi}_i\) into predictions \(\hat{y}_i\) for each row \(i\) in the data set. This means we need to decide that probability \(\hat{\pi}_i = P(\hat{y}_i = 1)\) is large enough that we want to assign \(\hat{y}_i = 1\).

This probability that is “large enough” is called a threshold. This is a value such that if the probability that a penguin \(i\) is male (\(\hat{\pi}_i\)) is above the threshold, we predict \(\hat{y}_i = male\). If \(\hat{\pi}_i\) is less than or equal to the threshold, we predict \(\hat{y}_i = female\).

Question 3

Suppose the threshold is .5. What is the predicted value \(\hat{y}_5\) for the 5th penguin in the data set?

Once we know our threshold, we can use R to convert all the probabilities in the vector probabilities into predictions.

YHat_logistic <- ifelse( probabilities > .5, "male", "female")

At this point, we have vector YHat_logistic that holds a predicted value of \(Y\) for all rows in the data set. The next step is to determine how accurately these predictions reflect the true values in the data set!

Assessing Accuracy

There are a lot of tools we use to assess the predictive accuracy of a classification task, including the accuracy, the classification error rate (CER), the true positive rate (TPR), and the true negative rate (TNR). It is common to consider TPR and TNR when we think about classification tasks, so we will focus on those for today.

TPR is the true positive rate. This phrase makes sense when our response variable is “yes” and “no”, but it’s trickier when we have data like ours when \(Y\) is “male” or “female”. “Positive” always refers to whatever level of \(Y\) is classified as a 1. This means that for our data, the TPR is the percent of the male penguins in the original data set that we correctly predict as male using our logistic regression model with a specific threshold.

\[TPR = \frac{ \sum_{i=1}^n (y_i = male ~ and ~ \hat{y}_i = male)}{\sum_{i=1}^n (y_i = male)}\]

To compute the TPR, we can use the following function:

compute_TPR <- function( truth, predictions, isone){
  
  # Determine which rows in the true data have y = 1
  true1 <- which( truth == isone)
  
  # Determine which rows in the predictions have hat y = 1
  pred1 <- which( predictions == isone)
  
  # Find how many agree
  numerator <- intersect( true1, pred1)
  
  # Compute the TPR
  TPR <- length(numerator)/length(true1)
  
  TPR
  
  
}

This function requires 3 inputs:

truth: What is the actual \(Y\) variable we trying to predict?
predictions: what are the predicted values \(\hat{Y}\) from our model?
isone: what value of the response variable are we computing for? In this case, we want the percent of the male penguins that are correctly estimated, so isone = "male".

We use the function like so:

compute_TPR( truth = penguins[, "sex"], predictions = YHat_logistic, isone = "male")

Question 4

Using a threshold of .5, what is the TPR of our logistic regression model?

Question 5

Using a threshold of .001, what is the TPR of our logistic regression model?

In Question 5, we will notice something very odd. Our TPR is very high for very small choice of threshold. Why is that??

Well, the smaller the threshold, the easier we make it to declare that \(\hat{y}_i= male\). This means that the probability that a penguin that actually is male is predicted to be male is very high, and this is what the TPR measures!

However, this means that it is also very likely that we assign the value \(\hat{y}_i=male\) when the true value of \(y_i\) is female. This is why in addition to the TPR, we discuss the TNR. In our case, the TNR is the percent of the female penguins in the original data set that we correctly predict as female using our logistic regression model with a specific threshold.

\[TNR = \frac{ \sum_{i=1}^n (y_i = female ~ and ~ \hat{y}_i = female)}{\sum_{i=1}^n (y_i = female)}\]

Question 6

Using the threshold .001, compute the TNR. To do this, you can use the compute_TPR function with one small adaption in the inputs.

As we can see, though our TPR is high with a threshold of .001, our TNR is not. We typically need a balance between the TPR and the TNR - we want to predict well for both male and female penguins, so we want both values to be relatively high!

Because of this, there are a few different tools we can use that assess how well our model is doing at prediction that take both TPR and TNR into account. One such tool is the geometric mean of TPR and TNR:

\[\sqrt{TPR \times TNR}\]

The quantity above tends to be closer to 1 when both TPR and TNR are high, meaning both \(Y_i =0\) and \(Y_i = 1\) are being predicted well. The quantity is closer to 0 when at least one of \(Y_i =0\) and/or \(Y_i = 1\) is NOT being predicted well. This means the model is doing better at prediction the closer the geometric mean of TPR and TNR gets to 1.

Question 7

What is the geometric mean of TPR and TNR with a threshold of .5?

Another score that is sometimes used in classification is the F1 Score, which is defined as:

\[F1 = \frac{2(TPR)}{2(TPR) + FPR + FNR}\]

where

\(TPR\) = the proportion of male penguins that were correct predicted as males
\(FPR\) = the proportion of female penguins that were incorrectly predicted as males.

Question 8

What does the False negative rate (FNR) mean in words for these data? Use the information above as a guide!

The highest possible value for the F1 score is 1, while the lowest is 0, just like the geometric mean of TPR and TNR. This means models with higher F1 score have higher predictive accuracy.

This means we need TPR, FNR, FPR…and we really need a confusion matrix.

knitr::kable( table( YHat_logistic, penguins$sex) ,col.names = c("True Female", "True Male"))

Note: In this output, the TPR = sensitivity, the TNR = specificity.

Question 9

With a threshold of .5, what is the F1 score?

Hint: You will need to determine the FPR and FNR before you can obtain the F1-score.

One thing to note is that so far, all of the values we have computed tell about the in-sample predictive ability of our model. If we want to estimate how well our model will predict for test data, we need to use a cross validation technique like LOOCV or 10 fold-CV, or we need test data.

My folks using classification on your project, our goal is prediction, so you will need to use CV to estimate your predictive accuracy!!! This means your code will be a little different from what we do in this lab!!

Classification Method 1B: Penalized Regression with Classification

For a binary outcome, one other option for predicting \(Y\) is to use penalized regression, meaning ridge, lasso, or elastic net! The process is the same as what we did for a numeric \(Y\), except that we choose the values of \(\hat{\beta}\) that minimize the deviance (-2 the log likelihood) plus the penalty. The idea and the process are otherwise the same!

The code is also very similar. For lasso, for instance, we use the following

# Build the design matrix 
trainFeatures <- model.matrix( sex ~ . , data = penguins)

# Run 10-fold CV to choose lambda
library(glmnet)
set.seed(363663)

cv.out.lasso <- cv.glmnet(trainFeatures[,-1], penguins$sex, 
  alpha = 1,lambda = seq(from = 0, to = 1, by = .001),
  family="binomial" )

# Find the value of lambda
cv.out.lasso$lambda.min

Once you have chosen your value of \(\lambda\), you proceed with assessing the predictive accuracy just like we did for logistic regression! This means that you can use ridge, lasso, and elastic net with a binary \(Y\).

Question 10

Train a logistic lasso model using the \(\lambda\) value from the code above. Which variables are removed from the model?

Once the model is trained, we can obtain predicted probabilities using:

lasso_probabilities <- predict(lassomodel, type ="response", newx = trainFeatures[,-1])

Question 11

Using a threshold of .5, what is the geometric mean of TPR and TNR for this model?
Based on this, do we prefer this model (logistic lasso) or the logistic model in the previous section when making predictions on the training data? Explain your choice.

Classification Approach 2: KNN

The next classification technique we are going to try is KNN. KNN has a large advantage that there are not many conditions you need to check before using it. There is no determined shape defined by the method, as there is in something like logistic regression. The only thing we need is at least two features, which we have.

We have used KNN with numeric response variables, and we know that in order to choose the predicted value of \(Y\) we simply average together the \(Y\) values of the \(K\) nearest rows. For classification, this won’t work because we cannot average “male” and “female”. Instead, we choose the value \(\hat{y}_i\) to match the majority of the \(Y\) values for the \(K\) neighbors. For this reason, we tend to choose \(K\) values that are odd numbers when our \(Y\) variable is categorical.

Question 12

Suppose when we look at the three nearest neighbors to a row \(i\), 2 are male and the 3rd is female. What value do we predict for \(\hat{y}_i\)?

To use KNN with a categorical response variable in R, we need the class library. If you get a message indicating you do not have the class package, install it.

suppressMessages(library(class))

The class library contains a lot of functions. The actual function we will use for KNN is called (shockingly!!) knn( ). It takes a few arguments (or inputs we need to make the function run). To run KNN, we will have a code structure like this:

YHat_KNN <- knn(train = , test =  , cl = , k = )

train = : the training data (just like for KNN with numeric \(Y\))
test = : the test data (just like for KNN with numeric \(Y\))
cl = : the response variable from the training data. For us: cl = penguins[, "sex"].
k = : Here, you provide the integer value you would like to use for k. How many nearest neighbors should we use?

Before we run KNN, there is one step we need to take. We have some categorical features, which means we need to convert them into binary indicators before building the model. Luckily, we know how to do this from penalized regression:

trainFeatures <- model.matrix( sex ~ . , data = penguins)[,-1]

The [ ,-1] removes the intercept column, as we are currently not fitting a model and therefore do not need an intercept. This is what we should use for the training data for KNN.

YHat_KNN <- knn(train = trainFeatures, test =  , cl = , k = )

Question 13

Using \(K=5\), run KNN to make predictions on your training data (meaning your test and training data are the same). What is the predicted value for the 5th penguin in the data set?

Now, just like in logistic regression, we have a tuning parameter we need to choose for KNN. Just like in logistic regression, we can choose \(K\) to optimize, (1) the TPR, (2) the TNR, or (3) the balance between the two. There are other options for what we can use to choose our tuning parameter, but these are common ones.

Question 14

Consider all possible odd values of \(K\) from 3 to 29. Find the value of \(K\) that gives you the highest balance between the TPR and the TNR. State the choice of \(K\) and the corresponding TPR and TNR values. Set a seed of 363663 (this is just in case of ties).

Again, we are only considering in-sample prediction today. To know how this would predict in general, we would want code like we used in Project 1 to use test data to choose \(K\).

Classification Approach 3: Classification Tree

The 3rd and final approach we will consider today is a classification tree. We have learned how to build these in class, but we didn’t actually get a chance to do it yet!

Remember that to build trees, we need the rpart package and to visualize the trees we need rpart.plot and rattle.

library(rpart)
library(rpart.plot)
library(rattle)

Question 15

Build a tree called m3 to predict penguin sex using all the features. Use the default stopping rules and show a visual of your tree using the code rpart.plot(m3). Use a seed of 363663 (we’ll learn why we need a seed later!!)

The rpart.plot package actually allows quite a bit of room for personalizing your tree plot. If you are curious, you can use ?rpart.plot to see all the options.

One option I recommend you consider using if your graph is too hard to read or the text is too small is fallen.leaves = FALSE. You can also use cex to change the size of the font in your graph, and box.palette to change the colors in your tree. The colors available are:

Grays: Gy
Greens: Gn
Blues: Bu
Browns: Bn
Oranges: Or
Reds: Rd
Purples: Pu

You can also combine colors, so if you want red, yellow, and green, you use box.palette = RdYlGn.

Question 16

Personalize your graph from Question 15 by choosing a different color palette (any you want) and using fallen.leaves = FALSE. Show your plot and state what fallen.leaves = FALSE seems to do.

rpart.plot( m3, fallen.leaves = FALSE, box.palette = )

Question 17

What value \(\hat{y}_5\) does the tree predict for the 5th penguin in the data set?

Predicting with a tree

To obtain the predicted values from the tree, we use the following:

YHat_tree <- predict(m3, type = "class")

Question 18

What is the TPR and TNR of your tree?

Comparing Results

At this point, we have the results of 3 different approaches for predicting \(Y = sex\).

Question 19

Which approach (logistic, KNN, classification tree) would you recommend using for prediction on the training data? Explain your choice.

In our next class, we’ll learn a way to improve the predictive accuracy of trees by extending to forests.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 April 1.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .