STA 363/663 Lab 5

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

In our last class, we started to explore classification. This just means models where \(Y\) is categorical rather than numeric. We saw that both KNN and logistic regression can be used if \(Y\) is binary. Today, we are going to implement both of these approaches in R.

The Data

Start by loading the data. There are two data sets today, train and test. Please note that even though this is earthquake data like we used in class, you have a different sample today!! This means that you will likely get different results than we did in the slides.

train <- read.csv("https://www.dropbox.com/scl/fi/rbwypc2pi3fi9g9fuqyhv/train_earthquake.csv?rlkey=rio3x88gq040n0h8zk8s4hom9&st=h33zi0m2&dl=1")
test <- read.csv("https://www.dropbox.com/scl/fi/rsnyt5r94ql4jeqkc0acq/test_earthquake.csv?rlkey=hdy0z7ph9n4l38os2b1kbjc0g&st=kjj7vto0&dl=1")

The training data set train should have 2426 rows and 28 columns. The test data set test should have 1079 rows and 28 columns. The response variable is Damage, which is recorded as 0 or 1. A 0 indicates no to minimal damage from the earthquake and a 1 indicates moderate to severe damage. Our goal is to predict \(Y\) = damage.

Question 1

We learned that there is a key first step we always do with classification tasks. Execute that step for these data and show your results.

Question 2

Why is this step so important to do before we try to predict \(Y\)?

KNN

Now that we have checked to see if prediction makes sense, it’s time to choose our method. We are going to try two different predictive approaches today, but we are going to start with KNN.

Question 3

Suppose the three rows shown below are the 3 nearest neighbors for a row in the test data. What value of \(Y\) do we predict for that row in the test data?

Table 1
Number of Floors Age Damage
3
35
1
3 32 1
3 31 0

When we do KNN with a categorical \(Y\), we will use the caret package, as we did with KNN with a numeric \(Y\).

library(caret)

The code is also similar:

yhat <- knn3Train( traindata , testdata , Y, k = )

where:

  • traindata: replace with a data set containing the features in your training data.
  • testdata: replace with a data set containing the features in your test data.
  • Y: replace with a dataset holding the column containing the features in your training data.
  • k =: fill in a number after the = to tell R how many nearest neighbors to use.

Question 4

We only use odd values of \(K\) when \(Y\) is categorical. Why is that?

Question 5

Make predictions on the test data using \(K = 5\). As an answer to this question, create a confusion matrix using the code below:

holder <- table( yhat, test$Damage )
colnames(holder) <- c("True 0", "True 1")
rownames(holder) <- c("Predicted 0", "Predicted 1")

knitr::kable( holder, caption = "K = 5 Confusion Matrix")

Question 6

What is the true positive rate on the test data for KNN with \(K = 5\)? Round to 3 decimal places.

Question 7

What is the true negative rate on the test data for KNN with \(K = 5\)? Round to 3 decimal places.

Question 8

What is the geometric mean of TPR and TNR on the test data for KNN with \(K = 5\)? Round to 3 decimal places.

Question 9

What is the F1-score on the test data for KNN with \(K = 5\)? Note: You may NOT use a function - show me the math so you get practice (you need it for the exam).

So far, we have (1) used KNN to make predictions and (2) assessed the accuracy of those predictions using a few different metrics. However, we have yet to tune our choice of \(K\). Tuning is really important to help approaches like KNN predict as well as possible.

Question 10

Would you recommend tuning to optimize the F1-score or geometric mean of TPR and TNR (GM)?. Explain your choice.

Note: Whatever you choose, you will use in the next section!

Tuning KNN

Our goal now is to tune \(K\), meaning we want to choose \(K\) to maximize whatever metric you chose in Question 10.

It’s very tempting at this point to use our test data to tune \(K\). The goal is to have the highest predictive accuracy on the test data that we can, so it may seem logical to use the test data to choose \(K\). However, when we tune, it’s actually best to use techniques like LOOCV and k-fold CV, etc., to create validation data to help us choose \(K\).

That’s a lot of work!!Why do we do this when we have test data???

The reason is because test data is only for testing. We do not use it to do any part of training a model or algorithm, and choosing tuning parameters is part of training. The goal with test data is to see how our predictive approach would perform on data that was not seen or used in any way by our predictive method. This means that for the most accurate assessment of predictive ability, we cannot use test data for tuning.

If you have questions about this, this is the time to ask Dr. Dalzell!!

Question 11

For these data, would you recommend 5-fold CV, 10-fold CV, validation/train split, or LOOCV? Clearly explain your choice.

Okay, so now we know what metric we want to optimize (you chose it in Question 10) and we know what method we are going to use to create validation data (you chose that in Question 11). Let’s do it!

I’m going to provide a little help to get you started with your code. To find the F1-score and geometric mean of the TPR and TNR, you can adapt the code below where you replace test with the name of the data set you are working with.

# Find the F1 score
TP <- sum(yhat == 1 & test$Damage == 1)
FP <- sum(yhat == 1 & test$Damage == 0 )
FN <- sum(yhat == 0 & test$Damage == 1 )
F1 <- TP/(TP + .5*(FP+FN))
F1

# Find the GM
TPR <- sum(yhat == 1 & test$Damage == 1) / sum(test$Damage==1)
TNR <- sum(yhat == 0 & test$Damage == 0) / sum(test$Damage==0)
GM <- sqrt( TPR*TNR)
GM 

Question 12

Using your choice from Question 11, write a code that computes the F1 Score or GM when \(K\)= 3, 5, 7, 11, 13, 15 (this is just to simplify your coding, you can use more options if you like!). Your output should be a data frame with 2 columns - \(K\) and the metric (F1 or GM). If you need to set a seed, use 3635.

Hint: The easiest way to do this it using a loop. Start without the validation part - just write it so you loop over different choices of K with KNN and compute your metric. Then, go back and add the part where you create validation data.

As the answer to this questions, show a plot of \(K\) versus your metric by adapting the plot below:

library(ggplot2)

# If you chose F1
ggplot(Storage, aes( K, F1) ) + 
      geom_line() + 
      geom_vline(xintercept = c(3,5,7,11,13,15)[which.max(Storage$F1)]) + 
      labs(title = "Figure 1", xlab = "K", y = "F1 Score", 
      caption = paste("Maximum F1 Score achieved at K=",seq(from = 3, to = 15, by = 2)[which.max(Storage$F1)]))

# If you chose GM
ggplot(Storage, aes( K, GM) ) + 
      geom_line() + 
      geom_vline(xintercept = c(3,5,7,11,13,15)[which.max(Storage$GM)]) + 
      + labs(title = "Figure 1", xlab = "K", y = "GM", 
      caption = paste("Maximum Geometric Mean of TPR and TNR achieved at K=",seq(from = 3, to = 15, by = 2)[which.max(Storage$GM)]))

Question 13

  1. State which value of \(K\) you chose in Question 12 as optimizing your chosen metric.

  2. Make predictions on the test data using that choice of \(K\). As an answer to this question, create a confusion matrix using the code below:

holder <- table( yhat, test$Damage )
colnames(holder) <- c("True 0", "True 1")
rownames(holder) <- c("Predicted 0", "Predicted 1")

knitr::kable( holder, caption = "K = ?? Confusion Matrix")

This would be the final version you would show a client! You have tuned \(K\), and then you have illustrated how well the approach can predict \(Y\) using that chosen value of \(K\).

However, could logistic regression do better??

Logistic Regression

Unlike KNN, logistic regression is a model. This means it has the added benefit of being able to used for association as well as prediction. It is also just a different approach to prediction, which means that it can perform better (or worse!) than KNN, depending on the structure of the data.

To train a logistic regression model in R, we use:

logistic <- glm( Damage ~ ., data = train, family = "binomial")

Once the model is trained, you can obtain the probability that \(y_{i^{*}}=1\) for each row in the test data using:

probabilities <- predict( logistic, newdata = test, type = "response")

Question 14

What is the probability that the first row in the test data is a building with moderate to severe damage, according to the model?

Question 15

What is the probability that the second row in the test data is building with no to minimal damage, according to the model?

The trick to making predictions with logistic is that we have to choose a threshold. Probabilities above the threshold are assigned a prediction of 1, while probabilities equal to or below the threshold are assigned a prediction of 0. Like \(K\) in KNN, the threshold is a tuning parameter…so we have to choose it!!

Tuning the threshold requires very similar coding to what we just used to tune \(K\). For these data, I checked for you and using a threshold of .5, which is standard, works very well here - so happy day, you don’t have to code it!

Question 16

Using a threshold of .5, make predictions on the test data for logistic regression. As an answer to this question, create a confusion matrix using the code below:

holderL <- table( yhat_logistic, test$Damage )
colnames(holderL) <- c("True 0", "True 1")
rownames(holderL) <- c("Predicted 0", "Predicted 1")

knitr::kable( holderL, caption = "Logistic Regression Confusion Matrix")

Comparing Results

Question 17

We now have two different approaches for classification: logistic regression and KNN. Which would you recommend the client use for prediction for earthquake damage? Clearly explain your reasoning to the client.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 December 21.

The data set used in this lab is a subset of the data from the Nepal Earthquake Data Portal. Retrieved from: ‘https://kathmandulivinglabs.org/our-work/earthquake-data-portal’ [Online Resource] .