STA 363/663 Lab 2

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

In our lab today, we are going to try out KNN as a predictive approach for numeric $Y$ variables. Pay attention to how we work with rows and columns today, because this is going to be very important as we move into the rest of the coding in our course!

The Data

Our data from today is the cell phone price data from our first class. Recall that the goal is to predict $Y$ = cell phone prices. Predicting prices is a goal that is very common in statistical learning and in the real world. If we can predict the price based on features, clients can determine how to competitively price their own products.

In addition to price, we have 14 features:

battery_power: Total energy a battery can store in one time measured in mAh
clock_speed: speed at which microprocessor executes instructions
fc: Front Camera mega pixels
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of processor
pc: Primary Camera mega pixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in Mega Bytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: longest time that a single battery charge will last w

In our first class, we only had training data. However, we now know how important it is to have test data when our goal is prediction. To load both the test and training rows, put the code below into a chunk and press play.

test <- read.csv("https://www.dropbox.com/scl/fi/y1hjakf9e3pkkohkot4sf/test_phones.csv?rlkey=7ujlcxnuvu0lemcoit39t6mcd&st=awcl33xl&dl=1")
train  <- read.csv("https://www.dropbox.com/scl/fi/8iitmby2trytf65pvmk96/train_phones.csv?rlkey=y5rhlbd26d6b20mu56oba4xg7&st=50a35x9h&dl=1")

KNN: Step by Step

We are going to use K-nearest neighbors to predict price. When we choose to use KNN, there are two key decisions we have to make:

1. Which distance measure do we use to choose the neighbors?
1. How many neighbors do we want to use?

For the moment, we are going to choose $K = 3$ neighbors, and we will come back and refine that later in the lab. This means that our focus needs to be on choosing a distance measure.

The choice of distance metric is usually guided by the type of features we have. If we have a mix of categorical and numeric features, we can use something like Gower’s distance. If we have all numeric features, we often use Euclidean distance.

Question 1

Based on our data set, which distance (Gower’s or Euclidean) do you think we will use?

We learned about Gower’s distance in class. To illustrate Euclidean distance, consider an example with the following two rows with 4 features per row:

\[Row~1: 1, 2, 5, 4\] \[Row~2: 2, 4, 5, 2\] To find the Euclidean distance between Row 1 and Row 2, we subtract each element of the row, basically finding how far each feature in Row 1 is from the corresponding feature in Row 2.

\[1 - 2 = - 1\] \[2 - 4 = -2\] \[5- 5 = 0\]

\[4-2 = 2\]

We then square these values to get rid of the negatives and add up all the squared values.

\[(-1)^2 + (-2)^2 +0^2 + 2^2= 9\]

At this point, we have a distance squared. As we do with RMSE, to remove the squared and get the final distance, we square root our quantity. This means the Euclidean distance between Row 1 and Row 2 in this example is 3.

As with most distance measures, smaller values of the Euclidean distance mean that two rows are closer together.

Question 2

Find the Euclidean distance between Row 1 and Row 2 in our train data set, using only the first 5 features. Show your work - you may not use any packages or functions here, I want you to practice the steps.

Now that we understand how the distance is computed, let’s write a function to speed up the process. You will notice this is a recurring theme in this course. We will work to understand how something works, and then we will talk about how to speed up the process to make it practically efficient to use.

A function in R has a fairly simple structure.

name <- function( inputs ){

  what want the function to do
  
}

where

name is the name we want to call the function. This is what we type in R to tell R to execute the function. Some examples we have used before are mean, summary, table, lm, etc.
inputs are the pieces of information we need in order to execute the function.
what want the function to do is just what it sounds like!

For instance, the function below computes the Euclidean distance between any two rows a and b.

euclidean <- function(a, b){
  sqrt(sum((a - b)^2))
}

euclidean is the name of the function
a and b are the two inputs we need (since we want to compute the distance between two things, we need the two things!)
sqrt(sum((a - b)^2)) computes the Euclidean distance between a and b.

To use the function, we put the following in a chunk and press play, after replacing a and b with the rows we want.

euclidean(a , b)

Great…except we haven’t talked about how to tell R to only grab a specific row in a data set. Let’s do that now.

In R, we use this notation to specify rows and columns in a data set:

data[ row, column ]

This means that if we want to pull the first row in train, we use

train[ 1, ]

The blank space after the comma means we want to see ALL the columns. In other words, by not specifying a column in the column space, we tell R to show all the columns.

Question 3

Pull row 3 from the test data set. Show the row as the answer to this question.

So, if we want to find the distance between row 1 in the test data and row 1 in the training data, we replace a and b with those rows!!

Except…in KNN, we only compute the distance between features. The response variable is not a feature, and therefore needs to be excluded when computing a distance. This means we want all columns except column 15 where the response variable is stored.

To print out the first row in the test data set EXCLUDING column 15, we use:

test[1, -15]

Question 4

Compute the Euclidean distance between Row 1 in the test data and Row 4 in the test data using the euclidean(a , b) function. Show the output of the function as the answer to this question.

Question 5

Which row is nearer (more similar) to Row 4 in the test data: Row 2 in the training data or Row 5 in the training data? Explain your choice.

At this point, we can find out which rows in the training set are closer to a certain row in the test data. In KNN, to make a prediction for $Y$ for a row in the test data set, we find the Euclidean distance between that row in the test data set and every row in the training data set. This means we have 1500 distances. We choose the rows with the $K$ smallest distances as the nearest neighbors to the test row. We then average the $Y$ values of these rows to create the prediction $\hat{y}_{i}^{*}$ for row $i$ in the test data.

Question 6

Rows 481, 952, and 1279 in the training data are the 3 nearest neighbors to Row 1 in the test data. What value of $Y$ would we predict for Row 1 in the test data using these three neighbors?

Hint: We have just learned how to pull rows and columns in R. Use this information!!

Question 7

What is the residual for Row 1 in the test data? In other words, how far off was our estimate using 3-NN?

And that’s the process!! Naturally, we would like it if there was a faster way to do this, and luckily there is!

KNN: In Practice

In practice, we use the caret package to perform KNN in R. Remember, we learned how to install packages in Lab 1. If you have not used caret before, go ahead and install it now!

Once the package is installed, remember we have to load it every time we use R:

library(caret)

Inside the caret package is a function called knnreg that allows us to perform KNN with Euclidean distance. The function is used as follows:

knn_yhat <- knnregTrain( trainingfeatures , testfeatures , response, k = number of neighbors)

where

trainingfeatures: replace with the features in your training data.
testfeatures: replace with the features in your test data.
response: replace with the response variable in your training data.
number of neighbors: replace with the number of neighbors $K$you want. Example: 3, 5, 7, etc.

Running the code produces an object knn_yhat with predictions $\hat{Y}$ for all the rows in the test data set.

Question 8

How many predictions should be in knn_yhat? (Your answer must be a number)

Question 9

What is the prediction for $Y$ for the 5th row in the test data?

Question 10

Adapt the code above Question 8 to make predictions for all rows in the test data using KNN with $K = 3$. As an answer to this question, state AND interpret the test RMSE for KNN with $K =3$.

Hint: The test RMSE function is in the notes from last class, and it is in Lab 1.

At this point, we can discuss the predictive ability of KNN with $K = 3$. However, $K=3$ was something we just randomly chose. What if we could get better predictive ability with a different choice of $K$?

$K$ in KNN is an example of a tuning parameter in statistical learning. Tuning parameters are values that we get to choose in a statistical learning model or method. We can tune a model to be better at prediction, association, or to balance between the two. KNN is not a method we can use for association tasks, but we can tune $K$to optimize our predictive ability on test data.

Question 11

Which choice of $K$ would you recommend we use for the best predictive ability on the test data: 3, 5, 10, 15, or 20? Explain your choice.

Question 11 may have seemed a little tedious, but this tuning process is something we use constantly in statistical learning. Because of this, in our next course we will learn about for loops, which will help us do these repetitive tasks a lot more quickly!

Comparing Approaches

We now have two different possible methods to choose from for predicting price: regression or KNN. In statistical learning, this is often the case. We have multiple methods that are possible to use, and we have to make a recommendation for a client about which method to use and why.

Question 12

Build an regression model to predict $Y$= price. State and interpret the test RMSE you get using linear regression.

Question 13

State and interpret the test RMSE you get using KNN with your choice of $K$ from Question 11.

Question 14

Based on the test RMSE, would you recommend using regression or KNN to predict price?

We know that in addition to RMSE, we use plots to assess predictive abilities of models. In these plots, we put the predictions for $Y$ on the x axis of the plot and the true values of $Y$ on the y axis of the plots. We then add a 0-1 line. Any points that lie on the line are perfectly predicted. The skeleton of the code you need to make such a plot is provided below.

ggplot( test, aes( x = , y = )) +
    geom_point() + 
    labs(x = "Predicted Price", y = "Actual Price", title = "Title", 
    caption = "Caption") + 
    geom_abline(intercept = 0, slope = 1, col = "blue",lwd = 2)

However, before we make the plot, we are going to learn one more R trick. We are about to make two plots, one comparing the predictions from regression to the true prices and a second comparing the predictions from KNN to the true prices. Whenever we have more than one plot, it is a good idea to stack plots to save space and make comparisons easier.

To stack two plots, we use the following structure: NOTE: You will need to fill in the missing pieces to make the plots!!

# First Plot
g1 <- ggplot( test, aes( x = , y = )) +
    geom_point() + 
    labs(x = "Predicted Price", y = "Actual Price", title = "Title", 
    caption = "Caption") + 
    geom_abline(intercept = 0, slope = 1, col = "blue",lwd = 2)
 
# Second Plot    
g2 <- ggplot( test, aes( x = , y = )) +
    geom_point() + 
    labs(x = "Predicted Price", y = "Actual Price", title = "Title", 
    caption = "Caption") + 
    geom_abline(intercept = 0, slope = 1, col = "blue",lwd = 2)
    
# Stack the plots
gridExtra::grid.arrange( g1, g2 )

Question 15

Adapt the code above to create (a) a plot showing the truth versus prediction for KNN with your choice of K from Question 11 and (b) a plot showing the truth versus prediction for regression with the plots stacked and appropriate labels added. Show your plot!

Question 16

Based on everything we have done thus far, which predictive approach (KNN or linear regression) would you recommend to predict price? Briefly explain your choice.

Before you submit

A few last steps before we knit, and then you will be done with Lab 2!

1. Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
1. If you are working with a partner, make sure their name and yours is on the top of the file.
1. Look at the very first chunk in your Markdown file. You should see something like:

knitr::opts_chunk$set(echo = TRUE)

Change this to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

1. You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.

Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!

References

This work was created by Dr. Nicole Dalzell, Associate Teaching Professor at Wake Forest University, and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 December 18.

The Data

The data has been adapted from the source below. Please note that the data sets have been cleaned and a new feature, price, has been added from the original data set.

Citation: Abhishek, Sharma. (2017r). Mobile Price Classification, Version 1. Retrieved December 20, 2025 from https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification