Lab1-KNN

In this lab, we are going to use the knnreg function from caret package. We will first practice supervised learning on a simulated data, then apply the KNN regression on the Boston Housing Values data.

Learning Objectives

Split the data into training set and test set. Use different error metrics to describe model performance.
Fit KNN regression model on training set, calculate the test MSE on test set. Use the for loop to choose the optimal \(K\) based on the test MSE.
Write your own KNN classification algorithm.

library(caret)
library(MASS)
library(ggplot2)
library(ggfortify)
library(GGally)
library(readr)

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. Regression is one popular supervised learning algorithm. In this lab, we will apply KNN regression and KNN classification to the breast cancer data. More information on the data [here]https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).

1. Training and Test

1.1. Preparation

Import “BreastCancerData.csv” as mydat

Use ggpairs for visualization. Here, we focus on the variable diagnosis,radius_mean,texture_mean, and compactness_mean

mydat <- mydat[,c("diagnosis","radius_mean","texture_mean","compactness_mean")]
str(mydat)

## tibble [569 × 4] (S3: tbl_df/tbl/data.frame)
##  $ diagnosis       : chr [1:569] "M" "M" "M" "M" ...
##  $ radius_mean     : num [1:569] 18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean    : num [1:569] 10.4 17.8 21.2 20.4 14.3 ...
##  $ compactness_mean: num [1:569] 0.2776 0.0786 0.1599 0.2839 0.1328 ...

ggpairs(mydat)

1.2. Split the data

Select 70% of the data as training set, and 30% as test set. Here, you may try different functions

#---- training and test
set.seed(10)
n = nrow(mydat)
index <- sample(1:n, size = 0.7 * n)
training <- mydat[index,]
test <- mydat[-index,]

2. Regression

2.1. Fit Linear Regression

Now, suppose we want to fit a model to predict tumor’s compactness by its radius and texture. Let’s fit a linear regression on the training data. Here, the autoplot function in ggfortify package will automatically print the diagnostic plots for linear regression in ggplot. Let’s name this linear model mod1.

#---- Linear Regression on training set
mod1 <- lm(compactness_mean ~ radius_mean + texture_mean,data = training)
summary(mod1)

## 
## Call:
## lm(formula = compactness_mean ~ radius_mean + texture_mean, data = training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.076924 -0.034732 -0.007097  0.024522  0.195094 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0037374  0.0119845  -0.312   0.7553    
## radius_mean   0.0063791  0.0006807   9.372   <2e-16 ***
## texture_mean  0.0009663  0.0005512   1.753   0.0804 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04533 on 395 degrees of freedom
## Multiple R-squared:  0.224,  Adjusted R-squared:   0.22 
## F-statistic:    57 on 2 and 395 DF,  p-value: < 2.2e-16

autoplot(mod1) # diagnostic plot

2.2. Calculate test MSE for linear regression

To evaluate the model performance, we pretend the response values in test set are unknown, and get the predictions using fitted linear model(predict(model, new predictors) ). Be careful not to include the response variable when doing prediction.

#---- Calculate the MSE on test set

# get the prediction for test set
new <- data.frame(test[,c(2,3)])
pred1 <- predict(mod1,new)

Then calculate the MSE(mean squared error) on test set. Recall the MSE is defined as \[ MSE = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2 \] Use the function function to create a “function” in R. See chapter 3.6.7 for more information on writing functions.

# write a function 
mse=function(actual, predicted){
  return(mean((actual-predicted)^2))
}

# use the function on test set
mse(test$compactness_mean,pred1)

## [1] 0.00211997

The test MSE for linear regression is 0.00211997.

2.3. KNN Regression

Now, suppose we want to fit a model to predict tumor’s compactness by its radius and texture.

KNN regression is one type of nonparametric methods. Click here for more information on knnreg function in caret package.

knnreg(train_x, train_y, k). The first input is the predictors in training set, the second input is the response variable in training set. Then specify the value of \(K\). Notice there are 398 observations in the training set, \(K\) can only take the value from 1 to 398. Put \(K=1,20,40,100,150,398\).

training_x <- training[,c(2,3)]
training_y <- training$compactness_mean

mod2_1 <- knnreg(training_x,training_y,k=1)
mod2_20 <- knnreg(training_x,training_y,k=20)
mod2_40 <- knnreg(training_x,training_y,k=40)
mod2_100 <- knnreg(training_x,training_y,k=100)
mod2_150 <- knnreg(training_x,training_y,k=150)
mod2_398 <- knnreg(training_x,training_y,k=398)

For prediction, we can still use predict function, as we did for linear regression predict(model, new predictors). For example,

# predictor in test set, will be used for prediction
new <- as.data.frame(test[,c(2,3)])
pred2_1 = predict(mod2_1, new)

Then we can calculate the test MSE.

mse(actual = test$compactness_mean, predicted = pred2_1)

## [1] 0.004130956

The test MSE is 0.004130956 The MSE from linear regression is 0.00211997 In this case, linear regression outperforms KNN regression when \(K=1\). Let’s compute the test MSE for all \(K=20,40,100,150,398\).

# K=20
pred2_20 = predict(mod2_20, new)
mse(actual = test$compactness_mean, predicted = pred2_20)

## [1] 0.002309002

# K=40
pred2_40 = predict(mod2_40, new)
mse(actual = test$compactness_mean, predicted = pred2_40)

## [1] 0.002265126

# K=100
pred2_100 = predict(mod2_100, new)
mse(actual = test$compactness_mean, predicted = pred2_100)

## [1] 0.002365433

# K=150
pred2_150 = predict(mod2_150, new)
mse(actual = test$compactness_mean, predicted = pred2_150)

## [1] 0.002452418

# K=398
pred2_398 = predict(mod2_398, new)
mse(actual = test$compactness_mean, predicted = pred2_398)

## [1] 0.003149367

Remember the U-shape relationship between test MSE and \(1/K\)(see Figure 3.19 in textbook). The optimal \(K\), that is, the \(K\) which can minimize the test MSE, should fall between 1 and 20. Now, let’s introduce the for loop, to calculate the test MSE for \(K=1,21,41,61,81,...,381\).

for loop has the following structure:

for (val in sequence) { statement }

Here, sequence is a vector and val takes on each of its value during the loop. In each iteration, statement is evaluated. A quick example, to calculate \(n=1+2+3+...+99+100\), we have

n=0

for(i in 1:100){n=n+i}

n

## [1] 5050

Now we are going to calculate the test MSE for K.

# create an empty vector to store the test MSE 
test_MSE <- c()
K = seq(from = 1, to = 398, by = 20)
# for each K=i, fit the knn, find the prediction, 
# calculate the MSE, store it in the ith item of matrix test_MSE
for(i in 1:length(K)){
  predicted <- predict(knnreg(training_x,training_y,k=K[i]), new)
  test_MSE[i] <- mse(test$compactness_mean, predicted)
}

plot(K,test_MSE, main = "Test MSE for Different K")

which.min(test_MSE) # find the k with minimum test MSE

## [1] 3

When \(K=41\), the KNN regression will achieve the smallest test MSE.

3. KNN Classification

Now, suppose we want to use radius for tumor diagnosis and apply KNN classification here. Use function function to write your own KNN classification algorithm called my.knn. In my.knn, use ** Euclidean distance** to find the nearest k observations. To make our job easier, you only need to work on the situation when response variable is binary and there is only one predictor. And you don’t need to use predict function.

Consider the following inputs:

Independent variable from training set
Dependent variable from training set
The value of K
The new observation from the test set

The output for my.knn will be either “benign” or “malignant”. In your algorithm, you may want to follow the following steps:

Define the distance
Calculate the distance between the new observation and all the observations in the training set
Choose the nearest K observations in the training set
Calculate the frequency of each category among K observations
Return to the category with the higher frequency

4. Lab Assignment

In this lab, you don’t need to write the whole statistical report. You can skip the “abstract” and “introduction”, and start from “method” section. In your report, make sure to answer the following questions:

[4 pts] With set.seed(1), split the data as 90% training and 10% training. Use radius_mean as predictor, compactness_mean as response variable. Fit a linear regression model and KNN regression model. Compare the test MSE. Which one performs better?
[6 pts] Follow the steps to write your own my.knn function. Use the same training and test set as it in Question 1. Fit the KNN classification model on training set, then pick one observation from the test set to test the your algorithm. Report the running time. See [here]https://www.r-bloggers.com/2017/05/5-ways-to-measure-running-time-of-r-code/ for 5 ways to measure the running time of R code.
(Bonus point) [2 pts] In Question 1, after fitting a linear regression and a KNN regression, can you provide a comparison by plotting them on the same graph? For example,