Info

Objectives

By the end of this activity, you should

  1. Be familiar with basic plotting in ggplot2 (part of tidyverse)

  2. Implement \(k\)-fold cross-validation to estimate a test error by yourself

  3. Use built-in cross-validation in the library caret to estimate a test error.

  4. Choose the optimal value of a tuning parameter with \(k\)-fold cross-validation in the library caret.

  5. Review functions lapply and sapply.

Preliminaries

You should be familiar with basic regression and classification models — linear regression, logistic regression, KNN. These topics were covered in lab 2. You should also be familiar with the pipe operator and splitting data into training and test sets (lab 1).

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Libraries

library(tidyverse) # for manipulation with data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(stargazer) # for printing pregression tables
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

Data

Today we will work with breast cancer dataset and we will predict if a tumor is benign of malignant from its measurements.

Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

B <- read.csv("breast_cancer.csv")
head(B)

The variables are

names(B)
##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
## [33] "X"

The response variable is diagnosis. Below are values and their frequencies:

round(table(B$diagnosis) / nrow(B) , 2)
## 
##    B    M 
## 0.63 0.37

We see that the dataset is more-or-less balanced.

Now we will split the data into 400 training and 169 test observations. For the sake of this lab session, we will extract just three variables, response diagnosis and predictors radius_mean and texture_mean. We will also convert the variable diagnosis from character to factor because it is necessary for the function glm.

set.seed(42)
B <- mutate(B, diagnosis = as.factor(diagnosis))
ind <- sample(1:nrow(B), 400)
train_data <- B %>% slice(ind) %>% select(diagnosis, radius_mean, texture_mean)
test_data <- B %>% slice(-ind) %>% select(diagnosis, radius_mean, texture_mean)
cat("Training data dimensions =", dim(train_data), "\n")
## Training data dimensions = 400 3
cat("Test data dimensions =", dim(test_data), "\n")
## Test data dimensions = 169 3

Plotting

The library ggplot2 for plotting is a part of tidyverse echosystem. It is one of nicest plot systems available in R.

Remark: Probably, plotly with its interactive plots is even better than ggplot2. However, plotly does not work with static PDF reports. You are encouraged to explore interactive plots with plotly as a part of your project. If you really construct interactive plots for your project, you can submit your project report as HTML rather than PDF.

Scatterplot

In ggplot2, we always load our data into the function ggplot and then tell it how to visualize and modify it by adding extra functions. It seems weird at first, but you will soon get used to it.

Here is a simple scatterplot:

ggplot(data = B, aes(x = radius_mean, y = texture_mean)) +
  geom_point()

It would be nice to colour our points according to diagnosis. Here is how we do it:

ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
  geom_point()

We can add more stuff to such a plot, like regression lines (regression lines are not meaningful here, it’s just an example) or a plot title:

ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Breast cancer data")
## `geom_smooth()` using formula 'y ~ x'

Whisker and violin charts

If we want to show the distribution of radius_mean separately for benign and malignant tumors, can begin with setting x in aes to be diagnosis:

ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_point()

However, it is hard to understand this plot because points overlap. To remedy this, we can make points transparent as follows:

ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_point(alpha = 0.1)

An even better way to solve the issue is to use boxplot method for plotting. It will produce a so-called box-and-whisker plot:

ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_boxplot()

It shows outliers, minimum, first quartile, median, third quartile, and maximum separately for benign and malignant tumors. A whisker chart can be coloured too:

ggplot(data = B, aes(x = diagnosis, y = radius_mean, colour = diagnosis)) +
  geom_boxplot()

Question 1

Change geom_boxplot to geom_violin to produce a violin plot. At the same time, change colour to fill for better contrast.

ggplot(data = B, aes(x = diagnosis, y = radius_mean, fill = diagnosis)) +
  geom_violin()

Histogram

To plot a histogram of, say, radius_mean, we just need to pass the x parameter to aes:

ggplot(data = B, aes(x = radius_mean)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can specify colours and the number of bins:

ggplot(data = B, aes(x = radius_mean)) +
  geom_histogram(colour = 'black', fill = 'orange', bins = 12)

We can also show histograms of radius_mean for benign and malignant tumors separately on the same plot:

ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Instead of a histogram, we can plot the estimated density:

ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
  geom_density(alpha = 0.5)

Cross-validation: hands-on

Fold generation

Let \(N\) be the number of observations in our dataset and let \(k\) be the number of folds that we need to create. Our folds are just a list of numbers \[ 1,1,\dots,1,2,2,\dots,2,\dots,k,\dots,k \] where each number appears approximately the same number of times. Later shuffle we will shuffle this list.

To create folds, we begin with an increment vector

N <- 20
k <- 4
1:N
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Fold numbers should be between 1 and \(k\). So let us divide each entry by \(N/k\):

(1:N) / (N/k)
##  [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
## [20] 4.0

Applying the ceiling function will now yield what we want:

((1:N) / (N/k)) %>% ceiling
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

Note that it works even when \(N\) is not divisible by \(k\):

N <- 12
k <- 5
((1:N) / (N/k)) %>% ceiling
##  [1] 1 1 2 2 3 3 3 4 4 5 5 5

Now we need to shuffle this vector. This can be done with the function sample from the base package — we just need to produce a sample from this vector without replacement of size \(N\):

((1:N) / (N/k)) %>% ceiling %>% sample(N, replace = FALSE)
##  [1] 3 3 2 4 4 3 1 5 5 5 1 2

Question 2

Come up with an alternative method to create folds based on the function rep:

rep(1:k, 5)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Your method should work even when \(N\) is not divisible by \(k\).

rep(1:k, ceiling(N/k))[1:N] %>% sample(N, replace = FALSE)
##  [1] 2 1 4 2 3 1 1 4 3 2 5 5

Adding folds to data

Now we will create a function that will add folds to a given dataset and we will apply it to create 5 folds for our training data.

k <- 5

set.seed(128)

add_folds_to_data <- function(dataset, k) {
  N <- nrow(dataset)
  folds <- ceiling((1:N)/(N/k))
  dataset %>% 
    mutate(fold = sample(folds, N))
}

train_data <- train_data %>%
  add_folds_to_data(k)

head(train_data)

To make sure that it worked, we will print the number of observations in each fold:

table(train_data$fold)
## 
##  1  2  3  4  5 
## 80 80 80 80 80

Logistic polynomial regression

We will train a logistic polynomial regression and report its test error. Below are coefficients of the model.

deg <- 2

mod_log <- glm(diagnosis ~ polym(radius_mean, texture_mean, degree = deg, raw = TRUE),
               family = "binomial", data = train_data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer(mod_log, type = "text")
## 
## =========================================================================================
##                                                                   Dependent variable:    
##                                                               ---------------------------
##                                                                        diagnosis         
## -----------------------------------------------------------------------------------------
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0           -2.215           
##                                                                         (1.515)          
##                                                                                          
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0           0.091*           
##                                                                         (0.049)          
##                                                                                          
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1            0.680           
##                                                                         (0.471)          
##                                                                                          
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1            0.039           
##                                                                         (0.027)          
##                                                                                          
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2          -0.025***         
##                                                                         (0.007)          
##                                                                                          
## Constant                                                                -2.074           
##                                                                        (12.145)          
##                                                                                          
## -----------------------------------------------------------------------------------------
## Observations                                                              400            
## Log Likelihood                                                          -96.296          
## Akaike Inf. Crit.                                                       204.593          
## =========================================================================================
## Note:                                                         *p<0.1; **p<0.05; ***p<0.01

And now we will construct predictions and report the confusion matrix

probs <- predict(mod_log, test_data, type = "response")
preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
confusionMatrix(preds, test_data$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102  12
##          M   8  47
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.8232, 0.9262)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : 7.189e-12       
##                                           
##                   Kappa : 0.7354          
##                                           
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.9273          
##             Specificity : 0.7966          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.8545          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6036          
##    Detection Prevalence : 0.6746          
##       Balanced Accuracy : 0.8619          
##                                           
##        'Positive' Class : B               
## 

Since cross-validation involves training a lot of models, it will be convenient to write special functions for training and validating such models of different degrees on different datasets. Below we do it and report test errors for models of degrees from 1 to 6:

train_poly_logistic_model <- function(deg, dataset = train_data) {
  mod_log <- glm(diagnosis ~ polym(radius_mean, texture_mean, degree = deg, raw = TRUE),
               family = "binomial", data = dataset)
}

error_rate <- function(mod_log, dataset = test_data) {
  probs <- predict(mod_log, dataset, type = "response")
  preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
  err <- 1 - confusionMatrix(preds, dataset$diagnosis)$overall['Accuracy'] 
  unname(err)
}

train_and_validate <- function(deg, train = train_data, test = test_data) {
  train_poly_logistic_model(deg, train) %>%
    error_rate(test)
}

vec_test_errors <- sapply(1:6, train_and_validate)
names(vec_test_errors) <- 1:6
vec_test_errors
##          1          2          3          4          5          6 
## 0.11242604 0.11834320 0.09467456 0.11242604 0.21301775 0.11834320

Cross-validation for a single model

Here we will calculate the cross-validation error for a polynomial logistic regression of, say, degree 2. It means that we will need to train 5 models, where model \(i\) will be trained on all folds except for fold \(i\) and we will report its error on fold \(i\).

list_of_models <- 
  lapply(1:k, function(i) 
    train_poly_logistic_model(2, train_data %>% filter(fold != i)))

list_of_error_rates <- 
  sapply(1:k, function(i) 
    error_rate(list_of_models[[i]], train_data %>% filter(fold == i)))

data.frame(fold = 1:k,
           error = list_of_error_rates)

The average error, i.e., the cross-validation error, is

mean(list_of_error_rates)
## [1] 0.1025

Below are all models generated in the cross-validation process:

stargazer(list_of_models, type = "text")
## 
## ==============================================================================================================
##                                                                             Dependent variable:               
##                                                               ------------------------------------------------
##                                                                                  diagnosis                    
##                                                                  (1)       (2)       (3)      (4)       (5)   
## --------------------------------------------------------------------------------------------------------------
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0  -1.605    -1.496   -4.603***  -1.274   -2.523  
##                                                                (1.655)   (1.762)   (1.602)  (1.733)   (1.565) 
##                                                                                                               
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0   0.044     0.081   0.186***   0.064    0.093*  
##                                                                (0.054)   (0.057)   (0.054)  (0.053)   (0.050) 
##                                                                                                               
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1   0.343    0.910*    1.033*    0.701     0.444  
##                                                                (0.521)   (0.475)   (0.535)  (0.597)   (0.538) 
##                                                                                                               
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1  0.074**    0.018     0.045    0.027    0.051*  
##                                                                (0.035)   (0.027)   (0.031)  (0.031)   (0.031) 
##                                                                                                               
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2 -0.028*** -0.023*** -0.034*** -0.021** -0.023***
##                                                                (0.009)   (0.008)   (0.010)  (0.008)   (0.008) 
##                                                                                                               
## Constant                                                       -2.363    -9.679     8.302    -8.131    2.706  
##                                                               (13.291)  (13.771)  (12.060)  (15.190) (12.948) 
##                                                                                                               
## --------------------------------------------------------------------------------------------------------------
## Observations                                                     320       320       320      320       320   
## Log Likelihood                                                 -81.671   -73.870   -66.366  -82.294   -76.364 
## Akaike Inf. Crit.                                              175.341   159.739   144.731  176.587   164.728 
## ==============================================================================================================
## Note:                                                                              *p<0.1; **p<0.05; ***p<0.01

Question 3

Note that while performing cross-validation, we saved all 5 models that were a part of the cross-validation process. In practice, this is not needed because in any case, the actual model that goes into production is going to be trained on the whole training data. Cross-validation is used just to estimate the test error.

Re-write the code above so that there will be just a single pipeline that will print the mean error rate of all the 5 models

1:k %>% sapply(
  function(i) train_and_validate(2, train_data %>% filter(fold != i),
                                 train_data %>% filter(fold == i))
) %>% mean
## [1] 0.1025

Cross-validation put together

Now we will make a function that will perform cross-validation for a certain degree and report the cross-validation error and the test error. First, we report the cross-validation error and the test error for logistic polynomial regression of degree 2.

cv_and_test_error <- function(d, K = k) {
  # K is the number of folds, the default value is whatever is specified above
  # when we assigned k
  cv_error <- 1:K %>% sapply(
    function(i) train_and_validate(d, train_data %>% filter(fold != i),
                                 train_data %>% filter(fold == i))
    ) %>% mean
  
  test_error <- train_and_validate(d, train_data, test_data)
  
  c(deg = d, CV = cv_error, TEST = test_error)
}

cv_and_test_error(2)
##       deg        CV      TEST 
## 2.0000000 0.1025000 0.1183432

And now we will make a table of training and test errors for all degrees from 1 to 12

all_degrees <- 1:12
final_errors <- sapply(all_degrees, cv_and_test_error) %>% t %>%
  as.data.frame

final_errors

Note that while cross-validation and test errors are different, the optimal degree is the same for both of them. Also note that we transposed and converted the output of sapply to a data frame because the default behaviour is producing a matrix built column-by-column rather than row-by-row.

Visualizing the errors

We can use the familiar tools to plot cross-validation error rates for different degrees:

ggplot(data = final_errors, aes(x = deg, y = CV)) +
  geom_point()

However, it will look better if plot lines instead:

ggplot(data = final_errors, aes(x = deg, y = CV)) +
  geom_line()

Question 4

Plot test errors with points joined by a line

ggplot(data = final_errors, aes(x = deg, y = TEST)) +
  geom_line() + geom_point()

Built-in cross-validation

Single model

Of course, in practice we do not implement cross-validation from scratch. We can use the function train from the library caret for this purpose.

set.seed(128)
mod_log_caret <- train(diagnosis ~ polym(radius_mean, texture_mean, degree = 2, raw = TRUE),
                  train_data, method = 'glm', 
                  family = binomial, 
                  trControl = trainControl(method = 'cv', number = k))

mod_log_caret
## Generalized Linear Model 
## 
## 400 samples
##   2 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 320, 320, 319, 320, 321 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9024961  0.7909036

The cross-validation error of this model is

1 - mod_log_caret$results$Accuracy
## [1] 0.09750391

A nice feature of the library caret is that the function predict for a model trained by train returns the labels and we do not have to manually convert probabilities to labels

mod_log_caret %>% predict(test_data) %>% head
## [1] B M M B B M
## Levels: B M

Here is the confusion matrix of our model on the test data:

mod_log_caret %>%
  predict(test_data) %>%
  confusionMatrix(test_data$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102  12
##          M   8  47
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.8232, 0.9262)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : 7.189e-12       
##                                           
##                   Kappa : 0.7354          
##                                           
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.9273          
##             Specificity : 0.7966          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.8545          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6036          
##    Detection Prevalence : 0.6746          
##       Balanced Accuracy : 0.8619          
##                                           
##        'Positive' Class : B               
## 

Cross-validation to find the optimal degree

We will now use cross-validation to pick the optimal degree. Specifically, we will write a function that will

  1. Take degree of a polynomial and the number of folds as inputs

  2. Train a logistic polynomial regression of the given degree with the built-in function train from library caret.

  3. Return the cross-validation error of the model

Below we write such a function and use it to find the cross-validation errors for models of degrees from 1 to 12

cv_caret_error <- function(d, K = k, dataset = train_data) {
  
  # The following two lines of code are somewhat obscure
  # Please talk to Fedor if you want to understand them
  # Otherwise, you can just reuse them
  template <- "diagnosis ~ polym(radius_mean, texture_mean, degree = DEG, raw = TRUE)"
  polynomial_formula <- template %>% gsub("DEG", d, .) %>% as.formula
  ##### There may be a better way to do it!!! ####
    
  mod <-
    train(polynomial_formula,
                  dataset, method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = K))
  1 - mod$results$Accuracy
}

caret_errors <- data.frame(
  deg = all_degrees,
  error_rate = sapply(all_degrees, cv_caret_error)
)

caret_errors

Question 5

Rewrite the function cv_caret_error so that it will return a vector with 3 entries, degree of the polynomial, cross-validation error, test error. And use it to produce a table of errors for different degrees analogous to what we already calculated when we implemented cross-validation from scratch.

caret_errors <- function(d, K = k, dataset = train_data) {
  
  # The following two lines of code are somewhat obscure
  # Please talk to Fedor if you want to understand them
  # Otherwise, you can just reuse them
  template <- "diagnosis ~ polym(radius_mean, texture_mean, degree = DEG, raw = TRUE)"
  polynomial_formula <- template %>% gsub("DEG", d, .) %>% as.formula
  ##### There may be a better way to do it!!! ####
    
  mod <-
    train(polynomial_formula,
                  dataset, method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = K))
  
  cv_err <- 1 - mod$results$Accuracy
  
  test_cm <- mod %>%
    predict(test_data) %>%
    confusionMatrix(test_data$diagnosis)
  
  test_err <- unname(1 - test_cm$overall['Accuracy'])
  
  c(deg = d, CV = cv_err, TEST = test_err)
}


sapply(all_degrees, caret_errors) %>% t %>% as.data.frame()

Tuning parameter

When we selected the degree of a polynomial logistic regression that yields the smallest cross-validation error, we constructed a family of models \[ L(1), L(2),\dots,L(12), \] where \(L(d)\) is polynomial logistic regression of degree \(d\), i.e., log-odds of the probability is modelled as a polynomial of degree \(d\) in our two variables. For instance, \(L(2)\) is the model \[ \log\frac{P}{1-P}=b_0+b_1X_1+b_2X_2+b_3X_1^2+b_4X_1X_2+b_5X_2^2 \] etc.

A large part of the difficulty in implementing this process that we needed to explicitly apply the cross-validation process to every degree.

A good news is that many models implemented in caret have an explicit tuning parameter or hyperparameter and we can find cross-validation errors and pick the best, i.e. minimizing the cross-validation error, model in just one call of the function train. For example, here is how we do it for KNN, where K is the hyperparameter (unfortunately, this simple technique is not available for logistic regression):

mod_caret_knn <- train(diagnosis ~ radius_mean + texture_mean,
  train_data, 
  method = 'knn', 
  preProcess = c("center","scale"), tuneLength = 20,
  trControl = trainControl(method = 'cv', number = k))

mod_caret_knn
## k-Nearest Neighbors 
## 
## 400 samples
##   2 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 320, 320, 321, 319, 320 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8698972  0.7195185
##    7  0.8824289  0.7460886
##    9  0.8874922  0.7565798
##   11  0.8849922  0.7518760
##   13  0.8849605  0.7517160
##   15  0.8898980  0.7620522
##   17  0.8899605  0.7627737
##   19  0.8949922  0.7735728
##   21  0.8975238  0.7792575
##   23  0.8949605  0.7729157
##   25  0.8975238  0.7786323
##   27  0.9000238  0.7842236
##   29  0.9025238  0.7893252
##   31  0.9025238  0.7893252
##   33  0.9050555  0.7944510
##   35  0.9125246  0.8109089
##   37  0.9074930  0.8013067
##   39  0.8973980  0.7784069
##   41  0.8973664  0.7785244
##   43  0.8973664  0.7785244
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 35.

Finally, we will report the confusion matrix for best of our KNN models to be able to compare performance of KNN to logistic regression:

mod_caret_knn %>%
  predict(test_data) %>% confusionMatrix(test_data$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 107   9
##          M   3  50
##                                           
##                Accuracy : 0.929           
##                  95% CI : (0.8793, 0.9628)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.84            
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.9727          
##             Specificity : 0.8475          
##          Pos Pred Value : 0.9224          
##          Neg Pred Value : 0.9434          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6331          
##    Detection Prevalence : 0.6864          
##       Balanced Accuracy : 0.9101          
##                                           
##        'Positive' Class : B               
## 

Note that for this problem KNN works really well.

Survey

There is a link to a simple survey after lab 3:

Answers

Here are the answers: