By the end of this activity, you should
Be familiar with basic plotting in ggplot2
(part of tidyverse
)
Implement \(k\)-fold cross-validation to estimate a test error by yourself
Use built-in cross-validation in the library caret
to estimate a test error.
Choose the optimal value of a tuning parameter with \(k\)-fold cross-validation in the library caret
.
Review functions lapply
and sapply
.
You should be familiar with basic regression and classification models — linear regression, logistic regression, KNN. These topics were covered in lab 2. You should also be familiar with the pipe operator and splitting data into training and test sets (lab 1).
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(stargazer) # for printing pregression tables
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
Today we will work with breast cancer dataset and we will predict if a tumor is benign of malignant from its measurements.
Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
B <- read.csv("breast_cancer.csv")
head(B)
The variables are
names(B)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave.points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave.points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave.points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
## [33] "X"
The response variable is diagnosis
. Below are values and their frequencies:
round(table(B$diagnosis) / nrow(B) , 2)
##
## B M
## 0.63 0.37
We see that the dataset is more-or-less balanced.
Now we will split the data into 400 training and 169 test observations. For the sake of this lab session, we will extract just three variables, response diagnosis
and predictors radius_mean
and texture_mean
. We will also convert the variable diagnosis
from character
to factor
because it is necessary for the function glm
.
set.seed(42)
B <- mutate(B, diagnosis = as.factor(diagnosis))
ind <- sample(1:nrow(B), 400)
train_data <- B %>% slice(ind) %>% select(diagnosis, radius_mean, texture_mean)
test_data <- B %>% slice(-ind) %>% select(diagnosis, radius_mean, texture_mean)
cat("Training data dimensions =", dim(train_data), "\n")
## Training data dimensions = 400 3
cat("Test data dimensions =", dim(test_data), "\n")
## Test data dimensions = 169 3
The library ggplot2
for plotting is a part of tidyverse
echosystem. It is one of nicest plot systems available in R.
Remark: Probably, plotly
with its interactive plots is even better than ggplot2
. However, plotly
does not work with static PDF reports. You are encouraged to explore interactive plots with plotly
as a part of your project. If you really construct interactive plots for your project, you can submit your project report as HTML rather than PDF.
In ggplot2
, we always load our data into the function ggplot
and then tell it how to visualize and modify it by adding extra functions. It seems weird at first, but you will soon get used to it.
Here is a simple scatterplot:
ggplot(data = B, aes(x = radius_mean, y = texture_mean)) +
geom_point()
It would be nice to colour our points according to diagnosis. Here is how we do it:
ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
geom_point()
We can add more stuff to such a plot, like regression lines (regression lines are not meaningful here, it’s just an example) or a plot title:
ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Breast cancer data")
## `geom_smooth()` using formula 'y ~ x'
If we want to show the distribution of radius_mean
separately for benign and malignant tumors, can begin with setting x
in aes
to be diagnosis
:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
geom_point()
However, it is hard to understand this plot because points overlap. To remedy this, we can make points transparent as follows:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
geom_point(alpha = 0.1)
An even better way to solve the issue is to use boxplot
method for plotting. It will produce a so-called box-and-whisker plot:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
geom_boxplot()
It shows outliers, minimum, first quartile, median, third quartile, and maximum separately for benign and malignant tumors. A whisker chart can be coloured too:
ggplot(data = B, aes(x = diagnosis, y = radius_mean, colour = diagnosis)) +
geom_boxplot()
Change geom_boxplot
to geom_violin
to produce a violin plot. At the same time, change colour
to fill
for better contrast.
ggplot(data = B, aes(x = diagnosis, y = radius_mean, fill = diagnosis)) +
geom_violin()
To plot a histogram of, say, radius_mean
, we just need to pass the x
parameter to aes
:
ggplot(data = B, aes(x = radius_mean)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can specify colours and the number of bins:
ggplot(data = B, aes(x = radius_mean)) +
geom_histogram(colour = 'black', fill = 'orange', bins = 12)
We can also show histograms of radius_mean
for benign and malignant tumors separately on the same plot:
ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Instead of a histogram, we can plot the estimated density:
ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
geom_density(alpha = 0.5)
Let \(N\) be the number of observations in our dataset and let \(k\) be the number of folds that we need to create. Our folds are just a list of numbers \[ 1,1,\dots,1,2,2,\dots,2,\dots,k,\dots,k \] where each number appears approximately the same number of times. Later shuffle we will shuffle this list.
To create folds, we begin with an increment vector
N <- 20
k <- 4
1:N
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Fold numbers should be between 1 and \(k\). So let us divide each entry by \(N/k\):
(1:N) / (N/k)
## [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
## [20] 4.0
Applying the ceiling function will now yield what we want:
((1:N) / (N/k)) %>% ceiling
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Note that it works even when \(N\) is not divisible by \(k\):
N <- 12
k <- 5
((1:N) / (N/k)) %>% ceiling
## [1] 1 1 2 2 3 3 3 4 4 5 5 5
Now we need to shuffle this vector. This can be done with the function sample
from the base
package — we just need to produce a sample from this vector without replacement of size \(N\):
((1:N) / (N/k)) %>% ceiling %>% sample(N, replace = FALSE)
## [1] 3 3 2 4 4 3 1 5 5 5 1 2
Come up with an alternative method to create folds based on the function rep
:
rep(1:k, 5)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Your method should work even when \(N\) is not divisible by \(k\).
rep(1:k, ceiling(N/k))[1:N] %>% sample(N, replace = FALSE)
## [1] 2 1 4 2 3 1 1 4 3 2 5 5
Now we will create a function that will add folds to a given dataset and we will apply it to create 5 folds for our training data.
k <- 5
set.seed(128)
add_folds_to_data <- function(dataset, k) {
N <- nrow(dataset)
folds <- ceiling((1:N)/(N/k))
dataset %>%
mutate(fold = sample(folds, N))
}
train_data <- train_data %>%
add_folds_to_data(k)
head(train_data)
To make sure that it worked, we will print the number of observations in each fold:
table(train_data$fold)
##
## 1 2 3 4 5
## 80 80 80 80 80
We will train a logistic polynomial regression and report its test error. Below are coefficients of the model.
deg <- 2
mod_log <- glm(diagnosis ~ polym(radius_mean, texture_mean, degree = deg, raw = TRUE),
family = "binomial", data = train_data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer(mod_log, type = "text")
##
## =========================================================================================
## Dependent variable:
## ---------------------------
## diagnosis
## -----------------------------------------------------------------------------------------
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0 -2.215
## (1.515)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0 0.091*
## (0.049)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1 0.680
## (0.471)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1 0.039
## (0.027)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2 -0.025***
## (0.007)
##
## Constant -2.074
## (12.145)
##
## -----------------------------------------------------------------------------------------
## Observations 400
## Log Likelihood -96.296
## Akaike Inf. Crit. 204.593
## =========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
And now we will construct predictions and report the confusion matrix
probs <- predict(mod_log, test_data, type = "response")
preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
confusionMatrix(preds, test_data$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 102 12
## M 8 47
##
## Accuracy : 0.8817
## 95% CI : (0.8232, 0.9262)
## No Information Rate : 0.6509
## P-Value [Acc > NIR] : 7.189e-12
##
## Kappa : 0.7354
##
## Mcnemar's Test P-Value : 0.5023
##
## Sensitivity : 0.9273
## Specificity : 0.7966
## Pos Pred Value : 0.8947
## Neg Pred Value : 0.8545
## Prevalence : 0.6509
## Detection Rate : 0.6036
## Detection Prevalence : 0.6746
## Balanced Accuracy : 0.8619
##
## 'Positive' Class : B
##
Since cross-validation involves training a lot of models, it will be convenient to write special functions for training and validating such models of different degrees on different datasets. Below we do it and report test errors for models of degrees from 1 to 6:
train_poly_logistic_model <- function(deg, dataset = train_data) {
mod_log <- glm(diagnosis ~ polym(radius_mean, texture_mean, degree = deg, raw = TRUE),
family = "binomial", data = dataset)
}
error_rate <- function(mod_log, dataset = test_data) {
probs <- predict(mod_log, dataset, type = "response")
preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
err <- 1 - confusionMatrix(preds, dataset$diagnosis)$overall['Accuracy']
unname(err)
}
train_and_validate <- function(deg, train = train_data, test = test_data) {
train_poly_logistic_model(deg, train) %>%
error_rate(test)
}
vec_test_errors <- sapply(1:6, train_and_validate)
names(vec_test_errors) <- 1:6
vec_test_errors
## 1 2 3 4 5 6
## 0.11242604 0.11834320 0.09467456 0.11242604 0.21301775 0.11834320
Here we will calculate the cross-validation error for a polynomial logistic regression of, say, degree 2. It means that we will need to train 5 models, where model \(i\) will be trained on all folds except for fold \(i\) and we will report its error on fold \(i\).
list_of_models <-
lapply(1:k, function(i)
train_poly_logistic_model(2, train_data %>% filter(fold != i)))
list_of_error_rates <-
sapply(1:k, function(i)
error_rate(list_of_models[[i]], train_data %>% filter(fold == i)))
data.frame(fold = 1:k,
error = list_of_error_rates)
The average error, i.e., the cross-validation error, is
mean(list_of_error_rates)
## [1] 0.1025
Below are all models generated in the cross-validation process:
stargazer(list_of_models, type = "text")
##
## ==============================================================================================================
## Dependent variable:
## ------------------------------------------------
## diagnosis
## (1) (2) (3) (4) (5)
## --------------------------------------------------------------------------------------------------------------
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0 -1.605 -1.496 -4.603*** -1.274 -2.523
## (1.655) (1.762) (1.602) (1.733) (1.565)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0 0.044 0.081 0.186*** 0.064 0.093*
## (0.054) (0.057) (0.054) (0.053) (0.050)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1 0.343 0.910* 1.033* 0.701 0.444
## (0.521) (0.475) (0.535) (0.597) (0.538)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1 0.074** 0.018 0.045 0.027 0.051*
## (0.035) (0.027) (0.031) (0.031) (0.031)
##
## polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2 -0.028*** -0.023*** -0.034*** -0.021** -0.023***
## (0.009) (0.008) (0.010) (0.008) (0.008)
##
## Constant -2.363 -9.679 8.302 -8.131 2.706
## (13.291) (13.771) (12.060) (15.190) (12.948)
##
## --------------------------------------------------------------------------------------------------------------
## Observations 320 320 320 320 320
## Log Likelihood -81.671 -73.870 -66.366 -82.294 -76.364
## Akaike Inf. Crit. 175.341 159.739 144.731 176.587 164.728
## ==============================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that while performing cross-validation, we saved all 5 models that were a part of the cross-validation process. In practice, this is not needed because in any case, the actual model that goes into production is going to be trained on the whole training data. Cross-validation is used just to estimate the test error.
Re-write the code above so that there will be just a single pipeline that will print the mean error rate of all the 5 models
1:k %>% sapply(
function(i) train_and_validate(2, train_data %>% filter(fold != i),
train_data %>% filter(fold == i))
) %>% mean
## [1] 0.1025
Now we will make a function that will perform cross-validation for a certain degree and report the cross-validation error and the test error. First, we report the cross-validation error and the test error for logistic polynomial regression of degree 2.
cv_and_test_error <- function(d, K = k) {
# K is the number of folds, the default value is whatever is specified above
# when we assigned k
cv_error <- 1:K %>% sapply(
function(i) train_and_validate(d, train_data %>% filter(fold != i),
train_data %>% filter(fold == i))
) %>% mean
test_error <- train_and_validate(d, train_data, test_data)
c(deg = d, CV = cv_error, TEST = test_error)
}
cv_and_test_error(2)
## deg CV TEST
## 2.0000000 0.1025000 0.1183432
And now we will make a table of training and test errors for all degrees from 1 to 12
all_degrees <- 1:12
final_errors <- sapply(all_degrees, cv_and_test_error) %>% t %>%
as.data.frame
final_errors
Note that while cross-validation and test errors are different, the optimal degree is the same for both of them. Also note that we transposed and converted the output of sapply
to a data frame because the default behaviour is producing a matrix built column-by-column rather than row-by-row.
We can use the familiar tools to plot cross-validation error rates for different degrees:
ggplot(data = final_errors, aes(x = deg, y = CV)) +
geom_point()
However, it will look better if plot lines instead:
ggplot(data = final_errors, aes(x = deg, y = CV)) +
geom_line()
Plot test errors with points joined by a line
ggplot(data = final_errors, aes(x = deg, y = TEST)) +
geom_line() + geom_point()
Of course, in practice we do not implement cross-validation from scratch. We can use the function train
from the library caret
for this purpose.
set.seed(128)
mod_log_caret <- train(diagnosis ~ polym(radius_mean, texture_mean, degree = 2, raw = TRUE),
train_data, method = 'glm',
family = binomial,
trControl = trainControl(method = 'cv', number = k))
mod_log_caret
## Generalized Linear Model
##
## 400 samples
## 2 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 320, 320, 319, 320, 321
## Resampling results:
##
## Accuracy Kappa
## 0.9024961 0.7909036
The cross-validation error of this model is
1 - mod_log_caret$results$Accuracy
## [1] 0.09750391
A nice feature of the library caret
is that the function predict
for a model trained by train
returns the labels and we do not have to manually convert probabilities to labels
mod_log_caret %>% predict(test_data) %>% head
## [1] B M M B B M
## Levels: B M
Here is the confusion matrix of our model on the test data:
mod_log_caret %>%
predict(test_data) %>%
confusionMatrix(test_data$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 102 12
## M 8 47
##
## Accuracy : 0.8817
## 95% CI : (0.8232, 0.9262)
## No Information Rate : 0.6509
## P-Value [Acc > NIR] : 7.189e-12
##
## Kappa : 0.7354
##
## Mcnemar's Test P-Value : 0.5023
##
## Sensitivity : 0.9273
## Specificity : 0.7966
## Pos Pred Value : 0.8947
## Neg Pred Value : 0.8545
## Prevalence : 0.6509
## Detection Rate : 0.6036
## Detection Prevalence : 0.6746
## Balanced Accuracy : 0.8619
##
## 'Positive' Class : B
##
We will now use cross-validation to pick the optimal degree. Specifically, we will write a function that will
Take degree of a polynomial and the number of folds as inputs
Train a logistic polynomial regression of the given degree with the built-in function train
from library caret
.
Return the cross-validation error of the model
Below we write such a function and use it to find the cross-validation errors for models of degrees from 1 to 12
cv_caret_error <- function(d, K = k, dataset = train_data) {
# The following two lines of code are somewhat obscure
# Please talk to Fedor if you want to understand them
# Otherwise, you can just reuse them
template <- "diagnosis ~ polym(radius_mean, texture_mean, degree = DEG, raw = TRUE)"
polynomial_formula <- template %>% gsub("DEG", d, .) %>% as.formula
##### There may be a better way to do it!!! ####
mod <-
train(polynomial_formula,
dataset, method = 'glm',
family = binomial,
trControl = trainControl(method = 'cv', number = K))
1 - mod$results$Accuracy
}
caret_errors <- data.frame(
deg = all_degrees,
error_rate = sapply(all_degrees, cv_caret_error)
)
caret_errors
Rewrite the function cv_caret_error
so that it will return a vector with 3 entries, degree of the polynomial, cross-validation error, test error. And use it to produce a table of errors for different degrees analogous to what we already calculated when we implemented cross-validation from scratch.
caret_errors <- function(d, K = k, dataset = train_data) {
# The following two lines of code are somewhat obscure
# Please talk to Fedor if you want to understand them
# Otherwise, you can just reuse them
template <- "diagnosis ~ polym(radius_mean, texture_mean, degree = DEG, raw = TRUE)"
polynomial_formula <- template %>% gsub("DEG", d, .) %>% as.formula
##### There may be a better way to do it!!! ####
mod <-
train(polynomial_formula,
dataset, method = 'glm',
family = binomial,
trControl = trainControl(method = 'cv', number = K))
cv_err <- 1 - mod$results$Accuracy
test_cm <- mod %>%
predict(test_data) %>%
confusionMatrix(test_data$diagnosis)
test_err <- unname(1 - test_cm$overall['Accuracy'])
c(deg = d, CV = cv_err, TEST = test_err)
}
sapply(all_degrees, caret_errors) %>% t %>% as.data.frame()
When we selected the degree of a polynomial logistic regression that yields the smallest cross-validation error, we constructed a family of models \[ L(1), L(2),\dots,L(12), \] where \(L(d)\) is polynomial logistic regression of degree \(d\), i.e., log-odds of the probability is modelled as a polynomial of degree \(d\) in our two variables. For instance, \(L(2)\) is the model \[ \log\frac{P}{1-P}=b_0+b_1X_1+b_2X_2+b_3X_1^2+b_4X_1X_2+b_5X_2^2 \] etc.
A large part of the difficulty in implementing this process that we needed to explicitly apply the cross-validation process to every degree.
A good news is that many models implemented in caret
have an explicit tuning parameter or hyperparameter and we can find cross-validation errors and pick the best, i.e. minimizing the cross-validation error, model in just one call of the function train
. For example, here is how we do it for KNN, where K is the hyperparameter (unfortunately, this simple technique is not available for logistic regression):
mod_caret_knn <- train(diagnosis ~ radius_mean + texture_mean,
train_data,
method = 'knn',
preProcess = c("center","scale"), tuneLength = 20,
trControl = trainControl(method = 'cv', number = k))
mod_caret_knn
## k-Nearest Neighbors
##
## 400 samples
## 2 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 320, 320, 321, 319, 320
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8698972 0.7195185
## 7 0.8824289 0.7460886
## 9 0.8874922 0.7565798
## 11 0.8849922 0.7518760
## 13 0.8849605 0.7517160
## 15 0.8898980 0.7620522
## 17 0.8899605 0.7627737
## 19 0.8949922 0.7735728
## 21 0.8975238 0.7792575
## 23 0.8949605 0.7729157
## 25 0.8975238 0.7786323
## 27 0.9000238 0.7842236
## 29 0.9025238 0.7893252
## 31 0.9025238 0.7893252
## 33 0.9050555 0.7944510
## 35 0.9125246 0.8109089
## 37 0.9074930 0.8013067
## 39 0.8973980 0.7784069
## 41 0.8973664 0.7785244
## 43 0.8973664 0.7785244
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 35.
Finally, we will report the confusion matrix for best of our KNN models to be able to compare performance of KNN to logistic regression:
mod_caret_knn %>%
predict(test_data) %>% confusionMatrix(test_data$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 107 9
## M 3 50
##
## Accuracy : 0.929
## 95% CI : (0.8793, 0.9628)
## No Information Rate : 0.6509
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.84
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.9727
## Specificity : 0.8475
## Pos Pred Value : 0.9224
## Neg Pred Value : 0.9434
## Prevalence : 0.6509
## Detection Rate : 0.6331
## Detection Prevalence : 0.6864
## Balanced Accuracy : 0.9101
##
## 'Positive' Class : B
##
Note that for this problem KNN works really well.