By the end of this activity, you should
Be familiar with basic plotting in ggplot2 (part of
tidyverse)
Implement \(k\)-fold cross-validation to estimate a test error by yourself
Use built-in cross-validation in the library caret
to estimate a test error.
Choose the optimal value of a tuning parameter with \(k\)-fold cross-validation in the library
caret.
Review functions map, map_vec, and
map_df.
You should be familiar with basic regression and classification models — linear regression, logistic regression, KNN. These topics were covered in lab 2. You should also be familiar with the pipe operator and splitting data into training and test sets (lab 1).
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorslibrary(caret) # for machine learning, including KNN## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     liftlibrary(modelsummary) # for printing pregression tablesToday we will work with breast cancer dataset and we will predict if a tumor is benign of malignant from its measurements.
Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
B <- read.csv("breast_cancer.csv")
head(B)The variables are
names(B)##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
## [33] "X"The response variable is diagnosis. Below are values and
their frequencies:
round(table(B$diagnosis) / nrow(B) , 2)## 
##    B    M 
## 0.63 0.37We see that the dataset is more-or-less balanced.
Now we will split the data into 400 training and 169 test
observations. For the sake of this lab session, we will extract just
three variables, response diagnosis and predictors
radius_mean and texture_mean. We will also
convert the variable diagnosis from character
to factor because it is necessary for the function
glm.
set.seed(42)
B <- mutate(B, diagnosis = as.factor(diagnosis))
ind <- sample(1:nrow(B), 400)
train_data <- B %>% slice(ind) %>% select(diagnosis, radius_mean, texture_mean)
test_data <- B %>% slice(-ind) %>% select(diagnosis, radius_mean, texture_mean)
cat("Training data dimensions =", dim(train_data), "\n")## Training data dimensions = 400 3cat("Test data dimensions =", dim(test_data), "\n")## Test data dimensions = 169 3The library ggplot2 for plotting is a part of
tidyverse echosystem. It is one of nicest plot systems
available in R.
Remark: Probably, plotly with its interactive
plots is even better than ggplot2. However,
plotly does not work with static PDF reports. You are
encouraged to explore interactive plots with plotly as a
part of your project. If you really construct interactive plots for your
project, you can submit your project report as HTML rather than PDF.
In ggplot2, we always load our data into the function
ggplot and then tell it how to visualize and modify it by
adding extra functions. It seems weird at first, but you will soon get
used to it.
Here is a simple scatterplot:
ggplot(data = B, aes(x = radius_mean, y = texture_mean)) +
  geom_point()It would be nice to colour our points according to diagnosis. Here is how we do it:
ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
  geom_point()We can add more stuff to such a plot, like regression lines (regression lines are not meaningful here, it’s just an example) or a plot title:
ggplot(data = B, aes(x = radius_mean, y = texture_mean, colour = diagnosis)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Breast cancer data")## `geom_smooth()` using formula = 'y ~ x'If we want to show the distribution of radius_mean
separately for benign and malignant tumors, can begin with setting
x in aes to be diagnosis:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_point()However, it is hard to understand this plot because points overlap. To remedy this, we can make points transparent as follows:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_point(alpha = 0.1)An even better way to solve the issue is to use boxplot
method for plotting. It will produce a so-called box-and-whisker
plot:
ggplot(data = B, aes(x = diagnosis, y = radius_mean)) +
  geom_boxplot()It shows outliers, minimum, first quartile, median, third quartile, and maximum separately for benign and malignant tumors. A whisker chart can be coloured too:
ggplot(data = B, aes(x = diagnosis, y = radius_mean, colour = diagnosis)) +
  geom_boxplot()Change geom_boxplot to geom_violin to
produce a violin plot. At the same time, change
colour to fill for better contrast.
ggplot(data = B, aes(x = diagnosis, y = radius_mean, fill = diagnosis)) +
  geom_violin()To plot a histogram of, say, radius_mean, we just need
to pass the x parameter to aes:
ggplot(data = B, aes(x = radius_mean)) +
  geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.We can specify colours and the number of bins:
ggplot(data = B, aes(x = radius_mean)) +
  geom_histogram(colour = 'black', fill = 'orange', bins = 12)We can also show histograms of radius_mean for benign
and malignant tumors separately on the same plot:
ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
  geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.Instead of a histogram, we can plot the estimated density:
ggplot(data = B, aes(x = radius_mean, group = diagnosis, fill = diagnosis)) +
  geom_density(alpha = 0.5)Let \(N\) be the number of observations in our dataset and let \(k\) be the number of folds that we need to create. Our folds are just a list of numbers \[ 1,1,\dots,1,2,2,\dots,2,\dots,k,\dots,k \] where each number appears approximately the same number of times. Later shuffle we will shuffle this list.
To create folds, we begin with an increment vector
N <- 20
k <- 4
1:N##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20Fold numbers should be between 1 and \(k\). So let us divide each entry by \(N/k\):
(1:N) / (N/k)##  [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
## [20] 4.0Applying the ceiling function will now yield what we want:
((1:N) / (N/k)) %>% ceiling##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4Note that it works even when \(N\) is not divisible by \(k\):
N <- 12
k <- 5
((1:N) / (N/k)) %>% ceiling##  [1] 1 1 2 2 3 3 3 4 4 5 5 5Now we need to shuffle this vector. This can be done with the
function sample from the base package — we
just need to produce a sample from this vector without replacement of
size \(N\):
((1:N) / (N/k)) %>% ceiling %>% sample(N, replace = FALSE)##  [1] 3 3 2 4 4 3 1 5 5 5 1 2Come up with an alternative method to create folds based on the
function rep:
rep(1:k, 5)##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Your method should work even when \(N\) is not divisible by \(k\).
rep(1:k, ceiling(N/k))[1:N] %>% sample(N, replace = FALSE)##  [1] 2 1 4 2 3 1 1 4 3 2 5 5Now we will create a function that will add folds to a given dataset and we will apply it to create 5 folds for our training data.
k <- 5
set.seed(128)
add_folds_to_data <- function(dataset, k) {
  N <- nrow(dataset)
  folds <- ceiling((1:N)/(N/k))
  dataset %>% 
    mutate(fold = sample(folds, N))
}
train_data <- train_data %>%
  add_folds_to_data(k)
head(train_data)To make sure that it worked, we will print the number of observations in each fold:
table(train_data$fold)## 
##  1  2  3  4  5 
## 80 80 80 80 80We will train a logistic polynomial regression and report its test error. Below are coefficients of the model.
Notice that the regression output contains terms like
polym(radius_mean, texture_mean, degree = 2, raw = FALSE)1.0.
These names look unusual because R by default uses an orthogonal
polynomial basis in multiple dimensions, rather than the simple
monomials \(x^2\), \(xy\), and \(y^2\). Orthogonal polynomials reduce
collinearity between predictors and improve numerical stability. For our
purposes, you don’t need to decode these labels line by line — just be
aware of why they look different. If you are curious, you are encouraged
to read more about multivariate orthogonal polynomials in
regression.
deg <- 2
mod_log <- glm(diagnosis ~ 
                 polym(radius_mean, texture_mean, degree = deg, raw = TRUE),
               family = "binomial", data = train_data)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredmodelsummary(mod_log, fmt = 3, 
             statistic = NULL,
             estimate = "{estimate} ({std.error}){stars}")| (1) | |
|---|---|
| (Intercept) | -2.074 (12.145) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0 | -2.215 (1.515) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0 | 0.091 (0.049)+ | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1 | 0.680 (0.471) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1 | 0.039 (0.027) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2 | -0.025 (0.007)*** | 
| Num.Obs. | 400 | 
| AIC | 204.6 | 
| BIC | 228.5 | 
| Log.Lik. | -96.296 | 
| F | 15.718 | 
| RMSE | 0.27 | 
And now we will construct predictions and report the confusion matrix
probs <- predict(mod_log, test_data, type = "response")
preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
confusionMatrix(preds, test_data$diagnosis)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102  12
##          M   8  47
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.8232, 0.9262)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : 7.189e-12       
##                                           
##                   Kappa : 0.7354          
##                                           
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.9273          
##             Specificity : 0.7966          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.8545          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6036          
##    Detection Prevalence : 0.6746          
##       Balanced Accuracy : 0.8619          
##                                           
##        'Positive' Class : B               
## Since cross-validation involves training a lot of models, it will be convenient to write special functions for training and validating such models of different degrees on different datasets. Below we do it and report test errors for models of degrees from 1 to 6:
train_poly_logistic_model <- function(deg, dataset = train_data) {
  mod_log <- glm(diagnosis ~ polym(radius_mean, 
                                   texture_mean, 
                                   degree = deg, raw = TRUE),
               family = "binomial", data = dataset)
}
error_rate <- function(mod_log, dataset = test_data) {
  probs <- predict(mod_log, dataset, type = "response")
  preds <- ifelse(probs > 0.5, "M", "B") %>% as.factor
  err <- 1 - confusionMatrix(preds, dataset$diagnosis)$overall['Accuracy'] 
  unname(err)
}
train_and_validate <- function(deg, train = train_data, test = test_data) {
  train_poly_logistic_model(deg, train) %>%
    error_rate(test)
}
vec_test_errors <- map_vec(1:6, train_and_validate)
names(vec_test_errors) <- 1:6
vec_test_errors##          1          2          3          4          5          6 
## 0.11242604 0.11834320 0.09467456 0.11242604 0.21301775 0.11834320Here we will calculate the cross-validation error for a polynomial logistic regression of, say, degree 2. It means that we will need to train 5 models, where model \(i\) will be trained on all folds except for fold \(i\) and we will report its error on fold \(i\).
list_of_models <- 
  map(1:k, function(i) 
    train_poly_logistic_model(2, train_data %>% filter(fold != i)))
list_of_error_rates <- 
  map_vec(1:k, function(i) 
    error_rate(list_of_models[[i]], train_data %>% filter(fold == i)))
data.frame(fold = 1:k,
           error = list_of_error_rates)The average error, i.e., the cross-validation error, is
mean(list_of_error_rates)## [1] 0.1025Below are all models generated in the cross-validation process:
modelsummary(list_of_models, fmt = 3, 
             statistic = NULL,
             estimate = "{estimate} ({std.error}){stars}")| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | -2.363 (13.291) | -9.679 (13.771) | 8.302 (12.060) | -8.131 (15.190) | 2.706 (12.948) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.0 | -1.605 (1.655) | -1.496 (1.762) | -4.603 (1.602)** | -1.274 (1.733) | -2.523 (1.565) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)2.0 | 0.044 (0.054) | 0.081 (0.057) | 0.186 (0.054)*** | 0.064 (0.053) | 0.093 (0.050)+ | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.1 | 0.343 (0.521) | 0.910 (0.475)+ | 1.033 (0.535)+ | 0.701 (0.597) | 0.444 (0.538) | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)1.1 | 0.074 (0.035)* | 0.018 (0.027) | 0.045 (0.031) | 0.027 (0.031) | 0.051 (0.031)+ | 
| polym(radius_mean, texture_mean, degree = deg, raw = TRUE)0.2 | -0.028 (0.009)** | -0.023 (0.008)** | -0.034 (0.010)*** | -0.021 (0.008)* | -0.023 (0.008)** | 
| Num.Obs. | 320 | 320 | 320 | 320 | 320 | 
| AIC | 175.3 | 159.7 | 144.7 | 176.6 | 164.7 | 
| BIC | 198.0 | 182.3 | 167.3 | 199.2 | 187.3 | 
| Log.Lik. | -81.671 | -73.870 | -66.366 | -82.294 | -76.364 | 
| RMSE | 0.28 | 0.26 | 0.24 | 0.28 | 0.27 | 
Note that while performing cross-validation, we saved all 5 models that were a part of the cross-validation process. In practice, this is not needed because in any case, the actual model that goes into production is going to be trained on the whole training data. Cross-validation is used just to estimate the test error.
Re-write the code above so that there will be just a single pipeline that will print the mean error rate of all the 5 models
1:k %>% map_vec(
  function(i) train_and_validate(2, train_data %>% filter(fold != i),
                                 train_data %>% filter(fold == i))
) %>% mean## [1] 0.1025Now we will make a function that will perform cross-validation for a certain degree and report the cross-validation error and the test error. First, we report the cross-validation error and the test error for logistic polynomial regression of degree 2.
cv_and_test_error <- function(d, K = k) {
  # K is the number of folds, the default value is whatever is specified above
  # when we assigned k
  cv_error <- 1:K %>% map_vec(
    function(i) train_and_validate(d, train_data %>% filter(fold != i),
                                 train_data %>% filter(fold == i))
    ) %>% mean
  
  test_error <- train_and_validate(d, train_data, test_data)
  
  tibble(deg = d, CV = cv_error, TEST = test_error)
}
cv_and_test_error(2)And now we will make a table of training and test errors for all degrees from 1 to 12
all_degrees <- 1:12
final_errors <- map_df(all_degrees, cv_and_test_error)
final_errorsNote that while cross-validation and test errors are different, the
optimal degree is the same for both of them. Also note that we
transposed and converted the output of map_vec to a data
frame because the default behaviour is producing a matrix built
column-by-column rather than row-by-row.
We can use the familiar tools to plot cross-validation error rates for different degrees:
ggplot(data = final_errors, aes(x = deg, y = CV)) +
  geom_point()However, it will look better if plot lines instead:
ggplot(data = final_errors, aes(x = deg, y = CV)) +
  geom_line()Plot test errors with points joined by a line
ggplot(data = final_errors, aes(x = deg, y = TEST)) +
  geom_line() + geom_point()Of course, in practice we do not implement cross-validation from
scratch. We can use the function train from the library
caret for this purpose.
set.seed(128)
mod_log_caret <- train(diagnosis ~ 
                         polym(radius_mean, texture_mean, degree = 2, raw = TRUE),
                  train_data, method = 'glm', 
                  family = binomial, 
                  trControl = trainControl(method = 'cv', number = k))
mod_log_caret## Generalized Linear Model 
## 
## 400 samples
##   2 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 320, 320, 319, 320, 321 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9024961  0.7909036The cross-validation error of this model is
1 - mod_log_caret$results$Accuracy## [1] 0.09750391A nice feature of the library caret is that the function
predict for a model trained by train returns
the labels and we do not have to manually convert probabilities to
labels
mod_log_caret %>% predict(test_data) %>% head## [1] B M M B B M
## Levels: B MHere is the confusion matrix of our model on the test data:
mod_log_caret %>%
  predict(test_data) %>%
  confusionMatrix(test_data$diagnosis)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102  12
##          M   8  47
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.8232, 0.9262)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : 7.189e-12       
##                                           
##                   Kappa : 0.7354          
##                                           
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.9273          
##             Specificity : 0.7966          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.8545          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6036          
##    Detection Prevalence : 0.6746          
##       Balanced Accuracy : 0.8619          
##                                           
##        'Positive' Class : B               
## We will now use cross-validation to pick the optimal degree. Specifically, we will write a function that will
Take degree of a polynomial and the number of folds as inputs
Train a logistic polynomial regression of the given degree with
the built-in function train from library
caret.
Return the cross-validation error of the model
Below we write such a function and use it to find the cross-validation errors for models of degrees from 1 to 12
cv_caret_error <- function(d, K = k, dataset = train_data) {
  
  polynomial_formula <-
  "diagnosis ~ polym(radius_mean, texture_mean, degree = %d, raw = TRUE)" %>%
    sprintf(d) %>% as.formula
    
  mod <-
    train(polynomial_formula,
                  dataset, method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = K))
  1 - mod$results$Accuracy
}
caret_errors <- data.frame(
  deg = all_degrees,
  error_rate = map_vec(all_degrees, cv_caret_error)
)
caret_errorsRewrite the function cv_caret_error so that it will
return a vector with 3 entries, degree of the polynomial,
cross-validation error, test error. And use it to produce a table of
errors for different degrees analogous to what we already calculated
when we implemented cross-validation from scratch.
caret_errors <- function(d, K = k, dataset = train_data) {
  
  polynomial_formula <-
    "diagnosis ~ polym(radius_mean, texture_mean, degree = %d, raw = TRUE)" %>%
    sprintf(d) %>% as.formula
  mod <-
    train(polynomial_formula,
                  dataset, method = 'glm',
                  family = binomial,
                  trControl = trainControl(method = 'cv', number = K))
  
  cv_err <- 1 - mod$results$Accuracy
  
  test_cm <- mod %>%
    predict(test_data) %>%
    confusionMatrix(test_data$diagnosis)
  
  test_err <- unname(1 - test_cm$overall['Accuracy'])
  
  tibble(deg = d, CV = cv_err, TEST = test_err)
}
map_df(all_degrees, caret_errors) When we selected the degree of a polynomial logistic regression that yields the smallest cross-validation error, we constructed a family of models \[ L(1), L(2),\dots,L(12), \] where \(L(d)\) is polynomial logistic regression of degree \(d\), i.e., log-odds of the probability is modelled as a polynomial of degree \(d\) in our two variables. For instance, \(L(2)\) is the model \[ \log\frac{P}{1-P}=b_0+b_1X_1+b_2X_2+b_3X_1^2+b_4X_1X_2+b_5X_2^2 \] etc.
A large part of the difficulty in implementing this process that we needed to explicitly apply the cross-validation process to every degree.
A good news is that many models implemented in caret
have an explicit tuning parameter or hyperparameter
and we can find cross-validation errors and pick the best,
i.e. minimizing the cross-validation error, model in just one call of
the function train. For example, here is how we do it for
KNN, where K is the hyperparameter (unfortunately, this simple technique
is not available for logistic regression):
mod_caret_knn <- train(diagnosis ~ radius_mean + texture_mean,
  train_data, 
  method = 'knn', 
  preProcess = c("center","scale"), tuneLength = 20,
  trControl = trainControl(method = 'cv', number = k))
mod_caret_knn## k-Nearest Neighbors 
## 
## 400 samples
##   2 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 320, 320, 321, 319, 320 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8698972  0.7195185
##    7  0.8824289  0.7460886
##    9  0.8874922  0.7565798
##   11  0.8849922  0.7518760
##   13  0.8849605  0.7517160
##   15  0.8898980  0.7620522
##   17  0.8899605  0.7627737
##   19  0.8949922  0.7735728
##   21  0.8975238  0.7792575
##   23  0.8949605  0.7729157
##   25  0.8975238  0.7786323
##   27  0.9000238  0.7842236
##   29  0.9025238  0.7893252
##   31  0.9025238  0.7893252
##   33  0.9050555  0.7944510
##   35  0.9125246  0.8109089
##   37  0.9074930  0.8013067
##   39  0.8973980  0.7784069
##   41  0.8973664  0.7785244
##   43  0.8973664  0.7785244
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 35.Finally, we will report the confusion matrix for best of our KNN models to be able to compare performance of KNN to logistic regression:
mod_caret_knn %>%
  predict(test_data) %>% confusionMatrix(test_data$diagnosis)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 107   9
##          M   3  50
##                                           
##                Accuracy : 0.929           
##                  95% CI : (0.8793, 0.9628)
##     No Information Rate : 0.6509          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.84            
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.9727          
##             Specificity : 0.8475          
##          Pos Pred Value : 0.9224          
##          Neg Pred Value : 0.9434          
##              Prevalence : 0.6509          
##          Detection Rate : 0.6331          
##    Detection Prevalence : 0.6864          
##       Balanced Accuracy : 0.9101          
##                                           
##        'Positive' Class : B               
## Note that for this problem KNN works really well.
I used ChatGPT 5 to phrase a few lines of epxlanations about polynomial logistic regression