Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
If you wish to use a similar header, here’s is the format specification for this document:

format: 
  html:
    embed-resources: true

1. Setup

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Warning: package 'caret' was built under R version 4.4.2

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. \(K\)NN Concepts

TODO: Explain how the choice of K affects the quality of your prediction when using a \(K\) Nearest Neighbors algorithm.

Explanation:The choice of K in kNN impacts how the model generalizes. A small K (e.g., 1 or 3) makes predictions sensitive to noise, leading to overfitting. A large K smooths predictions by considering more neighbors but risks under fitting by ignoring local patterns. The best K balances bias and variance. You can usually find this balance by doing some cross validation.

3. Feature Engineering

Create a version of the year column that is a factor (instead of numeric).
Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.

Take care to handle upper and lower case characters.

Create 3 new features that represent the interaction between time and the cherry, chocolate and earth indicators.
Remove the description column from the data.

wine2 <- wine %>% 
  mutate(fyear = as.factor(year), 
         cherry_dummy = as.integer(str_detect(tolower(description), "cherry")),
         earthy_dummy = as.integer(str_detect(tolower(description), "earth|earthy")),
         choc_dummy = as.integer(str_detect(tolower(description), "chocolate")),
         cherry_time = as.integer(year) * cherry_dummy,
         earth_time = as.integer(year) * earthy_dummy,
         choc_time = as.integer(year) * choc_dummy) %>% 
  select(-description)

4. Preprocessing

Preprocess the dataframe from the previous code block using BoxCox, centering and scaling of the numeric features
Create dummy variables for the year factor column

pacman::p_load(fastDummies)

wine2<- wine2 %>% 
  dummy_cols(select_columns = 'fyear', remove_selected_columns = T) %>% 
  select(-year)

box_wine<-wine2 %>% 
  preProcess(method = c("BoxCox","center","scale")) %>% 
  predict(wine2)
head(box_wine)

         id   province      price    points cherry_dummy earthy_dummy
1 -2.206642     Oregon  0.7146905 -1.033841   -0.8313474    2.1652968
2 -2.202427     Oregon -1.4139991 -1.033841   -0.8313474   -0.4617753
3 -2.198830 California  0.8225454 -1.033841   -0.8313474    2.1652968
4 -2.195581     Oregon  0.2408520 -1.367723   -0.8313474   -0.4617753
5 -2.192571     Oregon -1.2418658 -1.367723   -0.8313474    2.1652968
6 -2.189736     Oregon -1.0109945 -1.367723   -0.8313474   -0.4617753
  choc_dummy cherry_time earth_time  choc_time  fyear_1996  fyear_1997
1 -0.2657902  -0.8313464   2.165110 -0.2657899 -0.01092391 -0.01544966
2  3.7619168  -0.8313464  -0.461775  3.7646929 -0.01092391 -0.01544966
3 -0.2657902  -0.8313464   2.163804 -0.2657899 -0.01092391 -0.01544966
4 -0.2657902  -0.8313464  -0.461775 -0.2657899 -0.01092391 -0.01544966
5 -0.2657902  -0.8313464   2.161193 -0.2657899 -0.01092391 -0.01544966
6 -0.2657902  -0.8313464  -0.461775 -0.2657899 -0.01092391 -0.01544966
   fyear_1998  fyear_1999  fyear_2000  fyear_2001  fyear_2002  fyear_2003
1 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
2 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
3 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
4 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
5 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
6 -0.09177462 -0.04508349 -0.03278738 -0.01892302 -0.01892302 -0.01092391
   fyear_2004 fyear_2005 fyear_2006 fyear_2007 fyear_2008 fyear_2009 fyear_2010
1 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048 -0.2075134 -0.2524166
2 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048 -0.2075134 -0.2524166
3 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048 -0.2075134 -0.2524166
4 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048 -0.2075134  3.9612314
5 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048  4.8183900 -0.2524166
6 -0.04508349 -0.1250304 -0.1412752 -0.1205243 -0.1676048 -0.2075134 -0.2524166
  fyear_2011 fyear_2012 fyear_2013 fyear_2014 fyear_2015
1 -0.2731769  2.1371853 -0.5265085 -0.5683134 -0.3282074
2 -0.2731769 -0.4678493  1.8990778 -0.5683134 -0.3282074
3  3.6601949 -0.4678493 -0.5265085 -0.5683134 -0.3282074
4 -0.2731769 -0.4678493 -0.5265085 -0.5683134 -0.3282074
5 -0.2731769 -0.4678493 -0.5265085 -0.5683134 -0.3282074
6 -0.2731769 -0.4678493 -0.5265085 -0.5683134  3.0464899

5. Running \(K\)NN

Split the dataframe into an 80/20 training and test set
Use Caret to run a \(K\)NN model that uses our engineered features to predict province

use 5-fold cross validated subsampling
allow Caret to try 15 different values for \(K\)

Display the confusion matrix on the test data

set.seed(505)
wine_index <- createDataPartition(box_wine$province, p = 0.8, list = FALSE)
train <- box_wine[wine_index, ]
test <- box_wine[-wine_index, ]

fit <- train(province ~ .,
             data = train, 
             method = "knn",
             tuneLength = 15,
             trControl = trainControl(number = 5))

confusionMatrix(predict(fit, test), factor(test$province))

Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy                86         14                 3           6        0
  California              68        669                12          18       10
  Casablanca_Valley        0          0                 0           0        0
  Marlborough              0          0                 0           0        0
  New_York                 1          0                 3           1        1
  Oregon                  83        108                 8          20       15
                   Reference
Prediction          Oregon
  Burgundy              31
  California           246
  Casablanca_Valley      0
  Marlborough            0
  New_York               1
  Oregon               269

Overall Statistics
                                          
               Accuracy : 0.6127          
                 95% CI : (0.5888, 0.6361)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3551          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.36134            0.8458                  0.00000
Specificity                  0.96237            0.5986                  1.00000
Pos Pred Value               0.61429            0.6540                      NaN
Neg Pred Value               0.90085            0.8123                  0.98446
Prevalence                   0.14226            0.4728                  0.01554
Detection Rate               0.05140            0.3999                  0.00000
Detection Prevalence         0.08368            0.6115                  0.00000
Balanced Accuracy            0.66186            0.7222                  0.50000
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                      0.0000       0.0384615        0.4918
Specificity                      1.0000       0.9963570        0.7922
Pos Pred Value                      NaN       0.1428571        0.5348
Neg Pred Value                   0.9731       0.9849940        0.7624
Prevalence                       0.0269       0.0155409        0.3270
Detection Rate                   0.0000       0.0005977        0.1608
Detection Prevalence             0.0000       0.0041841        0.3007
Balanced Accuracy                0.5000       0.5174093        0.6420

6. Kappa

How do we determine whether a Kappa value represents a good, bad or some other outcome?

TODO: Kappa is used to assess whether a model’s performance is meaningfully better than random chance. It does so by comparing the observed agreement (how often the model’s predictions match actual values) to the expected agreement (the level of agreement we would expect by chance). While this is a simplified interpretation of the equation, a high Kappa value generally indicates strong model performance. In most cases, a Kappa value above 0.6 is considered good, as it suggests substantial agreement. However, it is important to be cautious if Kappa exceeds 0.85 or 0.9, as this may indicate overfitting, meaning the model is too closely tailored to the training data and may not generalize well to new data.

7. Improvement

How can we interpret the confusion matrix, and how can we improve in our predictions?

TODO: Explain:The confusion matrix indicates that the model performs well in classifying California wines but has difficulty distinguishing between Burgundy and Oregon, frequently misclassifying them. Additionally, Casablanca Valley and Marlborough are not classified at all, suggesting the model fails to recognize them. To enhance performance, potential improvements include balancing the dataset, incorporating more informative features, optimizing the k-value in kNN, or utilizing a more advanced model such as Random Forest.