Homework 3 - KNN & Preprocessing & Kappa

Matt

2/10/2021

1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(fastDummies)
library(knitr)
library(rmdformats)


## Global options
options(max.print="75")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)

wine = read_rds("/Users/Rose/Downloads/pinot.rds")

2. KNN Concepts (5pts)

Explain how the choice of K affects the quality of your prediction when using a K Nearest Neighbors algorithm.

Answer:

When K is too small, you tend to model the noise (random neighbors that aren’t alike).
When K is too large, you tend to lose information available from those in the neighborhood and become to general (too reliant on the population average).

3. Feature Engineering (3pts)

Remove the taster_name column from the data
Create a version of the year column that is a factor (instead of numeric)
Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description
Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators
Remove the description column from the data

wino <- wine %>%
  select(-taster_name) %>% 
  mutate(year_f = as.factor(year)) %>% # Factor version of year
  mutate(cherry = str_detect(description,"cherry")) %>% 
  mutate(chocolate = str_detect(description,"chocolate")) %>%
  mutate(earth = str_detect(description,"earth")) %>% 
  mutate(ycherry = cherry*year) %>% # Interaction between indicator vars and the continuous time
  mutate(ychocolate = chocolate*year) %>%  
  mutate(yearth = earth*year) %>% 
  select(-description)

wino %>% head()

    province price points year year_f cherry chocolate earth ycherry ychocolate
1     Oregon    65     87 2012   2012  FALSE     FALSE  TRUE       0          0
2     Oregon    20     87 2013   2013  FALSE      TRUE FALSE       0       2013
3 California    69     87 2011   2011  FALSE     FALSE  TRUE       0          0
4     Oregon    50     86 2010   2010  FALSE     FALSE FALSE       0          0
5     Oregon    22     86 2009   2009  FALSE     FALSE  TRUE       0          0
6     Oregon    25     86 2015   2015  FALSE     FALSE FALSE       0          0
  yearth
1   2012
2      0
3   2011
4      0
5   2009
6      0

The interaction are simply multiplications, we are interested in the joint effect
Why might we be interested in this interaction? Cherry * Year
- If we are predicting price, we could see: maybe the influence of cherry on price has changed over time
  - Maybe back in the day there wasn’t a big difference in flavor profile between OR and CAL but this could have changed over time
  - Giving the model a degree of freedom, to specify that this change may be present in the model
  - Seeing Cherry in 2002 could be very different from seeing Cherry in 2012
Something to note… if the model doesn’t work in the future, change those T/F to 1/0

4. Preprocessing (3pts)

Preprocess the dataframe that you created in the previous question using BoxCox, centering and scaling of the numeric features
Create dummy variables for the year factor column

wino <- wino %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(wino) %>% 
    dummy_cols(select_columns = c("year_f"), remove_most_frequent_dummy = T, remove_selected_columns = T)

# Why remove freq dummy? Instead of letting the algorithm select which variable
# to be the intercept, we could be 'over identifying' it, that most frequent
# dummy is then implied even tho it is dropped from the dataset

5. Running KNN (5pts)

Split your data into an 80/20 training and test set
Use Caret to run a KNN model that uses your engineered features to predict province

use 5-fold cross validated subsampling
allow Caret to try 15 different values for K

Display the confusion matrix on the test data

set.seed(504)

wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)

train <- wino[wine_index, ]

test <- wino[-wine_index, ]

control <- trainControl(method = "cv", number = 5)


fit <- train(province ~ ., data = train, method = "knn", tuneLength = 15, trControl = control)
fit

k-Nearest Neighbors 

6707 samples
  28 predictor
   6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5365, 5366, 5366, 5366, 5365 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5995238  0.3621479
   7  0.6080218  0.3687602
   9  0.6139855  0.3733153
  11  0.6202495  0.3804611
  13  0.6220384  0.3804007
  15  0.6317284  0.3949415
  17  0.6281511  0.3863036
  19  0.6283002  0.3854595
  21  0.6306868  0.3886794
  23  0.6329226  0.3912692
  25  0.6314307  0.3885431
  27  0.6251684  0.3774905
  29  0.6262122  0.3787591
  31  0.6245723  0.3749179
  33  0.6244238  0.3737868

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 23.

confusionMatrix(predict(fit, test), factor(test$province))

Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               114         25                 4           4        0
  California              63        672                 8          12       11
  Casablanca_Valley        0          0                 1           0        0
  Marlborough              0          0                 2           0        0
  New_York                 0          0                 0           0        1
  Oregon                  61         94                11          29       14
                   Reference
Prediction          Oregon
  Burgundy              36
  California           226
  Casablanca_Valley      0
  Marlborough            1
  New_York               1
  Oregon               283

Overall Statistics
                                          
               Accuracy : 0.6402          
                 95% CI : (0.6166, 0.6632)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.408           
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.47899            0.8496                0.0384615
Specificity                  0.95192            0.6372                1.0000000
Pos Pred Value               0.62295            0.6774                1.0000000
Neg Pred Value               0.91678            0.8253                0.9850478
Prevalence                   0.14226            0.4728                0.0155409
Detection Rate               0.06814            0.4017                0.0005977
Detection Prevalence         0.10938            0.5929                0.0005977
Balanced Accuracy            0.71545            0.7434                0.5192308
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                    0.000000       0.0384615        0.5174
Specificity                    0.998157       0.9993928        0.8144
Pos Pred Value                 0.000000       0.5000000        0.5752
Neg Pred Value                 0.973054       0.9850389        0.7765
Prevalence                     0.026898       0.0155409        0.3270
Detection Rate                 0.000000       0.0005977        0.1692
Detection Prevalence           0.001793       0.0011955        0.2941
Balanced Accuracy              0.499079       0.5189272        0.6659

6. Kappa (2pts)

Is this a good value of Kappa? Why or why not?

Answer: By rule of thumb, it’s pretty good. We’re explaining approximately 37% of the explainable variation in the data (i.e. better than chance)

7. Improvement (2pts)

Looking at the confusion matrix, where do you see room for improvement in your predictions?

Answer: Honestly, it looks like we’re really struggling with the three smaller provinces. We’re also mixing up California and Oregon a lot. Burgundy appears pretty good.