Homework 3 - KNN & Preprocessing & Kappa

Matt

2/10/2021

2. KNN Concepts (5pts)

Explain how the choice of K affects the quality of your prediction when using a K Nearest Neighbors algorithm.

Answer:

  • When K is too small, you tend to model the noise (random neighbors that aren’t alike).

  • When K is too large, you tend to lose information available from those in the neighborhood and become to general (too reliant on the population average).

3. Feature Engineering (3pts)

  1. Remove the taster_name column from the data
  2. Create a version of the year column that is a factor (instead of numeric)
  3. Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description
  4. Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators
  5. Remove the description column from the data
    province price points year year_f cherry chocolate earth ycherry ychocolate
1     Oregon    65     87 2012   2012  FALSE     FALSE  TRUE       0          0
2     Oregon    20     87 2013   2013  FALSE      TRUE FALSE       0       2013
3 California    69     87 2011   2011  FALSE     FALSE  TRUE       0          0
4     Oregon    50     86 2010   2010  FALSE     FALSE FALSE       0          0
5     Oregon    22     86 2009   2009  FALSE     FALSE  TRUE       0          0
6     Oregon    25     86 2015   2015  FALSE     FALSE FALSE       0          0
  yearth
1   2012
2      0
3   2011
4      0
5   2009
6      0
  • The interaction are simply multiplications, we are interested in the joint effect

  • Why might we be interested in this interaction? Cherry * Year
    • If we are predicting price, we could see: maybe the influence of cherry on price has changed over time
      • Maybe back in the day there wasn’t a big difference in flavor profile between OR and CAL but this could have changed over time
      • Giving the model a degree of freedom, to specify that this change may be present in the model
      • Seeing Cherry in 2002 could be very different from seeing Cherry in 2012
  • Something to note… if the model doesn’t work in the future, change those T/F to 1/0

5. Running KNN (5pts)

  1. Split your data into an 80/20 training and test set
  2. Use Caret to run a KNN model that uses your engineered features to predict province
  • use 5-fold cross validated subsampling
  • allow Caret to try 15 different values for K
  1. Display the confusion matrix on the test data
k-Nearest Neighbors 

6707 samples
  28 predictor
   6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5365, 5366, 5366, 5366, 5365 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5995238  0.3621479
   7  0.6080218  0.3687602
   9  0.6139855  0.3733153
  11  0.6202495  0.3804611
  13  0.6220384  0.3804007
  15  0.6317284  0.3949415
  17  0.6281511  0.3863036
  19  0.6283002  0.3854595
  21  0.6306868  0.3886794
  23  0.6329226  0.3912692
  25  0.6314307  0.3885431
  27  0.6251684  0.3774905
  29  0.6262122  0.3787591
  31  0.6245723  0.3749179
  33  0.6244238  0.3737868

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 23.
Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               114         25                 4           4        0
  California              63        672                 8          12       11
  Casablanca_Valley        0          0                 1           0        0
  Marlborough              0          0                 2           0        0
  New_York                 0          0                 0           0        1
  Oregon                  61         94                11          29       14
                   Reference
Prediction          Oregon
  Burgundy              36
  California           226
  Casablanca_Valley      0
  Marlborough            1
  New_York               1
  Oregon               283

Overall Statistics
                                          
               Accuracy : 0.6402          
                 95% CI : (0.6166, 0.6632)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.408           
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.47899            0.8496                0.0384615
Specificity                  0.95192            0.6372                1.0000000
Pos Pred Value               0.62295            0.6774                1.0000000
Neg Pred Value               0.91678            0.8253                0.9850478
Prevalence                   0.14226            0.4728                0.0155409
Detection Rate               0.06814            0.4017                0.0005977
Detection Prevalence         0.10938            0.5929                0.0005977
Balanced Accuracy            0.71545            0.7434                0.5192308
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                    0.000000       0.0384615        0.5174
Specificity                    0.998157       0.9993928        0.8144
Pos Pred Value                 0.000000       0.5000000        0.5752
Neg Pred Value                 0.973054       0.9850389        0.7765
Prevalence                     0.026898       0.0155409        0.3270
Detection Rate                 0.000000       0.0005977        0.1692
Detection Prevalence           0.001793       0.0011955        0.2941
Balanced Accuracy              0.499079       0.5189272        0.6659

6. Kappa (2pts)

Is this a good value of Kappa? Why or why not?

Answer: By rule of thumb, it’s pretty good. We’re explaining approximately 37% of the explainable variation in the data (i.e. better than chance)

7. Improvement (2pts)

Looking at the confusion matrix, where do you see room for improvement in your predictions?

Answer: Honestly, it looks like we’re really struggling with the three smaller provinces. We’re also mixing up California and Oregon a lot. Burgundy appears pretty good.