Wk 3 - KNN, Transformations, Human Understanding of Regression

Matt

2/4/2021

Reporting Impact from Regressions

Calculating correlation

# A tibble: 1 x 2
  cor_p cor_lp
  <dbl>  <dbl>
1 0.404  0.624
  • Why is the correlation of log price and points higher than regular price and points?

  • When you take the log of a price, youre allowing for more normally distributed data (think of a histogram)
    • Meaning we are going to have a more evenly distribution around your points in a graph, when there is less skew on either side it is going to be much easier to find a linear relationship
      • Justification of why we always take the log of price
      • The relationship has an exponential relationship… meaning when we go from 80 to 81 points, the increase is less than if i go from 81 to 82 points - The higher we get in the points scale, the more extreme the changes are in price - 98 to 99 much higher price change than 81 to 82 - We are going to have a stronger correlation with the Log because the relationship is exponential

Exercise

  1. Calculate the correlation between log(price) and points
  2. by variety
  3. for Oregon Chardonnay, Pinot Noir and Pinot Gris
  4. in the same tibble

Solution

# A tibble: 3 x 2
  variety    correlation
* <chr>            <dbl>
1 Chardonnay       0.642
2 Pinot Gris       0.490
3 Pinot Noir       0.591
  • How would I explain this?
    • If wine variety is chardonnay it the points that the wine is rated by raises the price more than the other 2
    • We can see this in the graph below

Visualizing these different correlations

- The line is the steepest for chardonnay and the least steep for Pinot gris

  • Explanation for why Pinot gris has different relationship

    • Typically, Pinot gris are usually priced lower in general as opposed to the other 2

Graphing residuals (bad)

- Larger SD as points go up - There are a lot of points (its dense) around 0 to -50 - Note how it changes when taking the log price

Graphing residuals (good)

- Residual above 0 look very similar to the residuals below 0

  • (in other words) our error looks very similar on both sides of our regression line (which is 0)

Interpreting the coefficients

Since we logged the DV, a 1 point ratings increase = 9.85% increase in price on average. - Think about it as a percentage

Note: \[ (e^x-1)*100 \]

Even moar awesomer with some code fu

[1] "For Chardonnay, a 1 point ratings increase leads to a 11.34% increase in price."
[1] "For Pinot Gris, a 1 point ratings increase leads to a 5.46% increase in price."
[1] "For Pinot Noir, a 1 point ratings increase leads to a 10.27% increase in price."

Logged feature

Taking the log of a feature


Call:
lm(formula = price ~ lpoints, data = filter(wine, province == 
    "Oregon") %>% mutate(lpoints = log(points)))

Coefficients:
(Intercept)      lpoints  
    -1419.0        324.4  

Since we logged the IV (feature), a 1% ratings increase = 3.24 increase in price on average.

Note: \[ x/100 \]

LogLog (also elasticity)

Log by a log


Call:
lm(formula = lprice ~ lpoints, data = filter(wine, province == 
    "Oregon") %>% mutate(lpoints = log(points)))

Coefficients:
(Intercept)      lpoints  
    -33.770        8.298  

…a 1% increase in ratings equals a 8.3% increase in price on average - Popular way of doing this because it is so easily interpretable

Summary

  • Only the dependent/response variable is log-transformed. Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.

  • Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.

  • Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%.

Graphing points by variety

  • How do the means differ by variety?

- Chard has highest mean, PN has the same mean but the upper quartile for chard is much larger than PN

Summary

# A tibble: 3 x 2
  variety     mean
* <chr>      <dbl>
1 Chardonnay  89.7
2 Pinot Gris  88.5
3 Pinot Noir  89.5

Note:

  1. The difference between Pinot Gris and Chardonnay is -1.2402471
  2. The difference between Pinot Noir and Chardonnay is -0.2555915

Regression

# A tibble: 3 x 7
  term              estimate std_error statistic p_value lower_ci upper_ci
  <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept           89.7       0.122    737.     0       89.5     90.0  
2 varietyPinot Gris   -1.24      0.177     -7.03   0       -1.59    -0.894
3 varietyPinot Noir   -0.256     0.132     -1.94   0.053   -0.515    0.003
  • What do we notice about these estimate coefficients?
    • They are the difference of Pinot gris and Pinot Noir from the intercept
    • They are also the same as above ( the estimate is the mean of chardonay )

Assumptions of linear regression

  1. Linearity of relationship between variables
  2. Independence of the residuals
  3. Normality of the residuals
  4. Equality of variance of the residuals

1. Linearity of relationship

Credit: Modern Dive

What would the residuals look like?

2. Independence

Given our original model of \(log(price)=m*Points+b\)

…are there any problems with independence?

No equality in the variance

Credit: Modern Dive

Preprocessing and BoxCox

Preprocessing

Box-Cox transformations use MLE to estimate \(\lambda\)

\(x^{*} = \frac{x^{\lambda}-1}{\lambda\: \tilde{x}^{\lambda-1}}\)

  • when \(\lambda=1\), there is no transformation
  • when \(\lambda=0\), it is log transformed
  • when \(\lambda=0.5\), it is square root
  • when \(\lambda=-1\), it is an inverse

Caret preprocessing is so easy!

    province      price    points        year    taster_name
1     Oregon  0.7146905 -1.033841 -0.03425331   Paul Gregutt
2     Oregon -1.4139991 -1.033841  0.33313680   Paul Gregutt
3 California  0.8225454 -1.033841 -0.40146088 Virginie Boone
4     Oregon  0.2408520 -1.367723 -0.76848588   Paul Gregutt
5     Oregon -1.2418658 -1.367723 -1.13532834   Paul Gregutt
6     Oregon -1.0109945 -1.367723  1.06846470   Paul Gregutt
  • Points, price & year are the only ones transformed

But wait… what is wrong here? - Shouldn’t be thinking of year as a continuous variable

Fixing this issue : Adding year as a factor/categorical

         year year_f
1 -0.03425331   2012
2  0.33313680   2013
3 -0.40146088   2011
4 -0.76848588   2010
5 -1.13532834   2009
6  1.06846470   2015
  • Factor year was left alone, continuous year was transformed
    • Now we created 16 new variables

The KNN Algorithm

Algorithm

  1. Load the data
  2. Initialize K to your chosen number of neighbors
  3. For each example in the data
  4. Calculate the distance between the query example and the current example from the data.
  5. Add the distance and the index of the example to an ordered collection
  6. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
  7. Pick the first K entries from the sorted collection
  8. Get the labels of the selected K entries
  9. If regression, return the mean of the K labels
  10. If classification, return the mode of the K labels

Engineering some features

    province      price    points        year year_f_1996 year_f_1997
1     Oregon  0.7146905 -1.033841 -0.03425331           0           0
2     Oregon -1.4139991 -1.033841  0.33313680           0           0
  year_f_1998 year_f_1999 year_f_2000 year_f_2001 year_f_2002 year_f_2003
1           0           0           0           0           0           0
2           0           0           0           0           0           0
  year_f_2004 year_f_2005 year_f_2006 year_f_2007 year_f_2008 year_f_2009
1           0           0           0           0           0           0
2           0           0           0           0           0           0
  year_f_2010 year_f_2011 year_f_2012 year_f_2013 year_f_2015
1           0           0           1           0           0
2           0           0           0           1           0
  taster_name_jim_gordon taster_name_matt_kettmann taster_name_roger_voss
1                      0                         0                      0
2                      0                         0                      0
  taster_name_virginie_boone taster_name_other cherry chocolate earth
1                          0                 0  FALSE     FALSE  TRUE
2                          0                 0  FALSE      TRUE FALSE
 [ reached getOption("max.print") -- omitted 4 rows ]

Confusion matrix

Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               233          0                 0           0        0
  California               0        771                 0           0        0
  Casablanca_Valley        0          0                 8           4        1
  Marlborough              0          0                12          24       10
  New_York                 0          0                 0           9       11
  Oregon                   5         20                 6           8        4
                   Reference
Prediction          Oregon
  Burgundy               1
  California             0
  Casablanca_Valley      4
  Marlborough            4
  New_York               1
  Oregon               537

Overall Statistics
                                          
               Accuracy : 0.9468          
                 95% CI : (0.9349, 0.9571)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9179          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                   0.9790            0.9747                 0.307692
Specificity                   0.9993            1.0000                 0.994536
Pos Pred Value                0.9957            1.0000                 0.470588
Neg Pred Value                0.9965            0.9778                 0.989130
Prevalence                    0.1423            0.4728                 0.015541
Detection Rate                0.1393            0.4608                 0.004782
Detection Prevalence          0.1399            0.4608                 0.010161
Balanced Accuracy             0.9891            0.9874                 0.651114
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                     0.53333        0.423077        0.9817
Specificity                     0.98403        0.993928        0.9618
Pos Pred Value                  0.48000        0.523810        0.9259
Neg Pred Value                  0.98706        0.990920        0.9909
Prevalence                      0.02690        0.015541        0.3270
Detection Rate                  0.01435        0.006575        0.3210
Detection Prevalence            0.02989        0.012552        0.3467
Balanced Accuracy               0.75868        0.708503        0.9718
  • Looking at how models predictions compare to the actual values in my holdout (test) sample
  • We do well with the wines that have the most frequency, but not so well with the provinces with lower frequency (NY, Casablanca, Marlborough)

What happens when Intern comes in “I have 95% accuracy”?

  • Do we have leakage? Looks like it
    • We only have the year var, features of the description and tasters, it must be the tasters

Kappa statistic

Compares observed accuracy against what would be expected by a random classifier.

  • < 0.2 (not so good)
  • 0.21 - 0.4 (ok)
  • 0.41 - 0.6 (pretty good)
  • 0.6 - 0.8 (great)
  • > 0.8 (almost perfect)

…whoa! What’s going on here?

Confusion matrix

Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               118         40                 1           3        0
  California              52        587                 8          12        9
  Casablanca_Valley        2          4                 0           1        1
  Marlborough              3          6                 3           2        3
  New_York                 1          0                 0           2        2
  Oregon                  62        154                14          25       11
                   Reference
Prediction          Oregon
  Burgundy              46
  California           193
  Casablanca_Valley      2
  Marlborough           12
  New_York               1
  Oregon               293

Overall Statistics
                                         
               Accuracy : 0.5989         
                 95% CI : (0.575, 0.6225)
    No Information Rate : 0.4728         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.3625         
                                         
 Mcnemar's Test P-Value : 5.151e-05      

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.49580            0.7421                 0.000000
Specificity                  0.93728            0.6893                 0.993928
Pos Pred Value               0.56731            0.6818                 0.000000
Neg Pred Value               0.91809            0.7488                 0.984366
Prevalence                   0.14226            0.4728                 0.015541
Detection Rate               0.07053            0.3509                 0.000000
Detection Prevalence         0.12433            0.5146                 0.005977
Balanced Accuracy            0.71654            0.7157                 0.496964
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                    0.044444        0.076923        0.5356
Specificity                    0.983415        0.997571        0.7638
Pos Pred Value                 0.068966        0.333333        0.5242
Neg Pred Value                 0.973844        0.985603        0.7720
Prevalence                     0.026898        0.015541        0.3270
Detection Rate                 0.001195        0.001195        0.1751
Detection Prevalence           0.017334        0.003586        0.3341
Balanced Accuracy              0.513930        0.537247        0.6497
  • More realistic representation, kappa of .36
    • Guessing the top well, but the smaller provinces are suffering
    • Explaining 36% of the explainable improvement

Basic model with parameter tuning

Using Caret framework to do some meta modeling (parameter tuning)

k-Nearest Neighbors 

6707 samples
  25 predictor
   6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 

No pre-processing
Resampling: Bootstrapped (1 reps) 
Summary of sample sizes: 6707 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5523422  0.2997320
   7  0.5747454  0.3284182
   9  0.5804481  0.3316907
  11  0.5873727  0.3382247
  13  0.5951120  0.3494213
  15  0.5959267  0.3488393
  17  0.6036660  0.3567968
  19  0.6101833  0.3667269
  21  0.6077393  0.3612561
  23  0.6061100  0.3570110
  25  0.6061100  0.3562455
  27  0.6085540  0.3575035
  29  0.6142566  0.3671831
  31  0.6138493  0.3646175
  33  0.6085540  0.3555932

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 29.
  • Looking at the optimal numbers of K (amount of neighbors)

Confusion Matrix

Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               118         22                 5           4        0
  California              65        684                 7          13       13
  Casablanca_Valley        0          1                 1           0        0
  Marlborough              0          0                 1           0        1
  New_York                 0          0                 0           1        0
  Oregon                  55         84                12          27       12
                   Reference
Prediction          Oregon
  Burgundy              41
  California           258
  Casablanca_Valley      0
  Marlborough            2
  New_York               0
  Oregon               246

Overall Statistics
                                          
               Accuracy : 0.627           
                 95% CI : (0.6033, 0.6502)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3831          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                  0.49580            0.8647                0.0384615
Specificity                  0.94983            0.5964                0.9993928
Pos Pred Value               0.62105            0.6577                0.5000000
Neg Pred Value               0.91908            0.8310                0.9850389
Prevalence                   0.14226            0.4728                0.0155409
Detection Rate               0.07053            0.4088                0.0005977
Detection Prevalence         0.11357            0.6216                0.0011955
Balanced Accuracy            0.72281            0.7306                0.5189272
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                    0.000000       0.0000000        0.4497
Specificity                    0.997543       0.9993928        0.8313
Pos Pred Value                 0.000000       0.0000000        0.5642
Neg Pred Value                 0.973038       0.9844498        0.7567
Prevalence                     0.026898       0.0155409        0.3270
Detection Rate                 0.000000       0.0000000        0.1470
Detection Prevalence           0.002391       0.0005977        0.2606
Balanced Accuracy              0.498771       0.4996964        0.6405

With parameter tuning and subsampling

Finding the optimal Kappa finding the right amount of neighbors

k-Nearest Neighbors 

6707 samples
  25 predictor
   6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 

No pre-processing
Resampling: Bootstrapped (1 reps) 
Summary of sample sizes: 6707 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5579592  0.3149479
   7  0.5755102  0.3370134
   9  0.5808163  0.3403683
  11  0.5853061  0.3397337
  13  0.5922449  0.3489705
  15  0.5910204  0.3455425
  17  0.5991837  0.3571511
  19  0.5971429  0.3519378
  21  0.5873469  0.3338707
  23  0.5897959  0.3390258
  25  0.5967347  0.3472139
  27  0.5959184  0.3456839
  29  0.5942857  0.3419408
  31  0.5938776  0.3389515
  33  0.6036735  0.3552427

Kappa was used to select the optimal model using the largest value.
The final value used for the model was k = 17.

Tuning plot

  • 17 may be due to some random sub sampling
    • Can and always should set.seed()
  • We look at a graph like this for that exact insight, “do i need 30 neighbors? or does it max out after 15”

Bonus: KNN for regression

k-Nearest Neighbors 

6707 samples
  25 predictor

No pre-processing
Resampling: Bootstrapped (1 reps) 
Summary of sample sizes: 6707 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE      
   5  0.7438022  0.4606165  0.5783852
   7  0.7380353  0.4654830  0.5745082
   9  0.7352568  0.4664593  0.5736572
  11  0.7321121  0.4699287  0.5695127
  13  0.7329237  0.4686406  0.5705231
  15  0.7311215  0.4714167  0.5696308
  17  0.7271077  0.4776711  0.5661544
  19  0.7282437  0.4764498  0.5672856
  21  0.7266817  0.4793160  0.5657744
  23  0.7250526  0.4826010  0.5629512
  25  0.7278982  0.4789688  0.5650174
  27  0.7287374  0.4787843  0.5661180
  29  0.7264588  0.4836403  0.5666414
  31  0.7253892  0.4867730  0.5668266
  33  0.7277723  0.4837948  0.5690161

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 23.