Agenda
- Review of Homework 2
- A human understanding of regression
- Preprocessing and BoxCox
- The KNN algorithm and the Confusion Matrix
- Vocabulary
Setup
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(moderndive)
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
wine <- read_rds("/Users/Rose/Downloads/wine.rds") %>%
mutate(lprice = log(price))
Reporting Impact from Regressions
Correlation
Calculating correlation
# A tibble: 1 x 2
cor_p cor_lp
<dbl> <dbl>
1 0.404 0.624
Why is the correlation of log price and points higher than regular price and points?
- When you take the log of a price, youre allowing for more normally distributed data (think of a histogram)
- Meaning we are going to have a more evenly distribution around your points in a graph, when there is less skew on either side it is going to be much easier to find a linear relationship
- Justification of why we always take the log of price
- The relationship has an exponential relationship… meaning when we go from 80 to 81 points, the increase is less than if i go from 81 to 82 points - The higher we get in the points scale, the more extreme the changes are in price - 98 to 99 much higher price change than 81 to 82 - We are going to have a stronger correlation with the Log because the relationship is exponential
- Meaning we are going to have a more evenly distribution around your points in a graph, when there is less skew on either side it is going to be much easier to find a linear relationship
Exercise
- Calculate the correlation between log(price) and points
- by variety
- for Oregon Chardonnay, Pinot Noir and Pinot Gris
- in the same tibble
Solution
wine %>%
filter(province=="Oregon") %>%
filter(variety %in% c("Chardonnay","Pinot Noir","Pinot Gris")) %>% # Creating a list of chard, pinot, gris
# Is variety in chardonnay... etc.
group_by(variety) %>%
summarise(correlation=cor(lprice,points))
# A tibble: 3 x 2
variety correlation
* <chr> <dbl>
1 Chardonnay 0.642
2 Pinot Gris 0.490
3 Pinot Noir 0.591
- How would I explain this?
- If wine variety is chardonnay it the points that the wine is rated by raises the price more than the other 2
- We can see this in the graph below
Visualizing these different correlations
wine %>% filter(province == "Oregon") %>% filter(variety %in% c("Chardonnay", "Pinot Noir",
"Pinot Gris")) %>% ggplot(aes(points, lprice, color = variety)) + geom_point(alpha = 0.3) +
facet_wrap(~variety) + geom_smooth(method = lm)
- The line is the steepest for chardonnay and the least steep for Pinot gris
Explanation for why Pinot gris has different relationship
- Typically, Pinot gris are usually priced lower in general as opposed to the other 2
Graphing residuals (bad)
model <- lm(price ~ points, filter(wine, province == "Oregon")) # Relationship betwen price and points
get_regression_points(model) %>% ggplot(aes(points, residual)) + geom_point()
- Larger SD as points go up - There are a lot of points (its dense) around 0 to -50 - Note how it changes when taking the log price
Graphing residuals (good)
model <- lm(lprice ~ points, filter(wine, province == "Oregon"))
get_regression_points(model) %>% ggplot(aes(points, residual)) + geom_point()
- Residual above 0 look very similar to the residuals below 0
- (in other words) our error looks very similar on both sides of our regression line (which is 0)
Interpreting the coefficients
model <- lm(lprice ~ points, filter(wine, province == "Oregon"))
pct = (exp(coef(model)["points"]) - 1) * 100
# Take the exponent of the coefficient and translating it into a percentage
Since we logged the DV, a 1 point ratings increase = 9.85
% increase in price on average. - Think about it as a percentage
Note: \[ (e^x-1)*100 \]
Even moar awesomer with some code fu
# For loop OR just doing the code above for all of these different wines
for (v in c("Chardonnay", "Pinot Gris", "Pinot Noir")) {
m <- lm(lprice ~ points, filter(wine, province == "Oregon", variety == v))
pct <- round((exp(coef(m)["points"]) - 1) * 100, 2)
print(str_c("For ", v, ", a 1 point ratings increase leads to a ", pct, "% increase in price."))
}
[1] "For Chardonnay, a 1 point ratings increase leads to a 11.34% increase in price."
[1] "For Pinot Gris, a 1 point ratings increase leads to a 5.46% increase in price."
[1] "For Pinot Noir, a 1 point ratings increase leads to a 10.27% increase in price."
Logged feature
Taking the log of a feature
model <- lm(price ~ lpoints, filter(wine, province == "Oregon") %>% mutate(lpoints = log(points)))
model
Call:
lm(formula = price ~ lpoints, data = filter(wine, province ==
"Oregon") %>% mutate(lpoints = log(points)))
Coefficients:
(Intercept) lpoints
-1419.0 324.4
Since we logged the IV (feature), a 1% ratings increase = 3.24
increase in price on average.
Note: \[ x/100 \]
LogLog (also elasticity)
Log by a log
model <- lm(lprice ~ lpoints, filter(wine, province == "Oregon") %>% mutate(lpoints = log(points)))
model
Call:
lm(formula = lprice ~ lpoints, data = filter(wine, province ==
"Oregon") %>% mutate(lpoints = log(points)))
Coefficients:
(Intercept) lpoints
-33.770 8.298
…a 1% increase in ratings equals a 8.3
% increase in price on average - Popular way of doing this because it is so easily interpretable
Summary
Only the dependent/response variable is log-transformed. Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.
Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.
Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%.
Graphing points by variety
- How do the means differ by variety?
wine %>% filter(province == "Oregon") %>% filter(variety %in% c("Chardonnay", "Pinot Noir",
"Pinot Gris")) %>% ggplot(aes(variety, points)) + geom_boxplot()
- Chard has highest mean, PN has the same mean but the upper quartile for chard is much larger than PN
Summary
(tmp <- wine %>% filter(province == "Oregon") %>% filter(variety %in% c("Chardonnay",
"Pinot Noir", "Pinot Gris")) %>% group_by(variety) %>% summarise(mean = mean(points)))
# A tibble: 3 x 2
variety mean
* <chr> <dbl>
1 Chardonnay 89.7
2 Pinot Gris 88.5
3 Pinot Noir 89.5
Note:
- The difference between Pinot Gris and Chardonnay is -1.2402471
- The difference between Pinot Noir and Chardonnay is -0.2555915
Regression
model <- lm(points~variety, # Points = dependent & is a categorical variable
filter(wine,province=="Oregon",variety %in% c("Chardonnay","Pinot Noir","Pinot Gris")))
# Each indpenedent variable (variety) is a yes/no variable
get_regression_table(model)
# A tibble: 3 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 89.7 0.122 737. 0 89.5 90.0
2 varietyPinot Gris -1.24 0.177 -7.03 0 -1.59 -0.894
3 varietyPinot Noir -0.256 0.132 -1.94 0.053 -0.515 0.003
- What do we notice about these estimate coefficients?
- They are the difference of Pinot gris and Pinot Noir from the intercept
- They are also the same as above ( the estimate is the mean of chardonay )
Assumptions of linear regression
- Linearity of relationship between variables
- Independence of the residuals
- Normality of the residuals
- Equality of variance of the residuals
2. Independence
Given our original model of \(log(price)=m*Points+b\)…
…are there any problems with independence?
3. Normality
model <- lm(lprice ~ points, filter(wine, province == "Oregon"))
get_regression_points(model) %>% ggplot(aes(residual)) + geom_histogram(color = "white")
4. Equality of variance
No equality in the variance
Preprocessing and BoxCox
Setup
Preprocessing
Box-Cox transformations use MLE to estimate \(\lambda\)
\(x^{*} = \frac{x^{\lambda}-1}{\lambda\: \tilde{x}^{\lambda-1}}\)
- when \(\lambda=1\), there is no transformation
- when \(\lambda=0\), it is log transformed
- when \(\lambda=0.5\), it is square root
- when \(\lambda=-1\), it is an inverse
Caret preprocessing is so easy!
wine %>%
preProcess(method = c("BoxCox","center","scale")) %>% # Doing all of these transformations
predict(wine) %>%
select(-description) %>%
head()
province price points year taster_name
1 Oregon 0.7146905 -1.033841 -0.03425331 Paul Gregutt
2 Oregon -1.4139991 -1.033841 0.33313680 Paul Gregutt
3 California 0.8225454 -1.033841 -0.40146088 Virginie Boone
4 Oregon 0.2408520 -1.367723 -0.76848588 Paul Gregutt
5 Oregon -1.2418658 -1.367723 -1.13532834 Paul Gregutt
6 Oregon -1.0109945 -1.367723 1.06846470 Paul Gregutt
- Points, price & year are the only ones transformed
But wait… what is wrong here? - Shouldn’t be thinking of year as a continuous variable
Fixing this issue : Adding year as a factor/categorical
wino <- wine %>% mutate(year_f = as.factor(year))
wino <- wino %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(wino)
head(wino %>% select(starts_with("year")))
year year_f
1 -0.03425331 2012
2 0.33313680 2013
3 -0.40146088 2011
4 -0.76848588 2010
5 -1.13532834 2009
6 1.06846470 2015
- Factor year was left alone, continuous year was transformed
- Now we created 16 new variables
The KNN Algorithm
Algorithm
- Load the data
- Initialize K to your chosen number of neighbors
- For each example in the data
- Calculate the distance between the query example and the current example from the data.
- Add the distance and the index of the example to an ordered collection
- Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
- Pick the first K entries from the sorted collection
- Get the labels of the selected K entries
- If regression, return the mean of the K labels
- If classification, return the mode of the K labels
Engineering some features
wino <- wino %>%
mutate(taster_name = fct_lump(taster_name,5)) %>% # Top 5 Tasters
dummy_cols(
select_columns = c("year_f","taster_name"), # Create dummy vars for year and those 5 tasters
remove_most_frequent_dummy = T,
remove_selected_columns = T) %>%
rename_all(funs(tolower(.))) %>%
rename_all(funs(str_replace_all(., "-", "_"))) %>%
rename_all(funs(str_replace_all(., " ", "_"))) %>%
mutate(cherry = str_detect(description,"cherry")) %>% # Now grabbing 3 different description based factors
mutate(chocolate = str_detect(description,"chocolate")) %>%
mutate(earth = str_detect(description,"earth")) %>%
select(-description)
head(wino)
province price points year year_f_1996 year_f_1997
1 Oregon 0.7146905 -1.033841 -0.03425331 0 0
2 Oregon -1.4139991 -1.033841 0.33313680 0 0
year_f_1998 year_f_1999 year_f_2000 year_f_2001 year_f_2002 year_f_2003
1 0 0 0 0 0 0
2 0 0 0 0 0 0
year_f_2004 year_f_2005 year_f_2006 year_f_2007 year_f_2008 year_f_2009
1 0 0 0 0 0 0
2 0 0 0 0 0 0
year_f_2010 year_f_2011 year_f_2012 year_f_2013 year_f_2015
1 0 0 1 0 0
2 0 0 0 1 0
taster_name_jim_gordon taster_name_matt_kettmann taster_name_roger_voss
1 0 0 0
2 0 0 0
taster_name_virginie_boone taster_name_other cherry chocolate earth
1 0 0 FALSE FALSE TRUE
2 0 0 FALSE TRUE FALSE
[ reached getOption("max.print") -- omitted 4 rows ]
Simple model
Predict which state/province a Pinot Noir belongs to:
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
fit <- knn( # Knn function
train = select(train,-province), # Training data
test = select(test,-province), # Testing data
k=5, # 5 nearest neighbors
cl = train$province,
prob = T)
Confusion matrix
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 233 0 0 0 0
California 0 771 0 0 0
Casablanca_Valley 0 0 8 4 1
Marlborough 0 0 12 24 10
New_York 0 0 0 9 11
Oregon 5 20 6 8 4
Reference
Prediction Oregon
Burgundy 1
California 0
Casablanca_Valley 4
Marlborough 4
New_York 1
Oregon 537
Overall Statistics
Accuracy : 0.9468
95% CI : (0.9349, 0.9571)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9179
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.9790 0.9747 0.307692
Specificity 0.9993 1.0000 0.994536
Pos Pred Value 0.9957 1.0000 0.470588
Neg Pred Value 0.9965 0.9778 0.989130
Prevalence 0.1423 0.4728 0.015541
Detection Rate 0.1393 0.4608 0.004782
Detection Prevalence 0.1399 0.4608 0.010161
Balanced Accuracy 0.9891 0.9874 0.651114
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.53333 0.423077 0.9817
Specificity 0.98403 0.993928 0.9618
Pos Pred Value 0.48000 0.523810 0.9259
Neg Pred Value 0.98706 0.990920 0.9909
Prevalence 0.02690 0.015541 0.3270
Detection Rate 0.01435 0.006575 0.3210
Detection Prevalence 0.02989 0.012552 0.3467
Balanced Accuracy 0.75868 0.708503 0.9718
- Looking at how models predictions compare to the actual values in my holdout (test) sample
- We do well with the wines that have the most frequency, but not so well with the provinces with lower frequency (NY, Casablanca, Marlborough)
What happens when Intern comes in “I have 95% accuracy”?
- Do we have leakage? Looks like it
- We only have the year var, features of the description and tasters, it must be the tasters
Kappa statistic
Compares observed accuracy against what would be expected by a random classifier.
- < 0.2 (not so good)
- 0.21 - 0.4 (ok)
- 0.41 - 0.6 (pretty good)
- 0.6 - 0.8 (great)
- > 0.8 (almost perfect)
…whoa! What’s going on here?
Fixing the leak
wino <- select(wino, -starts_with("taster")) # get rid of the taster variables
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)
train <- wino[wine_index, ]
test <- wino[-wine_index, ]
fit <- knn(train = select(train, -province), test = select(test, -province), k = 5,
cl = train$province, prob = T)
Confusion matrix
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 118 40 1 3 0
California 52 587 8 12 9
Casablanca_Valley 2 4 0 1 1
Marlborough 3 6 3 2 3
New_York 1 0 0 2 2
Oregon 62 154 14 25 11
Reference
Prediction Oregon
Burgundy 46
California 193
Casablanca_Valley 2
Marlborough 12
New_York 1
Oregon 293
Overall Statistics
Accuracy : 0.5989
95% CI : (0.575, 0.6225)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3625
Mcnemar's Test P-Value : 5.151e-05
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.49580 0.7421 0.000000
Specificity 0.93728 0.6893 0.993928
Pos Pred Value 0.56731 0.6818 0.000000
Neg Pred Value 0.91809 0.7488 0.984366
Prevalence 0.14226 0.4728 0.015541
Detection Rate 0.07053 0.3509 0.000000
Detection Prevalence 0.12433 0.5146 0.005977
Balanced Accuracy 0.71654 0.7157 0.496964
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.044444 0.076923 0.5356
Specificity 0.983415 0.997571 0.7638
Pos Pred Value 0.068966 0.333333 0.5242
Neg Pred Value 0.973844 0.985603 0.7720
Prevalence 0.026898 0.015541 0.3270
Detection Rate 0.001195 0.001195 0.1751
Detection Prevalence 0.017334 0.003586 0.3341
Balanced Accuracy 0.513930 0.537247 0.6497
- More realistic representation, kappa of .36
- Guessing the top well, but the smaller provinces are suffering
- Explaining 36% of the explainable improvement
Basic model with parameter tuning
Using Caret framework to do some meta modeling (parameter tuning)
control <- trainControl(method = "boot", number = 1)
# Model tuning:- only on 1 number to make it run fast
fit <- train(province ~ .,
data = train, # Running on the training set
method = "knn",
tuneLength = 15, # Model tuning:- number of neighbors
trControl = control)
fit
k-Nearest Neighbors
6707 samples
25 predictor
6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
No pre-processing
Resampling: Bootstrapped (1 reps)
Summary of sample sizes: 6707
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5523422 0.2997320
7 0.5747454 0.3284182
9 0.5804481 0.3316907
11 0.5873727 0.3382247
13 0.5951120 0.3494213
15 0.5959267 0.3488393
17 0.6036660 0.3567968
19 0.6101833 0.3667269
21 0.6077393 0.3612561
23 0.6061100 0.3570110
25 0.6061100 0.3562455
27 0.6085540 0.3575035
29 0.6142566 0.3671831
31 0.6138493 0.3646175
33 0.6085540 0.3555932
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 29.
- Looking at the optimal numbers of K (amount of neighbors)
Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 118 22 5 4 0
California 65 684 7 13 13
Casablanca_Valley 0 1 1 0 0
Marlborough 0 0 1 0 1
New_York 0 0 0 1 0
Oregon 55 84 12 27 12
Reference
Prediction Oregon
Burgundy 41
California 258
Casablanca_Valley 0
Marlborough 2
New_York 0
Oregon 246
Overall Statistics
Accuracy : 0.627
95% CI : (0.6033, 0.6502)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3831
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.49580 0.8647 0.0384615
Specificity 0.94983 0.5964 0.9993928
Pos Pred Value 0.62105 0.6577 0.5000000
Neg Pred Value 0.91908 0.8310 0.9850389
Prevalence 0.14226 0.4728 0.0155409
Detection Rate 0.07053 0.4088 0.0005977
Detection Prevalence 0.11357 0.6216 0.0011955
Balanced Accuracy 0.72281 0.7306 0.5189272
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.000000 0.0000000 0.4497
Specificity 0.997543 0.9993928 0.8313
Pos Pred Value 0.000000 0.0000000 0.5642
Neg Pred Value 0.973038 0.9844498 0.7567
Prevalence 0.026898 0.0155409 0.3270
Detection Rate 0.000000 0.0000000 0.1470
Detection Prevalence 0.002391 0.0005977 0.2606
Balanced Accuracy 0.498771 0.4996964 0.6405
With parameter tuning and subsampling
Finding the optimal Kappa finding the right amount of neighbors
fit <- train(province ~ ., data = train, method = "knn", tuneLength = 15, metric = "Kappa",
trControl = control)
fit
k-Nearest Neighbors
6707 samples
25 predictor
6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
No pre-processing
Resampling: Bootstrapped (1 reps)
Summary of sample sizes: 6707
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5579592 0.3149479
7 0.5755102 0.3370134
9 0.5808163 0.3403683
11 0.5853061 0.3397337
13 0.5922449 0.3489705
15 0.5910204 0.3455425
17 0.5991837 0.3571511
19 0.5971429 0.3519378
21 0.5873469 0.3338707
23 0.5897959 0.3390258
25 0.5967347 0.3472139
27 0.5959184 0.3456839
29 0.5942857 0.3419408
31 0.5938776 0.3389515
33 0.6036735 0.3552427
Kappa was used to select the optimal model using the largest value.
The final value used for the model was k = 17.
Tuning plot
- 17 may be due to some random sub sampling
- Can and always should set.seed()
- We look at a graph like this for that exact insight, “do i need 30 neighbors? or does it max out after 15”
Bonus: KNN for regression
k-Nearest Neighbors
6707 samples
25 predictor
No pre-processing
Resampling: Bootstrapped (1 reps)
Summary of sample sizes: 6707
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 0.7438022 0.4606165 0.5783852
7 0.7380353 0.4654830 0.5745082
9 0.7352568 0.4664593 0.5736572
11 0.7321121 0.4699287 0.5695127
13 0.7329237 0.4686406 0.5705231
15 0.7311215 0.4714167 0.5696308
17 0.7271077 0.4776711 0.5661544
19 0.7282437 0.4764498 0.5672856
21 0.7266817 0.4793160 0.5657744
23 0.7250526 0.4826010 0.5629512
25 0.7278982 0.4789688 0.5650174
27 0.7287374 0.4787843 0.5661180
29 0.7264588 0.4836403 0.5666414
31 0.7253892 0.4867730 0.5668266
33 0.7277723 0.4837948 0.5690161
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 23.