1. Setup (1pt)
Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(fastDummies)
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
wine = read_rds("/Users/Rose/Downloads/pinot.rds")
2. KNN Concepts (5pts)
Explain how the choice of K affects the quality of your prediction when using a K Nearest Neighbors algorithm.
Answer:
When K is too small, you tend to model the noise (random neighbors that aren’t alike).
When K is too large, you tend to lose information available from those in the neighborhood and become to general (too reliant on the population average).
3. Feature Engineering (3pts)
- Remove the taster_name column from the data
- Create a version of the year column that is a factor (instead of numeric)
- Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description
- Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators
- Remove the description column from the data
wino <- wine %>%
select(-taster_name) %>%
mutate(year_f = as.factor(year)) %>% # Factor version of year
mutate(cherry = str_detect(description,"cherry")) %>%
mutate(chocolate = str_detect(description,"chocolate")) %>%
mutate(earth = str_detect(description,"earth")) %>%
mutate(ycherry = cherry*year) %>% # Interaction between indicator vars and the continuous time
mutate(ychocolate = chocolate*year) %>%
mutate(yearth = earth*year) %>%
select(-description)
wino %>% head()
province price points year year_f cherry chocolate earth ycherry ychocolate
1 Oregon 65 87 2012 2012 FALSE FALSE TRUE 0 0
2 Oregon 20 87 2013 2013 FALSE TRUE FALSE 0 2013
3 California 69 87 2011 2011 FALSE FALSE TRUE 0 0
4 Oregon 50 86 2010 2010 FALSE FALSE FALSE 0 0
5 Oregon 22 86 2009 2009 FALSE FALSE TRUE 0 0
6 Oregon 25 86 2015 2015 FALSE FALSE FALSE 0 0
yearth
1 2012
2 0
3 2011
4 0
5 2009
6 0
The interaction are simply multiplications, we are interested in the joint effect
- Why might we be interested in this interaction? Cherry * Year
- If we are predicting price, we could see: maybe the influence of cherry on price has changed over time
- Maybe back in the day there wasn’t a big difference in flavor profile between OR and CAL but this could have changed over time
- Giving the model a degree of freedom, to specify that this change may be present in the model
- Seeing Cherry in 2002 could be very different from seeing Cherry in 2012
- If we are predicting price, we could see: maybe the influence of cherry on price has changed over time
Something to note… if the model doesn’t work in the future, change those T/F to 1/0
4. Preprocessing (3pts)
- Preprocess the dataframe that you created in the previous question using BoxCox, centering and scaling of the numeric features
- Create dummy variables for the year factor column
wino <- wino %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(wino) %>%
dummy_cols(select_columns = c("year_f"), remove_most_frequent_dummy = T, remove_selected_columns = T)
# Why remove freq dummy? Instead of letting the algorithm select which variable
# to be the intercept, we could be 'over identifying' it, that most frequent
# dummy is then implied even tho it is dropped from the dataset
5. Running KNN (5pts)
- Split your data into an 80/20 training and test set
- Use Caret to run a KNN model that uses your engineered features to predict province
- use 5-fold cross validated subsampling
- allow Caret to try 15 different values for K
- Display the confusion matrix on the test data
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)
train <- wino[wine_index, ]
test <- wino[-wine_index, ]
control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ ., data = train, method = "knn", tuneLength = 15, trControl = control)
fit
k-Nearest Neighbors
6707 samples
28 predictor
6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5365, 5366, 5366, 5366, 5365
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5995238 0.3621479
7 0.6080218 0.3687602
9 0.6139855 0.3733153
11 0.6202495 0.3804611
13 0.6220384 0.3804007
15 0.6317284 0.3949415
17 0.6281511 0.3863036
19 0.6283002 0.3854595
21 0.6306868 0.3886794
23 0.6329226 0.3912692
25 0.6314307 0.3885431
27 0.6251684 0.3774905
29 0.6262122 0.3787591
31 0.6245723 0.3749179
33 0.6244238 0.3737868
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 23.
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 114 25 4 4 0
California 63 672 8 12 11
Casablanca_Valley 0 0 1 0 0
Marlborough 0 0 2 0 0
New_York 0 0 0 0 1
Oregon 61 94 11 29 14
Reference
Prediction Oregon
Burgundy 36
California 226
Casablanca_Valley 0
Marlborough 1
New_York 1
Oregon 283
Overall Statistics
Accuracy : 0.6402
95% CI : (0.6166, 0.6632)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.408
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.47899 0.8496 0.0384615
Specificity 0.95192 0.6372 1.0000000
Pos Pred Value 0.62295 0.6774 1.0000000
Neg Pred Value 0.91678 0.8253 0.9850478
Prevalence 0.14226 0.4728 0.0155409
Detection Rate 0.06814 0.4017 0.0005977
Detection Prevalence 0.10938 0.5929 0.0005977
Balanced Accuracy 0.71545 0.7434 0.5192308
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.000000 0.0384615 0.5174
Specificity 0.998157 0.9993928 0.8144
Pos Pred Value 0.000000 0.5000000 0.5752
Neg Pred Value 0.973054 0.9850389 0.7765
Prevalence 0.026898 0.0155409 0.3270
Detection Rate 0.000000 0.0005977 0.1692
Detection Prevalence 0.001793 0.0011955 0.2941
Balanced Accuracy 0.499079 0.5189272 0.6659
6. Kappa (2pts)
Is this a good value of Kappa? Why or why not?
Answer: By rule of thumb, it’s pretty good. We’re explaining approximately 37% of the explainable variation in the data (i.e. better than chance)
7. Improvement (2pts)
Looking at the confusion matrix, where do you see room for improvement in your predictions?
Answer: Honestly, it looks like we’re really struggling with the three smaller provinces. We’re also mixing up California and Oregon a lot. Burgundy appears pretty good.