Directions:
Please turn in both a knitted HTML file and your Rmd file on WISE.
Good luck!
Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(class)
library(fastDummies)
wine <- readRDS("C:/Users/Kari/Downloads/pinot.rds")
Explain how the choice of K affects the quality of your prediction when using a K Nearest Neighbors algorithm.
Answer: As you select K and the number of neighbors you are asking what the classification is based on the majority. So if you continue to grow your k and select more neighbors you will start to select more data points, and then you could run into starting to guess based on population average. So the K becomes less informative if you grow it too large. Conversely, if if it is too small, like k = 1 you will only look at the data point itself and it will only work on itself, and not pay attention to the trends
wino <- wine %>%
rename_all(funs(tolower(.))) %>%
rename_all(funs(str_replace_all(., "-", "_"))) %>%
rename_all(funs(str_replace_all(., " ", "_"))) %>%
mutate(year_f = as.factor(year)) %>%
mutate(cherry = str_detect(description,"cherry")) %>%
mutate(chocolate = str_detect(description,"chocolate")) %>%
mutate(earth = str_detect(description,"earth")) %>%
mutate(earthyr = earth * year) %>%
mutate(cherryyr = cherry * year) %>%
mutate(chocolateyr = chocolate * year) %>%
select(-description, -taster_name)
wino <- wino %>%
preProcess(method = c("BoxCox","center","scale")) %>%
predict(wino)
wino <- wino %>%
dummy_cols(
select_columns = "year_f", remove_most_frequent_dummy = T, remove_selected_columns = T)
wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
control <- trainControl(method="repeatedcv", number=5, repeats=2)
fit <- train(province ~ .,
data = train,
method = "knn",
tuneLength = 15,
metric = "kappa",
trControl = control)
confusionMatrix(predict(fit, test),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Burgundy California Casablanca_Valley Marlborough New_York
## Burgundy 108 19 5 10 0
## California 68 682 7 14 14
## Casablanca_Valley 0 0 0 0 0
## Marlborough 1 1 5 0 2
## New_York 0 1 2 0 1
## Oregon 61 88 7 21 9
## Reference
## Prediction Oregon
## Burgundy 28
## California 266
## Casablanca_Valley 0
## Marlborough 2
## New_York 1
## Oregon 250
##
## Overall Statistics
##
## Accuracy : 0.6222
## 95% CI : (0.5985, 0.6455)
## No Information Rate : 0.4728
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3736
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Burgundy Class: California Class: Casablanca_Valley
## Sensitivity 0.45378 0.8622 0.00000
## Specificity 0.95679 0.5816 1.00000
## Pos Pred Value 0.63529 0.6489 NaN
## Neg Pred Value 0.91351 0.8248 0.98446
## Prevalence 0.14226 0.4728 0.01554
## Detection Rate 0.06455 0.4077 0.00000
## Detection Prevalence 0.10161 0.6282 0.00000
## Balanced Accuracy 0.70529 0.7219 0.50000
## Class: Marlborough Class: New_York Class: Oregon
## Sensitivity 0.000000 0.0384615 0.4570
## Specificity 0.993243 0.9975713 0.8348
## Pos Pred Value 0.000000 0.2000000 0.5734
## Neg Pred Value 0.972924 0.9850120 0.7599
## Prevalence 0.026898 0.0155409 0.3270
## Detection Rate 0.000000 0.0005977 0.1494
## Detection Prevalence 0.006575 0.0029886 0.2606
## Balanced Accuracy 0.496622 0.5180164 0.6459
#ggplot(fit, metric="Kappa") Just wanted to view for fun
Is this a good value of Kappa? Why or why not?
Answer: Well my kappa is .3681 which is the high end of “okay”. So good, no, adequate, if we’re feeling generous. A Kappa of .3681 means my model is about 37% more likely to be correct than a random classifier would be. It is not remarkable enough to be called good, but certainly more accurate than randomly guessing.
Interestingly, when I ran this without metric = “kappa” my kappa was higher. Not sure what that is about.
Looking at the confusion matrix, where do you see room for improvement in your predictions?
Answer: Well Casablanca, New York, and Marlborough seem to not be right at all. Or very only 50% in the case of New York. So potentially removing them from the model?