Directions:

Please turn in both a knitted HTML file and your Rmd file on WISE.

Good luck!

1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(class)
library(fastDummies)
wine <- readRDS("C:/Users/Kari/Downloads/pinot.rds")

2. KNN Concepts (5pts)

Explain how the choice of K affects the quality of your prediction when using a K Nearest Neighbors algorithm.

Answer: As you select K and the number of neighbors you are asking what the classification is based on the majority. So if you continue to grow your k and select more neighbors you will start to select more data points, and then you could run into starting to guess based on population average. So the K becomes less informative if you grow it too large. Conversely, if if it is too small, like k = 1 you will only look at the data point itself and it will only work on itself, and not pay attention to the trends

3. Feature Engineering (3pts)

  1. Remove the taster_name column from the data
  2. Create a version of the year column that is a factor (instead of numeric)
  3. Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description
  4. Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators
  5. Remove the description column from the data
wino <- wine %>%
  rename_all(funs(tolower(.))) %>%
  rename_all(funs(str_replace_all(., "-", "_"))) %>%
  rename_all(funs(str_replace_all(., " ", "_"))) %>%
  mutate(year_f = as.factor(year)) %>%
  mutate(cherry = str_detect(description,"cherry")) %>%
  mutate(chocolate = str_detect(description,"chocolate")) %>%
  mutate(earth = str_detect(description,"earth")) %>%
  mutate(earthyr = earth * year) %>%
  mutate(cherryyr = cherry * year) %>%
  mutate(chocolateyr = chocolate * year) %>%
  select(-description, -taster_name)

4. Preprocessing (3pts)

  1. Preprocess the dataframe that you created in the previous question using BoxCox, centering and scaling of the numeric features
  2. Create dummy variables for the year factor column
wino <- wino %>%
preProcess(method = c("BoxCox","center","scale")) %>%
predict(wino)

wino <- wino %>%
  dummy_cols(
    select_columns = "year_f", remove_most_frequent_dummy = T, remove_selected_columns = T)

5. Running KNN (5pts)

  1. Split your data into an 80/20 training and test set
  2. Use Caret to run a KNN model that uses your engineered features to predict province
  • use 5-fold cross validated subsampling
  • allow Caret to try 15 different values for K
  1. Display the confusion matrix on the test data
wine_index <- createDataPartition(wino$province, p = 0.8, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]

control <- trainControl(method="repeatedcv", number=5, repeats=2)

fit <- train(province ~ .,
             data = train,
             method = "knn",
             tuneLength = 15,
             metric = "kappa",
             trControl = control)

confusionMatrix(predict(fit, test),factor(test$province))
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Burgundy California Casablanca_Valley Marlborough New_York
##   Burgundy               108         19                 5          10        0
##   California              68        682                 7          14       14
##   Casablanca_Valley        0          0                 0           0        0
##   Marlborough              1          1                 5           0        2
##   New_York                 0          1                 2           0        1
##   Oregon                  61         88                 7          21        9
##                    Reference
## Prediction          Oregon
##   Burgundy              28
##   California           266
##   Casablanca_Valley      0
##   Marlborough            2
##   New_York               1
##   Oregon               250
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6222          
##                  95% CI : (0.5985, 0.6455)
##     No Information Rate : 0.4728          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3736          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Burgundy Class: California Class: Casablanca_Valley
## Sensitivity                  0.45378            0.8622                  0.00000
## Specificity                  0.95679            0.5816                  1.00000
## Pos Pred Value               0.63529            0.6489                      NaN
## Neg Pred Value               0.91351            0.8248                  0.98446
## Prevalence                   0.14226            0.4728                  0.01554
## Detection Rate               0.06455            0.4077                  0.00000
## Detection Prevalence         0.10161            0.6282                  0.00000
## Balanced Accuracy            0.70529            0.7219                  0.50000
##                      Class: Marlborough Class: New_York Class: Oregon
## Sensitivity                    0.000000       0.0384615        0.4570
## Specificity                    0.993243       0.9975713        0.8348
## Pos Pred Value                 0.000000       0.2000000        0.5734
## Neg Pred Value                 0.972924       0.9850120        0.7599
## Prevalence                     0.026898       0.0155409        0.3270
## Detection Rate                 0.000000       0.0005977        0.1494
## Detection Prevalence           0.006575       0.0029886        0.2606
## Balanced Accuracy              0.496622       0.5180164        0.6459
#ggplot(fit, metric="Kappa") Just wanted to view for fun

6. Kappa (2pts)

Is this a good value of Kappa? Why or why not?

Answer: Well my kappa is .3681 which is the high end of “okay”. So good, no, adequate, if we’re feeling generous. A Kappa of .3681 means my model is about 37% more likely to be correct than a random classifier would be. It is not remarkable enough to be called good, but certainly more accurate than randomly guessing.

Interestingly, when I ran this without metric = “kappa” my kappa was higher. Not sure what that is about.

7. Improvement (2pts)

Looking at the confusion matrix, where do you see room for improvement in your predictions?

Answer: Well Casablanca, New York, and Marlborough seem to not be right at all. Or very only 50% in the case of New York. So potentially removing them from the model?