Directions:

Please turn in both a knitted HTML file and your Rmd file on WISE.

Good luck!

1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘pinot.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
library(SnowballC)
wine = read_rds("/cloud/project/resources/pinot.rds")

2. Logistic Concepts (3pts)

Why do we call it Logistic Regression even though we are using the technique for classification?

Answer: If I understood correctly, it’s because the output of probability is continuous (versus discrete classification). So instead of an simple answer of “yes” “no” or “cat” “car”… we’re looking for the probability of “yes/no” “cat/car”. Since probability is continuous it cannot be simply classified.

3. Modeling (4pts)

  1. Train a logistic regression algorithm to classify a whether a wine comes from Marlborough,
  2. using 80% of your data,
  3. three features engineered from the description
  4. and 5-fold cross validation.
  5. Report Kappa after using your model to predict the province in the holdout sample.
wine_words <- function(df, j = 200, stem=T){ 
  data(stop_words)

  words <- df %>%
    unnest_tokens(word, description) %>%
    anti_join(stop_words) %>%
    filter(!(word %in% c("wine","pinot","vineyard", "structure", "drink", "tannin")))
  
  if(stem){
    words <- words %>% 
      mutate(word = wordStem(word))
  }
  
  words <- words %>% 
    count(id, word) %>% 
    group_by(id) %>% 
    mutate(exists = (n>0)) %>% 
    ungroup %>% 
    group_by(word) %>% 
    mutate(total = sum(n)) %>% 
    filter(total > j) %>% 
    pivot_wider(id_cols = id, names_from = word, values_from = exists, values_fill = list(exists=0)) %>% 
    right_join(select(df,id,province)) %>% 
    drop_na() %>% 
    select(-id)
}

wino <- wine_words(wine) %>%
  mutate(province = as.numeric(province=="Marlborough"))
wino %>% 
  head(10) %>%
  select(1:5,province)
bottl earthi herbal tannic ag province
TRUE TRUE TRUE TRUE FALSE 0
FALSE FALSE FALSE FALSE TRUE 0
FALSE FALSE FALSE FALSE FALSE 0
FALSE FALSE TRUE FALSE FALSE 0
FALSE TRUE FALSE TRUE FALSE 0
FALSE FALSE FALSE FALSE FALSE 0
TRUE FALSE FALSE FALSE FALSE 0
FALSE FALSE FALSE FALSE FALSE 0
FALSE FALSE FALSE FALSE FALSE 0
FALSE FALSE FALSE FALSE FALSE 0
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.80, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
table(train$province)
## 
##    0    1 
## 6518  186
control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ .,
             data = train, 
             trControl = control,
             method = "glm",
             family = "binomial")

odds_ratio <- exp(coef(fit$finalModel))
data.frame(name = names(odds_ratio), odds_ratio = odds_ratio) %>%  
  arrange(desc(odds_ratio)) %>% 
  head(10)
name odds_ratio
floorTRUE floorTRUE 20725020.662
supplTRUE supplTRUE 3806041.699
crispTRUE crispTRUE 2258170.858
supportTRUE supportTRUE 114269.877
silkiTRUE silkiTRUE 51162.818
noirTRUE noirTRUE 21319.420
solidTRUE solidTRUE 13685.235
slightliTRUE slightliTRUE 7762.919
wineTRUE wineTRUE 5485.298
developTRUE developTRUE 4924.404
prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)

confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1617   14
##          1   16   29
##                                           
##                Accuracy : 0.9821          
##                  95% CI : (0.9745, 0.9879)
##     No Information Rate : 0.9743          
##     P-Value [Acc > NIR] : 0.02224         
##                                           
##                   Kappa : 0.6499          
##                                           
##  Mcnemar's Test P-Value : 0.85513         
##                                           
##             Sensitivity : 0.9902          
##             Specificity : 0.6744          
##          Pos Pred Value : 0.9914          
##          Neg Pred Value : 0.6444          
##              Prevalence : 0.9743          
##          Detection Rate : 0.9648          
##    Detection Prevalence : 0.9732          
##       Balanced Accuracy : 0.8323          
##                                           
##        'Positive' Class : 0               
## 

4. Weighting (3pts)

Rerun the above model with a 15 to 1 weight on Marlborough

weight_train <- train %>% 
  mutate(weights=if_else(province==1,15,1))

fit <- train(province ~ .,
             data = train, 
             trControl = trainControl(method = "cv", number = 5),
             method = "glm",
             family = "binomial",
             weights = weight_train$weights)

prob_w <- predict(fit, newdata=test)
pred_w <- ifelse(prob_w > 0.5, 1, 0)

confusionMatrix(factor(pred_w),factor(test$province))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1608   11
##          1   25   32
##                                           
##                Accuracy : 0.9785          
##                  95% CI : (0.9704, 0.9849)
##     No Information Rate : 0.9743          
##     P-Value [Acc > NIR] : 0.15737         
##                                           
##                   Kappa : 0.6292          
##                                           
##  Mcnemar's Test P-Value : 0.03026         
##                                           
##             Sensitivity : 0.9847          
##             Specificity : 0.7442          
##          Pos Pred Value : 0.9932          
##          Neg Pred Value : 0.5614          
##              Prevalence : 0.9743          
##          Detection Rate : 0.9594          
##    Detection Prevalence : 0.9660          
##       Balanced Accuracy : 0.8644          
##                                           
##        'Positive' Class : 0               
## 

5. ROC Curves (5pts)

Display an ROC for the model you ran in the last question and use it to explain your model’s quality.

Answer: It appears that the quality of this model is good because it says that 91.3% of the model is under the curve. In other words, the steepness of the curve looks good. The only thing that I’m unsure of is the amount of “curvy-ness”? This one looks almost like a triangle with little bend. I’m not sure if that matters at this point.

library(pROC)

myRoc <- roc(test$province, prob_w)
plot(myRoc)

auc(myRoc)
## Area under the curve: 0.913

Note: You can find a tutorial on ROC curves here: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5