1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘pinot.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(pROC)
wine <- readRDS("C:/Users/Kari/Downloads/pinot.rds")

2. Logistic Concepts (3pts)

Why do we call it Logistic Regression even though we are using the tecnique for classification?

Answer: Well logistic regression is a form classification because a logistic regression is binary. It either is or it isn’t and you use features to predict whether or not it is the thing or it isn’t.

3. Modeling (4pts)

  1. Train a logistic regression algorithm to classify a whether a wine comes from Marlborough,
  2. using 80% of your data,
  3. three features engineered from the description
  4. and 5-fold cross validation.
  5. Report Kappa after using your model to predict the province in the holdout sample.

Marlborough and Engineering Features

wine <- wine %>%
  rowwise()%>%
  mutate(province = as.numeric(province=="Marlborough"),
       word_count = lengths(strsplit(description," ")),
       fruit = str_detect(description,"fruit"),
       avg_word_len = mean(nchar(unlist(strsplit(description,' '))))) %>%
  select(-taster_name, -description)

Split the data

set.seed(317)
wine_index <- createDataPartition(wine$province, p = 0.80, list = FALSE)
train <- wine[ wine_index, ]
test <- wine[-wine_index, ]
table(train$province)
## 
##    0    1 
## 6519  185

#Logistic Model

control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ .,
             data = train,
             trControl = control,
             method = "glm",
             family = "binomial")

prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)
confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1632   44
##          1    0    0
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.9649, 0.9809)
##     No Information Rate : 0.9737          
##     P-Value [Acc > NIR] : 0.54            
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 9.022e-11       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9737          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9737          
##          Detection Rate : 0.9737          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

4. Weighting (3pts)

Rerun the above model with a 15 to 1 weight on Marlborough

weight_train <- train %>%
  mutate(weights=if_else(province==1,15,1))

fit <- train(province ~ .,
             data = train,
             trControl = control,
             method = "glm",
             family = "binomial",
             weights = weight_train$weights)

prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)
confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1426   17
##          1  206   27
##                                           
##                Accuracy : 0.8669          
##                  95% CI : (0.8497, 0.8829)
##     No Information Rate : 0.9737          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1577          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8738          
##             Specificity : 0.6136          
##          Pos Pred Value : 0.9882          
##          Neg Pred Value : 0.1159          
##              Prevalence : 0.9737          
##          Detection Rate : 0.8508          
##    Detection Prevalence : 0.8610          
##       Balanced Accuracy : 0.7437          
##                                           
##        'Positive' Class : 0               
## 

5. ROC Curves (5pts)

Display an ROC for the model you ran in the last question and use it to explain your model’s quality.

myRoc <- roc(test$province, prob)
plot(myRoc)

auc(myRoc)
## Area under the curve: 0.8222

Answer: So the AUC is .822, which is sort of high. An AUC of .5 means the model would have no ability to tell the difference. We, of course, want to be higher and closer to one. So this is an okay model, and probably fine for telling apart wines, but would not be okay for a higher risk model.

Note: You can find a tutorial on ROC curves here: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5