1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘pinot.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
library(SnowballC)
wine = read_rds("/cloud/project/resources/pinot.rds")

2. Logistic Concepts (3pts)

Why do we call it Logistic Regression even though we are using the technique for classification?

Answer: If I understood correctly, it’s because the output of probability is continuous (versus discrete classification). So instead of an simple answer of “yes” “no” or “cat” “car”… we’re looking for the probability of “yes/no” “cat/car”. Since probability is continuous it cannot be simply classified.

3. Modeling (4pts)

Train a logistic regression algorithm to classify a whether a wine comes from Marlborough,
using 80% of your data,
three features engineered from the description
and 5-fold cross validation.
Report Kappa after using your model to predict the province in the holdout sample.

wine_words <- function(df, j = 200, stem=T){ 
  data(stop_words)

  words <- df %>%
    unnest_tokens(word, description) %>%
    anti_join(stop_words) %>%
    filter(!(word %in% c("wine","pinot","vineyard", "structure", "drink", "tannin")))
  
  if(stem){
    words <- words %>% 
      mutate(word = wordStem(word))
  }
  
  words <- words %>% 
    count(id, word) %>% 
    group_by(id) %>% 
    mutate(exists = (n>0)) %>% 
    ungroup %>% 
    group_by(word) %>% 
    mutate(total = sum(n)) %>% 
    filter(total > j) %>% 
    pivot_wider(id_cols = id, names_from = word, values_from = exists, values_fill = list(exists=0)) %>% 
    right_join(select(df,id,province)) %>% 
    drop_na() %>% 
    select(-id)
}

wino <- wine_words(wine) %>%
  mutate(province = as.numeric(province=="Marlborough"))

wino %>% 
  head(10) %>%
  select(1:5,province)

bottl	earthi	herbal	tannic	ag
TRUE	TRUE	TRUE	TRUE	FALSE
FALSE	FALSE	FALSE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	TRUE	FALSE	FALSE
FALSE	TRUE	FALSE	TRUE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE
TRUE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE

set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.80, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
table(train$province)

## 
##    0    1 
## 6518  186

control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ .,
             data = train, 
             trControl = control,
             method = "glm",
             family = "binomial")

odds_ratio <- exp(coef(fit$finalModel))
data.frame(name = names(odds_ratio), odds_ratio = odds_ratio) %>%  
  arrange(desc(odds_ratio)) %>% 
  head(10)

	name	odds_ratio
floorTRUE	floorTRUE	20725020.662
supplTRUE	supplTRUE	3806041.699
crispTRUE	crispTRUE	2258170.858
supportTRUE	supportTRUE	114269.877
silkiTRUE	silkiTRUE	51162.818
noirTRUE	noirTRUE	21319.420
solidTRUE	solidTRUE	13685.235
slightliTRUE	slightliTRUE	7762.919
wineTRUE	wineTRUE	5485.298
developTRUE	developTRUE	4924.404

prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)

confusionMatrix(factor(pred),factor(test$province))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1617   14
##          1   16   29
##                                           
##                Accuracy : 0.9821          
##                  95% CI : (0.9745, 0.9879)
##     No Information Rate : 0.9743          
##     P-Value [Acc > NIR] : 0.02224         
##                                           
##                   Kappa : 0.6499          
##                                           
##  Mcnemar's Test P-Value : 0.85513         
##                                           
##             Sensitivity : 0.9902          
##             Specificity : 0.6744          
##          Pos Pred Value : 0.9914          
##          Neg Pred Value : 0.6444          
##              Prevalence : 0.9743          
##          Detection Rate : 0.9648          
##    Detection Prevalence : 0.9732          
##       Balanced Accuracy : 0.8323          
##                                           
##        'Positive' Class : 0               
##

4. Weighting (3pts)

Rerun the above model with a 15 to 1 weight on Marlborough

weight_train <- train %>% 
  mutate(weights=if_else(province==1,15,1))

fit <- train(province ~ .,
             data = train, 
             trControl = trainControl(method = "cv", number = 5),
             method = "glm",
             family = "binomial",
             weights = weight_train$weights)

prob_w <- predict(fit, newdata=test)
pred_w <- ifelse(prob_w > 0.5, 1, 0)

confusionMatrix(factor(pred_w),factor(test$province))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1608   11
##          1   25   32
##                                           
##                Accuracy : 0.9785          
##                  95% CI : (0.9704, 0.9849)
##     No Information Rate : 0.9743          
##     P-Value [Acc > NIR] : 0.15737         
##                                           
##                   Kappa : 0.6292          
##                                           
##  Mcnemar's Test P-Value : 0.03026         
##                                           
##             Sensitivity : 0.9847          
##             Specificity : 0.7442          
##          Pos Pred Value : 0.9932          
##          Neg Pred Value : 0.5614          
##              Prevalence : 0.9743          
##          Detection Rate : 0.9594          
##    Detection Prevalence : 0.9660          
##       Balanced Accuracy : 0.8644          
##                                           
##        'Positive' Class : 0               
##

5. ROC Curves (5pts)

Display an ROC for the model you ran in the last question and use it to explain your model’s quality.

Answer: It appears that the quality of this model is good because it says that 91.3% of the model is under the curve. In other words, the steepness of the curve looks good. The only thing that I’m unsure of is the amount of “curvy-ness”? This one looks almost like a triangle with little bend. I’m not sure if that matters at this point.

library(pROC)

myRoc <- roc(test$province, prob_w)
plot(myRoc)

auc(myRoc)

## Area under the curve: 0.913

Note: You can find a tutorial on ROC curves here: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Homework 5

Rochelle Rafn

02/11/2022

1. Setup (1pt)

2. Logistic Concepts (3pts)

3. Modeling (4pts)

4. Weighting (3pts)

5. ROC Curves (5pts)