Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘pinot.rds’ data file from your own computer.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(pROC)
wine <- readRDS("C:/Users/Kari/Downloads/pinot.rds")
Why do we call it Logistic Regression even though we are using the tecnique for classification?
Answer: Well logistic regression is a form classification because a logistic regression is binary. It either is or it isn’t and you use features to predict whether or not it is the thing or it isn’t.
wine <- wine %>%
rowwise()%>%
mutate(province = as.numeric(province=="Marlborough"),
word_count = lengths(strsplit(description," ")),
fruit = str_detect(description,"fruit"),
avg_word_len = mean(nchar(unlist(strsplit(description,' '))))) %>%
select(-taster_name, -description)
set.seed(317)
wine_index <- createDataPartition(wine$province, p = 0.80, list = FALSE)
train <- wine[ wine_index, ]
test <- wine[-wine_index, ]
table(train$province)
##
## 0 1
## 6519 185
#Logistic Model
control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ .,
data = train,
trControl = control,
method = "glm",
family = "binomial")
prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)
confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1632 44
## 1 0 0
##
## Accuracy : 0.9737
## 95% CI : (0.9649, 0.9809)
## No Information Rate : 0.9737
## P-Value [Acc > NIR] : 0.54
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 9.022e-11
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9737
## Neg Pred Value : NaN
## Prevalence : 0.9737
## Detection Rate : 0.9737
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Rerun the above model with a 15 to 1 weight on Marlborough
weight_train <- train %>%
mutate(weights=if_else(province==1,15,1))
fit <- train(province ~ .,
data = train,
trControl = control,
method = "glm",
family = "binomial",
weights = weight_train$weights)
prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)
confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1426 17
## 1 206 27
##
## Accuracy : 0.8669
## 95% CI : (0.8497, 0.8829)
## No Information Rate : 0.9737
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1577
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8738
## Specificity : 0.6136
## Pos Pred Value : 0.9882
## Neg Pred Value : 0.1159
## Prevalence : 0.9737
## Detection Rate : 0.8508
## Detection Prevalence : 0.8610
## Balanced Accuracy : 0.7437
##
## 'Positive' Class : 0
##
Display an ROC for the model you ran in the last question and use it to explain your model’s quality.
myRoc <- roc(test$province, prob)
plot(myRoc)
auc(myRoc)
## Area under the curve: 0.8222
Answer: So the AUC is .822, which is sort of high. An AUC of .5 means the model would have no ability to tell the difference. We, of course, want to be higher and closer to one. So this is an okay model, and probably fine for telling apart wines, but would not be okay for a higher risk model.
Note: You can find a tutorial on ROC curves here: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5