Directions:
Please turn in both a knitted HTML file and your Rmd file on WISE.
Good luck!
Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘pinot.rds’ data file from your own computer.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
library(SnowballC)
wine = read_rds("/cloud/project/resources/pinot.rds")
Why do we call it Logistic Regression even though we are using the technique for classification?
Answer: If I understood correctly, it’s because the output of probability is continuous (versus discrete classification). So instead of an simple answer of “yes” “no” or “cat” “car”… we’re looking for the probability of “yes/no” “cat/car”. Since probability is continuous it cannot be simply classified.
wine_words <- function(df, j = 200, stem=T){
data(stop_words)
words <- df %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>%
filter(!(word %in% c("wine","pinot","vineyard", "structure", "drink", "tannin")))
if(stem){
words <- words %>%
mutate(word = wordStem(word))
}
words <- words %>%
count(id, word) %>%
group_by(id) %>%
mutate(exists = (n>0)) %>%
ungroup %>%
group_by(word) %>%
mutate(total = sum(n)) %>%
filter(total > j) %>%
pivot_wider(id_cols = id, names_from = word, values_from = exists, values_fill = list(exists=0)) %>%
right_join(select(df,id,province)) %>%
drop_na() %>%
select(-id)
}
wino <- wine_words(wine) %>%
mutate(province = as.numeric(province=="Marlborough"))
wino %>%
head(10) %>%
select(1:5,province)
| bottl | earthi | herbal | tannic | ag | province |
|---|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | FALSE | 0 |
| FALSE | FALSE | FALSE | FALSE | TRUE | 0 |
| FALSE | FALSE | FALSE | FALSE | FALSE | 0 |
| FALSE | FALSE | TRUE | FALSE | FALSE | 0 |
| FALSE | TRUE | FALSE | TRUE | FALSE | 0 |
| FALSE | FALSE | FALSE | FALSE | FALSE | 0 |
| TRUE | FALSE | FALSE | FALSE | FALSE | 0 |
| FALSE | FALSE | FALSE | FALSE | FALSE | 0 |
| FALSE | FALSE | FALSE | FALSE | FALSE | 0 |
| FALSE | FALSE | FALSE | FALSE | FALSE | 0 |
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.80, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
table(train$province)
##
## 0 1
## 6518 186
control <- trainControl(method = "cv", number = 5)
fit <- train(province ~ .,
data = train,
trControl = control,
method = "glm",
family = "binomial")
odds_ratio <- exp(coef(fit$finalModel))
data.frame(name = names(odds_ratio), odds_ratio = odds_ratio) %>%
arrange(desc(odds_ratio)) %>%
head(10)
| name | odds_ratio | |
|---|---|---|
| floorTRUE | floorTRUE | 20725020.662 |
| supplTRUE | supplTRUE | 3806041.699 |
| crispTRUE | crispTRUE | 2258170.858 |
| supportTRUE | supportTRUE | 114269.877 |
| silkiTRUE | silkiTRUE | 51162.818 |
| noirTRUE | noirTRUE | 21319.420 |
| solidTRUE | solidTRUE | 13685.235 |
| slightliTRUE | slightliTRUE | 7762.919 |
| wineTRUE | wineTRUE | 5485.298 |
| developTRUE | developTRUE | 4924.404 |
prob <- predict(fit, newdata=test)
pred <- ifelse(prob > 0.5, 1, 0)
confusionMatrix(factor(pred),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1617 14
## 1 16 29
##
## Accuracy : 0.9821
## 95% CI : (0.9745, 0.9879)
## No Information Rate : 0.9743
## P-Value [Acc > NIR] : 0.02224
##
## Kappa : 0.6499
##
## Mcnemar's Test P-Value : 0.85513
##
## Sensitivity : 0.9902
## Specificity : 0.6744
## Pos Pred Value : 0.9914
## Neg Pred Value : 0.6444
## Prevalence : 0.9743
## Detection Rate : 0.9648
## Detection Prevalence : 0.9732
## Balanced Accuracy : 0.8323
##
## 'Positive' Class : 0
##
Rerun the above model with a 15 to 1 weight on Marlborough
weight_train <- train %>%
mutate(weights=if_else(province==1,15,1))
fit <- train(province ~ .,
data = train,
trControl = trainControl(method = "cv", number = 5),
method = "glm",
family = "binomial",
weights = weight_train$weights)
prob_w <- predict(fit, newdata=test)
pred_w <- ifelse(prob_w > 0.5, 1, 0)
confusionMatrix(factor(pred_w),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1608 11
## 1 25 32
##
## Accuracy : 0.9785
## 95% CI : (0.9704, 0.9849)
## No Information Rate : 0.9743
## P-Value [Acc > NIR] : 0.15737
##
## Kappa : 0.6292
##
## Mcnemar's Test P-Value : 0.03026
##
## Sensitivity : 0.9847
## Specificity : 0.7442
## Pos Pred Value : 0.9932
## Neg Pred Value : 0.5614
## Prevalence : 0.9743
## Detection Rate : 0.9594
## Detection Prevalence : 0.9660
## Balanced Accuracy : 0.8644
##
## 'Positive' Class : 0
##
Display an ROC for the model you ran in the last question and use it to explain your model’s quality.
Answer: It appears that the quality of this model is good because it says that 91.3% of the model is under the curve. In other words, the steepness of the curve looks good. The only thing that I’m unsure of is the amount of “curvy-ness”? This one looks almost like a triangle with little bend. I’m not sure if that matters at this point.
library(pROC)
myRoc <- roc(test$province, prob_w)
plot(myRoc)
auc(myRoc)
## Area under the curve: 0.913
Note: You can find a tutorial on ROC curves here: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5