Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

1. Setup

Set Up Code:

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))
## Warning: package 'caret' was built under R version 4.4.2
sh(library(naivebayes))
## Warning: package 'naivebayes' was built under R version 4.4.2
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. Logistic Concepts

Why do we call it Logistic Regression even though we are using the technique for classification?

TODO: We call it this because Logistic Regression models the log odds of the outcome as a linear combination of the input variables. This is different from classification which makes a direct True/False output. A logistic regression produces a probability between 0 and 1

3. Modeling

We train a logistic regression algorithm to classify a whether a wine comes from Marlborough using:

  1. An 80-20 train-test split.
  2. Three features engineered from the description
  3. 5-fold cross validation.

We report Kappa after using the model to predict provinces in the holdout sample.

pacman::p_load(tidytext,data.table,scales,pROC)
data(stop_words)
head(stop_words, 25)$word
##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"      
## [21] "already"     "also"        "although"    "always"      "am"
wino <- wine %>%
  unnest_tokens(word, description) 
head(wino)
wino <- wino %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
wino <- wino %>%
  filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling")))
head(wino)
wino %>%
  filter(province == "Marlborough") %>% 
  count(province, word) %>%
  group_by(province) %>%
  top_n(5, n) %>%
  arrange(province, desc(n)) %>%
  head()
wino_marl <- wine %>% 
  mutate(
    fruit = as.factor(str_detect(tolower(description), "fruit")),
    cherry = as.factor(str_detect(tolower(description), "cherry")),
    finish = as.factor(str_detect(tolower(description), "finish")),
    bodied = as.factor(str_detect(tolower(description), "bodied")),
    bodied = as.factor(str_detect(tolower(description), "medium")),
    tropical = as.factor(str_detect(tolower(description), "passion fruit|pineapple|guava|tropical")),
    
    marl = as.factor(province == "Marlborough")
  ) %>%
  select(-c(description,id,province,points,year))
set.seed(505) 
train_index <- createDataPartition(wino_marl$marl, p = 0.8, list = FALSE)
train_data <- wino_marl[train_index, ]
test_data <- wino_marl[-train_index, ]
control = trainControl(method = "cv", number = 5)
get_fit <- function(df) {
  train(marl ~ .,
        data = df, 
        trControl = control,
        method = "glm",
        family = "binomial",
        maxit = 5) 
}
fit <- get_fit(train_data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit
## Generalized Linear Model 
## 
## 6705 samples
##    6 predictor
##    2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5363, 5364, 5365, 5364, 5364 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.9725579  0

4. Binary vs Other Classification

What is the difference between determining some form of classification through logistic regression versus methods like \(K\)-NN and Naive Bayes which performed classifications.

TODO: Explain: The main difference I see is that Logistic Regression models the relationship between features and the probability of a outcome, however knn and bayes directly classify without explicitly modeling probabilities.

5. ROC Curves

We can display an ROC for the model to explain your model’s quality.

prob <- predict(fit, newdata = test_data, type = "prob")[,2]
myRoc <- roc(test_data$marl, prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(myRoc)

auc(myRoc)
## Area under the curve: 0.8719

TODO: Explain:We have a high AUC of 0.8715 which indicates strong discriminatory power. However the Kappa we get of 0 suggests the model performs no better than random guessing, likely due to class imbalance. I believe this means that are model ranks predictions well but struggles when it comes to accurately classifying.