Abstract:
This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.
embed-resources option in the header.format:
html:
embed-resources: true
Set Up Code:
sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))
## Warning: package 'caret' was built under R version 4.4.2
sh(library(naivebayes))
## Warning: package 'naivebayes' was built under R version 4.4.2
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))
Why do we call it Logistic Regression even though we are using the technique for classification?
TODO: We call it this because Logistic Regression models the log odds of the outcome as a linear combination of the input variables. This is different from classification which makes a direct True/False output. A logistic regression produces a probability between 0 and 1
We train a logistic regression algorithm to classify a whether a wine comes from Marlborough using:
We report Kappa after using the model to predict provinces in the holdout sample.
pacman::p_load(tidytext,data.table,scales,pROC)
data(stop_words)
head(stop_words, 25)$word
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
## [11] "afterwards" "again" "against" "ain't" "all"
## [16] "allow" "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always" "am"
wino <- wine %>%
unnest_tokens(word, description)
head(wino)
wino <- wino %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
wino <- wino %>%
filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling")))
head(wino)
wino %>%
filter(province == "Marlborough") %>%
count(province, word) %>%
group_by(province) %>%
top_n(5, n) %>%
arrange(province, desc(n)) %>%
head()
wino_marl <- wine %>%
mutate(
fruit = as.factor(str_detect(tolower(description), "fruit")),
cherry = as.factor(str_detect(tolower(description), "cherry")),
finish = as.factor(str_detect(tolower(description), "finish")),
bodied = as.factor(str_detect(tolower(description), "bodied")),
bodied = as.factor(str_detect(tolower(description), "medium")),
tropical = as.factor(str_detect(tolower(description), "passion fruit|pineapple|guava|tropical")),
marl = as.factor(province == "Marlborough")
) %>%
select(-c(description,id,province,points,year))
set.seed(505)
train_index <- createDataPartition(wino_marl$marl, p = 0.8, list = FALSE)
train_data <- wino_marl[train_index, ]
test_data <- wino_marl[-train_index, ]
control = trainControl(method = "cv", number = 5)
get_fit <- function(df) {
train(marl ~ .,
data = df,
trControl = control,
method = "glm",
family = "binomial",
maxit = 5)
}
fit <- get_fit(train_data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit
## Generalized Linear Model
##
## 6705 samples
## 6 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5363, 5364, 5365, 5364, 5364
## Resampling results:
##
## Accuracy Kappa
## 0.9725579 0
What is the difference between determining some form of classification through logistic regression versus methods like \(K\)-NN and Naive Bayes which performed classifications.
TODO: Explain: The main difference I see is that Logistic Regression models the relationship between features and the probability of a outcome, however knn and bayes directly classify without explicitly modeling probabilities.
We can display an ROC for the model to explain your model’s quality.
prob <- predict(fit, newdata = test_data, type = "prob")[,2]
myRoc <- roc(test_data$marl, prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(myRoc)
auc(myRoc)
## Area under the curve: 0.8719
TODO: Explain:We have a high AUC of 0.8715 which indicates strong discriminatory power. However the Kappa we get of 0 suggests the model performs no better than random guessing, likely due to class imbalance. I believe this means that are model ranks predictions well but struggles when it comes to accurately classifying.