sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))Warning: package 'caret' was built under R version 4.4.2
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))Abstract:
This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.
embed-resources option in the header.format:
html:
embed-resources: trueSet Up Code:
sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))Warning: package 'caret' was built under R version 4.4.2
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))head(wine) id province price points year
1 1 Oregon 65 87 2012
2 2 Oregon 20 87 2013
3 3 California 69 87 2011
4 4 Oregon 50 86 2010
5 5 Oregon 22 86 2009
6 6 Oregon 25 86 2015
description
1 Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.
2 A sleek mix of tart berry, stem and herb, along with a hint of oak and chocolate, this is a fair value in a widely available, drink-now Oregon Pinot. The wine oak-aged for six months, whether in neutral or re-staved is not indicated.
3 Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails from a high-elevation site. Small in production, it offers intense, full-bodied raspberry and blackberry steeped in smoky spice and smooth texture.
4 As with many of the Erath 2010 vineyard designates, this is strongly herbal. The notes of leaf and herb create somewhat unripe flavor impressions, with a touch of bitterness on the finish. The fruit just passes the ripeness of sweet tomatoes.
5 A stiff, tannic wine, this slowly opens and brings brambly berry flavors into play, along with notes of earthy herbs. There's a touch of bitterness to the tannins.
6 Some rosés are made simply by bleeding the juice from the fermenter, to concentrate the remaining wine. Whether or not that is the case here, the wine has the simple pleasant lightly candied strawberry flavors one might expect from such a technique. It's fruity and undemanding. Drink up.
Calculate the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.
\[ P({\rm Burgundy}~|~{\rm Fruit}) \]
wino <- wine %>%
mutate(Fruit = str_detect(tolower(description),"fruit"))fruit_bur <- nrow(filter(wino, province=="Burgundy" & Fruit))/nrow(wino)
Fruit <- nrow(filter(wino, Fruit))/nrow(wino)
fruit_bur/Fruit[1] 0.2184909
We train a naive bayes algorithm to classify a wine’s province using: 1. An 80-20 train-test split. 2. Three features engineered from the description 3. 5-fold cross validation.
We report Kappa after using the model to predict provinces in the holdout sample.
wino2 <- wine %>%
mutate(
fruit = str_detect(tolower(description), "fruit"),
earthy = str_detect(tolower(description), "earth|earthy"),
tannins = str_detect(tolower(description), "tannins")
) %>%
select(-description)set.seed(505)
train_index <- createDataPartition(wino2$province, p = 0.8, list = FALSE)
train_data <- wino2[train_index, ]
test_data <- wino2[-train_index, ]fit <- train(province ~ .,
data = train_data,
method = "naive_bayes",
metric = "Kappa",
trControl = trainControl(method = "cv"))
fitNaive Bayes
6707 samples
7 predictor
6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6037, 6037, 6036, 6036, 6036, 6037, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.6150368 0.3954457
TRUE 0.6432067 0.4223953
Tuning parameter 'laplace' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Kappa was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
and adjust = 1.
confusionMatrix(predict(fit, test_data),factor(test_data$province))Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 172 35 3 9 7
California 30 672 12 16 9
Casablanca_Valley 1 2 2 2 2
Marlborough 0 0 0 0 0
New_York 0 0 0 0 0
Oregon 35 82 9 18 8
Reference
Prediction Oregon
Burgundy 74
California 250
Casablanca_Valley 2
Marlborough 0
New_York 0
Oregon 221
Overall Statistics
Accuracy : 0.6378
95% CI : (0.6142, 0.6608)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4176
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.7227 0.8496 0.076923
Specificity 0.9108 0.6406 0.994536
Pos Pred Value 0.5733 0.6795 0.181818
Neg Pred Value 0.9519 0.8260 0.985560
Prevalence 0.1423 0.4728 0.015541
Detection Rate 0.1028 0.4017 0.001195
Detection Prevalence 0.1793 0.5912 0.006575
Balanced Accuracy 0.8167 0.7451 0.535729
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.0000 0.00000 0.4040
Specificity 1.0000 1.00000 0.8650
Pos Pred Value NaN NaN 0.5925
Neg Pred Value 0.9731 0.98446 0.7492
Prevalence 0.0269 0.01554 0.3270
Detection Rate 0.0000 0.00000 0.1321
Detection Prevalence 0.0000 0.00000 0.2230
Balanced Accuracy 0.5000 0.50000 0.6345
We find the three words that most distinguish New York Pinots from all other Pinots.
pacman::p_load(tidytext,data.table,scales)
data(stop_words)
head(stop_words, 25)$word [1] "a" "a's" "able" "about" "above"
[6] "according" "accordingly" "across" "actually" "after"
[11] "afterwards" "again" "against" "ain't" "all"
[16] "allow" "allows" "almost" "alone" "along"
[21] "already" "also" "although" "always" "am"
wino3 <- wine %>%
unnest_tokens(word, description)
head(wino3) id province price points year word
1 1 Oregon 65 87 2012 much
2 1 Oregon 65 87 2012 like
3 1 Oregon 65 87 2012 the
4 1 Oregon 65 87 2012 regular
5 1 Oregon 65 87 2012 bottling
6 1 Oregon 65 87 2012 from
wino3 <- wino3 %>%
anti_join(stop_words)Joining with `by = join_by(word)`
head(wino3) id province price points year word
1 1 Oregon 65 87 2012 regular
2 1 Oregon 65 87 2012 bottling
3 1 Oregon 65 87 2012 2012
4 1 Oregon 65 87 2012 rough
5 1 Oregon 65 87 2012 tannic
6 1 Oregon 65 87 2012 rustic
wino3 <- wino3 %>%
filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling")))
head(wino3) id province price points year word
1 1 Oregon 65 87 2012 regular
2 1 Oregon 65 87 2012 2012
3 1 Oregon 65 87 2012 rough
4 1 Oregon 65 87 2012 tannic
5 1 Oregon 65 87 2012 rustic
6 1 Oregon 65 87 2012 earthy
wino3 %>%
filter(province == "New_York") %>%
count(province, word) %>%
group_by(province) %>%
top_n(3, n) %>%
arrange(province, desc(n)) %>%
head()# A tibble: 3 × 3
# Groups: province [1]
province word n
<chr> <chr> <int>
1 New_York cherry 120
2 New_York tannins 76
3 New_York finish 65
Either do this as a bonus problem, or delete this section.
Calculate the variance of the logged word-frequency distributions for each province.
wtxt <- wine %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>%
filter(str_detect(string = word, pattern = "[a-z+]")) %>%
filter(str_length(word)>3) %>%
group_by(word) %>%
mutate(total=n()) %>%
ungroup()Joining with `by = join_by(word)`
wtxt_variance <- wtxt %>%
filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling"))) %>%
count(province, word) %>%
group_by(province) %>%
mutate(log_freq = log1p(n)) %>%
summarise(variance = var(log_freq, na.rm = TRUE))
print(wtxt_variance)# A tibble: 6 × 2
province variance
<chr> <dbl>
1 Burgundy 1.09
2 California 1.30
3 Casablanca_Valley 0.457
4 Marlborough 0.490
5 New_York 0.478
6 Oregon 1.16