Conditional Probability

Author

Nathan butler!

Published

February 17, 2025

Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

0. Quarto Type-setting

  • This document is rendered with Quarto, and configured to embed an images using the embed-resources option in the header.
  • If you wish to use a similar header, here’s is the format specification for this document:

1. Setup

Set Up Code:

sh <- suppressPackageStartupMessages
sh(library(tidyverse))
sh(library(caret))
Warning: package 'caret' was built under R version 4.4.2
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/pinot.rds")))

2. Conditional Probability

head(wine)
  id   province price points year
1  1     Oregon    65     87 2012
2  2     Oregon    20     87 2013
3  3 California    69     87 2011
4  4     Oregon    50     86 2010
5  5     Oregon    22     86 2009
6  6     Oregon    25     86 2015
                                                                                                                                                                                                                                                                                       description
1                                        Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.
2                                                        A sleek mix of tart berry, stem and herb, along with a hint of oak and chocolate, this is a fair value in a widely available, drink-now Oregon Pinot. The wine oak-aged for six months, whether in neutral or re-staved is not indicated.
3                              Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails from a high-elevation site. Small in production, it offers intense, full-bodied raspberry and blackberry steeped in smoky spice and smooth texture.
4                                               As with many of the Erath 2010 vineyard designates, this is strongly herbal. The notes of leaf and herb create somewhat unripe flavor impressions, with a touch of bitterness on the finish. The fruit just passes the ripeness of sweet tomatoes.
5                                                                                                                              A stiff, tannic wine, this slowly opens and brings brambly berry flavors into play, along with notes of earthy herbs. There's a touch of bitterness to the tannins.
6 Some rosés are made simply by bleeding the juice from the fermenter, to concentrate the remaining wine. Whether or not that is the case here, the wine has the simple pleasant lightly candied strawberry flavors one might expect from such a technique. It's fruity and undemanding. Drink up.

Calculate the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.

\[ P({\rm Burgundy}~|~{\rm Fruit}) \]

wino <- wine %>% 
  mutate(Fruit = str_detect(tolower(description),"fruit"))
fruit_bur <- nrow(filter(wino, province=="Burgundy" & Fruit))/nrow(wino)
Fruit <- nrow(filter(wino, Fruit))/nrow(wino)
fruit_bur/Fruit
[1] 0.2184909

3. Naive Bayes Algorithm

We train a naive bayes algorithm to classify a wine’s province using: 1. An 80-20 train-test split. 2. Three features engineered from the description 3. 5-fold cross validation.

We report Kappa after using the model to predict provinces in the holdout sample.

wino2 <- wine %>%
  mutate(
    fruit = str_detect(tolower(description), "fruit"),
    earthy = str_detect(tolower(description), "earth|earthy"),
    tannins = str_detect(tolower(description), "tannins")
  ) %>%
  select(-description)
set.seed(505) 
train_index <- createDataPartition(wino2$province, p = 0.8, list = FALSE)
train_data <- wino2[train_index, ]
test_data <- wino2[-train_index, ]
fit <- train(province ~ .,
             data = train_data, 
             method = "naive_bayes",
             metric = "Kappa",
             trControl = trainControl(method = "cv"))
fit
Naive Bayes 

6707 samples
   7 predictor
   6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6037, 6037, 6036, 6036, 6036, 6037, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.6150368  0.3954457
   TRUE      0.6432067  0.4223953

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Kappa was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
 and adjust = 1.
confusionMatrix(predict(fit, test_data),factor(test_data$province))
Confusion Matrix and Statistics

                   Reference
Prediction          Burgundy California Casablanca_Valley Marlborough New_York
  Burgundy               172         35                 3           9        7
  California              30        672                12          16        9
  Casablanca_Valley        1          2                 2           2        2
  Marlborough              0          0                 0           0        0
  New_York                 0          0                 0           0        0
  Oregon                  35         82                 9          18        8
                   Reference
Prediction          Oregon
  Burgundy              74
  California           250
  Casablanca_Valley      2
  Marlborough            0
  New_York               0
  Oregon               221

Overall Statistics
                                          
               Accuracy : 0.6378          
                 95% CI : (0.6142, 0.6608)
    No Information Rate : 0.4728          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4176          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity                   0.7227            0.8496                 0.076923
Specificity                   0.9108            0.6406                 0.994536
Pos Pred Value                0.5733            0.6795                 0.181818
Neg Pred Value                0.9519            0.8260                 0.985560
Prevalence                    0.1423            0.4728                 0.015541
Detection Rate                0.1028            0.4017                 0.001195
Detection Prevalence          0.1793            0.5912                 0.006575
Balanced Accuracy             0.8167            0.7451                 0.535729
                     Class: Marlborough Class: New_York Class: Oregon
Sensitivity                      0.0000         0.00000        0.4040
Specificity                      1.0000         1.00000        0.8650
Pos Pred Value                      NaN             NaN        0.5925
Neg Pred Value                   0.9731         0.98446        0.7492
Prevalence                       0.0269         0.01554        0.3270
Detection Rate                   0.0000         0.00000        0.1321
Detection Prevalence             0.0000         0.00000        0.2230
Balanced Accuracy                0.5000         0.50000        0.6345

4. Frequency Differences

We find the three words that most distinguish New York Pinots from all other Pinots.

pacman::p_load(tidytext,data.table,scales)
data(stop_words)
head(stop_words, 25)$word
 [1] "a"           "a's"         "able"        "about"       "above"      
 [6] "according"   "accordingly" "across"      "actually"    "after"      
[11] "afterwards"  "again"       "against"     "ain't"       "all"        
[16] "allow"       "allows"      "almost"      "alone"       "along"      
[21] "already"     "also"        "although"    "always"      "am"         
wino3 <- wine %>%
  unnest_tokens(word, description) 
head(wino3)
  id province price points year     word
1  1   Oregon    65     87 2012     much
2  1   Oregon    65     87 2012     like
3  1   Oregon    65     87 2012      the
4  1   Oregon    65     87 2012  regular
5  1   Oregon    65     87 2012 bottling
6  1   Oregon    65     87 2012     from
wino3 <- wino3 %>%
  anti_join(stop_words)
Joining with `by = join_by(word)`
head(wino3)
  id province price points year     word
1  1   Oregon    65     87 2012  regular
2  1   Oregon    65     87 2012 bottling
3  1   Oregon    65     87 2012     2012
4  1   Oregon    65     87 2012    rough
5  1   Oregon    65     87 2012   tannic
6  1   Oregon    65     87 2012   rustic
wino3 <- wino3 %>%
  filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling")))
head(wino3)
  id province price points year    word
1  1   Oregon    65     87 2012 regular
2  1   Oregon    65     87 2012    2012
3  1   Oregon    65     87 2012   rough
4  1   Oregon    65     87 2012  tannic
5  1   Oregon    65     87 2012  rustic
6  1   Oregon    65     87 2012  earthy
wino3 %>%
  filter(province == "New_York") %>% 
  count(province, word) %>%
  group_by(province) %>%
  top_n(3, n) %>%
  arrange(province, desc(n)) %>%
  head()
# A tibble: 3 × 3
# Groups:   province [1]
  province word        n
  <chr>    <chr>   <int>
1 New_York cherry    120
2 New_York tannins    76
3 New_York finish     65

5. Extension

Either do this as a bonus problem, or delete this section.

Calculate the variance of the logged word-frequency distributions for each province.

wtxt <- wine %>% 
  unnest_tokens(word, description) %>% 
  anti_join(stop_words) %>% 
  filter(str_detect(string = word, pattern = "[a-z+]")) %>%  
  filter(str_length(word)>3) %>%  
  group_by(word) %>% 
  mutate(total=n()) %>% 
  ungroup()
Joining with `by = join_by(word)`
wtxt_variance <- wtxt %>%
    filter(!(word %in% c("wine","pinot","drink","noir","vineyard","palate","notes","flavors","bottling"))) %>% 
    count(province, word) %>%
    group_by(province) %>%
    mutate(log_freq = log1p(n)) %>%  
    summarise(variance = var(log_freq, na.rm = TRUE))

print(wtxt_variance)
# A tibble: 6 × 2
  province          variance
  <chr>                <dbl>
1 Burgundy             1.09 
2 California           1.30 
3 Casablanca_Valley    0.457
4 Marlborough          0.490
5 New_York             0.478
6 Oregon               1.16