1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
wine = read_rds("/Users/rochellerafn/RStudio Files/pinot.rds")
names(wine)[names(wine) == 'id'] = 'ID'
data(stop_words)
head(stop_words, 25)$word

##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"      
## [21] "already"     "also"        "although"    "always"      "am"

2. Conditional Probability (3pts)

Calculate \(P(Burgundy | Fruit)\)

…i.e. the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.

I know the calculation is…

\[ P({\rm Burgundy}~|~{\rm Fruit}) = \frac{P({\rm Fruit}~|~ \rm{Burgundy})P(\rm{Burgundy})}{P({\rm Fruit})} \] …but I will need help figuring out where to get the numbers from. The class examples we had the percentages to do the calculations. I’m sure once it is explained to me where to find the percentages of Burgundy and Fruit to region and word… it will make sense.

So my guess is… a wine will be from Burgundy 1193 out of 8380 times. That is roughly 14% of the time… I’m still shakey on what the probability is… I can work on that.

wine %>%
  count(province)

province	n
Burgundy	1193
California	3959
Casablanca_Valley	131
Marlborough	229
New_York	131
Oregon	2737

Fruit appears 4146 times.

wine %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words) %>% 
  filter(word %in% c("fruit")) %>%
  count(word)

word	n
fruit	4146

3. Naive Bayes Algorithm (4pts)

Train a naive bayes algorithm to classify a wine’s province,
using 80% of your data,
three features engineered from the description
and 5-fold cross validation.
Report Kappa after using your model to predict provinces in the holdout (test) sample.

Ok, I’ve officially spent way too much time on this… I’m not sure if we’re being graded on the performance of the model, or just proving that we know the syntax and are working through testing it out. The highest Kappa I was able to achieve was around 0.29… it’s currently at around 0.08 :-P. I took a few different approaches… First I tried just simple filtering out words (like non-descriptive and basic words), and that actually ended up giving me higher results, but was heavily skewed toward Oregon and California. I had filtered out words and kepts the filter of total > 700 or 1000… I was able to actually get a few models that ended up exclusively picking the LEAST populous regions. Then, another model where it was pretty evenly distributed, but like highly incorrect haha. It was more balanced at guessing differnet places, but the accuracy and kappa were still garbage. So, then I tried switching my strategy to figuring out what words are ranked the highest in each region. I figured that would help a lot…I separately notated my findings and kept re-running the left join by province to keep tabs on the most popular words and slowly filtered out duplicate words until the top 10 words were all unique. I then created the code below with those unique words… however, I found the issue with that is that even though a word is not showing up in the top ten of one region (but is in another) that doesn’t mean the word does not show up at all… It’s almost impossible to find words that are 100% unique to each region. So… I’m going to call it good for now and leave this explanation of my process here. This is fun and frustating at the same time haha.

str(wine)

## 'data.frame':    8380 obs. of  6 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ province   : chr  "Oregon" "Oregon" "California" "Oregon" ...
##  $ price      : num  65 20 69 50 22 25 64 55 44 38 ...
##  $ points     : num  87 87 87 86 86 86 91 91 91 91 ...
##  $ year       : num  2012 2013 2011 2010 2009 ...
##  $ description: chr  "Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, he"| __truncated__ "A sleek mix of tart berry, stem and herb, along with a hint of oak and chocolate, this is a fair value in a wid"| __truncated__ "Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails"| __truncated__ "As with many of the Erath 2010 vineyard designates, this is strongly herbal. The notes of leaf and herb create "| __truncated__ ...

df <- wine %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words) %>% # get rid of stop words
  filter(word %in% c("wood", "thyme", "creamy", "mild", "heavy", "lean", "supple", "marlborough", "mocha", "burgundy", "casablanca", "new york", "oregon", "california", "brisk", "bramble", "penetrating", "pretty", "forward")) %>%
  count(ID, word) %>% 
  group_by(ID) %>% 
  mutate(freq = n/sum(n)) %>% 
  mutate(exists = (n>0)) %>% 
  ungroup %>% 
  group_by(word) %>% 
  mutate(total = sum(n))

head(df, 10)

ID	word	n	freq	exists	total
2	oregon	1	1.0	TRUE	214
7	thyme	1	1.0	TRUE	300
8	brisk	1	1.0	TRUE	52
10	forward	1	1.0	TRUE	258
16	forward	1	1.0	TRUE	258
17	penetrating	1	0.5	TRUE	50
17	supple	1	0.5	TRUE	210
19	wood	1	1.0	TRUE	333
23	bramble	1	1.0	TRUE	60
25	wood	1	1.0	TRUE	333

df %>%
  left_join(select(wine, ID, province), by = "ID") %>%
  count(province, word) %>%
  group_by(province) %>% 
  top_n(20,n) %>% 
  arrange(province, desc(n))

word	province	n
wood	Burgundy	229
burgundy	Burgundy	24
forward	Burgundy	19
lean	Burgundy	10
supple	Burgundy	5
heavy	Burgundy	3
brisk	Burgundy	2
pretty	Burgundy	2
creamy	Burgundy	1
mild	Burgundy	1
thyme	Burgundy	1
thyme	California	284
lean	California	95
pretty	California	85
supple	California	69
heavy	California	51
wood	California	41
mild	California	40
mocha	California	35
forward	California	25
bramble	California	23
brisk	California	21
california	California	19
creamy	California	16
burgundy	California	8
oregon	California	3
penetrating	California	3
creamy	Casablanca_Valley	15
mild	Casablanca_Valley	14
heavy	Casablanca_Valley	13
lean	Casablanca_Valley	13
wood	Casablanca_Valley	3
california	Casablanca_Valley	2
bramble	Casablanca_Valley	1
casablanca	Casablanca_Valley	1
forward	Casablanca_Valley	1
supple	Marlborough	48
marlborough	Marlborough	35
mocha	Marlborough	24
pretty	Marlborough	15
creamy	Marlborough	9
heavy	Marlborough	5
lean	Marlborough	4
wood	Marlborough	3
forward	Marlborough	1
thyme	Marlborough	1
brisk	New_York	28
bramble	New_York	23
penetrating	New_York	18
pretty	New_York	9
forward	New_York	4
wood	New_York	4
mocha	New_York	2
lean	New_York	1
supple	New_York	1
pretty	Oregon	255
forward	Oregon	206
oregon	Oregon	204
supple	Oregon	86
mocha	Oregon	77
wood	Oregon	44
penetrating	Oregon	28
california	Oregon	26
burgundy	Oregon	17
bramble	Oregon	13
creamy	Oregon	13
heavy	Oregon	13
lean	Oregon	12
thyme	Oregon	4
mild	Oregon	1

df %>% 
  count(word) %>%
  arrange(desc(n))

word	n
pretty	366
wood	324
thyme	290
forward	256
supple	209
oregon	207
mocha	138
lean	135
heavy	85
bramble	60
mild	56
creamy	54
brisk	51
burgundy	49
penetrating	49
california	47
marlborough	35
casablanca	1

wino <- df %>%
  pivot_wider(id_cols = ID, names_from = word, values_from = exists, values_fill = list(exists=0)) %>% 
  right_join(select(wine,ID, province)) %>%
  drop_na()

head(wino)

ID	oregon	thyme	brisk	forward	penetrating	supple	wood	bramble	mocha	lean	marlborough	pretty	heavy	mild	burgundy	california	creamy	casablanca	province
2	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	Oregon
7	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	California
8	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	California
10	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	Oregon
16	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	Oregon
17	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	Oregon

set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.80, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]

fit <- train(province ~ .,
             data = train, 
             method = "naive_bayes",
             tuneGrid = expand.grid(usekernel = c(T,F), laplace = T, adjust = T),
             metric = "Kappa",
             trControl = trainControl(method = "cv", number = 5))

fit

## Naive Bayes 
## 
## 1693 samples
##   19 predictor
##    6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1354, 1355, 1354, 1355, 1354 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.1181006  0.08042429
##    TRUE      0.5818174  0.33489773
## 
## Tuning parameter 'laplace' was held constant at a value of TRUE
## 
## Tuning parameter 'adjust' was held constant at a value of TRUE
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were laplace = TRUE, usekernel = TRUE
##  and adjust = TRUE.

confusionMatrix(predict(fit, test),factor(test$province))

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Burgundy California Casablanca_Valley Marlborough New_York
##   Burgundy                 0          0                 0           0        0
##   California              53        127                10          17        9
##   Casablanca_Valley        0          0                 0           0        0
##   Marlborough              0          0                 0           0        0
##   New_York                 0          0                 0           0        0
##   Oregon                   4         26                 0           5        4
##                    Reference
## Prediction          Oregon
##   Burgundy               0
##   California            44
##   Casablanca_Valley      0
##   Marlborough            0
##   New_York               0
##   Oregon               122
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5914          
##                  95% CI : (0.5428, 0.6388)
##     No Information Rate : 0.3943          
##     P-Value [Acc > NIR] : 2.627e-16       
##                                           
##                   Kappa : 0.3461          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Burgundy Class: California Class: Casablanca_Valley
## Sensitivity                   0.0000            0.8301                  0.00000
## Specificity                   1.0000            0.5037                  1.00000
## Pos Pred Value                   NaN            0.4885                      NaN
## Neg Pred Value                0.8646            0.8385                  0.97625
## Prevalence                    0.1354            0.3634                  0.02375
## Detection Rate                0.0000            0.3017                  0.00000
## Detection Prevalence          0.0000            0.6176                  0.00000
## Balanced Accuracy             0.5000            0.6669                  0.50000
##                      Class: Marlborough Class: New_York Class: Oregon
## Sensitivity                     0.00000         0.00000        0.7349
## Specificity                     1.00000         1.00000        0.8471
## Pos Pred Value                      NaN             NaN        0.7578
## Neg Pred Value                  0.94774         0.96912        0.8308
## Prevalence                      0.05226         0.03088        0.3943
## Detection Rate                  0.00000         0.00000        0.2898
## Detection Prevalence            0.00000         0.00000        0.3824
## Balanced Accuracy               0.50000         0.50000        0.7910

4. Frequency differences (2pts)

List the three words that most distinguish New York Pinots from all other Pinots.

With all of the research I did on words above, I narrowed down the top ten, and then filtered down based on words that I found duplicated in other regions. I think it’s nearly impossible to find words that come up literally 0 times anywhere else… but the words, “Brisk”, “Delicate”, and “Bramble” seemed to distiguish New York from other regions based on the number of mentions. They are only receiving mentions between 23-28 times, but considering the small amount of data we have for New York, that is a much higher percentage than California which thousands of entries. 25/131 = 19% whereas 100/3959 = 2.5%. See the counts below.

wine %>%
  count(province)

province	n
Burgundy	1193
California	3959
Casablanca_Valley	131
Marlborough	229
New_York	131
Oregon	2737

df_2 <- wine %>%
  unnest_tokens(word, description) %>%
  filter(word %in% c("brisk", "delicate", "bramble")) %>%
  anti_join(stop_words) %>% # get rid of stop words
  count(ID, word) %>% 
  group_by(ID) %>% 
  mutate(freq = n/sum(n)) %>% 
  mutate(exists = (n>0)) %>% 
  ungroup %>% 
  group_by(word) %>% 
  mutate(total = sum(n))

df_2 %>%
  left_join(select(wine, ID, province), by = "ID") %>%
  count(province, word) %>%
  group_by(province) %>% 
  top_n(10,n) %>% 
  arrange(province, desc(n))

word	province	n
delicate	Burgundy	9
brisk	Burgundy	2
delicate	California	100
bramble	California	23
brisk	California	21
bramble	Casablanca_Valley	1
delicate	Marlborough	17
brisk	New_York	28
delicate	New_York	25
bramble	New_York	23
delicate	Oregon	119
bramble	Oregon	13

Homework 4