Directions:

Please turn in both a knitted HTML file and your Rmd file on WISE.

Good luck!

1. Setup (1pt)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
wine = read_rds("/Users/rochellerafn/RStudio Files/pinot.rds")
names(wine)[names(wine) == 'id'] = 'ID'
data(stop_words)
head(stop_words, 25)$word
##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"      
## [21] "already"     "also"        "although"    "always"      "am"

2. Conditional Probability (3pts)

Calculate \(P(Burgundy | Fruit)\)

…i.e. the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.

I know the calculation is…

\[ P({\rm Burgundy}~|~{\rm Fruit}) = \frac{P({\rm Fruit}~|~ \rm{Burgundy})P(\rm{Burgundy})}{P({\rm Fruit})} \] …but I will need help figuring out where to get the numbers from. The class examples we had the percentages to do the calculations. I’m sure once it is explained to me where to find the percentages of Burgundy and Fruit to region and word… it will make sense.

So my guess is… a wine will be from Burgundy 1193 out of 8380 times. That is roughly 14% of the time… I’m still shakey on what the probability is… I can work on that.

wine %>%
  count(province)
province n
Burgundy 1193
California 3959
Casablanca_Valley 131
Marlborough 229
New_York 131
Oregon 2737
Fruit appears 4146 times.
wine %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words) %>% 
  filter(word %in% c("fruit")) %>%
  count(word)
word n
fruit 4146

3. Naive Bayes Algorithm (4pts)

  1. Train a naive bayes algorithm to classify a wine’s province,
  2. using 80% of your data,
  3. three features engineered from the description
  4. and 5-fold cross validation.
  5. Report Kappa after using your model to predict provinces in the holdout (test) sample.

4. Frequency differences (2pts)

List the three words that most distinguish New York Pinots from all other Pinots.

With all of the research I did on words above, I narrowed down the top ten, and then filtered down based on words that I found duplicated in other regions. I think it’s nearly impossible to find words that come up literally 0 times anywhere else… but the words, “Brisk”, “Delicate”, and “Bramble” seemed to distiguish New York from other regions based on the number of mentions. They are only receiving mentions between 23-28 times, but considering the small amount of data we have for New York, that is a much higher percentage than California which thousands of entries. 25/131 = 19% whereas 100/3959 = 2.5%. See the counts below.
wine %>%
  count(province)
province n
Burgundy 1193
California 3959
Casablanca_Valley 131
Marlborough 229
New_York 131
Oregon 2737
df_2 <- wine %>%
  unnest_tokens(word, description) %>%
  filter(word %in% c("brisk", "delicate", "bramble")) %>%
  anti_join(stop_words) %>% # get rid of stop words
  count(ID, word) %>% 
  group_by(ID) %>% 
  mutate(freq = n/sum(n)) %>% 
  mutate(exists = (n>0)) %>% 
  ungroup %>% 
  group_by(word) %>% 
  mutate(total = sum(n))
df_2 %>%
  left_join(select(wine, ID, province), by = "ID") %>%
  count(province, word) %>%
  group_by(province) %>% 
  top_n(10,n) %>% 
  arrange(province, desc(n))
word province n
delicate Burgundy 9
brisk Burgundy 2
delicate California 100
bramble California 23
brisk California 21
bramble Casablanca_Valley 1
delicate Marlborough 17
brisk New_York 28
delicate New_York 25
bramble New_York 23
delicate Oregon 119
bramble Oregon 13