Directions:
Please turn in both a knitted HTML file and your Rmd file on WISE.
Good luck!
Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(naivebayes)
library(tidytext)
wine = read_rds("/Users/rochellerafn/RStudio Files/pinot.rds")
names(wine)[names(wine) == 'id'] = 'ID'
data(stop_words)
head(stop_words, 25)$word
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
## [11] "afterwards" "again" "against" "ain't" "all"
## [16] "allow" "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always" "am"
Calculate \(P(Burgundy | Fruit)\)
…i.e. the probability that a Pinot comes from Burgundy given it has the word ‘fruit’ in the description.
\[ P({\rm Burgundy}~|~{\rm Fruit}) = \frac{P({\rm Fruit}~|~ \rm{Burgundy})P(\rm{Burgundy})}{P({\rm Fruit})} \] …but I will need help figuring out where to get the numbers from. The class examples we had the percentages to do the calculations. I’m sure once it is explained to me where to find the percentages of Burgundy and Fruit to region and word… it will make sense.
So my guess is… a wine will be from Burgundy 1193 out of 8380 times. That is roughly 14% of the time… I’m still shakey on what the probability is… I can work on that.
wine %>%
count(province)
| province | n |
|---|---|
| Burgundy | 1193 |
| California | 3959 |
| Casablanca_Valley | 131 |
| Marlborough | 229 |
| New_York | 131 |
| Oregon | 2737 |
wine %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>%
filter(word %in% c("fruit")) %>%
count(word)
| word | n |
|---|---|
| fruit | 4146 |
str(wine)
## 'data.frame': 8380 obs. of 6 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ province : chr "Oregon" "Oregon" "California" "Oregon" ...
## $ price : num 65 20 69 50 22 25 64 55 44 38 ...
## $ points : num 87 87 87 86 86 86 91 91 91 91 ...
## $ year : num 2012 2013 2011 2010 2009 ...
## $ description: chr "Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, he"| __truncated__ "A sleek mix of tart berry, stem and herb, along with a hint of oak and chocolate, this is a fair value in a wid"| __truncated__ "Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails"| __truncated__ "As with many of the Erath 2010 vineyard designates, this is strongly herbal. The notes of leaf and herb create "| __truncated__ ...
df <- wine %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>% # get rid of stop words
filter(word %in% c("wood", "thyme", "creamy", "mild", "heavy", "lean", "supple", "marlborough", "mocha", "burgundy", "casablanca", "new york", "oregon", "california", "brisk", "bramble", "penetrating", "pretty", "forward")) %>%
count(ID, word) %>%
group_by(ID) %>%
mutate(freq = n/sum(n)) %>%
mutate(exists = (n>0)) %>%
ungroup %>%
group_by(word) %>%
mutate(total = sum(n))
head(df, 10)
| ID | word | n | freq | exists | total |
|---|---|---|---|---|---|
| 2 | oregon | 1 | 1.0 | TRUE | 214 |
| 7 | thyme | 1 | 1.0 | TRUE | 300 |
| 8 | brisk | 1 | 1.0 | TRUE | 52 |
| 10 | forward | 1 | 1.0 | TRUE | 258 |
| 16 | forward | 1 | 1.0 | TRUE | 258 |
| 17 | penetrating | 1 | 0.5 | TRUE | 50 |
| 17 | supple | 1 | 0.5 | TRUE | 210 |
| 19 | wood | 1 | 1.0 | TRUE | 333 |
| 23 | bramble | 1 | 1.0 | TRUE | 60 |
| 25 | wood | 1 | 1.0 | TRUE | 333 |
df %>%
left_join(select(wine, ID, province), by = "ID") %>%
count(province, word) %>%
group_by(province) %>%
top_n(20,n) %>%
arrange(province, desc(n))
| word | province | n |
|---|---|---|
| wood | Burgundy | 229 |
| burgundy | Burgundy | 24 |
| forward | Burgundy | 19 |
| lean | Burgundy | 10 |
| supple | Burgundy | 5 |
| heavy | Burgundy | 3 |
| brisk | Burgundy | 2 |
| pretty | Burgundy | 2 |
| creamy | Burgundy | 1 |
| mild | Burgundy | 1 |
| thyme | Burgundy | 1 |
| thyme | California | 284 |
| lean | California | 95 |
| pretty | California | 85 |
| supple | California | 69 |
| heavy | California | 51 |
| wood | California | 41 |
| mild | California | 40 |
| mocha | California | 35 |
| forward | California | 25 |
| bramble | California | 23 |
| brisk | California | 21 |
| california | California | 19 |
| creamy | California | 16 |
| burgundy | California | 8 |
| oregon | California | 3 |
| penetrating | California | 3 |
| creamy | Casablanca_Valley | 15 |
| mild | Casablanca_Valley | 14 |
| heavy | Casablanca_Valley | 13 |
| lean | Casablanca_Valley | 13 |
| wood | Casablanca_Valley | 3 |
| california | Casablanca_Valley | 2 |
| bramble | Casablanca_Valley | 1 |
| casablanca | Casablanca_Valley | 1 |
| forward | Casablanca_Valley | 1 |
| supple | Marlborough | 48 |
| marlborough | Marlborough | 35 |
| mocha | Marlborough | 24 |
| pretty | Marlborough | 15 |
| creamy | Marlborough | 9 |
| heavy | Marlborough | 5 |
| lean | Marlborough | 4 |
| wood | Marlborough | 3 |
| forward | Marlborough | 1 |
| thyme | Marlborough | 1 |
| brisk | New_York | 28 |
| bramble | New_York | 23 |
| penetrating | New_York | 18 |
| pretty | New_York | 9 |
| forward | New_York | 4 |
| wood | New_York | 4 |
| mocha | New_York | 2 |
| lean | New_York | 1 |
| supple | New_York | 1 |
| pretty | Oregon | 255 |
| forward | Oregon | 206 |
| oregon | Oregon | 204 |
| supple | Oregon | 86 |
| mocha | Oregon | 77 |
| wood | Oregon | 44 |
| penetrating | Oregon | 28 |
| california | Oregon | 26 |
| burgundy | Oregon | 17 |
| bramble | Oregon | 13 |
| creamy | Oregon | 13 |
| heavy | Oregon | 13 |
| lean | Oregon | 12 |
| thyme | Oregon | 4 |
| mild | Oregon | 1 |
df %>%
count(word) %>%
arrange(desc(n))
| word | n |
|---|---|
| pretty | 366 |
| wood | 324 |
| thyme | 290 |
| forward | 256 |
| supple | 209 |
| oregon | 207 |
| mocha | 138 |
| lean | 135 |
| heavy | 85 |
| bramble | 60 |
| mild | 56 |
| creamy | 54 |
| brisk | 51 |
| burgundy | 49 |
| penetrating | 49 |
| california | 47 |
| marlborough | 35 |
| casablanca | 1 |
wino <- df %>%
pivot_wider(id_cols = ID, names_from = word, values_from = exists, values_fill = list(exists=0)) %>%
right_join(select(wine,ID, province)) %>%
drop_na()
head(wino)
| ID | oregon | thyme | brisk | forward | penetrating | supple | wood | bramble | mocha | lean | marlborough | pretty | heavy | mild | burgundy | california | creamy | casablanca | province |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | Oregon |
| 7 | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | California |
| 8 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | California |
| 10 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | Oregon |
| 16 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | Oregon |
| 17 | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | Oregon |
set.seed(504)
wine_index <- createDataPartition(wino$province, p = 0.80, list = FALSE)
train <- wino[ wine_index, ]
test <- wino[-wine_index, ]
fit <- train(province ~ .,
data = train,
method = "naive_bayes",
tuneGrid = expand.grid(usekernel = c(T,F), laplace = T, adjust = T),
metric = "Kappa",
trControl = trainControl(method = "cv", number = 5))
fit
## Naive Bayes
##
## 1693 samples
## 19 predictor
## 6 classes: 'Burgundy', 'California', 'Casablanca_Valley', 'Marlborough', 'New_York', 'Oregon'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1354, 1355, 1354, 1355, 1354
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.1181006 0.08042429
## TRUE 0.5818174 0.33489773
##
## Tuning parameter 'laplace' was held constant at a value of TRUE
##
## Tuning parameter 'adjust' was held constant at a value of TRUE
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were laplace = TRUE, usekernel = TRUE
## and adjust = TRUE.
confusionMatrix(predict(fit, test),factor(test$province))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Burgundy California Casablanca_Valley Marlborough New_York
## Burgundy 0 0 0 0 0
## California 53 127 10 17 9
## Casablanca_Valley 0 0 0 0 0
## Marlborough 0 0 0 0 0
## New_York 0 0 0 0 0
## Oregon 4 26 0 5 4
## Reference
## Prediction Oregon
## Burgundy 0
## California 44
## Casablanca_Valley 0
## Marlborough 0
## New_York 0
## Oregon 122
##
## Overall Statistics
##
## Accuracy : 0.5914
## 95% CI : (0.5428, 0.6388)
## No Information Rate : 0.3943
## P-Value [Acc > NIR] : 2.627e-16
##
## Kappa : 0.3461
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Burgundy Class: California Class: Casablanca_Valley
## Sensitivity 0.0000 0.8301 0.00000
## Specificity 1.0000 0.5037 1.00000
## Pos Pred Value NaN 0.4885 NaN
## Neg Pred Value 0.8646 0.8385 0.97625
## Prevalence 0.1354 0.3634 0.02375
## Detection Rate 0.0000 0.3017 0.00000
## Detection Prevalence 0.0000 0.6176 0.00000
## Balanced Accuracy 0.5000 0.6669 0.50000
## Class: Marlborough Class: New_York Class: Oregon
## Sensitivity 0.00000 0.00000 0.7349
## Specificity 1.00000 1.00000 0.8471
## Pos Pred Value NaN NaN 0.7578
## Neg Pred Value 0.94774 0.96912 0.8308
## Prevalence 0.05226 0.03088 0.3943
## Detection Rate 0.00000 0.00000 0.2898
## Detection Prevalence 0.00000 0.00000 0.3824
## Balanced Accuracy 0.50000 0.50000 0.7910
List the three words that most distinguish New York Pinots from all other Pinots.
wine %>%
count(province)
| province | n |
|---|---|
| Burgundy | 1193 |
| California | 3959 |
| Casablanca_Valley | 131 |
| Marlborough | 229 |
| New_York | 131 |
| Oregon | 2737 |
df_2 <- wine %>%
unnest_tokens(word, description) %>%
filter(word %in% c("brisk", "delicate", "bramble")) %>%
anti_join(stop_words) %>% # get rid of stop words
count(ID, word) %>%
group_by(ID) %>%
mutate(freq = n/sum(n)) %>%
mutate(exists = (n>0)) %>%
ungroup %>%
group_by(word) %>%
mutate(total = sum(n))
df_2 %>%
left_join(select(wine, ID, province), by = "ID") %>%
count(province, word) %>%
group_by(province) %>%
top_n(10,n) %>%
arrange(province, desc(n))
| word | province | n |
|---|---|---|
| delicate | Burgundy | 9 |
| brisk | Burgundy | 2 |
| delicate | California | 100 |
| bramble | California | 23 |
| brisk | California | 21 |
| bramble | Casablanca_Valley | 1 |
| delicate | Marlborough | 17 |
| brisk | New_York | 28 |
| delicate | New_York | 25 |
| bramble | New_York | 23 |
| delicate | Oregon | 119 |
| bramble | Oregon | 13 |