knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
wine <- read_rds("/Users/Rose/Downloads/wine.rds") %>%
filter(province=="Oregon" | province == "California" | province == "New York") %>%
mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>%
# looking for cherry w/in description asking "does it exist?"
mutate(lprice=log(price)) %>%
select(price, lprice, points, cherry, province)
Answer: (write your line-by-line explanation of the code here)
Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.
library(moderndive)
m1 <- lm(lprice ~ points + cherry, data = wine)
get_regression_summaries(m1) # RMSE = .4687
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df |
---|---|---|---|---|---|---|---|
0.305 | 0.305 | 0.2197412 | 0.4687657 | 0.469 | 5845.826 | 0 | 3 |
Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.
m2 <- lm(lprice ~ points * cherry, data = wine)
get_regression_summaries(m2) # RMSE = .4685
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df |
---|---|---|---|---|---|---|---|
0.306 | 0.306 | 0.2195131 | 0.4685223 | 0.469 | 3910.329 | 0 | 4 |
get_regression_table(m2) # RMSE = .4685
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | -5.660 | 0.102 | -55.350 | 0 | -5.860 | -5.459 |
points | 0.102 | 0.001 | 88.981 | 0 | 0.100 | 0.104 |
cherry | -1.015 | 0.216 | -4.703 | 0 | -1.438 | -0.592 |
points:cherry | 0.013 | 0.002 | 5.256 | 0 | 0.008 | 0.017 |
How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical manager.
A categorical by continuous interaction - because we have a continuous variable = points being interacted with a categorical variable cherry
Whenever the word cherry is in the description, increases in points have a greater effect on price
So for wines that have a note of cherry, the relationship between points and price is greater (one extra point in a wine with cherry is going to mean more for my pricing power, than a wine that doesnt have cherry in it)
In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.
m3 <- lm(lprice ~ cherry * province, data = wine)
get_regression_summaries(m3)
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df |
---|---|---|---|---|---|---|---|
0.08 | 0.08 | 0.2910074 | 0.539451 | 0.54 | 463.724 | 0 | 6 |
get_regression_table(m3)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 3.492 | 0.004 | 778.012 | 0.000 | 3.483 | 3.501 |
cherry | 0.177 | 0.009 | 19.385 | 0.000 | 0.159 | 0.195 |
provinceNew York | -0.472 | 0.014 | -34.454 | 0.000 | -0.499 | -0.445 |
provinceOregon | -0.087 | 0.010 | -8.847 | 0.000 | -0.106 | -0.067 |
cherry:provinceNew York | -0.003 | 0.027 | -0.130 | 0.896 | -0.056 | 0.049 |
cherry:provinceOregon | 0.126 | 0.020 | 6.451 | 0.000 | 0.088 | 0.164 |
One is Cali. with cherry, one is Cali w/out cherry etc.
The log(price) of your basic wine in Cali = 3.5
library(wesanderson)
wine %>%
group_by(cherry, province) %>%
summarise(lprice = mean(lprice)) %>%
ggplot(aes(cherry, lprice, color = province)) +
geom_line() + geom_point()
Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. After a few days of work, your employee bursts into your office and exclaims, “I’ve achieved 91% accuracy on my model!”
Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.
wine %>%
count(province) %>%
arrange(desc(n))
province | n |
---|---|
California | 19073 |
Oregon | 5147 |
New York | 2364 |
Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?
Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the Corona virus. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?