Setup (5pts)

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)

wine <- read_rds("/Users/Rose/Downloads/wine.rds") %>%
  filter(province=="Oregon" | province == "California" | province == "New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  # looking for cherry w/in description asking "does it exist?"
  mutate(lprice=log(price)) %>% 
  select(price, lprice, points, cherry, province)

Answer: (write your line-by-line explanation of the code here)

Multiple Regression

(2pts)

Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.

library(moderndive)
m1 <- lm(lprice ~ points + cherry, data = wine)
get_regression_summaries(m1) # RMSE = .4687
r_squared adj_r_squared mse rmse sigma statistic p_value df
0.305 0.305 0.2197412 0.4687657 0.469 5845.826 0 3

(2pts)

Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.

m2 <- lm(lprice ~ points * cherry, data = wine)
get_regression_summaries(m2) # RMSE = .4685
r_squared adj_r_squared mse rmse sigma statistic p_value df
0.306 0.306 0.2195131 0.4685223 0.469 3910.329 0 4
get_regression_table(m2) # RMSE = .4685
term estimate std_error statistic p_value lower_ci upper_ci
intercept -5.660 0.102 -55.350 0 -5.860 -5.459
points 0.102 0.001 88.981 0 0.100 0.104
cherry -1.015 0.216 -4.703 0 -1.438 -0.592
points:cherry 0.013 0.002 5.256 0 0.008 0.017

(3pts)

How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical manager.

A categorical by continuous interaction - because we have a continuous variable = points being interacted with a categorical variable cherry

  • Think about it in terms of slope
    • baseline understanding of the relationship between points and price
      • once you add the term cherry to the description this slope increases from .102 to .1015 (.102 + 0.013) – adding the coefficient from (points * cherry) + points
      • The only time points * cherry comes estimate (0.013) into play is when both points and cherry are present

Whenever the word cherry is in the description, increases in points have a greater effect on price

So for wines that have a note of cherry, the relationship between points and price is greater (one extra point in a wine with cherry is going to mean more for my pricing power, than a wine that doesnt have cherry in it)

(Bonus: 1pt)

In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.

m3 <- lm(lprice ~ cherry * province, data = wine)
get_regression_summaries(m3)
r_squared adj_r_squared mse rmse sigma statistic p_value df
0.08 0.08 0.2910074 0.539451 0.54 463.724 0 6
get_regression_table(m3)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 3.492 0.004 778.012 0.000 3.483 3.501
cherry 0.177 0.009 19.385 0.000 0.159 0.195
provinceNew York -0.472 0.014 -34.454 0.000 -0.499 -0.445
provinceOregon -0.087 0.010 -8.847 0.000 -0.106 -0.067
cherry:provinceNew York -0.003 0.027 -0.130 0.896 -0.056 0.049
cherry:provinceOregon 0.126 0.020 6.451 0.000 0.088 0.164
  • Categorical by categorical interactions (cherry * NY | cherry * Oregon | cherry * California)
  • Essentially saying: What is the average price for each of these different potential states (categories)
  • One is Cali. with cherry, one is Cali w/out cherry etc.

  • log(price) = Cali + cherry + NY + Oregon + (Cherry * NY) + (Cherry * Oregon)
    • The intercept is Cali
    • When you have a bunch of dummy Vars in your equation, the intercep becomes one of those dummys
  • The log(price) of your basic wine in Cali = 3.5

library(wesanderson)
wine %>% 
  group_by(cherry, province) %>%
  summarise(lprice = mean(lprice)) %>%
  ggplot(aes(cherry, lprice, color = province)) +
  geom_line() + geom_point()

  • The price of NY is low (compared to the other two)
  • Cherry can help oregon wines (in terms of pricing power) more than it can help California because it has a steeper slope

Data Ethics

(3pts)

Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. After a few days of work, your employee bursts into your office and exclaims, “I’ve achieved 91% accuracy on my model!”

Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.

wine %>%
  count(province) %>%
  arrange(desc(n))
province n
California 19073
Oregon 5147
New York 2364
  • There is an imbalance in the counts of each province
  • By building a model that never chooses NY you will be correct 91% of the time bc NY is only 9% of the total data

(3pts)

Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?

  • Natural imbalances in data lead to model fitting to the dominant category, but ignoring the underrpresented category (OR and NY) So the decisions facilitated by this model dont serve the underrepresented

(3pts)

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the Corona virus. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

  • It does not because essentially there are these same biases in the data whether we take the demographic data or not