Setup (5pts)

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)

wine <- read_rds("/Users/Rose/Downloads/wine.rds") %>%
  filter(province=="Oregon" | province == "California" | province == "New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  # looking for cherry w/in description asking "does it exist?"
  mutate(lprice=log(price)) %>% 
  select(price, lprice, points, cherry, province)

Answer: (write your line-by-line explanation of the code here)

Multiple Regression

(2pts)

Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.

library(moderndive)
m1 <- lm(lprice ~ points + cherry, data = wine)
get_regression_summaries(m1) # RMSE = .4687

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df
0.305	0.305	0.2197412	0.4687657	0.469	5845.826	0	3

(2pts)

Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.

m2 <- lm(lprice ~ points * cherry, data = wine)
get_regression_summaries(m2) # RMSE = .4685

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df
0.306	0.306	0.2195131	0.4685223	0.469	3910.329	0	4

get_regression_table(m2) # RMSE = .4685

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	-5.660	0.102	-55.350	-5.860	-5.459
points	0.102	0.001	88.981	0.100	0.104
cherry	-1.015	0.216	-4.703	-1.438	-0.592
points:cherry	0.013	0.002	5.256	0.008	0.017

(3pts)

How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical manager.

A categorical by continuous interaction - because we have a continuous variable = points being interacted with a categorical variable cherry

Think about it in terms of slope
- baseline understanding of the relationship between points and price
  - once you add the term cherry to the description this slope increases from .102 to .1015 (.102 + 0.013) – adding the coefficient from (points * cherry) + points
  - The only time points * cherry comes estimate (0.013) into play is when both points and cherry are present

Whenever the word cherry is in the description, increases in points have a greater effect on price

So for wines that have a note of cherry, the relationship between points and price is greater (one extra point in a wine with cherry is going to mean more for my pricing power, than a wine that doesnt have cherry in it)

(Bonus: 1pt)

In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.

m3 <- lm(lprice ~ cherry * province, data = wine)
get_regression_summaries(m3)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df
0.08	0.08	0.2910074	0.539451	0.54	463.724	0	6

get_regression_table(m3)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.492	0.004	778.012	0.000	3.483	3.501
cherry	0.177	0.009	19.385	0.000	0.159	0.195
provinceNew York	-0.472	0.014	-34.454	0.000	-0.499	-0.445
provinceOregon	-0.087	0.010	-8.847	0.000	-0.106	-0.067
cherry:provinceNew York	-0.003	0.027	-0.130	0.896	-0.056	0.049
cherry:provinceOregon	0.126	0.020	6.451	0.000	0.088	0.164

Categorical by categorical interactions (cherry * NY | cherry * Oregon | cherry * California)
Essentially saying: What is the average price for each of these different potential states (categories)
One is Cali. with cherry, one is Cali w/out cherry etc.
log(price) = Cali + cherry + NY + Oregon + (Cherry * NY) + (Cherry * Oregon)
- The intercept is Cali
- When you have a bunch of dummy Vars in your equation, the intercep becomes one of those dummys
The log(price) of your basic wine in Cali = 3.5

library(wesanderson)
wine %>% 
  group_by(cherry, province) %>%
  summarise(lprice = mean(lprice)) %>%
  ggplot(aes(cherry, lprice, color = province)) +
  geom_line() + geom_point()

The price of NY is low (compared to the other two)
Cherry can help oregon wines (in terms of pricing power) more than it can help California because it has a steeper slope

Data Ethics

(3pts)

Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. After a few days of work, your employee bursts into your office and exclaims, “I’ve achieved 91% accuracy on my model!”

Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.

wine %>%
  count(province) %>%
  arrange(desc(n))

province	n
California	19073
Oregon	5147
New York	2364

There is an imbalance in the counts of each province
By building a model that never chooses NY you will be correct 91% of the time bc NY is only 9% of the total data

(3pts)

Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?

Natural imbalances in data lead to model fitting to the dominant category, but ignoring the underrpresented category (OR and NY) So the decisions facilitated by this model dont serve the underrepresented

(3pts)

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the Corona virus. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

It does not because essentially there are these same biases in the data whether we take the demographic data or not

Homework 1 - Regression & Ethics

01/25/2021

Setup (5pts)

Multiple Regression

(2pts)

(2pts)

(3pts)

(Bonus: 1pt)

Data Ethics

(3pts)

(3pts)

(3pts)