Directions:

Please turn in a knitted HTML file or PDF on WISE before next class.

Setup (5pts)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer. In the space provided after the R chunk, explain what this code is doing (line by line).

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(moderndive)
library(sjPlot)
library(sjmisc)
library(ggplot2)

wine <- read_rds("/cloud/project/resources/wine.rds") %>%
  filter(province=="Oregon" | province == "California" | province == "New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  mutate(lprice=log(price)) %>% 
  select(lprice, points, cherry, province)

Answer: I did add a few library packages. moderndive allows me to call the get_regression functions to be able to further analyze the linear models that I will be running. I also included sjPlot and sjmisc so that I am able to easily plot interaction models to show the difference between the original model and interaction affect. Since we’re using the cloud, and that is where the dataset is stored, I adjusted the read_rds so that it can find the data set. The rest of the data wrangling was included in the code.

Multiple Regression

(2pts)

Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.

# hint: m1 <- lm(lprice ~ points + cherry)

RMSE is 0.4687657

m1 <- lm(lprice~points+cherry, data = wine)

m1

## 
## Call:
## lm(formula = lprice ~ points + cherry, data = wine)
## 
## Coefficients:
## (Intercept)       points       cherry  
##     -5.9157       0.1051       0.1188

get_regression_table(m1)

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	-5.916	0.090	-65.776	-6.092	-5.739
points	0.105	0.001	104.030	0.103	0.107
cherry	0.119	0.007	17.700	0.106	0.132

get_regression_summaries(m1)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.305	0.305	0.2197412	0.4687657	0.469	5845.826	0	2	26584

(2pts)

Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.

RMSE is 0.4685223

m2 <- lm(lprice~points*cherry, data=wine)

m2

## 
## Call:
## lm(formula = lprice ~ points * cherry, data = wine)
## 
## Coefficients:
##   (Intercept)         points         cherry  points:cherry  
##      -5.65962        0.10223       -1.01490        0.01266

get_regression_table(m2)

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	-5.660	0.102	-55.350	-5.860	-5.459
points	0.102	0.001	88.981	0.100	0.104
cherry	-1.015	0.216	-4.703	-1.438	-0.592
points:cherry	0.013	0.002	5.256	0.008	0.017

get_regression_summaries(m2)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.306	0.306	0.2195131	0.4685223	0.469	3910.329	0	3	26584

(3pts)

How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical audience.

Answer: If cherry is TRUE, there is a new intercept and slope. The intercept decreases from the original by -1.015 making it now -6.675. The new slope is 0.013 more than the originaly making it now 0.115. Since the intercept is lower and the slope is higher, it means that the line is increasing quicker. It is a steeper incline than the original model. Although, they are very close. In the plot below you will see the red line where cherry is 0 (meaning FALSE) and a blue line where cherry is 1 (meaning TRUE). When cherry is TRUE we can see that price increases more rapidly as points increase.

plot_model(m2, type = "int")

(Bonus: 1pt)

In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.

wine_or <- wine %>%
  filter(province == "Oregon")

wine_ca <- wine %>%
  filter(province == "California")

wine_ny <- wine %>%
  filter(province == "New York")

m_or <- lm(lprice ~ cherry, data = wine_or)
m_ca <- lm(lprice ~ cherry, data = wine_ca)
m_ny <- lm(lprice ~ cherry, data = wine_ny)

get_regression_table(m_or)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.405	0.008	429.517	0	3.390	3.421
cherry	0.303	0.016	19.227	0	0.272	0.334

get_regression_table(m_ca)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.492	0.005	739.071	0	3.482	3.501
cherry	0.177	0.010	18.415	0	0.158	0.196

get_regression_table(m_ny)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.020	0.009	330.817	0	3.002	3.038
cherry	0.173	0.018	9.766	0	0.138	0.208

Answer: It appears that cherry feature affects price most in Oregon. California and New York have slopes that are very similar, where as Oregon is almost double at 0.303. As you can see in the graph below, although they all start in different places, Oregon (yellow) has the steepest incline, meaning it increases at a more rapid rate.

ggplot(wine_or, aes(cherry, lprice))+
  geom_smooth(method = lm, se = FALSE, color = "goldenrod1")+
  geom_smooth(method = lm, se = FALSE, color = "ivory", data = wine_ca)+
  geom_smooth(method = lm, se = FALSE, color = "grey70", data = wine_ny)+
  theme(panel.background = element_rect(fill = "mistyrose1"),
        plot.background = element_rect(fill = "mistyrose1"),
        panel.grid = element_blank(),
        axis.title = element_text(family = "courier"),
        axis.text = element_text(family = "courier"))

Data Ethics

(3pts)

Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. The employee is excited to report an accuracy of 91%.

Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.

summary(wine_ny)

##      lprice          points          cherry         province        
##  Min.   :2.197   Min.   :80.00   Min.   :0.0000   Length:2364       
##  1st Qu.:2.773   1st Qu.:86.00   1st Qu.:0.0000   Class :character  
##  Median :2.996   Median :87.00   Median :0.0000   Mode  :character  
##  Mean   :3.066   Mean   :87.29   Mean   :0.2648                     
##  3rd Qu.:3.296   3rd Qu.:89.00   3rd Qu.:1.0000                     
##  Max.   :4.828   Max.   :94.00   Max.   :1.0000

summary(wine_or)

##      lprice          points          cherry         province        
##  Min.   :1.609   Min.   :80.00   Min.   :0.0000   Length:5147       
##  1st Qu.:3.091   1st Qu.:87.00   1st Qu.:0.0000   Class :character  
##  Median :3.466   Median :89.00   Median :0.0000   Mode  :character  
##  Mean   :3.482   Mean   :89.08   Mean   :0.2534                     
##  3rd Qu.:3.871   3rd Qu.:91.00   3rd Qu.:1.0000                     
##  Max.   :5.617   Max.   :99.00   Max.   :1.0000

summary(wine_ca)

##      lprice          points          cherry         province        
##  Min.   :1.386   Min.   :80.00   Min.   :0.0000   Length:19073      
##  1st Qu.:3.178   1st Qu.:87.00   1st Qu.:0.0000   Class :character  
##  Median :3.555   Median :90.00   Median :0.0000   Mode  :character  
##  Mean   :3.535   Mean   :89.39   Mean   :0.2423                     
##  3rd Qu.:3.912   3rd Qu.:91.00   3rd Qu.:0.0000                     
##  Max.   :7.607   Max.   :99.00   Max.   :1.0000

Answer: As we discussed in class, it’s important to have full context for the data to be sure that a 91% accuracy includes other potential factors. Such as, data size and proportion. Based on the summary above we can see that there is a huge variance between the amount of data collected regarding New York wines vs Oregon vs California. Was this size discrepancy taken into account? Is the sample size that we have gathered from New York enough to make any type of conclusion? It will be impressive if we can answers more of these detailed questions. Last, there is also a difference between precision, recall and accuracy. It is also important to analyze all of these measures to get a bigger picture of how meaningful the information is.

(3pts)

Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?

Answer: Understanding the context and multiple factors is important because most questions do not have simple answer. Humans are complicated… most business decisions are complicated. As we discussed in class there are some circumstances where correlation is enough, but in most scenarios we need more than correlation. Jumping to a simple conclusion will also create potentially damaging and ethically harmful biases. For example, redlining. This is why it is important to look at the data from every angle and consider all of the internal and external factors at play.

(3pts)

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the covid-19 pandemic. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

Answer: Just simply omitting the features from the model does not seem like it would fix the problem. I suppose that keeping them in the model could create ethical issues if this model is being used for hiring or firing decisions. However, if it is simple informational, I could see how this could uncover some human biases regarding age, income and gender. I believe that it is important to have this information so we are able to fix problems of discrimination, not use the model to make it worse. I would probably create a model with and a model without, in different variations and analyze the results to see what happens. But again, I think the difference is whether this model is used just to inform and fix, or for decision making.

Homework 1

Rochelle Rafn

01/14/2022

Setup (5pts)

Multiple Regression

(2pts)

(2pts)

(3pts)

(Bonus: 1pt)

Data Ethics

(3pts)

(3pts)

(3pts)