Directions:
Please turn in a knitted HTML file or PDF on WISE before next class.
Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer. In the space provided after the R chunk, explain what this code is doing (line by line).
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(moderndive)
library(sjPlot)
library(sjmisc)
library(ggplot2)
wine <- read_rds("/cloud/project/resources/wine.rds") %>%
filter(province=="Oregon" | province == "California" | province == "New York") %>%
mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>%
mutate(lprice=log(price)) %>%
select(lprice, points, cherry, province)
Answer: I did add a few library packages. moderndive allows me to call the get_regression functions to be able to further analyze the linear models that I will be running. I also included sjPlot and sjmisc so that I am able to easily plot interaction models to show the difference between the original model and interaction affect. Since we’re using the cloud, and that is where the dataset is stored, I adjusted the read_rds so that it can find the data set. The rest of the data wrangling was included in the code.
Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.
# hint: m1 <- lm(lprice ~ points + cherry)
RMSE is 0.4687657
m1 <- lm(lprice~points+cherry, data = wine)
m1
##
## Call:
## lm(formula = lprice ~ points + cherry, data = wine)
##
## Coefficients:
## (Intercept) points cherry
## -5.9157 0.1051 0.1188
get_regression_table(m1)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -5.916 | 0.090 | -65.776 | 0 | -6.092 | -5.739 |
| points | 0.105 | 0.001 | 104.030 | 0 | 0.103 | 0.107 |
| cherry | 0.119 | 0.007 | 17.700 | 0 | 0.106 | 0.132 |
get_regression_summaries(m1)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.305 | 0.305 | 0.2197412 | 0.4687657 | 0.469 | 5845.826 | 0 | 2 | 26584 |
Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.
RMSE is 0.4685223
m2 <- lm(lprice~points*cherry, data=wine)
m2
##
## Call:
## lm(formula = lprice ~ points * cherry, data = wine)
##
## Coefficients:
## (Intercept) points cherry points:cherry
## -5.65962 0.10223 -1.01490 0.01266
get_regression_table(m2)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -5.660 | 0.102 | -55.350 | 0 | -5.860 | -5.459 |
| points | 0.102 | 0.001 | 88.981 | 0 | 0.100 | 0.104 |
| cherry | -1.015 | 0.216 | -4.703 | 0 | -1.438 | -0.592 |
| points:cherry | 0.013 | 0.002 | 5.256 | 0 | 0.008 | 0.017 |
get_regression_summaries(m2)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.306 | 0.306 | 0.2195131 | 0.4685223 | 0.469 | 3910.329 | 0 | 3 | 26584 |
How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical audience.
Answer: If cherry is TRUE, there is a new intercept and slope. The intercept decreases from the original by -1.015 making it now -6.675. The new slope is 0.013 more than the originaly making it now 0.115. Since the intercept is lower and the slope is higher, it means that the line is increasing quicker. It is a steeper incline than the original model. Although, they are very close. In the plot below you will see the red line where cherry is 0 (meaning FALSE) and a blue line where cherry is 1 (meaning TRUE). When cherry is TRUE we can see that price increases more rapidly as points increase.
plot_model(m2, type = "int")
In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.
wine_or <- wine %>%
filter(province == "Oregon")
wine_ca <- wine %>%
filter(province == "California")
wine_ny <- wine %>%
filter(province == "New York")
m_or <- lm(lprice ~ cherry, data = wine_or)
m_ca <- lm(lprice ~ cherry, data = wine_ca)
m_ny <- lm(lprice ~ cherry, data = wine_ny)
get_regression_table(m_or)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 3.405 | 0.008 | 429.517 | 0 | 3.390 | 3.421 |
| cherry | 0.303 | 0.016 | 19.227 | 0 | 0.272 | 0.334 |
get_regression_table(m_ca)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 3.492 | 0.005 | 739.071 | 0 | 3.482 | 3.501 |
| cherry | 0.177 | 0.010 | 18.415 | 0 | 0.158 | 0.196 |
get_regression_table(m_ny)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 3.020 | 0.009 | 330.817 | 0 | 3.002 | 3.038 |
| cherry | 0.173 | 0.018 | 9.766 | 0 | 0.138 | 0.208 |
Answer: It appears that cherry feature affects price most in Oregon. California and New York have slopes that are very similar, where as Oregon is almost double at 0.303. As you can see in the graph below, although they all start in different places, Oregon (yellow) has the steepest incline, meaning it increases at a more rapid rate.
ggplot(wine_or, aes(cherry, lprice))+
geom_smooth(method = lm, se = FALSE, color = "goldenrod1")+
geom_smooth(method = lm, se = FALSE, color = "ivory", data = wine_ca)+
geom_smooth(method = lm, se = FALSE, color = "grey70", data = wine_ny)+
theme(panel.background = element_rect(fill = "mistyrose1"),
plot.background = element_rect(fill = "mistyrose1"),
panel.grid = element_blank(),
axis.title = element_text(family = "courier"),
axis.text = element_text(family = "courier"))
Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. The employee is excited to report an accuracy of 91%.
Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.
summary(wine_ny)
## lprice points cherry province
## Min. :2.197 Min. :80.00 Min. :0.0000 Length:2364
## 1st Qu.:2.773 1st Qu.:86.00 1st Qu.:0.0000 Class :character
## Median :2.996 Median :87.00 Median :0.0000 Mode :character
## Mean :3.066 Mean :87.29 Mean :0.2648
## 3rd Qu.:3.296 3rd Qu.:89.00 3rd Qu.:1.0000
## Max. :4.828 Max. :94.00 Max. :1.0000
summary(wine_or)
## lprice points cherry province
## Min. :1.609 Min. :80.00 Min. :0.0000 Length:5147
## 1st Qu.:3.091 1st Qu.:87.00 1st Qu.:0.0000 Class :character
## Median :3.466 Median :89.00 Median :0.0000 Mode :character
## Mean :3.482 Mean :89.08 Mean :0.2534
## 3rd Qu.:3.871 3rd Qu.:91.00 3rd Qu.:1.0000
## Max. :5.617 Max. :99.00 Max. :1.0000
summary(wine_ca)
## lprice points cherry province
## Min. :1.386 Min. :80.00 Min. :0.0000 Length:19073
## 1st Qu.:3.178 1st Qu.:87.00 1st Qu.:0.0000 Class :character
## Median :3.555 Median :90.00 Median :0.0000 Mode :character
## Mean :3.535 Mean :89.39 Mean :0.2423
## 3rd Qu.:3.912 3rd Qu.:91.00 3rd Qu.:0.0000
## Max. :7.607 Max. :99.00 Max. :1.0000
Answer: As we discussed in class, it’s important to have full context for the data to be sure that a 91% accuracy includes other potential factors. Such as, data size and proportion. Based on the summary above we can see that there is a huge variance between the amount of data collected regarding New York wines vs Oregon vs California. Was this size discrepancy taken into account? Is the sample size that we have gathered from New York enough to make any type of conclusion? It will be impressive if we can answers more of these detailed questions. Last, there is also a difference between precision, recall and accuracy. It is also important to analyze all of these measures to get a bigger picture of how meaningful the information is.
Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?
Answer: Understanding the context and multiple factors is important because most questions do not have simple answer. Humans are complicated… most business decisions are complicated. As we discussed in class there are some circumstances where correlation is enough, but in most scenarios we need more than correlation. Jumping to a simple conclusion will also create potentially damaging and ethically harmful biases. For example, redlining. This is why it is important to look at the data from every angle and consider all of the internal and external factors at play.
Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the covid-19 pandemic. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?
Answer: Just simply omitting the features from the model does not seem like it would fix the problem. I suppose that keeping them in the model could create ethical issues if this model is being used for hiring or firing decisions. However, if it is simple informational, I could see how this could uncover some human biases regarding age, income and gender. I believe that it is important to have this information so we are able to fix problems of discrimination, not use the model to make it worse. I would probably create a model with and a model without, in different variations and analyze the results to see what happens. But again, I think the difference is whether this model is used just to inform and fix, or for decision making.