Setup (5pts)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer. In the space provided after the R chunk, explain what this code is doing (line by line).

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(modelr)

wine <- readRDS("C:/Users/Kari/Downloads/wine (1).rds") %>%
  filter(province=="Oregon" | province == "California" | province == "New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  mutate(lprice=log(price)) %>% 
  select(lprice, points, cherry, province)

Answer: (write your line-by-line explanation of the code here) In the above chunk, I called two libraries. The Tidyverse Library and Modelr to use RMSE.

Then I created a new object called ‘wine’ by reading in the wine data set with readRDS, and then piped it through a filter. Filtering by province is Oregon or California or New York. Then using the mutate verb, created a new variable called “lprice” which is a log of the existing variable, price. Then finally piped it through to select only the new lprice variable, points, cherry and province.

Multiple Regression

(2pts)

Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.

m1 <- lm(lprice ~ points + cherry, data = wine)
rmse(m1, wine)

## [1] 0.4687657

(2pts)

Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.

m2 <- lm(lprice ~ points * cherry, data = wine)
rmse(m2, wine)

## [1] 0.4685223

summary(m2)

## 
## Call:
## lm(formula = lprice ~ points * cherry, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6432 -0.3332 -0.0151  0.2924  3.9645 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.659620   0.102252 -55.350  < 2e-16 ***
## points         0.102225   0.001149  88.981  < 2e-16 ***
## cherry        -1.014896   0.215812  -4.703 2.58e-06 ***
## points:cherry  0.012663   0.002409   5.256 1.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4686 on 26580 degrees of freedom
## Multiple R-squared:  0.3062, Adjusted R-squared:  0.3061 
## F-statistic:  3910 on 3 and 26580 DF,  p-value: < 2.2e-16

(3pts)

How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical manager.

Answer: So for every increase in points where there is also a ‘cherry’ the increase in log price goes up by $.012.

(Bonus: 1pt)

In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.

m3 <- lm(lprice ~ cherry * province, data = wine)
rmse(m3, wine)

## [1] 0.539451

summary(m3)

## 
## Call:
## lm(formula = lprice ~ cherry * province, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1054 -0.3562 -0.0190  0.3623  4.1157 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.491711   0.004488 778.012  < 2e-16 ***
## cherry                   0.176729   0.009117  19.385  < 2e-16 ***
## provinceNew York        -0.471927   0.013697 -34.454  < 2e-16 ***
## provinceOregon          -0.086626   0.009792  -8.847  < 2e-16 ***
## cherry:provinceNew York -0.003490   0.026750  -0.130    0.896    
## cherry:provinceOregon    0.126100   0.019547   6.451 1.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5395 on 26578 degrees of freedom
## Multiple R-squared:  0.08024,    Adjusted R-squared:  0.08007 
## F-statistic: 463.7 on 5 and 26578 DF,  p-value: < 2.2e-16

Answer: (write your answer here) It only appears to impact Oregon, as it is the only instance in which it is statistically significant. As evidenced by the three asterisks.

Data Ethics

(3pts)

Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. After a few days of work, your employee bursts into your office and exclaims, “I’ve achieved 91% accuracy on my model!”

Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.

wine_NY<-wine%>%
  mutate(NewY=ifelse(province=="New York",1,0))
wine_NY%>%count(NewY)

NewY	n
0	24220
1	2364

Answer: (describe your reasoning here) In the case where your model predicts only with 91% accuracy, meaning 9% are incorrectly identified, and New York accounts for approximately 9% of the the wine included in this data, your wine accuracy counting has the potential to predict the wine as much as it has the inability to inaccurately predict the accuracy of a NY wine.So no, I would not be impressed. It would be too much of a risk to our consumers

(3pts)

Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?

Answer: (describe your reasoning here) If we want to use machine learning, we have to make sure that in cases where we use things like description (things that are dependent upon human discrepancy, and therefore subjective) it might influence machine learning. Where as things like points and price are not subjective and empirical points related to wine.

(3pts)

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the Corona virus. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

Answer: (describe your reasoning here) No, do not drop the stats from this particular model. You are looking for patterns irregardless of indicators, and if you find that the indicators are things like gender, income, and age, it is worth calling attention to these discrepancies. Therefore calling attention to marginalized groups that might need additional support during this time. If you didn’t consider things like gender, certainly policies could be implemented that would not benefit the groups truly in the most need.

504 - Homework 1 - Kari Olson

Kari Olson

1/23/2021

Setup (5pts)

Multiple Regression

(2pts)

(2pts)

(3pts)

(Bonus: 1pt)

Data Ethics

(3pts)

(3pts)

(3pts)