Directions:

Please turn in a knitted HTML file or PDF on WISE before next class.

Setup (5pts)

Change the author of this RMD file to be yourself and modify the below code so that you can successfully load the ‘wine.rds’ data file from your own computer. In the space provided after the R chunk, explain what this code is doing (line by line).

#install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
wine <- read_rds("C:\\Users\\rebec\\OneDrive\\Desktop\\Willamette\\ML2023\\wine.rds")%>%
  filter(province=="Oregon" | province == "California" | province == "New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  mutate(lprice=log(price)) %>% 
  select(lprice, points, cherry, province)
view(wine)

Answer: (write your line-by-line explanation of the code here)

Lines: 24: knitting the set of chunks with the perimeters of echo = TRUE,add code to the Markdown output, Message and Warnings = False, do not display messages and warning in the output.

25/26: install and run the package tidyverse

28: import wine.rds into Rstudio (project files within the environment you are working in) and give the data file a designation name to use as an input for the rest of your code adventure.

29: Filter the data within the column province using OR (|) statements for each province. ( Oregon, California, New York)

30: Used mutate filter to create a new column named cherry equal it to an int if the word is detected in column description string . If the word cherry is detected then put 1 in column, if not put 0.

31: mutate a new column called lprice that uses the price column and gets the log of the price and inputs it into the new column called lprice. 32: select the columns in which you want to have within the new function called wine.

Multiple Regression

(2pts)

Run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables). Report the RMSE.

#install.packages("moderndive")
library(moderndive)
#lprince is the dependent variable. We are looking for effects of the prices.

m1 <- lm(lprice ~ points + cherry, data = wine)

m1
## 
## Call:
## lm(formula = lprice ~ points + cherry, data = wine)
## 
## Coefficients:
## (Intercept)       points       cherry  
##     -5.9157       0.1051       0.1188
get_regression_summaries(m1)

rmse = 0.4687657

(2pts)

Run the same model as above, but add an interaction between ‘points’ and ‘cherry’.

library(ggplot2)
m2 <- lm(lprice ~points+cherry+points*cherry, data = wine)

get_regression_table(m2)
get_regression_summaries(m2)
ggplot(data =wine,aes(points,lprice, color = cherry))+
  geom_point()+
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

rmse = 0.4685223

(3pts)

How should I interpret the coefficient on the interaction variable? Please explain as you would to a non-technical audience.

Answer:

There is a .013 price increase with wines having a description of the word cherry and with high points.

(Bonus: 1pt)

In which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most? Show your code and write the answer below.

m3 <- lm(lprice~province+cherry+cherry*province, data=wine)

get_regression_table(m3)

Answer: (write your answer here)

Answer: The cherry description looks like it has a positive effect in the province California then next, Oregon province with the price increase of .177 and .126. Since the p-value is high interaction of province of New York and cherry is not significant.

Data Ethics

(3pts)

Imagine that you are a manager at an E-commerce operation that sells wine online. Your employee has been building a model to distinguish New York wines from those in California and Oregon. The employee is excited to report an accuracy of 91%.

Should you be impressed? Why or why not? Use simple descriptive statistics from the data to justify your answer.

table(wine$province)
## 
## California   New York     Oregon 
##      19073       2364       5147

Answer: (describe your reasoning here)

There is an imbalance of data, we would need to either sample up or down.

(3pts)

Why is understanding the vignette in the previous question important if you want to use machine learning in an ethical manner?

Answer: (describe your reasoning here)

To have a better prediction, less ways of a model to go to one side or another.

(3pts)

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the covid-19 pandemic. You have a very large data set with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

Answer:

What are the ethical issues associated with age, income, and/or gender? What will the knowledge be used for? In theory, there may be several scenarios. I’ll elaborate on two possible scenarios. One potential outcome is that this data could be used for the protection of individuals working in a business in order to incorporate protection into the prediction of another pandemic and procedures to properly handle the losing workers or reduce the lost of workers in the future. Alternatively, businesses could use these predictions to assess the new likelihoods of future pandemics and choose employment based on the prediction models and the cause and effect of worker loss, having bias on hiring and treating specific people based on age, income, and gender more aggressively than they already do.