DATA 505 HW #1 Kayle Megginson January 21st 2025

Step Up Code:

library(tidyverse)
wine <- readRDS(gzcon(url("https://github.com/cd-public/DSLM-505/raw/master/dat/wine.rds"))) %>%
  filter(province=="Oregon" | province=="California" | province=="New York") %>% 
  mutate(cherry=as.integer(str_detect(description,"[Cc]herry"))) %>% 
  mutate(lprice=log(price)) %>% 
  select(lprice, points, cherry, province)

Explanataion:

Line 1: Loads the Tidyverse package
Line 2: Reads the RDS file from the specified URL
Line 3: Retains rows where the province is Oregon, California, or New York.
Line 4: Creates a binary variable denoted as cherry that is 1 if “cherry” or “Cherry” appears in the description, 0 otherwise.
Line 5: Creates a new column denoted as lprice, which is the logarithm of the price.
Line 6: Keeps only the specified columns.

Multiple Regression

Linear Models

First run a linear regression model with log of price as the dependent variable and ‘points’ and ‘cherry’ as features (variables).

m1 <- lm(lprice ~ points + cherry, data = wine)
summary(m1)


Call:
lm(formula = lprice ~ points + cherry, data = wine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6318 -0.3377 -0.0162  0.2929  3.9586 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.915747   0.089938  -65.78   <2e-16 ***
points       0.105104   0.001010  104.03   <2e-16 ***
cherry       0.118832   0.006714   17.70   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4688 on 26581 degrees of freedom
Multiple R-squared:  0.3055,    Adjusted R-squared:  0.3054 
F-statistic:  5846 on 2 and 26581 DF,  p-value: < 2.2e-16

predicted <- predict(m1, wine)
rmse <- sqrt(mean((wine$lprice - predicted)^2))
rmse

[1] 0.4687657

Explanataion:

Line 1: Creates a linear regression model where lprice is explained by points and cherry.
Line 2: Displays the regression coefficients and model statistics.
Line 3: Predicts lprice using the model.
Line 4: Calculates the Root Mean Squared Error (RMSE) to evaluate model performance.

Results: - The RMSE for m1 is 0.4688, indicating the average error in predicting the log price of wine.
- The coefficients suggest for every additional point in the wine rating, the log price increases by 0.1051.
- Wines described as having “cherry” flavors are associated with a 0.1188 increase in log price.

Interaction Models

Add an interaction between ‘points’ and ‘cherry’.

m2 <- lm(lprice ~ points * cherry, data = wine)
summary(m2)


Call:
lm(formula = lprice ~ points * cherry, data = wine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6432 -0.3332 -0.0151  0.2924  3.9645 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -5.659620   0.102252 -55.350  < 2e-16 ***
points         0.102225   0.001149  88.981  < 2e-16 ***
cherry        -1.014896   0.215812  -4.703 2.58e-06 ***
points:cherry  0.012663   0.002409   5.256 1.48e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4686 on 26580 degrees of freedom
Multiple R-squared:  0.3062,    Adjusted R-squared:  0.3061 
F-statistic:  3910 on 3 and 26580 DF,  p-value: < 2.2e-16

predicted_interaction <- predict(m2, wine)
rmse_interaction <- sqrt(mean((wine$lprice - predicted_interaction)^2))
rmse_interaction

[1] 0.4685223

Line 1: Includes both the individual effects and the interaction between points and cherry in the model.
Line 2: Displays the updated regression results, including the interaction term.
Line 3: Predicts lprice using the new model.
Line 4: Computes the RMSE for the interaction model.

Results: - The RMSE for m2 is 0.4685, slightly lower than m1, indicating a marginally better fit.
- The interaction term (points:cherry) has a positive coefficient (0.0127), suggesting that the effect of points on log price is stronger for wines described as having “cherry” flavors.

The Interaction Variable

The coefficient on the interaction term indicates how the relationship between points and lprice changes depending on whether the wine mentions “cherry.” For instance, a positive coefficient suggests that wines with “cherry” descriptions experience a steeper increase in price as points increase.

Applications

Determine which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most?

m3 <- lm(lprice ~ points * cherry * province, data = wine)
summary(m3)


Call:
lm(formula = lprice ~ points * cherry * province, data = wine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6751 -0.3215 -0.0233  0.2891  3.9315 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -5.643824   0.118705 -47.545  < 2e-16 ***
points                          0.102415   0.001330  77.000  < 2e-16 ***
cherry                         -1.074154   0.254487  -4.221 2.44e-05 ***
provinceNew York                3.018542   0.448839   6.725 1.79e-11 ***
provinceOregon                  1.510512   0.261449   5.777 7.67e-09 ***
points:cherry                   0.013027   0.002833   4.599 4.27e-06 ***
points:provinceNew York        -0.037675   0.005138  -7.333 2.32e-13 ***
points:provinceOregon          -0.017562   0.002939  -5.976 2.32e-09 ***
cherry:provinceNew York         2.655053   0.899457   2.952  0.00316 ** 
cherry:provinceOregon          -0.422541   0.562001  -0.752  0.45215    
points:cherry:provinceNew York -0.029367   0.010260  -2.862  0.00421 ** 
points:cherry:provinceOregon    0.006144   0.006269   0.980  0.32703    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4615 on 26572 degrees of freedom
Multiple R-squared:  0.3272,    Adjusted R-squared:  0.3269 
F-statistic:  1175 on 11 and 26572 DF,  p-value: < 2.2e-16

Line 1: Includes interactions between all three variables.
Line 2: Creates a summary table to help see how cherry varies by province.

Results: - The interaction effect (cherry:provinceNew York) has the largest positive coefficient (2.655), indicating that the presence of “cherry” flavors impacts log price most strongly in New York.
- The effect in Oregon (cherry:provinceOregon) is not statistically significant, suggesting minimal impact.

Scenarios

On Accuracy

Imagine a model to distinguish New York wines from those in California and Oregon. After a few days of work, you take some measurements and note: “I’ve achieved 91% accuracy on my model!”

Should you be impressed? Why or why not?

wine %>% group_by(province) %>% summarise(count = n())

# A tibble: 3 × 2
  province   count
  <chr>      <int>
1 California 19073
2 New York    2364
3 Oregon      5147

Achieving 91% accuracy may not be impressive due to the class imbalance. With New York wines comprising only 8.3% of the data, a naive model predicting California or Oregon could achieve high accuracy by ignoring New York altogether.

On Ethics

Why is understanding this vignette important to use machine learning in an ethical manner?

Understanding this vignette helps us realize the importance of fairness and representation in machine learning. Bias in data or features can lead to unethical outcomes, such as discrimination or bias.

Ignorance is no excuse

Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the changing federal policy under new presidential administrations. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?

Dropping sensitive features does not resolve ethical issues. Proxy variables can indirectly reintroduce biases. Ethical machine learning requires thorough investigation of how all features influence the model and its predictions.