Homework questions and instructions copyright Miles Chen, Do not post, share, or distribute without permission.

Homework 2 Requirements

You will submit two files.

The files you submit will be:

101C_HW_02_First_Last.Rmd Take the provided R Markdown file and make the necessary edits so that it generates the requested output.
101C_HW_02_First_Last.pdf Your output file. This must be a PDF. This is the primary file that will be graded. Make sure all requested output is visible in the output file.

Academic Integrity

At the top of your R markdown file, be sure to include the following statement after modifying it with your name.

“By including this statement, I, Isaiah Mireles, declare that all of the work in this assignment is my own original work. At no time did I look at the code of other students nor did I search for code solutions online. I understand that plagiarism on any single part of this assignment will result in a 0 for the entire assignment and that I will be referred to the dean of students.”

Reading:

Read: Introduction to Statistical Learning with R: Chapter 3
Read: Tidy Modeling with R: Chapter 6 Fitting Models with parsnip
Read: Tidy Modeling with R: Chapter 7 A Model Workflow
Read: parsnip documentation: linear_regression() https://parsnip.tidymodels.org/reference/linear_reg.html
Read: parsnip documentation: fit() https://parsnip.tidymodels.org/reference/fit.html

DataCamp Homework Part 1 (25 pts)

Course: Data Manipulation with dplyr
https://app.datacamp.com/learn/courses/data-manipulation-with-dplyr

Include certificate of completion here:

DataCamp Homework Part 2 (25 pts)

Course: Modeling with Data in the Tidyverse
https://app.datacamp.com/learn/courses/modeling-with-data-in-the-tidyverse

ISLR Chapter 3 Applied Exercises

The following questions are based on exercises from ISLR Chapter 3, but I have modified them. You can refer to the original questions in the chapter text if some of the questions are confusing because of missing context.

Exercise 8 (modified to use tidymodels)

step 0. Use tidymodels and rsample to split Auto into a training set (prop = 0.80) and test set. Use set.seed(101) before using initial_split(). Stratify on mpg. Report the dimensions of the training and test sets.

Use tidymodels to fit a simple linear regression model to the training data with mpg as the response and horsepower as the predictor. Use the engine lm. Once you fit the model, print the model summary.

If you call summary() on the parsnip model_fit object, it will print a list summary. To get the traditional summary() output associated with lm, use extract_fit_engine() along with summary().

# Load libraries
library(ISLR)
library(tidymodels)
library(moderndive)

# Set seed and split data
set.seed(101)
auto_split <- initial_split(Auto, prop = 0.80, strata = mpg)
auto_train <- training(auto_split)
auto_test <- testing(auto_split)

# Report dimensions
dim(auto_train)

## [1] 312   9

dim(auto_test)

## [1] 80  9

# Define and fit the linear model using tidymodels
lm_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ horsepower, data = auto_train)

# Traditional model summary from lm engine
summary(extract_fit_engine(lm_model))

## 
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6159  -3.1739  -0.4144   2.5778  16.8674 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.102631   0.793211   50.56   <2e-16 ***
## horsepower  -0.159538   0.007168  -22.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.891 on 310 degrees of freedom
## Multiple R-squared:  0.6151, Adjusted R-squared:  0.6139 
## F-statistic: 495.4 on 1 and 310 DF,  p-value: < 2.2e-16

# Tidy regression table
get_regression_table(extract_fit_engine(lm_model))

Answer the following questions based on the fit model:

Is there a relationship between the predictor and the response?
1. Yes. Consider the Hypothesis test for the p-value :

\[ H_o : \beta_1 = 0 \\ H_a : \beta_1 \ne 0 \\ \]

Our p-value is dramatically below the standard \(\alpha = 5 \%\) . There is a statistically significant relationship between mpg and horsepower.

How strong is the relationship between the predictor and the response?
1. We can quantify the strength in 2 ways :
  1. Corr coef : we explain 60% of the variation with the model.
  \[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{reg}}{SS_{tot}} \\ \]
  
  \[ SS_{\text{res}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ;\\ SS_{\text{tot}} = \sum_{i=1}^{n} (y_i - \bar{y})^2 ;\\ SS_{\text{reg}} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 \]
  1. Scatter Plot : we appear to not be capturing the true relationship and are in fact oversimplifying the relationship. Using a more flexible model, we see that the true relationship has curvature. So With a simple linear regression model (inflexible), because we are using an oversimplified parametric model (assuming form), high bias and low variance as we underfitting as opposed to overfitting.

Is the relationship between the predictor and the response positive or negative?
1. Yes, its clearly negative. Hence : a slope of
```
-0.159538
```
Make predictions on the test set. Create a new data frame that contains the actual mpg, the prediction made by the model, as well as the lower and upper bounds of a 95% prediction interval. Add another column indicating if the actual mpg value is outside the bounds of the prediction interval. Identify which observations failed to make successful prediction intervals.

# Extract the lm object
lm_base <- extract_fit_engine(lm_model)

# Make predictions with 95% prediction intervals
auto_preds <- predict(lm_base,
                      newdata = auto_test,
                      interval = "prediction",
                      level = 0.95)

# Convert to tibble and bind with actual data
results <- 
  auto_test %>%
  select(mpg, horsepower) %>%
  bind_cols(as_tibble(auto_preds)) %>%
  mutate(outside_PI = mpg < lwr | mpg > upr)

# Show failed prediction intervals
str(results);results %>% filter(outside_PI)

## 'data.frame':    80 obs. of  6 variables:
##  $ mpg       : num  18 15 14 21 10 27 14 14 12 13 ...
##  $ horsepower: num  150 198 225 90 215 88 165 153 180 170 ...
##  $ fit       : num  16.17 8.51 4.21 25.74 5.8 ...
##  $ lwr       : num  6.51 -1.22 -5.58 16.1 -3.96 ...
##  $ upr       : num  25.8 18.2 14 35.4 15.6 ...
##  $ outside_PI: logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Use ggplot and create a plot for the test set with actual mpg on the x-axis and the predicted mpg on the y-axis. Add a geom_abline with a slope of 1 and intercept of 0 (also use lty = 2 to make it dotted) - this line represents where the predictions would be if they were 100% accurate. Add the option coord_obs_pred(). Color the observations that failed to make successful prediction intervals a different color from the other observations.

# Pred vs actual 
ggplot(results, aes(x = mpg, y = fit, color = outside_PI)) +
  geom_point() +
  # Dotted reference line for perfect prediction
  geom_abline(slope = 1, intercept = 0, lty = 2, color = "black") +
  coord_obs_pred() +  # Optional: aligns axes if coord_obs_pred() is defined
  scale_color_manual(values = c("FALSE" = "green", "TRUE" = "red")) +
  labs(
    title = "Predicted vs. Actual MPG (Test Set)",
    x = "Actual MPG",
    y = "Predicted MPG",
    color = "Outside 95% PI"
  ) +
  theme_minimal()

Exercise 9 (modified to use tidymodels)

This exercise involves the use of multiple linear regression on the Auto data set.

Skip part (a). You can make the plot for your own benefit, but don’t include it in your HW solutions.

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable from cor() which is qualitative.

# Exclude 'name' column and compute correlation matrix
cor_matrix <- cor(Auto[ , sapply(Auto, is.numeric)])
round(cor_matrix, 2)  # Optional: round for readability

##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
## origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
##              origin
## mpg            0.57
## cylinders     -0.57
## displacement  -0.61
## horsepower    -0.46
## weight        -0.59
## acceleration   0.21
## year           0.18
## origin         1.00

Use tidymodels with the lm engine to create a multiple linear regression with mpg as the response and all other variables except name as the predictors. Print the summary and answer the following questions:

auto_lm <- lm(mpg ~ . - name, data = Auto)

# Print summary
summary(auto_lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?
1. Yes
```
p-value: < 2.2e-16
```

Which predictors appear to have a statistically significant relationship to the response?

year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
displacement   0.019896   0.007515   2.647  0.00844 **

What does the coefficient for the year variable suggest?
1. The coef for yr suggests a positive – for each additional model year, the miles per gallon (mpg) is expected to increase by about 0.75, holding all other variables constant; Suggesting newer cars tend to be more fuel-efficient and there is likely an overall technological or regulatory trend (e.g., emissions standards, fuel economy improvements) causing mpg to improve over time.

skip (d)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

interaction_only_mdl <- lm(mpg ~ (. )^2 - ., data = subset(Auto, select = -name)) 
summary(interaction_only_mdl)

## 
## Call:
## lm(formula = mpg ~ (.)^2 - ., data = subset(Auto, select = -name))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8275 -1.4924 -0.1428  1.2977 15.2100 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6.721e+00  2.245e+00   2.994  0.00294 ** 
## cylinders:displacement     1.973e-03  6.785e-03   0.291  0.77139    
## cylinders:horsepower      -2.427e-03  2.292e-02  -0.106  0.91576    
## cylinders:weight          -7.059e-04  9.531e-04  -0.741  0.45939    
## cylinders:acceleration    -1.895e-01  1.576e-01  -1.203  0.22991    
## cylinders:year             8.276e-02  3.703e-02   2.235  0.02601 *  
## cylinders:origin          -6.433e-01  5.231e-01  -1.230  0.21956    
## displacement:horsepower    2.755e-04  2.611e-04   1.055  0.29196    
## displacement:weight        1.748e-05  1.596e-05   1.095  0.27406    
## displacement:acceleration  7.896e-03  3.192e-03   2.474  0.01381 *  
## displacement:year         -3.999e-03  9.710e-04  -4.119  4.7e-05 ***
## displacement:origin        5.045e-02  2.035e-02   2.479  0.01361 *  
## horsepower:weight         -1.803e-05  2.947e-05  -0.612  0.54093    
## horsepower:acceleration   -4.421e-03  3.937e-03  -1.123  0.26225    
## horsepower:year            1.093e-03  1.500e-03   0.729  0.46671    
## horsepower:origin         -4.101e-02  2.487e-02  -1.649  0.10008    
## weight:acceleration       -6.201e-04  2.112e-04  -2.936  0.00353 ** 
## weight:year                1.442e-04  8.675e-05   1.663  0.09724 .  
## weight:origin             -1.972e-03  1.604e-03  -1.229  0.21974    
## acceleration:year          2.159e-02  5.137e-03   4.204  3.3e-05 ***
## acceleration:origin        4.872e-02  1.343e-01   0.363  0.71698    
## year:origin                6.172e-02  3.090e-02   1.997  0.04651 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.967 on 370 degrees of freedom
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.8554 
## F-statistic: 111.2 on 21 and 370 DF,  p-value: < 2.2e-16

The interaction-only model (mpg ~ (.)^2 - .) captures complex relationships between pairs of predictors in the Auto dataset and demonstrates strong overall fit, with an R-squared of 0.8632 and an adjusted R-squared of 0.8554. Several interaction terms are statistically significant, suggesting that the effect of one variable on mpg often depends on another. For example, the positive interaction between acceleration and year indicates that acceleration contributes more positively to fuel efficiency in newer cars, while the negative interaction between weight and acceleration suggests that the benefit of acceleration decreases for heavier vehicles. However, excluding main effects may limit interpretability and violate hierarchical modeling principles, which typically recommend including main effects when their interactions are present. Despite this, the model reveals valuable insights into how combined features influence fuel efficiency.

Stats 101C - Homework 2

Isaiah Mireles

08/19/2025