Homework questions and instructions copyright Miles Chen, Do not post, share, or distribute without permission.
You will submit two files.
The files you submit will be:
101C_HW_02_First_Last.Rmd
Take the provided R
Markdown file and make the necessary edits so that it generates the
requested output.
101C_HW_02_First_Last.pdf
Your output file. This
must be a PDF. This is the primary file that will be graded.
Make sure all requested output is visible in the output
file.
At the top of your R markdown file, be sure to include the following statement after modifying it with your name.
“By including this statement, I, Isaiah Mireles, declare that all of the work in this assignment is my own original work. At no time did I look at the code of other students nor did I search for code solutions online. I understand that plagiarism on any single part of this assignment will result in a 0 for the entire assignment and that I will be referred to the dean of students.”
Include certificate of completion here:
The following questions are based on exercises from ISLR Chapter 3, but I have modified them. You can refer to the original questions in the chapter text if some of the questions are confusing because of missing context.
step 0. Use tidymodels
and rsample
to split
Auto into a training set (prop = 0.80) and test set. Use
set.seed(101)
before using initial_split()
.
Stratify on mpg. Report the dimensions of the training and test
sets.
tidymodels
to fit a simple linear regression model
to the training data with mpg as the response and horsepower as the
predictor. Use the engine lm
. Once you fit the model, print
the model summary.If you call summary()
on the parsnip
model_fit
object, it will print a list summary. To get the
traditional summary()
output associated with lm, use
extract_fit_engine()
along with summary()
.
# Load libraries
library(ISLR)
library(tidymodels)
library(moderndive)
# Set seed and split data
set.seed(101)
auto_split <- initial_split(Auto, prop = 0.80, strata = mpg)
auto_train <- training(auto_split)
auto_test <- testing(auto_split)
# Report dimensions
dim(auto_train)
## [1] 312 9
dim(auto_test)
## [1] 80 9
# Define and fit the linear model using tidymodels
lm_model <- linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ horsepower, data = auto_train)
# Traditional model summary from lm engine
summary(extract_fit_engine(lm_model))
##
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6159 -3.1739 -0.4144 2.5778 16.8674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.102631 0.793211 50.56 <2e-16 ***
## horsepower -0.159538 0.007168 -22.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.891 on 310 degrees of freedom
## Multiple R-squared: 0.6151, Adjusted R-squared: 0.6139
## F-statistic: 495.4 on 1 and 310 DF, p-value: < 2.2e-16
# Tidy regression table
get_regression_table(extract_fit_engine(lm_model))
Answer the following questions based on the fit model:
\[ H_o : \beta_1 = 0 \\ H_a : \beta_1 \ne 0 \\ \]
Our p-value is dramatically below the standard \(\alpha = 5 \%\) . There is a statistically significant relationship between mpg and horsepower.
We can quantify the strength in 2 ways :
\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{reg}}{SS_{tot}} \\ \]
\[ SS_{\text{res}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ;\\ SS_{\text{tot}} = \sum_{i=1}^{n} (y_i - \bar{y})^2 ;\\ SS_{\text{reg}} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 \]
Yes, its clearly negative. Hence : a slope of
-0.159538
# Extract the lm object
lm_base <- extract_fit_engine(lm_model)
# Make predictions with 95% prediction intervals
auto_preds <- predict(lm_base,
newdata = auto_test,
interval = "prediction",
level = 0.95)
# Convert to tibble and bind with actual data
results <-
auto_test %>%
select(mpg, horsepower) %>%
bind_cols(as_tibble(auto_preds)) %>%
mutate(outside_PI = mpg < lwr | mpg > upr)
# Show failed prediction intervals
str(results);results %>% filter(outside_PI)
## 'data.frame': 80 obs. of 6 variables:
## $ mpg : num 18 15 14 21 10 27 14 14 12 13 ...
## $ horsepower: num 150 198 225 90 215 88 165 153 180 170 ...
## $ fit : num 16.17 8.51 4.21 25.74 5.8 ...
## $ lwr : num 6.51 -1.22 -5.58 16.1 -3.96 ...
## $ upr : num 25.8 18.2 14 35.4 15.6 ...
## $ outside_PI: logi FALSE FALSE TRUE FALSE FALSE FALSE ...
coord_obs_pred()
. Color the
observations that failed to make successful prediction intervals a
different color from the other observations.# Pred vs actual
ggplot(results, aes(x = mpg, y = fit, color = outside_PI)) +
geom_point() +
# Dotted reference line for perfect prediction
geom_abline(slope = 1, intercept = 0, lty = 2, color = "black") +
coord_obs_pred() + # Optional: aligns axes if coord_obs_pred() is defined
scale_color_manual(values = c("FALSE" = "green", "TRUE" = "red")) +
labs(
title = "Predicted vs. Actual MPG (Test Set)",
x = "Actual MPG",
y = "Predicted MPG",
color = "Outside 95% PI"
) +
theme_minimal()
This exercise involves the use of multiple linear regression on the Auto data set.
Skip part (a). You can make the plot for your own benefit, but don’t include it in your HW solutions.
# Exclude 'name' column and compute correlation matrix
cor_matrix <- cor(Auto[ , sapply(Auto, is.numeric)])
round(cor_matrix, 2) # Optional: round for readability
## mpg cylinders displacement horsepower weight acceleration year
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42 0.58
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54 -0.37
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69 -0.42
## weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.29
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29 1.00
## origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21 0.18
## origin
## mpg 0.57
## cylinders -0.57
## displacement -0.61
## horsepower -0.46
## weight -0.59
## acceleration 0.21
## year 0.18
## origin 1.00
auto_lm <- lm(mpg ~ . - name, data = Auto)
# Print summary
summary(auto_lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response?
Yes
p-value: < 2.2e-16
Which predictors appear to have a statistically significant relationship to the response?
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
weight -0.006474 0.000652 -9.929 < 2e-16 ***
displacement 0.019896 0.007515 2.647 0.00844 **
What does the coefficient for the year variable suggest?
skip (d)
interaction_only_mdl <- lm(mpg ~ (. )^2 - ., data = subset(Auto, select = -name))
summary(interaction_only_mdl)
##
## Call:
## lm(formula = mpg ~ (.)^2 - ., data = subset(Auto, select = -name))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8275 -1.4924 -0.1428 1.2977 15.2100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.721e+00 2.245e+00 2.994 0.00294 **
## cylinders:displacement 1.973e-03 6.785e-03 0.291 0.77139
## cylinders:horsepower -2.427e-03 2.292e-02 -0.106 0.91576
## cylinders:weight -7.059e-04 9.531e-04 -0.741 0.45939
## cylinders:acceleration -1.895e-01 1.576e-01 -1.203 0.22991
## cylinders:year 8.276e-02 3.703e-02 2.235 0.02601 *
## cylinders:origin -6.433e-01 5.231e-01 -1.230 0.21956
## displacement:horsepower 2.755e-04 2.611e-04 1.055 0.29196
## displacement:weight 1.748e-05 1.596e-05 1.095 0.27406
## displacement:acceleration 7.896e-03 3.192e-03 2.474 0.01381 *
## displacement:year -3.999e-03 9.710e-04 -4.119 4.7e-05 ***
## displacement:origin 5.045e-02 2.035e-02 2.479 0.01361 *
## horsepower:weight -1.803e-05 2.947e-05 -0.612 0.54093
## horsepower:acceleration -4.421e-03 3.937e-03 -1.123 0.26225
## horsepower:year 1.093e-03 1.500e-03 0.729 0.46671
## horsepower:origin -4.101e-02 2.487e-02 -1.649 0.10008
## weight:acceleration -6.201e-04 2.112e-04 -2.936 0.00353 **
## weight:year 1.442e-04 8.675e-05 1.663 0.09724 .
## weight:origin -1.972e-03 1.604e-03 -1.229 0.21974
## acceleration:year 2.159e-02 5.137e-03 4.204 3.3e-05 ***
## acceleration:origin 4.872e-02 1.343e-01 0.363 0.71698
## year:origin 6.172e-02 3.090e-02 1.997 0.04651 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 370 degrees of freedom
## Multiple R-squared: 0.8632, Adjusted R-squared: 0.8554
## F-statistic: 111.2 on 21 and 370 DF, p-value: < 2.2e-16
The interaction-only model (mpg ~ (.)^2 - .) captures complex relationships between pairs of predictors in the Auto dataset and demonstrates strong overall fit, with an R-squared of 0.8632 and an adjusted R-squared of 0.8554. Several interaction terms are statistically significant, suggesting that the effect of one variable on mpg often depends on another. For example, the positive interaction between acceleration and year indicates that acceleration contributes more positively to fuel efficiency in newer cars, while the negative interaction between weight and acceleration suggests that the benefit of acceleration decreases for heavier vehicles. However, excluding main effects may limit interpretability and violate hierarchical modeling principles, which typically recommend including main effects when their interactions are present. Despite this, the model reveals valuable insights into how combined features influence fuel efficiency.