Exercise 1

Consider a well-known auto dataset (Auto.txt available on Canvas), which consists of 392 observations and two variables: mpg: miles per gallon and horsepower: Engine displacement (engine horsepower). As in those previous analyses in Chapter 2, we take mpg as the dependent/outcome variable and horsepower as the predictor variable.

Read data into R using the read.table function in R. For instance, I read the data from my computer using

Auto <- read.table("C:/Users/jenei/Downloads/Auto.txt",
header = T)

Draw a scatterplot to check the relationship between horsepower (x-axis) and mpg (y-axis) and interpret the relationship between mgp and horsepower.

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "Miles Per Gallon (mpg)",
     main = "Scatterplot of mpg vs horsepower",
     pch = 19)

Interpretation: The scatterplot shows a strong negative relationship between horsepower and miles per gallon (mpg). As horsepower increases, mpg generally decreases. Vehicles with lower horsepower tend to have higher fuel efficiency, while vehicles with higher horsepower tend to have lower fuel efficiency. The pattern also suggests a slight nonlinear relationship, as mpg declines more rapidly at lower horsepower levels and levels off at higher horsepower values.

Write down the least square regression equation and circle the results from your outputs.

lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Least squared regression equation: y=β0 + β1x mpg = β0 + β1 horsepower mpg= 39.935861 - 0.157845(horsepower)

Circled results from Outputs - Intercept: 39.935861 horsepower: -0.157845

Find proportion of the variation that can be explained by the least squares regression line (i.e., R2).

summary(lm_fit)$r.squared

## [1] 0.6059483

Interpretation: R2=0.61

About 61% of the variability in mpg is explained by horsepower using the linear regression model.

Draw the boxplot for the residuals from the linear regression model between mpg and horsepower to check if the data contain any potential outliers?

residuals_lm <- resid(lm_fit)

boxplot(residuals_lm,
        main = "Boxplot of Residuals",
        ylab = "Residuals")

Fit a single linear model and conduct 10-fold CV to estimate the error. In addition, draw the scatter plot with the fitted line and the scatter plot between the observed and fitted values.

library(boot)

set.seed(123)

cv_linear <- cv.glm(Auto, lm_fit, K = 10)
cv_linear$delta

## [1] NaN NaN

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "mpg",
     main = "Linear Regression Fit",
     pch = 19)
abline(lm_fit, col = "red", lwd = 2)

plot(fitted(lm_fit), Auto$mpg,
     xlab = "Fitted Values",
     ylab = "Observed mpg",
     main = "Observed vs Fitted (Values)",
     pch = 19)
abline(0, 1, col = "red", lwd = 2)

Fit a quadratic model and conduct 10-fold CV to estimate the error and draw the scatter plot with the fitted line and the scatter plot between the observed and fitted values. (Hint, you need to sort your data in an ascending order, such that #Create the squared term of horsepower, called horsepower2 Auto\(horsepower2 = Auto\)horsepower^2 #sort the data in an ascending order Auto = Auto[order(Auto[,3],decreasing=FALSE),]

# Create squared term
Auto$horsepower2 <- Auto$horsepower^2

# Sort data
Auto <- Auto[order(Auto$horsepower), ]

# Fit quadratic model
quad_fit <- lm(mpg ~ horsepower + horsepower2, data = Auto)

cv_quad <- cv.glm(Auto, quad_fit, K = 10)
cv_quad$delta

## [1] NaN NaN

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "mpg",
     main = "Quadratic Regression Fit",
     pch = 19)

lines(Auto$horsepower, fitted(quad_fit),
      col = "2", lwd = 2)

plot(fitted(quad_fit), Auto$mpg,
     xlab = "Fitted Values",
     ylab = "Observed mpg",
     main = "Observed vs Fitted (Quadratic)",
     pch = 19)
abline(0, 1, col = "red", lwd = 2)

Fit a mars model with optimal tuning parameters that you choose and conduct 10-fold CV to estimate the error and draw the scatter plot with the fitted line and the scatter plot between the observed and fitted values.

library(earth)

## Warning: package 'earth' was built under R version 4.5.2

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.5.2

## Loading required package: plotrix

## Warning: package 'plotrix' was built under R version 4.5.2

mars_fit <- earth(mpg ~ horsepower, data = Auto)
summary(mars_fit)

## Call: earth(formula=mpg~horsepower, data=Auto)
## 
##                   coefficients
## (Intercept)         20.3320268
## h(103-horsepower)    0.3147486
## h(horsepower-120)   -0.1780548
## h(horsepower-160)    0.1761460
## 
## Selected 4 of 5 terms, and 1 of 1 predictors
## Termination condition: RSq changed by less than 0.001 at 5 terms
## Importance: horsepower
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 19.05576    RSS 7205.459    GRSq 0.6879887    RSq 0.697491

set.seed(1)
mars_cv <- earth(mpg ~ horsepower, data = Auto,
                 nfold = 10, ncross = 1)

mars_cv$cv.oof.fit

## NULL

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "mpg",
     main = "MARS Model Fit",
     pch = 19)

lines(Auto$horsepower, predict(mars_fit),
      col = "red", lwd = 2)

plot(predict(mars_fit), Auto$mpg,
     xlab = "Fitted Values",
     ylab = "Observed mpg",
     main = "Observed vs Fitted (MARS)",
     pch = 19)
abline(0, 1, col = "red", lwd = 2)

Compare the three fitted models that obtained in g), h) and i) and suggest which model should be preferred according to your criteria, such as R2 or root mean square error (RMSE), or others.

# R-squared
r2_linear <- summary(lm_fit)$r.squared
r2_quad   <- summary(quad_fit)$r.squared
r2_mars   <- mars_fit$rsq




# RMSE
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

rmse_linear <- rmse(Auto$mpg, fitted(lm_fit))
rmse_quad   <- rmse(Auto$mpg, fitted(quad_fit))
rmse_mars   <- rmse(Auto$mpg, predict(mars_fit))

# Combine results
model_comparison <- data.frame(
  Model = c("Linear", "Quadratic", "MARS"),
  R_squared = c(r2_linear, r2_quad, r2_mars),
  RMSE = c(rmse_linear, rmse_quad, rmse_mars)
)

model_comparison

##       Model R_squared      RMSE
## 1    Linear 0.6059483 11.339532
## 2 Quadratic 0.6875590  4.357151
## 3      MARS 0.6974910  4.287339

The MARS model performed best, with the highest R² (0.6975) and lowest RMSE (4.2873), indicating the most accurate predictions. The quadratic model performed better than the linear model, but MARS is preferred overall.