Exercise1

Auto <- read.table("/Users/macbookair/Desktop/STA 2026spring/3st/Auto-1.txt", header = TRUE)

head(Auto)

##   mpg horsepower
## 1  18        130
## 2  15        165
## 3  18        150
## 4  16        150
## 5  17        140
## 6  15        198

dim(Auto)

## [1] 392   2

str(Auto)

## 'data.frame':    392 obs. of  2 variables:
##  $ mpg       : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ horsepower: int  130 165 150 150 140 198 220 215 225 190 ...

summary(Auto)

##       mpg          horsepower   
##  Min.   : 9.00   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.: 75.0  
##  Median :22.75   Median : 93.5  
##  Mean   :23.45   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :230.0

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "MPG",
     main = "MPG vs Horsepower",
     pch = 19)

The plot shows that cars with higher horsepower usually have lower mpg. So when horsepower goes up, fuel efficiency goes down. The pattern looks fairly strong, but not perfectly linear, and there might be some curvature.

lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

The least squares regression equation: mpg = 39.936 − 0.1578 × horsepower

summary(lm_fit)$r.squared

## [1] 0.6059483

The R² is about 0.61, so horsepower explains a little more than half of the variation in mpg. It’s a fairly strong relationship.

boxplot(residuals(lm_fit),
        main = "Boxplot of Residuals",
        ylab = "Residuals")

The residuals are mostly around zero, but I can see a few points that are far away from the rest. These could be possible outliers.

library(boot)

cv_lin <- cv.glm(
  data = Auto,
  glmfit = glm(mpg ~ horsepower, data = Auto),
  K = 10
)

cv_lin$delta

## [1] 24.27809 24.26058

The 10-fold CV error is about 24.17.

plot(Auto$horsepower, Auto$mpg,
     pch = 19,
     xlab = "Horsepower",
     ylab = "MPG",
     main = "Linear Fit")

abline(lm_fit, col = "red", lwd = 2)

The red fitted line shows the negative trend clearly, but many points are still spread around it.

plot(fitted(lm_fit), Auto$mpg,
     pch = 19,
     xlab = "Fitted MPG",
     ylab = "Observed MPG",
     main = "Observed vs Fitted")

abline(0, 1, col = "blue", lwd = 2)

The points roughly follow the line, but there is still quite a bit of spread, which means the model has some prediction error.

Auto$horsepower2 <- Auto$horsepower^2

Auto <- Auto[order(Auto$horsepower), ]


quad_fit <- lm(mpg ~ horsepower + horsepower2, data = Auto)
summary(quad_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower + horsepower2, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower  -0.4661896  0.0311246  -14.98   <2e-16 ***
## horsepower2  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

The quadratic regression equation: mpg = 56.90 − 0.4662 × horsepower + 0.00123 × horsepower²

Since the squared term is significant, it shows the relationship is curved, not just a straight line. So the quadratic model should fit better than the linear one.

library(boot)

cv_quad <- cv.glm(
  data = Auto,
  glmfit = glm(mpg ~ horsepower + horsepower2, data = Auto),
  K = 10
)

cv_quad$delta

## [1] 19.30202 19.28525

The CV error is around 19.26, which is smaller than the linear model, so the quadratic model seems to work better.

plot(Auto$horsepower, Auto$mpg,
     pch = 19,
     xlab = "Horsepower",
     ylab = "MPG",
     main = "Quadratic Fit")

lines(Auto$horsepower,
      fitted(quad_fit),
      col = "green",
      lwd = 2)

The curved fitted line matches the data pattern better than the linear line. It follows the bend in the data, so it looks like a better fit.

plot(fitted(quad_fit), Auto$mpg,
     pch = 19,
     xlab = "Fitted MPG",
     ylab = "Observed MPG",
     main = "Observed vs Fitted (Quadratic)")

abline(0, 1, col = "blue", lwd = 2)

Compared with the linear model, the points are closer to the line here, so the quadratic model seems to predict better.

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

mars_fit <- earth(mpg ~ horsepower, data = Auto)
summary(mars_fit)

## Call: earth(formula=mpg~horsepower, data=Auto)
## 
##                   coefficients
## (Intercept)         20.3320268
## h(103-horsepower)    0.3147486
## h(horsepower-120)   -0.1780548
## h(horsepower-160)    0.1761460
## 
## Selected 4 of 5 terms, and 1 of 1 predictors
## Termination condition: RSq changed by less than 0.001 at 5 terms
## Importance: horsepower
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 19.05576    RSS 7205.459    GRSq 0.6879887    RSq 0.697491

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

set.seed(1)

ctrl <- trainControl(method = "cv", number = 10)

mars_cv_model <- train(
  mpg ~ horsepower,
  data = Auto,
  method = "earth",
  trControl = ctrl
)

mars_cv_model

## Multivariate Adaptive Regression Spline 
## 
## 392 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 355, 352, 352, 353, 352, 353, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##   2       4.801547  0.6299820  3.794938
##   3       4.347131  0.6924163  3.314007
##   5       4.309786  0.6972659  3.240772
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.

The MARS model selected nprune = 5 as the best setting. The cross-validated RMSE is about 4.34 and the R-squared is about 0.70.

plot(Auto$horsepower, Auto$mpg,
     pch = 19,
     xlab = "Horsepower",
     ylab = "MPG",
     main = "MARS Fit")

lines(Auto$horsepower,
      predict(mars_fit),
      col = "purple",
      lwd = 2)

The MARS curve adjusts to the shape of the data and follows the trend better than a straight line. It looks more flexible and closer to the points.

plot(predict(mars_fit), Auto$mpg,
     pch = 19,
     xlab = "Fitted MPG",
     ylab = "Observed MPG",
     main = "Observed vs Fitted (MARS)")

abline(0, 1, col = "blue", lwd = 2)

The points are closer to the line here than before, so the MARS predictions seem more accurate overall.

Linear R²: 0.606 CV ≈ 24.17
Quadratic R²: 0.688 CV ≈ 19.26
MARS R²: ≈ 0.70 RMSE ≈ 4.34

Among the three models, the linear model fits the worst. The quadratic model is better because it captures the curve in the data. The MARS model gives the best results, with higher R-squared and lower error, so I would choose the MARS model.

Exercise1

Wei You

2026-02-07