Auto <- read.table("/Users/macbookair/Desktop/STA 2026spring/3st/Auto-1.txt", header = TRUE)
head(Auto)
## mpg horsepower
## 1 18 130
## 2 15 165
## 3 18 150
## 4 16 150
## 5 17 140
## 6 15 198
dim(Auto)
## [1] 392 2
str(Auto)
## 'data.frame': 392 obs. of 2 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ horsepower: int 130 165 150 150 140 198 220 215 225 190 ...
summary(Auto)
## mpg horsepower
## Min. : 9.00 Min. : 46.0
## 1st Qu.:17.00 1st Qu.: 75.0
## Median :22.75 Median : 93.5
## Mean :23.45 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:126.0
## Max. :46.60 Max. :230.0
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "MPG",
main = "MPG vs Horsepower",
pch = 19)
The plot shows that cars with higher horsepower usually have lower mpg.
So when horsepower goes up, fuel efficiency goes down. The pattern looks
fairly strong, but not perfectly linear, and there might be some
curvature.
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
The least squares regression equation: mpg = 39.936 − 0.1578 × horsepower
summary(lm_fit)$r.squared
## [1] 0.6059483
The R² is about 0.61, so horsepower explains a little more than half of the variation in mpg. It’s a fairly strong relationship.
boxplot(residuals(lm_fit),
main = "Boxplot of Residuals",
ylab = "Residuals")
The residuals are mostly around zero, but I can see a few points that are far away from the rest. These could be possible outliers.
library(boot)
cv_lin <- cv.glm(
data = Auto,
glmfit = glm(mpg ~ horsepower, data = Auto),
K = 10
)
cv_lin$delta
## [1] 24.27809 24.26058
The 10-fold CV error is about 24.17.
plot(Auto$horsepower, Auto$mpg,
pch = 19,
xlab = "Horsepower",
ylab = "MPG",
main = "Linear Fit")
abline(lm_fit, col = "red", lwd = 2)
The red fitted line shows the negative trend clearly, but many points
are still spread around it.
plot(fitted(lm_fit), Auto$mpg,
pch = 19,
xlab = "Fitted MPG",
ylab = "Observed MPG",
main = "Observed vs Fitted")
abline(0, 1, col = "blue", lwd = 2)
The points roughly follow the line, but there is still quite a bit of
spread, which means the model has some prediction error.
Auto$horsepower2 <- Auto$horsepower^2
Auto <- Auto[order(Auto$horsepower), ]
quad_fit <- lm(mpg ~ horsepower + horsepower2, data = Auto)
summary(quad_fit)
##
## Call:
## lm(formula = mpg ~ horsepower + horsepower2, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## horsepower2 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
The quadratic regression equation: mpg = 56.90 − 0.4662 × horsepower + 0.00123 × horsepower²
Since the squared term is significant, it shows the relationship is curved, not just a straight line. So the quadratic model should fit better than the linear one.
library(boot)
cv_quad <- cv.glm(
data = Auto,
glmfit = glm(mpg ~ horsepower + horsepower2, data = Auto),
K = 10
)
cv_quad$delta
## [1] 19.30202 19.28525
The CV error is around 19.26, which is smaller than the linear model, so the quadratic model seems to work better.
plot(Auto$horsepower, Auto$mpg,
pch = 19,
xlab = "Horsepower",
ylab = "MPG",
main = "Quadratic Fit")
lines(Auto$horsepower,
fitted(quad_fit),
col = "green",
lwd = 2)
The curved fitted line matches the data pattern better than the linear
line. It follows the bend in the data, so it looks like a better
fit.
plot(fitted(quad_fit), Auto$mpg,
pch = 19,
xlab = "Fitted MPG",
ylab = "Observed MPG",
main = "Observed vs Fitted (Quadratic)")
abline(0, 1, col = "blue", lwd = 2)
Compared with the linear model, the points are closer to the line here,
so the quadratic model seems to predict better.
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
mars_fit <- earth(mpg ~ horsepower, data = Auto)
summary(mars_fit)
## Call: earth(formula=mpg~horsepower, data=Auto)
##
## coefficients
## (Intercept) 20.3320268
## h(103-horsepower) 0.3147486
## h(horsepower-120) -0.1780548
## h(horsepower-160) 0.1761460
##
## Selected 4 of 5 terms, and 1 of 1 predictors
## Termination condition: RSq changed by less than 0.001 at 5 terms
## Importance: horsepower
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 19.05576 RSS 7205.459 GRSq 0.6879887 RSq 0.697491
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
set.seed(1)
ctrl <- trainControl(method = "cv", number = 10)
mars_cv_model <- train(
mpg ~ horsepower,
data = Auto,
method = "earth",
trControl = ctrl
)
mars_cv_model
## Multivariate Adaptive Regression Spline
##
## 392 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 355, 352, 352, 353, 352, 353, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 4.801547 0.6299820 3.794938
## 3 4.347131 0.6924163 3.314007
## 5 4.309786 0.6972659 3.240772
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
The MARS model selected nprune = 5 as the best setting. The cross-validated RMSE is about 4.34 and the R-squared is about 0.70.
plot(Auto$horsepower, Auto$mpg,
pch = 19,
xlab = "Horsepower",
ylab = "MPG",
main = "MARS Fit")
lines(Auto$horsepower,
predict(mars_fit),
col = "purple",
lwd = 2)
The MARS curve adjusts to the shape of the data and follows the trend
better than a straight line. It looks more flexible and closer to the
points.
plot(predict(mars_fit), Auto$mpg,
pch = 19,
xlab = "Fitted MPG",
ylab = "Observed MPG",
main = "Observed vs Fitted (MARS)")
abline(0, 1, col = "blue", lwd = 2)
The points are closer to the line here than before, so the MARS
predictions seem more accurate overall.
Linear R²: 0.606 CV ≈ 24.17
Quadratic R²: 0.688 CV ≈ 19.26
MARS R²: ≈ 0.70 RMSE ≈ 4.34
Among the three models, the linear model fits the worst. The quadratic model is better because it captures the curve in the data. The MARS model gives the best results, with higher R-squared and lower error, so I would choose the MARS model.