Consider a well-known auto dataset (Auto.txt available on Canvas), which consists of 392 observations and two variables: mpg: miles per gallon and horsepower: Engine displacement (engine horsepower). As in those previous analyses in Chapter 2, we take mpg as the dependent/outcome variable and horsepower as the predictor variable.
Auto <- read.table("C:/Users/jenei/Downloads/Auto.txt",
header = T)
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "Miles Per Gallon (mpg)",
main = "Scatterplot of mpg vs horsepower",
pch = 19)
Interpretation: The scatterplot shows a strong negative relationship between horsepower and miles per gallon (mpg). As horsepower increases, mpg generally decreases. Vehicles with lower horsepower tend to have higher fuel efficiency, while vehicles with higher horsepower tend to have lower fuel efficiency. The pattern also suggests a slight nonlinear relationship, as mpg declines more rapidly at lower horsepower levels and levels off at higher horsepower values.
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Least squared regression equation: y=β0 + β1x mpg = β0 + β1 horsepower mpg= 39.935861 - 0.157845(horsepower)
Circled results from Outputs - Intercept: 39.935861 horsepower: -0.157845
summary(lm_fit)$r.squared
## [1] 0.6059483
Interpretation: R2=0.61
About 61% of the variability in mpg is explained by horsepower using the linear regression model.
residuals_lm <- resid(lm_fit)
boxplot(residuals_lm,
main = "Boxplot of Residuals",
ylab = "Residuals")
library(boot)
set.seed(123)
cv_linear <- cv.glm(Auto, lm_fit, K = 10)
cv_linear$delta
## [1] NaN NaN
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "mpg",
main = "Linear Regression Fit",
pch = 19)
abline(lm_fit, col = "red", lwd = 2)
plot(fitted(lm_fit), Auto$mpg,
xlab = "Fitted Values",
ylab = "Observed mpg",
main = "Observed vs Fitted (Values)",
pch = 19)
abline(0, 1, col = "red", lwd = 2)
# Create squared term
Auto$horsepower2 <- Auto$horsepower^2
# Sort data
Auto <- Auto[order(Auto$horsepower), ]
# Fit quadratic model
quad_fit <- lm(mpg ~ horsepower + horsepower2, data = Auto)
cv_quad <- cv.glm(Auto, quad_fit, K = 10)
cv_quad$delta
## [1] NaN NaN
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "mpg",
main = "Quadratic Regression Fit",
pch = 19)
lines(Auto$horsepower, fitted(quad_fit),
col = "2", lwd = 2)
plot(fitted(quad_fit), Auto$mpg,
xlab = "Fitted Values",
ylab = "Observed mpg",
main = "Observed vs Fitted (Quadratic)",
pch = 19)
abline(0, 1, col = "red", lwd = 2)
library(earth)
## Warning: package 'earth' was built under R version 4.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
mars_fit <- earth(mpg ~ horsepower, data = Auto)
summary(mars_fit)
## Call: earth(formula=mpg~horsepower, data=Auto)
##
## coefficients
## (Intercept) 20.3320268
## h(103-horsepower) 0.3147486
## h(horsepower-120) -0.1780548
## h(horsepower-160) 0.1761460
##
## Selected 4 of 5 terms, and 1 of 1 predictors
## Termination condition: RSq changed by less than 0.001 at 5 terms
## Importance: horsepower
## Number of terms at each degree of interaction: 1 3 (additive model)
## GCV 19.05576 RSS 7205.459 GRSq 0.6879887 RSq 0.697491
set.seed(1)
mars_cv <- earth(mpg ~ horsepower, data = Auto,
nfold = 10, ncross = 1)
mars_cv$cv.oof.fit
## NULL
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "mpg",
main = "MARS Model Fit",
pch = 19)
lines(Auto$horsepower, predict(mars_fit),
col = "red", lwd = 2)
plot(predict(mars_fit), Auto$mpg,
xlab = "Fitted Values",
ylab = "Observed mpg",
main = "Observed vs Fitted (MARS)",
pch = 19)
abline(0, 1, col = "red", lwd = 2)
# R-squared
r2_linear <- summary(lm_fit)$r.squared
r2_quad <- summary(quad_fit)$r.squared
r2_mars <- mars_fit$rsq
# RMSE
rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
rmse_linear <- rmse(Auto$mpg, fitted(lm_fit))
rmse_quad <- rmse(Auto$mpg, fitted(quad_fit))
rmse_mars <- rmse(Auto$mpg, predict(mars_fit))
# Combine results
model_comparison <- data.frame(
Model = c("Linear", "Quadratic", "MARS"),
R_squared = c(r2_linear, r2_quad, r2_mars),
RMSE = c(rmse_linear, rmse_quad, rmse_mars)
)
model_comparison
## Model R_squared RMSE
## 1 Linear 0.6059483 11.339532
## 2 Quadratic 0.6875590 4.357151
## 3 MARS 0.6974910 4.287339
The MARS model performed best, with the highest R² (0.6975) and lowest RMSE (4.2873), indicating the most accurate predictions. The quadratic model performed better than the linear model, but MARS is preferred overall.