1. Read the data
Auto <- read.table("~/Downloads/Predictive modeling/Auto-1.txt", header=TRUE)

B)Scatterplot of HorsePower and MPG

plot(Auto$horsepower, Auto$mpg, 
     main="Horsepower vs MPG",
     xlab= "Horsepower",
     ylab="MPG",
     pch=19, col="blue")

The scatterplot shows that as horsepower increases, MPG decreases. Horsepower and MPG have a negative non-linear relationship.

C)Least Square Regression Equation mpg=39.9359-0.157845

model_linear<- lm(mpg~ horsepower, data=Auto)
summary(model_linear)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Find the proportion of the variation explained by the least squares regression line
r_sq<-summary(model_linear)$r.squared
cat("The proportion of variation explained is:", r_sq)
## The proportion of variation explained is: 0.6059483
  1. draw boxplot for residuals
resids<- residuals(model_linear)
boxplot(resids, main="Boxplot of Residuals",
        ylab="Residuals")

This boxplot shows that the data does contains outliers, and it is rightly skewed.

  1. single linear model, 10-fold CV & scatter plots.
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
set.seed(123)
train_control<-trainControl(method="cv", number=10)
cv_linear<-train(mpg~horsepower, data= Auto, method="lm", trControl= train_control)
print(cv_linear)
## Linear Regression 
## 
## 392 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 353, 352, 353, 353, 353, 354, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.857287  0.6141382  3.832307
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The models prediction for MPG is off by about 4.86 units, and 61.4% of the variance in MPG is captured.

plot(Auto$horsepower, Auto$mpg, main="Linear Fit")
abline(model_linear, col="red", lwd=2)

This graph shows non-linearity. Data is underfitted and appears to curve.

plot(predict(cv_linear), Auto$mpg, xlab="Fitted", 
     ylab= "Observed", main="Obs vs. Fitted (Lineaer)")
abline(0,1, col="green")

G) quadradic model 10-fold CV & scatter plot

Auto$horsepower2=Auto$horsepower^2
Auto=Auto[order(Auto[,3], decreasing=FALSE), ]
cv_quad <- train(mpg~horsepower + horsepower2, data=Auto, method="lm", trControl= train_control)
print(cv_quad)
## Linear Regression 
## 
## 392 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 353, 353, 353, 352, 353, 352, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.360583  0.6896283  3.269904
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
plot(Auto$horsepower, Auto$mpg, main="Quadratic Fit")
lines(Auto$horsepower, predict(cv_quad), col="red", lwd=2)

H)MARS & Scatterplots

library(earth)
## Warning: package 'earth' was built under R version 4.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
cv_mars<-train(mpg~horsepower, data=Auto, method="earth", trControl= train_control)
print(cv_mars)
## Multivariate Adaptive Regression Spline 
## 
## 392 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 352, 353, 353, 353, 354, 352, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##   2       4.804816  0.6315289  3.771082
##   3       4.365337  0.7000506  3.327959
##   5       4.349369  0.6997091  3.273270
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
plot(Auto$horsepower, Auto$mpg, main="MARS Fit")
lines(Auto$horsepower, predict(cv_mars), col="red", lwd=2)

  1. Model Comparison

Model RMSE. R-Squared Linear. 4.857287 0.6141382 Quadratic 4.345311 0.6981552 MARS. 4.305997 0.7021394

The best model (with the lowest RMSE and highest R-Squared) would be MARS. This model would be preffered because the relationship between horsepower and MPG is shown to be non-linear.