Auto <- read.table("~/Downloads/Predictive modeling/Auto-1.txt", header=TRUE)
B)Scatterplot of HorsePower and MPG
plot(Auto$horsepower, Auto$mpg,
main="Horsepower vs MPG",
xlab= "Horsepower",
ylab="MPG",
pch=19, col="blue")
The scatterplot shows that as horsepower increases, MPG decreases.
Horsepower and MPG have a negative non-linear relationship.
C)Least Square Regression Equation mpg=39.9359-0.157845
model_linear<- lm(mpg~ horsepower, data=Auto)
summary(model_linear)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
r_sq<-summary(model_linear)$r.squared
cat("The proportion of variation explained is:", r_sq)
## The proportion of variation explained is: 0.6059483
resids<- residuals(model_linear)
boxplot(resids, main="Boxplot of Residuals",
ylab="Residuals")
This boxplot shows that the data does contains outliers, and it is
rightly skewed.
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
set.seed(123)
train_control<-trainControl(method="cv", number=10)
cv_linear<-train(mpg~horsepower, data= Auto, method="lm", trControl= train_control)
print(cv_linear)
## Linear Regression
##
## 392 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 353, 352, 353, 353, 353, 354, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.857287 0.6141382 3.832307
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The models prediction for MPG is off by about 4.86 units, and 61.4% of the variance in MPG is captured.
plot(Auto$horsepower, Auto$mpg, main="Linear Fit")
abline(model_linear, col="red", lwd=2)
This graph shows non-linearity. Data is underfitted and appears to
curve.
plot(predict(cv_linear), Auto$mpg, xlab="Fitted",
ylab= "Observed", main="Obs vs. Fitted (Lineaer)")
abline(0,1, col="green")
G) quadradic model 10-fold CV & scatter plot
Auto$horsepower2=Auto$horsepower^2
Auto=Auto[order(Auto[,3], decreasing=FALSE), ]
cv_quad <- train(mpg~horsepower + horsepower2, data=Auto, method="lm", trControl= train_control)
print(cv_quad)
## Linear Regression
##
## 392 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 353, 353, 353, 352, 353, 352, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.360583 0.6896283 3.269904
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
plot(Auto$horsepower, Auto$mpg, main="Quadratic Fit")
lines(Auto$horsepower, predict(cv_quad), col="red", lwd=2)
H)MARS & Scatterplots
library(earth)
## Warning: package 'earth' was built under R version 4.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
cv_mars<-train(mpg~horsepower, data=Auto, method="earth", trControl= train_control)
print(cv_mars)
## Multivariate Adaptive Regression Spline
##
## 392 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 352, 353, 353, 353, 354, 352, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 4.804816 0.6315289 3.771082
## 3 4.365337 0.7000506 3.327959
## 5 4.349369 0.6997091 3.273270
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
plot(Auto$horsepower, Auto$mpg, main="MARS Fit")
lines(Auto$horsepower, predict(cv_mars), col="red", lwd=2)
Model RMSE. R-Squared Linear. 4.857287 0.6141382 Quadratic 4.345311 0.6981552 MARS. 4.305997 0.7021394
The best model (with the lowest RMSE and highest R-Squared) would be MARS. This model would be preffered because the relationship between horsepower and MPG is shown to be non-linear.