Structure, summary statistics, and key attributes
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Check for NAs
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
Check correlation between MPG and each variables
## cyl disp hp drat wt qsec vs
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684 0.6640389
## am gear carb
## [1,] 0.5998324 0.4802848 -0.5509251
Initial variable selection:
cyl, disp, hp, drat, wt, vs, and am were chosen as initial predictors based on correlation with mpg and domain knowledge.
Variables qsec, gear, and carb were excluded due to:
Weaker individual correlations with mpg.
Potential multicollinearity:
qsec (1/4 mile time) is likely influenced by hp and wt.
gear (number of gears) is often correlated with am (transmission type) and may also be related to engine size (disp) and hp, making its independent contribution to mpg difficult to isolate.
carb (number of carburetors) is potentially linked to disp (engine size) and hp.
Backward elimination after train/test split will refine the predictor set, addressing any remaining multicollinearity and optimizing model performance.
I split the dataset in a 70-30 split
Fit the multiple linear regression model
##
## Call:
## lm(formula = mpg ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4889 -1.3295 -0.6108 1.6427 5.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.971619 10.973636 2.913 0.0113 *
## cyl -1.140605 1.106243 -1.031 0.3200
## disp 0.008819 0.014771 0.597 0.5600
## hp -0.021499 0.021977 -0.978 0.3446
## drat 1.079989 1.759212 0.614 0.5491
## wt -2.537754 1.506208 -1.685 0.1142
## vs 1.063936 2.191131 0.486 0.6348
## am 0.592436 2.479485 0.239 0.8146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.604 on 14 degrees of freedom
## Multiple R-squared: 0.8713, Adjusted R-squared: 0.807
## F-statistic: 13.54 on 7 and 14 DF, p-value: 2.933e-05
The significance of the independent variables is not good. Using the Backward Elimination we select the most significant variables (Significance level > 0.05)
Backward Elimination:
Step 1: Remove ‘am’ (highest p-value: 0.8146)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + vs, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.334 -1.434 -0.591 1.599 5.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.734914 10.163031 3.221 0.00571 **
## cyl -1.239029 0.993895 -1.247 0.23165
## disp 0.008849 0.014299 0.619 0.54531
## hp -0.020065 0.020467 -0.980 0.34245
## drat 1.225087 1.598339 0.766 0.45529
## wt -2.701829 1.297746 -2.082 0.05489 .
## vs 0.817218 1.870848 0.437 0.66847
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.521 on 15 degrees of freedom
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8191
## F-statistic: 16.85 on 6 and 15 DF, p-value: 6.849e-06
Step 2: Remove ‘vs’ (p-value: 0.66847)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2591 -1.3665 -0.8902 1.6625 5.4790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.919292 8.620992 4.050 0.000928 ***
## cyl -1.462495 0.830296 -1.761 0.097263 .
## disp 0.008958 0.013930 0.643 0.529294
## hp -0.020788 0.019878 -1.046 0.311196
## drat 1.065701 1.516272 0.703 0.492254
## wt -2.649718 1.259150 -2.104 0.051509 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.456 on 16 degrees of freedom
## Multiple R-squared: 0.8692, Adjusted R-squared: 0.8283
## F-statistic: 21.26 on 5 and 16 DF, p-value: 1.509e-06
Step 3: Remove ‘disp’ (p-value: 0.529294)
##
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2328 -1.4877 -0.9369 1.3373 5.4409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.50802 8.19191 4.090 0.000762 ***
## cyl -1.27645 0.76472 -1.669 0.113393
## hp -0.01573 0.01794 -0.877 0.392726
## drat 1.05953 1.48986 0.711 0.486626
## wt -2.14896 0.97226 -2.210 0.041085 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.414 on 17 degrees of freedom
## Multiple R-squared: 0.8658, Adjusted R-squared: 0.8342
## F-statistic: 27.41 on 4 and 17 DF, p-value: 3.227e-07
Step 4: Remove ‘drat’ (p-value: 0.486626)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5487 -1.4337 -0.7772 1.4646 5.4523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.14908 2.01823 19.398 1.63e-13 ***
## cyl -1.53257 0.66527 -2.304 0.0334 *
## hp -0.01145 0.01666 -0.687 0.5009
## wt -2.41622 0.88429 -2.732 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.38 on 18 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.8387
## F-statistic: 37.41 on 3 and 18 DF, p-value: 6.072e-08
Step 5: Remove ‘hp’ (p-value: 0.5009)
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7049 -1.4539 -0.7823 1.6057 5.5980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.5830 1.8900 20.944 1.38e-14 ***
## cyl -1.8335 0.4937 -3.714 0.00147 **
## wt -2.4759 0.8677 -2.853 0.01017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.347 on 19 degrees of freedom
## Multiple R-squared: 0.8582, Adjusted R-squared: 0.8432
## F-statistic: 57.47 on 2 and 19 DF, p-value: 8.754e-09
Now all the variables are significant (p-value < 0.05)
Interpretation of the coefficients in the model:
Intercept: When both the independent variables are equal to 0 the estimated mpg would be 39.5830.
Cylinders: For each additional cylinder, mpg decreases by 1.8335 on average.
Weight : For each additional 1000 lbs, mpg decreases by 2.4759 on average.
R-squared: The model explains 85.82% of the variability
## Model Performance on Test Data:
## MSE: 9.34971
## RMSE: 3.057729
## MAE: 2.446939
## R-squared: 0.858156
Overall, the model does a solid job predicting MPG. The errors are relatively small, and the model explains a large portion of the variation in fuel efficiency. On average, predictions are off by about 2.45 MPG (MAE), and the typical error is around 3.06 MPG (RMSE). That’s a pretty reasonable margin, meaning the model is making fairly accurate estimates.
More importantly, the R-squared value of 85.82% tells us that the model captures most of the factors influencing MPG. In other words, it’s doing a great job at explaining how weight and cylinder count affect fuel efficiency.
Of course, there’s always room for improvement—maybe adding more variables or exploring nonlinear patterns could fine-tune the predictions even further. But as it stands, this model provides a strong and reliable estimate of MPG.