1) Data Exploration

Structure, summary statistics, and key attributes

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Check for NAs

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

2) Variable Selection

Check correlation between MPG and each variables

##            cyl       disp         hp      drat         wt     qsec        vs
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684 0.6640389
##             am      gear       carb
## [1,] 0.5998324 0.4802848 -0.5509251

Initial variable selection:

cyl, disp, hp, drat, wt, vs, and am were chosen as initial predictors based on correlation with mpg and domain knowledge.

Variables qsec, gear, and carb were excluded due to:

  1. Weaker individual correlations with mpg.

  2. Potential multicollinearity:

Backward elimination after train/test split will refine the predictor set, addressing any remaining multicollinearity and optimizing model performance.

3) Data Split

I split the dataset in a 70-30 split

4) Linear Regression Model

Fit the multiple linear regression model

## 
## Call:
## lm(formula = mpg ~ ., data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4889 -1.3295 -0.6108  1.6427  5.2360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 31.971619  10.973636   2.913   0.0113 *
## cyl         -1.140605   1.106243  -1.031   0.3200  
## disp         0.008819   0.014771   0.597   0.5600  
## hp          -0.021499   0.021977  -0.978   0.3446  
## drat         1.079989   1.759212   0.614   0.5491  
## wt          -2.537754   1.506208  -1.685   0.1142  
## vs           1.063936   2.191131   0.486   0.6348  
## am           0.592436   2.479485   0.239   0.8146  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.604 on 14 degrees of freedom
## Multiple R-squared:  0.8713, Adjusted R-squared:  0.807 
## F-statistic: 13.54 on 7 and 14 DF,  p-value: 2.933e-05

The significance of the independent variables is not good. Using the Backward Elimination we select the most significant variables (Significance level > 0.05)

Backward Elimination:

Step 1: Remove ‘am’ (highest p-value: 0.8146)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + vs, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.334 -1.434 -0.591  1.599  5.378 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 32.734914  10.163031   3.221  0.00571 **
## cyl         -1.239029   0.993895  -1.247  0.23165   
## disp         0.008849   0.014299   0.619  0.54531   
## hp          -0.020065   0.020467  -0.980  0.34245   
## drat         1.225087   1.598339   0.766  0.45529   
## wt          -2.701829   1.297746  -2.082  0.05489 . 
## vs           0.817218   1.870848   0.437  0.66847   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.521 on 15 degrees of freedom
## Multiple R-squared:  0.8708, Adjusted R-squared:  0.8191 
## F-statistic: 16.85 on 6 and 15 DF,  p-value: 6.849e-06

Step 2: Remove ‘vs’ (p-value: 0.66847)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2591 -1.3665 -0.8902  1.6625  5.4790 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.919292   8.620992   4.050 0.000928 ***
## cyl         -1.462495   0.830296  -1.761 0.097263 .  
## disp         0.008958   0.013930   0.643 0.529294    
## hp          -0.020788   0.019878  -1.046 0.311196    
## drat         1.065701   1.516272   0.703 0.492254    
## wt          -2.649718   1.259150  -2.104 0.051509 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.456 on 16 degrees of freedom
## Multiple R-squared:  0.8692, Adjusted R-squared:  0.8283 
## F-statistic: 21.26 on 5 and 16 DF,  p-value: 1.509e-06

Step 3: Remove ‘disp’ (p-value: 0.529294)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2328 -1.4877 -0.9369  1.3373  5.4409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.50802    8.19191   4.090 0.000762 ***
## cyl         -1.27645    0.76472  -1.669 0.113393    
## hp          -0.01573    0.01794  -0.877 0.392726    
## drat         1.05953    1.48986   0.711 0.486626    
## wt          -2.14896    0.97226  -2.210 0.041085 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.414 on 17 degrees of freedom
## Multiple R-squared:  0.8658, Adjusted R-squared:  0.8342 
## F-statistic: 27.41 on 4 and 17 DF,  p-value: 3.227e-07

Step 4: Remove ‘drat’ (p-value: 0.486626)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5487 -1.4337 -0.7772  1.4646  5.4523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.14908    2.01823  19.398 1.63e-13 ***
## cyl         -1.53257    0.66527  -2.304   0.0334 *  
## hp          -0.01145    0.01666  -0.687   0.5009    
## wt          -2.41622    0.88429  -2.732   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.38 on 18 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.8387 
## F-statistic: 37.41 on 3 and 18 DF,  p-value: 6.072e-08

Step 5: Remove ‘hp’ (p-value: 0.5009)

## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7049 -1.4539 -0.7823  1.6057  5.5980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.5830     1.8900  20.944 1.38e-14 ***
## cyl          -1.8335     0.4937  -3.714  0.00147 ** 
## wt           -2.4759     0.8677  -2.853  0.01017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.347 on 19 degrees of freedom
## Multiple R-squared:  0.8582, Adjusted R-squared:  0.8432 
## F-statistic: 57.47 on 2 and 19 DF,  p-value: 8.754e-09

Now all the variables are significant (p-value < 0.05)

Interpretation of the coefficients in the model:

Intercept: When both the independent variables are equal to 0 the estimated mpg would be 39.5830.

Cylinders: For each additional cylinder, mpg decreases by 1.8335 on average.

Weight : For each additional 1000 lbs, mpg decreases by 2.4759 on average.

R-squared: The model explains 85.82% of the variability

5) Visualization

6) Model Evaluation

## Model Performance on Test Data:
## MSE: 9.34971
## RMSE: 3.057729
## MAE: 2.446939
## R-squared: 0.858156

Overall, the model does a solid job predicting MPG. The errors are relatively small, and the model explains a large portion of the variation in fuel efficiency. On average, predictions are off by about 2.45 MPG (MAE), and the typical error is around 3.06 MPG (RMSE). That’s a pretty reasonable margin, meaning the model is making fairly accurate estimates.

More importantly, the R-squared value of 85.82% tells us that the model captures most of the factors influencing MPG. In other words, it’s doing a great job at explaining how weight and cylinder count affect fuel efficiency.

Of course, there’s always room for improvement—maybe adding more variables or exploring nonlinear patterns could fine-tune the predictions even further. But as it stands, this model provides a strong and reliable estimate of MPG.