1) Data Exploration

Structure, summary statistics, and key attributes

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Check for NAs

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

2) Variable Selection

Check correlation between MPG and each variables

##            cyl       disp         hp      drat         wt     qsec        vs
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684 0.6640389
##             am      gear       carb
## [1,] 0.5998324 0.4802848 -0.5509251

Initial variable selection:

cyl, disp, hp, drat, wt, vs, and am were chosen as initial predictors based on correlation with mpg and domain knowledge.

Variables qsec, gear, and carb were excluded due to:

Weaker individual correlations with mpg.
Potential multicollinearity:

qsec (1/4 mile time) is likely influenced by hp and wt.
gear (number of gears) is often correlated with am (transmission type) and may also be related to engine size (disp) and hp, making its independent contribution to mpg difficult to isolate.
carb (number of carburetors) is potentially linked to disp (engine size) and hp.

Backward elimination after train/test split will refine the predictor set, addressing any remaining multicollinearity and optimizing model performance.

3) Data Split

I split the dataset in a 70-30 split

4) Linear Regression Model

Fit the multiple linear regression model

## 
## Call:
## lm(formula = mpg ~ ., data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4889 -1.3295 -0.6108  1.6427  5.2360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 31.971619  10.973636   2.913   0.0113 *
## cyl         -1.140605   1.106243  -1.031   0.3200  
## disp         0.008819   0.014771   0.597   0.5600  
## hp          -0.021499   0.021977  -0.978   0.3446  
## drat         1.079989   1.759212   0.614   0.5491  
## wt          -2.537754   1.506208  -1.685   0.1142  
## vs           1.063936   2.191131   0.486   0.6348  
## am           0.592436   2.479485   0.239   0.8146  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.604 on 14 degrees of freedom
## Multiple R-squared:  0.8713, Adjusted R-squared:  0.807 
## F-statistic: 13.54 on 7 and 14 DF,  p-value: 2.933e-05

The significance of the independent variables is not good. Using the Backward Elimination we select the most significant variables (Significance level > 0.05)

Backward Elimination:

Step 1: Remove ‘am’ (highest p-value: 0.8146)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + vs, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.334 -1.434 -0.591  1.599  5.378 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 32.734914  10.163031   3.221  0.00571 **
## cyl         -1.239029   0.993895  -1.247  0.23165   
## disp         0.008849   0.014299   0.619  0.54531   
## hp          -0.020065   0.020467  -0.980  0.34245   
## drat         1.225087   1.598339   0.766  0.45529   
## wt          -2.701829   1.297746  -2.082  0.05489 . 
## vs           0.817218   1.870848   0.437  0.66847   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.521 on 15 degrees of freedom
## Multiple R-squared:  0.8708, Adjusted R-squared:  0.8191 
## F-statistic: 16.85 on 6 and 15 DF,  p-value: 6.849e-06

Step 2: Remove ‘vs’ (p-value: 0.66847)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2591 -1.3665 -0.8902  1.6625  5.4790 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.919292   8.620992   4.050 0.000928 ***
## cyl         -1.462495   0.830296  -1.761 0.097263 .  
## disp         0.008958   0.013930   0.643 0.529294    
## hp          -0.020788   0.019878  -1.046 0.311196    
## drat         1.065701   1.516272   0.703 0.492254    
## wt          -2.649718   1.259150  -2.104 0.051509 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.456 on 16 degrees of freedom
## Multiple R-squared:  0.8692, Adjusted R-squared:  0.8283 
## F-statistic: 21.26 on 5 and 16 DF,  p-value: 1.509e-06

Step 3: Remove ‘disp’ (p-value: 0.529294)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2328 -1.4877 -0.9369  1.3373  5.4409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.50802    8.19191   4.090 0.000762 ***
## cyl         -1.27645    0.76472  -1.669 0.113393    
## hp          -0.01573    0.01794  -0.877 0.392726    
## drat         1.05953    1.48986   0.711 0.486626    
## wt          -2.14896    0.97226  -2.210 0.041085 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.414 on 17 degrees of freedom
## Multiple R-squared:  0.8658, Adjusted R-squared:  0.8342 
## F-statistic: 27.41 on 4 and 17 DF,  p-value: 3.227e-07

Step 4: Remove ‘drat’ (p-value: 0.486626)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5487 -1.4337 -0.7772  1.4646  5.4523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.14908    2.01823  19.398 1.63e-13 ***
## cyl         -1.53257    0.66527  -2.304   0.0334 *  
## hp          -0.01145    0.01666  -0.687   0.5009    
## wt          -2.41622    0.88429  -2.732   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.38 on 18 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.8387 
## F-statistic: 37.41 on 3 and 18 DF,  p-value: 6.072e-08

Step 5: Remove ‘hp’ (p-value: 0.5009)

## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7049 -1.4539 -0.7823  1.6057  5.5980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.5830     1.8900  20.944 1.38e-14 ***
## cyl          -1.8335     0.4937  -3.714  0.00147 ** 
## wt           -2.4759     0.8677  -2.853  0.01017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.347 on 19 degrees of freedom
## Multiple R-squared:  0.8582, Adjusted R-squared:  0.8432 
## F-statistic: 57.47 on 2 and 19 DF,  p-value: 8.754e-09

Now all the variables are significant (p-value < 0.05)

Interpretation of the coefficients in the model:

Intercept: When both the independent variables are equal to 0 the estimated mpg would be 39.5830.

Cylinders: For each additional cylinder, mpg decreases by 1.8335 on average.

Weight : For each additional 1000 lbs, mpg decreases by 2.4759 on average.

R-squared: The model explains 85.82% of the variability

5) Visualization

6) Model Evaluation

## Model Performance on Test Data:

## MSE: 9.34971

## RMSE: 3.057729

## MAE: 2.446939

## R-squared: 0.858156

Overall, the model does a solid job predicting MPG. The errors are relatively small, and the model explains a large portion of the variation in fuel efficiency. On average, predictions are off by about 2.45 MPG (MAE), and the typical error is around 3.06 MPG (RMSE). That’s a pretty reasonable margin, meaning the model is making fairly accurate estimates.

More importantly, the R-squared value of 85.82% tells us that the model captures most of the factors influencing MPG. In other words, it’s doing a great job at explaining how weight and cylinder count affect fuel efficiency.

Of course, there’s always room for improvement—maybe adding more variables or exploring nonlinear patterns could fine-tune the predictions even further. But as it stands, this model provides a strong and reliable estimate of MPG.