Predicting Fuel Efficiency Using Machine Learning

A. Introduction

Fuel consumption is a major concern in the automotive world, combining both economic and environmental stakes. But what truly determines a car’s efficiency? Is it solely its weight, its horsepower, or the design of its engine?

To answer this question, we will build a Multiple Linear Regression (MLR) model using the famous mtcars dataset. Extracted from the 1974 Motor Trend magazine, this benchmark dataset explores the fuel consumption of 32 automobiles across 11 technical specifications.

The objective of this project is twofold: to understand the combined impact of these 11 variables on fuel consumption (measured in Miles per Gallon, mpg) and to demonstrate, step-by-step, a rigorous methodology for building a linear model.

B. Step-by-Step Workflow

Pairwise linear correlation study between explanatory variables.
Linearity evaluation between each numeric explanatory (independent) variable and the response (dependent) variable.
Fitting the full model.
Multicollinearity evaluation using Variance Inflation Factors (VIFs).
Assumption checking for normality and homoscedasticity of the full model’s residuals.
Interpretation of the full model results.
Parsimonious model selection.
Outlier detection and analysis.
Interpretation of the parsimonious model.
Synthesis of the results.

To illustrate this approach, we will use the mtcars dataset (built-in within R).The data was extracted from the 1974 American magazine Motor Trend, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models):

- mpg : Miles/(US) gallon

- cyl : Number of cylinders

- disp : Displacement (cu.in.)

- hp : Gross horsepower

- drat : Rear axle ratio

- wt : Weight (1000 lbs)

- qsec : 1/4 mile time (le temps pour parcourir 1/4 de mile)

- vs : Engine (0 = V-shaped, 1 = straight)

- am : Transmission (0 = automatic, 1 = manual)

- gear : Number of forward gears

- carb : Number of carburetors.

A multiple linear regression approach will be employed to assess the impact of each explanatory variable on miles per gallon (the response variable), independent of the other predictors.

We start by computing and visualizing the Pearson correlation coefficients between all numerical variables to get a first glimpse of their linear relationships.

1. Pairwise Linear Correlation Analysis

# Load necessary libraries
library(GGally)
library(corrplot)

# Select ONLY numeric variables automatically to avoid errors
numeric_mtcars <- mtcars[, sapply(mtcars, is.numeric)]

# Compute correlation matrix on numeric data only
cor_matrix <- cor(numeric_mtcars)

# Visualize using a clean correlation plot
corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black", tl.srt = 45,
         number.cex = 0.7, diag = FALSE)

Signs of Severe Multicollinearity Several explanatory variables are extremely highly correlated with each other:

- disp and cyl (\(r = 0.90\))

- disp and wt (\(r = 0.89\))

- hp and cyl (\(r = 0.83\))

2. Linearity Evaluation

To evaluate the linearity assumption between our response variable (mpg) and each continuous predictor, we analyze a grid of scatter plots accompanied by linear fits and local smooth curves (LOESS).

library(ggplot2)
library(tidyverse)

# Force predictors to be numeric just for the long-format visualization
mtcars %>%
  mutate(across(-mpg, as.numeric)) %>%
  pivot_longer(cols = -mpg, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, y = mpg)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_smooth(method = "loess", se = FALSE, color = "red", linetype = "dashed") +
  facet_wrap(~ variable, scales = "free_x") +
  theme_minimal() +
  labs(title = "Linearity Check: mpg vs Predictors",
       subtitle = "Blue: Linear trend | Red: Local smooth trend (LOESS)")

Based on the generated diagnostic grid, we evaluate the linearity assumption by comparing the linear regression line (blue solid line) with the local smooth curve (red dashed line) for each predictor against mpg.

- Weight (wt) & Horsepower (hp): These variables display a clear negative linear relationship with mpg. The red LOESS curve follows the blue linear line very closely across most of the data range. This confirms that the linearity assumption is well-satisfied for our primary continuous predictors.

- Rear Axle Ratio (drat): Shows a solid, consistent positive linear relationship with mpg, with no meaningful divergence from the linear path.

- Displacement (disp): While a strong negative correlation is clear, the red LOESS line shows a slight curve (u-shape) at the lower and higher ends of the spectrum. This indicates a minor non-linear component, but a linear approximation remains acceptable as a starting point.

- Quarter Mile Time (qsec): The red line shows an S-curve behavior, suggesting that the relationship with acceleration isn’t perfectly linear.

- Transmission (am) & Engine Shape (vs): Because these are binary variables (0 or 1), the points are grouped into two vertical columns. The lines simply connect the averages of these two groups. This reinforces our decision to treat them as categorical factor variables in the next step.

- Cylinders (cyl), Carburetors (carb), and Gears (gear): These discrete variables align moderately well with a linear assumption, though carb and gear show some localized fluctuations due to uneven data distribution at extreme values (e.g., cars with 6 or 8 carburetors).

3. Fitting the full model

We now fit the full multiple linear regression model. Note that variables like am (transmission) and vs (engine shape) are technically categorical, so we should convert them to factors.

# Prepare data by converting categorical variables to factors
mtcars_cleaned <- mtcars
mtcars_cleaned$am <- as.factor(mtcars_cleaned$am)
mtcars_cleaned$vs <- as.factor(mtcars_cleaned$vs)

# Fit the full model with all 10 predictors
full_model <- lm(mpg ~ ., data = mtcars_cleaned)
summary(full_model)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs1          0.31776    2.10451   0.151   0.8814  
## am1          2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The full model yields a highly significant global fit (\(F = 12.93\), \(p < 0.001\)) and explains 81% of the variance in fuel consumption (Adjusted \(R^2 = 0.8066\)). However, a striking paradox emerges: not a single individual predictor is statistically significant at the \(\alpha = 0.05\) level.

This phenomenon—where a model performs exceptionally well as a whole while its individual components appear useless—is a textbook indicator of severe multicollinearity. The highly correlated technical specifications are fighting for variance, inflating the standard errors and masking each other’s true impact.

4. Multicollinearity Evaluation

To diagnose and quantify this overlap, we calculate the Variance Inflation Factors (VIF) for each predictor. A VIF value exceeding 5 or 10 points to problematic multicollinearity that must be addressed.

library(performance)
check_collinearity(full_model)  # Calculate VIF for the full model

## # Check for Multicollinearity
## 
## Low Correlation
## 
##  Term  VIF     VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
##  drat 3.37 [ 2.44,  4.92]     1.84      0.30     [0.20, 0.41]
##    vs 4.97 [ 3.49,  7.32]     2.23      0.20     [0.14, 0.29]
##    am 4.65 [ 3.28,  6.84]     2.16      0.22     [0.15, 0.31]
## 
## Moderate Correlation
## 
##  Term  VIF     VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
##    hp 9.83 [ 6.70, 14.68]     3.14      0.10     [0.07, 0.15]
##  qsec 7.53 [ 5.18, 11.20]     2.74      0.13     [0.09, 0.19]
##  gear 5.36 [ 3.75,  7.92]     2.31      0.19     [0.13, 0.27]
##  carb 7.91 [ 5.43, 11.77]     2.81      0.13     [0.08, 0.18]
## 
## High Correlation
## 
##  Term   VIF     VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
##   cyl 15.37 [10.36, 23.07]     3.92      0.07     [0.04, 0.10]
##  disp 21.62 [14.49, 32.52]     4.65      0.05     [0.03, 0.07]
##    wt 15.16 [10.22, 22.75]     3.89      0.07     [0.04, 0.10]

The check_collinearity() output proves that our variables are severely overlapping. Engine size (disp), cylinders (cyl), and car weight (wt) all show critical VIF values well above 10, meaning they carry almost identical information. This heavy redundancy destabilizes the model and hides each variable’s true impact, confirming that this full 11 variable model is too crowded to trust and must be simplified.

5. Assumption Checking

We evaluate the full model’s residuals to check for normality (errors are evenly distributed) and homoscedasticity (errors have constant variance).

# Statistical test for normality
shapiro.test(residuals(full_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(full_model)
## W = 0.95694, p-value = 0.2261

# Diagnostic plots
par(mfrow = c(2, 2))
plot(full_model)

The diagnostic tests confirm that our model’s errors (residuals) meet all necessary mathematical assumptions. The Shapiro-Wilk test yields a p-value of 0.2261, which is well above 0.05, proving that our prediction errors are normally distributed—a result visually backed by the points tightly hugging the diagonal line on the Q-Q Residuals plot. Additionally, the random scatter of points in the Residuals vs Fitted plot shows constant variance (homoscedasticity), meaning the model’s accuracy remains stable across all car types. Finally, no data points cross the dangerous thresholds in the Residuals vs Leverage plot, confirming that no single vehicle is unfairly skewing our results.

7. Parsimonious Model Selection

To eliminate the severe multicollinearity we discovered earlier, we perform an automated stepwise variable selection based on the Akaike Information Criterion (AIC). This process drops redundant variables one by one until it finds the most optimal, simplified model.

# Stepwise regression (both forward and backward selection)
parsimonious_model <- step(full_model, direction = "both", trace = FALSE)

# Display the summary of the optimized model
summary(parsimonious_model)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am1           2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The stepwise selection algorithm successfully resolved the multicollinearity crisis by pruning the 10 original predictors down to just three essential variables: Weight (wt), Quarter-mile acceleration time (qsec), and Transmission type (am). By stripping away the heavy statistical redundancy of overlapping variables like cylinders and displacement, all three remaining factors have become highly statistically significant (\(p < 0.05\)). Remarkably, this simplified model explains 83.4% of the variance in fuel consumption (Adjusted \(R^2 = 0.8336\)), outperforming the original full model (81%) while being far more stable

8. Outlier Detection

check_outliers(parsimonious_model)

## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.808).
## - For variable: (Whole model)

The check_outliers() function from the performance package confirms that no influential outliers are present in our optimized model. Using the standard Cook’s Distance method with a mathematical threshold of 0.808, none of the 32 automobile observations exert enough disproportionate leverage to distort or skew our regression coefficients.

9. Interpretation of the parsimonious model

The theoretical multiple linear regression model is structured as follows:

\[Y_{mpg} = \beta_0 + \beta_1(wt) + \beta_2(qsec) + \beta_3(am) + \epsilon\]

By substituting the parameters with our final estimated coefficients from the parsimonious model, we obtain the definitive predictive equation for vehicle fuel efficiency:

\[\widehat{mpg} = 9.6178 - 3.9165(wt) + 1.2259(qsec) + 2.9358(am)\]

Final Model Coefficients:

- Intercept (9.6178): The baseline mathematical constant.

- Weight (wt, Coefficient = -3.9165, \(p < 0.001\)): Holding acceleration and transmission constant, every 1,000 lbs increase in vehicle weight decreases fuel efficiency by approximately 3.92 Miles per Gallon (mpg). Weight is the most punishing driver of fuel consumption.

- Acceleration (qsec, Coefficient = 1.2259, \(p < 0.001\)): Holding weight and transmission constant, every one-second increase in quarter-mile time (meaning a slower, less aggressive acceleration setup) increases fuel efficiency by 1.23 mpg.

- Transmission Type (am1, Coefficient = 2.9358, \(p = 0.046\)): Holding weight and acceleration constant, shifting from an automatic to a manual transmission (am1) delivers a significant efficiency boost of 2.94 mpg.

10. Final Project Synthesis

We started with 11 crowded variables and trimmed them down to the 3 things that truly matter for saving gas. This final, simple model explains 83.4% of a car’s fuel efficiency. The data shows three clear rules: heavy cars burn much more gas (losing about 4 mpg per 1,000 lbs), slower acceleration saves fuel, and manual transmissions are cleaner and more efficient (gaining nearly 3 mpg over automatics).