Causal Analysis:
Mileage & Transmission Type

Regression Models

Johns Hopkins University

Author

D.McCabe

https://en.m.wikipedia.org/wiki/Overdrive_(mechanics) https://themotorguy.com/what-is-axle-ratio-in-trucks/ https://www.ecklers.com/understanding-rear-gear-ratios-tech-guide.html https://www.youtube.com/watch?v=NCibZWA46Bo&ab_channel=DustorBust

“Is an automatic or manual transmission better for MPG”

“Quantify the MPG difference between automatic and manual transmissions”

We note both the Mazda RX4s in the data set have engines mis-reported as V-shaped. They are incact wankle rotary engines.

Data Sources

Data acquisition

getType <- function(hp, wt) {
  type <- rep("economy/compact", length(hp))
  type[wt > 3.0] <- "midsized/muscle"
  type[wt > 4.5] <- "luxury/superheavy"
  type[hp / wt > 60] <- "sports/muscle (hp)"
  return(as.factor(type))
}

dt <- as.data.table(mtcars,keep.rownames = "name")
dt[,vs:=factor(vs, labels = c("V-shaped", "straight"))]
dt[,am:=factor(am, labels = c("automatic", "manual"))]
dt[,gear:=as.factor(gear)]
dt[,type := getType(hp,wt)]

Effect of Transmission Type on MPG

Weight

automatic gearboxes are used in bigger, heavier less economic cars

Weight is a confounder: Cars designed with automatic transmission have poorer mileage because automatic transmission is used on heavier cars while manual tends to be used on lighter more economical cars.

Weight appears to act as an effect modifier with mileage decreasing more sharply with weight for manuals than for automatics. However, the limited overlap between the two groups along with the influence of very economical and very heavy luxury cars does seems to exaggerate this effect.

muscle and mid-size sports cars have both low mpg and manual transmission

Here we are at (or far beyond) the limits of the data. Nevertheless, we can still glimps something which I do believe exists namely, withing the midsized/muscle category we see cars with manual transmission having poorer mileage.

Automatics practical midsized cars (with a few exceptions) indicated by a high quarter mile time: \[\operatorname{qsec}_{auto}\sim N(18.32,3.82^2)\]
Manuals are more heavily punctuated with muscle and high performance sports cars with a 92% quarter mile time on average: \[\operatorname{qsec}_{stick}\sim N(16.83,3.20^2)\] (This design choice saved weight and made gear changes quicker and acceleration higher.)

This suggests that manuals are concentrated among higher performance cars, which explains why their fuel efficiency appears lower in this subset even when controlling for weight.

no. carburetor

carburetors themselves are not a statistically significant predictor of fuel efficiency

The relationship between transmission type and mileage was examined while adjusting for the number of carburettors. We observe that the primary effect of transmission on mileage remains largely unchanged after this adjustment. Additionally, there is a negative correlation between mileage and the number of carburettors, indicating that cars with more carburettors tend to be less fuel-efficient. This aligns with the observation that more wasteful, bulkier or sportier vehicles typically use more carburettors. However, there are a few high-influence sports cars that deviate from this general trend.

my preconceived theory regarding carburettors wasting fuel is not evidenced

model <- lm(mpg ~ wt+cyl+carb, data = dt)
anova(model)

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq  F value    Pr(>F)    
wt         1 847.73  847.73 133.8014 3.526e-12 ***
cyl        1  87.15   87.15  13.7554  0.000912 ***
carb       1  13.77   13.77   2.1738  0.151536    
Residuals 28 177.40    6.34                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

vif(model)

      wt      cyl     carb 
2.581453 2.920519 1.385647

I noticed that the number of carburetors appears to affect mileage in the additive model mpg ~ wt + disp + carb. However, the variance inflation factors for weight and displacement were high indicating collinearity between the two. The number of carburetors does not appear to be overly associated with the bulk (PC1) score but it does contribute positively to the sportiness score. This aligns with our expectation that vehicles with more carburetors tend to be overpowered and less fuel-efficient. When we model mpg with cylinders instead, we see that previously the number of carburettors was simply acting as a proxy for performance-oriented cars.

Adjustment for drat

Fuel efficiency increases with the final drive ratio. This is likely because larger vehicles generally require bigger more powerful engines as well as lower output-to-input gearing ratios. These lower ratios help provide the torque needed for acceleration and we suppose compensate for the lower engine RPMs resulting from design constraints imposed by the larger centrifugal forces in bigger engines. There is insufficient overlap between the automatic and manual transmission groups to allow a meaningful analysis.

dt2 <- dt[drat>3.5 & drat<4.0]

#| echo: false
model <- lm(mpg ~ am+drat, data = dt2)
ggplot(dt2, aes(x = drat, y = mpg, color = am)) +
  geom_point(alpha=0.4) +
  geom_line(aes(y = predict(model, newdata = dt2)), linewidth = 1) +  # use linewidth instead of size
  geom_hline(aes(yintercept = mean(dt2[am=="automatic"]$mpg)), linetype = "dashed", size = 1,colour = "coral") +
  geom_hline(aes(yintercept = mean(dt2[am=="manual"]$mpg)), linetype = "dashed", size = 1, colour="cyan") +
  labs(
    title = "MPG vs Rear Axle Ratio by Transmission",
    x = "Final Drive Ratio",
    y = "Miles Per Gallon (MPG)",
    color = "Transmission"
  ) +
  theme_minimal()

Appendices

A0) Statistical variation in car design

To explore statistical variation in available car design in 1970’s USA we can look at how variance is distributed in the mtcars dataset using PCA. It reveals there are three orthogonal directions containing most of the variance; 60%, 24% and 6%.

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
                           PC8    PC9    PC10   PC11
Standard deviation     0.35057 0.2776 0.22811 0.1485
Proportion of Variance 0.01117 0.0070 0.00473 0.0020
Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

Car bulkiness

The first principal component - the direction of greatest variance in the data is driven mainly by weight and engine size, with mileage (closely linked to engine size) and to a lesser extent the final drive ratio also contributing. This dimension approximates “bulk” i.e. large, heavy cars with powerful engines and poor mileage score high while smaller, lighter, more efficient cars score low.

#! echo: false
pc1_loadings <- pca_result$rotation[, 1]
filtered_pc1 <- pc1_loadings[abs(pc1_loadings) > 0.25]
filtered_pc1[order(-abs(filtered_pc1))]

    cyl.V1    disp.V1     mpg.V1      wt.V1      hp.V1      vs.V1    drat.V1 
 0.3739160  0.3681852 -0.3625305  0.3461033  0.3300569 -0.3065113 -0.2941514

bulk_scores <-pca_result$x[,1]

Car sportiness

The second principal component - the direction of greatest variance orthogonal to bulkiness lies in the direction of drag ability, transmission design (gears and type), the no. of carburettors and to a lesser extent the final drive ratio.

This second measure align the sportiness controlled for bulk. Cars with V-6 and V-8 instead of inline engines and slightly bigger power to weight ratios score higher in terms of sportiness than cars of similar bulk which do not.

#! echo: false
pc2_loadings <- pca_result$rotation[, 2]
filtered_pc2 <- pc2_loadings[abs(pc1_loadings) > 0.25]
filtered_pc2[order(-abs(filtered_pc2))]

    drat.V1       hp.V1       vs.V1       wt.V1     disp.V1      cyl.V1 
 0.27469408  0.24878402 -0.23164699 -0.14303825 -0.04932413  0.04374371 
     mpg.V1 
 0.01612440

sport_scores <-pca_result$x[,2]

A1) U.S. car type classification in 1970s

The dataset (mtcars) spans a wide variety of car types. I applied a basic ruleset to classify cars into distinct types. Classification boundaries are shown below, along with rough efficiency strata estimates¹, the bulk direction in the hp ~ wt plane at the data mean (arrow indicating the principal component direction), and the actual sportiness measure for each car (point size).

Here’s the ruleset (in the spirit of the course, we could have identified these points as outliers, high-leverage or influential but I believe this approach is clearer and more transferable)…

getType <- function(hp, wt) {
  type <- rep("economy/compact", length(hp))
  type[wt > 3.0] <- "midsized/muscle"
  type[wt > 4.5] <- "luxury/superheavy"
  type[hp / wt > 60] <- "sports/muscle (hp)"
  return(as.factor(type))
}

sports/muscle (hp) (any car producing $60+$ hp per half-ton): Sports cars of the 1970s as today, tended to be light for manoeuvrability and speed while maintaining a high power-to-weight ratio. Even lower-powered roadsters such as the Lotus Europa were stripped down and very basic inside to maximise performance through a high power-to-weight ratio.
luxury/superheavy: any other car exceeding $4.5$ half-tons. This class is exemplified by the Cadillac - heavy steel-bodied cars with big-block V8s that prioritised comfort over performance, resulting in low power-to-weight ratios.
midsized/muscle: any other car exceeding $3$ half-tons. Cars in this range are midsized saloon or two door cars fitted with big-block engines, creating the classic American muscle car. These vehicles were tuned for straight-line acceleration, emphasising horsepower and torque over handling or efficiency while still being practical enough to serve as everyday cars.
economy/compact: few small, economical cars were produced domestically so this category mostly consisted of European and Japanese imports. These cars were lightweight, fuel-efficient and designed for practicality with smaller engines and simple interiors.

A1) Engine and drive system effect on mileage

A car’s efficiency curve is convex due to two competing effects;

fuel consumption increases generally increases with power (speed)
power (speed) is useful output which increases efficiency

The mechanical efficiency of various drive system components detailed below:

Engine Efficiency: Fuel consumption increases with engine RPM

A four-stroke (Otto) cycle:

aspiration
compression
ignition/combustion (not a stroke)
expansion
exhaust

Ideally, engine RPM would match wheel RPM (avoiding gearbox losses) while still providing torque for acceleration in a lightweight, compact design. Engines use less fuel at lower RPM but face efficiency limits at idle. Lubrication relies on an oil sump where sloshing creates resistance and energy losses.

Note: engine power mtcars$hp is measured directly at the crank prior to any transmission losses (so called brake horsepower).

Carburettor Efficiency: carburettor reduce mileage

In the days before fuel injection, engines drew air through a carburettor where it was mixed with fuel. The air-fuel ratio was determined by airflow (via the Bernoulli principle) and by prior tuning. For cold starts, a choke allowed the driver to temporarily enrich the mixture.

An enriched mixture made starting easier but reduced fuel economy. Sometimes unburned fuel would ignite in the hot exhaust, causing the familiar backfire of older cars. A lean mixture on the other hand, could make the engine stall or cause sluggish throttle response. Striking the right balance was always a challenge, especially in engines with multiple carburettors which tended to favour performance and enrichment over efficiency.

Transmission Efficiency:

Gearboxes introduce mechanical losses. In top gear, the ratio is typically close to 1:1, so the crankshaft and driveshaft turn in unison, matching engine and wheel RPM. Lower gears provide torque for acceleration, while higher gears enable efficient cruising.

Fuel consumption decreases with number of gears

Progressing sequentially through the gears allows the car to reach cruising speed quickly while keeping the engine near its optimal RPM. Too low a gear wastes fuel by over-revving; too high a gear lacks torque, slows acceleration and may even stall the engine. More gears provide finer steps improving efficiency and fuel economy.

automatics of the 1970s were less efficient than manuals

In the 1970s, manual gear changes were more efficient than automatics, which were heavy, mechanically complex, and relied on inefficient clutch designs. Modern electronic control has largely eliminated this disadvantage.

Differential Efficiency: lower final drive ratio increases efficiency

The rear axle ratio is the gearing between the transmission output and the average rear wheels with a higher ratio meaning more engine turns per wheel rotation. In top gear the gearbox ratio is typically close to 1:1 so the rear axle (aka final drive) ratio defines the overall gearing at cruising speed.

\[\boxed{\operatorname{drat}=\frac{\text{no. ring gear (output) teeth}}{\text{no. pinion (input) teeth}}}\]

In the mockup the gearing ratio between there worm gear and the gear on the outside of the differential housing determines the final gearing ratio ².

rear differential mockup illustrating mechanism

A2) Causal pathway between MPG and transmission type

flowchart LR
  Engine[<b>Confounder</b>:<br>Engine Displacement]
  Gears{<i>Influencer</i>:<br>No. Gears}
  subgraph CausalPathway [<b><i>The Causal Pathway</i></b>]
    direction LR
    Transmission[<b>Exposure</b>:<br>Transmission<br>Manual/Automatic]
    Differential[<b>Mediator</b>:<br>Rear axel ratio]
    Weight{<i>Influencer</i>:<br>Weight}
    MPG[<b>Output</b>:<br>MPG]
  end
  Emissions[<b>Collider</b>:<br>Emissions Rating]
  
  Engine --> |some other pathway| Emissions
  Transmission --> |some other pathway| Emissions
  Differential --> |some other pathway| Emissions
  Engine --> Gears --> Transmission --> Differential --> Weight --> MPG --> Emissions

    carb
     |
     v

sportiness —-> mpg ^ | cyl —-> mpg | v wt —-> mpg

transmission —-> mpg

Carburetors influence mpg indirectly via sportiness.
Cylinders and weight influence mpg directly.
Transmission influences mpg directly, independent of carburetors.
The non-significant ANOVA result for carb reflects that, once you control for cyl and wt, the direct effect of carburetors is minimal.

A3) Exploratory Analysis - Transmission Type (Adjusted)

Initial thoughts observations:

we don’t have a huge volume of data here
mpg, weight and displacement look like great general predictors for mpg as all three as they are all colinear (remember to watch model stability if regressing on these covariates - better avoid weight and displacement.
- The linearity is great however the variance is too low to fit a stable multivariable regression model (i.e. these points sit in a subspace on the plane - we’ve hit the ceiling of the available data). Since mpg is what we want to study we must avoid wt and disp when it comes to regression models
the final columns clearly shows that automatics use a lower final drive ratio and manuals use a higher final drive ratio - I suspect the is down to non-negligible gearing in the transmission at top gear.

propensity scores - lack of data overlap VIF variance inflation factorv- colinear data F-value is we add interaction terms outliers? residuals cooks.dist?

mpg~hp+am (model) mpg~carb+am (model)

mpg=drat+am (explain design, propensity score)

A5)

url <- "https://www.fueleconomy.gov/feg/EPAGreenGuide/xls/all_alpha_25.xlsx"
destfile <- "./all_alpha_25.xlsx"

if (!file.exists(destfile)) {
  download.file(url, destfile, mode = "wb")
}

dModern <- as.data.table(read_excel(destfile))
dModern[,Fuel:=as.factor(Fuel)]
dModern<-dModern[Fuel=="Gasoline"|Fuel=="Diesel"] # avoid LPG and interesting hybrid/elec.
dModern[,Drive:=as.factor(Drive)]
dModern[,class:=as.factor(`Veh Class`)]; dModern[,`Veh Class` := NULL]
dModern[,Stnd:=as.factor(Stnd)] # emission standard
dModern[,Engine:=as.factor(`Underhood ID`)]; dModern[,`Underhood ID` := NULL]
dModern[,TransClass:=as.factor(Trans)]
dModern[, Trans := fifelse(grepl("^Man-", Trans), "Manual",
                      fifelse(grepl("^SemiAuto-", Trans), "Semi-Automatic",
                      fifelse(Trans == "CVT", "Automatic (CVT)",
                      fifelse(grepl("^AutoMan|^AMS-", Trans), "Automated Manual (AMT)",
                      "Automatic"))))]
dModern[,Trans:=as.factor(Trans)]

References

DATAtab. (2025). ANCOVA (analysis of covariance): A mix of ANOVA and···. https://www.youtube.com/watch?v=PngndHgZOgY.

Kmiec, P. (2016). The unofficial LEGO technic builder’s guide, 2nd edition. No Starch Press. https://www.nostarch.com/technicbuilder2

McCabe, D. (2025a). ANOVA workflows. R Pubs. https://rpubs.com/mccabe08/1303319

McCabe, D. (2025b). Regression i: Practical linear multivariable regression. R Pubs. https://rpubs.com/mccabe08/1335788

Peng, R. D. (2015). Regression models. Johns Hopkins University; Coursera. https://www.coursera.org/learn/regression-models/home/module/4

U.S. Environmental Protection Agency. (2025). Green vehicle guide: Downloadable data files. https://www.fueleconomy.gov/feg/EPAGreenGuide/xls/all_alpha_25.xlsx.

Footnotes

Predicted efficiency strata were calculated using a simple multivariable regression model (mpg ~ hp * wt * disp). I acknowledge that both these predicted strata and the car type decision boundaries are highly dataset-specific and somewhat subjective. The coplanar nature of the predictors makes the model fit extremely unstable, and the large number of parameters leaves few residual degrees of freedom. I include this model here mainly as a note: introducing interaction terms means the response surface is no longer a flat plane, and taking a lower-dimensional cross-section (as I have done) produces non-linear contours that illustrate how efficiency varies across combinations of power, weight, and displacement.↩︎
in the lego mockup the worm gear to turn the differential and three fixed there are 4 ring gears housed inside the rotating casing ↩︎