Study questions: 1. “Is an automatic or manual transmission better for MPG” 2. “Quantify the MPG difference between automatic and manual transmissions”
We found that, based on the sample that was studied, there is a trend for manuals to on average have a higher MPG, but we cannot infer with much certainty that the transmission type (manual or auto) necessarily makes a difference to fuel mileage. The differences borne out by the data is most likely due to other factors such as vehicle weight and engine displacement. Without knowing how the sample was collected we cannot assume that it is representative of the population and we strongly suspect that it is not.
The average difference in MPG performance in the mtcars sample is that manual cars have a 7.245 (+- 1.764422 SE) higher miles per gallon performance compared to autos. But the transmission type only explains about 34% of the variation in the data.
When modeled along with vehicle weight as a predictor, it can be shown that manuals have a gas mileage advantage at at the low end of the max spectrum, but that the MPG deteriorates faster (an extra -5.30 +-1.44 SE mpg per 1000 lbs) as mass increases compared to autos. This model explains 81% of the variation, yet the data is too limited to able to draw firm conclusions. No data is available for autos in the low mass range and the opposite for manuals. The entire region of data overlap (in terms of mass) has overlapping confidence intervals.
package_names <- c("tidyverse", "GGally", "rstatix", "ggpubr", "car")
check_install_load_packages <- function(package_names){
for (i in package_names) {
if(!i %in% installed.packages()){
install.packages(i)
}
if(!i %in% (.packages())){
library(i, character.only = TRUE)
}
}
}
check_install_load_packages(package_names)
#(.packages())
data_raw <- mtcars
#save data into new data frame and convert categorical variables to factors
data_cars <- data_raw
data_cars$cyl <- factor(data_cars$cyl, levels = c("4", "6", "8"),
labels = c("four_cyl", "six_cyl", "eigth_cyl"))
data_cars$vs <- factor(data_cars$vs, levels = c("0", "1"),
labels = c("V_cyls", "Straight_cyls"))
data_cars$am <- factor(data_cars$am, levels = c("0", "1"),
labels = c("auto", "manual"))
Plot of the data
Let’s see if there is more to the data than this simple relationship by looking at how the performance varies by cylinders.
give.n <- function(x){
return(c(y = median(x)*1.2, label = length(x)))
# experiment with the multiplier to find the perfect position
}
boxPlot_mpg_am = ggplot(data = data_cars, aes(x = am, y = mpg)) +
xlab("Transmission (manual(1) or auto(0))") +
ylab("Miles per Gallon") +
geom_boxplot(aes(fill = am)) +
facet_wrap("cyl") +
stat_summary(fun.data = give.n, geom = "text", fun.y = median,
position = position_dodge(width = 0.75))
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
boxPlot_mpg_am
We note: 1. The mgp performance is clearly better for smaller numbers of
cylinders 2. The difference in mpg performance between autos and manuals
decreases for more cylinders 3. There are very few observations in each
group so all of this is pretty sketchy
We did a linear model predicting mpg by number of cylinders and found that it would have a greater R squared - i.e. 0.72 compared to about 0.34 for modelling on transmission type.
For a bit more detail on some of the other variables that we intuitively know are important we run the ggpairs() function from the GGally package. Refer to Plot 1 in the Appendix.
We note the following from the pairs plots: 1. The auto cars in the dataset were generally heavier 2. The auto cars in the dataset generally had higher displacement engines 3. The auto cars in the dataset tended to have more cylinders
Hence the auto cars will probably seem to be worse performers on mpg but probably because of confounders.
**Fist model: mpg_hat = beta0 + beta1*am_manual **
linmod <- lm(formula = mpg ~ am, data_cars)
summary(linmod)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## ammanual 7.244939 1.764422 4.106127 2.850207e-04
We can see from the plot and coefficients that mean MPG of the auto cars is 17.147 miles per gallon, which is the intercept. The difference in mean when going to the manual group is 7.245 miles per gallon (the slope or estimated coefficient “am”). I.e. the mean mpg for the manuals is 17.147 + 7.245 = 24.392 miles per gallon. This linear relationship only explains about 34% of the variation (R squared).
We are not surprised at this outcome, given what we noted in our data exploration.
Second model: mpg_hat = beta0 + beta1wt + beta2am_manual - i.e. we include weight as a predictor, which we reason should be the main real predictor. We add interaction so we can see the change in slope of considering manuals relative to autos. Weight is prefered as predictor above cyl and disp because it is more highly correlated with mpg, as can be shown with cor(x = mtcars[,c(“wt”, “cyl”, “disp”)], y = mtcars$mpg).
linmod2 <- lm(formula = mpg ~ wt * am , data_cars)
summary(linmod2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.416055 3.0201093 10.402291 4.001043e-11
## wt -3.785908 0.7856478 -4.818836 4.551182e-05
## ammanual 14.878423 4.2640422 3.489277 1.621034e-03
## wt:ammanual -5.298360 1.4446993 -3.667449 1.017148e-03
See plot 2 in the appendix. Manuals are more fuel efficient under roughly the 2800 pound mass. Thereafter, autos perform better.
But let’s check the residuals to see if our linear model is reliable. Residuals Diagnostic plots:
par(mfrow = c(2,2))
plot(linmod2)
The data looks roughly normal on the QQ plot but shows some departure at
the higher residuals. The top left hand plot shows some
heteroskedasticity, which violates our linear regression assumption of
equal variances in the residuals. We’ll try to remove the
heteroskedasticity by doing a log transform of the response.
linmod3 <- lm(formula = log(mpg) ~ wt * am , data_cars)
summary(linmod3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7267980 0.15145179 24.607157 1.656822e-20
## wt -0.2414783 0.03939850 -6.129124 1.293878e-06
## ammanual 0.3898126 0.21383227 1.822983 7.900069e-02
## wt:ammanual -0.1539089 0.07244847 -2.124391 4.260222e-02
par(mfrow = c(2,2))
plot(linmod3)
We note now the residuals look a bit better but still not great -
indicating there is still a problem of unequal variance, which casts
doubt over the accuracy of our linear model. A square root and a cube
root transformation has a similar result.
After the slightly helpful log transformation, the influence of the transmission type has lost much of its statistical significance with the t test probability going from 0.16% to 8%. At a 95% confidence requirement we’ll fail to reject the null hypothesis that transmission type has no effect. Some additional residual plots are shown in the Appendix demonstrating that there isn’t a very strong relationship between mpg and transmission type after having removed the effect of weight.
We were not able, by linear regression, to determine a convincing relationship between transmission type and MPG performance. Not enough direct evidence exists in this data set to perform a meaningful comparison of the variables in question. Unequal variances result in inaccurate linear regression modeling outcomes.
#based on the questions, we'll limit our exploration to mpg and transmission, and sneak in displacement and weight
ggpairs(subset(data_cars, select = c(mpg, cyl, disp, wt, am) )
,ggplot2::aes(colour=am, alpha = 0.8)#,
#upper = list(continuous = "density", combo = "box_no_facet")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
predlm = predict(linmod2, interval = "confidence")
data_cars_pred <- cbind(data_cars, predlm)
lmPlot_mpg_wt_am = ggplot(data_cars_pred, aes(x = wt, y = mpg, colour = am)) +
xlab("Weight x1000 lbs") +
ylab("Miles per Gallon") +
geom_point() +
geom_ribbon( aes(ymin = lwr, ymax = upr, fill = am, color = NULL), alpha = .15) +
geom_abline(intercept = coef(linmod2)[1], slope = coef(linmod2)[2], size = 0.8, colour = "red") +
geom_abline(intercept = coef(linmod2)[1] + coef(linmod2)[3],
slope = coef(linmod2)[2] + coef(linmod2)[4], size = 0.8, colour = "cyan")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
lmPlot_mpg_wt_am
Once again this shows that the data is too limited to able to draw many
conclusions. No data is available for autos in the low mass range and
the opposite for manuals. The entire region of data overlap (in terms of
mass) has overlapping confidence intervals.
This shows that the mpg values are not identically distributed around the mean when comparing between transmission type groups.
v_plot <- ggplot(data_cars, aes(x = am, y = mpg, fill = am)) +
geom_violin() +
stat_summary(fun = "mean",
geom = "crossbar",
width = 0.5,
color = "black")
v_plot
First we plot the residuals of mpg against weight after have removed the affect of transmission type.
There is a clear relationship between mpg and weight (wt) after having
removed the effect of transmission type (am).
Secondly we plot the residuals of mpg against transmission type after have removed the affect of weight.
There is not a clear relationship between mpg and transmission type (am)
after having removed the effect of weight (wt).
Standardized residual plot showing fanning out pattern - indicative of unequal variance:
standard_res <- rstandard(linmod3)
resplot1 = ggplot(data_cars_pred, aes(x = standard_res, y = log(mpg), colour = am)) +
xlab("Regression Standardized Residual") +
ylab("Log of Miles per Gallon") +
geom_point()
resplot1