Motor Trend Data Analysis: Influence of Transmissions on MPG

Executive Summary

We examine mtcars data set to answer the following questions:

  • Is an automatic or manual transmission better for MPG?
  • Quantify the MPG difference between automatic and manual transmissions.

From the data exploratory, a T-test confirms that manual transmission is better than automatic transmission on MPG. A regression model (with selection of predictors and omiting outliers) is built to explain ~ \(93\)% of the variance in MPG of the resulting data set. This model shows that manual transmission is better by an improvement of ~ \(2.19\) in MPG. Note: all figures are in Appendix of this present report.

# Libraries
library(leaps); library(ggplot2); library(car)

Exploratory Data Analysis

Data Processing

# Load data set
data(mtcars); mydata<- mtcars
# Assignn factor class 
vars<- c('cyl','vs','gear','carb','am'); for(elt in vars){
        ifelse(elt != 'am', mydata[,elt]<- factor(mydata[,elt]),
               mydata[,elt]<- factor(mydata[,elt],labels = c('automatic','manual')))}

The figure 1 shows characteristics between variables (e.g. correlation and frequency via histograms). We can see from this figure that manual transmission seems to imply higher MPG than the automatic transmission (as also shown in figure 2). Let’s make in below a T-test to confirm this observation.

T-test and Confidence Interval

Here, the null hypothesis considers that there is no difference in mean of mpg between types (manual and automatic) of transmission:

# T-Test on mpg from each transmission
T_test <- with(mydata,t.test(mpg~am))
# P-value and the 95% confidence interval 
T_test$p.value; T_test$conf.int 
## [1] 0.001373638
## [1] -11.280194  -3.209684
## attr(,"conf.level")
## [1] 0.95

The above T-test shows that the null hypothesis is rejected (p value < 0.05 and 95% confidence interval does not contain zero). This means that manual and automatic transmissions have not the same behavior on MPG, i.e. manual transmission leads to a better MPG as expected the above observation. Let’s make in below a regression model to quantify this difference between types of transmission.

Regression Model

Subset Regression Selection

The subset regression technique is used (because all possible combinations of predictors are inspected) to select predictors and built a regression model with the highest adjusted R-squared (because R-squared always increases with addition of predictors). This technique is illustrated via the figure 2 (obtained with the regsubsets() function) which shows that regression model with the intercept and the variable wt leads to the lowest adjusted R-squared value (~ \(0.74\)). However, the subset regression technique suggests to consider the following predictors cyl, hp, wt, vs and am to get a high adjusted R-squared value.

Fit Model

With the above selected predictors, the regression model Fit is as follows:

# Fit model
Fit <- lm(mpg~ cyl+hp+wt+vs+am, mydata)

The influence plot (figure 4, obtained with the influencePlot() function) shows that the ‘Chrysler Imperial’ is the most influencial observation (largest circle size), while “Maserati Bora” and “Porshe 914” have both high leverages. Let’s omit cars above +2 or below –2 on the horizontal axis (considered as outliers) to make a second regression model Fit_2 as follows:

# Omit outliers
vars <- c('Fiat 128','Toyota Corolla','Chrysler Imperial','Volvo 142E','Datsun 710')
# Fit 2 model
Fit_2 <- lm(mpg~ cyl+hp+wt+vs+am, mydata[!(rownames(mydata) %in% vars),])
# Coefficients estimate and R-squared
round(summary(Fit_2)$coefficients[,1],2); round(summary(Fit_2)$r.squared,2)
## (Intercept)        cyl6        cyl8          hp          wt         vs1 
##       31.65       -2.25       -0.85       -0.03       -2.66        1.79 
##    ammanual 
##        2.19
## [1] 0.93

From the above, this second model explains ~ \(93\)% of the variance in mpg variable, which can be considered as a good fit model. Similarly, we can see by holding other variables fixed that the manual transmission increases MPG of ~ \(2.19\) (ammanual coefficient) compared to the automatic transmission. This confirms that the manual transmission is better as expected in the exploratory data analysis section.

Residual Diagnostics

From the figure 5, the points in the Residuals vs. Fitted plot are randomly scattered which verify the linearity condition. Similarly, the Normal Q-Q plot shows that residuals are normally distributed, while the Scale-Location plot confirms constant variance (homoscedasticity). Finally, the Residuals vs. Leverage plot shows that the great majority and leveraged scatttered points are inliers.

Conclusions

The influence of types of transmission (manual and automatic) on the dependent variable mpg of the mtcars data set is analysed via a statistical study and a regression model. It was found that the manual transmission is better on MPG and the improvement is evaluated of an increase of ~ \(2.19\).

Appendix

character(0)