Executive Summary

In this analysis, I will look at a data set of cars (mtcars) and address the following two questions:

  1. Is an automatic or manual transmission better for MPG?

  2. Quantify the MPG difference between automatic and manual transmissions.

This analysis reveals that there is a statistically significant difference between the mean and median MPG for automatic and manual transmission cars. All other factors held constant, we observe an increase of 1.80921 in MPG when moving from automatic transition to manual transition systems.

Data Processing and Data Transformation

In this section, I load the data set and transform variables into factors.

data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))

Exploratory Data Analysis

In this section, I explore the relationships between different variables in the mtcars data set using the plot presented in Appendix 1 and 2.

As can be seen from the pairs plot in Appendix 1, seven variables (cyl, disp, hp, drat, wt, vs, and am) are correlated with MPG. We cannot quantify this relationship unless we use linear models.

Next, we turn our attention to the effects of transmission type on MPG. As can been seen from the box plot in Appendix 2, MPG is higher when the car transmission is manual. This is our first hint towards the effects of transmission on MPG but we need to use the regression analysis to quantify and verify this conclusion.

Regression Analysis

In this section, I first build different regression models using different variables in the model and find the best model fit. Next, I perform analysis of residuals.

Model Building and Model Selection

I start with a model containing all the variables as the predictors of MPG. In order to select significant predictors of the best fit model, I next perfom stepwise model selection. The step method runs the lm multiple times, builds multiple regression models, and select the best variables using forward selection and backward elimination methods using the AIC algorithm. We therefore include the variables that are significant in predict MPG while dropping the non-significant ones.

model_init <- lm(mpg ~ ., data = mtcars)
model_best <- step(model_init, direction = "both")

The result of this analysis shows that the best fit model consists of three variables cyl, wt, and hp as confounders and one variable am as independent variable. Details of this model is shown below:

summary(model_best)

The adjusted \(R^2\) of this model is 0.84, meaning that 84% of the variation is explained using the above model.

Next, we compare this model with the model with only am as the predictor variable.

model_base <- lm(mpg ~ am, data = mtcars)
anova(model_base, model_best)

Since the p-value of the above analysis is highly significant, we reject the null hypothesis and conclude that the three confounders cyl, wt, and hp contribute significantly to the model.

Residuals and Diagnostics

In this section, I analyze the residual plots (presented in Appendix 3) along with the regression diagnostics.

par(mfrow=c(2, 2))
plot(model_best)

We can verify the independence condition by looking at the Residuals vs. Fitted plot and verifying the randomness of the scatter of points. Looking at the QQ Plot we can verify the normality condition for the regression residuals as the points in the plot mostly fall on the normal line. The points in the Scale-Location are scattered in a constant pattern which verifies the constant variance condition. Using the Residuals vs. Leverage plot we can visually identify some outliers. We need to run regression diagnostics in order to identify these outliers.

I identify the top three outlier cars using the influence measures. Looking at the following analysis, we can verify our conclusion from the Residuals vs. Leverage plot.

outlier <- hatvalues(model_best)
tail(sort(outlier),3)
influential <- dfbetas(model_best)
tail(sort(influential[,6]),3)

Inference

In this section, I run a two sample t-test to verify that the two subsets of transmission (i.e. manual and automatic) have equal means. Significant p-value (0.001374) in this analysis shows that the distribution of MPG is significantly different for manual and authomatic transmission cars.

t.test(mpg ~ am, data = mtcars)

Conclusion

Using the best fit model, we conclude that:

  1. MPG is significantly higher (1.80921 units) for cars with manual transmission compared to the ones with automatic transmission.

  2. For each 1000 lb increase in wb variable, MPG decreases by 2.49683 units.

  3. For each unit increase in hp variable, MPG decreases by 0.03211 units.

  4. Moving from a 4-cyl car to a 6- and 8-cyl car, MPG decreases by 3.03134 and 2.16368, respectively.

Appendix 1 - Pairs Plots

pairs(mpg ~ ., data = mtcars)

Appendix 2 - Boxplots

boxplot(mpg ~ am, data = mtcars, col = (c("white","white")), ylab = "MPG", xlab = "Transmission")

Appendix 3 - Residual Plots

par(mfrow=c(2, 2))
plot(model_best)