Executive Summary

In this report, we will analyze mtcars data set and explore the relationship between a set of variables and miles per gallon (MPG). The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We use regression models and exploratory data analyses to mainly explore how automatic and manual transmissions features affect the MPG feature. The t-test shows that the performance difference between cars with automatic and manual transmission. The analysis is focused on answering two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions.

We will use both, a simple linear regression model and a multiple regression model for our analysis. Both models support the conclusion that the cars in this study with manual transmissions have on average significantly higher MPG’s than cars with automatic transmissions. However, other variables (weight and acceleration time) do have significant influence on this correlation and further investigation and multi-variate modelling is recommended.

Data Preprocessing

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
data("mtcars")
# prepare variable "am" (automatic and manual) – change to factor
mtcarsFact <- mtcars
mtcarsFact$am <- as.factor(mtcarsFact$am)
levels(mtcarsFact$am) <- c("Automatic", "Manual")
kable(mtcars[1:5,])
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Statistical Inference

The two variables of interest “am” (transmission type) and “MPG” (miles per gallon) are plotted against each other to see if a visual analysis indicates a possible relationship (Code and violin plot of “mpg” and “am” - see Appendix A). The violin plot indicates that manual transmission has a higher mpg than automatic transmission. However, this is based on 32 observations and so is a relatively small sample size. A t-test is also carried out to test the significance of the relationship. The t-test is for the null hypothesis “there is no correlation between transmission type and mpg”.

result <- t.test(mpg ~ am, data  = mtcars)
result$p.value
## [1] 0.001373638

The p-value is 0.0013736. This is less than 0.05, therefore the null hypothesis is rejected. The alternative hypothesis: a significant difference (correlation) of mpg between automatic and manual transmissions is now examined.

Regression Analysis

Now to answer our first question “Is an automatic or manual transmission better for MPG?”, we will use some regression model.

Simple Linear regression model

The t-test indicates there could be a significant difference between mpg for the two transmission types. The first model to be applied is a linear regression model - this will test the significance found in the above t-test and the associated ‘adjusted r-squared’ value will indicate if the linear model is optimal.

fit1 <-lm(mpg~am, data = mtcars)
summary(fit1)$adj.r.squared
## [1] 0.3384589

The adjusted r squared value of the linear model is 0.3384589 i.e. it explains 33.8% of the variation; this is quite low, so we need to examine multivariate models.

Multi-variate regression model

Now under this new model we will be using multiple variables to get an optimal solution. First, we need to find which variables are significant.

fit2 <- step(lm(mpg ~., data = mtcars), direction = "both", trace=0)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The variables that provide an optimal fit are: “am”, “qsec”(acceleration time) and “wt” (weight). We generate a pairs plot of these optimal variables (Code and pairs plot of “mpg” and optimal variables “am”, “qsec” and “wt”. - see Appendix B). The pairs plot shows that there a number of other correlations in addition to the “am” (transmission type) and “mpg” variables that has been established. These variables and correlations should be explored in a multi-variate model.

This Multivariable Regression test now gives us an R-squared value of over 0.8336, suggesting that 83% or more of variance can be explained by the multivariable model. Moreover, manual transmission delivers 2.9358 more mpg than automatic transmission.

Diagnostics

Linear regression makes several assumptions about the data, such as :

  • Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
  • Normality of residuals. The residual errors are assumed to be normally distributed.
  • Homoscedasticity. The residuals are assumed to have a constant variance ()
  • Independence of residuals error terms.

You should check whether or not these assumptions hold true. Potential problems include:

  • Non-linearity of the outcome - predictor relationships
  • Heteroscedasticity: Non-constant variance of error terms.
  • Presence of influential values in the data that can be:
    • Outliers: extreme values in the outcome (y) variable
    • High-leverage points: extreme values in the predictors (x) variable

All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors. Just use a plot() to an lm object after running an analysis. A plot is generated to examine the model and to check the diagnostics (Code and plot - see Appendix C). The results are as follows:

  • Residual vs fitted plot gives broadly horizontal line and no distinct pattern. This indicates a lack of any non-linear relationships.
  • Normal Q-Q plot shows no outliers and the results broadly following the dashed line. It implies residuals are normally distributed.
  • Scale-Location plot shows a random distribution but the line is not horizontal. Implies that the variance would appear to be equal but it is very possible there are other influences (variables).
  • Residuals vs Leverage plot indicates there are no significant outliers (nothing top and bottom right outside the Cook’s distance)

Conclusion

To answer the two questions outlined in the executive summary:

  1. Manual transmission delivers a significantly higher mpg than automatic. Manual transmission is therefore ‘better’ in terms of mpg.
  2. Manual transmission delivers 2.94 more mpg than automatic transmission (using the multi-variate model). This has an adjusted r squared of 0.83 and a p-value below 0.05.

Appendix A: Violin plot of mpg vs am

library(ggplot2)

g1 <- ggplot(mtcarsFact , aes(am, mpg))
g1 + geom_violin(aes(fill = am)) + geom_jitter(height = 0)

Appendix B: Plot of mpg vs optimal variables (am, qsec, wt)

library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
mtcarsFact[, c(1,6,7, 9)]  %>%
    ggpairs (
        mapping = ggplot2::aes(color = am), upper = list(continuous = gglegend("points")), lower = list(continuous = wrap("smooth", alpha=0.2, size=1), combo = wrap("dot"))
    )

Appendix C: Diagnostic plots

par(mfrow = c(2,2))
plot(fit2)