In this report, we will analyze mtcars data set and explore the relationship between a set of variables and miles per gallon (MPG). The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We use regression models and exploratory data analyses to mainly explore how automatic and manual transmissions features affect the MPG feature. The t-test shows that the performance difference between cars with automatic and manual transmission. The analysis is focused on answering two questions:
We will use both, a simple linear regression model and a multiple regression model for our analysis. Both models support the conclusion that the cars in this study with manual transmissions have on average significantly higher MPG’s than cars with automatic transmissions. However, other variables (weight and acceleration time) do have significant influence on this correlation and further investigation and multi-variate modelling is recommended.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
data("mtcars")
# prepare variable "am" (automatic and manual) – change to factor
mtcarsFact <- mtcars
mtcarsFact$am <- as.factor(mtcarsFact$am)
levels(mtcarsFact$am) <- c("Automatic", "Manual")
kable(mtcars[1:5,])
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The two variables of interest “am” (transmission type) and “MPG” (miles per gallon) are plotted against each other to see if a visual analysis indicates a possible relationship (Code and violin plot of “mpg” and “am” - see Appendix A). The violin plot indicates that manual transmission has a higher mpg than automatic transmission. However, this is based on 32 observations and so is a relatively small sample size. A t-test is also carried out to test the significance of the relationship. The t-test is for the null hypothesis “there is no correlation between transmission type and mpg”.
result <- t.test(mpg ~ am, data = mtcars)
result$p.value
## [1] 0.001373638
The p-value is 0.0013736. This is less than 0.05, therefore the null hypothesis is rejected. The alternative hypothesis: a significant difference (correlation) of mpg between automatic and manual transmissions is now examined.
Now to answer our first question “Is an automatic or manual transmission better for MPG?”, we will use some regression model.
The t-test indicates there could be a significant difference between mpg for the two transmission types. The first model to be applied is a linear regression model - this will test the significance found in the above t-test and the associated ‘adjusted r-squared’ value will indicate if the linear model is optimal.
fit1 <-lm(mpg~am, data = mtcars)
summary(fit1)$adj.r.squared
## [1] 0.3384589
The adjusted r squared value of the linear model is 0.3384589 i.e. it explains 33.8% of the variation; this is quite low, so we need to examine multivariate models.
Now under this new model we will be using multiple variables to get an optimal solution. First, we need to find which variables are significant.
fit2 <- step(lm(mpg ~., data = mtcars), direction = "both", trace=0)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The variables that provide an optimal fit are: “am”, “qsec”(acceleration time) and “wt” (weight). We generate a pairs plot of these optimal variables (Code and pairs plot of “mpg” and optimal variables “am”, “qsec” and “wt”. - see Appendix B). The pairs plot shows that there a number of other correlations in addition to the “am” (transmission type) and “mpg” variables that has been established. These variables and correlations should be explored in a multi-variate model.
This Multivariable Regression test now gives us an R-squared value of over 0.8336, suggesting that 83% or more of variance can be explained by the multivariable model. Moreover, manual transmission delivers 2.9358 more mpg than automatic transmission.
Linear regression makes several assumptions about the data, such as :
You should check whether or not these assumptions hold true. Potential problems include:
All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors. Just use a plot() to an lm object after running an analysis. A plot is generated to examine the model and to check the diagnostics (Code and plot - see Appendix C). The results are as follows:
To answer the two questions outlined in the executive summary:
library(ggplot2)
g1 <- ggplot(mtcarsFact , aes(am, mpg))
g1 + geom_violin(aes(fill = am)) + geom_jitter(height = 0)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
mtcarsFact[, c(1,6,7, 9)] %>%
ggpairs (
mapping = ggplot2::aes(color = am), upper = list(continuous = gglegend("points")), lower = list(continuous = wrap("smooth", alpha=0.2, size=1), combo = wrap("dot"))
)
par(mfrow = c(2,2))
plot(fit2)