The report explores relationship between transmission type (manual or automatic) and miles per gallon (MPG). The analysis is based on the mtcars dataset. The following questions were addressed in the report: define which type of transmission is better for MPG, and quantify the difference in MPG. The simple linear regression and the multiple regression models with hypothesis testing will be used in the analysis. Both models ultimately confirmed that the cars in this study with manual transmissions had on average significantly higher MPG’s than the cars with automatic transmissions. Data visualisation is presented in the Appendix section.
data(mtcars)
head(mtcars, n = 3)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
ModelFit <- lm(mpg ~ am, data = mtcars)
summary(ModelFit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
summary(ModelFit)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
The Beta0/intercept coefficient is mean MPG for cars with automatic transmissions; the Beta1/am coefficient is the mean increase in MPG for cars with manual transmissions (am = 1). (Beta0 + beta1) is the mean MPG for cars with manual transmissions. So, the mean difference in MPG is 7.244939.
Thus, the 95% confidence interval for beta1 (mean MPG difference) is following:
alpha <- 0.05
n <- length(mtcars$mpg)
pe <- coef(summary(ModelFit))["am", "Estimate"]
se <- coef(summary(ModelFit))["am", "Std. Error"]
t <- qt(1 - alpha/2, n - 2)
pe + c(-1, 1) * (se * t)
## [1] 3.64151 10.84837
Based on the results, we can reject the null hyposthesis in favor of the alternative one: that there is a significant difference in MPG between the two groups at alpha = 0.05.
The following predictor variables will be included into analysis: wt (weight), qsec (1/4 mile time) and am (transmission type).The following step-by-step approach will be used in the modelling: 1) Start with the predictor whose correlation with mpg is highest (wt) 2) The variables that are highly correlated with wt are to be removed 3) Add the remaining predictor, qsec 4) Finally add am, to see if it is a significant predictor.
MultiFit <- lm(mpg ~ wt + qsec + am, data=mtcars)
summary(MultiFit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
coef(summary(MultiFit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
So, the mean difference in MPG is 2.935837. Thus, the 95% confidence interval for beta1 (mean MPG difference) is following:
alpha <- 0.05
n <- length(mtcars$mpg)
pe <- coef(summary(MultiFit))["am", "Estimate"]
se <- coef(summary(MultiFit))["am", "Std. Error"]
t <- qt(1 - alpha/2, n - 2)
pe + c(-1, 1) * (se * t)
## [1] 0.05438576 5.81728862
Based on the results, we can also reject null hypothesis in favor of the alternative one: that there is a significant difference in MPG between the two groups at alpha = 0.05.
The analysis performed confirmed that there is difference in MPG associated with transmission type.In the simple model, the mean MPG difference is 7.25 MPG, while the multiple regression model delivers the difference of 2.93 MPG.
This section contains basic exploratory data analysis and all the required visualisations supporting the final conclusion.
The presented boxplots based on the observations of our data set demonstrate that on average the cars with manual transmission generally have higher MPG.
library(ggplot2)
ggplot(data = mtcars, aes(x = as.factor(mtcars$am), y = mtcars$mpg)) + geom_boxplot() + labs(x = "Transmission type: 0 - Automatic, 1 - Manual", y = "MPG") + ggtitle("Comparison")
The presented scatterplots visually demonstrate correlations: moderate association can be noticed
mtcarsv <- mtcars[, c(1, 6, 7, 9)]
pairs(mtcarsv, panel = panel.smooth, col = "blue")
plot(MultiFit)
The following plots lead us to the following conclusion that the residuals and fitted values are independent. The points of the Normal Q-Q plot following closely to the line conclude that the distribution of residuals is normal.The Scale-Location plot random distribution confirms the constant variance assumption. As all the points are within the 0.05, the Residuals vs. Leverage concludes that there are no outliers.