Motor Trend, a magazine about the automobile industry, is looking at a data set of a collection of cars and they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Our research concludes that manual transmission is better than automatic for MPG:
data(mtcars) #load the dataset
str(mtcars) #basic understanding of the data
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Transformation of variable into factor
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
First of all, an initial linear regression model will be created with mpg regressed on transmission(am). In appendix plot 1, the boxplot shows that the manual transmission has better performance than automatic on mpg. Such comparison will be further illustrated with the following linear regression model.
fit0 <- lm(mpg ~ am, data = mtcars)
round(summary(fit0)$coef, 4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.1474 1.1246 15.2475 0e+00
## amManual 7.2449 1.7644 4.1061 3e-04
The expected mpg is 17.147 for automatic transmission and manual transmission is higher than that with 7.245. The p-value of such difference far below the benchmark of 0.05, indicating the difference between manual and automatic affecting the mpg performance is significant.
The R-squared is 0.3597989, showing than our simple linear model only explains around 36% of the variability. Hence, addtional variables have to be considered in order account for the variation of mpg.
In appendix plot 2, the comparison plots between all the variables in mtcars demonstrate the strength of correlation between each variables. Taking mpg as our main consideration, the correlation of mpg against the number of cylinders(cyl), displacement(disp), gross horsepower(hp) and weight(wt) tends to be stronger among the others:
From the EDA, variables with stronger correlation with mpg have been figured out. They will be used to fit a new regression model, together with our original predictor, am.
fit1 = lm(mpg ~ am + I(factor(cyl)) + disp + hp + wt, data = mtcars)
round(summary(fit1)$coef, 4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.8643 2.6954 12.5637 0.0000
## amManual 1.8061 1.4211 1.2709 0.2155
## I(factor(cyl))6 -3.1361 1.4691 -2.1347 0.0428
## I(factor(cyl))8 -2.7178 2.8981 -0.9378 0.3573
## disp 0.0041 0.0128 0.3202 0.7515
## hp -0.0325 0.0140 -2.3228 0.0286
## wt -2.7387 1.1760 -2.3289 0.0282
By constructing the multivariable regression model, the R-squared, 0.8664276, shows that the model explains around 87% of the variability. With adjustment, manual transmission is still better than automatic transmission with the amount of 1.8.
anova(fit0, fit1)
Comparison between the initial linear regression model and the multivariate regression model was conducted through anova. The p-value is extremely small and is lower than the benchmark level 0.05, suggesting that with additional variables adding to the regression model is statistically significant. Hence, we will reject the null model.
In appendix plot 3, the “Residuals vs Fitted” plot shows that the points are randomly scattered and there is no obvious pattern among them. Thus, we can say that the residuals are homoscedastic. The “Normal Q-Q” plot shows that there is no clear evidence for non-normality, suggesting the residuals are normally distributed.
leverage <- hatvalues(fit1)
summary(leverage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1120 0.1737 0.2037 0.2188 0.2538 0.5121
By conducting the diagnostic, there is no obvious dominant leverage and influencial values.
Last but not least, a t-test is conducted to sum up the difference between the automatic and manual transmission for the performance of mpg.
t.test(mtcars$mpg ~ mtcars$am)$p.value
## [1] 0.001373638
The p-value is 0.001374, which lower than the benchmark of 0.05, indicating such difference is significant. This result further reinforces our previous conclusion that manual transmission is better than automatic for mpg.
with(mtcars, boxplot(mpg ~ am, col = c("blue", "green"), xlab = "Transmission Type", ylab = "mpg"))
with(mtcars, plot(mtcars))
par(mfrow = c(2, 2))
plot(fit1)