You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
We take the mtcars data set and write up an analysis to answer the question using regression models and exploratory data analyses.
data(mtcars)
head(mtcars,10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
we will analyse the following variables how they impact mpg (Miles/(US) gallon). cyl: Number of cylinders hp: horsepower drat: Rear axle ratio wt: Weight (lb/1000) qsec: 1/4 mile time vs: V/S am: Transmission (0 = automatic, 1 = manual)
The difference between manual and automatic transmission using the mean shows that in average the fuel consumption (mpg) is higher for manual compared to automatic transmission. Visualized i Appendix plot
aggregate(mpg~am, data = mtcars, mean)
## am mpg
## 1 0 17.14737
## 2 1 24.39231
To quantify this result we will do basic fit on a linear regression model. We will use mpg as the dependent variable and am as the independent variable.
single <- lm(mpg ~ am, data = mtcars)
summary(single)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Since the p-value = 0.000285 is less than 0.05 so we rejected null hypothesis. The adjusted R squared value is 0.3385 which means our model only explains 35.85% of the variance. We need to include other predictor variables.
From the results above we can see that the simple linear regression model is not sufficient to explain the data. There are several other predictor variables that we must take into account. To determine the most significant variables we use the analysis via cor function.
cor(mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
With the variables wt (weight), hp (horse power), disp, cyl (cylinder) and am we determine the multi-variable regression model.
multi <- lm(mpg ~ wt+hp+disp+cyl+am, data = mtcars)
summary(multi)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp + cyl + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5952 -1.5864 -0.7157 1.2821 5.5725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
## wt -3.30262 1.13364 -2.913 0.00726 **
## hp -0.02796 0.01392 -2.008 0.05510 .
## disp 0.01226 0.01171 1.047 0.30472
## cyl -1.10638 0.67636 -1.636 0.11393
## am 1.55649 1.44054 1.080 0.28984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
## F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
I plot the residuals to examine any heteroskedacity between the fitted and residual values; as well as to check for any non-normality.
plot(multi)
library(ggplot2)
g = ggplot(mtcars, aes(factor(am), mpg, fill=factor(am)))
g = g + geom_boxplot()
g = g + geom_jitter(position=position_jitter(width=.1, height=0))
g = g + scale_colour_discrete(name = "Type")
g = g + scale_fill_discrete(name="Type", breaks=c("0", "1"),
labels=c("Automatic", "Manual"))
g = g + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g = g + xlab("")
g