From mtcars data set in R, we would like to investigate the relationship between miles per gallon variable (MPG) and other variables by using regression models and exploratory analyses. The result should answer these 2 questions:
We will follow below steps in order to find the answers:
The data was extracted from the 1974 Motor Trend US magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
The data set consists of a data frame with 32 observations (nrow) and 11 variables (ncol).
mpg: Miles per US gallon
cyl: Number of cylinders
disp: Displacement (cubic inches)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (lb / 1000)
qsec: 1 / 4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors
Firstly, take a look at the overall information of the mtcars data set such as dimension, general summary.
library(ggplot2)
library(knitr)
library(MASS)
library(xtable)
data("mtcars")
head(mtcars)
dim(mtcars)
## [1] 32 11
mtcars$cyl = factor(mtcars$cyl)
mtcars$vs = factor(mtcars$vs)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$am = factor(mtcars$am,labels=c("Automatic","Manual"))
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 0:18 Automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1:14 Manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
ggplot(aes(x = am, y = mpg), data = mtcars) +
geom_boxplot(aes(fill = am),color="gray20",size=0.5)+
scale_fill_manual(name="Transmission",
values = c("royalblue","firebrick3"))+
labs(title = "MPG by transmission boxplot")
The boxplot clearly shows that the manual transmissions have higher mpg’s. To test this claim, a hypothesis test is performed. Since 0 is not included in the 95 percent confident interval, we reject the null hypothesis that there is a no difference in mpg between automatic and manual transmission. As a result, there is sufficient evidence showing that there is significantly difference in mpg using different type of fuel transmission.
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Regression analyses will now be performed to quantify how much of a factor transmission type accounts for gas mileage.
fit0 = lm(mpg~am,data=mtcars)
summary(fit0)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We have seen in the exploratory data analysis section that manual transmissions are generally better for mpg. To quantify the difference, a linear model was fit using only mpg as an explanatory variable. This model produced an 0.3598 which means that it explains only 36% of the variation in mpg. Thus, this is not a very good model to answer the original questions. We will try to look for another better model which involves variables that most likely account for gas mileage.The R “step” function is used to chooses the optimal model by AIC (Aikake Information Criterion) with mpg as the outcome variable and all other variables, except 1/4 mile, as the explanatory variables.
fit= lm(mpg ~ cyl + disp + hp+drat+wt+vs+am+gear+carb, data=mtcars)
step = stepAIC(fit, direction="both")
## Start: AIC=74.73
## mpg ~ cyl + disp + hp + drat + wt + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 5 17.6209 139.26 69.061
## - gear 2 3.9739 125.62 71.760
## - am 1 0.3613 122.00 72.827
## - drat 1 1.9910 123.64 73.251
## - vs 1 4.1355 125.78 73.801
## - cyl 2 12.7542 134.40 73.922
## <none> 121.64 74.732
## - disp 1 11.1303 132.77 75.533
## - wt 1 24.3453 145.99 78.570
## - hp 1 27.0227 148.67 79.151
##
## Step: AIC=69.06
## mpg ~ cyl + disp + hp + drat + wt + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 2 3.391 142.66 65.830
## - drat 1 0.386 139.65 67.149
## - disp 1 0.409 139.67 67.154
## - am 1 8.404 147.67 68.936
## <none> 139.26 69.061
## - vs 1 9.137 148.40 69.094
## - cyl 2 20.764 160.03 69.508
## - wt 1 14.990 154.25 70.332
## - hp 1 32.807 172.07 73.830
## + carb 5 17.621 121.64 74.732
##
## Step: AIC=65.83
## mpg ~ cyl + disp + hp + drat + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - drat 1 0.435 143.09 63.928
## - disp 1 0.694 143.35 63.986
## - vs 1 7.445 150.10 65.458
## <none> 142.66 65.830
## - cyl 2 19.365 162.02 65.904
## - am 1 13.369 156.02 66.697
## + gear 2 3.391 139.26 69.061
## - wt 1 28.934 171.59 69.740
## - hp 1 37.107 179.76 71.229
## + carb 5 17.038 125.62 71.760
##
## Step: AIC=63.93
## mpg ~ cyl + disp + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - disp 1 0.589 143.68 62.059
## - vs 1 7.319 150.41 63.524
## <none> 143.09 63.928
## - cyl 2 21.221 164.31 64.353
## - am 1 16.385 159.47 65.397
## + drat 1 0.435 142.66 65.830
## + gear 2 3.439 139.65 67.149
## - wt 1 29.348 172.44 67.898
## - hp 1 36.855 179.94 69.261
## + carb 5 14.306 128.78 70.557
##
## Step: AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - vs 1 7.346 151.03 61.655
## <none> 143.68 62.059
## - cyl 2 25.284 168.96 63.246
## - am 1 16.443 160.12 63.527
## + disp 1 0.589 143.09 63.928
## + drat 1 0.330 143.35 63.986
## + gear 2 3.437 140.24 65.284
## - hp 1 36.344 180.02 67.275
## - wt 1 41.088 184.77 68.108
## + carb 5 3.480 140.20 71.275
##
## Step: AIC=61.65
## mpg ~ cyl + hp + wt + am
##
## Df Sum of Sq RSS AIC
## <none> 151.03 61.655
## - am 1 9.752 160.78 61.657
## + vs 1 7.346 143.68 62.059
## - cyl 2 29.265 180.29 63.323
## + disp 1 0.617 150.41 63.524
## + drat 1 0.220 150.81 63.608
## + gear 2 1.361 149.66 65.365
## - hp 1 31.943 182.97 65.794
## - wt 1 46.173 197.20 68.191
## + carb 5 5.633 145.39 70.438
From the result of AIC process, we will use ANOVA to examine closly to these models:
Note that for the first and second model is second and first best fit result from AIC process. The third one is the model without horse power since p-value for its coefficient in first model is above 0.05.
fit1 = lm(mpg ~ cyl+hp+wt+am, data = mtcars)
fit2 = lm(mpg ~ cyl+hp+wt, data = mtcars)
fit3 = lm(mpg ~ cyl+wt, data = mtcars)
p.values=round(summary(fit1)$coefficients[,4],5) ## p-val of coefficient of hp > 0.5, thus eliminate the hp in model
pAnova1 = round(anova(fit1,fit2)$"Pr(>F)"[2],3)
pAnova2 = round(anova(fit2,fit3)$"Pr(>F)"[2],3)
pAnova3 = round(anova(fit1,fit3)$"Pr(>F)"[2],3)
kable(matrix(as.character(c(pAnova1,pAnova2,pAnova3)),ncol = 3),
col.names= c("Model 1 vs 2","Model 2 vs 3", "Model 1 vs 3"),format = "markdown")
| Model 1 vs 2 | Model 2 vs 3 | Model 1 vs 3 |
|---|---|---|
| 0.206 | 0.064 | 0.082 |
The fact that pAnova1 > pAnova3 > pAnova2 > 0.05 suggests that the best fit model is mpg~ cyl + wt
summary(fit3)
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## cyl6 -4.2556 1.3861 -3.070 0.004718 **
## cyl8 -6.0709 1.6523 -3.674 0.000999 ***
## wt -3.2056 0.7539 -4.252 0.000213 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
This model has an Rsquared of 0.8302 which means that it explains 83% of the variation in mpg. The plots below are the diagnostic plots for the model. The residuals appear randomly, the standardized residuals appear normally distributed, and there are not any highly influential outliers.
layout(matrix(c(1,2,3,4),2,2))
plot(fit3,pch=16)
After performing this analysis, we show that: