Executive Summary

From mtcars data set in R, we would like to investigate the relationship between miles per gallon variable (MPG) and other variables by using regression models and exploratory analyses. The result should answer these 2 questions:

We will follow below steps in order to find the answers:

Processing

1. Data set description

The data was extracted from the 1974 Motor Trend US magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

The data set consists of a data frame with 32 observations (nrow) and 11 variables (ncol).

mpg: Miles per US gallon
cyl: Number of cylinders
disp: Displacement (cubic inches)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (lb / 1000)
qsec: 1 / 4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

2. Preprocessing

Firstly, take a look at the overall information of the mtcars data set such as dimension, general summary.

library(ggplot2)
library(knitr)
library(MASS)
library(xtable)

data("mtcars")
head(mtcars)
dim(mtcars)
## [1] 32 11
mtcars$cyl = factor(mtcars$cyl)
mtcars$vs = factor(mtcars$vs)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$am = factor(mtcars$am,labels=c("Automatic","Manual"))
summary(mtcars)
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec       vs             am     gear   carb  
##  Min.   :1.513   Min.   :14.50   0:18   Automatic:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1:14   Manual   :13   4:12   2:10  
##  Median :3.325   Median :17.71                         5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                                4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                                6: 1  
##  Max.   :5.424   Max.   :22.90                                8: 1

Exploratory analysis

ggplot(aes(x = am, y = mpg), data = mtcars) + 
  geom_boxplot(aes(fill = am),color="gray20",size=0.5)+
  scale_fill_manual(name="Transmission",
                    values = c("royalblue","firebrick3"))+
  labs(title = "MPG by transmission boxplot")

The boxplot clearly shows that the manual transmissions have higher mpg’s. To test this claim, a hypothesis test is performed. Since 0 is not included in the 95 percent confident interval, we reject the null hypothesis that there is a no difference in mpg between automatic and manual transmission. As a result, there is sufficient evidence showing that there is significantly difference in mpg using different type of fuel transmission.

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Regression model

Regression analyses will now be performed to quantify how much of a factor transmission type accounts for gas mileage.

fit0 = lm(mpg~am,data=mtcars) 
summary(fit0)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We have seen in the exploratory data analysis section that manual transmissions are generally better for mpg. To quantify the difference, a linear model was fit using only mpg as an explanatory variable. This model produced an 0.3598 which means that it explains only 36% of the variation in mpg. Thus, this is not a very good model to answer the original questions. We will try to look for another better model which involves variables that most likely account for gas mileage.The R “step” function is used to chooses the optimal model by AIC (Aikake Information Criterion) with mpg as the outcome variable and all other variables, except 1/4 mile, as the explanatory variables.

fit= lm(mpg ~ cyl + disp + hp+drat+wt+vs+am+gear+carb, data=mtcars)
step = stepAIC(fit, direction="both")
## Start:  AIC=74.73
## mpg ~ cyl + disp + hp + drat + wt + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  5   17.6209 139.26 69.061
## - gear  2    3.9739 125.62 71.760
## - am    1    0.3613 122.00 72.827
## - drat  1    1.9910 123.64 73.251
## - vs    1    4.1355 125.78 73.801
## - cyl   2   12.7542 134.40 73.922
## <none>              121.64 74.732
## - disp  1   11.1303 132.77 75.533
## - wt    1   24.3453 145.99 78.570
## - hp    1   27.0227 148.67 79.151
## 
## Step:  AIC=69.06
## mpg ~ cyl + disp + hp + drat + wt + vs + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  2     3.391 142.66 65.830
## - drat  1     0.386 139.65 67.149
## - disp  1     0.409 139.67 67.154
## - am    1     8.404 147.67 68.936
## <none>              139.26 69.061
## - vs    1     9.137 148.40 69.094
## - cyl   2    20.764 160.03 69.508
## - wt    1    14.990 154.25 70.332
## - hp    1    32.807 172.07 73.830
## + carb  5    17.621 121.64 74.732
## 
## Step:  AIC=65.83
## mpg ~ cyl + disp + hp + drat + wt + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     0.435 143.09 63.928
## - disp  1     0.694 143.35 63.986
## - vs    1     7.445 150.10 65.458
## <none>              142.66 65.830
## - cyl   2    19.365 162.02 65.904
## - am    1    13.369 156.02 66.697
## + gear  2     3.391 139.26 69.061
## - wt    1    28.934 171.59 69.740
## - hp    1    37.107 179.76 71.229
## + carb  5    17.038 125.62 71.760
## 
## Step:  AIC=63.93
## mpg ~ cyl + disp + hp + wt + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     0.589 143.68 62.059
## - vs    1     7.319 150.41 63.524
## <none>              143.09 63.928
## - cyl   2    21.221 164.31 64.353
## - am    1    16.385 159.47 65.397
## + drat  1     0.435 142.66 65.830
## + gear  2     3.439 139.65 67.149
## - wt    1    29.348 172.44 67.898
## - hp    1    36.855 179.94 69.261
## + carb  5    14.306 128.78 70.557
## 
## Step:  AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1     7.346 151.03 61.655
## <none>              143.68 62.059
## - cyl   2    25.284 168.96 63.246
## - am    1    16.443 160.12 63.527
## + disp  1     0.589 143.09 63.928
## + drat  1     0.330 143.35 63.986
## + gear  2     3.437 140.24 65.284
## - hp    1    36.344 180.02 67.275
## - wt    1    41.088 184.77 68.108
## + carb  5     3.480 140.20 71.275
## 
## Step:  AIC=61.65
## mpg ~ cyl + hp + wt + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              151.03 61.655
## - am    1     9.752 160.78 61.657
## + vs    1     7.346 143.68 62.059
## - cyl   2    29.265 180.29 63.323
## + disp  1     0.617 150.41 63.524
## + drat  1     0.220 150.81 63.608
## + gear  2     1.361 149.66 65.365
## - hp    1    31.943 182.97 65.794
## - wt    1    46.173 197.20 68.191
## + carb  5     5.633 145.39 70.438

From the result of AIC process, we will use ANOVA to examine closly to these models:

Note that for the first and second model is second and first best fit result from AIC process. The third one is the model without horse power since p-value for its coefficient in first model is above 0.05.

fit1 = lm(mpg ~ cyl+hp+wt+am, data = mtcars)
fit2 = lm(mpg ~ cyl+hp+wt, data = mtcars)
fit3 = lm(mpg ~ cyl+wt, data = mtcars)

p.values=round(summary(fit1)$coefficients[,4],5) ## p-val of coefficient of hp > 0.5, thus eliminate the hp in model

pAnova1 = round(anova(fit1,fit2)$"Pr(>F)"[2],3)
pAnova2 = round(anova(fit2,fit3)$"Pr(>F)"[2],3)
pAnova3 = round(anova(fit1,fit3)$"Pr(>F)"[2],3)

kable(matrix(as.character(c(pAnova1,pAnova2,pAnova3)),ncol = 3),
      col.names= c("Model 1 vs 2","Model 2 vs 3", "Model 1 vs 3"),format = "markdown")
Model 1 vs 2 Model 2 vs 3 Model 1 vs 3
0.206 0.064 0.082

The fact that pAnova1 > pAnova3 > pAnova2 > 0.05 suggests that the best fit model is mpg~ cyl + wt

summary(fit3)
## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
## cyl8         -6.0709     1.6523  -3.674 0.000999 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

This model has an Rsquared of 0.8302 which means that it explains 83% of the variation in mpg. The plots below are the diagnostic plots for the model. The residuals appear randomly, the standardized residuals appear normally distributed, and there are not any highly influential outliers.

layout(matrix(c(1,2,3,4),2,2))
plot(fit3,pch=16)

Conclusion

After performing this analysis, we show that: