The priliminary object of this report is to explore the relationship between transmission types, am, and fuel consumption, mpg, in the data set mtcars. To be more specifically, it is aimed to answer two questions:
The analysis shown below leads to a conclusion that there is a relatively significant difference between manual and auto transmission, and manual transmission has a better fuel economy than the automatic ones.
The mtcars data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). Here is a brief describtion of all the variables which will be mentioned later on.
mpg, Miles/(US) galloncyl , Number of cylindersdisp, Displacement (cu.in.)hp, Gross horsepowerdrat, Rear axle ratiowt, Weight (1000 lbs)qsec, 1/4 mile timevs, Engine (0 = V-shaped, 1 = straight)am, Transmission (0 = automatic, 1 = manual)gear, Number of forward gearscarb, Number of carburetorsIt can be seen from the exploratory plots which are shown in the appendix that transmission type does have an impact on mpg. The following section will quantitatively analyse the impact and generate the best fitted model to explain it.
Initially, the regression analysis is conducted to see the relationship between mpg and am. It is seen as the base model.
base <- lm(mpg ~ am, data = data)
summary(base)
##
## Call:
## lm(formula = mpg ~ am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
aggregate(mpg ~ am, data = data, mean)
## am mpg
## 1 Auto 17.14737
## 2 Manual 24.39231
The analysis shown here strongly support the preliminary conclusion extracted from the exploratory plots mentioned above, which is that transmission type does have a relatively significant impact on fuel consumption. The average mpg for manual and automatic is 24.39 and 17.15 respectively, where the Pr also gives a positive result towards this assumption.
It is also obvious for one to notify that some other variables such as cyl, hp and wt could also be taken into consideration when exploring the fuel consumption. Hence, a model named og which includes all the variables is generated and it will be used for future model selection.
# First is the one which considers all the variables
og <- lm(mpg ~., data = data)
summary(og)
##
## Call:
## lm(formula = mpg ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
# Also it has the base model, which is mpg vs am
base <- lm(mpg ~ am, data = data)
summary(base)
##
## Call:
## lm(formula = mpg ~ am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
With the help of step(), a stepwise model selection process can be preformed by R, which eventually decides the best variables to interpret the regression model. Basically, one is supposed to select the model with the smallest AIC, Akaike Information Criterion, which is used to evaluate the the complexity and goodness of fit for a statistical model.
# Then it comes to the calibration step, find the best fitted one
best <- step(og, direction = "both")
summary(best)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
Then by using anova(), the difference among different models can be illustrated easily.
# First consider the base and the og
anova(base, best, og)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 17.7489 1.476e-05 ***
## 3 15 120.40 11 30.62 0.3468 0.9588
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is can be obtained that by adding a few more varaibles to the model from the base to the best, the significance level changed dramatically, whereas adding more variables from the best to the og, the significance level didn’t change a lot. This supports the fact that the regression model obtianed from step() has the best interpretation towards the question.
Recall the summary of best again.
summary(best)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
Again, according to the P value shown above, most of the P values are less than or really close to the threshold value 0.05, which means that each variable has relatively significant impact on the output, mpg. The reason that p-value of hp is bigger than other’s is that actually, hp highly depends on cyl. It can be seen below.
summary(lm(hp ~ cyl - 1, data = data))
##
## Call:
## lm(formula = hp ~ cyl - 1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.21 -22.78 -8.25 15.97 125.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cyl4 82.64 11.43 7.228 5.86e-08 ***
## cyl6 122.29 14.33 8.532 2.12e-09 ***
## cyl8 209.21 10.13 20.645 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.92 on 29 degrees of freedom
## Multiple R-squared: 0.95, Adjusted R-squared: 0.9449
## F-statistic: 183.7 on 3 and 29 DF, p-value: < 2.2e-16
But this doesn’t mean that hp can be removed. In the regression model best, more importantly, the multiple \(R^2\) and adjusted \(R^2\) are 0.84 and 0.81 respectively, which are higher than the \(R^2\) values of the model without hp. This indicates that around 83 percent of the regression variance can be explained by the selected variables. The F-statistic result shows that the p-value of the whole model is 2.73e-10, which also supports the conclustion that selected variables from the model are jointly significant.
Additionally, a series of residuals plots have been shown below, aiming to conduct the regression diagnostics and examining non-normality.
par(mfrow = c(2,2))
plot(best)
According to these four plots shown above, they indicate that the residuals are independent and normally distributed with a constant variance.
Based on the analysis conducted above, a few conclusions can be made:
Transmission type dose have a impact on vehicle fuel consumptions performance. Specifically, manual car has a better mpg comparing with the automatic car.
Quantitatively speaking, when only considering the transmission type, mpg increases 7.245 from automatic car to manual car. mpg increases 1.8 when it’s adjusted by hp, cyl and wt.
Even horsepower, hp is highly related on cyl, it still made its own contribution in terms of \(R^2\) and F-statistic when it comes to evaluation of the whole model.
Here are some plots for exploratory analysis.
# Plot em all
# Plot all the box plots first
ggplot2.multiplot(plot1, plot7, cols = 2)
ggplot2.multiplot(plot8, plot9, plot10, cols = 2)
# The others
ggplot2.multiplot(plot2, plot3, cols = 2)
ggplot2.multiplot(plot4, plot5, plot6, cols = 2)