On behalf of Motor Trend magazine, looking at mtcars dataset (a collection of cars), we are particularly interested in the following two issues:
Is an automatic or manual transmission better for Miles Per Gallon?
Quantifying the Miles Per Gallon difference between automatic and manual transmissions.
In this report, we will analyze the mtcars dataset and explore the relationship between the variables miles per gallon (MPG) and transmissions (am). Here miles per gallon is our outcome.
The key inference from our analysis is:
Manual transmission is better for mileage (MPG) by a factor of 1.8 compared to automatic transmission.
Means and medians for automatic and manual transmission cars are significantly different.
file.edit(".Rprofile")
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am, labels = c('Automatic', 'Manual'))
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
In this section, we start building linear regression models based on the different variables and try to find out the best model fit and compare it with the base model which we have using anova. After model selection, we also perform analysis of residuals.
Based on the high correlation with mpg, we first fit an initial linear model with all the variables as predictors. Then we’ll perfom stepwise model selection to select significant predictors for the final model. To do it, we use the step method which runs multiple times to build multiple regression models and select the final variables using both forward selection and backward elimination methods by the AIC algorithm.
initialModel <- lm(mpg ~ ., data = mtcars)
finalModel <- step(initialModel, direction = "both")
## Start: AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 5 13.60 134 69.8
## - gear 2 3.97 124 73.4
## - am 1 1.14 122 74.7
## - qsec 1 1.24 122 74.7
## - drat 1 1.82 122 74.9
## - cyl 2 10.93 131 75.2
## - vs 1 3.63 124 75.4
## <none> 120 76.4
## - disp 1 9.97 130 76.9
## - wt 1 25.55 146 80.6
## - hp 1 25.67 146 80.6
##
## Step: AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 2 5.02 139 67.0
## - disp 1 0.99 135 68.1
## - drat 1 1.19 135 68.1
## - vs 1 3.68 138 68.7
## - cyl 2 12.56 147 68.7
## - qsec 1 5.26 139 69.1
## <none> 134 69.8
## - am 1 11.93 146 70.6
## - wt 1 19.80 154 72.2
## - hp 1 22.79 157 72.9
## + carb 5 13.60 120 76.4
##
## Step: AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - drat 1 0.97 140 65.2
## - cyl 2 10.42 149 65.3
## - disp 1 1.55 141 65.4
## - vs 1 2.18 141 65.5
## - qsec 1 3.63 143 65.8
## <none> 139 67.0
## - am 1 16.57 156 68.6
## - hp 1 18.18 157 68.9
## + gear 2 5.02 134 69.8
## - wt 1 31.19 170 71.5
## + carb 5 14.65 124 73.4
##
## Step: AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - disp 1 1.25 141 63.5
## - vs 1 2.34 142 63.8
## - cyl 2 12.33 152 63.9
## - qsec 1 3.10 143 63.9
## <none> 140 65.2
## + drat 1 0.97 139 67.0
## - hp 1 17.74 158 67.0
## - am 1 19.47 160 67.4
## + gear 2 4.80 135 68.1
## - wt 1 30.72 171 69.6
## + carb 5 13.05 127 72.1
##
## Step: AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - qsec 1 2.4 144 62.1
## - vs 1 2.7 144 62.1
## - cyl 2 18.6 160 63.5
## <none> 141 63.5
## + disp 1 1.2 140 65.2
## + drat 1 0.7 141 65.4
## - hp 1 18.2 159 65.4
## - am 1 18.9 160 65.5
## + gear 2 4.7 137 66.4
## - wt 1 39.6 181 69.4
## + carb 5 2.3 139 73.0
##
## Step: AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - vs 1 7.3 151 61.7
## <none> 144 62.1
## - cyl 2 25.3 169 63.2
## + qsec 1 2.4 141 63.5
## - am 1 16.4 160 63.5
## + disp 1 0.6 143 63.9
## + drat 1 0.3 143 64.0
## + gear 2 3.4 140 65.3
## - hp 1 36.3 180 67.3
## - wt 1 41.1 185 68.1
## + carb 5 3.5 140 71.3
##
## Step: AIC=61.65
## mpg ~ cyl + hp + wt + am
##
## Df Sum of Sq RSS AIC
## <none> 151 61.7
## - am 1 9.8 161 61.7
## + vs 1 7.3 144 62.1
## + qsec 1 7.0 144 62.1
## - cyl 2 29.3 180 63.3
## + disp 1 0.6 150 63.5
## + drat 1 0.2 151 63.6
## + gear 2 1.4 150 65.4
## - hp 1 31.9 183 65.8
## - wt 1 46.2 197 68.2
## + carb 5 5.6 145 70.4
Number of cylinders (cyl), weight (wt) and gross horsepower (hp) - these variables are confounders and transmissions (am) is independent variable in our final model.
summary(finalModel)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.939 -1.256 -0.401 1.125 5.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.7083 2.6049 12.94 7.7e-13 ***
## cyl6 -3.0313 1.4073 -2.15 0.0407 *
## cyl8 -2.1637 2.2843 -0.95 0.3523
## hp -0.0321 0.0137 -2.35 0.0269 *
## wt -2.4968 0.8856 -2.82 0.0091 **
## amManual 1.8092 1.3963 1.30 0.2065
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.866, Adjusted R-squared: 0.84
## F-statistic: 33.6 on 5 and 26 DF, p-value: 1.51e-10
So, from the above summary we can formulate the following linear equation: mpg = 33.7083 - 3.0313 * cyl6 - 2.1637 * cyl8 - 0.0321 * hp - 2.4968 * wt + 1.8092 * am:Manual
From this equation we can conclude regarding the slope of transmission as - all else held constant, the model predicts that manual transmission’s mileage is 1.8092 mile more than automatic transmission’s mileage, on average.
Now we compare an other model (baseModel) that has only transmission am variable as the predictor with the finalModel.
baseModel <- lm(mpg ~ am, data = mtcars)
anova(baseModel, finalModel)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 721
## 2 26 151 4 570 24.5 1.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of finalModel is highly significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.
Since we are interested in the effects of car transmission type on mpg, we plot boxplots of the variable mpg when am is Automatic or Manual.
plot(mpg ~ am, data = mtcars)
This plot clearly depicts an increase in the mileage (mpg) when the transmission is Manual.
In this section, we’ll analyse the residual plots of our finalModel and also compute some of the regression diagnostics for our model to find out some interesting leverage points (outliers) in the data set.
par(mfrow = c(2, 2))
plot(finalModel)
From the above plots, we can make the following observations,
The points in the Residuals vs. Fitted plot seem to be randomly scattered on the plot and verify the independence condition.
The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed.
The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance.
There are some distinct points of interest (outliers or leverage points) in the top right of the plots.
We now compute some regression diagnostics of our model to find out these interesting leverage points as shown in the following section. We compute top three points in each case of influence measures.
leverage <- hatvalues(finalModel)
tail(sort(leverage), 3)
## Toyota Corona Lincoln Continental Maserati Bora
## 0.2778 0.2937 0.4714
influential <- dfbetas(finalModel)
tail(sort(influential[, 6]), 3)
## Chrysler Imperial Fiat 128 Toyota Corona
## 0.3507 0.4292 0.7305
Based on the observations from our finalModel, we can conclude the following:
Cars with Manual transmission get more miles per gallon mpg compared to cars with Automatic transmission (1.8 adjusted by hp, cyl, and wt).
mpg will decrease by 2.5 (adjusted by hp, cyl, and am) for every 1000 lb increase in wt.
mpg decreases negligibly with increase of hp.
If number of cylinders, cyl increases from 4 to 6 and 8, mpg will decrease by a factor of 3 and 2.2 respectively (adjusted by hp, wt, and am).