Let’s imagine we work for ‘Motor Trend’, a magazine about the automobile industry. Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantifying how different is the MPG between automatic and manual transmissions?”
Let’s start with some exploratory works, to learn more about the dataset
## load the data
data(mtcars)
## look at the variables and their formats
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## Investigate the data descriptions as headers are not relally explicit
?mtcars
As the dataset comes with good descriptions, we can have a better understandings of each variable in the dataset.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Now we can run a quick ‘pairs’ chart to have an idea of which variable might be correlated with ‘mpg’
## Quick linear regression of mpg versus all variables
pairs(mpg ~ ., data = mtcars)
From this chart, we see that most variables are more or less correlated with ‘mpg’ and we will need to consider this in our regression analysis, as variance might be inflated. We also directly see that ‘cyl’, ‘vs’, ‘am’, ‘gear’ and ‘carb’ variables are not continuous and should therefore be factorized. Then, we can run a summary.
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Auto','Manual')) ## Let'use 'Auto' and 'Manual' instead of 0 and 1
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 0:18 Auto :19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1:14 Manual:13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
To see if gasoline consumption is different for automatic and manual transmissions, we can make a T-test, the null hypothesis being that there is no difference between the two different models.
testResult <- t.test(mtcars$mpg~mtcars$am)
PValue <- round(testResult$p.value,4)
alpha <- 0.05 ## let set alpha to 5% confidence to reject or not the null hypothesis
if(PValue<alpha){
print(paste("Our ",PValue," test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"))
}else{
print(paste("Our ",PValue," test P-value is greater than our 0.05 alpha value, so we failed to reject the null hypothesis"))
}
## [1] "Our 0.0014 test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"
Our T-test finally suggests that manual and automatic transmissions show differences in gasoline consumption. This can be seen on a boxplot graph, as below:
boxplot(mpg ~ am, data = mtcars, main = "Gasoline Consumption, by Transmission Type",
xlab = "Transmission ", ylab = "Gasoline Consumption (mpg)")
The boxplot shows that gasoline consumption is higher for manual than for automatic transmissions. Let’s extract the boxplot values:
AutoFilter <- mtcars[mtcars$am=="Auto",c("am","mpg")]
summary(AutoFilter)
## am mpg
## Auto :19 Min. :10.40
## Manual: 0 1st Qu.:14.95
## Median :17.30
## Mean :17.15
## 3rd Qu.:19.20
## Max. :24.40
ManualFilter <- mtcars[mtcars$am=="Manual",c("am","mpg")]
summary(ManualFilter)
## am mpg
## Auto : 0 Min. :15.00
## Manual:13 1st Qu.:21.00
## Median :22.80
## Mean :24.39
## 3rd Qu.:30.40
## Max. :33.90
Let’s investigate more with regression models. As we saw that most variables are strongly correlated to ‘mpg’, we might use the ‘step’ function to find the optimal regression model with optimal variables, following AIC algorithm rules.
modelFull <- lm(mpg ~ ., data = mtcars)
modelBestFit<- step(modelFull, direction = "both")
## Start: AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 5 13.5989 134.00 69.828
## - gear 2 3.9729 124.38 73.442
## - am 1 1.1420 121.55 74.705
## - qsec 1 1.2413 121.64 74.732
## - drat 1 1.8208 122.22 74.884
## - cyl 2 10.9314 131.33 75.184
## - vs 1 3.6299 124.03 75.354
## <none> 120.40 76.403
## - disp 1 9.9672 130.37 76.948
## - wt 1 25.5541 145.96 80.562
## - hp 1 25.6715 146.07 80.588
##
## Step: AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 2 5.0215 139.02 67.005
## - disp 1 0.9934 135.00 68.064
## - drat 1 1.1854 135.19 68.110
## - vs 1 3.6763 137.68 68.694
## - cyl 2 12.5642 146.57 68.696
## - qsec 1 5.2634 139.26 69.061
## <none> 134.00 69.828
## - am 1 11.9255 145.93 70.556
## - wt 1 19.7963 153.80 72.237
## - hp 1 22.7935 156.79 72.855
## + carb 5 13.5989 120.40 76.403
##
## Step: AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - drat 1 0.9672 139.99 65.227
## - cyl 2 10.4247 149.45 65.319
## - disp 1 1.5483 140.57 65.359
## - vs 1 2.1829 141.21 65.503
## - qsec 1 3.6324 142.66 65.830
## <none> 139.02 67.005
## - am 1 16.5665 155.59 68.608
## - hp 1 18.1768 157.20 68.937
## + gear 2 5.0215 134.00 69.828
## - wt 1 31.1896 170.21 71.482
## + carb 5 14.6475 124.38 73.442
##
## Step: AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - disp 1 1.2474 141.24 63.511
## - vs 1 2.3403 142.33 63.757
## - cyl 2 12.3267 152.32 63.927
## - qsec 1 3.1000 143.09 63.928
## <none> 139.99 65.227
## + drat 1 0.9672 139.02 67.005
## - hp 1 17.7382 157.73 67.044
## - am 1 19.4660 159.46 67.393
## + gear 2 4.8033 135.19 68.110
## - wt 1 30.7151 170.71 69.574
## + carb 5 13.0509 126.94 72.095
##
## Step: AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - qsec 1 2.442 143.68 62.059
## - vs 1 2.744 143.98 62.126
## - cyl 2 18.580 159.82 63.466
## <none> 141.24 63.511
## + disp 1 1.247 139.99 65.227
## + drat 1 0.666 140.57 65.359
## - hp 1 18.184 159.42 65.386
## - am 1 18.885 160.12 65.527
## + gear 2 4.684 136.55 66.431
## - wt 1 39.645 180.88 69.428
## + carb 5 2.331 138.91 72.978
##
## Step: AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - vs 1 7.346 151.03 61.655
## <none> 143.68 62.059
## - cyl 2 25.284 168.96 63.246
## + qsec 1 2.442 141.24 63.511
## - am 1 16.443 160.12 63.527
## + disp 1 0.589 143.09 63.928
## + drat 1 0.330 143.35 63.986
## + gear 2 3.437 140.24 65.284
## - hp 1 36.344 180.02 67.275
## - wt 1 41.088 184.77 68.108
## + carb 5 3.480 140.20 71.275
##
## Step: AIC=61.65
## mpg ~ cyl + hp + wt + am
##
## Df Sum of Sq RSS AIC
## <none> 151.03 61.655
## - am 1 9.752 160.78 61.657
## + vs 1 7.346 143.68 62.059
## + qsec 1 7.044 143.98 62.126
## - cyl 2 29.265 180.29 63.323
## + disp 1 0.617 150.41 63.524
## + drat 1 0.220 150.81 63.608
## + gear 2 1.361 149.66 65.365
## - hp 1 31.943 182.97 65.794
## - wt 1 46.173 197.20 68.191
## + carb 5 5.633 145.39 70.438
summary(modelBestFit)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC, the better the fit. AIC passed from 76.4 (all variables) to 61.5 with ‘cyl’, ‘hp’, ‘wt’, ‘am’ as the most relevant variables. Adjusted r-squared value finally reached 0.8401, explaining 84% of the variance.
Let’s go on with the ‘anova’ function, comparing our best-fit model with a model taking into account only the ‘am’ variable (transmission type). Our null hypothesis will be that model with variables selected for our best-fit model (‘cyl’, ‘hp’, ‘wt’, ‘am’) does not differ from a simple model only related to the ‘am’ vairiable.
modelAmVar <- lm(mpg ~ am, data = mtcars)
anova(modelBestFit, modelAmVar)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + hp + wt + am
## Model 2: mpg ~ am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 151.03
## 2 30 720.90 -4 -569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
PValue <- anova(modelBestFit, modelAmVar)$"Pr(>F)"[2]
alpha <- 0.05 ## let set alpha to 5% confidence to reject or not the null hypothesis
if(PValue<alpha){
print(paste("Our ",PValue," test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"))
}else{
print(paste("Our ",PValue," test P-value is greater than our 0.05 alpha value, so we failed to reject the null hypothesis"))
}
## [1] "Our 1.68843485084987e-08 test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"
We will focus now on our best-fit model and have a look to the residual values. We would like to be sure that residuals do not present any hidden pattern that might weaken our regression model. We will also the check the presence of outliers and evaulate its effect on the regression model.
par(mfrow = c(2, 2))
plot(modelBestFit, which = 1)
plot(modelBestFit, which = 2)
plot(modelBestFit, which = 3)
plot(modelBestFit, which = 5)
What can we conclude from residual analysis?
Finally, we can now answer both questions:
Our analysis showed that manual transmission models have a higher autonomy than automatic transmission ones regarding gasoline comsumption.
We saw that cars gasoline consumption is strongly correlated with different parameters such as weight, cylinders, horse power or transmission type. In multi-vraiable regression, such an interdependency is responsible for a strong variance inflation. That is why we used AIC algoritms to isolate the most relevant variables to explain the regression fit. Finally we were able to support with different tests that manual transmission car models are better automatic ones regarding the mpg index.