You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
df = mtcars
dim(df)
## [1] 32 11
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
By looking at the boxplot of mileage vs trasmission type in the appendix, it seems the automatic transmission could be better in fuel effectiveness than the manual one. But it should be confirmed by a statistics later. In other case this is impossible to give an definite answer: if automatic transmission is really better for fuel economy or this effect is due to unclear data sample.
The boxplot of mileage vs cylinder number in the appendix shows a clear dependency between mileage and cylinder count. This dependency is much more clear than one between the mileage and the gearbox type - automatic or manual.
As it can be seen from scatterpolt matrix, the mileage has good correlation with the variables cyl, hp, vt. These predictors are choosed for the model selection as the best correlated with mpg variable. This model also is compared with one with all the predictors included (see Appendix).
fitLr <- lm(mpg ~ cyl + hp + wt + am, df)
summary(fitLr)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## cyl -0.74516 0.58279 -1.279 0.2119
## hp -0.02495 0.01365 -1.828 0.0786 .
## wt -2.60648 0.91984 -2.834 0.0086 **
## am 1.47805 1.44115 1.026 0.3142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
It is easy to see what the p - value for mileage are more than 0.3. It means what this linear model coefficient for am is very probably to be zero. So, this model cannot answer the question of interest. The linear regression with all the regressors included can be found in the Appendix. The reiduals plot for the chosen linear model can be found in Appendix also.
library(ridge)
ridge <- linearRidge(mpg ~ cyl + hp + wt + am, df)
summary(ridge)
##
## Call:
## linearRidge(formula = mpg ~ cyl + hp + wt + am, data = df)
##
##
## Coefficients:
## Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept) 35.76620 NA NA NA
## cyl -0.78228 -7.77867 4.44349 1.751
## hp -0.02459 -9.38833 4.01090 2.341
## wt -2.44628 -13.32689 4.11122 3.242
## am 1.58182 4.39470 3.32648 1.321
## Pr(>|t|)
## (Intercept) NA
## cyl 0.08002 .
## hp 0.01925 *
## wt 0.00119 **
## am 0.18646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Ridge parameter: 0.03558268, chosen automatically, computed using 2 PCs
##
## Degrees of freedom: model 3.537 , variance 3.163 , residual 3.912
As can be seen from the summary of the ridge regression estimates, the p - value for the am coefficient is too big (0.19) to consider the coefficient at the mileage variable to be not equal to zero.
Is an automatic or manual transmission better for MPG? This question cannot be ansewred with this data sample because the uncertainty of the p - value in the linear regression and in the ridge regression for am: p - value is too large for the certain answer.
Quantify the MPG difference between automatic and manual transmissions Supposing the data in the mtcars are normal distrubited and the difference in MPG between automatic and manual transmissions depends only on transmission type, pushing aside all the relevant predictors. The relevant t - statistics for this estimate can be found in the Appendix.
auto <- mtcars[mtcars$am == 1, ]$mpg
manual <- mtcars[mtcars$am == 0, ]$mpg
print(paste0('The difference in MPG between automatic and manual transmissions = ', mean(auto) - mean(manual)))
## [1] "The difference in MPG between automatic and manual transmissions = 7.24493927125506"
But as we know, we can’t push aside other relevant predictors. So this estimate is not statistically confirm.
fitLr_1 <- lm(mpg ~ ., df)
summary(fitLr_1)
##
## Call:
## lm(formula = mpg ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
It is easy to see what all the p - value for am are more than 0.2. It means what all this linear model coefficient are very probably to be zero. So, this model cannot answer the question of interest.
boxplot(mpg ~ am, data=df, col=c('blue', 'green'),
xlab='Transmission (0 = Automatic, 1 = Manual',
ylab='Miles per Gallon', main='Boxplot of MPG vs. Transmission')
boxplot(mtcars$mpg ~ mtcars$cyl, data=mtcars, col=(c('red', 'green', 'blue')), ylab='miles per gallon', xlab='number of cylinders', main='Mileage by Cylinder number')
pairs(mpg ~ ., data = df)
t.test(mpg ~ am, data = df)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
par(mfrow = c(2,2))
plot(fitLr)