background: You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
data(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# There are 11 variables, since we are interested in the relationshp between mpg and other variables, we first check the correlation between mpg and other variables by using the cor() function
# mtcars[,-1] is the dataframe without mpg
cor(mtcars$mpg,mtcars[,-1])
## cyl disp hp drat wt qsec vs am gear
## [1,] -0.8522 -0.8476 -0.7762 0.6812 -0.8677 0.4187 0.664 0.5998 0.4803
## carb
## [1,] -0.5509
# First we convert the am as factor data
mtcars$am <- as.factor(mtcars$am)
# by ?mtcars, we can see the following information
# [,9] am Transmission (0 = automatic, 1 = manual)
levels(mtcars$am) <-c("Automatic", "Manual")
A boxplot was created to examine the relationship between mpg and am type (Appendix 1) and it seems automatic car has better mpg compared with manual cars.
# To perform a statistical analysis to support this hypothesis, we use t-test
t.test(mtcars$mpg~mtcars$am,conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.15 24.39
The p-value is 0.001374, we may think it is ok to reject the null hypothesis and conclude automatic has low mpg compared with manual cars - however this assumption is based on all other characteristics of auto cars and manual cars are same (e.g: auto cars and manual cars have same weight distribution) - which needs to be further explored in the multiple linear regression analysis.
# Here we adopt a stepwise algorithm to choose the best model by using step() function
stepmodel = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(stepmodel)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## amManual 2.936 1.411 2.08 0.04672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
# OK, here we got a model which includes 3 variables wt, qsec and am. This model captured 0.85 of total variance.
# To further optimize our model, we examine mpg ~ wt + qsec and controled by am
model <- lm(mpg~ factor(am):wt + factor(am):qsec,data=mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ factor(am):wt + factor(am):qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.936 -1.402 -0.155 1.269 3.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.969 5.776 2.42 0.0226 *
## factor(am)Automatic:wt -3.176 0.636 -4.99 3.1e-05 ***
## factor(am)Manual:wt -6.099 0.969 -6.30 9.7e-07 ***
## factor(am)Automatic:qsec 0.834 0.260 3.20 0.0035 **
## factor(am)Manual:qsec 1.446 0.269 5.37 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 27 degrees of freedom
## Multiple R-squared: 0.895, Adjusted R-squared: 0.879
## F-statistic: 57.3 on 4 and 27 DF, p-value: 8.42e-13
To interpret the results, we could see this model captured 89.5% of total variance and adjusted variance is 0.879 - which is quite good. In the coefficents section, we could see when the weight increased 1000 lbs, the mpg decreased, -3.176 miles for autocars and -6.09 miles for manual cars - implicating when the weight of car increased - it might be better to choose manual cars. For the qsec part, when the accelaration speed droped and ¼ mile time increased 1 secs, the mpg increased 0.834 miles for automatic cars and 1.446 miles for manual cars - this implys that if the car has low accelation speed at same weight, maunal cars is better for mpg. In conclusion, the mpg is largely determined by the interplay between weight, accelaration and tramsmission.
Therefore given the above analysis, the question of auto car and manual car is not anwsered and have to be considered in the context of weight and accelaration speed.
Finally we plot the residue and diagnostic plot for this linear model (Appendix 2).
boxplot(mtcars$mpg ~ mtcars$am, data = mtcars, outpch = 19, ylab="mpg:miles per
gallon",xlab="transmission type",main="mpg vs transmission type")
par(mfrow=c(2,2))
plot(model)