We are asked to perform a regression analysis for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Lets load our data and check the general structure
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We shall start with checking the most important question, Is an automatic or manual transmission better for MPG?
# Lets make am factor variable for future use.
mtcars$am <- as.factor(mtcars$am)
# Note if am variable is 1 it means manual and 0 means automatic
levels(mtcars$am) <- c("Automatic", "Manual")
boxplot(mpg~am,data=mtcars,main="MPG by Tranmission",varwidth=TRUE, col=c(3,4), ylab = "MPG")
Although this is just a exploratory graph it seems the manual tranmission has a higher mpg as expected.
To quantify the numerical difference in mpg
by(mtcars$mpg, mtcars$am , mean)
## mtcars$am: Automatic
## [1] 17.14737
## --------------------------------------------------------
## mtcars$am: Manual
## [1] 24.39231
Of course this is without taking into account any of the other variables but it still shows manual transmission is better for mpg
We can also see the inference from a t.test
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
With a p value of 0.001374 our first impression is that the tranmission type has significance on mpg. Indeed The 95% confidence interval shows the mpg decrease will be in the interval (-11.280194 , -3.209684)
Before going into the model development it is better to check the pairs relation of all the variables so that we can decide which variables to include in our linear regression model.
pairs(mtcars, pch = 18, panel = panel.smooth)
Our approach will be linear regression fit. And we will fit couple different models with different regressors. Our pairs plot gives an idea of which to include in our model.
Both our intuition and the pairs plot shows cyl wt hp and of course am are significant. Because we expect Number of cylinders, the weight of the car and the horsepower to be effective on mpg. On the other hand disp and vs seem to have also effect but we are suspicious whether adding them is helpful to our model at all.
Variables carb qsec drat and gear do not show promising regressions so we discard them.
We will fit 3 models, one single variable solely on transmission type, than the secondary important variables listed above, and finally we will add disp and vs to the picture.
# Cylinders is a factor.
mtcars$cyl <- as.factor(mtcars$cyl)
# Create the models
fit1 <- lm(mpg ~ am, data = mtcars)
fit2 <- lm(mpg ~ am + cyl + hp + wt, data=mtcars)
fit3 <- lm(mpg ~ am + cyl + hp + wt + disp + vs, data = mtcars)
Lets check the ANOVA - Analysis of variances
anova (fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl + hp + wt
## Model 3: mpg ~ am + cyl + hp + wt + disp + vs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 23.8956 4.524e-08 ***
## 3 24 143.09 2 7.94 0.6655 0.5233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the low p-value 4.524e-08 in fit2 we conclude that cyl wt hp and am are indeed significant terms. Similarly, the high p-value of model 3 suggest there is not much gain by including variable disp and vs.
So our model selection is fit2
Let’s take a closer look to “Quantify the MPG difference between automatic and manual transmissions”
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## amManual 1.80921 1.39630 1.296 0.20646
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The result after adding extra regressors to transmission type are important. We can quantify the effect of manual transmission as 1.80921 increase in mpg . But the pvalue is 0.20646 which is a high value. So we conclude that the transmission type may not represent a significant effect on mpg.
Weight and Horsepower are much more significant quantities with lower p-values.
par(mfrow = c(2,2))
plot(fit2)