author: liuyubobobo
date: Sunday, April 26, 2015
First of all, we need to load the mtcars data.
library(datasets)
data(mtcars)
Then, we do some exploratory analysis using head , str functions and plot the variables’ relationship using pairs function. The result plot is shown on Appendix, Figure 1.
head(mtcars)
str(mtcars)
pairs(mtcars, panel=panel.smooth, main="pairs plot for mtcars data set", col=4+(mtcars$am==0))
Create a regression model using lm function for only mpg and am,which we are interested in. We call this regression model fit0
fit0 <- lm(mpg ~ factor(am), data=mtcars)
Then, create mutiple models which are added a second regressor to see wheter other variables effect the result. Let’s call these models from fit1 to fit9
fit1 <- lm(mpg ~ factor(am)+factor(cyl), data=mtcars);
fit2 <- lm(mpg ~ factor(am)+disp, data=mtcars)
fit3 <- lm(mpg ~ factor(am)+hp, data=mtcars);
fit4 <- lm(mpg ~ factor(am)+drat, data=mtcars)
fit5 <- lm(mpg ~ factor(am)+wt, data=mtcars);
fit6 <- lm(mpg ~ factor(am)+qsec, data=mtcars)
fit7 <- lm(mpg ~ factor(am)+factor(vs), data=mtcars)
fit8 <- lm(mpg ~ factor(am)+factor(gear), data=mtcars)
fit9 <- lm(mpg ~ factor(am)+factor(carb), data=mtcars)
Then we can use summary to see the different coefficients of am in each model, as follows:
## Model 0 Model 1 Model 2 Model 3 Model 4 Model 5
## 7.24493927 2.55995370 1.83345825 5.27708531 2.80706095 -0.02361522
## Model 6 Model 7 Model 8 Model 9
## 8.87633094 6.06666667 5.22500000 7.08988506
From the above results, we can see that the coefficients of am in fit5 model are totally different from fit0, the sign of the coefficent is changed. Besides, disp, cyl and drat also make the cofficient of am changed a lot, which means these variables may be included in the best model. We can use nest models and nested likelihood ratio tests to determine which model should be selected.
fit10 <- update(fit0, mpg ~ factor(am)+wt)
fit11 <- update(fit0, mpg ~ factor(am)+wt+disp)
fit12 <- update(fit0, mpg ~ factor(am)+wt+disp+factor(cyl))
fit13 <- update(fit0, mpg ~ factor(am)+wt+disp+factor(cyl)+drat)
anova(fit10,fit11,fit12,fit13)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am) + wt
## Model 2: mpg ~ factor(am) + wt + disp
## Model 3: mpg ~ factor(am) + wt + disp + factor(cyl)
## Model 4: mpg ~ factor(am) + wt + disp + factor(cyl) + drat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 278.32
## 2 28 246.56 1 31.763 4.3470 0.04744 *
## 3 26 182.87 2 63.687 4.3579 0.02379 *
## 4 25 182.68 1 0.195 0.0266 0.87166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above result shows that ther’s a significant difference between fit11 and fit12. fit12 is selected for this research.
we examine our model by doing residual plots, which is shown on Appendix, Figure 2, and regression diagnostics. The results tell us our model is good enough for this research.
par(mfrow = c(2,2))
plot(fit12)
hatvalues(fit12)
dfbetas(fit12)
We use summary on our final selected model fit12 for analysis.
summary(fit12)
The detail is shown on Appendix, Figure3, From the result, We can clearly see the following coclusions:
1. the mpg for Manual transmission cars is larger than the Automatic transmission cars;
2. the mpg difference between automatic transmission cars and manual transmission cars is not significant;
3. 80.64% of total variation can be described by our model.
Figure 1 shows the variables’ relashionship using pais function.
Figure 2 shows the residual plots of our final selected model fit12
Figure 3 shows the summary of our final selected model fit12
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + disp + factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5029 -1.2829 -0.4825 1.4954 5.7889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.816067 2.914272 11.604 8.79e-12 ***
## factor(am)1 0.141212 1.326751 0.106 0.91605
## wt -3.249176 1.249098 -2.601 0.01513 *
## disp 0.001632 0.013757 0.119 0.90647
## factor(cyl)6 -4.304782 1.492355 -2.885 0.00777 **
## factor(cyl)8 -6.318406 2.647658 -2.386 0.02458 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.652 on 26 degrees of freedom
## Multiple R-squared: 0.8376, Adjusted R-squared: 0.8064
## F-statistic: 26.82 on 5 and 26 DF, p-value: 1.73e-09