Short note: On my rpubs-site, I have a longer version of the report with more descriptions and explanations. If you want, you can check it out on http://rpubs.com/Niko/81205. Let me add, that stargaze is a great package for tables. You should definitely have a look at it!
Working for Motor Trend, a magazine about the automobile industry, our task was to explore the relationship between a set of variables and miles per gallon (MPG) (outcome). Our client was particularly interested in the following two questions:
"Is an automatic or manual transmission better for MPG"
"Quantify the MPG difference between automatic and manual transmissions"
Preliminary analyses showed an effect of transmission type on MPG. The presence of a manual transmisson led to an average increase of roughly 7 miles per gallon. However, when we included further predictors in the model, the effect shrank to about 2 miles per gallon and, moreover, turned out to be non-significant. We propose to reproduce the calculations on a bigger data set in the future so as to investigate further on the topic.
Having a first look at the relationship between MPG and AM, we are intrigued to claim, that there is, indeed, an obvious relatonship between the two variables. A manual transmission seems to have a positive impact on MPG: The average MPG of the cars with a manual transmission is at approximately 24.39, whereas the average MPG for cars with an automatic transmission is at approximately 17.14. After getting an idea about the correlations between mpg and some further variables in the dataset, it gets clear that we should take these variables into account, as well, when specifying the model. The plots show clear relationships between the predictor variables on the and MPG. All of them are negative relationships, where high values of MPG correlate with low values in the predictor variables, and vice versa.
Instead of throwing all variables of the dataset into our model, at once, we propose a different approach by building our model in a stepwise fashion. As our client is especally interested in the relationship between MPG and AM, our first model will comprise only these two variables.
fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Using this simple regression model, we come to the conclusion, that the effect of AM on MPG explains about 36% of the variation in the outcome variable (see R-squared). The F-statistic is significant (p=0.0002) which tells us that using AM as a predictor adds valuable information to the model. As regards, the coefficient of AM, we can see that there is an increase of 7.245 miles per gallon, if the car has a manual transmission. Note, that this value is the same value as the difference between the two means in the tables!
fit2 <- lm(mpg ~ am + wt, data=mtcars)
fit3 <- lm(mpg ~ am + wt + hp, data=mtcars)
fit4 <- lm(mpg ~ am + wt + hp + disp, data=mtcars)
fit5 <- lm(mpg ~ am + wt + hp + disp + cyl, data=mtcars)
As we plug in WT, the model changes dramatically. R-squared, the variance explained, jumps up to 75%. However, the coefficient of AM is no longer significant, that is, almost the complete variance is explained now by the variable weight. Adding HP, we see an increase in R-squared to 84% explained variance. Both WT and HP have significant coeffcients. AM’s coeffcient is still non-significant, but to a much lesser degree compared to the second model. Model 4 shows no improvement in R-squared; the coeffcient of DISP is not significant, at all. Adding CYL improves R-squared by about 2%. The dummy variable CYL6 is significant, whereas the dummy variable CYL8 is not. That means, that with a change from 4 (baseline) to 6 cylinders, we see a significant change of 3 miles per gallon less. The decrease of 2 miles per gallon between 4 cylinders and 8 cylinders is not significant. The ANOVA for model 5 turns out to be non-significant, assuming a 5% alpha-level, so we stick with model 3. One might argue, that this model disposes of only a few variables. However, given the high R-squared we achieved by including only three variables, we can be satisfied. Besides, having only a few predictors makes interpretation much easier. If we turn back to our question, if an automatic of manual transmission is better for MPG, and if so, what the difference in MPG is between these two types, we can conclude: It turns out, that WT and HP account for most of the variance in MPG; at the same time they correlate with AM. That means, that a manual transmission correlates with low numbers in HP and WT. If we control for WT and HP, there is only an effect of 2 miles per gallon more in case of a manual transmission, but this estimate is not significant, so chances are high that it could be 0 miles per gallon as well!
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations. The two plots in the first column show the fitted values (a.k.a the standardized predicted values) vs. the (standardized) residuals. They inform us about any violations against homoscedasticity and/or linearity. As one can see, the assumption of homoscdasticity is met; on the other hand, we have some slight curves which implies some violation of linearity, however, these curves are not pronounced. So we can check the boxes as regards homoscedasticity and linearity. The plot in the upper right corner informs us about the criterion, if the standardized residuals are normally distributed. As we can see, there are some data points that deviate slightly from the line; however, overall, the curve of the data points is not that wiggly, so we can check the box ‘normally distributed residuals’ as well. The fourth plot gives us an idea about potential outliers. From the plot, we know that none of the observations is even above 0.5; so we can say, that there are no outliers that have a disturbing leverge effect on the model.
durbinWatsonTest(fit3)
## lag Autocorrelation D-W Statistic p-value
## 1 0.232331 1.43292 0.048
## Alternative hypothesis: rho != 0
The null-hypothesis that there is no autocorrelation must be rejected, if we assume alpha<0.05. So that is a problem, we should focus on in future analyses.
vif(fit3)
## am wt hp
## 2.271082 3.774838 2.088124
Here, we can see, that the variable that may pose a problem is WT. However, its tolerance (calculated by 1/VIF) is still above 0.2, which is alright. Still, the average of VIF is above 1, so multicollinearity might pose a problem in this model.
mtcars$am <- as.factor(mtcars$am)
ggplot(mtcars, aes(x = am, y = mpg, fill=am)) + geom_boxplot()
stargazer(subset(mtcars[c("mpg")], mtcars$am==0), title="Automatic transmission", type="text", digits=1, out="table1.txt")
##
## Automatic transmission
## ====================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------
## mpg 19 17.1 3.8 10.4 24.4
## ------------------------------------
stargazer(subset(mtcars[c("mpg")], mtcars$am==1), title="Manual transmission", type="text", digits=1, out="table2.txt")
##
## Manual transmission
## ====================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------
## mpg 13 24.4 6.2 15.0 33.9
## ------------------------------------
mtcars$am <- as.numeric(mtcars$am)
chart.Correlation(mtcars)
levels(mtcars$am) <- c("Automatic", "Manual")
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
a <- ggplot(mtcars, aes(x = hp, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
b <- ggplot(mtcars, aes(x = disp, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
c <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
d <- ggplot(mtcars, aes(x = cyl, y = mpg, colour = factor(am))) + geom_point()
grid.arrange(a, b, c, d, ncol=2)
stargazer(fit2, fit3, fit4, fit5, type="text", intercept.bottom=TRUE, dep.var.labels=c("Miles per gallon"), covariate.labels=c("Transmisson type (manual=1)", "Weight", "Horse Power", "Displacement", "Cylinders(6)", "Cylinders(8)"), out="models.txt")
##
## =======================================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------------
## Miles per gallon
## (1) (2) (3) (4)
## -----------------------------------------------------------------------------------------------------------------------
## Transmisson type (manual=1) -0.024 2.084 2.159 1.556
## (1.546) (1.376) (1.435) (1.441)
##
## Weight -5.353*** -2.879*** -3.047** -3.303***
## (0.788) (0.905) (1.157) (1.134)
##
## Horse Power -0.037*** -0.039*** -0.028*
## (0.010) (0.012) (0.014)
##
## Displacement 0.002 0.012
## (0.010) (0.012)
##
## Cylinders(6) -1.106
## (0.676)
##
## Cylinders(8) 37.322*** 34.003*** 34.209*** 38.203***
## (3.055) (2.643) (2.823) (3.669)
##
## -----------------------------------------------------------------------------------------------------------------------
## Observations 32 32 32 32
## R2 0.753 0.840 0.840 0.855
## Adjusted R2 0.736 0.823 0.817 0.827
## Residual Std. Error 3.098 (df = 29) 2.538 (df = 28) 2.581 (df = 27) 2.505 (df = 26)
## F Statistic 44.165*** (df = 2; 29) 48.960*** (df = 3; 28) 35.498*** (df = 4; 27) 30.697*** (df = 5; 26)
## =======================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
fit2 <- update(fit1, mpg ~ am + wt, data=mtcars)
fit3 <- update(fit2, mpg ~ am + wt + hp, data=mtcars)
anova(fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 68.734 5.071e-09 ***
## 3 28 180.29 1 98.03 15.224 0.0005464 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(fit3$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit3$residuals
## W = 0.9453, p-value = 0.1059
par(mfrow=c(2,2))
plot(fit3)