This work represents an analysis of mtcars data set. The goal of the analysis is to find out if transmission type influnces miles/gallon and what is the influence. It was found out that on average one can drive 2.9358 miles more on manual transmission than on automatic transmission. However, detailed results are below.
First of all, we build a regression where mpg is an outcome and am is a regressor. am = 0, if transmission is automatic and am = 0, if transmission is manual. As one can see, a car with manual transmission can drive 7.245 miles/gallon more than a car with automatic transmission.
library(ggplot2)
library(car)
data(mtcars)
mtcars$am <- as.numeric(as.character(mtcars$am))
fit1 <- lm( mpg ~ am, data = mtcars )
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## am 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
qplot(mtcars$mpg,fit1$residuals, col = am, xlab = "Mpg", ylab = "Residuals", data = mtcars)
As one can see, a car with manual transmission can drive 7.245 (on average) miles/gallon more than a car with automatic transmission. However, if we examine residual vs. mpg plot we can see, that residuals are not randomly distributed. This fact says us that some important variables are not included into our analysis. That’s why we should consider multivariable regression with mpg as regressant and all other variables as regressors. All variables which are not significant are excluded in the following way: regress mpg on all variables, if there are unsignificant variables, exclude the most unsignificant.
fit2 <- lm( mpg ~ . - cyl - vs - carb - gear - drat - disp - hp, data = mtcars )
vif(fit2)
## wt qsec am
## 2.483 1.364 2.541
summary(fit2)
##
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat - disp -
## hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## am 2.936 1.411 2.08 0.04672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
qplot(mtcars$mpg,fit2$residuals, col = am, xlab = "Mpg", ylab = "Residuals", data = mtcars)
anova(fit1,fit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ (cyl + disp + hp + drat + wt + qsec + vs + am + gear +
## carb) - cyl - vs - carb - gear - drat - disp - hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 721
## 2 28 169 2 552 45.6 1.6e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a <- c(-1,1)*1.4109*qt(0.95, 28) + 2.9358 # calculate sign interval
If you look at anova test (F-test), you can see that regression with wy and qsec variables is statistically better (P-value -> 0). Influence of transmission type has decreased (due to reduction of bias). One can drive 2.9358 miles more on average with manual transmission. There is no heteroscedasticity (residual plot) and multicollinearity (small vifs), so final regression seems to be accurate. If we calculate a significance interval for the coefficeint we will get the following result: with 0.9 pobability influence of transmission type lays between 0.5357 and 5.3359