The report explores the relationship between some variables of data mtcars and the reponse variable “miles per gallon (MPG)”. There are particurlarly interested in the questions. First, is an automatic or manual transmission better for MPG? Second, how different is the MPG between automatic manual transmission? The exploration and analysis of the data mtcars provide that among all variables “Weight” and “Number of cylinders” have significat impact in quantifying the difference on “MPG” between automatic an transmission cars.
## Mean St.Error Count
## automatic 17.15 3.834 13
## manual 24.39 6.167 19
In order to the difference of the means and the visualization of the boxblots we use the two sample Student t-test to check that automatic cars have a lower miles per gallon than manuel cars.
## conf.lev t.stat df p.value lower.ci upper.ci aut.mean man.mean
## 0.90 -3.767 18.33 0.00137 -10.58 -3.913 17.15 24.39
## 1 0.95 -3.767 18.33 0.00137 -11.28 -3.210 17.15 24.39
In Appendix B) there is showed the check of Normaly of “Miles per gallon (MPG)”. At confidence level 0.90 the 2 sample t-test is out of the confidence interval for the hyphothese that the means for automatic and manual cars are equals. We reject the hyphothese. The difference between automatic and manual cars is significat. But with the confidence interval 0.95 is no longer so. The statement is unfortunately not robust. Lets make multipy linear regression.
After a first look at “AIC-Statistik” and changes of estimates in the table of Appendix C) there are 3 variables wt,am & qsec interesting. Including wt and hp in a regression equation makes sense intuitively - heavier cars and cars that have more horsepower should have lower MPGs.
First lets look at single linear model with variable “t_am” and response “mpg”.
my_cars.lm1 <- lm( mpg ~ t_am, data = my_mtcars ); sum.lm1 <- summary(my_cars.lm1)
conf.lm1 <- confint( my_cars.lm1, level=0.95 ); drop.all1 <- drop1(my_cars.lm1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.3923 -3.0923 -0.2974 0.0000 3.2439 9.5077
## Estim. Std.Err t.val Pr(>|t|) r.sqrd 2.5% Sum of Sq RSS
## (Intercept) 17.147 1.125 15.247 1.134e-15 0.3598 14.851 NA 720.9
## t_ammanual 7.245 1.764 4.106 2.850e-04 0.3598 3.642 405.2 1126.0
Lets do not gain much more information from our hypothesis test using this model. Interpreting the coefficient and intercepts, we say that, on average, automatic cars have 17.147 MPG and manual transmission cars have 7.245 MPGs more. In addition, we see that “r.squared” value is 0.3598. This means that our model only explains 35.98% of the variance.
Lets look at the relationsship on MPG and the variables in the following models.
M1 <- lm(mpg ~ t_am, data=my_mtcars )
M2 <- lm(mpg ~ t_am+I(wt-mean(wt)), data=my_mtcars )
M3 <- lm(mpg ~ t_am+I(wt-mean(wt))+I(qsec-mean(qsec)), data=my_mtcars )
M4 <- lm(mpg ~ t_am+I(wt-mean(wt))+I(cyl-mean(cyl))+I(hp-mean(hp)),data=my_mtcars )
tab_anova <- anova(M1, M2, M3, M4)
print("Analysis of Variance Table")
## [1] "Analysis of Variance Table"
print(t_anova, format="html", digits=3)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## M1: mpg ~ t_am 30 721 NA NA NA NA
## M2: mpg ~ t_am+wt 29 278 1 442.577 70.3 5.39e-09
## M3: mpg ~ t_am+wt+cyl 28 169 1 109.034 17.3 2.88e-04
## M4: mpg ~ t_am+wt+cyl+hp 27 170 1 -0.712 NA NA
The contribution of the variable “hp” in model M4 is not significant, since the t-value -0.98684 of the test ( H0: beta 4 = 0 ) is not in the confidence intervall [-0.07, 0.02] (Appendix C). Therefore M3 is the fittest multiple linear model.
The variance-inflation factors of the regressors “wt” and “cyl” are increasing strongly in model M4 in opposit to model M3 (Appendix F).
Model M3 explains over 81.22% of the variance. Moreover, we see that wt and hp did indeed confound the relationship between am and mpg (mostly wt). Now when we read the coefficient for am, we say that, on average, manual transmission cars have 2.52 MPGs more than automatic transmission cars (Appendix C).
The residuals of models M1 and M3 are normally distributed and homoskedastic. Lets have a look at the plots of residuals of multiply linear regression model M3 in Appendix E).
Henderson and Velleman (1981) Building multiple regression models interactively. Biometrics, 37, 391-411. A data frame with 32 observations on 11 variables. * [, 1] mpg Miles/(US) gallon * [, 2] cyl Number of cylinders * [, 3] disp Displacement (cu.in.) * [, 4] hp Gross horsepower * [, 5] drat Rear axle ratio * [, 6] wt Weight (1000 lbs) * [, 7] qsec 1/4 mile time * [, 8] vs V/S * [, 9] am Transmission (0 = automatic, 1 = manual) * [,10] gear Number of forward gears * [,11] carb Number of carburetors
print( head(mtcars, 2), format="html", digits=4 )
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Lets check the Shapiro t-test and have a look at Q-Q-Plot and Histogram.
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.9, p-value = 0.1
## Q-Q-Plot and histogram
par(mfrow=c(1,2), mar=c(3.8,4,2,2)+0.1, cex=0.6); qqnorm( mtcars$mpg)
hist( mtcars$mpg, breaks=8, cex=0.6, main="Histogram of Miles/(US) gallon" )
Table “Multiple Regression Model with Estimate, Std.Err, RSS and AIC”
## define model
cars.lm <- lm(mpg~factor(am)+wt+cyl+hp+disp+drat+gear+carb+vs+qsec-1,data=mtcars)
sum.lm<-summary(cars.lm,digi=4); conf.lm<-confint(cars.lm,level=0.95 ); drop.all<-drop1(cars.lm)
## Estim. Std.Err t.val Pr(>|t|) 2.5% 97.5% RSS AIC
## factor(am)0 12.30337 18.71788 0.6573 0.51812 -26.62 51.23 147.5 70.90
## factor(am)1 14.82360 18.35265 0.8077 0.42831 -23.34 52.99 164.6 70.41
## wt -3.71530 1.89441 -1.9612 0.06325 -7.65 0.22 174.5 74.28
## cyl -0.11144 1.04502 -0.1066 0.91609 -2.28 2.06 147.6 68.92
## hp -0.02148 0.02177 -0.9868 0.33496 -0.07 0.02 154.3 70.35
## disp 0.01334 0.01786 0.7468 0.46349 -0.02 0.05 151.4 69.74
# Basic Scatterplot Matrix
my_cars <- as.data.frame(cbind( mtcars$mpg, mtcars$wt, mtcars$am, mtcars$cyl ))
colnames(my_cars) <- c( "mpg", "wt", "am", "cyl" )
pairs(my_cars,lower.panel=panel.smooth,upper.panel=panel.cor,main="Scatterplot Matrix")
par(mfrow=c(1,3), mar=c(3.8,4,2,2)+0.1, cex=0.6);
plot(my_cars.lm1, which=c(1,2,4))
##
## Call:
## lm(formula = mpg ~ t_am + I(wt - mean(wt)) + I(qsec - mean(qsec)),
## data = my_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.8979 0.7194 26.271 < 2e-16 ***
## t_ammanual 2.9358 1.4109 2.081 0.046716 *
## I(wt - mean(wt)) -3.9165 0.7112 -5.507 6.95e-06 ***
## I(qsec - mean(qsec)) 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
## Residuals of Multiply model M3
par(mfrow=c(2,2), mar=c(3.8,4,2,2)+0.1, cex=0.6);
plot(M3, which=c(1,2,3,4))
vit_all <- rbind( c(vif(M2), NA, NA), c(vif(M3),NA), vif(M4) )
colnames(vit_all) <- c( "t_am", " wt", " cyl", " hp")
rownames(vit_all) <- c( "Model 2: mpg ~ t_am + wt",
"Model 3: mpg ~ t_am + wt + cyl",
"Model 4: mpg ~ t_am + wt + cyl + hp" )
print( vit_all, format="html", digits=4 )
## t_am wt cyl hp
## Model 2: mpg ~ t_am + wt 1.921 1.921 NA NA
## Model 3: mpg ~ t_am + wt + cyl 2.541 2.483 1.364 NA
## Model 4: mpg ~ t_am + wt + cyl + hp 2.546 3.988 5.334 4.31