Bikram Bhusal
28 July, 2019
In this analysis, we use the “mtcars” data and examine the relationship between a set of variables and miles per gallon (MPG) (outcome).We are particularly interested in the following two questions:
library(ggplot2)
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars$vs<-factor(mtcars$vs)
mtcars$am.label<-factor(mtcars$am,labels=c("Automatic","Mannual"))# 0 auto & 1 manual
mtcars$gear<-factor(mtcars$gear)
mtcars$carb<-factor(mtcars$carb)
Boxplot of MPG by transmission type;
boxplot(mpg ~ am.label,data=mtcars,col=c("yellow","green"),xlab="Transmission Type",ylab="Miles Per Gallon")
so, we can say from the boxplot that Manual Transmission provides better MPG.
Let’s do the simple regression test of MPG vs transmission type
R_simple<-lm(mpg~factor(am),data = mtcars)
summary(R_simple)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Here,p-value is less than 0.0003.So we donot rejet the null hypothesis.At the same time R-squared value(0.3598) suggest that the strength of relationship is not so impressive.
Now,let’s perform analysis of variance,
Aov<-aov(mpg~.,data=mtcars)
summary(Aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.7 817.7 102.591 2.3e-08 ***
## disp 1 37.6 37.6 4.717 0.04525 *
## hp 1 9.4 9.4 1.176 0.29430
## drat 1 16.5 16.5 2.066 0.16988
## wt 1 77.5 77.5 9.720 0.00663 **
## qsec 1 3.9 3.9 0.495 0.49161
## vs 1 0.1 0.1 0.016 0.90006
## am 1 14.5 14.5 1.816 0.19657
## gear 2 2.3 1.2 0.145 0.86578
## carb 5 19.0 3.8 0.477 0.78789
## Residuals 16 127.5 8.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see cyl,disp, wta and am has the p-value less than 0.05 Let’s perform the multivariabe linear regrassion analysis:
R_multivar<-lm(mpg~cyl+disp+wt+am,data = mtcars)
summary(R_multivar)
##
## Call:
## lm(formula = mpg ~ cyl + disp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.318 -1.362 -0.479 1.354 6.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.898313 3.601540 11.356 8.68e-12 ***
## cyl -1.784173 0.618192 -2.886 0.00758 **
## disp 0.007404 0.012081 0.613 0.54509
## wt -3.583425 1.186504 -3.020 0.00547 **
## am 0.129066 1.321512 0.098 0.92292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared: 0.8327, Adjusted R-squared: 0.8079
## F-statistic: 33.59 on 4 and 27 DF, p-value: 4.038e-10
We notice that R-squared value is 0.8327.Which suggest that 83% or more of variance can be explained by this multivariable model.P-values for cyl and wt are below 0.05,suggesting that these are the confounding variables in the model.
pairs plot for the dataset
pairs(mpg ~ .,data=mtcars)
Residuals plot
par(mfrow=c(2,2))
plot(R_multivar)
The Residual vs fitted plot show that the residuals are homoscedastic. Also, QQ plot shows residuals are normally distributed(except some outliers).