Load the data
data(mtcars);help(mtcars);dim(mtcars)
## [1] 32 11
show all data plot
?pairs
panel.cor <- function(x,y,digits=2,prefix="",cex.cor,...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(mtcars,panel=panel.smooth,main="mtcars",col=3+mtcars$am,
upper.panel = panel.cor)
MPG has highest correlation (>0.8) to: wt, cyl, disp. Corrlelation between MPG and am is fair (0.6).
Make dummy variable and show conditional mean
mtcars$am <-as.factor(mtcars$am)
library(plyr)
(mmpg <- ddply(mtcars, "am", summarise, mpg.mean=mean(mpg)))
## am mpg.mean
## 1 0 17.14737
## 2 1 24.39231
Mean MPG for auto is lower than for mannual.
Show boxplot for these two groups
Show distribution of mpg for these two mt. Looks like mpg distribution for mannual not Gaussian.
ggplot(mtcars, aes(x=mpg, fill=factor(am))) + geom_density(alpha=.3)+scale_fill_discrete(labels=c("auto","manual")) + geom_vline(data=mmpg, aes(xintercept=mpg.mean, colour=am), linetype="dashed", size=1)
Here, I will perform a hypothesis testing. The hypotheisis is that: auto transmission has a lower mpg than manual transmission. The Null hypotheisis is they are not siginificant different in mpg. So it will be a one side T-test.
The data in auto and mannual is independent. So the test should also be not paired.
auto <- subset(mtcars,am == 0,mpg)
mannual <-subset(mtcars,am == 1,mpg)
t.test(auto,mannual,alternative = "less")$p.value
## [1] 0.0006868192
fit1<-lm(mpg ~ am-1,data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am - 1, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## am0 17.147 1.125 15.25 1.13e-15 ***
## am1 24.392 1.360 17.94 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9452
## F-statistic: 277.2 on 2 and 30 DF, p-value: < 2.2e-16
THe coefficient is the group mean.
fit2<-lm(mpg ~ am,data = mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
par(mfrow=c(2,2));plot(fit2)
par(mfrow=c(1,1))
plot(predict(fit2),resid(fit2))
Coefficient of am1 to am0 is 7.245 with a p value 0.000285. It seems that manual transmission is significant save gas (about a save of 7.245 MPG on average)
fit4<-lm(mpg~ wt + am + wt*am,data=mtcars)
fit3<-lm(mpg~ wt + am,data=mtcars)
summary(fit4)
##
## Call:
## lm(formula = mpg ~ wt + am + wt * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6004 -1.5446 -0.5325 0.9012 6.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.4161 3.0201 10.402 4.00e-11 ***
## wt -3.7859 0.7856 -4.819 4.55e-05 ***
## am1 14.8784 4.2640 3.489 0.00162 **
## wt:am1 -5.2984 1.4447 -3.667 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
## F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
plot(mtcars$wt,mtcars$mpg,pch=21,bg=3+as.numeric(mtcars$am))
abline(c(fit4$coefficients[1],fit4$coefficients[2]),col=4,lwd=3)
abline(c(fit4$coefficients[1]+fit4$coefficients[3],fit4$coefficients[2]+fit4$coefficients[4]),col=5,lwd=3)
abline(c(fit3$coefficients[1],fit3$coefficients[2]),col=4,lwd=3,lty=2)
abline(c(fit3$coefficients[1]+fit3$coefficients[3],fit3$coefficients[2]),col=5,lwd=3,lty=2)
qplot(wt,mpg,data = mtcars,color=am) + geom_smooth(method="lm")+scale_colour_discrete(name="Transmission",labels=c("auto","manual"))
Notice here, after including wt predictor, the coefficient to discribe difference bewteen manual and auto on predict mpg subtle and not significant (-0.02362, pvalue=.988). It means that the am is not a strong predictor compared to wt.
fit1 <-lm(mpg~wt,data=mtcars)
anova(fit1,fit3,fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + am
## Model 3: mpg ~ wt + am + wt * am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 278.32 1 0.002 0.0003 0.985556
## 3 28 188.01 1 90.312 13.4502 0.001017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-vlaue when compare model2 vs model 1 is .98 means, am is not nesscessary to include.