Load the data

data(mtcars);help(mtcars);dim(mtcars)
## [1] 32 11

show all data plot

?pairs
panel.cor <- function(x,y,digits=2,prefix="",cex.cor,...){
      usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
  }
pairs(mtcars,panel=panel.smooth,main="mtcars",col=3+mtcars$am,
      upper.panel = panel.cor)

MPG has highest correlation (>0.8) to: wt, cyl, disp. Corrlelation between MPG and am is fair (0.6).

Explortary analysis

Make dummy variable and show conditional mean

mtcars$am <-as.factor(mtcars$am)
library(plyr)
(mmpg <- ddply(mtcars, "am", summarise, mpg.mean=mean(mpg)))
##   am mpg.mean
## 1  0 17.14737
## 2  1 24.39231

Mean MPG for auto is lower than for mannual.

Show boxplot for these two groups

Show distribution of mpg for these two mt. Looks like mpg distribution for mannual not Gaussian.

ggplot(mtcars, aes(x=mpg, fill=factor(am))) + geom_density(alpha=.3)+scale_fill_discrete(labels=c("auto","manual")) + geom_vline(data=mmpg, aes(xintercept=mpg.mean,  colour=am), linetype="dashed", size=1)

Inference study show auto is significant less MPG then mannual group

Here, I will perform a hypothesis testing. The hypotheisis is that: auto transmission has a lower mpg than manual transmission. The Null hypotheisis is they are not siginificant different in mpg. So it will be a one side T-test.

The data in auto and mannual is independent. So the test should also be not paired.

auto <- subset(mtcars,am == 0,mpg)
mannual <-subset(mtcars,am == 1,mpg)
t.test(auto,mannual,alternative = "less")$p.value
## [1] 0.0006868192

Regression study

  1. Regression on am
fit1<-lm(mpg ~ am-1,data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am - 1, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## am0   17.147      1.125   15.25 1.13e-15 ***
## am1   24.392      1.360   17.94  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9452 
## F-statistic: 277.2 on 2 and 30 DF,  p-value: < 2.2e-16

THe coefficient is the group mean.

  1. Regression on am, auto is the reference.
fit2<-lm(mpg ~ am,data = mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
par(mfrow=c(2,2));plot(fit2)

par(mfrow=c(1,1))
plot(predict(fit2),resid(fit2))

Coefficient of am1 to am0 is 7.245 with a p value 0.000285. It seems that manual transmission is significant save gas (about a save of 7.245 MPG on average)

  1. Regression include wt predictor
fit4<-lm(mpg~ wt + am + wt*am,data=mtcars)
fit3<-lm(mpg~ wt + am,data=mtcars)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ wt + am + wt * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.4161     3.0201  10.402 4.00e-11 ***
## wt           -3.7859     0.7856  -4.819 4.55e-05 ***
## am1          14.8784     4.2640   3.489  0.00162 ** 
## wt:am1       -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11
plot(mtcars$wt,mtcars$mpg,pch=21,bg=3+as.numeric(mtcars$am))
abline(c(fit4$coefficients[1],fit4$coefficients[2]),col=4,lwd=3)
abline(c(fit4$coefficients[1]+fit4$coefficients[3],fit4$coefficients[2]+fit4$coefficients[4]),col=5,lwd=3)
abline(c(fit3$coefficients[1],fit3$coefficients[2]),col=4,lwd=3,lty=2)
abline(c(fit3$coefficients[1]+fit3$coefficients[3],fit3$coefficients[2]),col=5,lwd=3,lty=2)

qplot(wt,mpg,data = mtcars,color=am) + geom_smooth(method="lm")+scale_colour_discrete(name="Transmission",labels=c("auto","manual")) 

Notice here, after including wt predictor, the coefficient to discribe difference bewteen manual and auto on predict mpg subtle and not significant (-0.02362, pvalue=.988). It means that the am is not a strong predictor compared to wt.

  1. model selection
fit1 <-lm(mpg~wt,data=mtcars)
anova(fit1,fit3,fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + am
## Model 3: mpg ~ wt + am + wt * am
##   Res.Df    RSS Df Sum of Sq       F   Pr(>F)   
## 1     30 278.32                                 
## 2     29 278.32  1     0.002  0.0003 0.985556   
## 3     28 188.01  1    90.312 13.4502 0.001017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-vlaue when compare model2 vs model 1 is .98 means, am is not nesscessary to include.