Regression Model Project (2014.11.23)

Questions
Take the mtcars data set and write up an analysis to answer the questions below using regression models and exploratory data analyses.

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions

Summary

The first analysis is to analyze the relationship between MPG performance and transmission by dataset mtcars. The result shows the relation is significant. The cars with manual transmissions are statistically outperformed those with automatic transmissions.

In the second analysis, several regression models with different combinations of variables are compared. With the practice of less number of regressors and high $R^2$ value, we choose mpg~wt+qsec+am as the best fitness.

Is an automatic or manual transmission better for MPG?

Load data
```
data(mtcars)
```

Boxplot: MPG with automatic am=0 & manual am=1 (defined in ?mtcars)

g = ggplot(mtcars, aes(factor(am), mpg, fill=factor(am)))
g = g + geom_boxplot()
g = g + geom_jitter(position=position_jitter(width=.1, height=0))
g = g + scale_colour_discrete(name = "Type")
g = g + scale_fill_discrete(name="Type", breaks=c("0", "1"),
            labels=c("Automatic", "Manual"))
g = g + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g = g + xlab("")
g

T-Test: compare group #1 with am=1 & group #2 with am=0.
$H_0$: mpg of manual is less than mpg of automatic
$H_A$: mpg of manual is greater than mpg of automatic
```
g1 <- subset(mtcars, mtcars$am==0)
g2 <- subset(mtcars, mtcars$am==1)
amt <- t.test(g1, g2, alternative="greater", paired=F)
```
The p-value is 0.0307 which is less than 0.05. So, the mpg of am=1 group is significant larger than am=0 group.

ANS: The manual transmission has better performance in miles per gallon (mpg).

Quantify the MPG difference between automatic and manual transmissions

From the previous analysis, we know transmission is an important variable. How about other models? Use stepwise-selected model step in R to find better fitness model.

s_model <- step(lm(data = mtcars, mpg~.), trace=0)
summary(s_model)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.481 -1.556 -0.726  1.411  4.661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.618      6.960    1.38  0.17792    
## wt            -3.917      0.711   -5.51    7e-06 ***
## qsec           1.226      0.289    4.25  0.00022 ***
## am             2.936      1.411    2.08  0.04672 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared:  0.85,   Adjusted R-squared:  0.834 
## F-statistic: 52.7 on 3 and 28 DF,  p-value: 1.21e-11

With weight wt, 1/4 mile time qsec and transmission am, we can build the a model with best $R^2$ fitness.

Get the variables which have the high corraltions

var_cor <- round(cor(mtcars)[1,], 2)
name_cor <- names(sort(abs(var_cor),decreasing=T))
## the top four variables: wt, cyl, disp, hp

cor(mpg, wt) = -0.87
cor(mpg, cyl) = -0.85
cor(mpg, disp) = -0.85
cor(mpg, hp) = -0.78

Use mpg ~ wt + qsec + am as the formula of linear regression model.

fit1 <- lm(data=mtcars, mpg~wt+qsec+am)
## coefficient
summary(fit1)$coefficient

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    9.618     6.9596   1.382 1.779e-01
## wt            -3.917     0.7112  -5.507 6.953e-06
## qsec           1.226     0.2887   4.247 2.162e-04
## am             2.936     1.4109   2.081 4.672e-02

Compare with other models
Stepwise-selected model step in R: mpg~wt+qsec+am
All variables: mpg~.
Weight: mpg~wt (with the largest value of correlation)
Four high-correlated variables: mpg~wt+cyl+disp+hp

fit2 <- lm(data=mtcars, mpg~.)
fit3 <- lm(data=mtcars, mpg~wt)
fit4 <- lm(data=mtcars, mpg~wt+cyl+disp+hp)
## using anova to compare models
anova(fit1, fit2, fit3, fit4)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 3: mpg ~ wt
## Model 4: mpg ~ wt + cyl + disp + hp
##   Res.Df RSS Df Sum of Sq    F Pr(>F)   
## 1     28 169                            
## 2     21 148  7      21.8 0.44 0.8636   
## 3     30 278 -9    -130.8 2.07 0.0816 . 
## 4     27 170  3     107.9 5.12 0.0082 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the ANOVA comparison mpg~wt+cyl+disp+hp is significant different from mpg~wt+qsec+am. We may find which one has better $R^2$ value:
Model mpg~wt+qsec+am = 0.8497
Model mpg~wt+cyl+disp+hp = 0.8486

Compare reesiduals of the models (mpg~wt+qsec+am and mpg~wt+cyl+disp+hp)

    x <- mtcars$wt
    res1 <- resid(fit1)
    e <- res1
    n <- length(e)
    plot(x, e,  
         main="mpg~wt+qsec+am",
         xlab="wt", 
         ylab="Residuals", 
         bg="lightblue", 
         col="black", cex = 2, pch = 21,frame = FALSE)
    abline(h = 0, lwd = 2)
    for (i in 1 : n) 
        lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

plot of chunk residual_compare

    res4 <- resid(fit4)
    e <- res4
    n <- length(e)
    plot(x, e,  
         main="mpg~wt+cyl+disp+hp",
         xlab="wt", 
         ylab="Residuals", 
         bg="lightblue", 
         col="black", cex = 2, pch = 21,frame = FALSE)
    abline(h = 0, lwd = 2)
    for (i in 1 : n) 
        lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

plot of chunk residual_compare

The variables in both models can strongly explain the relationship between mpg and car-samples! Howerver, it recomemeds the model with less number of regressors – mpg~wt+qsec+am.

Explanation of the MPG difference between automatic and manual transmissions
Let’s use two variable and their interaction mpg~wt+am+am*wt model to explain the contribution of am with wt.

fit <- lm(data=mtcars, mpg~wt+am+wt*am)
g1 <- subset(mtcars, mtcars$am==0)
g2 <- subset(mtcars, mtcars$am==1)
plot(g1$wt, g1$mpg, col="lightblue", pch=19, cex=2, xlab="Weight", ylab="mpg")
points(g2$wt, g2$mpg, col="salmon", pch=19, cex=2)
## am=0; 
abline(c(fit$coeff[1], fit$coeff[2]), col="lightblue", lwd=3, lty=2)
## am=1
abline(c(fit$coeff[1]+fit$coeff[3], fit$coeff[2]+fit$coeff[4]), col="salmon", lwd=3, lty=2)
legend("topright", pch=19, col=c("lightblue", "salmon"), legend=c("Manual (am=1)", "Automatic (am=0)"))

plot of chunk am_wt

The number of samples with am=0 is less than number of am=1. With the regression lines, the mpg payoff of am=1 is better if wt > 2.8. We still need more data to prove and increase the reliability.