Questions
Take the mtcars data set and write up an analysis to answer the questions below using regression models and exploratory data analyses.
The first analysis is to analyze the relationship between MPG performance and transmission by dataset mtcars. The result shows the relation is significant. The cars with manual transmissions are statistically outperformed those with automatic transmissions.
In the second analysis, several regression models with different combinations of variables are compared. With the practice of less number of regressors and high \(R^2\) value, we choose mpg~wt+qsec+am as the best fitness.
Load data
data(mtcars)Boxplot: MPG with automatic am=0 & manual am=1 (defined in ?mtcars)
g = ggplot(mtcars, aes(factor(am), mpg, fill=factor(am)))
g = g + geom_boxplot()
g = g + geom_jitter(position=position_jitter(width=.1, height=0))
g = g + scale_colour_discrete(name = "Type")
g = g + scale_fill_discrete(name="Type", breaks=c("0", "1"),
labels=c("Automatic", "Manual"))
g = g + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g = g + xlab("")
g
T-Test: compare group #1 with am=1 & group #2 with am=0.
\(H_0\): mpg of manual is less than mpg of automatic
\(H_A\): mpg of manual is greater than mpg of automatic
g1 <- subset(mtcars, mtcars$am==0)
g2 <- subset(mtcars, mtcars$am==1)
amt <- t.test(g1, g2, alternative="greater", paired=F)
The p-value is 0.0307 which is less than 0.05. So, the mpg of am=1 group is significant larger than am=0 group.
ANS: The manual transmission has better performance in miles per gallon (mpg).
From the previous analysis, we know transmission is an important variable. How about other models? Use stepwise-selected model step in R to find better fitness model.
s_model <- step(lm(data = mtcars, mpg~.), trace=0)
summary(s_model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## am 2.936 1.411 2.08 0.04672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
With weight wt, 1/4 mile time qsec and transmission am, we can build the a model with best \(R^2\) fitness.Get the variables which have the high corraltions
var_cor <- round(cor(mtcars)[1,], 2)
name_cor <- names(sort(abs(var_cor),decreasing=T))
## the top four variables: wt, cyl, disp, hp
cor(mpg, wt) = -0.87
cor(mpg, cyl) = -0.85
cor(mpg, disp) = -0.85
cor(mpg, hp) = -0.78
Use mpg ~ wt + qsec + am as the formula of linear regression model.
fit1 <- lm(data=mtcars, mpg~wt+qsec+am)
## coefficient
summary(fit1)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.9596 1.382 1.779e-01
## wt -3.917 0.7112 -5.507 6.953e-06
## qsec 1.226 0.2887 4.247 2.162e-04
## am 2.936 1.4109 2.081 4.672e-02Compare with other models
Stepwise-selected model step in R: mpg~wt+qsec+am
All variables: mpg~.
Weight: mpg~wt (with the largest value of correlation)
Four high-correlated variables: mpg~wt+cyl+disp+hp
fit2 <- lm(data=mtcars, mpg~.)
fit3 <- lm(data=mtcars, mpg~wt)
fit4 <- lm(data=mtcars, mpg~wt+cyl+disp+hp)
## using anova to compare models
anova(fit1, fit2, fit3, fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 3: mpg ~ wt
## Model 4: mpg ~ wt + cyl + disp + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 169
## 2 21 148 7 21.8 0.44 0.8636
## 3 30 278 -9 -130.8 2.07 0.0816 .
## 4 27 170 3 107.9 5.12 0.0082 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the ANOVA comparison mpg~wt+cyl+disp+hp is significant different from mpg~wt+qsec+am. We may find which one has better \(R^2\) value:
Model mpg~wt+qsec+am = 0.8497
Model mpg~wt+cyl+disp+hp = 0.8486
Compare reesiduals of the models (mpg~wt+qsec+am and mpg~wt+cyl+disp+hp)
x <- mtcars$wt
res1 <- resid(fit1)
e <- res1
n <- length(e)
plot(x, e,
main="mpg~wt+qsec+am",
xlab="wt",
ylab="Residuals",
bg="lightblue",
col="black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
res4 <- resid(fit4)
e <- res4
n <- length(e)
plot(x, e,
main="mpg~wt+cyl+disp+hp",
xlab="wt",
ylab="Residuals",
bg="lightblue",
col="black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
The variables in both models can strongly explain the relationship between mpg and car-samples! Howerver, it recomemeds the model with less number of regressors – mpg~wt+qsec+am.
Explanation of the MPG difference between automatic and manual transmissions
Let’s use two variable and their interaction mpg~wt+am+am*wt model to explain the contribution of am with wt.
fit <- lm(data=mtcars, mpg~wt+am+wt*am)
g1 <- subset(mtcars, mtcars$am==0)
g2 <- subset(mtcars, mtcars$am==1)
plot(g1$wt, g1$mpg, col="lightblue", pch=19, cex=2, xlab="Weight", ylab="mpg")
points(g2$wt, g2$mpg, col="salmon", pch=19, cex=2)
## am=0;
abline(c(fit$coeff[1], fit$coeff[2]), col="lightblue", lwd=3, lty=2)
## am=1
abline(c(fit$coeff[1]+fit$coeff[3], fit$coeff[2]+fit$coeff[4]), col="salmon", lwd=3, lty=2)
legend("topright", pch=19, col=c("lightblue", "salmon"), legend=c("Manual (am=1)", "Automatic (am=0)"))
The number of samples with am=0 is less than number of am=1. With the regression lines, the mpg payoff of am=1 is better if wt > 2.8. We still need more data to prove and increase the reliability.