In this project, we analyze the mtcars dataset and explore the linear relationship between a set of variables and miles per gallon.
The main objectives of this research are as follows:
The results of this analysis are:
t <- t.test(mpg ~ am, data = mtcars, paired = FALSE, var.equal = FALSE)
t
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
The exploratory data analysis shows that:
Single variable linear regression model - Model #1:
model_1 = lm(mpg ~ am, data = mtcars)
summary(model_1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
summary(model_1)$adj.r.squared
## [1] 0.3384589
Multivariable linear regression model (Stepwise Regression) - Model #2:
fitAll <- lm(mpg ~ . , data = mtcars)
model_2 <- step(fitAll, direction = "both") # stepwise regression
summary(model_2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
summary(model_2)$adj.r.squared
## [1] 0.8335561
Model #2 obtained from the above computations consists of the variables: “wt” and “qsec” as confounders and “am” as the independent variable.
Compare the Adjusted R-squared values of the two models:
## Adjusted R-squared
## model_1 0.3385
## model_2 0.8300
Now, let’s look at the Analysis of Variance Table of the models:
varTbl <- anova(model_1, model_2)
varTbl
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, the VIF (variance inflation factors) of model #2:
vif(model_2)
## wt qsec am
## 2.482952 1.364339 2.541437
Findings:
Refer to Residual Plots and Diagnostics sections in Appendix below for more details.
Findings:
For meaningful interpretation of coefficients, we applied centering on both “wt” and “qsec” variables:
fitCentered <- lm(mpg ~ I(wt - mean(wt)) + I(qsec - mean(qsec)) + factor(am),
data = mtcars)
summary(fitCentered)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.897941 0.7193542 26.270704 2.855851e-21
## I(wt - mean(wt)) -3.916504 0.7112016 -5.506882 6.952711e-06
## I(qsec - mean(qsec)) 1.225886 0.2886696 4.246676 2.161737e-04
## factor(am)1 2.935837 1.4109045 2.080819 4.671551e-02
Interpretation:
Below are confidence levels of intercept and predictors with 95% confidence:
confint(fitCentered)
## 2.5 % 97.5 %
## (Intercept) 17.42441087 20.371471
## I(wt - mean(wt)) -5.37333423 -2.459673
## I(qsec - mean(qsec)) 0.63457320 1.817199
## factor(am)1 0.04573031 5.825944
mtcars2 <- mtcars
mtcars2$txType <- factor(mtcars$am, labels = c("Automatic","Manual"))
ggplot(mtcars2, aes(x = txType, y = mpg, fill = txType)) +
geom_boxplot() +
labs(title = "Miles Per Gallon by Transmission Type",
x = "Transmission Type",
y = "Miles Per Gallon") +
scale_fill_discrete(name = "Transmission")
g = ggpairs(mtcars, lower = list(continuous = "smooth"))
g
par(mfrow = c(2, 2))
plot(model_2)
par(mfrow = c(1, 2))
# Leverage
plot(hatvalues(model_2), main="Leverage")
# Influence
#plot(rstandard(model_2))
plot(rstudent(model_2), main="Studentized Residuals")
par(mfrow = c(1, 2))
plot(dffits(model_2), main="Influence - dffits")
plot(cooks.distance(model_2), main="Influence - Cook's Distance")
par(mfrow = c(1, 2))
plot(dfbetas(model_2)[, 2], main="Influence - dfbetas of \"wt\"")
plot(dfbetas(model_2)[, 3], main="Influence - dfbetas of \"qsec\"")