In the data cleaning section, I transform the am attribute into factor type. And in the Exploratory Data Analysis section, I use boxplot to show that there might be a significant difference between automotive and manual cars.
In the Linear Regression Model section, I fit the data with simple linear regression and multivariate regression and found out that the multivariate regression works better. The result is verified in the Residuals Analysis section.
data("mtcars")
mtcars$am <- as.factor(mtcars$am)
Load the ggplot2 library.
Notice that the 25% quantile of manual cars’ MPG is significantly greater than 75% quantile of automatic cars’ mpg. So we might hypothesize that Manual cars will be better than automatic cars. So we can use t-test to test this hypothesis.
tTest <- t.test(mpg ~ factor(am), data = mtcars)
tTest
##
## Welch Two Sample t-test
##
## data: mpg by factor(am)
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
The p-value, which is significantly less than 0.05, shows that we can acccept our alternative hypothesis: true difference in means is not equal to 0.
simpleFit <- lm(formula = mpg ~ factor(am), data = mtcars)
summary(simpleFit)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We can see that the Adjusted R-squared is 0.3385 and Multiple R-squared is 0.3598, which means our simple linear model is not enough to explain the underlying relationship between the outcome and the predictors.
As we discuss in the last section, we need a more complex model.
multiVFit <- lm(mpg ~., data = mtcars)
summary(multiVFit)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am1 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
The Adjusted R-squared now is 0.8066, which means this model can explain more than 80 percent of the variance of the mpg variables. And the p-values shows that we can accept the alternative hypothesis.
I select the multivariate fitting as our best model.
par(mfrow = c(2, 2))
plot(multiVFit)
Figure 1 demonstrate the independence conditions. Figure 2 shows that the residuals are normally distributed. Figure 3 shows that the variance is constant. Figure 4 shows that there may be some outliers that we might interested in.