Executive Summary

This report presents the results of the course project for the Regression Models course, part of the Johns Hopkins Data Science Specialization on Coursera.

It analyzes the Mtcars data in the R datasets package. The data is from the 1974 Motor Trend US magazine and comprises fuel consumption and ten characteristics of automobile design and performance for 32 cars. The goals of this analysis are to:

This analysis was made using regression models and exploratory data analyses, and the findings are as follows:

Exploratory Data Analyses

The Mtcars dataset description shows us that the data frame has 32 observations on 11 variables: mpg - Miles/(US) gallon, cyl - Number of cylinders, disp - Displacement (cu.in.), hp - Gross horsepower, drat - Rear axle ratio, wt - Weight (lb/1000), qsec - 1/4 mile time, vs V/S (0 = V engine, 1 = straight engine), am - Transmission (0 = automatic, 1 = manual), gear - Number of forward gears, and carb - Number of carburetors.

The first step is to perform an exploratory analysis only with the variables am and mpg because the analyzed issues are directly related to these variables. Then, the first question can be answered by analyzing the bloxpot graph shown in Appendix A (Miles per Gallon - mpg by Transmission - am). It also can be note Appendix A that in the density plot of the variable mpg the distribution is normal. In addition, a t-test comparing the mean between the two transmission data groups (Manual and Auto):

amAuto <- mtcars$mpg[mtcars$am == 0]; amManual <- mtcars$mpg[mtcars$am == 1];
t.test(amAuto, amManual, paired = FALSE, alternative="two.sided", var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  amAuto and amManual
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean of x mean of y 
##     17.15     24.39

The confidence interval (95%) does not contain zero (-11.28,-3.21) and p-value is greater then 0.005. Then, it can conclud that the average consumption, in miles per gallon, with automatic transmission is higher than the manual transmission. In this case, the mean analysis, it is possible to quantify the MPG difference between automatic and manual transmissions: 7.24 mpg greater, subtracting means.

meanMPGmanual <- 24.39; meanMPGauto <- 17.15; meanMPGmanual - meanMPGauto
## [1] 7.24

Additionally it is concluded that there are other variables correlated with mpg according to the graph analysis of Appendix B (Pairs Panel - Mtcars variables). These correlations are evaluated in the regression analysis because it is the main topic addressed in this report.

Regression Analysis

At this stage of the analysis, it is evaluated a single model initially, considering transmission type as predictor and miles per gallon as the outcome. After that, it is performed a multivariable analysis.

Single Model

This analysis is made to compare results from the exploratory analysis. The null hypothesis is that the difference between mpg and am means is zero, considering transmission type as the predictor. The alternative hypothesis is the opposite.

SingleModel <- lm(mtcars$mpg ~ mtcars$am); 
summary(SingleModel)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   17.147      1.125  15.247 1.134e-15
## mtcars$am      7.245      1.764   4.106 2.850e-04

The results show us that the p-value of the slope is less than 0.005. Then, it can reject the null hypothesis, and the results of the exploratory analysis were confirmed: automatic transmission results are 7.245 miles per gallon greater. If the slope is greater than zero, manual transmission is better than the automatic one.

Multivariable Analysis

Multivariate regression helps to estimate better the impact of transmission type on mpg to adjust for other confounding variables such as the weigt (wt) and quarter mile time (qsec), for example. It was decided to choose the stepAIC function from the MASS package to perform this analysis because this way the choice of the best model is automated. The results are the following:

require(MASS)
## Loading required package: MASS
MultModel <- stepAIC(lm(mpg ~ . ,data=mtcars), direction = 'both', trace = FALSE)
MultModel$anova[1:5]
##     Step Df Deviance Resid. Df Resid. Dev
## 1                           21      147.5
## 2  - cyl  1  0.07987        22      147.6
## 3   - vs  1  0.26852        23      147.8
## 4 - carb  1  0.68546        24      148.5
## 5 - gear  1  1.56497        25      150.1
## 6 - drat  1  3.34455        26      153.4
## 7 - disp  1  6.62865        27      160.1
## 8   - hp  1  9.21947        28      169.3

The best model indicated by the automated analysis consists of the variables wt, qsec, am and mpg as the outcome.

finalModel <- lm(mtcars$mpg ~ mtcars$wt + mtcars$qsec + mtcars$am)
summary(finalModel)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    9.618     6.9596   1.382 1.779e-01
## mtcars$wt     -3.917     0.7112  -5.507 6.953e-06
## mtcars$qsec    1.226     0.2887   4.247 2.162e-04
## mtcars$am      2.936     1.4109   2.081 4.672e-02

Then, the regression equation is mpg = 9.618 -3.917 wt + 1.226 qsec + 1.4109 am. It is assumed that Errors = 0. As the two-sided p-value for the am coefficient is 0.04672, smaller than 0.05, it can we reject the null hypothesis.

Looking at the plots in Appendix C, Final Model Residuals, the visual analysis show us that the behavior of the best model is adequate considering normal residuals and constant variability. The leverage is within reasonable upper limit.

Conclusions

These are the conclusions of the analysis:

The online version of this report is available at http://rpubs.com/dsasas/mtcars.

Appendix

A - Bloxpot - Miles per Gallon (MPG) by Transmission and Densit Plot

summary.data <- function(x) {
  temp <- c(min(x), mean(x) - sd(x), mean(x), mean(x) + sd(x), max(x))
  names(temp) <- c("ymin", "lower", "middle", "upper", "ymax")
  temp
}
require(ggplot2)
## Loading required package: ggplot2
p1 <- ggplot(aes(y = mpg, x = factor(am), fill=factor(am)), data = mtcars)
p1 <- p1 + stat_summary(fun.data = summary.data, geom = "boxplot") + geom_jitter(position=position_jitter(width=.2), size=3) + ggtitle("Miles per Gallon (MPG) by Transmission") + xlab("Transmission (0 = automatic, 1 = manual)") + ylab("MPG - Miles per Gallon")
p1

plot of chunk boxplot

plot(density(mtcars$mpg), xlab = "Miles per Gallon (MPG)", main ="Density Plot of Miles per Gallon")

plot of chunk unnamed-chunk-1

B - Pairs Panel - Mtcars variables

require(graphics)
pairs(mtcars,main = "Pair Panel - Mtcars variables", panel=panel.smooth)

plot of chunk panel

C - Final Model Residuals

par(mfrow = c(2,2));plot(finalModel)

plot of chunk unnamed-chunk-2