Project Executive summary

In this project, we play th role of data analyst who works for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are trying to answer the following two questions - Question1: “Is an automatic or manual transmission better for MPG?”, Question2: “Quantify the MPG difference between automatic and manual transmissions.”

Getting the dataset

First, we need to load the dataset into R

library(datasets); data(mtcars)

Exploratory Data Analysis

The purpuse of is Exploratory Data Analysis (EDA) is to detect any possible correlations between transmission type am, along with other variables and dependent variable MPG. See appendix 1

From the first glance of the above plot, it does not seem that we can these two questions quickly, the range of MPG values for two different transmissions are not overlapping. We also see some of the other variables have strong linear correlations, and may not be omitted from the regression models.

We use the boxplot to show the intervals of mpg values for transmission type, see Plot#3 in Appendix. From this boxplot, we have a general idea what is the group mean for automatic and manual transimissions.

Regression Model Selection

Based on the summary of initial analysis and the information we were given about the variable mpg and am, we can approach this problem like a dummary variable model, with \(X_{i1}\) being binary, so it is 1 when transmission is manual, and 0 when transimission is 1.

From that, we will start building our first linear regression model, with only am as the independent variable, and mpg as the dependent variable we will treat am as factor variable, \(Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}\)

Cofficients Interpretaion

\(\beta_1\) is the group mean for automatic transmission, \(\beta_0\) is the intercept \(\epsilon\) is the residual

for the purpose for this project to simplify our model, we will include intercept as part of the model.

The following R code shows the summary our first simple linear regression model:

fit1 <- lm(mpg ~ factor(am), data=mtcars); summary(fit1)$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   17.147      1.125  15.247 1.134e-15
## factor(am)1    7.245      1.764   4.106 2.850e-04

With a p-value being very small at 0.000285, we reject the null hypothesis and we say that there is linear correlation between the predictor variable am and mpg. We also see from this summary that \(R^{2}\) is 0.338 also This means that our model only explains 33.8% of the variance. We can also say from the summary that group mean for mpg is 17.147 for automatic transmission and 24.49 (17.147 + 7.24) for manual transmission cars.

Since there are more variables in this dataset that also look like they have linear correlations with dependent variable mpg, we will explore a multivariable regression model next with the vif and cor funtions in R to determine variation inflation factors and select variables for building this linear model,

library(car); fit <- lm(mpg ~ . ,data=mtcars); sqrt(vif(fit));cor(mtcars)[1,]
##   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
## 3.921 4.650 3.136 1.837 3.894 2.744 2.228 2.156 2.315 2.812
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am 
##  1.0000 -0.8522 -0.8476 -0.7762  0.6812 -0.8677  0.4187  0.6640  0.5998 
##    gear    carb 
##  0.4803 -0.5509

From the above output from vif and cor functions, we can see cyl,wt,hp,disp are selected for a linear model plus am, thus we have: \(Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2}\beta_2 + X_{i3}\beta_3 + X_{i4}\beta_4 + X_{i5}\beta_5 + \epsilon_{i}\)

fit2 <- lm(mpg ~ am+wt+hp+disp+cyl, data = mtcars); anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ am + wt + hp + disp + cyl
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1     30 721                              
## 2     26 163  4       558 22.2 4.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value here is very small at 4.5e-08, therefore we can state that our multivariate model is significantly different from our simple model.

We check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot to spot for heteroskedasticity. See Plot 4 in Appendix. After examninations, our residuals are normally distributed and homoskedastic, and we can now look at the estimates from our multivariate model.

See Appendix Table 1.

\(R^{2}\) shows over 85.5% of the variance is included in this model. We can see that two variables wt and cyl clearly confound the relationship between am and mpg. To answer our questions, we conclude yes, it is better for MPG to have manual transmission, and on average manual transmission cars have average 1.55 MPGs more than automatic transmission cars.

Appendix

Plot 1

plot of chunk unnamed-chunk-5

Plot 2

plot of chunk unnamed-chunk-6 Plot 3 plot of chunk unnamed-chunk-7 Table 1

summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp + disp + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.595 -1.586 -0.716  1.282  5.572 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.2028     3.6691   10.41  9.1e-11 ***
## am            1.5565     1.4405    1.08   0.2898    
## wt           -3.3026     1.1336   -2.91   0.0073 ** 
## hp           -0.0280     0.0139   -2.01   0.0551 .  
## disp          0.0123     0.0117    1.05   0.3047    
## cyl          -1.1064     0.6764   -1.64   0.1139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 26 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.827 
## F-statistic: 30.7 on 5 and 26 DF,  p-value: 4.03e-10