Executive Summary

This project is to explore the relationship between transimission type (automatic and manual) and miles per gallon (MPG) (outcome), and quantify the MPG difference between automatic and manual transmissions. “mtcars” dataset in R will be used in the analysis. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We find out that a manual transmission is better for MPG, when other condistions are the same.

Data Exploray Analysis

data <- data.frame(mtcars)

See appendix 1, Toyota Corolla has the highest MPG among these cars.

Rregression Assumption Test

See appendix 2,in the Residuals vs Fitted plot, the residuals variance is around zero, and it implies that the assumption of homoscedasticity is not violated. Also, random, patternless residuals in the Residuals vs Fitted plot imply independent errors. Moreover, the Normal Q-Q plot represents a straight line, so the normality assumption is valid.

Correlations between variables

We assumes that the absolute value of correlation, which is greater than 0.8, indicates highly correlation between the variables, and cyl vs. disp (0.902), cyl vs. hp (0.832), cyl vs. vs (-0.811), and disp vs. wt (0.888) are considered highly correlated pairs (see appendix 3). We need to include the transimission information in our model design. When we fit the model, only one variable in each pair can be added.

Model Selection

Base Model

lm0 <- lm(data$mpg ~ data$am)

Model 1

lm1 <- lm(data$mpg ~ data$am + data$hp)
summary(lm1)
## 
## Call:
## lm(formula = data$mpg ~ data$am + data$hp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.384 -2.264  0.137  1.697  5.866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.58491    1.42509   18.65  < 2e-16 ***
## data$am      5.27709    1.07954    4.89  3.5e-05 ***
## data$hp     -0.05889    0.00786   -7.50  2.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.91 on 29 degrees of freedom
## Multiple R-squared:  0.782,  Adjusted R-squared:  0.767 
## F-statistic:   52 on 2 and 29 DF,  p-value: 2.55e-10

Model 2

lm2 <- lm(data$mpg ~ data$am + data$qsec + data$carb)
summary(lm2)
## 
## Call:
## lm(formula = data$mpg ~ data$am + data$qsec + data$carb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.674 -1.502  0.465  1.716  5.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.325      8.631    0.04   0.9702    
## data$am        8.435      1.149    7.34  5.4e-08 ***
## data$qsec      1.133      0.425    2.67   0.0125 *  
## data$carb     -1.383      0.458   -3.02   0.0053 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.08 on 28 degrees of freedom
## Multiple R-squared:  0.764,  Adjusted R-squared:  0.738 
## F-statistic: 30.2 on 3 and 28 DF,  p-value: 6.45e-09

Model 3

lm3 <- lm(data$mpg ~ data$am + data$vs + data$carb)
summary(lm3)
## 
## Call:
## lm(formula = data$mpg ~ data$am + data$vs + data$carb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.280 -1.231  0.408  2.052  4.820 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.517      1.609   12.13  1.2e-12 ***
## data$am        6.798      1.101    6.17  1.2e-06 ***
## data$vs        4.196      1.325    3.17   0.0037 ** 
## data$carb     -1.431      0.408   -3.51   0.0016 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 28 degrees of freedom
## Multiple R-squared:  0.782,  Adjusted R-squared:  0.758 
## F-statistic: 33.4 on 3 and 28 DF,  p-value: 2.14e-09

Model 4

lm4 <- lm(data$mpg ~ data$am + data$carb)
summary(lm4)
## 
## Call:
## lm(formula = data$mpg ~ data$am + data$carb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.232 -1.741 -0.071  2.394  5.638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.146      1.294   17.89  < 2e-16 ***
## data$am        7.653      1.223    6.26  7.9e-07 ***
## data$carb     -2.192      0.378   -5.80  2.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.39 on 29 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.683 
## F-statistic: 34.4 on 2 and 29 DF,  p-value: 2.19e-08

From above 4 models, we can see that Model 1 and Model 3 are the best two with highest R-square, and coefficients in both models are stitistical significant. Next, we analyze their VIF and compare these two models by using AIC and BIC methods.

As we can see that vs and carb are highly correlated with each other in Model 3 (see appendix4). The model with the smallest AIC and smallest BIC values is the “best”. From appendix, we can see that both AIC and BIC values in Model 1 are smaller than those in Model 3. Therefore, Model 1 is better than Model 3 (see appendix 5).

Best Model

\(Y_{mpg} = 26.58 + 5.28 X_{am} - 0.06 X_{hp} + \epsilon_i\).

Here the \(\epsilon_{i}\) are assumed iid \(N(0, \sigma^2)\).

We estimate an expected 5.28 increase in mpg for the manual transmission type (0 = automatic, 1 = manual) in holding the remaining variables constant. Therefore, a manual transmission is better for MPG, when other condistions are the same.

We also estimate an expected 0.06 decrease in mpg for every one unit gross horsepower increase in holding the remaining variables constant.

Quantify the MPG difference between automatic and manual trnasmissions

Dif=mean(data$mpg[data$am=="0"]) - mean(data$mpg[data$am=="1"])
Dif
## [1] -7.245

\(Difference_{mpg} = -7.24\).

Appendix

Appendix 1

data <- data.frame(mtcars)
data$model <- factor(rownames(data))
data$car_no <- 1:32
str(data)
## 'data.frame':    32 obs. of  13 variables:
##  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl   : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp  : num  160 160 108 258 360 ...
##  $ hp    : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec  : num  16.5 17 18.6 19.4 17 ...
##  $ vs    : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am    : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear  : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb  : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ model : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
##  $ car_no: int  1 2 3 4 5 6 7 8 9 10 ...
plot(data$car_no, data$mpg)
text(data$car_no, data$mpg, labels = data$car_no, pos = 1)

plot of chunk unnamed-chunk-8

data[20,12]
## [1] Toyota Corolla
## 32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 ... Volvo 142E

Appendix 2

data <- data[,1:11]
fit <- lm(mpg ~., data = data[,1:11])
summary(fit)
## 
## Call:
## lm(formula = mpg ~ ., data = data[, 1:11])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.45  -1.60  -0.12   1.22   4.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  12.3034    18.7179    0.66    0.518  
## cyl          -0.1114     1.0450   -0.11    0.916  
## disp          0.0133     0.0179    0.75    0.463  
## hp           -0.0215     0.0218   -0.99    0.335  
## drat          0.7871     1.6354    0.48    0.635  
## wt           -3.7153     1.8944   -1.96    0.063 .
## qsec          0.8210     0.7308    1.12    0.274  
## vs            0.3178     2.1045    0.15    0.881  
## am            2.5202     2.0567    1.23    0.234  
## gear          0.6554     1.4933    0.44    0.665  
## carb         -0.1994     0.8288   -0.24    0.812  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.807 
## F-statistic: 13.9 on 10 and 21 DF,  p-value: 3.79e-07
par(mfrow = c(2,2))
plot(fit)

plot of chunk unnamed-chunk-9

Appendix 3

correlation = cor(data[,2:11])
correlation
##          cyl    disp      hp     drat      wt    qsec      vs       am
## cyl   1.0000  0.9020  0.8324 -0.69994  0.7825 -0.5912 -0.8108 -0.52261
## disp  0.9020  1.0000  0.7909 -0.71021  0.8880 -0.4337 -0.7104 -0.59123
## hp    0.8324  0.7909  1.0000 -0.44876  0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488  1.00000 -0.7124  0.0912  0.4403  0.71271
## wt    0.7825  0.8880  0.6587 -0.71244  1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082  0.09120 -0.1747  1.0000  0.7445 -0.22986
## vs   -0.8108 -0.7104 -0.7231  0.44028 -0.5549  0.7445  1.0000  0.16835
## am   -0.5226 -0.5912 -0.2432  0.71271 -0.6925 -0.2299  0.1683  1.00000
## gear -0.4927 -0.5556 -0.1257  0.69961 -0.5833 -0.2127  0.2060  0.79406
## carb  0.5270  0.3950  0.7498 -0.09079  0.4276 -0.6562 -0.5696  0.05753
##         gear     carb
## cyl  -0.4927  0.52699
## disp -0.5556  0.39498
## hp   -0.1257  0.74981
## drat  0.6996 -0.09079
## wt   -0.5833  0.42761
## qsec -0.2127 -0.65625
## vs    0.2060 -0.56961
## am    0.7941  0.05753
## gear  1.0000  0.27407
## carb  0.2741  1.00000

Appendix 4

library(car)
vif(lm1)
## data$am data$hp 
##   1.063   1.063
vif(lm3)
##   data$am   data$vs data$carb 
##     1.067     1.575     1.535

Appendix 5

table <- cbind(c(AIC(lm1),AIC(lm3)),c(BIC(lm1),BIC(lm3)))
rownames(table) <- c("lm1","lm3")
colnames(table) <- c("AIC","BIC")
table
##     AIC   BIC
## lm1 164 169.9
## lm3 166 173.4