This project is to explore the relationship between transimission type (automatic and manual) and miles per gallon (MPG) (outcome), and quantify the MPG difference between automatic and manual transmissions. “mtcars” dataset in R will be used in the analysis. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We find out that a manual transmission is better for MPG, when other condistions are the same.
data <- data.frame(mtcars)
See appendix 1, Toyota Corolla has the highest MPG among these cars.
See appendix 2,in the Residuals vs Fitted plot, the residuals variance is around zero, and it implies that the assumption of homoscedasticity is not violated. Also, random, patternless residuals in the Residuals vs Fitted plot imply independent errors. Moreover, the Normal Q-Q plot represents a straight line, so the normality assumption is valid.
We assumes that the absolute value of correlation, which is greater than 0.8, indicates highly correlation between the variables, and cyl vs. disp (0.902), cyl vs. hp (0.832), cyl vs. vs (-0.811), and disp vs. wt (0.888) are considered highly correlated pairs (see appendix 3). We need to include the transimission information in our model design. When we fit the model, only one variable in each pair can be added.
lm0 <- lm(data$mpg ~ data$am)
lm1 <- lm(data$mpg ~ data$am + data$hp)
summary(lm1)
##
## Call:
## lm(formula = data$mpg ~ data$am + data$hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.384 -2.264 0.137 1.697 5.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.58491 1.42509 18.65 < 2e-16 ***
## data$am 5.27709 1.07954 4.89 3.5e-05 ***
## data$hp -0.05889 0.00786 -7.50 2.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.91 on 29 degrees of freedom
## Multiple R-squared: 0.782, Adjusted R-squared: 0.767
## F-statistic: 52 on 2 and 29 DF, p-value: 2.55e-10
lm2 <- lm(data$mpg ~ data$am + data$qsec + data$carb)
summary(lm2)
##
## Call:
## lm(formula = data$mpg ~ data$am + data$qsec + data$carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.674 -1.502 0.465 1.716 5.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.325 8.631 0.04 0.9702
## data$am 8.435 1.149 7.34 5.4e-08 ***
## data$qsec 1.133 0.425 2.67 0.0125 *
## data$carb -1.383 0.458 -3.02 0.0053 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.08 on 28 degrees of freedom
## Multiple R-squared: 0.764, Adjusted R-squared: 0.738
## F-statistic: 30.2 on 3 and 28 DF, p-value: 6.45e-09
lm3 <- lm(data$mpg ~ data$am + data$vs + data$carb)
summary(lm3)
##
## Call:
## lm(formula = data$mpg ~ data$am + data$vs + data$carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.280 -1.231 0.408 2.052 4.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.517 1.609 12.13 1.2e-12 ***
## data$am 6.798 1.101 6.17 1.2e-06 ***
## data$vs 4.196 1.325 3.17 0.0037 **
## data$carb -1.431 0.408 -3.51 0.0016 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 28 degrees of freedom
## Multiple R-squared: 0.782, Adjusted R-squared: 0.758
## F-statistic: 33.4 on 3 and 28 DF, p-value: 2.14e-09
lm4 <- lm(data$mpg ~ data$am + data$carb)
summary(lm4)
##
## Call:
## lm(formula = data$mpg ~ data$am + data$carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.232 -1.741 -0.071 2.394 5.638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.146 1.294 17.89 < 2e-16 ***
## data$am 7.653 1.223 6.26 7.9e-07 ***
## data$carb -2.192 0.378 -5.80 2.8e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.39 on 29 degrees of freedom
## Multiple R-squared: 0.704, Adjusted R-squared: 0.683
## F-statistic: 34.4 on 2 and 29 DF, p-value: 2.19e-08
From above 4 models, we can see that Model 1 and Model 3 are the best two with highest R-square, and coefficients in both models are stitistical significant. Next, we analyze their VIF and compare these two models by using AIC and BIC methods.
As we can see that vs and carb are highly correlated with each other in Model 3 (see appendix4). The model with the smallest AIC and smallest BIC values is the “best”. From appendix, we can see that both AIC and BIC values in Model 1 are smaller than those in Model 3. Therefore, Model 1 is better than Model 3 (see appendix 5).
\(Y_{mpg} = 26.58 + 5.28 X_{am} - 0.06 X_{hp} + \epsilon_i\).
Here the \(\epsilon_{i}\) are assumed iid \(N(0, \sigma^2)\).
We estimate an expected 5.28 increase in mpg for the manual transmission type (0 = automatic, 1 = manual) in holding the remaining variables constant. Therefore, a manual transmission is better for MPG, when other condistions are the same.
We also estimate an expected 0.06 decrease in mpg for every one unit gross horsepower increase in holding the remaining variables constant.
Dif=mean(data$mpg[data$am=="0"]) - mean(data$mpg[data$am=="1"])
Dif
## [1] -7.245
\(Difference_{mpg} = -7.24\).
data <- data.frame(mtcars)
data$model <- factor(rownames(data))
data$car_no <- 1:32
str(data)
## 'data.frame': 32 obs. of 13 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ model : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
## $ car_no: int 1 2 3 4 5 6 7 8 9 10 ...
plot(data$car_no, data$mpg)
text(data$car_no, data$mpg, labels = data$car_no, pos = 1)
data[20,12]
## [1] Toyota Corolla
## 32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 ... Volvo 142E
data <- data[,1:11]
fit <- lm(mpg ~., data = data[,1:11])
summary(fit)
##
## Call:
## lm(formula = mpg ~ ., data = data[, 1:11])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.45 -1.60 -0.12 1.22 4.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3034 18.7179 0.66 0.518
## cyl -0.1114 1.0450 -0.11 0.916
## disp 0.0133 0.0179 0.75 0.463
## hp -0.0215 0.0218 -0.99 0.335
## drat 0.7871 1.6354 0.48 0.635
## wt -3.7153 1.8944 -1.96 0.063 .
## qsec 0.8210 0.7308 1.12 0.274
## vs 0.3178 2.1045 0.15 0.881
## am 2.5202 2.0567 1.23 0.234
## gear 0.6554 1.4933 0.44 0.665
## carb -0.1994 0.8288 -0.24 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.807
## F-statistic: 13.9 on 10 and 21 DF, p-value: 3.79e-07
par(mfrow = c(2,2))
plot(fit)
correlation = cor(data[,2:11])
correlation
## cyl disp hp drat wt qsec vs am
## cyl 1.0000 0.9020 0.8324 -0.69994 0.7825 -0.5912 -0.8108 -0.52261
## disp 0.9020 1.0000 0.7909 -0.71021 0.8880 -0.4337 -0.7104 -0.59123
## hp 0.8324 0.7909 1.0000 -0.44876 0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488 1.00000 -0.7124 0.0912 0.4403 0.71271
## wt 0.7825 0.8880 0.6587 -0.71244 1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082 0.09120 -0.1747 1.0000 0.7445 -0.22986
## vs -0.8108 -0.7104 -0.7231 0.44028 -0.5549 0.7445 1.0000 0.16835
## am -0.5226 -0.5912 -0.2432 0.71271 -0.6925 -0.2299 0.1683 1.00000
## gear -0.4927 -0.5556 -0.1257 0.69961 -0.5833 -0.2127 0.2060 0.79406
## carb 0.5270 0.3950 0.7498 -0.09079 0.4276 -0.6562 -0.5696 0.05753
## gear carb
## cyl -0.4927 0.52699
## disp -0.5556 0.39498
## hp -0.1257 0.74981
## drat 0.6996 -0.09079
## wt -0.5833 0.42761
## qsec -0.2127 -0.65625
## vs 0.2060 -0.56961
## am 0.7941 0.05753
## gear 1.0000 0.27407
## carb 0.2741 1.00000
library(car)
vif(lm1)
## data$am data$hp
## 1.063 1.063
vif(lm3)
## data$am data$vs data$carb
## 1.067 1.575 1.535
table <- cbind(c(AIC(lm1),AIC(lm3)),c(BIC(lm1),BIC(lm3)))
rownames(table) <- c("lm1","lm3")
colnames(table) <- c("AIC","BIC")
table
## AIC BIC
## lm1 164 169.9
## lm3 166 173.4