Task - Analysis MPG difference between transmissions

Executive Summary

The report explores the relationship between some variables of data mtcars and the reponse variable “miles per gallon (MPG)”. There are particurlarly interested in the questions. First, is an automatic or manual transmission better for MPG? Second, how different is the MPG between automatic manual transmission? The exploration and analysis of the data mtcars provide that among all variables “Weight” and “Number of cylinders” have significat impact in quantifying the difference on “MPG” between automatic an transmission cars.

Is an automatic or manual transmission better for MPG?

Exploratory data analyses - dataset “mtcars” (32 obs. on 11 variables. Look at Appendix A)

##              Mean    St.Error    Count
## automatic   17.15       3.834       13
## manual      24.39       6.167       19

Two Samples T-Test

In order to the difference of the means and the visualization of the boxblots we use the two sample Student t-test to check that automatic cars have a lower miles per gallon than manuel cars.

##   conf.lev t.stat    df p.value lower.ci upper.ci aut.mean man.mean
##       0.90 -3.767 18.33 0.00137   -10.58   -3.913    17.15    24.39
## 1     0.95 -3.767 18.33 0.00137   -11.28   -3.210    17.15    24.39

In Appendix B) there is showed the check of Normaly of “Miles per gallon (MPG)”. At confidence level 0.90 the 2 sample t-test is out of the confidence interval for the hyphothese that the means for automatic and manual cars are equals. We reject the hyphothese. The difference between automatic and manual cars is significat. But with the confidence interval 0.95 is no longer so. The statement is unfortunately not robust. Lets make multipy linear regression.

How different is the MPG between automatic and manual transmission?

Exploratory data analyses (Look at Appendix C)

After a first look at “AIC-Statistik” and changes of estimates in the table of Appendix C) there are 3 variables wt,am & qsec interesting. Including wt and hp in a regression equation makes sense intuitively - heavier cars and cars that have more horsepower should have lower MPGs.

Single Linear Regression (Look at Appendix D)

First lets look at single linear model with variable “t_am” and response “mpg”.

my_cars.lm1 <- lm( mpg ~ t_am, data = my_mtcars ); sum.lm1 <- summary(my_cars.lm1) 
conf.lm1 <- confint( my_cars.lm1, level=0.95 ); drop.all1 <- drop1(my_cars.lm1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9.3923 -3.0923 -0.2974  0.0000  3.2439  9.5077

##             Estim. Std.Err  t.val  Pr(>|t|) r.sqrd   2.5% Sum of Sq    RSS
## (Intercept) 17.147   1.125 15.247 1.134e-15 0.3598 14.851        NA  720.9
## t_ammanual   7.245   1.764  4.106 2.850e-04 0.3598  3.642     405.2 1126.0

Lets do not gain much more information from our hypothesis test using this model. Interpreting the coefficient and intercepts, we say that, on average, automatic cars have 17.147 MPG and manual transmission cars have 7.245 MPGs more. In addition, we see that “r.squared” value is 0.3598. This means that our model only explains 35.98% of the variance.

Multiple Linear Regression (Look at Appendix E)

Lets look at the relationsship on MPG and the variables in the following models.

M1 <- lm(mpg ~ t_am, data=my_mtcars )
M2 <- lm(mpg ~ t_am+I(wt-mean(wt)), data=my_mtcars )
M3 <- lm(mpg ~ t_am+I(wt-mean(wt))+I(qsec-mean(qsec)), data=my_mtcars )
M4 <- lm(mpg ~ t_am+I(wt-mean(wt))+I(cyl-mean(cyl))+I(hp-mean(hp)),data=my_mtcars )
tab_anova <- anova(M1, M2, M3, M4)

print("Analysis of Variance Table")

## [1] "Analysis of Variance Table"

print(t_anova, format="html", digits=3)

##                          Res.Df RSS Df Sum of Sq    F   Pr(>F)
## M1: mpg ~ t_am               30 721 NA        NA   NA       NA
## M2: mpg ~ t_am+wt            29 278  1   442.577 70.3 5.39e-09
## M3: mpg ~ t_am+wt+cyl        28 169  1   109.034 17.3 2.88e-04
## M4: mpg ~ t_am+wt+cyl+hp     27 170  1    -0.712   NA       NA

The contribution of the variable “hp” in model M4 is not significant, since the t-value -0.98684 of the test ( H0: beta 4 = 0 ) is not in the confidence intervall [-0.07, 0.02] (Appendix C). Therefore M3 is the fittest multiple linear model.

The variance-inflation factors of the regressors “wt” and “cyl” are increasing strongly in model M4 in opposit to model M3 (Appendix F).

Model M3 explains over 81.22% of the variance. Moreover, we see that wt and hp did indeed confound the relationship between am and mpg (mostly wt). Now when we read the coefficient for am, we say that, on average, manual transmission cars have 2.52 MPGs more than automatic transmission cars (Appendix C).

Regression diagnostics (Look at Appendix D, E and F)

The residuals of models M1 and M3 are normally distributed and homoskedastic. Lets have a look at the plots of residuals of multiply linear regression model M3 in Appendix E).

Appendix

Appendix A) Data Description “cars”

Source

Henderson and Velleman (1981) Building multiple regression models interactively. Biometrics, 37, 391-411. A data frame with 32 observations on 11 variables. * [, 1] mpg Miles/(US) gallon * [, 2] cyl Number of cylinders * [, 3] disp Displacement (cu.in.) * [, 4] hp Gross horsepower * [, 5] drat Rear axle ratio * [, 6] wt Weight (1000 lbs) * [, 7] qsec 1/4 mile time * [, 8] vs V/S * [, 9] am Transmission (0 = automatic, 1 = manual) * [,10] gear Number of forward gears * [,11] carb Number of carburetors

  print( head(mtcars, 2), format="html", digits=4 )

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Appendix B) Check Normaly of “Miles per gallon (MPG)”

Lets check the Shapiro t-test and have a look at Q-Q-Plot and Histogram.

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.9, p-value = 0.1

## Q-Q-Plot and histogram
   par(mfrow=c(1,2), mar=c(3.8,4,2,2)+0.1, cex=0.6); qqnorm( mtcars$mpg)
   hist( mtcars$mpg, breaks=8, cex=0.6, main="Histogram of Miles/(US) gallon" )

Appendix C) Multiple Regression Model

Table “Multiple Regression Model with Estimate, Std.Err, RSS and AIC”

## define model 
cars.lm <- lm(mpg~factor(am)+wt+cyl+hp+disp+drat+gear+carb+vs+qsec-1,data=mtcars)
sum.lm<-summary(cars.lm,digi=4); conf.lm<-confint(cars.lm,level=0.95 ); drop.all<-drop1(cars.lm)

##               Estim.  Std.Err   t.val Pr(>|t|)   2.5% 97.5%   RSS   AIC
## factor(am)0 12.30337 18.71788  0.6573  0.51812 -26.62 51.23 147.5 70.90
## factor(am)1 14.82360 18.35265  0.8077  0.42831 -23.34 52.99 164.6 70.41
## wt          -3.71530  1.89441 -1.9612  0.06325  -7.65  0.22 174.5 74.28
## cyl         -0.11144  1.04502 -0.1066  0.91609  -2.28  2.06 147.6 68.92
## hp          -0.02148  0.02177 -0.9868  0.33496  -0.07  0.02 154.3 70.35
## disp         0.01334  0.01786  0.7468  0.46349  -0.02  0.05 151.4 69.74

# Basic Scatterplot Matrix
my_cars <- as.data.frame(cbind( mtcars$mpg, mtcars$wt, mtcars$am, mtcars$cyl ))
colnames(my_cars) <- c( "mpg", "wt", "am", "cyl" ) 
pairs(my_cars,lower.panel=panel.smooth,upper.panel=panel.cor,main="Scatterplot Matrix")

Appendix D) Single linear regression - Analysis of Residuals

par(mfrow=c(1,3), mar=c(3.8,4,2,2)+0.1, cex=0.6);
plot(my_cars.lm1, which=c(1,2,4))

Appendix E) Multiply linear regression - Analysis of Residuals

## 
## Call:
## lm(formula = mpg ~ t_am + I(wt - mean(wt)) + I(qsec - mean(qsec)), 
##     data = my_mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           18.8979     0.7194  26.271  < 2e-16 ***
## t_ammanual             2.9358     1.4109   2.081 0.046716 *  
## I(wt - mean(wt))      -3.9165     0.7112  -5.507 6.95e-06 ***
## I(qsec - mean(qsec))   1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

## Residuals of Multiply model M3 
par(mfrow=c(2,2), mar=c(3.8,4,2,2)+0.1, cex=0.6);
plot(M3, which=c(1,2,3,4))

Appendix F - Regression diagnostics

variance-inflation factors of multiple linear models

vit_all <- rbind( c(vif(M2), NA, NA), c(vif(M3),NA), vif(M4) )
colnames(vit_all) <- c( "t_am", "    wt", "   cyl", "    hp")
rownames(vit_all) <- c( "Model 2: mpg ~ t_am + wt",
                        "Model 3: mpg ~ t_am + wt + cyl",
                        "Model 4: mpg ~ t_am + wt + cyl + hp" )
print( vit_all, format="html", digits=4 )

##                                      t_am     wt    cyl     hp
## Model 2: mpg ~ t_am + wt            1.921  1.921     NA     NA
## Model 3: mpg ~ t_am + wt + cyl      2.541  2.483  1.364     NA
## Model 4: mpg ~ t_am + wt + cyl + hp 2.546  3.988  5.334   4.31