The report aims at analyzing the data set of a collection of cars for exploring the relationship between a set of variables and miles per gallon (MPG). The report addresses the following questions:
1. Is an automatic or manual transmission better for MPG
2. Quantify the MPG difference between automatic and manual transmissions
Analysis is carried out using R package
library(knitr)
# Setting the global options.
opts_chunk$set(fig.width=6, fig.height=4, warning=FALSE)
A data frame with 32 observations on 11 variables.
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000) [, 7] qsec 1/4 mile time [, 8] vs V/S [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
library(datasets)
data(mtcars)
mtcars = mtcars[complete.cases(mtcars),]
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
suppressMessages(library(dplyr))
mtcars <- mutate(mtcars, txtype=ifelse(!am, "auto", "manual"))
par(mfrow = c(1, 2))
boxplot(mpg~txtype*cyl, data=mtcars, notch=FALSE,
col = (c("lightblue","lightgreen")), main = "Motor Trend Car Dataset",
xlab = "Transmission-Type . Number of Cylinders", ylab = "Miles / (US) gallon")
boxplot(mpg~txtype, data=mtcars, notch=FALSE,
col = (c("lightblue","lightgreen")), main = "Motor Trend Car Dataset",
xlab = "Transmission-Type", ylab = "Miles / (US) gallon")
mtcars %>% group_by(am, cyl) %>% summarize(mpg = mean(mpg, na.rm = TRUE))
## Source: local data frame [6 x 3]
## Groups: am
##
## am cyl mpg
## 1 0 4 22.90000
## 2 0 6 19.12500
## 3 0 8 15.05000
## 4 1 4 28.07500
## 5 1 6 20.56667
## 6 1 8 15.40000
Included figures highlight the means we are comparing.
str(mtcars)
## 'data.frame': 32 obs. of 12 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ txtype: chr "manual" "manual" "manual" "auto" ...
Lets carry-out classical hypothesis testing that is concerned with deciding between two decisions. The first, a null hypothesis is usually labeled, H0 and this is what we assume by default. The alternative or research hypothesis is what we require evidence to conclude. This hypothesis is usually labeled, Ha or sometimes H1 (or some other number other than 0). So to reiterate, the null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis.
Let’s fit the linear model that include all the terms. Including a residual plot, can be performed with the plot function on the output of a lm fit.
par(mfrow = c(2, 2))
fit <- lm(mpg ~ . , data = mtcars); plot(fit)
1st-Plot: Consider the upper left hand plot of the residuals versus the fitted values. Often, a horiztonal reference line at 0 is drawn since (whenever an intercept is included) the residuals must sum to 0 and so will lie above and below the zero. There doesn’t seem to be any systematic patters or large outlying observations; in this plot, the variance does look okay i.e. neither increasing nor decreasing.
3rd Plot: Scale-location plot is plotting residuals in the standardized form so that it has comparable scale across the experiments and scale look similar to T-like statistics.
4th Plot: Leverage measures (hat values) plot does show that we have few points (far from the center of axis) that has large leverage and hence high influence on regression line (residual).
summary(lm(mpg ~ . , data = mtcars))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
So lets focus on -0.11144, this number is interpreted as the following:
Lets go through some other models and see look into how the process of model selection changes our estimates. Now lets contrast the model so that it just has transmission type in it.
summary(lm(mpg ~ am , data = mtcars))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
What is interesting is that transmission-type impact has gone much higher (i.e. 2.52 to 7.24). Notice from the T-test that it is strongly statistically significant (unlike before).
summary(lm(mpg ~ cyl , data = mtcars))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.88458 2.0738436 18.267808 8.369155e-18
## cyl -2.87579 0.3224089 -8.919699 6.112687e-10
Similar is true with the size of the cylinder.
summary(lm(mpg ~ am + cyl + hp + wt , data = mtcars))
##
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## am 1.47805 1.44115 1.026 0.3142
## cyl -0.74516 0.58279 -1.279 0.2119
## hp -0.02495 0.01365 -1.828 0.0786 .
## wt -2.60648 0.91984 -2.834 0.0086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
From the above selection approach shows that variance coverage is 85% (approx.). Also it won’t be wrong to conclude that cars of manual transmission type has around 1.5 (appox.) mpg higher than its counter-part (i.e. automatic type).
Now finally lets use nested models. If the models of interest are nested and without lots of parameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests for model selection. Consider the following example:
fit1 <- lm(mpg ~ am, data = mtcars)
fit3 <- update(fit, mpg ~ am + cyl)
fit5 <- update(fit, mpg ~ am + cyl + hp + wt)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + hp + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 271.36 1 449.53 71.3976 4.619e-09 ***
## 3 27 170.00 2 101.36 8.0496 0.001812 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice how the three models are nested. That is, Model 3 contains all of the Model 2 variables which contains all of the Model 1 variables. The P-values are for a test of whether all of the new variables are all zero or not (i.e. whether or not they’re necessary). So this model would conclude that all of the added Model 3 terms are necessary over Model 2 and all of the Model 2 terms are necessary over Model 1. So, unless there were some other compelling reasons, we’d pick Model 3.