Overview

The report aims at analyzing the data set of a collection of cars for exploring the relationship between a set of variables and miles per gallon (MPG). The report addresses the following questions:
1. Is an automatic or manual transmission better for MPG
2. Quantify the MPG difference between automatic and manual transmissions

Analysis is carried out using R package

library(knitr)
# Setting the global options.
opts_chunk$set(fig.width=6, fig.height=4, warning=FALSE)

Basic Exploratory Data Analyses of MTCARS Dataset

A data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000) [, 7] qsec 1/4 mile time [, 8] vs V/S [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

library(datasets)
data(mtcars)
mtcars = mtcars[complete.cases(mtcars),]
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
suppressMessages(library(dplyr))
mtcars <- mutate(mtcars, txtype=ifelse(!am, "auto", "manual"))
par(mfrow = c(1, 2))
boxplot(mpg~txtype*cyl, data=mtcars, notch=FALSE, 
        col = (c("lightblue","lightgreen")), main = "Motor Trend Car Dataset", 
        xlab = "Transmission-Type . Number of Cylinders", ylab = "Miles / (US) gallon")
boxplot(mpg~txtype, data=mtcars, notch=FALSE, 
        col = (c("lightblue","lightgreen")), main = "Motor Trend Car Dataset", 
        xlab = "Transmission-Type", ylab = "Miles / (US) gallon")

mtcars %>% group_by(am, cyl) %>% summarize(mpg = mean(mpg, na.rm = TRUE))
## Source: local data frame [6 x 3]
## Groups: am
## 
##   am cyl      mpg
## 1  0   4 22.90000
## 2  0   6 19.12500
## 3  0   8 15.05000
## 4  1   4 28.07500
## 5  1   6 20.56667
## 6  1   8 15.40000

Provide a basic summary of the data.

Included figures highlight the means we are comparing.

str(mtcars)
## 'data.frame':    32 obs. of  12 variables:
##  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl   : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp  : num  160 160 108 258 360 ...
##  $ hp    : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec  : num  16.5 17 18.6 19.4 17 ...
##  $ vs    : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am    : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear  : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb  : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ txtype: chr  "manual" "manual" "manual" "auto" ...

Linear Regression Model

Lets carry-out classical hypothesis testing that is concerned with deciding between two decisions. The first, a null hypothesis is usually labeled, H0 and this is what we assume by default. The alternative or research hypothesis is what we require evidence to conclude. This hypothesis is usually labeled, Ha or sometimes H1 (or some other number other than 0). So to reiterate, the null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis.

Residuals, variation, diagnostics

Let’s fit the linear model that include all the terms. Including a residual plot, can be performed with the plot function on the output of a lm fit.

par(mfrow = c(2, 2))
fit <- lm(mpg ~ . , data = mtcars); plot(fit)

1st-Plot: Consider the upper left hand plot of the residuals versus the fitted values. Often, a horiztonal reference line at 0 is drawn since (whenever an intercept is included) the residuals must sum to 0 and so will lie above and below the zero. There doesn’t seem to be any systematic patters or large outlying observations; in this plot, the variance does look okay i.e. neither increasing nor decreasing.

3rd Plot: Scale-location plot is plotting residuals in the standardized form so that it has comparable scale across the experiments and scale look similar to T-like statistics.

4th Plot: Leverage measures (hat values) plot does show that we have few points (far from the center of axis) that has large leverage and hence high influence on regression line (residual).

summary(lm(mpg ~ . , data = mtcars))$coefficients
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

So lets focus on -0.11144, this number is interpreted as the following:

  • Our model estimates an expected 0.11 decrease in the standarized miles/gallon (MPG) for every 1% increase in the percentage of the cylinder size in cars (holding the remaining variables constant).
  • Similarly the model estimate an expected 2.53 increase in the standarized miles/gallon (MPG) for choosing automatic transmission-type vehicles (holding the remaining variables constant).

Linear regression using transmission-type and cylinder size

Lets go through some other models and see look into how the process of model selection changes our estimates. Now lets contrast the model so that it just has transmission type in it.

summary(lm(mpg ~ am , data = mtcars))$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

What is interesting is that transmission-type impact has gone much higher (i.e. 2.52 to 7.24). Notice from the T-test that it is strongly statistically significant (unlike before).

summary(lm(mpg ~ cyl , data = mtcars))$coefficients
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.88458  2.0738436 18.267808 8.369155e-18
## cyl         -2.87579  0.3224089 -8.919699 6.112687e-10

Similar is true with the size of the cylinder.

summary(lm(mpg ~ am + cyl + hp + wt , data = mtcars))
## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## am           1.47805    1.44115   1.026   0.3142    
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

From the above selection approach shows that variance coverage is 85% (approx.). Also it won’t be wrong to conclude that cars of manual transmission type has around 1.5 (appox.) mpg higher than its counter-part (i.e. automatic type).

Covariate or nested model selection

Now finally lets use nested models. If the models of interest are nested and without lots of parameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests for model selection. Consider the following example:

fit1 <- lm(mpg ~ am, data = mtcars)
fit3 <- update(fit, mpg ~ am + cyl)
fit5 <- update(fit, mpg ~ am + cyl + hp + wt)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + hp + wt
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 271.36  1    449.53 71.3976 4.619e-09 ***
## 3     27 170.00  2    101.36  8.0496  0.001812 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how the three models are nested. That is, Model 3 contains all of the Model 2 variables which contains all of the Model 1 variables. The P-values are for a test of whether all of the new variables are all zero or not (i.e. whether or not they’re necessary). So this model would conclude that all of the added Model 3 terms are necessary over Model 2 and all of the Model 2 terms are necessary over Model 1. So, unless there were some other compelling reasons, we’d pick Model 3.

Conclusions and the Assumptions

Assumption(s)

  1. The residual errors are normally distributed

Conclusion(s)

  • By looking at F-statistics of nested model approach it is evident that the inclusion on of the cylinder size to the transmission type was necessary. Further it also suggested that the includsion of additional variables like horse-power and weight were also necessary.
  • The boxplot shows that there are no outliers in the given dataset. On an average miles/gallon (MPG) for manual transmission type is higher w.r.t. automatic. Also, in general (irrespective of number of cylinders) manual transmission type gives better MPG w.r.t. automatic type.
  • Cars of manual transmission type has around 1.5 (appox.) mpg higher than automatic type.