mtcars data set and regression methods

Introduction

We are looking at the mtcars data set, which is built-in to R and attempting to find if 1) automatic or manual transmission is better for mpg and 2) the difference in mpg between automatic and manual.

Using logistic & linear regression, we show that there is a significant difference in mpg between transmission types.

Setup & Exploratory Analysis

First load the data. Checking \(str(mtcars)\) will show us the data types. Then check if our dataset contains any NAs.

library(ggplot2)
library(corrplot)
data(mtcars)

corr_mt = cor(mtcars) 
colSums(is.na(mtcars))

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

A correlation plot given in Appendix 1 shows several of the predictors cyl are highly correlated with each other, which may introduce the problem of multicollinearity.

There is no missing data.

19 of the 32 cars in the dataset are automatic:

mtcars$am = factor(mtcars$am, labels=c("Automatic", "Manual"))
table(mtcars$am)

## 
## Automatic    Manual 
##        19        13

Inference

Using a t-test, we see there is significant difference between mpg for transmission types (p-value of 0.001).

t.test(mpg ~ am, data=mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The p-value is <0.05; we reject that there is no difference in mpg between transmission types.

A boxplot also shows this relationship, shown in Appendix 2.

Model selection

Following is a simple linear model with only 1 predictor, a model using all predictors, and a model determined by backward elimination, which starts with all predictors and iteratively removes the least useful predictor each step, based on minimizing AIC.

fit_simple = lm(mpg~am, data=mtcars) 
summary(fit_simple)$r.squared

## [1] 0.3597989

fit_all = lm(mpg~., data=mtcars)
fit_step = step(fit_all, direction="back", trace=0)

AIC(fit_simple, fit_all, fit_step)

##            df      AIC
## fit_simple  3 196.4844
## fit_all    12 163.7098
## fit_step    5 154.1194

Looking at the R-squared of the simple model, we see <40% of the variance is explained. So we should add other variables to the model. Next, we add all predictors to the model, and we have a much higher r-squared, but as shown in the above correlation plot, the model suffers from multicolinearity. The third model includes only three predictors (wt + qsec + am) instead of all 10 and has an only slightly higher r-squared value. Note that am has the highest p-value/is least significant. Adding more predictors may increase variation in the model.

Following, we show a logistic regression model that classifies transmission type based on mpg

fit_log_step = step(glm(am ~  mpg, family=binomial, data=mtcars, trace=0))

## Start:  AIC=33.68
## am ~ mpg
## 
##        Df Deviance    AIC
## <none>      29.675 33.675
## - mpg   1   43.230 45.230

num_correctly_classified = sum( round(predict(fit_log_step,type="response")) == mtcars$am) 
num_correctly_classified / dim(mtcars)[1]

## [1] 0

This classification accuracy is 75%. Increasing mpg increases the probability that a car has a manual transmission, assuming the number of mpg is independently normally distributed for each transmission type.

ANOVA

anova(fit_simple, fit_step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the simple and step models are significantly different.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Mpg and Transmission Type

Below, we can see that in our model with confounders transmission type, weight, and 1/4 mile time, manual transmission type cars have 2.93 higher mpg than auotmatic transmission type cars.

summary(fit_step)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Uncertainty

The uncertainty of this model is as follows:

confint(fit_step)

##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199
## amManual     0.04573031  5.825944

Appendix

Appendix 1: Correlation Plot

corrplot(corr_mt, method="color", addCoef.col="black", number.cex=0.5, type="lower")

Appendix 2: Transmission Type Boxplot

p = ggplot(mtcars, aes(am,mpg)) 
p + geom_boxplot() + geom_point()

Appendix 3: Model Residuals

par(mfrow=c(2,2))
plot(fit_step)

The Residuals vs Fitted and Scale location show if there is a trend to the residuals. R’s lm uses ordinary least squares, which require residuals to be independently distributed, so their distribution doesn’t change for different input values.

Residuals vs. Fitted: This plot type can help detect non-linearity, but we don’t have this problem. If there had been a parabola or other shape in the plot, we would know there is a non-linear relationship not explained by the model. However, we can see some outliers, which may influence the \(\beta\) values.

Scale Location: In this plot we can see that the square root of the standardized residuals as a function of the fitted values spread wider and wider, so the residuals are not completely randomly spread along the ranges of predictors. This may mean there is not completely equal variance. This will not influence the \(\beta\) estimates, but the \(p\)-values are less meaningful.

Normal Q-Q: Because the points are near the line, we can see the residuals are normally distributed, which is what we’re looking for.

Residuals vs Leverage: We can see some outliers that are influencing the fit. We can confirm this, using r’s hatvalues() and betas() functions. hatvalues() will show us how much each observation \(X_i\) has the most leverage in the fit. betas() will show us which influence the coefficients \(\beta\).

head(sort(hatvalues(fit_step), decreasing=TRUE), 4)

##            Merc 230 Lincoln Continental   Chrysler Imperial 
##           0.2970422           0.2642151           0.2296338 
##  Cadillac Fleetwood 
##           0.2270069

head(dfbetas(fit_step),4)

##                  (Intercept)           wt        qsec    amManual
## Mazda RX4      -0.0349774936 -0.006599829  0.04974726 -0.08284844
## Mazda RX4 Wag   0.0319133284 -0.058920738 -0.01222537 -0.11093933
## Datsun 710      0.1889625607 -0.069619988 -0.21499454 -0.29857502
## Hornet 4 Drive -0.0001293684 -0.021170451  0.01677960 -0.03323132