Regression Models - Course Project

library(ggplot2)      # Plots
library(RColorBrewer) # Plots
library(car)          # VIF
library(corrplot)     # Correlation plotting
library(dplyr)        # Data Wrangling
library(sandwich)     # Confidence interval

Executive Summary

The preferred model has high confidence that a manual car has higher mpg than an automatic. At 95% confidence it estimates the improvement between 3.2 and 8.7 mpg more.

In an adjusted plot it is also shown that for cars with a higher quarter mile time (ie. slower acceleration) the difference increases. Meaning that if you buy a car with fast acceleration it doesn’t matter to much whether you buy a manual or automatic, but when you buy slow car you are much better of with a manual when it comes to miles per gallon.

Exploratory data analyses

data("mtcars")
RawData <- mtcars

The mtcars dataset has 32 observations of 11 variables. A small, but luckily, complete dataset. The variables’ names and classes are all numeric. Five of the classes are better represented as factor variables, including transmission, which we are interested in.

dat <- RawData
dat$cyl <- as.factor(dat$cyl)
dat$vs <- as.factor(dat$vs)
dat$am <- as.factor(dat$am)
dat$gear <- as.factor(dat$gear)
dat$carb <- as.factor(dat$carb)

We would like to model the effect of transmissions on mpg. Let’s start with a quick boxplot. See Figure 1 in the appendix.

Model Selection

The predictor is a continuous variable, hence we do not want to use a logistic regression or poisson regression. A linear regression is preferred over other complicated regression models because it has interpretable coefficients.

The first model to explore will be single variable linear regression Linear regression will fit a straight line that goes through the two means that are the subsets of mpg with respect to automatic and manual transmission.

fit0 <- lm(formula = mpg ~ am, data = dat)
summary(fit0)$coef

##              Estimate Std. Error   t value                Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 0.000000000000001133983
## am1          7.244939   1.764422  4.106127 0.000285020743935065980

The intercept is the mean of mpg of automatic cars because automatic has the x-axis value of 0. The slope is the subtraction of the mean of mpg of manual with respect to automatic cars. Hence, intercept plus slope equals mean of mpg of manual cars.

We probably want to include more variables in the model, because omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones. However, including any new variables increases (actual, not estimated) standard errors of other regressors. So we don’t want to idly throw variables into the model.

Let’s explore correlation and variation inflation. For a great visual representation of the correlation see figure 2 in the appendix. The variation inflation factor of a linear model that takes all variables into account.

vif(lm(mpg ~ .,data = RawData))

##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747

High correlation and vif means we have to take care with picking the variables.

Let’s produce some plots that plot the effect of Adjustment. In figure 3 the adjustment plots show the linear regression of transmission adjusted for the continuous variables qsec, hp, drat, wt and disp. For drat, wt and disp we find similar slopes and a sort of continuation in the data. Ie. most of the automatic cars dominate half of the x-axis and the manual cars the other half. Hence the models do not split the data very well. In the first to plots for qsec and hp there exists a clear separation across the x-axis.

Use annova() to determine the significance of adding other regressors than qsec. Print the significance codes:

fit1 <- lm(mpg ~ am + qsec,data = dat)
fit2 <- lm(mpg ~ am + qsec + hp,data = dat)
fit3 <- lm(mpg ~ .,data = dat)
anova(fit0, fit1, fit2, fit3)$heading[2]

## NULL

anova(fit0, fit1, fit2, fit3)[6]

##    Pr(>F)    
## 1            
## 2 0.00001 ***
## 3 0.00185 ** 
## 4 0.40480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significance is relatively high for adding hp, but we ignore any other variables. Hence we will stic with the linear model with regressors am, qsec and hp.

Diagnostics

With the hatvalues() and dfbetas() functions an outlier can be identified. Calculate the hatvalues for the model with regressors am, hp and qsec.

min(round(hatvalues(fit2),3))

## [1] 0.056

max(round(hatvalues(fit2),3))

## [1] 0.477

The minimum and maximum are within an order of magnitude Using dfbetas() we find that the maximum and minimum are in the same order as the standard deviation for each coefficient, so no outliers here either.

m <- round(dfbetas(fit2),3)
min <- apply(m,2,min)
max <- apply(m,2,max)
sd <- apply(m,2,sd)
rbind(min=min,max=max,sd=sd)

##     (Intercept)        am1       qsec         hp
## min  -0.4840000 -0.4320000 -0.2950000 -0.3040000
## max   0.3020000  0.6180000  0.4900000  0.7780000
## sd    0.1852825  0.2138005  0.1835779  0.1949891

Residual plot

The residual plot in figure 4 show a random behaviour in all three regressors plotted. Hence, we can not spot an obvious underlining trend that our model is not accounting for.

Confidence and inference

The coefficients of our model and its 95% confidence intervals are:

round(cbind(summary(fit2)$coef,confint(fit2)),2)

##             Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
## (Intercept)    16.63      11.17    1.49     0.15 -6.25  39.51
## am1             5.98       1.34    4.47     0.00  3.24   8.72
## qsec            0.46       0.51    0.90     0.38 -0.59   1.51
## hp             -0.05       0.01   -3.66     0.00 -0.08  -0.02

The intercept has a large uncertainty, that is due to most of the data lying relatively far off from the y-axis (hp and qsec are much greater than zero), hence a small incertainty in slope has a large effect on intercept.

The model has high confidence that a manual car has higher mpg than an automatic. At 95% confidence it estimates the improvement between 3.2 and 8.7 mpg more.

The quater mile time (qsec) has relatively likely a positive slope and horse power (hp) has very likely a negative slope. We can safetely conclude that for cars with higher power/lower acceleration the mpg goes down. However, as we could find in the qsec adjusted plot the difference between mpg for automatic and manual cars increases for slower cars!

Appendix For who-ever is interested in the full code. Figure 1: Boxplot

g <- ggplot(dat, aes(x = am, y = mpg, colour = am))
g <- g + scale_color_brewer(palette = "Set1")
g <- g + geom_boxplot()
g <- g + geom_point(size = 3)
g <- g + labs(title = "Boxplot for MPG versus transmissions", x = "Automatic and Manual",y = "Miles per Gallon" )
g

Figure 2: Correlation plot

corrplot(cor(RawData), method = "circle")

adjustPlot <- function(dF, ind, reg1, reg2){

  hl <- tapply(ind, reg2, mean)
  
  fit1 <- lm(ind ~ reg1 * reg2, data = dF)
  g <- ggplot(dF, aes(x = reg1, y = ind, colour = reg2))
  g <- g + geom_point(size = 2)
  g <- g + geom_abline(intercept = coef(fit1)[1], slope = coef(fit1)[2], size = 1, colour = "coral2")
  g <- g + geom_abline(intercept = coef(fit1)[1] + coef(fit1)[3], slope = coef(fit1)[2] + coef(fit1)[4], size = 1, colour = "cyan3")
  g <- g + geom_abline(intercept = hl[1], slope = 0, colour = "coral2" ,size = 1, linetype="dotted")
  g <- g + geom_abline(intercept = hl[2], slope = 0, colour = "cyan3", size = 1, linetype="dotted")
  g
}

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Figure 3: Adjust plot

g1 <- adjustPlot(dat, dat$mpg, dat$qsec, dat$am)
g2 <- adjustPlot(dat, dat$mpg, dat$hp, dat$am)
g3 <- adjustPlot(dat, dat$mpg, dat$drat, dat$am)
g4 <- adjustPlot(dat, dat$mpg, dat$wt, dat$am)
g5 <- adjustPlot(dat, dat$mpg, dat$disp, dat$am)
multiplot(g1,g2,g3,g4,g5, cols = 2)

This adjustment plot plots the linear regression with predictor mpg and regressors qsec, hp, drat, wt and disp (on x-axis in order filling column 1 then column 2 of the graph) combined in multi-regression with am and their squared term. The solid lines are the regression lines and the dotted lines the means for automatic (0) and manual (1) transmission.

Figure 4: Residual plot

e <- resid(fit2)
dF <- data.frame(qsec = dat$qsec, y = e, am = dat$am, hp = dat$hp)
g <- ggplot(dF, aes(x=qsec, y=y, colour = hp))
g <- g + geom_hline(yintercept = 0, size = 1)
g <- g + geom_point(size = 4, aes(shape = am))
g <- g + labs(title = "Residual plot", y="Residual", x="qsec")
g

Regression Models - Course Project

Jiddu Alexander

February 28, 2016