Regression Models

This note is a reorganization of Dr. Brian Caffo's lecture notes for the Coursera course Regression Models.

Module III : Generalized Linear Models

Introduction

The Generalized Linear Model (GLM) was introduced in a 1972 RSSB paper by Nelder and Wedderburn. It involves three components

GLMs can be thought of as a two-stage modeling approach. We first model the response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of prediction and a special form of multi regression. - OpenIntro

Mathematically, first we assume that \( Y_i \sim N(\mu_i, \sigma^2) \) (the Gaussian distribution is an exponential family distribution) and define the linear predictor to be \( \eta_i = \sum_{k=1}^p X_{ik} \beta_k \). The link function is \( g(\mu) = \eta \).

For linear models \( g(\mu) = \mu \) so that \( \mu_i = \eta_i \). This yields the same likelihood model as our additive error Gaussian linear model \( Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i} \) where \( \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \).

Logistic Regression

logistic regression is a type of GLM where regular multiple regression does not work well. Assume that \( Y_i \sim Bernoulli(\mu_i) \) so that \( E[Y_i] = \mu_i \) where \( 0\leq \mu_i \leq 1 \).

Poisson Regression

Assume that \( Y_i \sim Poisson(\mu_i) \) so that \( E[Y_i] = \mu_i \) where \( 0\leq \mu_i \)

Summary

In each case, the only way in which the likelihood depends on the data is through \[ \sum_{i=1}^n y_i \eta_i = \sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i \] Thus if we don't need the full data, only \( \sum_{i=1}^n X_{ik} y_i \). This simplification is a consequence of chosing so-called canonical link functions.

All models acheive their maximum at the root of the so called normal equations \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i \] where \( W_i \) are the derivative of the inverse of the link function.

The variances in the GLMs are

Odds and ends


Logistic Regression

In this section, we will explore the Baltimore Ravens data. First let's load it and check its structure

download.file("https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda", 
    destfile = "../03_02_binaryOutcomes/data/ravensData.rda", method = "wget")
load("../03_02_binaryOutcomes/data/ravensData.rda")
head(ravensData)
##   ravenWinNum ravenWin ravenScore opponentScore
## 1           1        W         24             9
## 2           1        W         38            35
## 3           1        W         28            13
## 4           1        W         34            31
## 5           1        W         44            13
## 6           0        L         23            24
str(ravensData)
## 'data.frame':    20 obs. of  4 variables:
##  $ ravenWinNum  : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ ravenWin     : Factor w/ 2 levels "L","W": 2 2 2 2 2 1 2 2 2 2 ...
##  $ ravenScore   : num  24 38 28 34 44 23 31 23 9 31 ...
##  $ opponentScore: num  9 35 13 31 13 24 30 16 6 29 ...

Linear Regression

The linear regression can be modeled as \[ RW_i = b_0 + b_1 RS_i + e_i \] Where

lmRavens <- lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
summary(lmRavens)$coef
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.2850   0.256643   1.111  0.28135
## ravensData$ravenScore   0.0159   0.009059   1.755  0.09625

Odds

Binary Outcome 0/1

\[ RW_i \]

Probability (0,1)

\[ \rm{Pr}(RW_i | RS_i, b_0, b_1 ) \]

Odds $(0,\infty)$ \[ \frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)} \]

Log odds $(-\infty,\infty)$

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) \]

Linear vs. logistic regression

Linear

\[ RW_i = b_0 + b_1 RS_i + e_i \]

or

\[ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i \]

Logistic

\[ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)} \]

or

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]

Logistic Regression

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]

Odds

logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore, family = "binomial")
summary(logRegRavens)
## 
## Call:
## glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore, 
##     family = "binomial")
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.758  -1.100   0.530   0.806   1.495  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -1.6800     1.5541   -1.08     0.28
## ravensData$ravenScore   0.1066     0.0667    1.60     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.89
## 
## Number of Fisher Scoring iterations: 5
plot(ravensData$ravenScore, logRegRavens$fitted, pch = 19, col = "blue", xlab = "Score", 
    ylab = "Prob Ravens Win")

plot of chunk logReg

Odds ratios and confidence intervals

exp(logRegRavens$coeff)
##           (Intercept) ravensData$ravenScore 
##                0.1864                1.1125
exp(confint(logRegRavens))
## Waiting for profiling to be done...
##                          2.5 % 97.5 %
## (Intercept)           0.005675  3.106
## ravensData$ravenScore 0.996230  1.303

ANOVA for logistic regression

anova(logRegRavens, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: ravensData$ravenWinNum
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                                     19       24.4           
## ravensData$ravenScore  1     3.54        18       20.9     0.06 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can interpreting odds ratios


Poisson Regression

There are many data taking the form of counts, such as the number of calls to a call center, the number of flu cases in an area, the number of cars that cross a bridge and etc. The data may also be in the form of rates, such as the percentage of children passing a test or the percentage of hits to a website from a country.

In all the above cases, Linear regression with transformation is an option.

First, we recall that the Poisson distribution is a useful model for counts and rates (rate is count per some monitoring time). Some examples uses of the Poisson distribution

For the Poisson distribution

par(mfrow = c(1, 3))
plot(0:10, dpois(0:10, lambda = 2), type = "h", frame = FALSE)
plot(0:20, dpois(0:20, lambda = 10), type = "h", frame = FALSE)
plot(0:200, dpois(0:200, lambda = 100), type = "h", frame = FALSE)

plot of chunk simPois

We can show the the mean and variance of a poisson distribution are equal

x <- 0:10000
lambda = 3
mu <- sum(x * dpois(x, lambda = lambda))
sigmasq <- sum((x - mu)^2 * dpois(x, lambda = lambda))
c(mu, sigmasq)
## [1] 3 3

Example: Leek Group Website Traffic

As an example, let's consider the daily counts to Jeff Leek's website, since the unit of time is always one day, set \( t = 1 \) and then the Poisson mean is interpretted as web hits per day. (If we set \( t = 24 \), it would be web hits per hour). (http://skardhamar.github.com/rga/)

download.file("https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda", 
    destfile = "../03_03_countOutcomes/data/gaData.rda", method = "wget")
load("../03_03_countOutcomes/data/gaData.rda")
gaData$julian <- julian(gaData$date)
str(gaData)
## 'data.frame':    731 obs. of  4 variables:
##  $ date       : Date, format: "2011-01-01" "2011-01-02" ...
##  $ visits     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ simplystats: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ julian     : atomic  14975 14976 14977 14978 14979 ...
##   ..- attr(*, "origin")= Date, format: "1970-01-01"

If we fit the data with a simple linear regression model

\[ NH_i = b_0 + b_1 JD_i + e_i \]

We can also plot the Linear regression line

plot(gaData$julian, gaData$visits, pch = 19, col = "darkgrey", xlab = "Julian", 
    ylab = "Visits")
lm1 <- lm(gaData$visits ~ gaData$julian)
abline(lm1, col = "red", lwd = 3)

plot of chunk linReg

Aside, Let's take the natural log of the outcome has a specific interpretation. Consider the model

\[ \log(NH_i) = b_0 + b_1 JD_i + e_i \]

To exponentiate coefficients

round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
##   (Intercept) gaData$julian 
##         0.000         1.002

Linear vs. Poisson regression

Linear

\[ NH_i = b_0 + b_1 JD_i + e_i \]

or

\[ E[NH_i | JD_i, b_0, b_1] = b_0 + b_1 JD_i \]

Poisson/log-linear

\[ \log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i \]

or

\[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \]

Multiplicative differences \[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \] \[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 \right)\exp\left(b_1 JD_i\right) \]

If \( JD_i \) is increased by one unit, \( E[NH_i | JD_i, b_0, b_1] \) is multiplied by \( \exp\left(b_1\right) \)

par(mfrow = c(1, 2))
plot(gaData$julian, gaData$visits, pch = 19, col = "darkgrey", xlab = "Julian", 
    ylab = "Visits")
glm1 <- glm(gaData$visits ~ gaData$julian, family = "poisson")
abline(lm1, col = "red", lwd = 3)
lines(gaData$julian, glm1$fitted, col = "blue", lwd = 3)
plot(glm1$fitted, glm1$residuals, pch = 19, col = "grey", ylab = "Residuals", 
    xlab = "Fitted", main = "Mean-variance relationship")

plot of chunk poisReg

Model agnostic standard errors

library(sandwich)
confint.agnostic <- function(object, parm, level = 0.95, ...) {
    cf <- coef(object)
    pnames <- names(cf)
    if (missing(parm)) 
        parm <- pnames else if (is.numeric(parm)) 
        parm <- pnames[parm]
    a <- (1 - level)/2
    a <- c(a, 1 - a)
    pct <- stats:::format.perc(a, 3)
    fac <- qnorm(a)
    ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm, pct))
    ses <- sqrt(diag(sandwich::vcovHC(object)))[parm]
    ci[] <- cf[parm] + ses %o% fac
    ci
}

http://stackoverflow.com/questions/3817182/vcovhc-and-confidence-interval

Estimating confidence intervals

confint(glm1)
## Waiting for profiling to be done...
##                   2.5 %     97.5 %
## (Intercept)   -34.34658 -31.159716
## gaData$julian   0.00219   0.002396
confint.agnostic(glm1)
##                    2.5 %     97.5 %
## (Intercept)   -36.362675 -29.136997
## gaData$julian   0.002058   0.002528

Rates

\[ E[NHSS_i | JD_i, b_0, b_1]/NH_i = \exp\left(b_0 + b_1 JD_i\right) \] \[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) - \log(NH_i) = b_0 + b_1 JD_i \] \[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) = \log(NH_i) + b_0 + b_1 JD_i \]

par(mfrow = c(1, 2))
glm2 <- glm(gaData$simplystats ~ julian(gaData$date), offset = log(visits + 
    1), family = "poisson", data = gaData)
plot(julian(gaData$date), glm2$fitted, col = "blue", pch = 19, xlab = "Date", 
    ylab = "Fitted Counts")
points(julian(gaData$date), glm1$fitted, col = "red", pch = 19)

glm2 <- glm(gaData$simplystats ~ julian(gaData$date), offset = log(visits + 
    1), family = "poisson", data = gaData)
plot(julian(gaData$date), gaData$simplystats/(gaData$visits + 1), col = "grey", 
    xlab = "Date", ylab = "Fitted Rates", pch = 19)
lines(julian(gaData$date), glm2$fitted/(gaData$visits + 1), col = "blue", lwd = 3)

plot of chunk ratesFit


Further resources


Previous Module. Module II : Multivariable Regression