This note is a reorganization of Dr. Brian Caffo's lecture notes for the Coursera course Regression Models.
The Generalized Linear Model (GLM) was introduced in a 1972 RSSB paper by Nelder and Wedderburn. It involves three components
GLMs can be thought of as a two-stage modeling approach. We first model the response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of prediction and a special form of multi regression. - OpenIntro
Mathematically, first we assume that \( Y_i \sim N(\mu_i, \sigma^2) \) (the Gaussian distribution is an exponential family distribution) and define the linear predictor to be \( \eta_i = \sum_{k=1}^p X_{ik} \beta_k \). The link function is \( g(\mu) = \eta \).
For linear models \( g(\mu) = \mu \) so that \( \mu_i = \eta_i \). This yields the same likelihood model as our additive error Gaussian linear model \( Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i} \) where \( \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \).
logistic regression is a type of GLM where regular multiple regression does not work well. Assume that \( Y_i \sim Bernoulli(\mu_i) \) so that \( E[Y_i] = \mu_i \) where \( 0\leq \mu_i \leq 1 \).
Assume that \( Y_i \sim Poisson(\mu_i) \) so that \( E[Y_i] = \mu_i \) where \( 0\leq \mu_i \)
In each case, the only way in which the likelihood depends on the data is through \[ \sum_{i=1}^n y_i \eta_i = \sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i \] Thus if we don't need the full data, only \( \sum_{i=1}^n X_{ik} y_i \). This simplification is a consequence of chosing so-called canonical link functions.
All models acheive their maximum at the root of the so called normal equations \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i \] where \( W_i \) are the derivative of the inverse of the link function.
The variances in the GLMs are
In this section, we will explore the Baltimore Ravens data. First let's load it and check its structure
download.file("https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda",
destfile = "../03_02_binaryOutcomes/data/ravensData.rda", method = "wget")
load("../03_02_binaryOutcomes/data/ravensData.rda")
head(ravensData)
## ravenWinNum ravenWin ravenScore opponentScore
## 1 1 W 24 9
## 2 1 W 38 35
## 3 1 W 28 13
## 4 1 W 34 31
## 5 1 W 44 13
## 6 0 L 23 24
str(ravensData)
## 'data.frame': 20 obs. of 4 variables:
## $ ravenWinNum : num 1 1 1 1 1 0 1 1 1 1 ...
## $ ravenWin : Factor w/ 2 levels "L","W": 2 2 2 2 2 1 2 2 2 2 ...
## $ ravenScore : num 24 38 28 34 44 23 31 23 9 31 ...
## $ opponentScore: num 9 35 13 31 13 24 30 16 6 29 ...
The linear regression can be modeled as \[ RW_i = b_0 + b_1 RS_i + e_i \] Where
lmRavens <- lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
summary(lmRavens)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2850 0.256643 1.111 0.28135
## ravensData$ravenScore 0.0159 0.009059 1.755 0.09625
Binary Outcome 0/1
\[ RW_i \]
Probability (0,1)
\[ \rm{Pr}(RW_i | RS_i, b_0, b_1 ) \]
Odds $(0,\infty)$ \[ \frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)} \]
Log odds $(-\infty,\infty)$
\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) \]
Linear
\[ RW_i = b_0 + b_1 RS_i + e_i \]
or
\[ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i \]
Logistic
\[ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)} \]
or
\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]
\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]
What should we set \( X \) and \( Y \) for the game to be fair?
$$E[earnings]= X p - Y (1 - p) = 0$$
Implies \[ \frac{Y}{X} = \frac{p}{1 - p} \]
The odds can be said as “How much should you be willing to pay for a \( p \) probability of winning a dollar?”
logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore, family = "binomial")
summary(logRegRavens)
##
## Call:
## glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore,
## family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.758 -1.100 0.530 0.806 1.495
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6800 1.5541 -1.08 0.28
## ravensData$ravenScore 0.1066 0.0667 1.60 0.11
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 20.895 on 18 degrees of freedom
## AIC: 24.89
##
## Number of Fisher Scoring iterations: 5
plot(ravensData$ravenScore, logRegRavens$fitted, pch = 19, col = "blue", xlab = "Score",
ylab = "Prob Ravens Win")
Odds ratios and confidence intervals
exp(logRegRavens$coeff)
## (Intercept) ravensData$ravenScore
## 0.1864 1.1125
exp(confint(logRegRavens))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.005675 3.106
## ravensData$ravenScore 0.996230 1.303
ANOVA for logistic regression
anova(logRegRavens, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: ravensData$ravenWinNum
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.4
## ravensData$ravenScore 1 3.54 18 20.9 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can interpreting odds ratios
There are many data taking the form of counts, such as the number of calls to a call center, the number of flu cases in an area, the number of cars that cross a bridge and etc. The data may also be in the form of rates, such as the percentage of children passing a test or the percentage of hits to a website from a country.
In all the above cases, Linear regression with transformation is an option.
First, we recall that the Poisson distribution is a useful model for counts and rates (rate is count per some monitoring time). Some examples uses of the Poisson distribution
For the Poisson distribution
par(mfrow = c(1, 3))
plot(0:10, dpois(0:10, lambda = 2), type = "h", frame = FALSE)
plot(0:20, dpois(0:20, lambda = 10), type = "h", frame = FALSE)
plot(0:200, dpois(0:200, lambda = 100), type = "h", frame = FALSE)
We can show the the mean and variance of a poisson distribution are equal
x <- 0:10000
lambda = 3
mu <- sum(x * dpois(x, lambda = lambda))
sigmasq <- sum((x - mu)^2 * dpois(x, lambda = lambda))
c(mu, sigmasq)
## [1] 3 3
As an example, let's consider the daily counts to Jeff Leek's website, since the unit of time is always one day, set \( t = 1 \) and then the Poisson mean is interpretted as web hits per day. (If we set \( t = 24 \), it would be web hits per hour). (http://skardhamar.github.com/rga/)
download.file("https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda",
destfile = "../03_03_countOutcomes/data/gaData.rda", method = "wget")
load("../03_03_countOutcomes/data/gaData.rda")
gaData$julian <- julian(gaData$date)
str(gaData)
## 'data.frame': 731 obs. of 4 variables:
## $ date : Date, format: "2011-01-01" "2011-01-02" ...
## $ visits : num 0 0 0 0 0 0 0 0 0 0 ...
## $ simplystats: num 0 0 0 0 0 0 0 0 0 0 ...
## $ julian : atomic 14975 14976 14977 14978 14979 ...
## ..- attr(*, "origin")= Date, format: "1970-01-01"
If we fit the data with a simple linear regression model
\[ NH_i = b_0 + b_1 JD_i + e_i \]
We can also plot the Linear regression line
plot(gaData$julian, gaData$visits, pch = 19, col = "darkgrey", xlab = "Julian",
ylab = "Visits")
lm1 <- lm(gaData$visits ~ gaData$julian)
abline(lm1, col = "red", lwd = 3)
Aside, Let's take the natural log of the outcome has a specific interpretation. Consider the model
\[ \log(NH_i) = b_0 + b_1 JD_i + e_i \]
To exponentiate coefficients
round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
## (Intercept) gaData$julian
## 0.000 1.002
Linear
\[ NH_i = b_0 + b_1 JD_i + e_i \]
or
\[ E[NH_i | JD_i, b_0, b_1] = b_0 + b_1 JD_i \]
Poisson/log-linear
\[ \log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i \]
or
\[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \]
Multiplicative differences \[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \] \[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 \right)\exp\left(b_1 JD_i\right) \]
If \( JD_i \) is increased by one unit, \( E[NH_i | JD_i, b_0, b_1] \) is multiplied by \( \exp\left(b_1\right) \)
par(mfrow = c(1, 2))
plot(gaData$julian, gaData$visits, pch = 19, col = "darkgrey", xlab = "Julian",
ylab = "Visits")
glm1 <- glm(gaData$visits ~ gaData$julian, family = "poisson")
abline(lm1, col = "red", lwd = 3)
lines(gaData$julian, glm1$fitted, col = "blue", lwd = 3)
plot(glm1$fitted, glm1$residuals, pch = 19, col = "grey", ylab = "Residuals",
xlab = "Fitted", main = "Mean-variance relationship")
library(sandwich)
confint.agnostic <- function(object, parm, level = 0.95, ...) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm))
parm <- pnames else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm, pct))
ses <- sqrt(diag(sandwich::vcovHC(object)))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
http://stackoverflow.com/questions/3817182/vcovhc-and-confidence-interval
confint(glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -34.34658 -31.159716
## gaData$julian 0.00219 0.002396
confint.agnostic(glm1)
## 2.5 % 97.5 %
## (Intercept) -36.362675 -29.136997
## gaData$julian 0.002058 0.002528
\[ E[NHSS_i | JD_i, b_0, b_1]/NH_i = \exp\left(b_0 + b_1 JD_i\right) \] \[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) - \log(NH_i) = b_0 + b_1 JD_i \] \[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) = \log(NH_i) + b_0 + b_1 JD_i \]
par(mfrow = c(1, 2))
glm2 <- glm(gaData$simplystats ~ julian(gaData$date), offset = log(visits +
1), family = "poisson", data = gaData)
plot(julian(gaData$date), glm2$fitted, col = "blue", pch = 19, xlab = "Date",
ylab = "Fitted Counts")
points(julian(gaData$date), glm1$fitted, col = "red", pch = 19)
glm2 <- glm(gaData$simplystats ~ julian(gaData$date), offset = log(visits +
1), family = "poisson", data = gaData)
plot(julian(gaData$date), gaData$simplystats/(gaData$visits + 1), col = "grey",
xlab = "Date", ylab = "Fitted Rates", pch = 19)
lines(julian(gaData$date), glm2$fitted/(gaData$visits + 1), col = "blue", lwd = 3)
Previous Module. Module II : Multivariable Regression