Module 11: GLMs

The three most famous cases of GLMs are: * linear models, * binomial and binary regression * Poisson regression.

Linear models

Linear models are the most useful applied statistical technique. However, they are not without their limitations.
The assumption of an additive response model is not justified if the response is discrete or strictly positive.
Transformation are often hard to interpret.
Particularly interpretable transformations, like the logarithm, are not applicable for negative or zero values.

Generalized linear models

The generalized linear models (GLMs) is a family of models that includes linear models.
GLMs were introduced in 1972 paper by Belder and Wedderburn
The GLMs involves three components:
- An exponential family model for the response.
- A systematic component via a linear predictor.
- A link function that connects the means of the response to the linear predictor.

Example

Assume that \[ Y_i \sim \text{Normal} \left( \mu_i, \sigma^2 \right) \]

(The normal distribution is an exponential family distribution).

Define the linear predictor to be \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \]
The link function as $g$, so that

\[ g(\mu) = \eta \] * For linear models, $g(\mu) = \mu$ so that $\mu_i = \nu_i$.

Example: Logistic regression

Assume that $Y_i \sim \text{Bernoulli}(\mu_i)$ so that $\mathbb{E}\left(Y_i\right) = \mu_i$, where $0 \leq \mu_i \leq 1$.

The linear predictor is \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \] The link function is $g(\mu) = \ln \left( \frac{\mu}{1 - \mu} \right) = \eta$. This link function $g$ is the (natural) log odds, refered to as the logit function.

The mean $\mu$ is expressed via the linear predictor $\eta$ using the logistic function.

\[ g(\mu) = \ln \left( \frac{\mu}{1 - \mu} \right) = \eta \Rightarrow \mu = \frac{\exp{(\eta})}{1+\exp{(\eta)}} \]

Example: Poisson regression

Assume that $Y_i \sim \text{Poisson}(\mu_i)$ so that $\mathbb{E}\left(Y_i\right) = \mu_i$, where $0 \leq \mu_i$.

The linear predictor is \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \] The link function is $g(\mu) = \ln \left( \mu \right) = \eta$. This link function $g$ is the (natural) log.

The mean $\mu$ expressed via the linear predictor $\eta$ using the logistic function.

The mean $\mu$ is expressed via the linear predictor $\eta$ using the logistic function.

\[ g(\mu) = \ln \left( \mu \right) = \eta \Rightarrow \mu = \exp{(\eta}) \]

In each case, the only way in which the likelihood depends on the data is through

\[ \sum_{i=1}^{n} y_i \eta_i = \sum_{i=1}^{n} y_i \sum_{j=1}^{p} X_{ij} \beta_{j} = \sum_{j=1}^{p} \beta_{j} \sum_{i=1}^{n} X_{ij} y_{i} \]

About variances

For the linear model, $\mathbb{V} \left[ Y_i \right] = \sigma^2$ is constant

For the Bernoulli case, $\mathbb{V} \left[ Y_i \right] = \mu_i \left( 1 - \mu_i \right)$ - the variance depends on the observation, unkile the linear models case where the variance is constant. Similiarly, for Poisson case, $\mathbb{V} \left[ Y_i \right] = \mu_i$ the variance depends on the observation.

Odds and ends

The normal equations have to be solved iteratively. Resulting in $\widehat{\beta}_k$ and, if included, $\widehat{\phi}$
Predicted linear predictior repsonses can be obtained as $\widehat{\eta} = \sum_{j=1}^{p} X_j \beta_{j}$
The interpretation of the coefficients is the same as in the linear model case, the change in the expected response per unit change in the regressors, holding all the other regressors constant, this interpretation being done on the scale of the linear predictor. \[ \beta_k = g \left( \mathbb{E} \left[ X_j = x_j + 1 , X_{-j} = x_{-j} \right] \right) -g \left( \mathbb{E} \left[ X_j = x_j , X_{-j} = x_{-j} \right] \right) \]

Module 12: Logistic Regression

Key ideas

Frequently we care about outcomes that have two values:
- Alive/Dead
- Win/Loss
- Success/Failure
Called binary, Bernoulli or 0/1 outcomes
Collection of exchangeable binary outcomes for the same covariate leads to binomial outcomes.

df <- read.csv('./data/ravensData.csv')
head(df)

##   X ravenWinNum ravenWin ravenScore opponentScore
## 1 1           1        W         24             9
## 2 2           1        W         38            35
## 3 3           1        W         28            13
## 4 4           1        W         34            31
## 5 5           1        W         44            13
## 6 6           0        L         23            24

Linear regression

\[ \text{RW}_i = \beta_0 + \beta_1 \text{RS}_i + \epsilon_i \] * $\text{RW}_i$ - 1 if a Ravens win, 0 if not. * $\text{RS}_i$ - number of points Ravens scored. * $\beta_0$ - probability of a Ravens win if they score 0 points. * $\beta_1$ - increase in probability of a Ravens win for each additional point. * $\epsilon_i$ - residual variation.

Odds

Binary outcome $0/1$

\[ \text{RW}_i \]

Probability $(0, 1)$ \[ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) \]
Odds $(0, \infty)$ \[ \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \]
Log odds $(0, \infty)$

\[ \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) \]

Linear vs. logistic regression

Linear

\[ \text{RW}_i = \beta_0 + \beta_1 \text{RS}_i + \epsilon_i \Leftrightarrow \mathbb{E}\left[ \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right] = \beta_0 + \beta_1 \text{RS}_i \] * Logistic \[ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) = \frac{\exp{\left(\beta_0 + \beta_1 \text{RS}_i\right)}}{1 + \exp{\left(\beta_0 + \beta_1 \text{RS}_i\right)}} \Leftrightarrow \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) = \beta_0 + \beta_1 \text{RS}_i \]

Interpreting Logistic Regression

\[ \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) = \beta_0 + \beta_1 \text{RS}_i \] * $\beta_0$ - log odds of a ravens win if they score zero points.

$\beta_1$ - log odds ratio of win probability of each point scored, compared to zero points.
$\exp{\left(\beta_1\right))}$ - odds ratio of win probability for each point scored.

Odds

A game is played, where a coin is flipped with success probability $p$
If it comes heads, the win is $X$, if it comes tails the loss is $Y$
How should $X$ and $Y$ be set so the game is fair?

\[ \mathbb{E} \left[ \text{earnings} \right] = Xp - Y(1-p) \] * Fair game implies

\[ \frac{Y}{X} = \frac{p}{1-p} \] * The odds can be said as “How much should you be willing to pay for a $p$ probability of winning a dollar?”

* if $p > 0.5$ you have to pay more if you lose than you get if you win.
* if $p < 0.5$ you have to pay less if you lose than you get if you win.

x<-seq(-10, 10, length = 1000)
par(mfrow = c(2,2))
beta0 = -2; beta1 = -1
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
     type = 'l', lwd = 3, frame = FALSE)

beta0 = -2; beta1 = 0
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
     type = 'l', lwd = 3, frame = FALSE)
beta0 = -2; beta1 = 1
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
     type = 'l', lwd = 3, frame = FALSE)
beta0 = 0; beta1 = 2
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
     type = 'l', lwd = 3, frame = FALSE)

logdf <- glm(df$ravenWinNum ~ df$ravenScore, family = 'binomial')
summary(logdf)

## 
## Call:
## glm(formula = df$ravenWinNum ~ df$ravenScore, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -1.0999   0.5305   0.8060   1.4947  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -1.68001    1.55412  -1.081     0.28
## df$ravenScore  0.10658    0.06674   1.597     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.895
## 
## Number of Fisher Scoring iterations: 5

plot(df$ravenScore, logdf$fitted, pch = 19, col = 'blue', xlab = 'Score', ylab = 'Prob Ravens Win')

Odds rations and confidence intervals

exp(logdf$coeff)

##   (Intercept) df$ravenScore 
##     0.1863724     1.1124694

exp(confint(logdf))

## Waiting for profiling to be done...

##                     2.5 %   97.5 %
## (Intercept)   0.005674966 3.106384
## df$ravenScore 0.996229662 1.303304

anova(logdf, test = 'Chisq')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: df$ravenWinNum
## 
## Terms added sequentially (first to last)
## 
## 
##               Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                             19     24.435           
## df$ravenScore  1   3.5398        18     20.895  0.05991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Module 13: Poisson Regression

Key ideas

Many data take the form of counts:
- Calls to a call center.
- Number of flue cases in an area.
- Number of cars that cross a bridge.
Data may also be in the form of rates:
- Percent of children passing a test.
- Perce of hits to a website from a country.
Linear regression with transformation is an option.
Poissing distribution is a useful model for counts and rates (here a rate is count per some monitoring time)
Some examples uses of the Poisson distribution
- Modeling web traffic hits.
- Incidence rates.
- Approximating binomial probabilities with small $p$ and large $n$.
- Analyzing contingency table data.
The Poisson mass function

\[ X \sim \text{Poisson}\left(x \mid t\lambda \right) , \quad \text{Pr}\left( X = x \right) = \frac{(t\lambda)^x \exp{ \left( -t \lambda \right) }}{x!}, x = 0, 1, \ldots, \]

The mean of the Poisson is $\mathbb{E} \left[ X \right] = t\lambda$, thus $\mathbb{E} \left[ X/t \right] = \lambda$.
The variance of the Poisson is $\mathbb{V} \left[ X \right] = t\lambda$
The Poisson tends to a normal as $t\lambda$ gets large.

par(mfrow = c(1,3))
plot(0:10, dpois(0:10, lambda = 2), type = 'h', frame = FALSE)
plot(0:20, dpois(0:20, lambda = 10), type = 'h', frame = FALSE)
plot(0:200, dpois(0:200, lambda = 100), type = 'h', frame = FALSE)

Example

df <- read.csv('./data/gaData.csv')
df$julian <- julian(as.Date(df$date))
head(df)

##         date visits simplystats julian
## 1 2011-01-01      0           0  14975
## 2 2011-01-02      0           0  14976
## 3 2011-01-03      0           0  14977
## 4 2011-01-04      0           0  14978
## 5 2011-01-05      0           0  14979
## 6 2011-01-06      0           0  14980

plot(df$julian, df$visits, 
     pch = 19, col = 'darkgrey', 
     xlab = 'Julian', ylab = 'Visits')
lm1 <- lm(df$visits ~ df$julian)
abline(lm1, col = 'red', lwd = 3)

Linear regression

\[ \text{NH}_i = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \] * $\text{NH}_i$ - number of hits to the website * $\text{JD}_i$ - day of the year (Julian day) * $\beta_0$ - number of hits on Julian day 0 * $\beta_1$ - increase in number of hits per unit day * $\epsilon_i$ - variation due to everything we didn’t measure.

Log linear regression

Taking the logarithm of the outcome has a specific interpretation

\[ \ln(\text{NH}_i) = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \] * $\text{NH}_i$ - number of hits to the website * $\text{JD}_i$ - day of the year (Julian day) * $\beta_0$ - log number of hits on Julian day 0 * $\beta_1$ - increase in log number of hits per unit day * $\epsilon_i$ - variation due to everything we didn’t measure.

Linear vs. Poisson regression

Linear \[ \text{NH}_i = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \Leftrightarrow \mathbb{E} \left( \text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right) = \beta_0 + \beta_1 \text{JD}_i \]
Poisson / log-linear \[ \ln(\mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right]) = \beta_0 + \beta_1 \text{JD}_i \Leftrightarrow \mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} \]

round(exp(coef(lm(I(log(df$visits+1)) ~ df$julian))),5)

## (Intercept)   df$julian 
##     0.00000     1.00231

Multiplicative differences

\[ \mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} = \exp{\left( \beta_0 \right)} \exp{\left( \beta_1 \text{JD}_i \right)} \]

If $\text{JD}_i$ is increased by one unit, $$ is multiplied by $\exp{\left( \beta_1 \right)}$

plot(df$julian, df$visits, 
     pch = 19, col = 'darkgrey', 
     xlab = 'Julian', ylab = 'Visits')
lm1 <- lm(df$visits ~ df$julian)
glm1 <- glm(df$visits ~ df$julian, family = 'poisson')
abline(lm1, col = 'red', lwd = 3)
lines(df$julian, glm1$fitted, col = 'blue', lwd = 3)

Mean-variance relationship?

plot(glm1$fitted, glm1$residuals, 
     pch = 19, col = 'grey',
     xlab = 'Fitted', ylab = 'Residuals')

Confidence intervals

confint(glm1)

## Waiting for profiling to be done...

##                     2.5 %        97.5 %
## (Intercept) -34.346577587 -31.159715656
## df$julian     0.002190043   0.002396461

Rates

\[ \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] / \text{NH}_i = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} \] \[ \ln \left( \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] \right) - \ln \left( \text{NH}_i \right) = \beta_0 + \beta_1 \text{JD}_i \] \[ \ln \left( \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] \right) = \ln \left( \text{NH}_i \right) + \beta_0 + \beta_1 \text{JD}_i \]

glm2 <- glm(df$simplystats ~ df$julian, 
            offset = log(df$visits + 1),
            family = 'poisson')
plot(df$julian, glm2$fitted, 
     pch = 19, col = 'blue', 
     xlab = 'JulDateian', ylab = 'Fitted')
points(df$julian, glm1$fitted, 
     pch = 19, col = 'red', 
     xlab = 'JulDateian', ylab = 'Fitted')

glm2 <- glm(df$simplystats ~ df$julian, 
            offset = log(df$visits + 1),
            family = 'poisson')
plot(df$julian, df$simplystats/(df$visits+1), 
     pch = 19, col = 'grey', 
     xlab = 'Date', ylab = 'Fitted Rates')
points(df$julian, glm2$fitted/(df$visits+1), 
     pch = 19, col = 'blue', 
     xlab = 'JulDateian', ylab = 'Fitted')

Module 14: Hodgepodge

Fitting functions using linear models

Consider a model

\[ Y_i = f(X_i) + \epsilon \] * How can we fit such a model using linear models (scatterplot smoothing)

Consider the model:

\[ Y_i = \beta_0 + \beta_1 X_i + \sum_{j}^{d} \left( x_i - \xi_{j} \right)_{+} \gamma_{j} + \epsilon_i \] where \[ (a)_{+} = \begin{cases} a, \; a > 0 \\ 0, \; \text{otherwise} \end{cases} \] and $\xi_{1} \leq \ldots \xi_{d}$ are known knots.

The mean function

\[ \beta_0 + \beta_1 X_i + \sum_{j}^{d} \left( x_i - \xi_{j} \right)_{+} \gamma_{j} \] is continuous at the know points.

n <- 500
x <- seq(0, 4 * pi, length = n)
y <- sin(x) + rnorm(n, sd = .3)

knots <- seq(0, 8*pi, length = 20)

splineTerms <- sapply(knots, function(knots) (x>knots) * (x - knots))

xMat <- cbind(1, x, splineTerms)
yhat <- predict(lm(y ~ xMat -1))
plot(x, y, frame = FALSE, pch = 21, bg = 'lightblue', cex = 2)
lines(x, yhat, col = 'salmon', lwd=2)

Adding squared terms makes it continuously differentiable at the knot points.
Adding cubic terms makes it twice continuously differentiable at the knot points.

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \sum_{j}^{d} \left( x_i - \xi_{j} \right)^2_{+} \gamma_{j} + \epsilon_i \]

n <- 500
x <- seq(0, 4 * pi, length = n)
y <- sin(x) + rnorm(n, sd = .3)

knots <- seq(0, 8*pi, length = 20)

splineTerms <- sapply(knots, function(knots) (x > knots) * (x - knots)^2)

xMat <- cbind(1, x, x^2, splineTerms)
yhat <- predict(lm(y ~ xMat -1))
plot(x, y, frame = FALSE, pch = 21, bg = 'lightblue', cex = 2)
lines(x, yhat, col = 'salmon', lwd=2)

Notes

The collection of regressors is called a basis.
Single know point terms can fit hockey stick like processes.
These bases can be used in GLMs as well.
An issue with these approaches is the large number of parameters introduced.
Requires some method of so called regularization.

Harmonices using linear models

notes4 <- c(261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88, 523.25)

t <- seq(0, 2, by = 0.001)
n <- length(n)

c4 <- sin(2 * pi * notes4[1] * t)
e4 <- sin(2 * pi * notes4[3] * t)
g4 <- sin(2 * pi * notes4[5] * t)

chord <- c4 + e4 + g4 + rnorm(n, 0, 0.3)

Harmonic <- function(freq){
    sin(2 * pi * freq * t)
}

x <- sapply(notes4, Harmonic)

fit <- lm(chord ~ x - 1)

plot(notes4, fit$coefficients, type = 'l')

a <- fft(chord)
plot(Re(a)^2, type = 'l')

Generalized Linear Models

Regression Models: Week 4
Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

04 March, 2023

Module 11: GLMs

Linear models

Generalized linear models

Example

Example: Logistic regression

Example: Poisson regression

About variances

Odds and ends

Module 12: Logistic Regression

Key ideas

Linear regression

Odds

Linear vs. logistic regression

Interpreting Logistic Regression

Odds

Odds rations and confidence intervals

Module 13: Poisson Regression

Key ideas

Example

Linear regression

Log linear regression

Linear vs. Poisson regression

Multiplicative differences

Mean-variance relationship?

Confidence intervals

Rates

Module 14: Hodgepodge

Fitting functions using linear models

Notes

Harmonices using linear models

Generalized Linear Models

Regression Models: Week 4 Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

04 March, 2023

Module 11: GLMs

Linear models

Generalized linear models

Example

Example: Logistic regression

Example: Poisson regression

About variances

Odds and ends

Module 12: Logistic Regression

Key ideas

Linear regression

Odds

Linear vs. logistic regression

Interpreting Logistic Regression

Odds

Odds rations and confidence intervals

Module 13: Poisson Regression

Key ideas

Example

Linear regression

Log linear regression

Linear vs. Poisson regression

Multiplicative differences

Mean-variance relationship?

Confidence intervals

Rates

Module 14: Hodgepodge

Fitting functions using linear models

Notes

Harmonices using linear models

Regression Models: Week 4
Coursera Data Science: Statistics & Machine Learning Specialization