The three most famous cases of GLMs are: * linear models, * binomial and binary regression * Poisson regression.
Linear models are the most useful applied statistical technique. However, they are not without their limitations.
The assumption of an additive response model is not justified if the response is discrete or strictly positive.
Transformation are often hard to interpret.
Particularly interpretable transformations, like the logarithm, are not applicable for negative or zero values.
The generalized linear models (GLMs) is a family of models that includes linear models.
GLMs were introduced in 1972 paper by Belder and Wedderburn
The GLMs involves three components:
An exponential family model for the response.
A systematic component via a linear predictor.
A link function that connects the means of the response to the linear predictor.
(The normal distribution is an exponential family distribution).
Define the linear predictor to be \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \]
The link function as \(g\), so that
\[ g(\mu) = \eta \] * For linear models, \(g(\mu) = \mu\) so that \(\mu_i = \nu_i\).
Assume that \(Y_i \sim \text{Bernoulli}(\mu_i)\) so that \(\mathbb{E}\left(Y_i\right) = \mu_i\), where \(0 \leq \mu_i \leq 1\).
The linear predictor is \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \] The link function is \(g(\mu) = \ln \left( \frac{\mu}{1 - \mu} \right) = \eta\). This link function \(g\) is the (natural) log odds, refered to as the logit function.
The mean \(\mu\) is expressed via the linear predictor \(\eta\) using the logistic function.
\[ g(\mu) = \ln \left( \frac{\mu}{1 - \mu} \right) = \eta \Rightarrow \mu = \frac{\exp{(\eta})}{1+\exp{(\eta)}} \]
Assume that \(Y_i \sim \text{Poisson}(\mu_i)\) so that \(\mathbb{E}\left(Y_i\right) = \mu_i\), where \(0 \leq \mu_i\).
The linear predictor is \[ \eta_i = \sum_{j=1}^{p} X_{ij} \beta_{j} \] The link function is \(g(\mu) = \ln \left( \mu \right) = \eta\). This link function \(g\) is the (natural) log.
The mean \(\mu\) expressed via the linear predictor \(\eta\) using the logistic function.
The mean \(\mu\) is expressed via the linear predictor \(\eta\) using the logistic function.
\[ g(\mu) = \ln \left( \mu \right) = \eta \Rightarrow \mu = \exp{(\eta}) \]
In each case, the only way in which the likelihood depends on the data is through
\[ \sum_{i=1}^{n} y_i \eta_i = \sum_{i=1}^{n} y_i \sum_{j=1}^{p} X_{ij} \beta_{j} = \sum_{j=1}^{p} \beta_{j} \sum_{i=1}^{n} X_{ij} y_{i} \]
For the linear model, \(\mathbb{V} \left[ Y_i \right] = \sigma^2\) is constant
For the Bernoulli case, \(\mathbb{V} \left[ Y_i \right] = \mu_i \left( 1 - \mu_i \right)\) - the variance depends on the observation, unkile the linear models case where the variance is constant. Similiarly, for Poisson case, \(\mathbb{V} \left[ Y_i \right] = \mu_i\) the variance depends on the observation.
The normal equations have to be solved iteratively. Resulting in \(\widehat{\beta}_k\) and, if included, \(\widehat{\phi}\)
Predicted linear predictior repsonses can be obtained as \(\widehat{\eta} = \sum_{j=1}^{p} X_j \beta_{j}\)
The interpretation of the coefficients is the same as in the linear model case, the change in the expected response per unit change in the regressors, holding all the other regressors constant, this interpretation being done on the scale of the linear predictor. \[ \beta_k = g \left( \mathbb{E} \left[ X_j = x_j + 1 , X_{-j} = x_{-j} \right] \right) -g \left( \mathbb{E} \left[ X_j = x_j , X_{-j} = x_{-j} \right] \right) \]
Frequently we care about outcomes that have two values:
Called binary, Bernoulli or 0/1 outcomes
Collection of exchangeable binary outcomes for the same covariate leads to binomial outcomes.
df <- read.csv('./data/ravensData.csv')
head(df)
## X ravenWinNum ravenWin ravenScore opponentScore
## 1 1 1 W 24 9
## 2 2 1 W 38 35
## 3 3 1 W 28 13
## 4 4 1 W 34 31
## 5 5 1 W 44 13
## 6 6 0 L 23 24
\[ \text{RW}_i = \beta_0 + \beta_1 \text{RS}_i + \epsilon_i \] * \(\text{RW}_i\) - 1 if a Ravens win, 0 if not. * \(\text{RS}_i\) - number of points Ravens scored. * \(\beta_0\) - probability of a Ravens win if they score 0 points. * \(\beta_1\) - increase in probability of a Ravens win for each additional point. * \(\epsilon_i\) - residual variation.
\[ \text{RW}_i \]
Probability \((0, 1)\) \[ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) \]
Odds \((0, \infty)\) \[ \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \]
Log odds \((0, \infty)\)
\[ \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) \]
\[ \text{RW}_i = \beta_0 + \beta_1 \text{RS}_i + \epsilon_i \Leftrightarrow \mathbb{E}\left[ \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right] = \beta_0 + \beta_1 \text{RS}_i \] * Logistic \[ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) = \frac{\exp{\left(\beta_0 + \beta_1 \text{RS}_i\right)}}{1 + \exp{\left(\beta_0 + \beta_1 \text{RS}_i\right)}} \Leftrightarrow \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) = \beta_0 + \beta_1 \text{RS}_i \]
\[ \ln \left( \frac{ \text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right) } {1-\text{Pr} \left( \text{RW}_i \mid \text{RS}_i, \beta_0, \beta_1 \right)} \right) = \beta_0 + \beta_1 \text{RS}_i \] * \(\beta_0\) - log odds of a ravens win if they score zero points.
\(\beta_1\) - log odds ratio of win probability of each point scored, compared to zero points.
\(\exp{\left(\beta_1\right))}\) - odds ratio of win probability for each point scored.
A game is played, where a coin is flipped with success probability \(p\)
If it comes heads, the win is \(X\), if it comes tails the loss is \(Y\)
How should \(X\) and \(Y\) be set so the game is fair?
\[ \mathbb{E} \left[ \text{earnings} \right] = Xp - Y(1-p) \] * Fair game implies
\[ \frac{Y}{X} = \frac{p}{1-p} \] * The odds can be said as “How much should you be willing to pay for a \(p\) probability of winning a dollar?”
* if $p > 0.5$ you have to pay more if you lose than you get if you win.
* if $p < 0.5$ you have to pay less if you lose than you get if you win.
x<-seq(-10, 10, length = 1000)
par(mfrow = c(2,2))
beta0 = -2; beta1 = -1
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
type = 'l', lwd = 3, frame = FALSE)
beta0 = -2; beta1 = 0
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
type = 'l', lwd = 3, frame = FALSE)
beta0 = -2; beta1 = 1
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
type = 'l', lwd = 3, frame = FALSE)
beta0 = 0; beta1 = 2
plot(x, exp(beta0+beta1*x)/(1 + exp(beta0+beta1*x)),
type = 'l', lwd = 3, frame = FALSE)
logdf <- glm(df$ravenWinNum ~ df$ravenScore, family = 'binomial')
summary(logdf)
##
## Call:
## glm(formula = df$ravenWinNum ~ df$ravenScore, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7575 -1.0999 0.5305 0.8060 1.4947
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68001 1.55412 -1.081 0.28
## df$ravenScore 0.10658 0.06674 1.597 0.11
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 20.895 on 18 degrees of freedom
## AIC: 24.895
##
## Number of Fisher Scoring iterations: 5
plot(df$ravenScore, logdf$fitted, pch = 19, col = 'blue', xlab = 'Score', ylab = 'Prob Ravens Win')
exp(logdf$coeff)
## (Intercept) df$ravenScore
## 0.1863724 1.1124694
exp(confint(logdf))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.005674966 3.106384
## df$ravenScore 0.996229662 1.303304
anova(logdf, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: df$ravenWinNum
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## df$ravenScore 1 3.5398 18 20.895 0.05991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Many data take the form of counts:
Data may also be in the form of rates:
Linear regression with transformation is an option.
Poissing distribution is a useful model for counts and rates (here a rate is count per some monitoring time)
Some examples uses of the Poisson distribution
The Poisson mass function
\[ X \sim \text{Poisson}\left(x \mid t\lambda \right) , \quad \text{Pr}\left( X = x \right) = \frac{(t\lambda)^x \exp{ \left( -t \lambda \right) }}{x!}, x = 0, 1, \ldots, \]
The mean of the Poisson is \(\mathbb{E} \left[ X \right] = t\lambda\), thus \(\mathbb{E} \left[ X/t \right] = \lambda\).
The variance of the Poisson is \(\mathbb{V} \left[ X \right] = t\lambda\)
The Poisson tends to a normal as \(t\lambda\) gets large.
par(mfrow = c(1,3))
plot(0:10, dpois(0:10, lambda = 2), type = 'h', frame = FALSE)
plot(0:20, dpois(0:20, lambda = 10), type = 'h', frame = FALSE)
plot(0:200, dpois(0:200, lambda = 100), type = 'h', frame = FALSE)
df <- read.csv('./data/gaData.csv')
df$julian <- julian(as.Date(df$date))
head(df)
## date visits simplystats julian
## 1 2011-01-01 0 0 14975
## 2 2011-01-02 0 0 14976
## 3 2011-01-03 0 0 14977
## 4 2011-01-04 0 0 14978
## 5 2011-01-05 0 0 14979
## 6 2011-01-06 0 0 14980
plot(df$julian, df$visits,
pch = 19, col = 'darkgrey',
xlab = 'Julian', ylab = 'Visits')
lm1 <- lm(df$visits ~ df$julian)
abline(lm1, col = 'red', lwd = 3)
\[ \text{NH}_i = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \] * \(\text{NH}_i\) - number of hits to the website * \(\text{JD}_i\) - day of the year (Julian day) * \(\beta_0\) - number of hits on Julian day 0 * \(\beta_1\) - increase in number of hits per unit day * \(\epsilon_i\) - variation due to everything we didn’t measure.
Taking the logarithm of the outcome has a specific interpretation
\[ \ln(\text{NH}_i) = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \] * \(\text{NH}_i\) - number of hits to the website * \(\text{JD}_i\) - day of the year (Julian day) * \(\beta_0\) - log number of hits on Julian day 0 * \(\beta_1\) - increase in log number of hits per unit day * \(\epsilon_i\) - variation due to everything we didn’t measure.
Linear \[ \text{NH}_i = \beta_0 + \beta_1 \text{JD}_i + \epsilon_i \Leftrightarrow \mathbb{E} \left( \text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right) = \beta_0 + \beta_1 \text{JD}_i \]
Poisson / log-linear \[ \ln(\mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right]) = \beta_0 + \beta_1 \text{JD}_i \Leftrightarrow \mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} \]
round(exp(coef(lm(I(log(df$visits+1)) ~ df$julian))),5)
## (Intercept) df$julian
## 0.00000 1.00231
\[ \mathbb{E}\left[\text{NH}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} = \exp{\left( \beta_0 \right)} \exp{\left( \beta_1 \text{JD}_i \right)} \]
If \(\text{JD}_i\) is increased by one unit, $$ is multiplied by \(\exp{\left( \beta_1 \right)}\)
plot(df$julian, df$visits,
pch = 19, col = 'darkgrey',
xlab = 'Julian', ylab = 'Visits')
lm1 <- lm(df$visits ~ df$julian)
glm1 <- glm(df$visits ~ df$julian, family = 'poisson')
abline(lm1, col = 'red', lwd = 3)
lines(df$julian, glm1$fitted, col = 'blue', lwd = 3)
plot(glm1$fitted, glm1$residuals,
pch = 19, col = 'grey',
xlab = 'Fitted', ylab = 'Residuals')
confint(glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -34.346577587 -31.159715656
## df$julian 0.002190043 0.002396461
\[ \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] / \text{NH}_i = \exp{\left( \beta_0 + \beta_1 \text{JD}_i \right)} \] \[ \ln \left( \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] \right) - \ln \left( \text{NH}_i \right) = \beta_0 + \beta_1 \text{JD}_i \] \[ \ln \left( \mathbb{E}\left[\text{NHSS}_i \mid \text{JD}_i, \beta_0, \beta_1 \right] \right) = \ln \left( \text{NH}_i \right) + \beta_0 + \beta_1 \text{JD}_i \]
glm2 <- glm(df$simplystats ~ df$julian,
offset = log(df$visits + 1),
family = 'poisson')
plot(df$julian, glm2$fitted,
pch = 19, col = 'blue',
xlab = 'JulDateian', ylab = 'Fitted')
points(df$julian, glm1$fitted,
pch = 19, col = 'red',
xlab = 'JulDateian', ylab = 'Fitted')
glm2 <- glm(df$simplystats ~ df$julian,
offset = log(df$visits + 1),
family = 'poisson')
plot(df$julian, df$simplystats/(df$visits+1),
pch = 19, col = 'grey',
xlab = 'Date', ylab = 'Fitted Rates')
points(df$julian, glm2$fitted/(df$visits+1),
pch = 19, col = 'blue',
xlab = 'JulDateian', ylab = 'Fitted')
\[ Y_i = f(X_i) + \epsilon \] * How can we fit such a model using linear models (scatterplot smoothing)
\[ Y_i = \beta_0 + \beta_1 X_i + \sum_{j}^{d} \left( x_i - \xi_{j} \right)_{+} \gamma_{j} + \epsilon_i \] where \[ (a)_{+} = \begin{cases} a, \; a > 0 \\ 0, \; \text{otherwise} \end{cases} \] and \(\xi_{1} \leq \ldots \xi_{d}\) are known knots.
\[ \beta_0 + \beta_1 X_i + \sum_{j}^{d} \left( x_i - \xi_{j} \right)_{+} \gamma_{j} \] is continuous at the know points.
n <- 500
x <- seq(0, 4 * pi, length = n)
y <- sin(x) + rnorm(n, sd = .3)
knots <- seq(0, 8*pi, length = 20)
splineTerms <- sapply(knots, function(knots) (x>knots) * (x - knots))
xMat <- cbind(1, x, splineTerms)
yhat <- predict(lm(y ~ xMat -1))
plot(x, y, frame = FALSE, pch = 21, bg = 'lightblue', cex = 2)
lines(x, yhat, col = 'salmon', lwd=2)
Adding squared terms makes it continuously differentiable at the knot points.
Adding cubic terms makes it twice continuously differentiable at the knot points.
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \sum_{j}^{d} \left( x_i - \xi_{j} \right)^2_{+} \gamma_{j} + \epsilon_i \]
n <- 500
x <- seq(0, 4 * pi, length = n)
y <- sin(x) + rnorm(n, sd = .3)
knots <- seq(0, 8*pi, length = 20)
splineTerms <- sapply(knots, function(knots) (x > knots) * (x - knots)^2)
xMat <- cbind(1, x, x^2, splineTerms)
yhat <- predict(lm(y ~ xMat -1))
plot(x, y, frame = FALSE, pch = 21, bg = 'lightblue', cex = 2)
lines(x, yhat, col = 'salmon', lwd=2)
The collection of regressors is called a basis.
Single know point terms can fit hockey stick like processes.
These bases can be used in GLMs as well.
An issue with these approaches is the large number of parameters introduced.
Requires some method of so called regularization.
notes4 <- c(261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88, 523.25)
t <- seq(0, 2, by = 0.001)
n <- length(n)
c4 <- sin(2 * pi * notes4[1] * t)
e4 <- sin(2 * pi * notes4[3] * t)
g4 <- sin(2 * pi * notes4[5] * t)
chord <- c4 + e4 + g4 + rnorm(n, 0, 0.3)
Harmonic <- function(freq){
sin(2 * pi * freq * t)
}
x <- sapply(notes4, Harmonic)
fit <- lm(chord ~ x - 1)
plot(notes4, fit$coefficients, type = 'l')
a <- fft(chord)
plot(Re(a)^2, type = 'l')