These notes are modified from your reading
by David Dalpiaz, Department of Statistics at The University of Illinois
After reading chapter 17 you will be able to:
So far we have only considered linear regression model for numeric response variables. What about response variables that only take integer values? What about a response variable that is categorical? Can we use linear models in these situations? Yes! The model that we have been using, which we will call ordinary linear regression, is actually a specific case of the more general, generalized linear model.
So far, we’ve had response variables that, conditioned on the predictors, were modeled using a normal distribution with a mean that is some linear combination of the predictors. This linear combination is what made a linear model “linear.”
\[ Y \mid {\bf X} = {\bf x} \sim N(\beta_0 + \beta_1x_1 + \ldots + \beta_{p - 1}x_{p - 1}, \ \sigma^2) \]
Now we’ll allow for two modifications of this situation, which will let us use linear models in many more situations. Instead of using a normal distribution for the response conditioned on the predictors, we’ll allow for other distributions. Also, instead of the conditional mean being a linear combination of the predictors, it can be some function of a linear combination of the predictors.
In general, a generalized linear model has three parts:
\[\eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}\]
\[ \eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right). \]
The following table summarizes three examples of a generalized linear model:
| Linear Regression | Poisson Regression | Logistic Regression | |
|---|---|---|---|
| \(Y \mid {\bf X} = {\bf x}\) | \(N(\mu({\bf x}), \sigma^2)\) | \(\text{Pois}(\lambda({\bf x}))\) | \(\text{Bern}(p({\bf x}))\) |
| Distribution Name | Normal | Poisson | Bernoulli (Binomial) |
| \(\text{E}[Y \mid {\bf X} = {\bf x}]\) | \(\mu({\bf x})\) | \(\lambda({\bf x})\) | \(p({\bf x})\) |
| Support | Real: \((-\infty, \infty)\) | Integer: \(0, 1, 2, \ldots\) | Integer: \(0, 1\) |
| Usage | Numeric Data | Count (Integer) Data | Binary (Class ) Data |
| Link Name | Identity | Log | Logit |
| Link Function | \(\eta({\bf x}) = \mu({\bf x})\) | \(\eta({\bf x}) = \log(\lambda({\bf x}))\) | \(\eta({\bf x}) = \log \left(\frac{p({\bf x})}{1 - p({\bf x})} \right)\) |
| Mean Function | \(\mu({\bf x}) = \eta({\bf x})\) | \(\lambda({\bf x}) = e^{\eta({\bf x})}\) | \(p({\bf x}) = \frac{e^{\eta({\bf x})}}{1 + e^{\eta({\bf x})}} = \frac{1}{1 + e^{-\eta({\bf x})}}\) |
Like ordinary linear regression, we will seek to “fit” the model by estimating the \(\beta\) parameters. To do so, we will use the method of maximum likelihood.
Note that a Bernoulli distribution is a specific case of a binomial distribution where the \(n\) parameter of a binomial is \(1\). Binomial regression is also possible, but we’ll focus on the much more popular Bernoulli case.
So, in general, GLMs relate the mean of the response to a linear combination of the predictors, \(\eta({\bf x})\), through the use of a link function, \(g()\). That is,
\[ \eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right). \]
The mean is then
\[ \text{E}[Y \mid {\bf X} = {\bf x}] = g^{-1}(\eta({\bf x})). \]
To illustrate the use of a GLM we’ll focus on the case of binary responses variable coded using \(0\) and \(1\). In practice, these \(0\) and \(1\)s will code for two classes such as yes/no, cat/dog, sick/healthy, etc.
\[ Y = \begin{cases} 1 & \text{yes} \\ 0 & \text{no} \end{cases} \]
First, we define some notation that we will use throughout.
\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] \]
With a binary (Bernoulli) response, we’ll mostly focus on the case when \(Y = 1\), since with only two possibilities, it is trivial to obtain probabilities when \(Y = 0\).
\[ P[Y = 0 \mid {\bf X} = {\bf x}] + P[Y = 1 \mid {\bf X} = {\bf x}] = 1 \]
\[ P[Y = 0 \mid {\bf X} = {\bf x}] = 1 - p({\bf x}) \]
We now define the logistic regression model.
\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1} \]
Immediately we notice some similarities to ordinary linear regression, in particular, the right hand side. This is our usual linear combination of the predictors. We have our usual \(p - 1\) predictors for a total of \(p\) \(\beta\) parameters. (Note, many more machine learning focused texts will use \(p\) as the number of predictors. This is an arbitrary choice, but you should be aware of it.)
The left hand side is called the log odds, which is the log of the odds. The odds are the probability for a positive event \((Y = 1)\) divided by the probability of a negative event \((Y = 0)\). So when the odds are \(1\), the two events have equal probability. Odds greater than \(1\) favor a positive event. The opposite is true when the odds are less than \(1\).
\[ \frac{p({\bf x})}{1 - p({\bf x})} = \frac{P[Y = 1 \mid {\bf X} = {\bf x}]}{P[Y = 0 \mid {\bf X} = {\bf x}]} \]
Essentially, the log odds are the logit transform applied to \(p({\bf x})\).
\[ \text{logit}(\xi) = \log\left(\frac{\xi}{1 - \xi}\right) \]
It will also be useful to define the inverse logit, otherwise known as the “logistic” or sigmoid function.
\[ \text{logit}^{-1}(\xi) = \frac{e^\xi}{1 + e^{\xi}} = \frac{1}{1 + e^{-\xi}} \]
While our main focus is on estimating the mean, \(\beta_0 + \beta_1x_1 + \ldots + \beta_qx_q\), there is also another parameter, \(\sigma^2\) which needs to be estimated. This is the result of the normal distribution having two parameters.
With logistic regression, which uses the Bernoulli distribution, we only need to estimate the Bernoulli distribution’s single parameter \(p({\bf x})\), which happens to be its mean.
\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{q} x_{q} \]
So even though we introduced ordinary linear regression first, in some ways, logistic regression is actually simpler.
Note that applying the inverse logit transformation allow us to obtain an expression for \(p({\bf x})\).
\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] = \frac{e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}} \]
With \(n\) observations, we write the model indexed with \(i\) to note that it is being applied to each observation.
\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i)})}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)} \]
We can apply the inverse logit transformation to obtain \(P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}]\) for each observation. Since these are probabilities, it’s good that we used a function that returns values between \(0\) and \(1\).
\[ p({\bf x_i}) = P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}] = \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \]
\[ 1 - p({\bf x_i}) = P[Y_i = 0 \mid {\bf X} = {\bf x_i}] = \frac{1}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \]
To “fit” this model, that is estimate the \(\beta\) parameters, we will use maximum likelihood.
\[ \boldsymbol{{\beta}} = [\beta_0, \beta_1, \beta_2, \beta_3, \ldots, \beta_{p - 1}] \]
We first write the likelihood given the observed data.
\[ L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} P[Y_i = y_i \mid {\bf X_i} = {\bf x_i}] \]
This is already technically a function of the \(\beta\) parameters, but we’ll do some rearrangement to make this more explicit.
\[ L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} p({\bf x_i})^{y_i} (1 - p({\bf x_i}))^{(1 - y_i)} \]
\[ L(\boldsymbol{{\beta}}) = \prod_{i : y_i = 1}^{n} p({\bf x_i}) \prod_{j : y_j = 0}^{n} (1 - p({\bf x_j})) \] The complexity of solving a non-linear function may be simplified by taking logs. The log-likelihood function is the usual form for ‘minimising errors’. An error is the difference between an observation (i.e. 1 for yes) and the estimated probability that that observation is that value (\[ 1 - p({\bf x_j}) \] , if the observation is a zero or no).
\[ LL(\boldsymbol{{\beta}}) = \sum_{i : y_i = 1}^{n} 1-p({\bf x_i}) +\sum_{j : y_j = 0}^{n} 0-(1 - p({\bf x_j})) \]
While the logistic regression model isn’t exactly the same as the ordinary linear regression model, because they both use a linear combination of the predictors
\[ \eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1} \]
working with logistic regression is very similar. Many of the things we did with ordinary linear regression can be done with logistic regression in a very similar fashion. For example,
RAfter some introduction to the new tests, we’ll demonstrate each of these using an example.
Like ordinary linear regression, we’ll want to be able to perform hypothesis testing. We’ll again want both single parameter, and multiple parameter tests.
In ordinary linear regression, we performed the test of
\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]
using a \(t\)-test.
For the logistic regression model,
\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1} \]
we can again perform a test of
\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]
however, the test statistic and its distribution are no longer \(t\). We see that the test statistic takes the same form
\[ z = \frac{\hat{\beta}_j - \beta_j}{\text{SE}[\hat{\beta}_j]} \overset{\text{approx}}{\sim} N(0, 1) \]
but now we are performing a \(z\)-test, as the test statistic is approximated by a standard normal distribution, provided we have a large enough sample. (The \(t\)-test for ordinary linear regression, assuming the assumptions were correct, had an exact distribution for any sample size.)
We’ll skip some of the exact details of the calculations, as
R will obtain the standard error for us. The use of this
test will be extremely similar to the \(t\)-test for ordinary linear regression.
Essentially the only thing that changes is the distribution of the test
statistic.
Consider the following full model,
\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i \]
This model has \(p - 1\) predictors, for a total of \(p\) \(\beta\)-parameters. We will denote the MLE of these \(\beta\)-parameters as \(\hat{\beta}_{\text{Full}}\)
Now consider a null (or reduced) model,
\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(q-1)} x_{i(q-1)} + \epsilon_i \]
where \(q < p\). This model has \(q - 1\) predictors, for a total of \(q\) \(\beta\)-parameters. We will denote the MLE of these \(\beta\)-parameters as \(\hat{\beta}_{\text{Null}}\)
The difference between these two models can be codified by the null hypothesis of a test.
\[ H_0: \beta_q = \beta_{q+1} = \cdots = \beta_{p - 1} = 0. \]
This implies that the reduced model is nested inside the full model.
We then define a test statistic, \(D\),
\[ D = -2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Null}}})} {L(\boldsymbol{\hat{\beta}_{\text{Full}}})} \right) = 2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Full}}})} {L(\boldsymbol{\hat{\beta}_{\text{Null}}})} \right) = 2 \left( \ell(\hat{\beta}_{\text{Full}}) - \ell(\hat{\beta}_{\text{Null}})\right) \]
where \(L\) denotes a likelihood and \(\ell\) denotes a log-likelihood. For a large enough sample, this test statistic has an approximate Chi-square distribution
\[ D \overset{\text{approx}}{\sim} \chi^2_{k} \]
where \(k = p - q\), the difference in number of parameters of the two models.
This test, which we will call the Likelihood-Ratio
Test, will be the analogue to the ANOVA \(F\)-test for logistic regression.
Interestingly, to perform the Likelihood-Ratio Test, we’ll actually
again use the anova() function in R!.
The Likelihood-Ratio Test is actually a rather general test, however, here we have presented a specific application to nested logistic regression models.
Titanic ExampleTo illustrate the use of logistic regression, we will use the
Titanic data set installed with R base.
The data is tank directly from the Titanic register of passengers and been prepared for modelling. The data contains information about the passengers age and gender along with where they boarded the vessel and their passenger class and ticket price. For illustrative purposes we will only look at age and gender.
The data is augmented with information on whether the passenger was rescued (survived=1) or perished at sea (survived =0)
Titanic_all<- read.csv("Titanic_all.csv")
Titanic_small<-Titanic_all%>%select(Survived,Sex,Age)
Titanic_small$one=1 #we will use this for a null model
Titanic_all$Survived <- as.factor(Titanic_all$Survived)
Titanic_all$Sex <- as.factor(Titanic_all$Sex)
head(Titanic_small)
## Survived Sex Age one
## 1 0 1 27.50 1
## 2 1 0 47.50 1
## 3 1 0 32.50 1
## 4 1 0 43.75 1
## 5 0 1 43.75 1
## 6 0 1 35.00 1
Our logistic model will be used to estimate the probability that a passenger survived the Titanic conditional on their age and gender. The estimating logit is:
\[ \log\left(\frac{P[\texttt{surv} = 1]}{1 - P[\texttt{surv} = 1]}\right) = \beta_0 + \beta_{\texttt{sex}} sex + \beta_{\texttt{age}} age \]
Let’s start with one independent variable, age.
Surv_Model = glm(Survived ~ Age, data = Titanic_small, family = binomial)
plot(jitter(Survived, factor = 0.1) ~ Age, data = Titanic_small, pch = 20,
ylab = "Probability of survive", xlab = "Age of passenger")
grid()
curve(predict(Surv_Model, data.frame(Age = x), type = "response"),
add = TRUE, col = "dodgerblue", lty = 2)
As before, we plot the data in addition to the estimated
probabilities. Note that we have “jittered” the data to make it easier
to visualize, but the data do only take values 0 and
1.
The plot indicates that younger passengers were more likely to survive
coef(summary(Surv_Model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.176717087 0.170132413 -1.038703 0.29894278
## Age -0.008192576 0.004301293 -1.904677 0.05682201
To perform the test of the significance on the coefficient for age we can read the p-value
\[ H_0: \beta_{\texttt{age}} = 0 \]
we use the summary() function as we have done so many
times before. Like the \(t\)-test for
ordinary linear regression, this returns the estimate of the parameter,
its standard error, the relevant test statistic (\(z\)), and its p-value. Here we have an low
p-value but we do not reject the null hypothesis at the 5% level of
significance.
We would like to test the overall fit of the model. Like simple linear regression (with one predictor) the test of significance of the model should have the same result as the test of significance of the parameter. Here we run the model with the independent variable equal to one (i.e. 1 for all observations). The resulting probabilities will be the proportion of survivers and this will be applied to all individual observations.
Surv_Model_Null = glm(Survived ~ one, data = Titanic_small, family = binomial)
We could manually calculate the test statistic,
chivalue= -2 * as.numeric(logLik(Surv_Model_Null) - logLik(Surv_Model))
This is # \(\chi X\) distributed with one degree of freedom and we can estimate the p-value
qchisq(.95, df=1) # 1 degrees of freedom
## [1] 3.841459
pchisq(q=chivalue, df=1, lower.tail=FALSE)
## [1] 0.05520155
We can also get this information from the summary output and the
anova. By specifying test = "LRT", R will use
the likelihood-ratio test to compare the two models.
anova(Surv_Model, test = "LRT") # note deviance is 2*Loglikelihood
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Survived
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 891 1187.6
## Age 1 3.676 890 1184.0 0.0552 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Test two nested models
When fitting logistic regression, we can use the same formula syntax as ordinary linear regression. So, to fit an additive model using all available predictors (in our case this is age and sex):
Surv_Model_additive = glm(Survived ~ ., data = Titanic_small, family = binomial)
coef(summary(Surv_Model_Null))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4751075 0.06886333 -6.899282 5.2266e-12
We can then use the likelihood-ratio test to compare the two models. Specifically, we are testing
\[ H_0: \beta_{\texttt{age}} = \beta_{\texttt{sex}} = 0 \]
\[ H_0: \beta_{\texttt{age}} = \beta_{\texttt{sex}} = 0 \]
We could manually calculate the test statistic,
-2 * as.numeric(logLik(Surv_Model) - logLik(Surv_Model_additive))
## [1] 266.254
This is # \(\chi X\) distributed with one degree of freedom.
We could utilize the anova() function. By specifying
test = "LRT", R will use the likelihood-ratio
test to compare the two models.
anova(Surv_Model, Surv_Model_additive, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: Survived ~ Age
## Model 2: Survived ~ Sex + Age + one
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 890 1183.95
## 2 889 917.69 1 266.25 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the test statistic that we had just calculated appears in the output. The very small p-value suggests that we prefer the larger model.
In our lab you will have the full data set to work with. The lab will guide you through preliminary data exploration and then some models. Leaving you to decide the best model.