Logistic Regression

Preliminary

These notes are modified from your reading

Applied Statistics with R (applied stats)

by David Dalpiaz, Department of Statistics at The University of Illinois

After reading chapter 17 you will be able to:

Understand how generalized linear models are a generalization of ordinary linear models.
Use logistic regression to model a binary response.
Apply concepts learned for ordinary linear models to logistic regression.

So far we have only considered linear regression model for numeric response variables. What about response variables that only take integer values? What about a response variable that is categorical? Can we use linear models in these situations? Yes! The model that we have been using, which we will call ordinary linear regression, is actually a specific case of the more general, generalized linear model.

Generalized Linear Models

So far, we’ve had response variables that, conditioned on the predictors, were modeled using a normal distribution with a mean that is some linear combination of the predictors. This linear combination is what made a linear model “linear.”

\[ Y \mid {\bf X} = {\bf x} \sim N(\beta_0 + \beta_1x_1 + \ldots + \beta_{p - 1}x_{p - 1}, \ \sigma^2) \]

Now we’ll allow for two modifications of this situation, which will let us use linear models in many more situations. Instead of using a normal distribution for the response conditioned on the predictors, we’ll allow for other distributions. Also, instead of the conditional mean being a linear combination of the predictors, it can be some function of a linear combination of the predictors.

In general, a generalized linear model has three parts:

A distribution of the response conditioned on the predictors. (Technically this distribution needs to be from the exponential family of distributions.)
A linear combination of the \(p - 1\) predictors, \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}\), which we write as \(\eta({\bf x})\). That is,

\[\eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}\]

A link function, \(g()\), that defines how \(\eta({\bf x})\), the linear combination of the predictors, is related to the mean of the response conditioned on the predictors, \(\text{E}[Y \mid {\bf X} = {\bf x}]\).

\[ \eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right). \]

The following table summarizes three examples of a generalized linear model:

	Linear Regression	Poisson Regression	Logistic Regression
\(Y \mid {\bf X} = {\bf x}\)	\(N(\mu({\bf x}), \sigma^2)\)	\(\text{Pois}(\lambda({\bf x}))\)	\(\text{Bern}(p({\bf x}))\)
Distribution Name	Normal	Poisson	Bernoulli (Binomial)
\(\text{E}[Y \mid {\bf X} = {\bf x}]\)	\(\mu({\bf x})\)	\(\lambda({\bf x})\)	\(p({\bf x})\)
Support	Real: \((-\infty, \infty)\)	Integer: \(0, 1, 2, \ldots\)	Integer: \(0, 1\)
Usage	Numeric Data	Count (Integer) Data	Binary (Class ) Data
Link Name	Identity	Log	Logit
Link Function	\(\eta({\bf x}) = \mu({\bf x})\)	\(\eta({\bf x}) = \log(\lambda({\bf x}))\)	\(\eta({\bf x}) = \log \left(\frac{p({\bf x})}{1 - p({\bf x})} \right)\)
Mean Function	\(\mu({\bf x}) = \eta({\bf x})\)	\(\lambda({\bf x}) = e^{\eta({\bf x})}\)	\(p({\bf x}) = \frac{e^{\eta({\bf x})}}{1 + e^{\eta({\bf x})}} = \frac{1}{1 + e^{-\eta({\bf x})}}\)

Like ordinary linear regression, we will seek to “fit” the model by estimating the \(\beta\) parameters. To do so, we will use the method of maximum likelihood.

Note that a Bernoulli distribution is a specific case of a binomial distribution where the \(n\) parameter of a binomial is \(1\). Binomial regression is also possible, but we’ll focus on the much more popular Bernoulli case.

So, in general, GLMs relate the mean of the response to a linear combination of the predictors, \(\eta({\bf x})\), through the use of a link function, \(g()\). That is,

\[ \eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right). \]

The mean is then

\[ \text{E}[Y \mid {\bf X} = {\bf x}] = g^{-1}(\eta({\bf x})). \]

Binary Response

To illustrate the use of a GLM we’ll focus on the case of binary responses variable coded using \(0\) and \(1\). In practice, these \(0\) and \(1\)s will code for two classes such as yes/no, cat/dog, sick/healthy, etc.

\[ Y = \begin{cases} 1 & \text{yes} \\ 0 & \text{no} \end{cases} \]

First, we define some notation that we will use throughout.

\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] \]

With a binary (Bernoulli) response, we’ll mostly focus on the case when \(Y = 1\), since with only two possibilities, it is trivial to obtain probabilities when \(Y = 0\).

\[ P[Y = 0 \mid {\bf X} = {\bf x}] + P[Y = 1 \mid {\bf X} = {\bf x}] = 1 \]

\[ P[Y = 0 \mid {\bf X} = {\bf x}] = 1 - p({\bf x}) \]

We now define the logistic regression model.

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1} \]

Immediately we notice some similarities to ordinary linear regression, in particular, the right hand side. This is our usual linear combination of the predictors. We have our usual \(p - 1\) predictors for a total of \(p\) \(\beta\) parameters. (Note, many more machine learning focused texts will use \(p\) as the number of predictors. This is an arbitrary choice, but you should be aware of it.)

The left hand side is called the log odds, which is the log of the odds. The odds are the probability for a positive event \((Y = 1)\) divided by the probability of a negative event \((Y = 0)\). So when the odds are \(1\), the two events have equal probability. Odds greater than \(1\) favor a positive event. The opposite is true when the odds are less than \(1\).

\[ \frac{p({\bf x})}{1 - p({\bf x})} = \frac{P[Y = 1 \mid {\bf X} = {\bf x}]}{P[Y = 0 \mid {\bf X} = {\bf x}]} \]

Essentially, the log odds are the logit transform applied to \(p({\bf x})\).

\[ \text{logit}(\xi) = \log\left(\frac{\xi}{1 - \xi}\right) \]

It will also be useful to define the inverse logit, otherwise known as the “logistic” or sigmoid function.

\[ \text{logit}^{-1}(\xi) = \frac{e^\xi}{1 + e^{\xi}} = \frac{1}{1 + e^{-\xi}} \]

While our main focus is on estimating the mean, \(\beta_0 + \beta_1x_1 + \ldots + \beta_qx_q\), there is also another parameter, \(\sigma^2\) which needs to be estimated. This is the result of the normal distribution having two parameters.

With logistic regression, which uses the Bernoulli distribution, we only need to estimate the Bernoulli distribution’s single parameter \(p({\bf x})\), which happens to be its mean.

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{q} x_{q} \]

So even though we introduced ordinary linear regression first, in some ways, logistic regression is actually simpler.

Note that applying the inverse logit transformation allow us to obtain an expression for \(p({\bf x})\).

\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] = \frac{e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}} \]

Fitting Logistic Regression

With \(n\) observations, we write the model indexed with \(i\) to note that it is being applied to each observation.

\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i)})}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)} \]

We can apply the inverse logit transformation to obtain \(P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}]\) for each observation. Since these are probabilities, it’s good that we used a function that returns values between \(0\) and \(1\).

\[ p({\bf x_i}) = P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}] = \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \]

\[ 1 - p({\bf x_i}) = P[Y_i = 0 \mid {\bf X} = {\bf x_i}] = \frac{1}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \]

To “fit” this model, that is estimate the \(\beta\) parameters, we will use maximum likelihood.

\[ \boldsymbol{{\beta}} = [\beta_0, \beta_1, \beta_2, \beta_3, \ldots, \beta_{p - 1}] \]

We first write the likelihood given the observed data.

\[ L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} P[Y_i = y_i \mid {\bf X_i} = {\bf x_i}] \]

This is already technically a function of the \(\beta\) parameters, but we’ll do some rearrangement to make this more explicit.

\[ L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} p({\bf x_i})^{y_i} (1 - p({\bf x_i}))^{(1 - y_i)} \]

\[ L(\boldsymbol{{\beta}}) = \prod_{i : y_i = 1}^{n} p({\bf x_i}) \prod_{j : y_j = 0}^{n} (1 - p({\bf x_j})) \] The complexity of solving a non-linear function may be simplified by taking logs. The log-likelihood function is the usual form for ‘minimising errors’. An error is the difference between an observation (i.e. 1 for yes) and the estimated probability that that observation is that value (\[ 1 - p({\bf x_j}) \] , if the observation is a zero or no).

\[ LL(\boldsymbol{{\beta}}) = \sum_{i : y_i = 1}^{n} 1-p({\bf x_i}) +\sum_{j : y_j = 0}^{n} 0-(1 - p({\bf x_j})) \]

Working with Logistic Regression

While the logistic regression model isn’t exactly the same as the ordinary linear regression model, because they both use a linear combination of the predictors

\[ \eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1} \]

working with logistic regression is very similar. Many of the things we did with ordinary linear regression can be done with logistic regression in a very similar fashion. For example,

Testing for a single \(\beta\) parameter
Testing for a set of \(\beta\) parameters
Formula specification in R
Interpreting parameters and estimates
Confidence intervals for parameters
Confidence intervals for mean response
Variable selection

After some introduction to the new tests, we’ll demonstrate each of these using an example.

Testing with GLMs

Like ordinary linear regression, we’ll want to be able to perform hypothesis testing. We’ll again want both single parameter, and multiple parameter tests.

Wald Test

In ordinary linear regression, we performed the test of

\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]

using a \(t\)-test.

For the logistic regression model,

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1} \]

we can again perform a test of

\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]

however, the test statistic and its distribution are no longer \(t\). We see that the test statistic takes the same form

\[ z = \frac{\hat{\beta}_j - \beta_j}{\text{SE}[\hat{\beta}_j]} \overset{\text{approx}}{\sim} N(0, 1) \]

but now we are performing a \(z\)-test, as the test statistic is approximated by a standard normal distribution, provided we have a large enough sample. (The \(t\)-test for ordinary linear regression, assuming the assumptions were correct, had an exact distribution for any sample size.)

We’ll skip some of the exact details of the calculations, as R will obtain the standard error for us. The use of this test will be extremely similar to the \(t\)-test for ordinary linear regression. Essentially the only thing that changes is the distribution of the test statistic.

Likelihood-Ratio Test

Consider the following full model,

\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i \]

This model has \(p - 1\) predictors, for a total of \(p\) \(\beta\)-parameters. We will denote the MLE of these \(\beta\)-parameters as \(\hat{\beta}_{\text{Full}}\)

Now consider a null (or reduced) model,

\[ \log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(q-1)} x_{i(q-1)} + \epsilon_i \]

where \(q < p\). This model has \(q - 1\) predictors, for a total of \(q\) \(\beta\)-parameters. We will denote the MLE of these \(\beta\)-parameters as \(\hat{\beta}_{\text{Null}}\)

The difference between these two models can be codified by the null hypothesis of a test.

\[ H_0: \beta_q = \beta_{q+1} = \cdots = \beta_{p - 1} = 0. \]

This implies that the reduced model is nested inside the full model.

We then define a test statistic, \(D\),

\[ D = -2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Null}}})} {L(\boldsymbol{\hat{\beta}_{\text{Full}}})} \right) = 2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Full}}})} {L(\boldsymbol{\hat{\beta}_{\text{Null}}})} \right) = 2 \left( \ell(\hat{\beta}_{\text{Full}}) - \ell(\hat{\beta}_{\text{Null}})\right) \]

where \(L\) denotes a likelihood and \(\ell\) denotes a log-likelihood. For a large enough sample, this test statistic has an approximate Chi-square distribution

\[ D \overset{\text{approx}}{\sim} \chi^2_{k} \]

where \(k = p - q\), the difference in number of parameters of the two models.

This test, which we will call the Likelihood-Ratio Test, will be the analogue to the ANOVA \(F\)-test for logistic regression. Interestingly, to perform the Likelihood-Ratio Test, we’ll actually again use the anova() function in R!.

The Likelihood-Ratio Test is actually a rather general test, however, here we have presented a specific application to nested logistic regression models.

`Titanic` Example

To illustrate the use of logistic regression, we will use the Titanic data set installed with R base.

The data is tank directly from the Titanic register of passengers and been prepared for modelling. The data contains information about the passengers age and gender along with where they boarded the vessel and their passenger class and ticket price. For illustrative purposes we will only look at age and gender.

The data is augmented with information on whether the passenger was rescued (survived=1) or perished at sea (survived =0)

Titanic_all<- read.csv("Titanic_all.csv")
Titanic_small<-Titanic_all%>%select(Survived,Sex,Age)
Titanic_small$one=1 #we will use this for a null model
Titanic_all$Survived <- as.factor(Titanic_all$Survived)
Titanic_all$Sex <- as.factor(Titanic_all$Sex)
head(Titanic_small)

##   Survived Sex   Age one
## 1        0   1 27.50   1
## 2        1   0 47.50   1
## 3        1   0 32.50   1
## 4        1   0 43.75   1
## 5        0   1 43.75   1
## 6        0   1 35.00   1

The Estimating model

Our logistic model will be used to estimate the probability that a passenger survived the Titanic conditional on their age and gender. The estimating logit is:

\[ \log\left(\frac{P[\texttt{surv} = 1]}{1 - P[\texttt{surv} = 1]}\right) = \beta_0 + \beta_{\texttt{sex}} sex + \beta_{\texttt{age}} age \]

Let’s start with one independent variable, age.

Surv_Model = glm(Survived ~ Age, data = Titanic_small, family = binomial)

plot(jitter(Survived, factor = 0.1) ~ Age, data = Titanic_small, pch = 20, 
     ylab = "Probability of survive", xlab = "Age of passenger")
grid()
curve(predict(Surv_Model, data.frame(Age = x), type = "response"), 
      add = TRUE, col = "dodgerblue", lty = 2)

As before, we plot the data in addition to the estimated probabilities. Note that we have “jittered” the data to make it easier to visualize, but the data do only take values 0 and 1.

The plot indicates that younger passengers were more likely to survive

coef(summary(Surv_Model))

##                 Estimate  Std. Error   z value   Pr(>|z|)
## (Intercept) -0.176717087 0.170132413 -1.038703 0.29894278
## Age         -0.008192576 0.004301293 -1.904677 0.05682201

To perform the test of the significance on the coefficient for age we can read the p-value

\[ H_0: \beta_{\texttt{age}} = 0 \]

we use the summary() function as we have done so many times before. Like the \(t\)-test for ordinary linear regression, this returns the estimate of the parameter, its standard error, the relevant test statistic (\(z\)), and its p-value. Here we have an low p-value but we do not reject the null hypothesis at the 5% level of significance.

We would like to test the overall fit of the model. Like simple linear regression (with one predictor) the test of significance of the model should have the same result as the test of significance of the parameter. Here we run the model with the independent variable equal to one (i.e. 1 for all observations). The resulting probabilities will be the proportion of survivers and this will be applied to all individual observations.

Surv_Model_Null = glm(Survived ~ one, data = Titanic_small, family = binomial)

We could manually calculate the test statistic,

chivalue= -2 * as.numeric(logLik(Surv_Model_Null) - logLik(Surv_Model))

This is # \(\chi X\) distributed with one degree of freedom and we can estimate the p-value

qchisq(.95, df=1) # 1 degrees of freedom

## [1] 3.841459

pchisq(q=chivalue, df=1, lower.tail=FALSE)

## [1] 0.05520155

We can also get this information from the summary output and the anova. By specifying test = "LRT", R will use the likelihood-ratio test to compare the two models.

anova(Surv_Model, test = "LRT") # note deviance is 2*Loglikelihood

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                   891     1187.6           
## Age   1    3.676       890     1184.0   0.0552 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

###Test two nested models

When fitting logistic regression, we can use the same formula syntax as ordinary linear regression. So, to fit an additive model using all available predictors (in our case this is age and sex):

Surv_Model_additive = glm(Survived ~ ., data = Titanic_small, family = binomial)
coef(summary(Surv_Model_Null))

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) -0.4751075 0.06886333 -6.899282 5.2266e-12

We can then use the likelihood-ratio test to compare the two models. Specifically, we are testing

\[ H_0: \beta_{\texttt{age}} = \beta_{\texttt{sex}} = 0 \]

We could manually calculate the test statistic,

-2 * as.numeric(logLik(Surv_Model) - logLik(Surv_Model_additive))

## [1] 266.254

This is # \(\chi X\) distributed with one degree of freedom.

We could utilize the anova() function. By specifying test = "LRT", R will use the likelihood-ratio test to compare the two models.

anova(Surv_Model, Surv_Model_additive, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: Survived ~ Age
## Model 2: Survived ~ Sex + Age + one
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       890    1183.95                          
## 2       889     917.69  1   266.25 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the test statistic that we had just calculated appears in the output. The very small p-value suggests that we prefer the larger model.

Lab

In our lab you will have the full data set to work with. The lab will guide you through preliminary data exploration and then some models. Leaving you to decide the best model.

Logistic Regression

Preliminary

Applied Statistics with R (applied stats)

Generalized Linear Models

Binary Response

Fitting Logistic Regression

Working with Logistic Regression

Testing with GLMs

Wald Test

Likelihood-Ratio Test

Titanic Example

The Estimating model

Lab

`Titanic` Example