Logistic Regression

August 2023

Linear Models

Why I don’t like SPSS et al.: ANOVA, ANCOVA, regression, linear model…
all with their own menu items and dialogue boxes
But these can all be seen as linear models (LM) in a regression framework.
I’ll be using R examples for this session.
R helps to adopt the LM/regression perspective.
It does not matter what software you use, as long as you know what you are doing.
Before we go into logistic regression models, let’s look at ordinary regression in R (ordinary least-squares or OLS regression).

Linear models, the R way

In a linear regression model, we are seeking to model

the effect of one or more independent variables (IV) or predictors
\(X = (x_1, x_2, ..., x_i)\)
on one dependent variable (DV), response or outcome \(y\)

Model formulas in R look like so (beautiful, right?):

y ~ x               # univariable regression model
y ~ x1 + x2 + x3    # multivariable regression model

This is short hand for

\[y = \beta_0 + \beta_1 x + \epsilon\] \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\] Remember that, in OLS regression, the residual error \(\epsilon\) is assumed to be normally distributed.

A simple linear regression model in R

The plot shows simulated data, with

one continuous IV \((x)\); \(N=100\)
DV \((y)\) is also continuous

The simulated relationship \(y \sim x\) follows

\[y = 1 + 2x + \epsilon\] Let’s fit the model:

m.1 <- lm(y ~ x)  # fit linear model
print(summary(m.1)$coefficients, digits = 3)

            Estimate Std. Error t value Pr(>|t|)
(Intercept)     1.04      0.105    9.88 2.21e-16
x               1.97      0.104   19.02 1.15e-34

The model fit recovers the intercept and slope nicely (\(\hat\beta_0 = 1.04, \hat\beta_1 = 1.97\)).

What if the outcome is binary?

Examples of binary outcomes:

tossing a coin: heads / tails.
cancer: yes / no — association with smoking, drinking, sex, age, social stratum, income…
vote cast in Brexit referendum: remain / leave — association with education, ethnicity, age, gender…
binary choice of mouse in a Y-maze: left / right, or
startle in fear conditioning: yes / no — association with number of training trials, sex, drug treatment, gene knock-outs…

We are typically interested in

the effects of multiple predictors, and in
their possible interactions.

Simulated cancer data

To keep things simple, we simulate only one continuous risk factor, \(x\). We can plot the relationship between \(x\) and \(y\) in different ways:

OLS regression on binary DV is bad

Our beautiful earlier linear regression model on continuous \(y\):

Linear regression model on binary \(y\): we are coding healthy = 0 and cancer = 1.

What is wrong with a t-test?

We could turn the problem around and think of the data as in the box plot on the right;
this suggests using a t-test or Wilcoxon test:

w <- wilcox.test(x~y)
w$p.value

[1] 2.634e-20

This is kosher, but it works only for one continuous \(x\) at a time — but we often have multiple IVs. Moreover,

it does not tell us how \(x\) affects the likelihood of catching cancer;
instead, it tells us that cancer patients have a higher \(x\).

This is because we have switched the roles of \(x\) and \(y\) as DV and IV: we now do \(x\sim y\), not \(y\sim x\).

Time for Logistic Regression!

So how should we deal with binary DVs?

An intuitive understanding of LR

Staying with a single continuous IV:

in our sample, the frequency of ‘events’ (here, cancer cases) increases with \(x\).
you’d hope that the sample is representative of the population;
the data suggest that the probability of an event (here, catching cancer) increases with \(x\).

LR is about estimating how predictors (IVs, \(X\)) change the probability of an event.

Proportions, risks, probabilities

If \(x\) is a risk factor, increasing \(x\) increases the probability of cancer, \(P(y=\mathrm{cancer})\).

We want to estimate the effect of \(x\) on \(P(y=\mathrm{cancer})\).
The proportion of cancer cases, for any given value of \(x\), is an estimate of \(P\) given \(x\): conditional probability \(P(y=\mathrm{cancer}\ |\ x)\).
In the histogram, this is shown graphically for discrete bins of \(x\).
But LR can deal with continuous \(x\), just like ordinary regression.

LR is about estimating conditional probabilities of events occurring: \[P(y\ |\ X)\]

Probabilities are odd (pun…)

In LR, instead of modelling \(y \sim x\), we seek to model \(P(y) \sim x\).

But we have a problem: linear models are additive, probabilities are not.

Linear models work like this

Assume (toy example) that smoking 10 ciggies /day increases the probability of getting cancer from 20% to 75%.
So it would seem that 10 ciggies give you a 55% increase in the probability;
Now go from 10 ciggies to 30 ciggies; 55% + 55% + 55% = 165%
And, with a baseline probability of 20% for non-smoker, 20% + 3 x 55% = 185%

This cannot be right!

Probabilities are bounded between 0 and 1: \(0 \leq P(y) \leq 1\)
(Linear combinations of) predictors can be arbitrarily large or small: \(-\infty < x < \infty\)

Probabilities, odds and logits

How can we linearly map our predictors onto probabilities?
Can we transform \(P(y)\)?

Odds, i.e., \(3:1,\ 1:7,\ 100:1\) etc.: the odds, given \(P\), are \[\mathrm{odds} = \frac{P}{1-P}\]

Odds are multiplicative: \(\frac{1}{3} \times \frac{1}{7} = \frac{1}{21},\ \frac{1}{10} \times \frac{10}{1} = \frac{1}{1}\) etc.
Odds are still bounded: \(0 < \frac{P}{1-P} < \infty\)
Remember that log-transformation turns multiplications into additions…

Log-odds: \(\log \frac{P}{1-P}\) are unbounded: \(-\infty < \log\frac{P}{1-P} < \infty\)

The logit transformation: \[\mathrm{logit}\ P = \log (\mathrm{odds}) = \log \frac{P}{1-P}\]

Logits and the logistic function

From probabilities to logits: \[\mathrm{logit}\ P = \log \frac{P}{1-P}\]

From logits to probabilities: \[P = \frac{e^t}{1+e^t},\ \mathrm{with\ } t=\mathrm{logit\ }P\]

LR is a Generalised Linear Model (GLM)

Logistic regression assumes that the effect of the predictors \(X\) on the probability of an event \(P(y|X)\) is linear on the logit scale.

This permits us to ‘couple’ or link a linear model to outcome probabilities: \[\mathrm{logit}\ P(y|x) \sim \beta_0 + \beta_1 x\]

or, with multiple predictors \(x_1, x_2, ..., x_i\): \[\mathrm{logit}\ P(y|X) \sim \beta_0 + \beta_1 x_1 + ... +\beta_i x_i\]

LR is an example of a Generalised Linear Model (GLM) = generalisation of the linear model.

The logistic function, \(\sigma(t) = \frac{e^t}{1+e^t}\), is the so-called link function: it provides the (nonlinear) link between the linear model equation \((\beta_0 + \beta_1 x_1 + ...)\) and \(P(y)\).

What LR can do for you

The link function trick permits “porting” the power and flexibility of linear models to the analysis of binary outcomes:

estimate the effects of multiple IVs on outcome probabilities.
model main effects and interactions.

LR makes no assumption about the distribution of the predictors (IVs):

IVs can be categorical, ordinal or continuous.
Only assumption is linearity of effects in the log-odds (but this can be relaxed through e.g. modelling continuous effects as cubic splines).

LR models can

detect associations between IVs and outcomes (hypothesis testing).
estimate the effect size of particular IVs.
predict the probability of future events given particular values for IVs.

Fitting our LR model in R

We use the same model formula y ~ x, but using R function glm instead of lm.
We need to specify the link function; for LR, this is family = "binomial":

m.3 <- glm(y ~ x, family="binomial")    # fit the LR model
print(summary(m.3)$coefficients)        # print the coefficients

            Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -3.2756    0.48950  -6.692 2.205e-11
x             0.1129    0.01511   7.469 8.109e-14

Interpreting the coefficients: The coefficients indicate how each IV changes the probability of the outcome:

Positive coefficient \((\beta_i>0)\) means an increase in \(x_i\) makes event more likely;
Negative coefficient \((\beta_i<0)\) means an increase in \(x_i\) makes event less likely.
Coefficient indicates effect size on the logit scale: \(\beta_i=1\) means changing \(x_i\) by 1 changes the logit by 1, and hence the odds become \(e^1=2.7183\times\) greater.
In our toy example, \(\beta_1 = 0.11\) so 9 cigarettes / day shift the log-odds by 1 towards cancer.

Predicting probabilities of new cases

We can now also predict the future!
E.g., what is

\(P(y=\mathrm{cancer}\ |\ x = 15)\) (15 fags / day) or
\(P(y=\mathrm{cancer}\ |\ x = 40)\) (40 fags / day)?

[note this is a fictive toy model!!!]

predict(m.3, newdata = data.frame(x=c(15, 40)), type = "response")

     1      2 
0.1705 0.7755

So, the model predicts that,

with \(x=15\) cigarettes / day, your \(P(y=\mathrm{cancer})=17.05\%\);
with \(x=40\) cigarettes / day, your \(P(y=\mathrm{cancer})=77.55\%\).

Multiple predictors: Titanic survival

Outcome (DV): survived yes / no

Predictors (IVs):

sex: female / male
pclass = passenger class on ship: 1st, 2nd, 3rd
age.cat = age category: under 16, 16–55 [reference], over 55

Fit the model using glm:

m.4 <- glm(survived ~ sex + pclass + age.cat, data=Titanic3, family = "binomial")

Titanic survival model results

Type-II ANOVA

drop1(m.4, test = "Chisq")

Single term deletions

Model:
survived ~ sex + pclass + age.cat
        Df Deviance  AIC LRT Pr(>Chi)    
<none>         1226 1238                 
sex      1     1566 1576 340  < 2e-16 ***
pclass   2     1357 1365 132  < 2e-16 ***
age.cat  2     1257 1265  31  1.5e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What does it mean?

Sex, passenger class and age are all associated with survival;
But how? You need to look at the estimates for the coefficients!

Titanic survival model results

Model estimates of coefficients:

summary(m.4)$coefficients

               Estimate Std. Error z value  Pr(>|z|)
(Intercept)       2.196     0.1806  12.158 5.178e-34
sexmale          -2.500     0.1489 -16.788 2.984e-63
pclass2          -1.054     0.2040  -5.166 2.394e-07
pclass3          -1.967     0.1813 -10.851 1.983e-27
age.catunder16    1.140     0.2446   4.661 3.142e-06
age.catover55    -1.018     0.3637  -2.799 5.130e-03

Men were much less likely to survive than women: odds were \(e^{2.5}=12.2\times\) lower for men.
2nd class passengers were less likely to survive than 1st class passengers, and 3rd class passengers were even less likely to survive.
children were more likely to survive than adults, while over-55s were less likely to survive.

LR is not necessarily about risks

Thinking in terms of risk factors is extremely helpful in getting at the essence of LR. But LR is not necessarily about risk as we know it.

‘Did the rat press the lever?’ is a binary outcome.
‘Does drug treatment increase the risk of the rat pressing the lever?’ — ???
not the most sensible way of thinking about rat behaviour… but
‘Does drug treatment increases the probability of the rat pressing the lever?’ = are drug-treated rats more likely to press the lever — eminently sensible!!!

I estimate the probability that, at some point in your research career, your data are best analysed by logistic regression, as

\[P(method\ needed = \mathrm{LR}\ |\ data) \sim 0.99\]

Pointers and advanced topics

Things you may want to explore, but that we did not have time for in the lecture

Logits what???

If \(\mathrm{logit} = 0\), odds are \(e^0=1:1\),
and \(P=0.5\)

If \(\mathrm{logit} = -1\), odds are \(e^{-1}=0.37:1\),
and \(P=0.27\)

If \(\mathrm{logit} = 1\), odds are \(e^1=2.72:1\),
and \(P=0.73\)

If \(\mathrm{logit} = 4\), odds are \(e^4=54.6:1\),
and \(P=0.98\)

Sample size requirements

If you fit a model with too many predictor variables and not enough observations, bad things happen:

the model fit may fail altogether (due to ‘total separation’);
the standard errors of the estimated coefficients become very large, and the predictors come out ‘nonsignificant’;
the model describes the sample very well, but predicts very poorly on future observations: this is known as overfitting.

So how many observations do you need? LR is limited by the sample size in the smaller of the two groups: limiting sample size.

if you have N=500 healthy and N=15 cancer subjects, then the effective sample size is only 15…
you need at least about 10–15 ‘events per variable’ (EPV).

So in the above example, you can just about sensibly fit a model with a single predictor!

LR as a classifier

One can consider LR as a classification technique, but seeing it as only a classification technique risks missing the point…

An LR model of the association between Brexit voting behaviour and socio-economic predictors (social class, income, education, gender…) can be interpreted as classifying voters into Remainers and Brexiteers.
link to / example of machine learning.
can then predict class membership probabilities based on predictor values.

Classification is then based on choosing a cut-off probability, often 0.5 (50%):

Predicted Remainer: \(P(vote = \mathrm{leave}\ |\ X) < 0.5\);
Predicted Brexiteer: \(P(vote = \mathrm{leave}\ |\ X) \geq 0.5\).

Comparing predicted and actual voting behaviour gives us a crude indicator of model fit: ‘percent correctly classified’.

Quality measures for LR models

‘Percent correctly classified’ is very crude and often misleading: 0.5 cut-off is arbitrary, and LR is not really about hard cut-offs anyway.

Moving the threshold from 0 to 1, and plotting false positive rate (1–specificity) over true positive rate (sensitivity) leads to a receiver operator characteristic (ROC) curve.

Area Under the (ROC) Curve = AUC:

AUC = 1.0: perfect classification (100% correct)
AUC = 0.5: no better than random guessing.
AUC = 0.0: perfectly wrong (100% incorrect)

AUC for our simulated cancer data model:

Area under the curve: 0.878

Extensions of the LR model

Ordinal logistic regression (OLR): The DV is on an ordinal scale, e.g., University degree classes (First, 2.1, 2.2, Third…). This can be extended to arbitrary number of ordinal classes, which provides a (little-known) semi-parametric alternative to OLS regression models.

Multinomial logistic regression: Extends from the binary (two-class) case to multi-class problems, i.e., with three or more discrete outcomes. Unlike in OLR, the DV is not ordinal.

Cox-regression survival models: Closely related to OLR. Permits modelling survival over time without having to fit a parametric survival function.