Logistic regression

This is a statistical method used to model the relationship of a binary outcome(event) with one or more independent variables that may either be categorical or continuous.

It is used both as a classification algorithm for machine learning problems and as a statistical tool for inferential needs such as when we want to identify how certain explanatory variables affects a binary outcome.

The model is such that

\[logit(Y) \sim \beta^T X\] where

\[ Y = \begin{cases} 1, & \text{if the event occurred} \\ 0, & \text{if the event did not occur} \end{cases} \]

$X$ is a vector of predictors/explanatory variables.

$\beta$ are regression coefficients

The logit function is a log transformation of the event probability into the log-odds scale. It is defined as:

\[logit(Y) = log\bigg(\frac{P(Y=1 \mid X)}{1-P(Y=1\mid X)}\bigg) = log\bigg(\frac{p}{1-p}\bigg)\] Where $P(Y=1\mid X)=p$ is the probability that the events occurs conditional on X.

logit(Y) is a linear combination of contributions from each $X_i$ in the X vector.

The $\beta$

\[\text{logit}(\Pr(Y = 1\mid X)) = \beta^T X= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k\] Each coefficient $\beta_j$ represents the change in the log-odds of the outcome per one-unit increase in $x_j$, holding all other variables constant.

To interpret $\beta_j$ more intuitively, we exponentiate it to get the odd ratios: \[\exp(\beta_j) = \text{odds ratio associated with a one-unit increase in } x_j\] If $\exp(\beta_j) > 1$ the odds of the event increase as $x_j$ increases.

If $\exp(\beta_j) < 1$ the odds of the event decrease as $x_j$ increases.

If $\exp(\beta_j) = 1$ no association between $x_j$ and the outcome.

The logistic(Sigmoid) Function

Recall we are modelling logit of the outcome given X as \[logit(Y|X) = log\bigg(\frac{p}{1-p}\bigg) = \beta^T X\]

To find the probability p, you invert the logit function. Start from:

\[\log\left(\frac{p}{1 - p}\right) = \beta^\top X\]

Exponentiate both sides:

\[\frac{p}{1 - p} = e^{\beta^\top X}\]

Solve for p:

\[ logit^{-1}(Y|X) = p = \frac{e^{\beta^\top X}}{1 + e^{\beta^\top X}} = \frac{1}{1 + e^{-\beta^\top X}}\]

This results is what we call the logistic function or the sigmoid curve. This function converts the linear predictor into a valid probability,[0-1], while the logit function maps a probability back to the log-odds scale.

Visuals on linear and logistic regression

In regression analysis we are concerned with estimating the line of best fit one that minimizes the sum of squared residuals: $\text{RSS} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2$

The fit is such that : \[\hat{Y} = \beta^TX\] which simplifies to $\hat{Y} = \beta_0 + \beta_1X$ when we have one predictor.

In cases with more than one predictor, we often use a residuals vs fitted values plot to assess model fit. If we observe a curve or pattern in this plot, it may indicate violations of the linearity or homoscedasticity assumptions.

For linearity violations, possible remedies include: Transforming predictors (e.g., log, square root), Adding polynomial terms, Using splines or generalized additive models (GAMs) , Switching to tree-based models

If the residuals display a funnel shape—that is, their spread increases or decreases across fitted values this suggests heteroscedasticity. In such cases, we may; Transform the outcome variable or model the variance structure explicitly (e.g., with weighted least squares or generalized least squares)

A visual representation of logistic regression differs from linear regression because the outcome is binary. Instead of predicting continuous values, the model estimates probabilities between 0 and 1. These predicted probabilities can then be compared to a threshold (e.g., 0.5) to classify observations into outcome categories—typically 1 for “event” and 0 for “no event.”

The standard residual sum of squares (RSS) used in linear regression is not appropriate for binary outcomes. Instead of minimizing RSS, logistic regression maximizes the log-likelihood based on the Bernoulli distribution of the outcome. Residual plots (e.g., deviance or Pearson residuals vs. fitted values) can help detect patterns or model misfit. Goodness of fit is typically assessed using deviance, Hosmer-Lemeshow tests, or pseudo-$R^2$ measures like McFadden’s $R^2$

Regression estimation methods

We have mentioned the residual sum of squares and its is important to understand that its comes from one of the estimation method known as the least squares(LS). The regression coefficients $\beta$ are calculated from minimizing the sum of squared residuals: $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ . This methods works directly for continuous outcomes because we assume residuals are normally distributed.

However,there exist a general purpose estimation method known as the maximum likelihood method(MLE). It works by maximizing the likelihood function \[L(\beta) = \prod_{i=1}^n f(y_i \mid x_i, \beta)\] or equivalently, maximizing log transformed likelihood for efficient computation:

\[\ell(\beta) = \sum_{i=1}^n \log f(y_i \mid x_i, \beta)\] MLE method works for many outcome types (continuous, binary, count, etc.) and only requires specifying a probability model for $Y_i$. That is Gaussian for a normally distributed continuous outcome, Bernoulli distribution for a binary data, poisson distribution for count data etc.

Quiz

Maximum likelihood method for a binary outcome

As mentioned when the outcome is binary we assume a Bernoulli distribution for $Y_i$: \[Y_i \overset{indep}{\sim} \text{Bernoulli}(p_i)\] Where $p_i$ is the probability of success for the event, that is $P(Y=1\mid X)$. We the express the likelihood for n independent outcomes as follows: \[L(\beta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\] or equivalently the log of the likelihood as : \[\ell(\beta) = \sum_{i=1}^n \big[ y_i\log(p_i) + (1-y_i)\log(1-p_i) \big]\] Solving the latter equation for $p_i$ gives a closed form solution estimate of $\hat{P}(Y=1|X) = \hat{p_i} = \frac{y_i}{n}$

But we know that $\hat{p} = \frac{1}{1 + e^-{X^T\beta}} = \frac{e^{X^T\beta}}{1+e^{X^T\beta}}$

Therefore to solve for the $\beta$ regression coefficients we replace p and use the revised likelihood

\[\ell(\beta) = \sum_{i=1}^n \bigg[ y_i \log\left( \frac{e^{X_i^T\beta}}{1+e^{X_i^T\beta}} \right) + (1-y_i)\log\left( 1 - \frac{e^{X_i^T\beta}}{1+e^{X_i^T\beta}} \right) \bigg]\]

Which simplifies to the canonical form of a logistic regression. \[\ell(\ beta) = \sum_{i-1}^n \bigg[ yX^T\beta - log(1+e^{X^T\beta}) \bigg]\] .

Using the MLE method, we need to obtain the first and second derivatives, where the second is know as the Hessian matrix and confirms if we have a maximum solution if its negative.

\[ \frac{\partial \ell}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ \frac{\partial }{\partial\beta^{\prime}}yX^T\beta - \frac{\partial}{\partial\beta^{\prime}} log(1+e^{X^T\beta}) \bigg] \] Using chain rule on the second component we get.

\[\frac{\partial \ell \beta}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ yX - \frac{Xe^{X^T\beta}}{1+ e^{X^T\beta}} \bigg]\] By simplification the score function(first derivative)

\[ U(\beta) =\frac{\partial \ell \beta}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ {yx - px} \bigg] \]

The hessian is obtained by working on $-px$, since $yx$ is independent of $\beta$

\[H(\beta) = \frac{\partial^2 \ell \beta}{\partial\beta^{2}} = -\sum_{i-1}^n \bigg[ {p(1-p)xx^T} \bigg] \]

The above working reveal that $p$ depends on $\beta$ through $X^T\beta$ making the gradient(the score function) depend $\beta$. This implies that we cannot isolate $\beta$ on the side of the equation which results with no closed form solution for $\beta$ This is in contrast to what we see in linear regression where $U(\beta) =0$ yields $\hat{\beta} = (X^TX)^{-1}X^Ty$

Thus in a logistic regression to maximize $\ell\beta$ the score function needs $\beta$ to to be approximated iteratively either through Newton-Raphson, Fisher scoring or the gradient ascent method.

\[\beta^{t+1} = \beta^{t}- {H(\beta^t)}^{-1} U(\beta^t)\]

\[\beta^{t+1} = \beta^{t}- {E[H(\beta^t)]}^{-1} U(\beta^t)\] where $-E[H(\beta)]$ is the expected information matrix instead of the actual hessian $H(\beta)$

The update is provided via $\alpha$ defined steps. This is usually slower than Newton-Raphson and does not require second derivatives.

\[\beta^{t+1} = \beta^{t} + \alpha U(\beta^t)\] We conclude on the MLE method for logistic regression by show why its often referred to as iterative re-weighted least squares update.

We have the score function $U(\beta) = X^T(Y-p)$ where $ X = n ;x; p$ design matrix, $y$ is the vector of observed outcomes and $P$ is vectors of fitted probabilities under current $\beta$. Secondly the Hessian $H(\beta) = - X^TWX$ with $ W = diag(p(1-p))$ a diagonal matrix of weights. Then the Newton Raphson update looks like weighted least square update with $W$ weights to each observation

\[B^{t+1} = \beta^{t} + (X^TWX)^{-1}X^T(Y-p)\]

By defining the residuals as $Y-p$ and $Z = X \beta^\top + W^{-1}(Y - p)$ where $X\beta^\top$ is the current linear predictor $\eta^t$ and $W^{-1}(Y-p)$ is an adjustment to residuals. Z is called a working response or adjusted dependent variable.

We now solve the weighted least squares problem: \[\beta^{(t+1)} = (X^\top W^{(t)}X)^{-1}X^\top W^{(t)}z^{(t)}\] Under the hood R implements the fisher scoring method.

Data explorations

Now that we know what goes on while fitting a logistic regression, before fitting any model we first explore the relationship between the variables and the outcomes. For continuous outcomes and continuous predictors we use scatter plots and we may add loess lines to aid visualizing the trends. For a binary outcome we use box plots when we have a continuous predictor. We may also use counts and frequencies ans well as bivariate tests such as chi square test.

Example

# load your data
load("Workshop2025.Rdata")
str(htndat)

‘data.frame’: 4998 obs. of 17 variables: $ DBP : int 60 75 60 60 60 60 59 60 70 58 … $ SBP : int 90 110 80 90 100 120 100 100 120 110 … $ BMI : num NA 27.3 17.7 19.9 21.3 … $ age : num 28 26.5 43 50.1 30.6 … $ married : int 0 1 0 1 0 1 0 0 1 1 … $ male.gender : int 0 0 0 1 1 0 0 1 0 0 … $ hgb_centered : num NA -3.9 -3.2 NA -0.4 … $ adv_HIV : int NA 0 NA NA NA NA NA NA NA NA … $ survtime : int 338 439 752 526 215 272 1669 388 14 419 … $ event : int 1 1 1 1 1 1 1 1 1 1 … $ arv_naive : int 1 1 1 1 1 1 1 0 1 1 … $ urban.clinic : int 0 1 0 1 0 0 1 0 0 0 … $ log_creat_centered: num NA 5.42e-02 -3.60e-01 NA -1.00e-07 … $ IPW_weight : num 0.924 1.164 0.721 0.829 0.856 … $ SBP_ge120 : int 0 0 0 0 0 0 0 0 0 0 … $ survmonth : num 12 15 26 18 8 10 56 13 1 14 … $ ID : int 1 2 3 4 5 6 7 8 9 10 …

# Get to now your data
#ls()

# Which variables

#names(htndat)
#names(htnlongdat)

# Size
#dim(htndat)
#dim(htnlongdat)

# Identify your outcome variable and ensure its is binary
#table(htndat$event)# death=1, censored = 0
#typeof(htndat$event) # Right?

###
library(ggplot2)
ggplot(htndat, aes(x = factor(event), y = survmonth)) +
  geom_boxplot(notch = TRUE) +
ggtitle("Boxplot of survival time againts event")

#Table one summary with p value anova and chi tests

library(gtsummary)

htndat$event <- factor(htndat$event)  
htndat$male.gender <- factor(htndat$male.gender) 
htndat$adv_HIV <- factor(htndat$adv_HIV) 
htndat$arv_naive <- factor(htndat$arv_naive) 
htndat$urban.clinic <- factor(htndat$urban.clinic) 
htndat$SBP_ge120 <- factor(htndat$SBP_ge120)

# Table 1 stratified by 'event'
tbl <- htndat %>%
   tbl_summary(
     by = event,  # stratify by cyl
     statistic = all_continuous() ~ "{mean} ({sd})",
     digits = all_continuous() ~ 2
   ) %>%
   add_p(
     test = list(
       all_continuous() ~ "aov",      # ANOVA for continuous
       all_categorical() ~ "chisq.test"  # Chi-sq for categorical
     )
   )
# Add p-values

# add overall column

tbl <- tbl %>% add_overall()

Table 1

tbl

Characteristic	Overall N = 4,998¹	0 N = 4,249¹	1 N = 749¹	p-value²
DBP	68.03 (10.32)	68.71 (10.11)	64.17 (10.66)	<0.001
SBP	110.60 (14.78)	111.31 (14.70)	106.56 (14.56)	<0.001
BMI	21.43 (3.62)	21.62 (3.65)	20.35 (3.25)	<0.001
Unknown	669	569	100
age	35.19 (10.56)	35.07 (10.51)	35.90 (10.86)	0.046
married	2,848 (59%)	2,453 (60%)	395 (54%)	0.007
Unknown	163	141	22
male.gender				<0.001
0	3,643 (73%)	3,201 (75%)	442 (59%)
1	1,355 (27%)	1,048 (25%)	307 (41%)
hgb_centered	-0.05 (2.59)	0.08 (2.57)	-0.88 (2.57)	<0.001
Unknown	1,393	1,147	246
adv_HIV				<0.001
0	1,130 (37%)	1,031 (39%)	99 (25%)
1	1,908 (63%)	1,609 (61%)	299 (75%)
Unknown	1,960	1,609	351
survtime	750.68 (650.86)	788.04 (666.77)	538.72 (502.19)	<0.001
arv_naive				0.4
0	468 (9.4%)	405 (9.5%)	63 (8.4%)
1	4,530 (91%)	3,844 (90%)	686 (92%)
urban.clinic				0.024
0	2,636 (53%)	2,212 (52%)	424 (57%)
1	2,362 (47%)	2,037 (48%)	325 (43%)
log_creat_centered	-0.09 (0.32)	-0.10 (0.31)	-0.04 (0.33)	<0.001
Unknown	1,537	1,264	273
IPW_weight	0.99 (0.33)	1.00 (0.33)	0.98 (0.30)	0.3
SBP_ge120				0.046
0	4,748 (95%)	4,025 (95%)	723 (97%)
1	250 (5.0%)	224 (5.3%)	26 (3.5%)
survmonth	25.54 (21.67)	26.78 (22.19)	18.46 (16.74)	<0.001
ID	2,499.50 (1,442.94)	2,874.00 (1,226.72)	375.00 (216.36)	<0.001
¹ Mean (SD); n (%)
² One-way analysis of means; Pearson’s Chi-squared test

Fitting linear and logistic regression in R

We use the function $lm( Y \sim X,data = yourdata)$ to fit a linear regression. On the other hand we use the function $glm(Y \sim X, data = yourdata, family = binomial(link = "logit"))$ to fit a logistic regression model in R.

As discussed earlier, Y is a binary outcome variable — for example, life status where Y = 1 indicates death and Y = 0 indicates survival. The vector of explanatory variables X could include predictors such as age, gender, BMI, and systolic blood pressure (SBP).

The argument yourdata refers to the dataset — typically a data.frame, data.table, or tibble — that contains both the outcome variable and the predictor variables.

By specifying family = binomial(link = “logit”), we are telling R that the outcome follows a binomial distribution and that we are modeling the logit transformation of the probability of the event Y = 1.

Linear regressions

m0 = lm(SBP ~ age + male.gender + married, data = htndat )
#summary(m0)
library(knitr)
library(gtsummary)
kable(coef(summary(m0)),caption="Linear regression")

Linear regression
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	103.0737900	0.7781899	132.453256	0.0000000
age	0.1697076	0.0205136	8.272921	0.0000000
male.gender1	1.1482555	0.4971801	2.309536	0.0209558
married	1.8972446	0.4347769	4.363720	0.0000131

#tbl_regression(m0)

logistic regression

m1 = glm(event ~ survtime + adv_HIV  + male.gender ,data = htndat,family = binomial(link="logit"))
# Obtain  fit summary
#summary(m1)
# What issues do you note?
kable(coef(summary(m1)),caption="Logistic regression")

Logistic regression
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.0582266	0.1265390	-16.265556	0.00e+00
survtime	-0.0006148	0.0000962	-6.393221	0.00e+00
adv_HIV1	0.5392274	0.1247368	4.322921	1.54e-05
male.gender1	0.7235381	0.1153533	6.272366	0.00e+00

Predictions from logistic regression

To obtain predictions from a logistic regression model in R, we typically use built-in functions for efficiency, though manual calculation using the model coefficients will yield the same results.

In this context, we are primarily interested in the fitted probabilities. Since the model is applied to the full dataset used for fitting, these predictions are considered in-sample or training predictions. In contrast, a proper classification pipeline often involves splitting the data into training, testing, and validation sets to assess model performance more reliably.

In R, the predict() function is used as follows: \[predict(fit, type = "response")\] This returns the estimated probabilities \[\hat{p}_i = \Pr(Y_i = 1 \mid X_i)\]. If the type argument is changed to class i.e type = “class”, we get the predicted classes instead of the probabilities.

To generate predictions for new observations, the new data must contain all the predictor variables used in model fitting. The syntax becomes:

\[{predict(fit, type = “response”, newdata = new\_data)}\]

predictions = predict(m1, type = "response")
head(predictions)

     2         11         12         13         14         18

0.08881977 0.14275512 0.20466253 0.17753473 0.16169752 0.17457218

Pooled logistic regressions for survival outcomes.

A survival outcome is a special type of binary event—such as death or disease progression—that is observed over time. Unlike standard binary outcomes, survival data account not just for whether an event occurred, but also when it occurred.

For example, consider an HIV cohort study following individuals from the time of enrollment and treatment initiation. The investigator is interested in the time to death. Subject A might die after 10 years of follow-up, while subject B dies after just 2 years. Subject C may remain alive at the end of the study, never experiencing the event, and subject D may drop out after 4 years, with their survival status thereafter unknown.

This variability in event timing and incomplete follow-up requires methods beyond standard binary regression. One such approach is pooled logistic regression, which adapts logistic regression to handle survival outcomes in discrete time intervals.

In discrete-time survival analysis, follow-up time is divided into intervals (e.g., weeks, months, years). For each subject, we track whether the event (e.g., death) occurs in each interval. This transforms the survival data into a person-period dataset.

Let $t = 1, 2, \dots, T$ denote discrete time intervals (e.g., years). $Y_{it} \in \{0, 1\}$ be a binary indicator for whether subject i experienced the event in interval $t$ $(1 = event \; occurred(death), 0 = no \; event(alive))$. $\mathbf{X}_{it}$ be the covariates for subject $i$ at time $t$ (can include time-fixed or time-varying predictors).

We assume that each subject contributes a row for each time point up until the event or censoring.

The model expresses the conditional probability of the event occurring at time t for subject i, given they have survived up to t:

\[\Pr(Y_{it} = 1 \mid Y_{i,t-1} = 0, \mathbf{X}{it}) = logit^{-1} \bigg(\alpha_t + \mathbf{X}_{it}^\top \boldsymbol{\beta}) \bigg)\]

Where $\alpha_t$ is a time-specific intercept (e.g., one per time point to allow for baseline hazard variation). $\boldsymbol{\beta}$ is the vector of coefficients for the covariates $\mathbf{X}_{it}$.

This is a logistic regression where the outcome is whether the event occurred in a given time interval. The data is pooled across time, hence the name “pooled logistic regression.”

Sample long data for a particular patient

library(dplyr)
library(tidyr)
xx=htnlongdat |> filter(ID %in% c(133,1259)) |>
  select(ID,Month,death,age,BMI,hgb_centered,SBP_ge120) 

 kable(xx,caption="Sample Person period data")

Sample Person period data
ID	Month	death	age	BMI	hgb_centered
133	1	0	25.57974	18.31356	-1.3000002
133	2	0	25.57974	18.31356	-1.3000002
133	3	0	25.57974	18.31356	-1.3000002
133	4	0	25.57974	18.31356	-1.3000002
133	5	0	25.57974	18.31356	-1.3000002
133	6	0	25.57974	18.31356	-1.3000002
133	7	0	25.57974	18.31356	-1.3000002
133	8	1	25.57974	18.31356	-1.3000002
1259	1	0	29.51677	23.08254	0.6000004
1259	2	0	29.51677	23.08254	0.6000004
1259	3	0	29.51677	23.08254	0.6000004
1259	4	0	29.51677	23.08254	0.6000004

# add a time varying data
xtable::xtable(xx)

Fitting a pooled logistic in R

Just like in a normal logistic regression, the fitting and estimation is the same.

htnlongdat = htnlongdat |> drop_na()
library(splines)
fit = glm(death~ ns(Month,2) + age + BMI + 
           hgb_centered + adv_HIV+ SBP_ge120, 
          family= binomial(link="logit"),
         data = htnlongdat)
summary(fit)

Call: glm(formula = death ~ ns(Month, 2) + age + BMI + hgb_centered + adv_HIV + SBP_ge120, family = binomial(link = “logit”), data = htnlongdat)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.195052 0.483265 -8.681 < 2e-16 ns(Month, 2)1 -0.641916 0.313418 -2.048 0.04055
ns(Month, 2)2 -0.182229 0.450190 -0.405 0.68564
age 0.013318 0.005212 2.555 0.01061 *
BMI -0.085760 0.019670 -4.360 1.30e-05 hgb_centered -0.131210 0.023393 -5.609 2.04e-08 adv_HIV 0.442044 0.140556 3.145 0.00166 SBP_ge120 0.280788 0.249679 1.125 0.26076
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3684.0  on 61327  degrees of freedom

Residual deviance: 3587.9 on 61320 degrees of freedom AIC: 3603.9

Number of Fisher Scoring iterations: 8

#for a cleaner output
tbl_regression(fit)

Characteristic	log(OR)	95% CI	p-value
ns(Month, 2)
ns(Month, 2)1	-0.64	-1.3, -0.03	0.041
ns(Month, 2)2	-0.18	-1.1, 0.65	0.7
age	0.01	0.00, 0.02	0.011
BMI	-0.09	-0.12, -0.05	<0.001
hgb_centered	-0.13	-0.18, -0.09	<0.001
adv_HIV	0.44	0.17, 0.72	0.002
SBP_ge120	0.28	-0.24, 0.74	0.3
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

# Get the predicted hazards
htnlongdat$lambda_t <- predict(fit, type = "response")

# Get the sample data
xx=htnlongdat |> filter(ID %in% c(133,1259)) |>
  select(ID,Month,death,age,BMI,lambda_t) 


# add a time varying data
xtable::xtable(xx)

xtable::xtable(xx, digits = -1)

Generating Survival probabilities from the pooled logistic predictions

Let \[\lambda_{it}(t) = \Pr(Y_{it} = 1 \mid Y_{i,t-1} = 0, X_i) = \Pr(Y_{it} = 1 \mid T \ge t, X_i) = \Pr(T = t \mid T \ge t, X_i)\] be the discrete-time hazard function for individual i at time t. This represents the probability that subject i experiences the event (e.g., death) at time t, given they have not yet experienced it by t, conditional on covariates $X_i$. Here, T denotes the event time, and $X_i$ is a vector of covariates for individual $i$.

Note that if we define \[ Y_{it} = \begin{cases} 1 & \text{if } T = t \\ 0 & \text{otherwise}, \end{cases} \] then $\lambda_{it}$ can equivalently be expressed as $\lambda_{it} = \Pr(Y_{it} = 1 \mid T \ge t, X_i)$.

We model $\lambda_{it}$ using a pooled logistic regression. Taking the complement $1 - \lambda_{it}$ gives the probability that individual i does not experience the event at time t (i.e., remains alive), given no prior occurrence of the event up to t-1 and conditional on covariates X_i.

Therefore, the survival function can be expressed as: \[S(t \mid X_i) = \Pr(T > t \mid X_i) = \prod_{k=1}^{t} \left(1 - \lambda_{ik}(k)\right)\], which gives the probability that individual i survives beyond time t.

What is the survival probility at 6 months after enrollment.

We are interested in \[S(6 \mid X_i) = \Pr(T > 6 \mid X_i) = \prod_{t=1}^{6} \big(1 - \lambda_{it}(t)\big)\] which is the probability of not having the event(death) in any of the first 6 months.

Here $\lambda_{it}(t)$ is predicted hazard at time $t$ for individual $i$ . The product multiplies the complements of the hazards across time $1 to 6$ for each individual.

To compute the survival probability $S(6 \mid X_i)$, we must first create new rows for each individual at every time point up to 6 months (or any time horizon T of interest). We then use the fitted model to predict $\lambda_{it}$ at each time point and combine these conditional hazards over time using the survival formula. Recall that the pooled logistic regression predicts the probability of experiencing the event at time t given survival up to t; the overall survival probability is obtained by chaining these predictions across all t.

A common concern is what happens if an individual experienced the event before T or remained event-free by T. This issue is already handled during model fitting: the pooled logistic regression accounts for censoring and event times. When predicting, we are essentially asking, “What if the individual survived up to each time t?” under the fitted model, independent of their actual observed survival status. For this reason, it is necessary to supply a grid of time points t = 1, , 6 and the covariates X_i to the prediction function.

If we wish to compute $S(6 \mid X_i)$ for all individuals in the dataset, we must create an expanded dataset where each individual $i$ is represented at all time points $t = 1, \dots, 6.$

# In sample predictions

xx <- htnlongdat %>%
  group_by(ID) %>%
  summarise(S = prod(1 - lambda_t), .groups = "drop") |> filter(ID %in% c(133,1259))
xtable::xtable(xx, digits = -1)

# Expand data: each individual at t=1..6
expanded_data <- htnlongdat %>%
  group_by(ID) %>%
  tidyr::expand(Month = 1:6, .by = "ID") %>%
  left_join(htnlongdat %>% distinct(ID, .keep_all = TRUE), by = c("ID"))  %>% 
  # bring in baseline covariates, just one copy of X at each t
rename(Month = Month.x)
  
#? What if we have time varying predictors


# Predict hazards
expanded_data <- expanded_data %>%
  ungroup() %>%  # remove any groupings
  mutate(predicted_hazard = predict(fit, newdata = ., type = "response"))

# What do you observe
# Compute survival at 6 months per individual
survival_6mo <- expanded_data %>%
  group_by(ID) %>%
  summarise(S6_predicted_prob = prod(1 - predicted_hazard), .groups = "drop")

head(survival_6mo[!is.na(survival_6mo$S6_predicted_prob),])

Advanced pooled logistic regression

The standard pooled logistic regression can be extended in several ways to improve flexibility and capture complex relationships in the data.

1. Time-varying predictors

In many applications, covariates X_i(t) may change over time (e.g., blood pressure, treatment adherence). These time-varying predictors can be directly incorporated into the model:

\[\text{logit} \, \lambda_{it}(t) = \beta_0 + \beta_1 X_{i}(t) + \beta_2 t + \dots\]

where X_i(t) represents the value of covariate X for individual i at time t. This allows the model to account for dynamic patient characteristics and their effect on the hazard.

2. Flexible modeling of time effects (splines)

A key assumption in pooled logistic regression is the functional form of time t. Instead of using a simple linear or categorical term for t, spline functions can be used to model complex, nonlinear effects:

\[\text{logit} \, \lambda_{it}(t) = \beta_0 + f(t) + \beta_1 X_{i} + \dots\]

where f(t) is a spline basis (e.g., natural splines or cubic splines). This approach captures variations in the baseline hazard over time more smoothly.

3. Interactions

Interactions between covariates and time or between covariates themselves can capture differential effects across time or groups:

\[\text{logit} \, \lambda_{it}(t) = \beta_0 + \beta_1 X_i + \beta_2 t + \beta_3 X_i \cdot t + \dots\] Covariate $\times$ Time interactions $(X_i \cdot t)$ allow the effect of a covariate to vary across time while Covariate Covariate interactions help model synergistic or antagonistic effects between predictors.

These enhancements are particularly important in survival analysis settings where the proportional hazards assumption may not hold.

Models diagnostic and validation

The pooled logistic regression model should be assessed using standard diagnostic tools. These include examination of the deviance, residual plots, and other goodness-of-fit measures to ensure that the model adequately captures the relationship between covariates and the discrete-time hazard.

Once the model has been fitted, the predicted survival probabilities $S(T \mid X_i)$ can be validated against observed data. For this purpose, define a binary outcome indicating whether each individual experienced the event by time T: \[ Z_i = \begin{cases} 1, & \text{if } T_i \leq T \\ 0, & \text{if } T_i > T \end{cases} \] where $T_i$ is the observed event time for individual i.

The predicted survival probabilities $S(T \mid X_i)$ can then be compared with these observed outcomes by calculating the area under the receiver operating characteristic curve (AUC). This provides a summary measure of the model’s ability to discriminate between individuals who experience the event by T and those who do not.

# Extract T for all patients that had the event(death)
 T_i_all <- htnlongdat %>%
  filter(death == 1) %>%
  group_by(ID) %>%
  summarise(T_i = min(Month), .groups = "drop")

# Join back to full list of IDs to get T= NA where no death occurred
T_i_all <- htnlongdat %>%
  distinct(ID) %>%
  left_join(T_i_all, by = "ID")

# Create Z True binary outcome to validate 6 months predictions
T_i_all <- T_i_all %>%
  mutate(Z = ifelse(is.na(T_i) | T_i > 6, 0, 1),
         Time = 6)

# Compare Z with the predicted outcomes
# Join all the and delete cases where missing data prevented pooled logistic estimation due to listwise deletion 
results <- left_join(T_i_all,survival_6mo,by ="ID")
results <- results[!is.na(results$S6_predicted_prob),] # drop IDs with missing predictions
results<- results %>% select(ID,T_i,Time,Z,S6_predicted_prob)

# Compute the event probability
results <- results %>%
  mutate(event_prob_6mo = 1 - S6_predicted_prob)

library(pROC)  

# Compute ROC and AUC
roc_obj <- roc(results$Z, results$event_prob_6mo)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Print AUC
#roc_obj

#print(xtable::xtable(head(results)), include.rownames = FALSE)

library(ggplot2)

# ggplot-style ROC curve
p =ggroc(roc_obj, colour = "#0072B2", size = 1.2) +
  labs(
    title = "ROC Curve censoring ignored",
    subtitle = paste("AUC =", round(auc(roc_obj), 3)),
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
  ) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray")
#p

#ggsave("auc.pdf", p, width = 8, height = 6, dpi = 600) 


# Obtain last row for each person
test.test = htnlongdat |>  group_by(ID) %>% slice_tail()


# Time dependent AUC that handles censoring

library(timeROC)
library(survival)
roc6 <- timeROC(
  T = htnlongdat$Month,           # observed time
  delta = htnlongdat$death,     # event indicator
  marker = htnlongdat$lambda_t, # model predictions (pooled logistic)
  cause = 1,          # event of interest
  weighting = "marginal",  # IPCW weighting
  times = 6           # time horizon
)



roc_df <- data.frame(
    FPR = roc6$FP[, 2],
    TPR = roc6$TP[, 2],
    #Model = "Pooled Logistic",
    AUC = round(roc6$AUC[2], 3)
      )

## Warning in data.frame(FPR = roc6$FP[, 2], TPR = roc6$TP[, 2], AUC =
## round(roc6$AUC[2], : row names were found from a short variable and have been
## discarded

# Combine AUC into model names for legend
roc_df <- roc_df %>%
  mutate(AUC = paste0( "(",AUC, ")"))


p2 = ggplot(roc_df, aes(x = FPR, y = TPR, color = AUC)) +
  geom_line(size = 1.2) +
  geom_abline(slope = 1, intercept = 0, color = "gray") +
  labs(
    title = "Time dependent ROC (IPCW)",
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top",
    legend.direction = "horizontal",         
   # legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10)
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Combine plots comparing AUC with and without accounting for censoring
library(patchwork)
auc = p + p2
ggsave("auc.pdf",auc, width = 8, height = 6, dpi = 600) 
auc

Machine learning for survival outcomes in discrete settings

# Deal with complete case data
htnlongdat_c = htnlongdat |> drop_na()

# Sample 70 training and 30 % test data
uniqueids <- htnlongdat_c %>%
  distinct(ID) %>%
  pull(ID)
set.seed(123)  # for reproducibility
sampled_ids <- sample(uniqueids, size = floor(0.7 * length(uniqueids)))

# Train data
train <- htnlongdat_c %>%
  filter(ID %in% sampled_ids)

# Test data
test <- htnlongdat_c %>%
  filter(!ID %in% sampled_ids)

#Advanced pooled logistic regression


# Multiple predictors
pooled = glm(death~ Month + age + male.gender + married + BMI + 
           hgb_centered + adv_HIV + log_creat_centered +SBP_ge120, 
          family= binomial(link="logit"),
         data = train)
# splines
pooled.ns = glm(death~ ns(Month,2) + age + male.gender + married + BMI + 
           hgb_centered + adv_HIV + log_creat_centered +SBP_ge120, 
          family= binomial(link="logit"),
         data = train)
# Interaction
pooled.in = glm(death~ Month + age + male.gender: married + BMI + 
           hgb_centered + adv_HIV + log_creat_centered +SBP_ge120, 
          family= binomial(link="logit"),
         data = train)
# Compare models
AIC(pooled,pooled.ns,pooled.in)

summary(pooled)

Call: glm(formula = death ~ Month + age + male.gender + married + BMI + hgb_centered + adv_HIV + log_creat_centered + SBP_ge120, family = binomial(link = “logit”), data = train)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.073406 0.574288 -7.093 1.31e-12 Month -0.007023 0.004314 -1.628 0.103528
age 0.002335 0.006377 0.366 0.714268
male.gender 0.940458 0.161711 5.816 6.04e-09 married -0.503951 0.141787 -3.554 0.000379 BMI -0.068033 0.023303 -2.919 0.003506 hgb_centered -0.189045 0.026809 -7.051 1.77e-12 ** adv_HIV 0.297176 0.166466 1.785 0.074227 .
log_creat_centered 0.390688 0.221717 1.762 0.078053 .
SBP_ge120 0.300378 0.298658 1.006 0.314531
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1