This is a statistical method used to model the relationship of a binary outcome(event) with one or more independent variables that may either be categorical or continuous.
It is used both as a classification algorithm for machine learning problems and as a statistical tool for inferential needs such as when we want to identify how certain explanatory variables affects a binary outcome.
The model is such that
\[logit(Y) \sim \beta^T X\] where
\[ Y = \begin{cases} 1, & \text{if the event occurred} \\ 0, & \text{if the event did not occur} \end{cases} \]
\(X\) is a vector of predictors/explanatory variables.
\(\beta\) are regression coefficients
The logit function is a log transformation of the event probability into the log-odds scale. It is defined as:
\[logit(Y) = log\bigg(\frac{P(Y=1 \mid X)}{1-P(Y=1\mid X)}\bigg) = log\bigg(\frac{p}{1-p}\bigg)\] Where \(P(Y=1\mid X)=p\) is the probability that the events occurs conditional on X.
logit(Y) is a linear combination of contributions from each \(X_i\) in the X vector.
\[\text{logit}(\Pr(Y = 1\mid X)) = \beta^T X= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k\] Each coefficient \(\beta_j\) represents the change in the log-odds of the outcome per one-unit increase in \(x_j\), holding all other variables constant.
To interpret \(\beta_j\) more intuitively, we exponentiate it to get the odd ratios: \[\exp(\beta_j) = \text{odds ratio associated with a one-unit increase in } x_j\] If \(\exp(\beta_j) > 1\) the odds of the event increase as \(x_j\) increases.
If \(\exp(\beta_j) < 1\) the odds of the event decrease as \(x_j\) increases.
If \(\exp(\beta_j) = 1\) no association between \(x_j\) and the outcome.
Recall we are modelling logit of the outcome given X as \[logit(Y|X) = log\bigg(\frac{p}{1-p}\bigg) = \beta^T X\]
To find the probability p, you invert the logit function. Start from:
\[\log\left(\frac{p}{1 - p}\right) = \beta^\top X\]
Exponentiate both sides:
\[\frac{p}{1 - p} = e^{\beta^\top X}\]
Solve for p:
\[ logit^{-1}(Y|X) = p = \frac{e^{\beta^\top X}}{1 + e^{\beta^\top X}} = \frac{1}{1 + e^{-\beta^\top X}}\]
This results is what we call the logistic function or the sigmoid curve. This function converts the linear predictor into a valid probability,[0-1], while the logit function maps a probability back to the log-odds scale.
In regression analysis we are concerned with estimating the line of best fit one that minimizes the sum of squared residuals: \(\text{RSS} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\)
The fit is such that : \[\hat{Y} = \beta^TX\] which simplifies to \(\hat{Y} = \beta_0 + \beta_1X\) when we have one predictor.
In cases with more than one predictor, we often use a residuals vs fitted values plot to assess model fit. If we observe a curve or pattern in this plot, it may indicate violations of the linearity or homoscedasticity assumptions.
For linearity violations, possible remedies include: Transforming predictors (e.g., log, square root), Adding polynomial terms, Using splines or generalized additive models (GAMs) , Switching to tree-based models
If the residuals display a funnel shape—that is, their spread increases or decreases across fitted values this suggests heteroscedasticity. In such cases, we may; Transform the outcome variable or model the variance structure explicitly (e.g., with weighted least squares or generalized least squares)
A visual representation of logistic regression differs from linear regression because the outcome is binary. Instead of predicting continuous values, the model estimates probabilities between 0 and 1. These predicted probabilities can then be compared to a threshold (e.g., 0.5) to classify observations into outcome categories—typically 1 for “event” and 0 for “no event.”
The standard residual sum of squares (RSS) used in linear regression is not appropriate for binary outcomes. Instead of minimizing RSS, logistic regression maximizes the log-likelihood based on the Bernoulli distribution of the outcome. Residual plots (e.g., deviance or Pearson residuals vs. fitted values) can help detect patterns or model misfit. Goodness of fit is typically assessed using deviance, Hosmer-Lemeshow tests, or pseudo-\(R^2\) measures like McFadden’s \(R^2\)
We have mentioned the residual sum of squares and its is important to understand that its comes from one of the estimation method known as the least squares(LS). The regression coefficients \(\beta\) are calculated from minimizing the sum of squared residuals: \(RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2\) . This methods works directly for continuous outcomes because we assume residuals are normally distributed.
However,there exist a general purpose estimation method known as the maximum likelihood method(MLE). It works by maximizing the likelihood function \[L(\beta) = \prod_{i=1}^n f(y_i \mid x_i, \beta)\] or equivalently, maximizing log transformed likelihood for efficient computation:
\[\ell(\beta) = \sum_{i=1}^n \log f(y_i \mid x_i, \beta)\] MLE method works for many outcome types (continuous, binary, count, etc.) and only requires specifying a probability model for \(Y_i\). That is Gaussian for a normally distributed continuous outcome, Bernoulli distribution for a binary data, poisson distribution for count data etc.
As mentioned when the outcome is binary we assume a Bernoulli distribution for \(Y_i\): \[Y_i \overset{indep}{\sim} \text{Bernoulli}(p_i)\] Where \(p_i\) is the probability of success for the event, that is \(P(Y=1\mid X)\). We the express the likelihood for n independent outcomes as follows: \[L(\beta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\] or equivalently the log of the likelihood as : \[\ell(\beta) = \sum_{i=1}^n \big[ y_i\log(p_i) + (1-y_i)\log(1-p_i) \big]\] Solving the latter equation for \(p_i\) gives a closed form solution estimate of \(\hat{P}(Y=1|X) = \hat{p_i} = \frac{y_i}{n}\)
But we know that \(\hat{p} = \frac{1}{1 + e^-{X^T\beta}} = \frac{e^{X^T\beta}}{1+e^{X^T\beta}}\)
Therefore to solve for the \(\beta\) regression coefficients we replace p and use the revised likelihood
\[\ell(\beta) = \sum_{i=1}^n \bigg[ y_i \log\left( \frac{e^{X_i^T\beta}}{1+e^{X_i^T\beta}} \right) + (1-y_i)\log\left( 1 - \frac{e^{X_i^T\beta}}{1+e^{X_i^T\beta}} \right) \bigg]\]
Which simplifies to the canonical form of a logistic regression. \[\ell(\ beta) = \sum_{i-1}^n \bigg[ yX^T\beta - log(1+e^{X^T\beta}) \bigg]\] .
Using the MLE method, we need to obtain the first and second derivatives, where the second is know as the Hessian matrix and confirms if we have a maximum solution if its negative.
\[ \frac{\partial \ell}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ \frac{\partial }{\partial\beta^{\prime}}yX^T\beta - \frac{\partial}{\partial\beta^{\prime}} log(1+e^{X^T\beta}) \bigg] \] Using chain rule on the second component we get.
\[\frac{\partial \ell \beta}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ yX - \frac{Xe^{X^T\beta}}{1+ e^{X^T\beta}} \bigg]\] By simplification the score function(first derivative)
\[ U(\beta) =\frac{\partial \ell \beta}{\partial\beta^{\prime}} = \sum_{i-1}^n \bigg[ {yx - px} \bigg] \]
The hessian is obtained by working on \(-px\), since \(yx\) is independent of \(\beta\)
\[H(\beta) = \frac{\partial^2 \ell \beta}{\partial\beta^{2}} = -\sum_{i-1}^n \bigg[ {p(1-p)xx^T} \bigg] \]
The above working reveal that \(p\) depends on \(\beta\) through \(X^T\beta\) making the gradient(the score function) depend \(\beta\). This implies that we cannot isolate \(\beta\) on the side of the equation which results with no closed form solution for \(\beta\) This is in contrast to what we see in linear regression where \(U(\beta) =0\) yields \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
Thus in a logistic regression to maximize \(\ell\beta\) the score function needs \(\beta\) to to be approximated iteratively either through Newton-Raphson, Fisher scoring or the gradient ascent method.
\[\beta^{t+1} = \beta^{t}- {H(\beta^t)}^{-1} U(\beta^t)\]
\[\beta^{t+1} = \beta^{t}- {E[H(\beta^t)]}^{-1} U(\beta^t)\] where \(-E[H(\beta)]\) is the expected information matrix instead of the actual hessian \(H(\beta)\)
The update is provided via \(\alpha\) defined steps. This is usually slower than Newton-Raphson and does not require second derivatives.
\[\beta^{t+1} = \beta^{t} + \alpha U(\beta^t)\] We conclude on the MLE method for logistic regression by show why its often referred to as iterative re-weighted least squares update.
We have the score function \(U(\beta) = X^T(Y-p)\) where $ X = n ;x; p$ design matrix, \(y\) is the vector of observed outcomes and \(P\) is vectors of fitted probabilities under current \(\beta\). Secondly the Hessian \(H(\beta) = - X^TWX\) with $ W = diag(p(1-p))$ a diagonal matrix of weights. Then the Newton Raphson update looks like weighted least square update with \(W\) weights to each observation
\[B^{t+1} = \beta^{t} + (X^TWX)^{-1}X^T(Y-p)\]
By defining the residuals as \(Y-p\) and \(Z = X \beta^\top + W^{-1}(Y - p)\) where \(X\beta^\top\) is the current linear predictor \(\eta^t\) and \(W^{-1}(Y-p)\) is an adjustment to residuals. Z is called a working response or adjusted dependent variable.
We now solve the weighted least squares problem: \[\beta^{(t+1)} = (X^\top W^{(t)}X)^{-1}X^\top W^{(t)}z^{(t)}\] Under the hood R implements the fisher scoring method.
Now that we know what goes on while fitting a logistic regression, before fitting any model we first explore the relationship between the variables and the outcomes. For continuous outcomes and continuous predictors we use scatter plots and we may add loess lines to aid visualizing the trends. For a binary outcome we use box plots when we have a continuous predictor. We may also use counts and frequencies ans well as bivariate tests such as chi square test.
# load your data
load("Workshop2025.Rdata")
str(htndat)
‘data.frame’: 4998 obs. of 17 variables: $ DBP : int 60 75 60 60 60 60 59 60 70 58 … $ SBP : int 90 110 80 90 100 120 100 100 120 110 … $ BMI : num NA 27.3 17.7 19.9 21.3 … $ age : num 28 26.5 43 50.1 30.6 … $ married : int 0 1 0 1 0 1 0 0 1 1 … $ male.gender : int 0 0 0 1 1 0 0 1 0 0 … $ hgb_centered : num NA -3.9 -3.2 NA -0.4 … $ adv_HIV : int NA 0 NA NA NA NA NA NA NA NA … $ survtime : int 338 439 752 526 215 272 1669 388 14 419 … $ event : int 1 1 1 1 1 1 1 1 1 1 … $ arv_naive : int 1 1 1 1 1 1 1 0 1 1 … $ urban.clinic : int 0 1 0 1 0 0 1 0 0 0 … $ log_creat_centered: num NA 5.42e-02 -3.60e-01 NA -1.00e-07 … $ IPW_weight : num 0.924 1.164 0.721 0.829 0.856 … $ SBP_ge120 : int 0 0 0 0 0 0 0 0 0 0 … $ survmonth : num 12 15 26 18 8 10 56 13 1 14 … $ ID : int 1 2 3 4 5 6 7 8 9 10 …
# Get to now your data
#ls()
# Which variables
#names(htndat)
#names(htnlongdat)
# Size
#dim(htndat)
#dim(htnlongdat)
# Identify your outcome variable and ensure its is binary
#table(htndat$event)# death=1, censored = 0
#typeof(htndat$event) # Right?
###
library(ggplot2)
ggplot(htndat, aes(x = factor(event), y = survmonth)) +
geom_boxplot(notch = TRUE) +
ggtitle("Boxplot of survival time againts event")
#Table one summary with p value anova and chi tests
library(gtsummary)
htndat$event <- factor(htndat$event)
htndat$male.gender <- factor(htndat$male.gender)
htndat$adv_HIV <- factor(htndat$adv_HIV)
htndat$arv_naive <- factor(htndat$arv_naive)
htndat$urban.clinic <- factor(htndat$urban.clinic)
htndat$SBP_ge120 <- factor(htndat$SBP_ge120)
# Table 1 stratified by 'event'
tbl <- htndat %>%
tbl_summary(
by = event, # stratify by cyl
statistic = all_continuous() ~ "{mean} ({sd})",
digits = all_continuous() ~ 2
) %>%
add_p(
test = list(
all_continuous() ~ "aov", # ANOVA for continuous
all_categorical() ~ "chisq.test" # Chi-sq for categorical
)
)
# Add p-values
# add overall column
tbl <- tbl %>% add_overall()
tbl
Characteristic | Overall N = 4,9981 |
0 N = 4,2491 |
1 N = 7491 |
p-value2 |
---|---|---|---|---|
DBP | 68.03 (10.32) | 68.71 (10.11) | 64.17 (10.66) | <0.001 |
SBP | 110.60 (14.78) | 111.31 (14.70) | 106.56 (14.56) | <0.001 |
BMI | 21.43 (3.62) | 21.62 (3.65) | 20.35 (3.25) | <0.001 |
Unknown | 669 | 569 | 100 | |
age | 35.19 (10.56) | 35.07 (10.51) | 35.90 (10.86) | 0.046 |
married | 2,848 (59%) | 2,453 (60%) | 395 (54%) | 0.007 |
Unknown | 163 | 141 | 22 | |
male.gender | <0.001 | |||
0 | 3,643 (73%) | 3,201 (75%) | 442 (59%) | |
1 | 1,355 (27%) | 1,048 (25%) | 307 (41%) | |
hgb_centered | -0.05 (2.59) | 0.08 (2.57) | -0.88 (2.57) | <0.001 |
Unknown | 1,393 | 1,147 | 246 | |
adv_HIV | <0.001 | |||
0 | 1,130 (37%) | 1,031 (39%) | 99 (25%) | |
1 | 1,908 (63%) | 1,609 (61%) | 299 (75%) | |
Unknown | 1,960 | 1,609 | 351 | |
survtime | 750.68 (650.86) | 788.04 (666.77) | 538.72 (502.19) | <0.001 |
arv_naive | 0.4 | |||
0 | 468 (9.4%) | 405 (9.5%) | 63 (8.4%) | |
1 | 4,530 (91%) | 3,844 (90%) | 686 (92%) | |
urban.clinic | 0.024 | |||
0 | 2,636 (53%) | 2,212 (52%) | 424 (57%) | |
1 | 2,362 (47%) | 2,037 (48%) | 325 (43%) | |
log_creat_centered | -0.09 (0.32) | -0.10 (0.31) | -0.04 (0.33) | <0.001 |
Unknown | 1,537 | 1,264 | 273 | |
IPW_weight | 0.99 (0.33) | 1.00 (0.33) | 0.98 (0.30) | 0.3 |
SBP_ge120 | 0.046 | |||
0 | 4,748 (95%) | 4,025 (95%) | 723 (97%) | |
1 | 250 (5.0%) | 224 (5.3%) | 26 (3.5%) | |
survmonth | 25.54 (21.67) | 26.78 (22.19) | 18.46 (16.74) | <0.001 |
ID | 2,499.50 (1,442.94) | 2,874.00 (1,226.72) | 375.00 (216.36) | <0.001 |
1 Mean (SD); n (%) | ||||
2 One-way analysis of means; Pearson’s Chi-squared test |
We use the function \(lm( Y \sim X,data = yourdata)\) to fit a linear regression. On the other hand we use the function \(glm(Y \sim X, data = yourdata, family = binomial(link = "logit"))\) to fit a logistic regression model in R.
As discussed earlier, Y is a binary outcome variable — for example, life status where Y = 1 indicates death and Y = 0 indicates survival. The vector of explanatory variables X could include predictors such as age, gender, BMI, and systolic blood pressure (SBP).
The argument yourdata refers to the dataset — typically a data.frame, data.table, or tibble — that contains both the outcome variable and the predictor variables.
By specifying family = binomial(link = “logit”), we are telling R that the outcome follows a binomial distribution and that we are modeling the logit transformation of the probability of the event Y = 1.
m0 = lm(SBP ~ age + male.gender + married, data = htndat )
#summary(m0)
library(knitr)
library(gtsummary)
kable(coef(summary(m0)),caption="Linear regression")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 103.0737900 | 0.7781899 | 132.453256 | 0.0000000 |
age | 0.1697076 | 0.0205136 | 8.272921 | 0.0000000 |
male.gender1 | 1.1482555 | 0.4971801 | 2.309536 | 0.0209558 |
married | 1.8972446 | 0.4347769 | 4.363720 | 0.0000131 |
#tbl_regression(m0)
m1 = glm(event ~ survtime + adv_HIV + male.gender ,data = htndat,family = binomial(link="logit"))
# Obtain fit summary
#summary(m1)
# What issues do you note?
kable(coef(summary(m1)),caption="Logistic regression")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.0582266 | 0.1265390 | -16.265556 | 0.00e+00 |
survtime | -0.0006148 | 0.0000962 | -6.393221 | 0.00e+00 |
adv_HIV1 | 0.5392274 | 0.1247368 | 4.322921 | 1.54e-05 |
male.gender1 | 0.7235381 | 0.1153533 | 6.272366 | 0.00e+00 |
To obtain predictions from a logistic regression model in R, we typically use built-in functions for efficiency, though manual calculation using the model coefficients will yield the same results.
In this context, we are primarily interested in the fitted probabilities. Since the model is applied to the full dataset used for fitting, these predictions are considered in-sample or training predictions. In contrast, a proper classification pipeline often involves splitting the data into training, testing, and validation sets to assess model performance more reliably.
In R, the predict() function is used as follows: \[predict(fit, type = "response")\] This returns the estimated probabilities \[\hat{p}_i = \Pr(Y_i = 1 \mid X_i)\]. If the type argument is changed to class i.e type = “class”, we get the predicted classes instead of the probabilities.
To generate predictions for new observations, the new data must contain all the predictor variables used in model fitting. The syntax becomes:
\[{predict(fit, type = “response”, newdata = new\_data)}\]
predictions = predict(m1, type = "response")
head(predictions)
2 11 12 13 14 18
0.08881977 0.14275512 0.20466253 0.17753473 0.16169752 0.17457218
A survival outcome is a special type of binary event—such as death or disease progression—that is observed over time. Unlike standard binary outcomes, survival data account not just for whether an event occurred, but also when it occurred.
For example, consider an HIV cohort study following individuals from the time of enrollment and treatment initiation. The investigator is interested in the time to death. Subject A might die after 10 years of follow-up, while subject B dies after just 2 years. Subject C may remain alive at the end of the study, never experiencing the event, and subject D may drop out after 4 years, with their survival status thereafter unknown.
This variability in event timing and incomplete follow-up requires methods beyond standard binary regression. One such approach is pooled logistic regression, which adapts logistic regression to handle survival outcomes in discrete time intervals.
In discrete-time survival analysis, follow-up time is divided into intervals (e.g., weeks, months, years). For each subject, we track whether the event (e.g., death) occurs in each interval. This transforms the survival data into a person-period dataset.
Let \(t = 1, 2, \dots, T\) denote discrete time intervals (e.g., years). \(Y_{it} \in \{0, 1\}\) be a binary indicator for whether subject i experienced the event in interval \(t\) \((1 = event \; occurred(death), 0 = no \; event(alive))\). \(\mathbf{X}_{it}\) be the covariates for subject \(i\) at time \(t\) (can include time-fixed or time-varying predictors).
We assume that each subject contributes a row for each time point up until the event or censoring.
The model expresses the conditional probability of the event occurring at time t for subject i, given they have survived up to t:
\[\Pr(Y_{it} = 1 \mid Y_{i,t-1} = 0, \mathbf{X}{it}) = logit^{-1} \bigg(\alpha_t + \mathbf{X}_{it}^\top \boldsymbol{\beta}) \bigg)\]
Where \(\alpha_t\) is a time-specific intercept (e.g., one per time point to allow for baseline hazard variation). \(\boldsymbol{\beta}\) is the vector of coefficients for the covariates \(\mathbf{X}_{it}\).
This is a logistic regression where the outcome is whether the event occurred in a given time interval. The data is pooled across time, hence the name “pooled logistic regression.”
library(dplyr)
library(tidyr)
xx=htnlongdat |> filter(ID %in% c(133,1259)) |>
select(ID,Month,death,age,BMI,hgb_centered,SBP_ge120)
kable(xx,caption="Sample Person period data")
ID | Month | death | age | BMI | hgb_centered | SBP_ge120 |
---|---|---|---|---|---|---|
133 | 1 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 2 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 3 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 4 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 5 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 6 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 7 | 0 | 25.57974 | 18.31356 | -1.3000002 | 0 |
133 | 8 | 1 | 25.57974 | 18.31356 | -1.3000002 | 0 |
1259 | 1 | 0 | 29.51677 | 23.08254 | 0.6000004 | 0 |
1259 | 2 | 0 | 29.51677 | 23.08254 | 0.6000004 | 0 |
1259 | 3 | 0 | 29.51677 | 23.08254 | 0.6000004 | 0 |
1259 | 4 | 0 | 29.51677 | 23.08254 | 0.6000004 | 0 |
# add a time varying data
xtable::xtable(xx)
Just like in a normal logistic regression, the fitting and estimation is the same.
htnlongdat = htnlongdat |> drop_na()
library(splines)
fit = glm(death~ ns(Month,2) + age + BMI +
hgb_centered + adv_HIV+ SBP_ge120,
family= binomial(link="logit"),
data = htnlongdat)
summary(fit)
Call: glm(formula = death ~ ns(Month, 2) + age + BMI + hgb_centered + adv_HIV + SBP_ge120, family = binomial(link = “logit”), data = htnlongdat)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.195052 0.483265 -8.681 < 2e-16 ns(Month,
2)1 -0.641916 0.313418 -2.048 0.04055
ns(Month, 2)2 -0.182229 0.450190 -0.405 0.68564
age 0.013318 0.005212 2.555 0.01061 *
BMI -0.085760 0.019670 -4.360 1.30e-05 hgb_centered
-0.131210 0.023393 -5.609 2.04e-08 adv_HIV 0.442044
0.140556 3.145 0.00166 SBP_ge120 0.280788 0.249679 1.125
0.26076
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3684.0 on 61327 degrees of freedom
Residual deviance: 3587.9 on 61320 degrees of freedom AIC: 3603.9
Number of Fisher Scoring iterations: 8
#for a cleaner output
tbl_regression(fit)
Characteristic | log(OR) | 95% CI | p-value |
---|---|---|---|
ns(Month, 2) | |||
ns(Month, 2)1 | -0.64 | -1.3, -0.03 | 0.041 |
ns(Month, 2)2 | -0.18 | -1.1, 0.65 | 0.7 |
age | 0.01 | 0.00, 0.02 | 0.011 |
BMI | -0.09 | -0.12, -0.05 | <0.001 |
hgb_centered | -0.13 | -0.18, -0.09 | <0.001 |
adv_HIV | 0.44 | 0.17, 0.72 | 0.002 |
SBP_ge120 | 0.28 | -0.24, 0.74 | 0.3 |
Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
# Get the predicted hazards
htnlongdat$lambda_t <- predict(fit, type = "response")
# Get the sample data
xx=htnlongdat |> filter(ID %in% c(133,1259)) |>
select(ID,Month,death,age,BMI,lambda_t)
# add a time varying data
xtable::xtable(xx)
xtable::xtable(xx, digits = -1)
Let \[\lambda_{it}(t) = \Pr(Y_{it} = 1 \mid Y_{i,t-1} = 0, X_i) = \Pr(Y_{it} = 1 \mid T \ge t, X_i) = \Pr(T = t \mid T \ge t, X_i)\] be the discrete-time hazard function for individual i at time t. This represents the probability that subject i experiences the event (e.g., death) at time t, given they have not yet experienced it by t, conditional on covariates \(X_i\). Here, T denotes the event time, and \(X_i\) is a vector of covariates for individual \(i\).
Note that if we define \[ Y_{it} = \begin{cases} 1 & \text{if } T = t \\ 0 & \text{otherwise}, \end{cases} \] then \(\lambda_{it}\) can equivalently be expressed as \(\lambda_{it} = \Pr(Y_{it} = 1 \mid T \ge t, X_i)\).
We model \(\lambda_{it}\) using a pooled logistic regression. Taking the complement \(1 - \lambda_{it}\) gives the probability that individual i does not experience the event at time t (i.e., remains alive), given no prior occurrence of the event up to t-1 and conditional on covariates X_i.
Therefore, the survival function can be expressed as: \[S(t \mid X_i) = \Pr(T > t \mid X_i) = \prod_{k=1}^{t} \left(1 - \lambda_{ik}(k)\right)\], which gives the probability that individual i survives beyond time t.
We are interested in \[S(6 \mid X_i) = \Pr(T > 6 \mid X_i) = \prod_{t=1}^{6} \big(1 - \lambda_{it}(t)\big)\] which is the probability of not having the event(death) in any of the first 6 months.
Here \(\lambda_{it}(t)\) is predicted hazard at time \(t\) for individual \(i\) . The product multiplies the complements of the hazards across time \(1 to 6\) for each individual.
To compute the survival probability \(S(6 \mid X_i)\), we must first create new rows for each individual at every time point up to 6 months (or any time horizon T of interest). We then use the fitted model to predict \(\lambda_{it}\) at each time point and combine these conditional hazards over time using the survival formula. Recall that the pooled logistic regression predicts the probability of experiencing the event at time t given survival up to t; the overall survival probability is obtained by chaining these predictions across all t.
A common concern is what happens if an individual experienced the event before T or remained event-free by T. This issue is already handled during model fitting: the pooled logistic regression accounts for censoring and event times. When predicting, we are essentially asking, “What if the individual survived up to each time t?” under the fitted model, independent of their actual observed survival status. For this reason, it is necessary to supply a grid of time points t = 1, , 6 and the covariates X_i to the prediction function.
If we wish to compute \(S(6 \mid X_i)\) for all individuals in the dataset, we must create an expanded dataset where each individual \(i\) is represented at all time points \(t = 1, \dots, 6.\)
# In sample predictions
xx <- htnlongdat %>%
group_by(ID) %>%
summarise(S = prod(1 - lambda_t), .groups = "drop") |> filter(ID %in% c(133,1259))
xtable::xtable(xx, digits = -1)
# Expand data: each individual at t=1..6
expanded_data <- htnlongdat %>%
group_by(ID) %>%
tidyr::expand(Month = 1:6, .by = "ID") %>%
left_join(htnlongdat %>% distinct(ID, .keep_all = TRUE), by = c("ID")) %>%
# bring in baseline covariates, just one copy of X at each t
rename(Month = Month.x)
#? What if we have time varying predictors
# Predict hazards
expanded_data <- expanded_data %>%
ungroup() %>% # remove any groupings
mutate(predicted_hazard = predict(fit, newdata = ., type = "response"))
# What do you observe
# Compute survival at 6 months per individual
survival_6mo <- expanded_data %>%
group_by(ID) %>%
summarise(S6_predicted_prob = prod(1 - predicted_hazard), .groups = "drop")
head(survival_6mo[!is.na(survival_6mo$S6_predicted_prob),])
The standard pooled logistic regression can be extended in several ways to improve flexibility and capture complex relationships in the data.
In many applications, covariates X_i(t) may change over time (e.g., blood pressure, treatment adherence). These time-varying predictors can be directly incorporated into the model:
\[\text{logit} \, \lambda_{it}(t) = \beta_0 + \beta_1 X_{i}(t) + \beta_2 t + \dots\]
where X_i(t) represents the value of covariate X for individual i at time t. This allows the model to account for dynamic patient characteristics and their effect on the hazard.
A key assumption in pooled logistic regression is the functional form of time t. Instead of using a simple linear or categorical term for t, spline functions can be used to model complex, nonlinear effects:
\[\text{logit} \, \lambda_{it}(t) = \beta_0 + f(t) + \beta_1 X_{i} + \dots\]
where f(t) is a spline basis (e.g., natural splines or cubic splines). This approach captures variations in the baseline hazard over time more smoothly.
Interactions between covariates and time or between covariates themselves can capture differential effects across time or groups:
\[\text{logit} \, \lambda_{it}(t) = \beta_0 + \beta_1 X_i + \beta_2 t + \beta_3 X_i \cdot t + \dots\] Covariate \(\times\) Time interactions \((X_i \cdot t)\) allow the effect of a covariate to vary across time while Covariate Covariate interactions help model synergistic or antagonistic effects between predictors.
These enhancements are particularly important in survival analysis settings where the proportional hazards assumption may not hold.
The pooled logistic regression model should be assessed using standard diagnostic tools. These include examination of the deviance, residual plots, and other goodness-of-fit measures to ensure that the model adequately captures the relationship between covariates and the discrete-time hazard.
Once the model has been fitted, the predicted survival probabilities \(S(T \mid X_i)\) can be validated against observed data. For this purpose, define a binary outcome indicating whether each individual experienced the event by time T: \[ Z_i = \begin{cases} 1, & \text{if } T_i \leq T \\ 0, & \text{if } T_i > T \end{cases} \] where \(T_i\) is the observed event time for individual i.
The predicted survival probabilities \(S(T \mid X_i)\) can then be compared with these observed outcomes by calculating the area under the receiver operating characteristic curve (AUC). This provides a summary measure of the model’s ability to discriminate between individuals who experience the event by T and those who do not.
# Extract T for all patients that had the event(death)
T_i_all <- htnlongdat %>%
filter(death == 1) %>%
group_by(ID) %>%
summarise(T_i = min(Month), .groups = "drop")
# Join back to full list of IDs to get T= NA where no death occurred
T_i_all <- htnlongdat %>%
distinct(ID) %>%
left_join(T_i_all, by = "ID")
# Create Z True binary outcome to validate 6 months predictions
T_i_all <- T_i_all %>%
mutate(Z = ifelse(is.na(T_i) | T_i > 6, 0, 1),
Time = 6)
# Compare Z with the predicted outcomes
# Join all the and delete cases where missing data prevented pooled logistic estimation due to listwise deletion
results <- left_join(T_i_all,survival_6mo,by ="ID")
results <- results[!is.na(results$S6_predicted_prob),] # drop IDs with missing predictions
results<- results %>% select(ID,T_i,Time,Z,S6_predicted_prob)
# Compute the event probability
results <- results %>%
mutate(event_prob_6mo = 1 - S6_predicted_prob)
library(pROC)
# Compute ROC and AUC
roc_obj <- roc(results$Z, results$event_prob_6mo)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Print AUC
#roc_obj
#print(xtable::xtable(head(results)), include.rownames = FALSE)
library(ggplot2)
# ggplot-style ROC curve
p =ggroc(roc_obj, colour = "#0072B2", size = 1.2) +
labs(
title = "ROC Curve censoring ignored",
subtitle = paste("AUC =", round(auc(roc_obj), 3)),
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray")
#p
#ggsave("auc.pdf", p, width = 8, height = 6, dpi = 600)
# Obtain last row for each person
test.test = htnlongdat |> group_by(ID) %>% slice_tail()
# Time dependent AUC that handles censoring
library(timeROC)
library(survival)
roc6 <- timeROC(
T = htnlongdat$Month, # observed time
delta = htnlongdat$death, # event indicator
marker = htnlongdat$lambda_t, # model predictions (pooled logistic)
cause = 1, # event of interest
weighting = "marginal", # IPCW weighting
times = 6 # time horizon
)
roc_df <- data.frame(
FPR = roc6$FP[, 2],
TPR = roc6$TP[, 2],
#Model = "Pooled Logistic",
AUC = round(roc6$AUC[2], 3)
)
## Warning in data.frame(FPR = roc6$FP[, 2], TPR = roc6$TP[, 2], AUC =
## round(roc6$AUC[2], : row names were found from a short variable and have been
## discarded
# Combine AUC into model names for legend
roc_df <- roc_df %>%
mutate(AUC = paste0( "(",AUC, ")"))
p2 = ggplot(roc_df, aes(x = FPR, y = TPR, color = AUC)) +
geom_line(size = 1.2) +
geom_abline(slope = 1, intercept = 0, color = "gray") +
labs(
title = "Time dependent ROC (IPCW)",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top",
legend.direction = "horizontal",
# legend.title = element_text(face = "bold"),
legend.text = element_text(size = 10)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Combine plots comparing AUC with and without accounting for censoring
library(patchwork)
auc = p + p2
ggsave("auc.pdf",auc, width = 8, height = 6, dpi = 600)
auc
# Deal with complete case data
htnlongdat_c = htnlongdat |> drop_na()
# Sample 70 training and 30 % test data
uniqueids <- htnlongdat_c %>%
distinct(ID) %>%
pull(ID)
set.seed(123) # for reproducibility
sampled_ids <- sample(uniqueids, size = floor(0.7 * length(uniqueids)))
# Train data
train <- htnlongdat_c %>%
filter(ID %in% sampled_ids)
# Test data
test <- htnlongdat_c %>%
filter(!ID %in% sampled_ids)
#Advanced pooled logistic regression
# Multiple predictors
pooled = glm(death~ Month + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered +SBP_ge120,
family= binomial(link="logit"),
data = train)
# splines
pooled.ns = glm(death~ ns(Month,2) + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered +SBP_ge120,
family= binomial(link="logit"),
data = train)
# Interaction
pooled.in = glm(death~ Month + age + male.gender: married + BMI +
hgb_centered + adv_HIV + log_creat_centered +SBP_ge120,
family= binomial(link="logit"),
data = train)
# Compare models
AIC(pooled,pooled.ns,pooled.in)
summary(pooled)
Call: glm(formula = death ~ Month + age + male.gender + married + BMI + hgb_centered + adv_HIV + log_creat_centered + SBP_ge120, family = binomial(link = “logit”), data = train)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.073406 0.574288 -7.093 1.31e-12 Month
-0.007023 0.004314 -1.628 0.103528
age 0.002335 0.006377 0.366 0.714268
male.gender 0.940458 0.161711 5.816 6.04e-09 married
-0.503951 0.141787 -3.554 0.000379 BMI -0.068033 0.023303
-2.919 0.003506 hgb_centered -0.189045 0.026809 -7.051
1.77e-12 ** adv_HIV 0.297176 0.166466 1.785 0.074227 .
log_creat_centered 0.390688 0.221717 1.762 0.078053 .
SBP_ge120 0.300378 0.298658 1.006 0.314531
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2731.1 on 43268 degrees of freedom
Residual deviance: 2603.2 on 43259 degrees of freedom AIC: 2623.2
Number of Fisher Scoring iterations: 8
# Random forest
rf_fit <- randomForest(factor(death) ~ Month + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered + SBP_ge120,
data = train, ntree = 500)
# XGboost
#Prepare matrix and labels
X_train <- model.matrix(death ~ Month + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered +SBP_ge120 -1,
data = train)
y_train <- train$death
xgb_fit <- xgboost(data = X_train, label = y_train,
objective = "binary:logistic", nrounds = 100,verbose = F)
#=============================================#
# Predicting for S(6)
# Create expanded data using the test data each individual at t=1..6
#=============================================#
expanded_test <- test %>%
group_by(ID) %>%
tidyr::expand(Month = 1:6, .by = "ID") %>%
left_join(test %>% distinct(ID, .keep_all = TRUE), by = c("ID")) |>
rename(Month = Month.x)
# Predict hazards
expanded_test_r <- expanded_test %>%
ungroup() %>% # remove any groupings
mutate(lambda_pool = predict(pooled, newdata = ., type = "response"),
lambda_rf = predict(rf_fit, newdata = ., type = "prob")[,2])
# Convert test data to a matrix form for XGB prediction
X_test = model.matrix(death ~ Month + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered +SBP_ge120 -1,data = expanded_test)
# Predict hazards using xgb
expanded_test_r$lambda_xgb <- predict(xgb_fit, newdata = X_test)
#====================================================#
# Compute survival at 6 months per individual
#====================================================#
S6 <- expanded_test_r %>%
group_by(ID) %>%
summarise(S6_pool = prod(1 - lambda_pool),
S6_RF = prod(1 - lambda_rf),
S6_XGB = prod(1 - lambda_xgb)
, .groups = "drop")
# View
head(S6)
#====================================================#
# Validate and compute the AUC and plots
#====================================================#
# Extract T for all patients that had the event(death)
T_i_all <- test %>%
group_by(ID) %>%
summarise(T_i = min(Month[death == 1], na.rm = TRUE),
.groups = "drop") %>%
mutate(T_i = ifelse(is.infinite(T_i), NA, T_i)) %>%
right_join(distinct(test, ID), by = "ID")
## Warning: There were 625 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `T_i = min(Month[death == 1], na.rm = TRUE)`.
## ℹ In group 74: `ID = 755`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 624 remaining warnings.
# Create Z True binary outcome to validate 6 months predictions
T_i_all <- T_i_all %>%
mutate(Z = ifelse(is.na(T_i) | T_i > 6, 0, 1),
Time = 6)
# Compare Z with the predicted outcomes
results <- left_join(T_i_all,S6,by ="ID")
results<- results %>% select(ID,T_i,Time,Z,S6_pool,S6_RF,S6_XGB)
# Compute the event probability
results <- results %>%
mutate(Pr_event_pool = 1 - S6_pool,
Pr_event_rf = 1 - S6_RF,
Pr_event_xgb = 1 - S6_XGB)
# Compute ROC and AUC
roc_pool <- roc(results$Z, results$Pr_event_pool)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_rf <- roc(results$Z, results$Pr_event_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_xgb <- roc(results$Z, results$Pr_event_xgb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
p= ggroc(list(Pooled = roc_pool, RF = roc_rf, XGB = roc_xgb),
aes = c("color")) +
labs(title = "ROC Curves for 6-Month Event Prediction",
subtitle = paste0("AUCs: Pooled=", round(auc(roc_pool), 3),
", RF=", round(auc(roc_rf), 3),
", XGB=", round(auc(roc_xgb), 3)),
x = "1 - Specificity", y = "Sensitivity") +
scale_color_manual(values = c("Pooled" = "blue", "RF" = "forestgreen", "XGB" = "darkorange")) +
theme_minimal()
p
ggsave("auc_compare_6m.pdf", p, width = 8, height = 6, dpi = 600)
#Lets see predictions for 5 years(24 months),10 years
#====================================================#
# Function to predict survival at T months
#====================================================#
predict_survival <- function(test_data, horizon, pooled, rf_fit, xgb_fit) {
# Expand test data: each individual at t=1..horizon
expanded_test <- test_data %>%
group_by(ID) %>%
tidyr::expand(Month = 1:horizon, .by = "ID") %>%
left_join(test_data %>% distinct(ID, .keep_all = TRUE), by = "ID") %>%
rename(Month = Month.x)
# Predict hazards
expanded_test_r <- expanded_test %>%
ungroup() %>% # remove any groupings
mutate(lambda_pool = predict(pooled, newdata = ., type = "response"),
lambda_rf = predict(rf_fit, newdata = ., type = "prob")[, 2])
# Convert test data for XGB prediction
X_test <- model.matrix(death ~ Month + age + male.gender + married + BMI +
hgb_centered + adv_HIV + log_creat_centered + SBP_ge120 - 1,
data = expanded_test)
# Predict hazards using XGB
expanded_test_r$lambda_xgb <- predict(xgb_fit, newdata = X_test)
# Compute survival at horizon months per individual
S_pred <- expanded_test_r %>%
group_by(ID) %>%
summarise(
S_pool = prod(1 - lambda_pool),
S_RF = prod(1 - lambda_rf),
S_XGB = prod(1 - lambda_xgb),
.groups = "drop"
)
# Extract T for all patients that had the event(death)
T_i_all <- test_data %>%
group_by(ID) %>%
summarise(T_i = min(Month[death == 1], na.rm = TRUE),
.groups = "drop") %>%
mutate(T_i = ifelse(is.infinite(T_i), NA, T_i)) %>%
right_join(distinct(test_data, ID), by = "ID")
# Create binary outcome Z for validation at horizon months
T_i_all <- T_i_all %>%
mutate(Z = ifelse(is.na(T_i) | T_i > horizon, 0, 1),
Time = horizon)
# Merge predictions
results <- left_join(T_i_all, S_pred, by = "ID") %>%
mutate(
Pr_event_pool = 1 - S_pool,
Pr_event_rf = 1 - S_RF,
Pr_event_xgb = 1 - S_XGB
)
# Compute ROC and AUC
roc_pool <- roc(results$Z, results$Pr_event_pool)
roc_rf <- roc(results$Z, results$Pr_event_rf)
roc_xgb <- roc(results$Z, results$Pr_event_xgb)
# Plot ROC curves
p <- ggroc(list(Pooled = roc_pool, RF = roc_rf, XGB = roc_xgb),
aes = c("color")) +
labs(
title = paste0("ROC Curves for ", horizon, "-Month Event Prediction"),
subtitle = paste0("AUCs: Pooled=", round(auc(roc_pool), 3),
", RF=", round(auc(roc_rf), 3),
", XGB=", round(auc(roc_xgb), 3)),
x = "1 - Specificity", y = "Sensitivity"
) +
scale_color_manual(values = c("Pooled" = "blue", "RF" = "forestgreen", "XGB" = "darkorange")) +
theme_minimal()
return(list(results = results, plot = p,
aucs = c(Pooled = auc(roc_pool), RF = auc(roc_rf), XGB = auc(roc_xgb))))
}
# Predictions at 6 months
res_6 <- predict_survival(test, horizon = 6, pooled, rf_fit, xgb_fit)
## Warning: There were 625 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `T_i = min(Month[death == 1], na.rm = TRUE)`.
## ℹ In group 74: `ID = 755`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 624 remaining warnings.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Predictions at 5 years (60 months)
res_60 <- predict_survival(test, horizon = 60, pooled, rf_fit, xgb_fit)
## Warning: There were 625 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `T_i = min(Month[death == 1], na.rm = TRUE)`.
## ℹ In group 74: `ID = 755`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 624 remaining warnings.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Predictions at 10 years (120 months)
res_120 <- predict_survival(test, horizon = 120, pooled, rf_fit, xgb_fit)
## Warning: There were 625 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `T_i = min(Month[death == 1], na.rm = TRUE)`.
## ℹ In group 74: `ID = 755`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 624 remaining warnings.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# View ROC plots
res_6$plot
res_60$plot
res_120$plot
# View AUCs
res_6$aucs
Pooled RF XGB 0.7013177 0.5807223 0.5700342
res_60$aucs
Pooled RF XGB 0.6061191 0.5897453 0.5751908
res_120$aucs
Pooled RF XGB 0.6115945 0.5963397 0.5804712
# Combine plots
p = res_6$plot + res_60$plot
ggsave("auc_compare_6m.pdf", p, width = 8, height = 6, dpi = 600)
############################################
# Get time dependent AUC
#############################################
# Obtain last row for each person in the test data
test.test = test |> group_by(ID) %>% slice_tail()
# Pooled logistic regression
roc6 <- timeROC(
T = test.test$Month, # observed time
delta = test.test$death, # event indicator
marker = res_60$results$Pr_event_pool, # model predictions (pooled logistic)
cause = 1, # event of interest
weighting = "marginal", # IPCW weighting
times = 60 # time horizon
)
plot(roc6,time = 60, main ="Pooled logistic R")
# Nicer plot
plot(
roc6,
time = 60,
col = "blue", # Line color
lwd = 2, # Line width (thicker)
main = "Pooled Logistic Model ROC at 60 Months", # Title
xlab = "1 - Specificity", # X-axis label
ylab = "Sensitivity" # Y-axis label
)
# Random Forest
roc_rf <- timeROC(
T = test.test$Month,
delta = test.test$death,
marker = res_60$results$Pr_event_rf,
cause = 1,
weighting = "marginal",
times = 60 ,# or c(6, 12, 60, 120) for multiple time points
ROC = TRUE,
# iid = TRUE # for confidence intervals
)
plot(roc_rf,time = 60)
# For XGboost
roc_xgb <- timeROC(
T = test.test$Month,
delta = test.test$death,
marker = res_60$results$Pr_event_xgb,
cause = 1,
weighting = "marginal",
times = 60 ,
ROC = TRUE
)
plot(roc_xgb,time = 60)
# All plots
plot(roc6$FP[, 2], roc6$TP[, 2], type = "l", col = "blue", lwd = 2,
main = "",
xlab = "1 - Specificity", ylab = "Sensitivity")
title("ROC Curves at 60 Months")
lines(roc_rf$FP[, 2], roc_rf$TP[, 2], col = "red", lwd = 2)
lines(roc_xgb$FP[, 2], roc_xgb$TP[, 2], col = "green", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray")
legend("bottomright", legend = c("Pooled Logistic", "Random Forest", "XGBoost"),
col = c("blue", "red", "green"), lwd = 2, cex = 0.8)
auc_value <- as.numeric(roc6$AUC[2]) # AUC at time=60
text(0.6, 0.3, paste("AUC =", round(auc_value, 3)), cex = 1, col = "blue")
auc_value <- as.numeric(roc_rf$AUC[2]) # AUC at time=60
text(0.6, 0.2, paste("AUC =", round(auc_value, 3)), cex = 1, col = "red")
auc_value <- as.numeric(roc_xgb$AUC[2]) # AUC at time=60
text(0.6, 0.1, paste("AUC =", round(auc_value, 3)), cex = 1, col = "green")
# ggplot version
# Create a data frame for all ROC curves
roc_df <- bind_rows(
data.frame(
FPR = roc6$FP[, 2],
TPR = roc6$TP[, 2],
Model = "Pooled Logistic",
AUC = round(roc6$AUC[2], 3)
),
data.frame(
FPR = roc_rf$FP[, 2],
TPR = roc_rf$TP[, 2],
Model = "Random Forest",
AUC = round(roc_rf$AUC[2], 3)
),
data.frame(
FPR = roc_xgb$FP[, 2],
TPR = roc_xgb$TP[, 2],
Model = "XGBoost",
AUC = round(roc_xgb$AUC[2], 3)
)
)
## Warning in data.frame(FPR = roc6$FP[, 2], TPR = roc6$TP[, 2], Model = "Pooled
## Logistic", : row names were found from a short variable and have been discarded
## Warning in data.frame(FPR = roc_rf$FP[, 2], TPR = roc_rf$TP[, 2], Model =
## "Random Forest", : row names were found from a short variable and have been
## discarded
## Warning in data.frame(FPR = roc_xgb$FP[, 2], TPR = roc_xgb$TP[, 2], Model =
## "XGBoost", : row names were found from a short variable and have been discarded
# Combine AUC into model names for legend
roc_df <- roc_df %>%
mutate(Model_AUC = paste0(Model, " (AUC=", AUC, ")"))
p2 = ggplot(roc_df, aes(x = FPR, y = TPR, color = Model_AUC)) +
geom_line(size = 1.2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
labs(
title = "Time dependent ROC Curves at 60 Months",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
color = "Model"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.direction = "vertical", # make it vertical
legend.title = element_text(face = "bold"),
legend.text = element_text(size = 12)
) +
guides(color = guide_legend(ncol = 1)) # force 1 column (vertical)
ggsave("timedependentauc_60m.pdf", p2, width = 8, height = 6, dpi = 600)
# Show why censoring needed to be taken
show_cens = left_join(test.test %>% select(ID,Month,death), results,by = "ID")
xx = show_cens|> filter(ID %in% c(4801,4684,297,1259,656,714,155,133)) |> select(ID,Month,death,T_i,Time,Z,S6_pool,Pr_event_pool)
xtable::xtable(xx)