When retrospective data is censored data

Introduction

This vignette discusses an aspect of the following paper:

Bakeera-Kitaka, S., Smekens, T., Jespers, V., Wobudeya, E., Loos, J., Colebunders, R., Adipo, D., Kekitiinwa, A., Musoke, P., Buve, A., & Nöstlinger, C. (2019). Factors Influencing the Risk of Becoming Sexually Active Among HIV Infected Adolescents in Kampala and Kisumu, East Africa. AIDS and Behavior, 23(6), 1375–1386. https://doi.org/10.1007/s10461-018-2323-y

The data set consists of survey responses from 582 Kenyan and Ugandan adolescents living with HIV [ALHIV], who are assumed to be infected at or shortly after birth. If not treated effectively with antiretroviral therapy (ART), ALHIV often have worse treatment outcomes compared to adults, and are at risk of transmitting HIV to their sexual partners. Earlier research showed associations between delayed sexual activity among ALHIV and greater self-efficacy to prevent onwards HIV transmission (including higher condom use and fewer teenage pregnancies); therefore, the aim of this paper was to identify risk factors for early sexual activity.

The survey was cross-sectional and asked, among others, the respondents’ current age, the age of their first sexual activity (if any), and the age at which they became aware that they were infected with HIV. This vignette will show a faulty analysis from one draft of the paper, followed by the published version in which we used survival analysis, even though this is cross-sectional data!

library(haven)

## Warning: package 'haven' was built under R version 4.2.2

dat <- as.data.frame(read_stata("datawithrecode.dta"))

Description of the data

One of the key variables in the data is ‘eversex’, the response to the survey question “Have you ever had sex?” In addition, the variable ‘agefirstsex’ contains the answer to “How old were you the first time you had sex?” as a number, as well as ‘ageHIV’, “How old were you when you learnt that you are HIV-infected?” Respondents were aged between 13 and 17 at the time of the survey.

## Show an extract from the data
# Examples of respondents who had had sex
#head(dat[dat$eversex == 1,c("SbjNum","eversex","agefirstsex","ageHIV")])
# Examples of respondents who had not had sex
#head(dat[dat$eversex == 2,c("SbjNum","eversex","agefirstsex","ageHIV")])
     
# 1 = yes, 2 = no
table(dat$eversex)

## 
##   1   2 
## 123 457

# -1 = never had sex
table(dat$agefirstsex)

## 
##  -1   6   7   8   9  10  11  12  13  14  15  16 
## 457   6   2   2   4  15   5  19  16  25  22   7

# Histogram of this, leaving out those who never had sex
hist(dat$agefirstsex[dat$agefirstsex != -1])

# Histogram of age at which they found out about their HIV infection
hist(dat$ageHIV)

# Age of first sex, by gender. 1 = male, 2 = female
boxplot(agefirstsex ~ sex, data=dat, subset=agefirstsex != -1)

When looking for associations between early sexual activity and other variables, there is a big problem: 457 respondents (thus, a large majority of the sample) had not yet had sex at the time of the survey, and therefore, are missing an age of first intercourse. In response to this, researchers may be tempted to drop the missing data and restrict the analysis to only those adolescents who had had sex.

But this would be a grave mistake, for several reasons. One reason is that these 457 respondents are examples of adolescents who did NOT become sexually active, and thus any characteristics they have may be protective factors. Discarding these adolescents from the analysis would mean discarding this important evidence. Or in other words, adolescents who never had sex may differ systematically from those who did, and removing them introduces bias.

To illustrate this, here are the proportions of having had sex, comparing those adolescents who were on ART to those who weren’t. This is an important predictor, as being on ART means the adolescent is receiving health care for their HIV infection, and therefore likely also has better access to information and support.

# Table of being on ART (rows; 1 = yes, 2 = no) by ever having had sex (columns; 1 = yes, 2 = no)
tab <- table(dat$ARV, dat$eversex)
tab

##    
##       1   2
##   1  60 344
##   2  63 109

# With row percentages
round(proportions(tab, margin=1),2)

##    
##        1    2
##   1 0.15 0.85
##   2 0.37 0.63

The proportion of sexually active adolescents is much lower among those receiving ART, suggesting a protective effect. This evidence would not be included if we only analyze the sexually active adolescents (or in other words, remove the bottom row of the table), and thus we would gravely underestimate this protective effect.

Faulty analysis

As discussed earlier, most of the respondents had no age at first intercourse as they had not yet had any, and restricting the analysis to those that did would have been a grave mistake. It is probably for this reason that this variable was not used in the first draft and a different analysis was chosen instead, albeit one with a major problem of its own.

Here, the categorical variable ‘eversex’ was used as the outcome, i.e. whether or not the adolescent had had sex by the time of the survey, ignoring the specific age at which this happened. The risk factor analysis thus became a logistic regression model of ‘eversex’ across several predictors, and no respondents had to be excluded.

# Final model
mod <- glm(I(eversex - 1) ~ ageca + country + stud + as.factor(tsh) + health + ARV + as.factor(tscat) + as.factor(fr1), family = binomial, data=dat)
round(exp(coef(mod)),1) # odds ratios

##       (Intercept)             ageca           country              stud 
##           17427.0               0.1               0.4               0.3 
##   as.factor(tsh)2   as.factor(tsh)3            health               ARV 
##               1.9               2.1               0.6               0.3 
## as.factor(tscat)2 as.factor(tscat)3   as.factor(fr1)3   as.factor(fr1)4 
##               0.4               0.4               2.6               4.4

One obvious drawback of this approach is that it ignores the age of first intercourse; becoming sexually active at age 13 should be considered a worse outcome than at 17 within the context of this study, but here they are treated as the same. This will again make us underestimate any associated factors. There is an even bigger problem, however, and that is that the adolescents were not all the same age at the time of the survey.

# Table of age at the time of the survey
table(dat$age)

## 
##  13  14  15  16  17 
## 176 126  93  97  88

# Age by eversex
tab <- table(dat$age, dat$eversex)
tab

##     
##        1   2
##   13  17 159
##   14   8 118
##   15  20  73
##   16  35  62
##   17  43  45

# Histogram of age at first sex, across age at the time of the survey
boxplot(agefirstsex ~ age, data=dat, subset=agefirstsex != -1)

Why is this so problematic? Because someone who was surveyed at 17 years of age has had more time to become sexually active than someone who was still only 13. This makes those aged 13 automatically less likely to be sexually active, and thus any characteristics they may have will be wrongly treated as evidence for a protective effect.

This accumulation of biases leads to a very large potential for error in the analysis, and indeed, major revisions were requested after the draft had been reviewed by the journal. One reviewer took particular issue with the fact that age at first sex was not taken into account by using the binary “ever had sex” outcome, as though the paper was trying to find evidence in support of abstinence-focused programmes rather than merely delayed sexual activity. They suggested to code a new outcome variable “early sexual debut”, that is, before the age of 15. This would have partially solved both aforementioned issues, but only partially: since most of the ages of sexual debut were under 15 anyway, all of these would still have been treated the same in this analysis. Moreover, many respondents were 13 or 14 years old at the time of the survey, meaning they still had had less time to have an early sexual debut than those who were older.

The proper analysis

Instead, we applied survival analysis (also known as event history analysis in the social sciences), a method that is perfectly suited to this kind of data and allowed us to take into account the specific age of sexual debut, without excluding those who had never had sex at the time of the survey.

The key insight is that we are working with data that is censored. Censoring means that the timing of an event (in this case, sexual debut) is not completely known. Here, the time of sexual debut is censored for those who had not yet had sex by the time of the survey. More specifically, it is right-censored because we know it could not have taken place before their age at the time of the survey (such as at 13) but it could still take place after (such as at 14). Right-censoring is the most common kind of censoring and it poses no problem for survival analysis. There is only the assumption that the censoring is non-informative, i.e. the fact that censoring occurred must not be related to the chances of the event happening; in this case, that is a relatively safe assumption, as censoring is just the result of adolescents being different ages.

Preparation

The outcome variable in a survival analysis is effectively twofold: on the one hand, you must supply a variable indicating whether the event occurred or not. For sexually active adolescents, it will be 1, and for those not yet active, it will be 0.

On the other hand, you must supply the time at which the event occurred, or if there was no event, the time at which censoring occurred. This will be, respectively, a person’s age of sexual debut, or otherwise, their current age at the time of the survey.

dat$surv_event <- FALSE
dat$surv_event[dat$eversex == 1] <- TRUE
dat$surv_time <- dat$agefirstsex
dat$surv_time[dat$eversex == 2] <- dat$age[dat$eversex == 2]

Kaplan-Meier Plot

The ideal way to describe the raw data in a survival analysis is by plotting a survival curve, using a Kaplan-Meier plot. It shows a survival curve, showing the proportion of subjects that have not had the event happen, and how it decreases over time. Censored individuals will still be a part of the curve until the time of censoring, but will not make it go down when censoring happens. You can optionally split the survival curve in categories, usually a candidate predictor. When plotting without a predictor, the base plotting function will automatically add a confidence interval. With a predictor, it will not, as that would become very difficult to read. In both examples, however, I specify mark.time=TRUE so that consored cases are plotted along the survival curves, marked by a +.

library(survival)

kmfit <- survfit(Surv(surv_time,surv_event) ~ 1, data=dat)
plot(kmfit, mark.time=TRUE)

kmfit2 <- survfit(Surv(surv_time,surv_event) ~ sex, data=dat)
plot(kmfit2, mark.time=TRUE)

Either way, the plot is a bit ugly. For the paper, we used the ggplot2 package, with the addition of the ggalt package in order to be able to plot stepped ribbons for the confidence intervals. The ggplot method requires a data frame holding all the necessary variables to map onto the plot, so we need to do a bit of extra work to extract this from the model object.

plotdata <- data.frame(
  time=kmfit2$time,
  surv=kmfit2$surv,
  strata=rep(names(kmfit2$strata), kmfit2$strata),
  lower=kmfit2$lower,
  upper=kmfit2$upper
)


# And now to make the plot:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(ggalt)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

ggplot(plotdata, aes(x=time, y=surv, col=strata, fill=strata, group=strata, ymin=lower, ymax=upper)) +
  # The stat="stepribbon" is made possible by the ggalt package
  geom_step() + geom_ribbon(stat="stepribbon", alpha=0.3, linetype=0) +
  xlab("Age") + ylab("Proportion sexually inactive") + labs(col="Gender", fill="Gender") + scale_fill_discrete(labels=c("Male","Female")) + scale_color_discrete(labels=c("Male","Female")) +
  theme_minimal() + ylim(0,1)

The curve initially decreases faster for boys, but by age 17, girls have “caught up” and are at a higher proportion of being sexually active. That is, if we take this at face value; this kind of data on sexual behavior is very sensitive and may thus be susceptible to severe self-report bias. We assume that the bias may even be in opposite directions for boys and girls, so we did not try to estimate an effect of gender. Instead, we split up the entire risk factor analysis so that every candidate risk factor was estimated separately for males and females.

Cox Regression

Whereas the previous strategy was to use logistic regression on sexual activity as a binary outcome in order to estimate associations with risk factors and adjust for confounding, survival analysis has such a method of its own, namely cox regression. The syntax is very similar to other regression models in R, except the outcome variable is actually two variables in one, like with kaplan-meier plots.

Below is a candidate multivariable model for each gender after having followed a variable selection procedure. The I(2-…) syntax just recodes a variable from 1/2 coding to 1/0.

# Male
coxmod1_m <- coxph(Surv(surv_time,surv_event) ~ I(2-stud) + health + I(2-ARV) + at6new_2cat + fr1_2cat, data=dat, subset=sex == 1)
summary(coxmod1_m)

## Call:
## coxph(formula = Surv(surv_time, surv_event) ~ I(2 - stud) + health + 
##     I(2 - ARV) + at6new_2cat + fr1_2cat, data = dat, subset = sex == 
##     1)
## 
##   n= 248, number of events= 52 
##    (15 observations deleted due to missingness)
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)    
## I(2 - stud)     -0.8775    0.4158   0.4822 -1.820 0.068780 .  
## health           0.2167    1.2420   0.2889  0.750 0.453070    
## I(2 - ARV)      -1.3582    0.2571   0.3045 -4.461 8.16e-06 ***
## at6new_2catTRUE  0.2306    1.2593   0.3693  0.624 0.532316    
## fr1_2catTRUE     1.1335    3.1064   0.3004  3.773 0.000161 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## I(2 - stud)        0.4158     2.4049    0.1616     1.070
## health             1.2420     0.8051    0.7051     2.188
## I(2 - ARV)         0.2571     3.8892    0.1416     0.467
## at6new_2catTRUE    1.2593     0.7941    0.6107     2.597
## fr1_2catTRUE       3.1064     0.3219    1.7241     5.597
## 
## Concordance= 0.676  (se = 0.042 )
## Likelihood ratio test= 35.5  on 5 df,   p=1e-06
## Wald test            = 35.99  on 5 df,   p=1e-06
## Score (logrank) test = 41.47  on 5 df,   p=8e-08

# Female
coxmod1_f <- coxph(Surv(surv_time,surv_event) ~ I(2-stud) + health + I(2-ARV) + at6new_2cat + fr1_2cat, data=dat, subset=sex == 2)
summary(coxmod1_f)

## Call:
## coxph(formula = Surv(surv_time, surv_event) ~ I(2 - stud) + health + 
##     I(2 - ARV) + at6new_2cat + fr1_2cat, data = dat, subset = sex == 
##     2)
## 
##   n= 281, number of events= 61 
##    (36 observations deleted due to missingness)
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)    
## I(2 - stud)     -0.8090    0.4453   0.2770 -2.920  0.00349 ** 
## health           0.8328    2.2997   0.2811  2.962  0.00305 ** 
## I(2 - ARV)      -1.2530    0.2856   0.2647 -4.734 2.20e-06 ***
## at6new_2catTRUE -1.0991    0.3332   0.2704 -4.064 4.82e-05 ***
## fr1_2catTRUE     0.8371    2.3096   0.2769  3.023  0.00251 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## I(2 - stud)        0.4453     2.2457    0.2587    0.7664
## health             2.2997     0.4348    1.3255    3.9902
## I(2 - ARV)         0.2856     3.5008    0.1700    0.4799
## at6new_2catTRUE    0.3332     3.0014    0.1961    0.5661
## fr1_2catTRUE       2.3096     0.4330    1.3422    3.9743
## 
## Concordance= 0.808  (se = 0.027 )
## Likelihood ratio test= 76.65  on 5 df,   p=4e-15
## Wald test            = 75.83  on 5 df,   p=6e-15
## Score (logrank) test = 91.8  on 5 df,   p=<2e-16

The coefficients need to be exponentiated for us to interpret them. Thankfully, the output has already done this for us. Such coefficients are hazard ratios, describing the relative “hazard” of experiencing the event, in this case becoming sexually active. They are somewhat similar to risk ratios; hazard can be thought of as the risk of the event occurring, but along infinitesimally short time points along the timeline. This hazard itself cannot be calculated directly (which is why the model does not include an intercept), but it can be compared along predictors in order to get hazard ratios.

One further assumption of the Cox regression model is the proportional hazards assumption. This means that the association with the predictor has to stay more or less the same at different points in time, in this case ages. You can tell when this is assumption is violated when the Kaplan-Meier plot shows two survival curves making very different trajectories, such as in the above plot by gender, where the curves even cross. This makes gender very unsuitable as a predictor in a Cox regression model without further precautions. In this study, we looked at Kaplan-Meier plots for all candidate predictors, though they are not shown here.

Note the hazard ratio for the ARV variable, which is far below 1, thus making it a strong protective factor.

Time-varying covariates

There was one more variable that we wanted to include, namely the “ageHIV” variable mentioned earlier, where survey participants were asked at what age they became aware that they had been infected with HIV. The original analysis computed the reverse - how many years they had already been aware by the time of the survey - and categorized it, before including it in the model. The reasoning may have been that a high number of years meant that the infection was spotted early, implying better access to health care, and more time knowingly living with HIV and thus more time to better cope with the associated challenges, such as HIV transmission risk.

This kind of variable is problematic in the same way as we saw with the event itself: someone who was younger at the time of the survey will have had less time to discover their HIV infection, so this variable will automatically tend to be lower for them. This is especially obvious when you compare the ageHIV variable across respondents’ current age.

table(dat$ageHIV)

## 
##   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
##   4   1  10  19  37  34  33  91  64 103  79  41  38  26

table(dat$age)

## 
##  13  14  15  16  17 
## 176 126  93  97  88

boxplot(ageHIV ~ age, data=dat)

We could try to sidestep this issue by making a new variable “Knew about their HIV status at the time of sexual debut”, since this is the true risk factor that we are thinking of, but then we run into the problem where this variable is undefined if the respondent wasn’t sexually active yet. The true solution is realizing that this is a predictor that changes over time. At first, every respondent was unaware of their HIV infection, until at some point they became aware. Whereas all the other predictors are cross-sectional in nature (i.e. one value that is “fixed in time” and applies to that person as a whole), here we are dealing with a time-varying covariate, namely “knowing one’s HIV status”.

Thankfully, survival analysis is perfectly equipped to deal with time-varying covariates as well. At any time (or in this case, age) when an event occurs, it will check whether this predictor is systematically different among those for whom the event has just happened, compared to those for whom the event has not yet happened. For this to work, you need to know the value of this predictor for every person, at every time (or at least at each time when someone else experiences an event), not just at the time of their own event. This is not a problem in this particular case, as we are retrospectively filling in the timeline based on a survey question; but when such a variable is prospectively measured, this can be a serious problem. If you have such data, I recommend looking up the Joint Modeling approach by Dimitris Rizopoulos.

To perform a cox regression with a time-varying covariate in R, we need to put in some extra effort, as the previous way of structuring the data - with one row per person - will no longer work. Each person’s timeline will now need to be divided into “periods”, which are spans of time in which the covariate stays the same. Each period will be a row in the data, leading to multiple rows per person. This won’t increase the number of observations in the data, it is just a means to provide information on the time-varying covariate. This is also called the Counting Process style of input.

In this case, the time-varying covariate can change at most once (from not knowing to knowing) and therefore each person can only have a maximum of two periods, or in other words, two rows in the data. There are individuals who had sex before knowing about their HIV infection, and these will have only one period and one row, as it doesn’t matter what happens to the covariate after the event. There are probably packages that automate this data conversion, but for these paper, we did it manually. We need to provide two new variables indicating the times at which the period began and ended. The event variable will begin at 0, even for the sexually active participants, and only become 1 at the last period, when the event actually occurred.

# Indicate whether the person first had sex or was censored while knowing their status and will thus need two rows
dat$surv_after_knowing <- dat$surv_time > dat$ageHIV

# Extract of sexually active cases
#head(dat[dat$eversex == 1,c("SbjNum","age","eversex","agefirstsex","ageHIV","surv_after_knowing")])
# Extract of censored cases
#head(dat[dat$eversex == 2,c("SbjNum","age","eversex","agefirstsex","ageHIV","surv_after_knowing")])

# Make a new data frame that includes extra duplicate rows for each of these subjects
new_indices <- sort(c(rep(which(dat$surv_after_knowing),2), which(!dat$surv_after_knowing)))
dat_counting <- dat[new_indices,]

# Copy old survival variables to new ones, valid for subjects who had sex before knowing about their HIV status
dat_counting$surv_start <- 0
dat_counting$surv_end <- dat_counting$surv_time
dat_counting$surv_event2 <- dat_counting$surv_event
dat_counting$surv_knowhiv <- FALSE

## For subjects who were censored or had sex after knowing HIV status, define two time intervals
# For every person, the two starting times will be 0 and the time of knowing the HIV infection
dat_counting$surv_start[dat_counting$surv_after_knowing] <- ave(x=dat_counting$ageHIV[dat_counting$surv_after_knowing], dat_counting$SbjNum[dat_counting$surv_after_knowing], FUN=function(x) c(0, x[1]))
# For every person, the first ending time will be the time of knowing the HIV infection, and...
dat_counting$surv_end[dat_counting$surv_after_knowing] <- ave(x=dat_counting$ageHIV[dat_counting$surv_after_knowing], dat_counting$SbjNum[dat_counting$surv_after_knowing], FUN=function(x) c(x[1], NA))
# ...the second ending time will be the time of event, or censoring
dat_counting$surv_end[is.na(dat_counting$surv_end)] <- dat_counting$surv_time[is.na(dat_counting$surv_end)]
# The time-varying covariate will be FALSE and then TRUE for each person
dat_counting$surv_knowhiv[dat_counting$surv_after_knowing] <- rep(c(FALSE,TRUE), sum(dat_counting$surv_after_knowing) / 2)
# The event variable will be FALSE and then the actual event outcome for each person
dat_counting$surv_event2[dat_counting$surv_after_knowing] <- ave(x=dat_counting$surv_event[dat_counting$surv_after_knowing], dat_counting$SbjNum[dat_counting$surv_after_knowing], FUN=function(x) c(FALSE,x[1]))

# Extract of sexually active cases
#head(dat_counting[dat_counting$eversex == 1,c("SbjNum","age","surv_start","surv_end","surv_event2","surv_knowhiv")])
# Extract of censored cases
#head(dat_counting[dat_counting$eversex == 2,c("SbjNum","age","surv_start","surv_end","surv_event2","surv_knowhiv")])

We are now ready to put this into a cox regression. In the previous specification, we provided two variables as the outcome. Now it will be three: the start time, the end time, and the event.

# Male
coxmod2_m <- coxph(Surv(surv_start,surv_end,surv_event2) ~ surv_knowhiv, data=dat_counting, subset=sex == 1)
summary(coxmod2_m)

## Call:
## coxph(formula = Surv(surv_start, surv_end, surv_event2) ~ surv_knowhiv, 
##     data = dat_counting, subset = sex == 1)
## 
##   n= 480, number of events= 53 
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## surv_knowhivTRUE -2.2146    0.1092   0.4097 -5.405 6.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## surv_knowhivTRUE    0.1092      9.158   0.04891    0.2438
## 
## Concordance= 0.656  (se = 0.028 )
## Likelihood ratio test= 37.11  on 1 df,   p=1e-09
## Wald test            = 29.21  on 1 df,   p=6e-08
## Score (logrank) test = 35.65  on 1 df,   p=2e-09

# Female
coxmod2_f <- coxph(Surv(surv_start,surv_end,surv_event2) ~ surv_knowhiv, data=dat_counting, subset=sex == 2)
summary(coxmod2_f)

## Call:
## coxph(formula = Surv(surv_start, surv_end, surv_event2) ~ surv_knowhiv, 
##     data = dat_counting, subset = sex == 2)
## 
##   n= 582, number of events= 70 
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## surv_knowhivTRUE -1.5902    0.2039   0.2721 -5.845 5.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## surv_knowhivTRUE    0.2039      4.905    0.1196    0.3475
## 
## Concordance= 0.639  (se = 0.032 )
## Likelihood ratio test= 35.08  on 1 df,   p=3e-09
## Wald test            = 34.17  on 1 df,   p=5e-09
## Score (logrank) test = 38.93  on 1 df,   p=4e-10

Indeed, knowing your HIV status is a strong protective effect in isolation. Here is the full multivariable model which made it into the paper:

# Male
coxmod3_m <- coxph(Surv(surv_start, surv_end, surv_event2) ~ I(-age) + country + surv_knowhiv + I(2-ARV) + ps1_2cat, ties='efron', data=dat_counting, subset=sex == 1)
summary(coxmod3_m)

## Call:
## coxph(formula = Surv(surv_start, surv_end, surv_event2) ~ I(-age) + 
##     country + surv_knowhiv + I(2 - ARV) + ps1_2cat, data = dat_counting, 
##     subset = sex == 1, ties = "efron")
## 
##   n= 459, number of events= 52 
##    (21 observations deleted due to missingness)
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## I(-age)          -0.2969    0.7431   0.1154 -2.573 0.010088 *  
## country           1.5369    4.6501   0.4312  3.564 0.000365 ***
## surv_knowhivTRUE -1.2773    0.2788   0.4500 -2.839 0.004529 ** 
## I(2 - ARV)       -0.5971    0.5504   0.2998 -1.991 0.046446 *  
## ps1_2catTRUE      0.9541    2.5964   0.3785  2.520 0.011720 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## I(-age)             0.7431     1.3457    0.5927    0.9317
## country             4.6501     0.2150    1.9973   10.8264
## surv_knowhivTRUE    0.2788     3.5870    0.1154    0.6734
## I(2 - ARV)          0.5504     1.8168    0.3058    0.9906
## ps1_2catTRUE        2.5964     0.3852    1.2364    5.4524
## 
## Concordance= 0.786  (se = 0.032 )
## Likelihood ratio test= 73.44  on 5 df,   p=2e-14
## Wald test            = 60.1  on 5 df,   p=1e-11
## Score (logrank) test = 81.88  on 5 df,   p=3e-16

# Female
coxmod3_f <- coxph(Surv(surv_start, surv_end, surv_event2) ~ I(2-stud) + health + surv_knowhiv + I(2-ARV) + at6new_2cat + fr1_2cat, data=dat_counting, subset=sex == 2)
summary(coxmod3_f)

## Call:
## coxph(formula = Surv(surv_start, surv_end, surv_event2) ~ I(2 - 
##     stud) + health + surv_knowhiv + I(2 - ARV) + at6new_2cat + 
##     fr1_2cat, data = dat_counting, subset = sex == 2)
## 
##   n= 516, number of events= 61 
##    (66 observations deleted due to missingness)
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## I(2 - stud)      -0.5933    0.5525   0.2865 -2.071 0.038360 *  
## health            0.7548    2.1271   0.2824  2.672 0.007531 ** 
## surv_knowhivTRUE -1.0300    0.3570   0.3123 -3.298 0.000973 ***
## I(2 - ARV)       -1.0412    0.3530   0.2684 -3.880 0.000104 ***
## at6new_2catTRUE  -1.0691    0.3433   0.2683 -3.985 6.76e-05 ***
## fr1_2catTRUE      0.7239    2.0625   0.2781  2.603 0.009231 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## I(2 - stud)         0.5525     1.8100    0.3151    0.9687
## health              2.1271     0.4701    1.2229    3.7000
## surv_knowhivTRUE    0.3570     2.8009    0.1936    0.6584
## I(2 - ARV)          0.3530     2.8326    0.2086    0.5974
## at6new_2catTRUE     0.3433     2.9128    0.2029    0.5809
## fr1_2catTRUE        2.0625     0.4848    1.1959    3.5571
## 
## Concordance= 0.82  (se = 0.026 )
## Likelihood ratio test= 87.94  on 6 df,   p=<2e-16
## Wald test            = 85.5  on 6 df,   p=3e-16
## Score (logrank) test = 105.4  on 6 df,   p=<2e-16

Conclusion

Usually, survival analysis is understood as a method for longitudinal data analysis, applied in prospective studies where patients are followed up until some event happens to them, or they are censored. This is still its most typical application, but we have just seen that it is also often required in cross-sectional studies if patients are asked about life course events which they have not all undergone, or in other words, when there is censoring in the data. In addition to being more “correct” - in the sense that it avoids bias from having to exclude or misrepresent some data - it is simpler to explain. It allowed us to discuss the relative risk (or more specifically, hazard) of sexual activity as a function of knowing your HIV status, and other factors; the other strategies had us put an arbitrary cutoff in place to define “early sexual debut”, or turn the time-varying covariate into “years since knowing about one’s HIV status”, or worse.

In the published paper, we found that the strongest protective factors against early sexual debut among adolescents living with HIV were related to HIV programs in health care systems; namely, knowing your HIV status, and receiving antiretroviral treatment. It also found that subjective peer-norms that promote sex at a young age constituted a risk factor, but - contrary to other studies - self-efficacy was not associated. We therefore concluded that strengthening peer norms to delay sexual activity at a young age and on making informed decisions about sexuality should be part of positive prevention interventions in HIV programs, rather than merely emphasizing individual behavior change and responsibility.