RM2 2025 S2 Assignment 3

Author

Beth Firipis

Published

November 10, 2025

HTML Version: https://rpubs.com/bfrps/1366550

Abstract

Using Cox proportional hazards regression, we model the association between various factors and all-cause mortality.

The multivariable model demonstrated good discriminative performance (c-index = 0.733). Age had a strong linear association (HR = 1.09 per year, 95% CI: 1.08-1.11). Female gender was highly protective (HR = 0.46, 95% CI: 0.37-0.57). Diabetes conferred substantial mortality risk (HR = 2.64, 95% CI: 1.64-4.25). Obesity conferred moderate mortality risk (HR = 1.38, 95% CI: 1.02-1.89). Time-dependent angina was associated with 83% increased mortality hazard (HR = 1.83, 95% CI: 1.40-2.39). Education showed no significant association after multivariable adjustment.

These findings quantify key cardiovascular risk factors and support targeted interventions for high-risk populations.

Introduction

The Framingham Heart Study is a landmark longitudinal cardiovascular epidemiological study that began in 1948 and continues today, originally designed to identify risk factors for heart disease in a community-based population in Framingham, Massachusetts¹.

This analysis investigates associations between key demographic and clinical factors (age, sex, BMI, diabetes, education, and angina) and all-cause mortality, using data from the Framingham Heart Study’s original 1948 cohort.

A critical aspect of this analysis is the treatment of angina as a time-dependent covariate. Unlike traditional risk factors that are measured at baseline, angina status can change during the follow-up period. Participants may develop angina at any point during observation, or may have been diagnosed prior to study enrolment. This time-varying nature requires the use of survival analysis techniques to account for the changing risk profile over time.

Understanding these associations allows us to estimate the relative cardiovascular risk of the various covariates and quantify their individual contributions to mortality outcomes. This analysis provides hazard ratio estimates that can be used to inform clinical decision-making, patient counselling, and potentially guide targeted interventions for high-risk populations.

Methods

Statistical Analysis

Cox proportional hazards regression was used to examine the association between covariates and mortality, accounting for the time-dependent nature of angina. The model included age, sex, BMI category, diabetes, education, and angina. Univariate and multi-variate analysis were performed.

Covariate Processing

Angina was modeled as a time-dependent covariate. Each participant’s follow-up was divided at the point of angina occurrence, creating separate rows for distinct risk periods with different angina status, as outlined in Table 3.

BMI categories were collapsed from six original categories into three categories (Underweight/Healthy Weight, Overweight, and Obese) to ensure adequate sample sizes and focus on clinically meaningful thresholds.

Results

Baseline Characteristics

The study population demonstrated typical characteristics of a middle-aged to older adult cohort. The mean age at baseline reflects the demographic profile expected for cardiovascular risk assessment, with representation across educational levels. Due to low numbers in some BMI categories necessitated collapsing the original six categories into three groups due to small numbers in some individual categories, particularly in the extreme BMI ranges (underweight and highest obesity categories). The diabetes prevalence and mortality rate during follow-up are consistent with expected outcomes for this population and follow-up duration.

Survival Analysis

Univariate Analysis

The univariate analyses revealed significant crude associations for most covariates examined:

Age had a strong linear association with mortality - each additional year associated with 8.9% increased hazard (HR = 1.089, 95% CI: 1.075-1.103, p < 0.001)
Women had 39% lower mortality hazard compared to men (HR = 0.609, 95% CI: 0.494-0.750, p < 0.001)
Having diabetes substantially increased mortality hazard (estimated more than four-fold!) (HR = 4.057, 95% CI: 2.552-6.450, p < 0.001)
Education had a mild protective effect with high school completion (HR = 0.641, 95% CI: 0.497-0.827, p = 0.001) and some college (HR = 0.614, 95% CI: 0.444-0.849, p = 0.003) showing significant protection. College degree showed non-significant protective trend (HR = 0.819, 95% CI: 0.583-1.151, p = 0.250)
Angina had a strong association with increased mortality risk - more than 2.6-fold increased hazard following angina onset (HR = 2.617, 95% CI: 2.021-3.390, p < 0.001)
Obesity demonstrated significantly increased mortality risk (HR = 1.867, 95% CI: 1.390-2.508, p < 0.001) compared to underweight/healthy weight reference. Overweight status showed non-significant trend toward increased risk (HR = 1.231, 95% CI: 0.977-1.550, p = 0.078) compared to underweight/healthy weight reference.

Multivariable Analysis

The multivariable Cox proportional hazards model had strong overall performance with a concordance index of 0.733, indicating good discriminative ability. The likelihood ratio test (χ² = 266.7, df = 9, p < 0.001) confirmed the model significantly improved prediction compared to a null model.

The results of the multivariable analysis revealed several key findings:

Age was a strong predictor of mortality, with each additional year associated with a 9% increase in hazard (HR = 1.09, 95% CI: 1.08-1.11, p < 0.001)
Gender was a very strong predictor of mortality, with women having 54% lower hazard of death compared to men (HR = 0.46, 95% CI: 0.37-0.57, p < 0.001).
Diabetes was the strongest association, with those with diabetes having a 2.64-fold increase in mortality risk (HR = 2.64, 95% CI: 1.64-4.25, p < 0.001). The magnitude of the association was substantially attenuated in the multivariable model (univariate HR = 4.06 vs multivariable HR = 2.64), indicating that age and other factors partially mediate the association of diabetes and all-cause mortality.
Obesity significantly elevated mortality risk by 38% compared to underweight/healthy weight participants (HR = 1.38, 95% CI: 1.02-1.89, p = 0.038). Overweight status showed no significant association with mortality (HR = 0.92, 95% CI: 0.73-1.17, p = 0.49)
The development of angina during follow-up was associated with an 83% increase in mortality hazard (HR = 1.83, 95% CI: 1.40-2.39, p < 0.001), highlighting the clinical effects of incident cardiovascular symptoms on future cardiovascular risk.
Despite showing significant protective effects in univariate analysis (high school: HR = 0.64, p = 0.001; some college: HR = 0.61, p = 0.003), education level showed no significant association with mortality after adjustment for other covariates (global p = 0.5), suggesting that socioeconomic effects may be mediated through other measured risk factors.

Model Performance

The model demonstrated good discriminative performance (c-index = 0.733) and satisfied key statistical assumptions, including proportional hazards. The comprehensive diagnostic assessment revealed some evidence of potential non-linearity in age effects, but sensitivity analyses using splined age terms confirmed that linear modeling provides adequate fit without significantly changing substantive conclusions.

Conclusion

This analysis of Framingham Heart Study data demonstrates strong independent associations between age, sex, diabetes, obesity, and incident angina with all-cause mortality. The time-dependent modeling captured angina’s dynamic impact on mortality risk.

The findings’ generalizability is limited by the study’s restriction to the predominantly white 1948 Framingham cohort, which may not reflect contemporary populations or diverse demographic groups.

The model’s good discriminative performance supports its utility for risk stratification, though broader validation is needed for generalizability to modern clinical practice.

References

Framingham Heart Study (FHS) NHLBI, NIH. https://www.nhlbi.nih.gov/science/framingham-heart-study-fhs;

Appendices

Appendix 1: Univariate Model Results

Multivariable model summary table

Table 1: Univariate Cox regression results

Variable	Term	HR	95% CI Lower	95% CI Upper	P-value
age	age	1.089	1.075	1.103	0.000
sex	sexFemale	0.609	0.494	0.750	0.000
bmicat_collapsed	bmicat_collapsedOverweight	1.231	0.977	1.550	0.078
bmicat_collapsed	bmicat_collapsedObese	1.867	1.390	2.508	0.000
diabetes	diabetesYes	4.057	2.552	6.450	0.000
educ	educHigh School	0.641	0.497	0.827	0.001
educ	educSome College	0.614	0.444	0.849	0.003
educ	educCollege degree	0.819	0.583	1.151	0.250
angina	angina	2.617	2.021	3.390	0.000

Appendix 2: Multivariate Model Results

Multivariable model summary table

Table 2: Multivariable Cox regression model with time-dependent angina

Characteristic	N	Event N	HR	95% CI	p-value
Age (per year)	1,139	353	1.09	1.08, 1.11	<0.001
Sex					<0.001
Male	487	185	1.00	—
Female	652	168	0.46	0.37, 0.57
BMI Category					0.038
Underweight/Healthy Weight	522	142	1.00	—
Overweight	464	147	0.92	0.73, 1.17
Obese	153	64	1.38	1.02, 1.89
Diabetes					<0.001
No	1,112	334	1.00	—
Yes	27	19	2.64	1.64, 4.25
Education Level					0.5
0-11 years	477	176	1.00	—
High School	343	90	1.12	0.85, 1.46
Some College	184	46	1.04	0.74, 1.45
College degree	135	41	0.84	0.59, 1.19
Angina (Time-dependent)	1,139	353	1.83	1.40, 2.39	<0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Likelihood Ratio Test = 267; LR Test p-value = 0.000; Concordance (c-index) = 0.733

Appendix 3: Kaplan-Meier Curves

KM Curves

Figure 1: Kaplan-Meier survival curves by covariates

Appendix 4: Time-Dependent Angina Covariate Structure

Time-Dependent Angina Covariate Structure

Table 3: Time-Dependent Angina Covariate Structure

Participant Group	# of Intervals	Interval Description
Angina at baseline (timeap = 0)	1	Single interval with angina = 1
Develop angina during follow-up	2	Before angina onset (angina = 0) & after onset (angina = 1)
Never develop angina	1	Single interval with angina = 0

Appendix 5: Study Participant Characteristics

Study Participant Characteristics

Table 4: Baseline characteristics of Framingham Heart Study participants

Characteristic	N	N = 999¹
Age (years)	999	50 (9)
Sex	999
Male		424 (42%)
Female		575 (58%)
BMI Category	999
Underweight/Healthy Weight		463 (46%)
Overweight		407 (41%)
Obese		129 (13%)
Diabetes	999	24 (2.4%)
Education Level	999
0-11 years		417 (42%)
High School		305 (31%)
Some College		163 (16%)
College degree		114 (11%)
Deaths	999	353 (35%)
Follow-up (years)	999	20.3 (6.3)
¹ Mean (SD); n (%)

Figure 2: Distribution of continuous variables

Figure 3: Boxplots of continuous variables

Figure 4: Stacked bar charts showing categorical variables frequencies

Appendix 6: Model Diagnostics

Model Diagnostics

The proportional hazards assumption was reasonably satisfied for all covariates (global test p = 0.71), supporting the validity of the Cox regression approach. Initial diagnostic plots suggested mild non-linearity in the association of age and all-cause mortality, with very older individiauls having higher than linearly predicted risk.

The Schoenfeld residual plots show relatively random scatter around zero for most variables, with no systematic trends that would indicate time-varying effects. This supports the assumption that hazard ratios remain constant over time

Diagnostic deviance residuals plot revealed a moderate systematic pattern in deviance residuals, suggesting potential non-linear relationships, unmeasured confounding and/or missing interactions. Exploration of spline terms for age revealed no significant non-linear relationships (LR test p = 0.34), supporting the adequacy of a linear age effect in this cohort.

                  chisq df    p
age              1.0366  1 0.31
sex              1.7244  1 0.19
bmicat_collapsed 1.0565  2 0.59
diabetes         0.1767  1 0.67
educ             2.1526  3 0.54
angina           0.0495  1 0.82
GLOBAL           6.3402  9 0.71

Figure 5: Proportional hazards assumption test (Schoenfeld residuals)

Figure 6: Assessment of linearity for age (Martingale residuals)

The linearity assumption for age appears is not clearly satisfied. Although, the lowess smoothed line stays relatively close to zero across the age range there is a notable upward trend at higher ages.

Appendix 7: Alternative Model - Restricted Cubic Splines for Age

for reference, not reported due to worse fit

Spline Model

Analysis of Deviance Table
 Cox model: response is  surv_tdc
 Model 1: ~ age + sex + bmicat_collapsed + diabetes + educ + angina
 Model 2: ~ rcs(age, 4) + sex + bmicat_collapsed + diabetes + educ + angina
   loglik  Chisq Df Pr(>|Chi|)
1 -2233.6                     
2 -2232.5 2.1387  2     0.3432

Table 5: Spline model coefficients

Characteristic	HR	95% CI	p-value
Age (splined)
rcs(age, 4)age	1.14	1.04, 1.25	0.004
rcs(age, 4)age'	0.84	0.63, 1.10	0.2
rcs(age, 4)age''	1.58	0.82, 3.05	0.2
Sex
Male	—	—
Female	0.46	0.37, 0.57	<0.001
BMI Category
Underweight/Healthy Weight	—	—
Overweight	0.93	0.73, 1.17	0.5
Obese	1.39	1.02, 1.90	0.038
Diabetes
No	—	—
Yes	2.65	1.64, 4.26	<0.001
Education Level
0-11 years	—	—
High School	1.11	0.85, 1.46	0.4
Some College	1.03	0.73, 1.44	0.9
College degree	0.84	0.59, 1.19	0.3
Angina (Time-dependent)	1.86	1.42, 2.43	<0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Concordance = 0.735
Spline model not selected due to non-significant improvement (LR test p = 0.34)

Figure 8: Age-mortality relationship with splines

a joke

Further analysis of appropriate model specification accommodating for systematic patterns in deviance residuals is considered trivial and left as an exercise for the reader.

Appendix 8: R Code

Show code

blockquote {
  font-size: inherit;
}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# Load required libraries
library(knitr)
library(tidyverse)
library(survival)
library(survminer)
library(gridExtra)
library(kableExtra)
library(gtsummary)
library(broom.helpers)
library(parameters)
library(rms)
library(broom)
library(gt)

# Set ggplot theme
theme_set(theme_minimal())

# Load the Framingham data
data <- read.csv("assn3_2025_S2.csv")
# Convert categorical variables to factors
data <- data |>
  mutate(
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    bmicat = factor(bmicat, levels = 1:6, 
                    labels = c("Underweight", "Healthy Weight", "Overweight", 
                              "Obese I", "Obese II", "Obese III")),
    diabetes = factor(diabetes, levels = c(0, 1), labels = c("No", "Yes")),
    educ = factor(educ, levels = 1:4, 
                  labels = c("0-11 years", "High School", "Some College", "College degree")),
    death = factor(death, levels = c(0, 1), labels = c("Censored", "Death"))
  ) |>
  # Combine BMI categories
  mutate(
    bmicat_collapsed = case_when(
      bmicat %in% c("Underweight", "Healthy Weight") ~ "Underweight/Healthy Weight",
      bmicat == "Overweight" ~ "Overweight", 
      bmicat %in% c("Obese I", "Obese II", "Obese III") ~ "Obese",
      TRUE ~ as.character(bmicat)
    ),
    bmicat_collapsed = factor(bmicat_collapsed, 
                             levels = c("Underweight/Healthy Weight", "Overweight", "Obese"))
  )
# Create time-dependent dataset for angina scenarios
tdc_data <- data |>
  rowwise() |>
  mutate(
    rows = case_when(
      # Case 1: Angina at baseline (timeap = 0)
      timeap == 0 ~ list(tibble(
        randid = randid,
        tstart = 0,
        tstop = timedth,
        death = as.numeric(death) - 1,
        age = age,
        sex = sex,
        bmicat = bmicat,
        bmicat_collapsed = bmicat_collapsed,
        diabetes = diabetes,
        educ = educ,
        angina = 1
      )),
      
      # Case 2: Angina during follow-up (timeap != 0)
      timeap <= timedth ~ list(bind_rows(
        # Before angina
        tibble(
          randid = randid,
          tstart = 0,
          tstop = timeap,
          death = 0,
          age = age,
          sex = sex,
          bmicat = bmicat,
          bmicat_collapsed = bmicat_collapsed,
          diabetes = diabetes,
          educ = educ,
          angina = 0
        ),
        # After angina
        tibble(
          randid = randid,
          tstart = timeap,
          tstop = timedth,
          death = as.numeric(death) - 1,
          age = age,
          sex = sex,
          bmicat = bmicat,
          bmicat_collapsed = bmicat_collapsed,
          diabetes = diabetes,
          educ = educ,
          angina = 1
        )
      )),
      
      # Case 3: No angina during study
      TRUE ~ list(tibble(
        randid = randid,
        tstart = 0,
        tstop = timedth,
        death = as.numeric(death) - 1,
        age = age,
        sex = sex,
        bmicat = bmicat,
        bmicat_collapsed = bmicat_collapsed,
        diabetes = diabetes,
        educ = educ,
        angina = 0
      ))
    )
  ) |>
  # Expand nested rows and clean up
  select(rows) |>
  unnest(rows) |>
  ungroup()

# Create survival objects
surv_obj <- Surv(data$timedth, as.numeric(data$death) - 1)
surv_tdc <- Surv(tdc_data$tstart, tdc_data$tstop, tdc_data$death)
# Univariate models
uni_models <- list(
  age = coxph(surv_tdc ~ age, data = tdc_data),
  sex = coxph(surv_tdc ~ sex, data = tdc_data),
  bmicat_collapsed = coxph(surv_tdc ~ bmicat_collapsed, data = tdc_data),
  diabetes = coxph(surv_tdc ~ diabetes, data = tdc_data),
  educ = coxph(surv_tdc ~ educ, data = tdc_data),
  angina = coxph(surv_tdc ~ angina, data = tdc_data)
)

# Create univariate results table
uni_results <- map_dfr(names(uni_models), ~{
  model <- uni_models[[.x]]
  tidy(model, exponentiate = TRUE, conf.int = TRUE) |>
    mutate(variable = .x)
}) |>
  select(variable, term, estimate, conf.low, conf.high, p.value)

kable(uni_results, 
      col.names = c("Variable", "Term", "HR", "95% CI Lower", "95% CI Upper", "P-value"),
      digits = 3)
# Fit multivariable model
mv_model <- coxph(surv_tdc ~ age + sex + bmicat_collapsed + diabetes + educ + angina, 
                  data = tdc_data)

# Display results using gtsummary
tbl_regression(mv_model,
               exponentiate = TRUE,
               add_estimate_to_reference_rows = TRUE,
               label = list(
                 age ~ "Age (per year)",
                 sex ~ "Sex",
                 bmicat_collapsed ~ "BMI Category",
                 diabetes ~ "Diabetes",
                 educ ~ "Education Level",
                 angina ~ "Angina (Time-dependent)"
               )) |>
  add_n(location = "level") |>
  add_nevent(location = "level") |>
  add_global_p() |>
  add_glance_source_note(
    label = list(
      concordance = "Concordance (c-index)",
      statistic.log = "Likelihood Ratio Test",
      p.value.log = "LR Test p-value",
      AIC = "AIC",
      BIC = "BIC"
    ),
    include = c(statistic.log, p.value.log, concordance)
  ) |>
  bold_p(t = 0.05) |>
  bold_labels() |>
  as_gt()
# KM curves by different variables
km_plot_sex <- ggsurvplot(survfit(surv_obj ~ sex, data = data), 
                 data = data, pval = TRUE, conf.int = TRUE,
                 title = "Survival by Sex", xlab = "Time (days)")$plot

km_plot_diabetes <- ggsurvplot(survfit(surv_obj ~ diabetes, data = data), 
                 data = data, pval = TRUE, conf.int = TRUE,
                 title = "Survival by Diabetes", xlab = "Time (days)")$plot

km_plot_education <- ggsurvplot(survfit(surv_obj ~ educ, data = data), 
                 data = data, pval = TRUE, conf.int = TRUE,
                 title = "Survival by Education", xlab = "Time (days)")$plot

km_plot_bmi <- ggsurvplot(survfit(surv_obj ~ bmicat_collapsed, data = data), 
                 data = data, pval = TRUE, conf.int = TRUE,
                 title = "Survival by BMI Category", xlab = "Time (days)")$plot

grid.arrange(km_plot_sex, km_plot_diabetes, km_plot_education, km_plot_bmi, ncol = 2)
angina_table <- data.frame(
  "Participant Group" = c("Angina at baseline (timeap = 0)",
                         "Develop angina during follow-up", 
                         "Never develop angina"),
  "# of Intervals" = c("1", "2", "1"),
  "Interval Description" = c("Single interval with angina = 1",
                            "Before angina onset (angina = 0) & after onset (angina = 1)",
                            "Single interval with angina = 0"),
  check.names = FALSE
)
kable(angina_table, 
      format = "markdown",
      align = c("l", "c", "l"))
baseline_data <- data |>
  select(age, sex, bmicat_collapsed, diabetes, educ, death, timedth) |>
  mutate(
    death_binary = ifelse(death == "Death", 1, 0),
    follow_up_years = timedth / 365.25
  )

tbl_summary(baseline_data,
            include = c(age, sex, bmicat_collapsed, diabetes, educ, death_binary, follow_up_years),
            statistic = list(
              all_continuous() ~ "{mean} ({sd})",
              all_categorical() ~ "{n} ({p}%)"
            ),
            label = list(
              age ~ "Age (years)",
              sex ~ "Sex",
              bmicat_collapsed ~ "BMI Category", 
              diabetes ~ "Diabetes",
              educ ~ "Education Level",
              death_binary ~ "Deaths",
              follow_up_years ~ "Follow-up (years)"
            )) |>
  add_n() |>
  bold_labels()
# Histograms for continuous variables
histogram_age <- ggplot(data, aes(x = age)) +
  geom_histogram(bins = 30, fill = "thistle2", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Age", x = "Age (years)", y = "Frequency")

histogram_timedth <- ggplot(data, aes(x = timedth)) +
  geom_histogram(bins = 30, fill = "lightblue2", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Follow-up Time", 
       x = "Time to Death/Censoring (days)", y = "Frequency")

histogram_timeap <- ggplot(data, aes(x = timeap)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Time to Angina", 
       x = "Time to Angina Diagnosis (days)", y = "Frequency")

grid.arrange(histogram_age, histogram_timedth, histogram_timeap, ncol = 3)
# Boxplots for continuous variables
boxplot_age <- ggplot(data, aes(x = age)) +
  geom_boxplot(fill = "thistle2", alpha = 0.7) +
  labs(title = "Boxplot of Age", x = "Age (years)") +
  coord_flip()

boxplot_timedth <- ggplot(data, aes(x = timedth)) +
  geom_boxplot(fill = "lightblue2", alpha = 0.7) +
  labs(title = "Boxplot of Follow-up Time", x = "Time (days)") +
  coord_flip()

boxplot_timeap <- ggplot(data, aes(x = timeap)) +
  geom_boxplot(fill = "lightgreen", alpha = 0.7) +
  labs(title = "Boxplot of Time to Angina", x = "Time (days)") +
  coord_flip()

grid.arrange(boxplot_age, boxplot_timedth, boxplot_timeap, ncol = 3)
# Stacked bar charts for categorical variables
bar_sex <- ggplot(data, aes(x = "", fill = sex)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title = "Distribution of Sex", x = "", y = "Frequency", fill = "Sex")

bar_diabetes <- ggplot(data, aes(x = "", fill = diabetes)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title = "Distribution of Diabetes", x = "", y = "Frequency", fill = "Diabetes")

bar_educ <- ggplot(data, aes(x = "", fill = educ)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title = "Distribution of Education", 
       x = "", y = "Frequency", fill = "Education Level")

bar_bmicat <- ggplot(data, aes(x = "", fill = bmicat)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Distribution of BMI Category", 
       x = "", y = "Frequency", fill = "BMI Category") +
  theme(legend.text = element_text(size = 8))

bar_death <- ggplot(data, aes(x = "", fill = death)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title = "Distribution of Death Status", 
       x = "", y = "Frequency", fill = "Death Status")

# Combine BMI categories
data <- data |>
  mutate(
    bmicat_collapsed = case_when(
      bmicat %in% c("Underweight", "Healthy Weight") ~ "Underweight/Healthy Weight",
      bmicat == "Overweight" ~ "Overweight", 
      bmicat %in% c("Obese I", "Obese II", "Obese III") ~ "Obese",
      TRUE ~ as.character(bmicat)
    ),
    bmicat_collapsed = factor(bmicat_collapsed, 
                             levels = c("Underweight/Healthy Weight", "Overweight", "Obese"))
  )

# Stacked bar chart of clapsed BMI Category
bar_bmicat_collapsed <- ggplot(data, aes(x = "", fill = bmicat_collapsed)) +
  geom_bar(position = "stack", alpha = 0.7, color = "black", width = 0.7) +
  geom_text(aes(label = after_stat(sprintf("%d\n(%.1f%%)", count, 100*count/sum(count)))),
            stat = 'count', position = position_stack(vjust = 0.5), size = 3.5) +
  labs(title = "Distribution of Collapsed BMI Category", 
       x = "", y = "Frequency", fill = "BMI Category") +
  theme_minimal()

grid.arrange(bar_sex, bar_diabetes, bar_educ, bar_bmicat, bar_bmicat_collapsed, bar_death, ncol = 3)
# Proportional hazards assumption test
ph_test <- cox.zph(mv_model)
print(ph_test)

# Plot Schoenfeld residuals
ggcoxzph(ph_test, font.main = 12)
# Martingale residuals to test for linearity of age 
mart_resid <- residuals(mv_model, type = "martingale")

# Martingale residuals plot for age
plot(tdc_data$age, mart_resid,
     xlab = "Age (years)",
     ylab = "Martingale Residuals",
     main = "Assessment of Age Linearity")
lines(lowess(tdc_data$age, mart_resid), col = "thistle2", lwd = 2)
abline(h = 0, lty = 2, col = "gray")
# Deviance residuals
dev_resid <- residuals(mv_model, type = "deviance")
par(mfrow = c(2, 2))

# Deviance residuals plot for linear predictor
plot(predict(mv_model), dev_resid, 
     xlab = "Linear Predictor", ylab = "Deviance Residuals",
     main = "Deviance Residuals vs Linear Predictor")
abline(h = 0, col = "thistle2", lty = 2)

# Deviance residuals plot for age
plot(tdc_data$age, dev_resid,
     xlab = "Age", ylab = "Deviance Residuals",
     main = "Deviance Residuals vs Age")
abline(h = 0, col = "thistle2", lty = 2)

# DFBETA plot for age
dfbeta_vals <- residuals(mv_model, type = "dfbeta")
plot(dfbeta_vals[,1], ylab = "DFBETA for Age",
     main = "Influential Observations (Age)")
abline(h = c(-2/sqrt(nrow(tdc_data)), 2/sqrt(nrow(tdc_data))), 
       col = "thistle2", lty = 2)

# DFBETA plot for angina
angina_col <- which(colnames(dfbeta_vals) == "angina")
if(length(angina_col) > 0) {
  plot(dfbeta_vals[,angina_col], ylab = "DFBETA for Angina",
       main = "Influential Observations (Angina)")
  abline(h = c(-2/sqrt(nrow(tdc_data)), 2/sqrt(nrow(tdc_data))), 
         col = "thistle2", lty = 2)
}
par(mfrow = c(1, 1))
# Fit model with restricted cubic splines for age
dd <- datadist(tdc_data)
options(datadist = 'dd')

# Fit spline model
mv_model_spline <- coxph(surv_tdc ~ rcs(age, 4) + sex + bmicat_collapsed + 
                         diabetes + educ + angina, data = tdc_data)

# Likelihood ratio test
anova(mv_model, mv_model_spline)
# Display spline model coefficients
tbl_regression(mv_model_spline,
               exponentiate = TRUE,
               label = list(
                 "rcs(age, 4)" ~ "Age (splined)",
                 sex ~ "Sex", 
                 bmicat_collapsed ~ "BMI Category",
                 diabetes ~ "Diabetes",
                 educ ~ "Education Level",
                 angina ~ "Angina (Time-dependent)"
               )) |>
  add_glance_source_note(
    label = list(concordance = "Concordance"),
    include = concordance
  ) |>
  bold_p(t = 0.05) |>
  bold_labels() |>
  as_gt() |>
  tab_footnote("Spline model not selected due to non-significant improvement (LR test p = 0.34)")
# Spline plot for age
age_vals <- seq(min(tdc_data$age), max(tdc_data$age), length.out = 100)

# Create prediction data
pred_data <- data.frame(
  age = age_vals,
  sex = factor("Male", levels = levels(tdc_data$sex)),
  bmicat_collapsed = factor("Underweight/Healthy Weight", 
                           levels = levels(tdc_data$bmicat_collapsed)),
  diabetes = factor("No", levels = levels(tdc_data$diabetes)),
  educ = factor("High School", levels = levels(tdc_data$educ)),
  angina = 0
)

# Get predictions
linear_pred <- predict(mv_model, newdata = pred_data, type = "lp")
spline_pred <- predict(mv_model_spline, newdata = pred_data, type = "lp")

# Plot comparison
plot_data <- data.frame(
  age = age_vals,
  linear = exp(linear_pred - linear_pred[1]),
  spline = exp(spline_pred - spline_pred[1])
) |>
  pivot_longer(cols = c(linear, spline), names_to = "model", values_to = "hr")

ggplot(plot_data, aes(x = age, y = hr, color = model)) +
  geom_line(size = 1.2) +
  labs(x = "Age (years)", 
       y = "Relative Hazard Ratio",
       title = "Age-Mortality Relationship: Linear vs Splined",
       color = "Model") +
  scale_color_manual(values = c("linear" = "lightblue2", "spline" = "thistle2"),
                     labels = c("Linear", "Splined (4 knots)"))
codebook <- tribble(
  ~Variable, ~Description,
  "randid", "Identifying ID",
  "age", "Age in years", 
  "sex", "Sex (1=male, 2=female)",
  "bmicat", "BMI category (1-6: Underweight to Obese III)",
  "diabetes", "Diabetes diagnosis (0=no, 1=yes)",
  "educ", "Education (1-4: 0-11 years to College degree)",
  "timeap", "Time to angina diagnosis (days)",
  "death", "Death (0=censored, 1=death)",
  "timedth", "Time to death/censoring (days)"
)

codebook |>
  gt() |>
  cols_label(Variable = "Variable", Description = "Description")

Appendix 9: Assignment Instructions

Assignment Instructions

Presentation schedule

Presentations will be in the week 10th to 14th November. Please select a time slot in the Excel sheet (link given on the Canvas page). The time slots are 15 minutes, allowing time for some discussion and feedback. Contact the unit coordinator by email (gillian.heller@sydney.edu.au) if you require another time.

Submission Instructions

Due 11:59pm 7th October 2025. This assignment is worth 30% of the overall mark.

Please submit your work in one document, with your name, the course title and the assignment number in the file name (e.g. RM2 Assignment2 JimmyBarnes.docx) and in the document itself (either on the title page, or in the header/footer). Please submit either a Word or PDF file. You should include relevant statistical code in an appendix, showing how you answered the question. Please use a different font (Courier 10pt is recommended) and format it neatly, to make your answers easy to read and understand. You should submit your assignment online according to the instructions in the BCA Turnitin Guide (available on Canvas).

Task Scenario > You are the Biostatistician who has been asked to deliver research findings to your research team and wider professional network. > > The presentation of your findings will be via Zoom and should be 5 minutes long (±10%). > > - The team you are presenting to includes non-statisticians, so it is important to present the findings in a way that can be understood by this audience. > - Do not present interim modelling steps, just your final model. > - The presentation will be followed by 2 live questions. > - Your slides will form part of your assessment and should be submitted. > > In addition to the live presentation and presentation slides, two other documents will be submitted as part of the assessment: > > - A report describing the analysis. > - software script (R or Stata) which shows the data preparation and analysis steps and command syntax. This code should be “well commented” so that a colleague familiar with the software can follow the analysis steps undertaken. > > You will be assessed on the content, as well as the quality of your presentation. This includes visual appearance of slides, clarity of presentation and keeping to time.

Outputs > There will be two components for assessment. > > ### 1. Report [20 marks] > > The report should be of 3 pages length and include > > 1. Description of study and data. > 2. Demographics. > 3. Univariate description or graphs for survival. > 4. Description of the methods used, and the final model. You do not need to describe the steps. > 5. A table of estimates for your final model, together with a discussion of the effects. This should be constructed in a way that is easily interpretable. > 6. Brief summary, and a discussion of limitations. > 7. In an appendix, the diagnostics, including linearity, for the final model. (not included in 3 pages). Note that for linearity that we can ignore small improvements for nonlinear models. > 8. Listing of code used (not included in 3 pages). > > Note that you do not include anything that you did to check the data. The report is meant to be shown to someone who has only a little knowledge of statistics and will assume that the analysis is sensible. I have asked for the diagnostics, as these have not been examined elsewhere > > ### 2. Presentation [20 marks] > > The presentation should be 5 minutes long (±10%). > In your presentation you should cover: > > 1. Aims of the study > 2. The study itself. > 3. Description of the subjects. > 4. Univariate descriptions. > 5. Description of the methods used. > 6. Results. > 7. Brief summary, and a discussion of limitations. > > Presentations are always a compromise between having too much on each slide and having too many slides. I expect that about 10-15 slides but there is no correct number. For this data the aims and study details will be very brief, as I have not given much information. > > #### Resources on presentation > A live presentation (online) forms part of the assessment of Assignment 3. The following resources are provided on Canvas to assist you with preparing your presentation: > > 1. A short lecture, by Dr Michael Waller, on presentation skills; > 2. An example of excellent slides, from a previous semester.

Task > The data for this assignment is taken from the Framingham Heart Study, and is contained in the file assn3_2025_S2.csv. This is a study conducted to determine the epidemiology of heart disease, and is still in progress. The interest in this analysis is the association of age, sex, BMI, diabetes, education and angina on all cause mortality. Angina is a time-dependent covariate, which may have already occurred at baseline. This is similar to the example in the notes, except the time-dependent covariate changes from 0 to 1. Your presentation should explain the time-dependent effect, and any limitations. Note that this is old data, so may not be comparable to current studies and I have included only a small subset of variables, and a subset of subjects.

Table 6: Codebook for Framingham Heart Study

Variable	Description
randid	Identifying ID
age	Age in years
sex	Sex (1=male, 2=female)
bmicat	BMI category (1-6: Underweight to Obese III)
diabetes	Diabetes diagnosis (0=no, 1=yes)
educ	Education (1-4: 0-11 years to College degree)
timeap	Time to angina diagnosis (days)
death	Death (0=censored, 1=death)
timedth	Time to death/censoring (days)