EPI 553 Progress Check-in 2

library(readxl)

## Warning: package 'readxl' was built under R version 4.5.2

library(janitor)

## Warning: package 'janitor' was built under R version 4.5.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(forcats)

## Warning: package 'forcats' was built under R version 4.5.2

library(ggplot2)
library(broom)

## Warning: package 'broom' was built under R version 4.5.2

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.5.2

## ResourceSelection 0.3-6   2023-06-27

library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

1. Analytical Sample Update

The analytical sample was unchanged from Check-in 1. The final analytic dataset included all 150 participants in the HIV clinic dataset. No additional exclusions were required because all participants were aged 40 years or older and there were no missing values for the primary exposure (ART duration), the binary outcome (elevated NT-proBNP), age, sex, height, weight, hypertension, or current ARV regimen.

One modeling refinement was made since Check-in 1. For descriptive summaries, ARV regimen was originally harmonized into four groups. For the adjusted regression model, the PI-based and other/mixed categories were combined into a single PI/Other group because the original categories were very small and produced unstable estimates when entered separately. This change affected only the multivariable model specification and did not alter the analytic sample.

nursultan <- read_excel("C:/Users/userp/OneDrive/Рабочий стол/HSTA553/Project EPI 553/nursultan.xlsx") |>
  clean_names()

library(readxl)
library(janitor)
library(dplyr)
library(stringr)

nursultan <- read_excel("C:/Users/userp/OneDrive/Рабочий стол/HSTA553/Project EPI 553/nursultan.xlsx") %>%
  clean_names()

# 
names(nursultan)[grepl("art", names(nursultan), ignore.case = TRUE)]

## [1] "start"         "sexpartner"    "partnerstatus" "heart"        
## [5] "heartmed"      "heartmedtype"  "art"           "fhheart"

# if the column is art_yrs, use this:
# nursultan <- nursultan %>% rename(art_duration_years = art_yrs)

# if the column is artyrs, use this:
nursultan <- nursultan %>% rename(art_duration = ar_tyrs)

analysis_df <- nursultan %>%
  mutate(
    elevated_proBNP = if_else(pro_bnp >= 125, 1, 0),
    BMI = weight / (height / 100)^2,
    sex_f = factor(if_else(sex == 1, "Male", "Female"),
                   levels = c("Female", "Male")),
    HTN_f = factor(if_else(htn == 1, "Yes", "No"),
                   levels = c("No", "Yes")),
    smoking = case_when(
      smoke100 == 1 & smokeday %in% c(1, 2) ~ "Current",
      TRUE ~ "Non-current"
    ),
    smoking = factor(smoking, levels = c("Non-current", "Current")),
    ART_clean = str_to_lower(art),
    ART_group4 = case_when(
      str_detect(ART_clean, "резолста|калетра|лопинавир|дарунавир") &
        str_detect(ART_clean, "теград|тивикай|долутегравир|dtg|триумек") ~ "Other/mixed",
      str_detect(ART_clean, "резолста|калетра|лопинавир|дарунавир") ~ "PI-based",
      str_detect(ART_clean, "теград|тивикай|долутегравир|dtg|триумек|триград") ~ "INSTI-based",
      str_detect(ART_clean, "тенмифа|атрипла|комплера|efv|рилпивирин") ~ "NNRTI-based",
      TRUE ~ "Other/mixed"
    ),
    ART_group3 = case_when(
      ART_group4 %in% c("PI-based", "Other/mixed") ~ "PI/Other",
      TRUE ~ ART_group4
    ),
    ART_group3 = factor(ART_group3,
                        levels = c("INSTI-based", "NNRTI-based", "PI/Other")),
    art_duration_cat = case_when(
      art_duration < 5 ~ "<5 years",
      art_duration >= 5 & art_duration <= 9 ~ "5-9 years",
      art_duration >= 10 ~ ">=10 years",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(age >= 40)

2. Regression Model Specification

A logistic regression model was used because the outcome was binary: elevated NT-proBNP was defined as NT-proBNP >=125 pg/mL versus <125 pg/mL. This specification was preferred to linear regression because the underlying biomarker distribution was strongly right-skewed in Check-in 1, whereas the binary threshold provides a clinically interpretable outcome aligned with the study question.

Model 1 was an unadjusted logistic regression of elevated NT-proBNP on ART duration in years. Model 2 added age, hypertension, ARV regimen group, sex, body mass index (BMI), and smoking status. In words, the adjusted model estimated the log-odds of elevated NT-proBNP as a function of ART duration while holding those covariates constant.

Reference categories were selected to provide interpretable baselines and to reflect the most common or clinically natural comparison groups. INSTI-based regimen served as the ARV reference category because it was the largest regimen group in the sample. Female was the reference for sex, no hypertension for hypertension, and non-current smoker for smoking status.

Age and hypertension were included a priori because both are plausible confounders of treatment history and cardiac stress. ARV regimen group was included because regimen composition may be related to both cumulative treatment exposure and cardiovascular risk. Sex, BMI, and smoking status were included because they are clinically relevant cardiovascular risk factors and could also differ across patients with different treatment histories.

model1 <- glm(
  elevated_proBNP ~ art_duration,
  family = binomial(link = "logit"),
  data = analysis_df
)

model2 <- glm(
  elevated_proBNP ~ art_duration + age + HTN_f + ART_group3 + sex_f + BMI + smoking,
  family = binomial(link = "logit"),
  data = analysis_df
)

3. Regression Results

tbl_regression(model1, exponentiate = TRUE) %>%
  modify_caption("**Model 1. Unadjusted logistic regression**")

**Model 1. Unadjusted logistic regression**
Characteristic	OR	95% CI	p-value
art_duration	1.07	0.99, 1.15	0.094
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

tbl_regression(model2, exponentiate = TRUE) %>%
  modify_caption("**Model 2. Adjusted logistic regression**")

**Model 2. Adjusted logistic regression**
Characteristic	OR	95% CI	p-value
art_duration	1.05	0.97, 1.14	0.2
age	1.08	1.03, 1.13	0.003
HTN_f
No	—	—
Yes	0.88	0.29, 2.56	0.8
ART_group3
INSTI-based	—	—
NNRTI-based	0.95	0.45, 1.99	0.9
PI/Other	2.01	0.48, 8.69	0.3
sex_f
Female	—	—
Male	1.01	0.46, 2.24	>0.9
BMI	0.98	0.88, 1.08	0.7
smoking
Non-current	—	—
Current	0.72	0.30, 1.70	0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

In the unadjusted model, each additional year of ART duration was associated with higher odds of elevated NT-proBNP, and the direction of the estimate was consistent with the study hypothesis. The confidence interval included the null, so the result should be interpreted as suggestive rather than conclusive.

After adjustment for age, hypertension, ARV regimen, sex, BMI, and smoking, the estimated association between ART duration and elevated NT-proBNP became somewhat smaller. This attenuation suggests that some of the crude association was explained by differences in patient characteristics, especially age. In the adjusted model, age emerged as the strongest predictor of elevated NT-proBNP, while the remaining covariates did not show statistically significant associations in this sample.

Taken together, the results are most consistent with a weak positive crude association between ART duration and elevated NT-proBNP that is partly confounded by age and estimated imprecisely in this moderate-sized cross-sectional dataset.

tidy(model1, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 2 × 7
##   term         estimate std.error statistic p.value conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)     0.380    0.307      -3.15 0.00162    0.202     0.679
## 2 art_duration    1.07     0.0383      1.67 0.0943     0.992     1.15

tidy(model2, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 9 × 7
##   term                  estimate std.error statistic p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)             0.0211    1.77     -2.18   0.0295  0.000591     0.644
## 2 art_duration            1.05      0.0397    1.21   0.226   0.972        1.14 
## 3 age                     1.08      0.0250    2.95   0.00319 1.03         1.13 
## 4 HTN_fYes                0.884     0.552    -0.224  0.823   0.287        2.56 
## 5 ART_group3NNRTI-based   0.948     0.378    -0.141  0.888   0.449        1.99 
## 6 ART_group3PI/Other      2.01      0.728     0.958  0.338   0.476        8.69 
## 7 sex_fMale               1.01      0.402     0.0315 0.975   0.459        2.24 
## 8 BMI                     0.978     0.0522   -0.431  0.666   0.880        1.08 
## 9 smokingCurrent          0.716     0.437    -0.765  0.444   0.303        1.70

4. Model Diagnostics

diag_df <- augment(model2, type.predict = "response", type.residuals = "deviance")

# Hosmer-Lemeshow test
hoslem.test(analysis_df$elevated_proBNP, fitted(model2), g = 10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  analysis_df$elevated_proBNP, fitted(model2)
## X-squared = 7.0797, df = 8, p-value = 0.5281

ggplot(diag_df, aes(.fitted, .resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(
    title = "Diagnostic 1. Deviance residuals vs fitted values",
    x = "Fitted probability",
    y = "Deviance residual"
  ) +
  theme_minimal()

The deviance residuals were centered around zero across the fitted-probability range without a severe systematic curve. Some banding and slightly wider spread at higher fitted probabilities are expected with a binary outcome, and there was no obvious pattern suggesting gross misspecification.

cal_df <- analysis_df %>%
  mutate(pred = fitted(model2)) %>%
  mutate(decile = ntile(pred, 10)) %>%
  group_by(decile) %>%
  summarize(
    pred = mean(pred),
    obs = mean(elevated_proBNP),
    .groups = "drop"
  )

ggplot(cal_df, aes(pred, obs)) +
  geom_point() +
  geom_line() +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  labs(
    title = "Diagnostic 2. Calibration by decile of predicted risk",
    x = "Mean predicted probability",
    y = "Observed event proportion"
  ) +
  theme_minimal()

Predicted risks tracked observed event proportions reasonably well across deciles of predicted probability. The Hosmer-Lemeshow test did not indicate poor calibration, so there was no strong evidence that the model systematically overpredicted or underpredicted the outcome.

cook_vals <- cooks.distance(model2)
plot(cook_vals, type = "h",
     main = "Diagnostic 3. Influence plot",
     xlab = "Observation", ylab = "Cook's distance")
abline(h = 1, lty = 2)

No observation approached the usual Cook’s distance threshold of 1, indicating that the fitted coefficients were not being driven by any single highly influential observation.

hat_vals <- hatvalues(model2)
std_resid <- rstandard(model2)

plot(hat_vals, std_resid,
     main = "Diagnostic 4. Leverage vs standardized residuals",
     xlab = "Leverage", ylab = "Standardized residual")
abline(h = 0, lty = 2)

A small number of observations had moderately higher leverage, but they were not paired with extreme standardized residuals. These points should be reviewed in the Final Report, but they do not currently warrant removing observations or re-specifying the model.

5. Regression Visualizations

newdat <- tibble(
  art_duration = seq(min(analysis_df$art_duration), max(analysis_df$art_duration), by = 1),
  age = mean(analysis_df$age, na.rm = TRUE),
  HTN_f = factor("No", levels = c("No", "Yes")),
  ART_group3 = factor("INSTI-based", levels = c("INSTI-based", "NNRTI-based", "PI/Other")),
  sex_f = factor("Female", levels = c("Female", "Male")),
  BMI = mean(analysis_df$BMI, na.rm = TRUE),
  smoking = factor("Non-current", levels = c("Non-current", "Current"))
)

pred_obj <- predict(model2, newdata = newdat, type = "link", se.fit = TRUE)
newdat <- newdat %>%
  mutate(
    fit = plogis(pred_obj$fit),
    lo = plogis(pred_obj$fit - 1.96 * pred_obj$se.fit),
    hi = plogis(pred_obj$fit + 1.96 * pred_obj$se.fit)
  )

ggplot(newdat, aes(art_duration, fit)) +
  geom_line() +
  geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.2) +
  labs(
    title = "Figure 1. Adjusted predicted probability of elevated NT-proBNP by ART duration",
    x = "ART duration (years)",
    y = "Predicted probability of elevated NT-proBNP"
  ) +
  theme_minimal()

Figure 1 shows adjusted predicted probabilities from the logistic regression model while holding age and BMI at their sample means and setting hypertension to no, smoking to non-current, sex to female, and regimen to INSTI-based. The fitted curve rises gradually with longer ART duration, indicating that the model still estimates a positive association after adjustment. However, the confidence band widens at longer treatment durations because relatively few participants had very long ART exposure, so the slope should be interpreted cautiously.

coef_df <- tidy(model2, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = recode(
      term,
      "art_duration" = "ART duration (per year)",
      "age" = "Age (per year)",
      "HTN_fYes" = "Hypertension: Yes vs No",
      "ART_group3NNRTI-based" = "Regimen: NNRTI vs INSTI",
      "ART_group3PI/Other" = "Regimen: PI/Other vs INSTI",
      "sex_fMale" = "Male vs Female",
      "BMI" = "BMI (kg/m^2)",
      "smokingCurrent" = "Current smoker vs Non-current"
    )
  )

ggplot(coef_df, aes(x = estimate, y = forcats::fct_rev(term))) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.15) +
  geom_vline(xintercept = 1, linetype = 2) +
  scale_x_log10() +
  labs(
    title = "Figure 2. Adjusted odds ratios for elevated NT-proBNP",
    x = "Odds ratio (log scale)",
    y = NULL
  ) +
  theme_minimal()

## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `height` was translated to `width`.

Figure 2 presents the adjusted odds ratios for all predictors in the multivariable model. Age stands out as the most clearly positive predictor because its confidence interval lies above 1, whereas the intervals for ART duration and the remaining covariates all cross the null value. This visual summary reinforces the main interpretation from the model table: within this sample, age was more strongly associated with elevated NT-proBNP than ART duration or regimen group.

Appendix. AI Assistance Statement

AI assistance was not used

#end