Prevalence and Risk Factors of Hypertension Among Adolescents Living with HIV Attending Lagos University Teaching Hospital

Author

Mariam Olufunmilayo YUSUF

Published

May 18, 2026


1 Executive Summary

Adolescents Living with HIV (ALHIV) face compounding cardiometabolic risks from the virus itself, antiretroviral therapy, and a chronic inflammatory state — yet hypertension in this population is under-diagnosed and poorly characterised in sub-Saharan Africa. This study examined the prevalence of elevated blood pressure and hypertension, and its associated risk factors, among 134 ALHIV aged 10–19 years attending the APIN Clinic at Lagos University Teaching Hospital (LUTH) between October 2022 and April 2023.

Data were obtained from clinic records, a structured researcher-administered questionnaire, and direct anthropometric and laboratory measurements. Five analytical techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression.

Key findings indicate that 10.4% of ALHIV had elevated blood pressure or hypertension (SBP or DBP ≥ 90th age-sex-height centile per AAP 2017 guidelines). Body mass index (BMI) emerged as the dominant predictor of systolic blood pressure (β = 1.45 mmHg per kg/m², p < 0.001) in the linear regression model, with male sex adding a further 4.4 mmHg after adjustment (p = 0.013). In the Firth penalised logistic regression, longer HAART duration was independently associated with lower odds of elevated BP/HTN (OR = 0.95 per month, p = 0.017), potentially reflecting the cardiometabolic benefits of sustained viral suppression. Longer ART exposure also correlated with higher triglycerides — a lipid burden that may accelerate future cardiovascular disease.

Recommendation: A cardiometabolic risk flag triggered by BMI ≥ 25 kg/m² should be integrated into APIN Clinic’s standard visit workflow, prompting enhanced BP monitoring, lipid panels, and dietary counselling for flagged ALHIV.


2 Professional Disclosure

Job Title: Consultant Paediatrician Organisation: Lagos University Teaching Hospital (LUTH), APIN Clinic, Lagos, Nigeria Domain: Paediatric Infectious Diseases and Child Health

As a Consultant Paediatrician managing over 200 ALHIV at the APIN Clinic, I make daily decisions about when to screen for comorbidities, how to adjust antiretroviral regimens for metabolic side effects, and how to counsel caregivers on long-term health risks. Each analytical technique chosen directly informs these clinical decisions:

  1. EDA — Technique Justification: Before any clinical decision, I must understand the baseline characteristics of my patient panel — age distribution, viral suppression rates, nutritional status, and blood pressure profile. EDA gives a structured, objective summary of these dimensions and flags data quality issues that could distort clinical conclusions. Without EDA, I risk building analysis on corrupted or misunderstood data.

  2. Data Visualisation — Technique Justification: Clinical guidelines rest on population-level evidence communicated visually. I use charts to identify which patient subgroups carry disproportionate risk and to present findings to nursing staff and caregivers who may not interpret statistical tables. A boxplot comparing BP by sex communicates instantly what a table of means cannot.

  3. Hypothesis Testing — Technique Justification: When reviewing whether a HAART regimen affects blood pressure, or whether male and female patients differ in a key outcome, formal hypothesis testing prevents acting on spurious differences. It adds scientific rigour to what might otherwise be clinical anecdote and is required for publication in peer-reviewed journals.

  4. Correlation Analysis — Technique Justification: HIV management involves balancing multiple interacting variables — viral load, CD4 count, BMI, blood pressure, and lipids. Correlation analysis reveals which variables track together, guiding which measurements to prioritise when resources are constrained. If BMI correlates strongly with SBP, I can use the weighing scale to screen before blood pressure equipment is available.

  5. Logistic/Linear Regression — Technique Justification: Identifying independent predictors of elevated blood pressure enables risk stratification. I can target enhanced monitoring toward the highest-risk ALHIV rather than applying uniform surveillance to all patients — an important consideration in a resource-limited tertiary setting where nursing time is scarce.


3 Data Collection & Sampling

Data source: APIN (AIDS Prevention Initiative in Nigeria) Clinic, Department of Paediatrics, Lagos University Teaching Hospital (LUTH), Lagos, Nigeria.

Collection period: October 2022 – April 2023 (7 months).

Collection methods: Three complementary sources were used:

  • Clinic records: Antiretroviral therapy (ART) history, HAART regimen, duration on therapy, and laboratory values were extracted from the clinic’s electronic and paper registers.
  • Researcher-administered structured questionnaire: Sociodemographic data (age, gender, religion, education, family history of hypertension, exercise habits) were obtained via face-to-face interview with the adolescent and caregiver.
  • Direct clinical measurements: Anthropometric measurements (weight, height, waist circumference), blood pressure (two readings taken five minutes apart using a calibrated sphygmomanometer with appropriate cuff size), and fasting laboratory tests (lipid panel; viral load) were performed during the clinic visit.

Sampling frame: All ALHIV aged 10–19 years attending the APIN Clinic during the study period who provided assent (adolescent) and whose caregiver provided written informed consent. Exclusion criteria: known cardiovascular disease prior to HIV diagnosis; incomplete anthropometric data.

Sample size: 134 adolescents (70 male, 64 female). This exceeds the minimum of 100 observations required.

Statistical rationale: Using a published hypertension prevalence estimate in ALHIV of ~10% (Berman et al., 2023), desired precision of ±5%, and 95% confidence: n = [Z²·p(1−p)]/d² = [(1.96²)(0.10)(0.90)]/(0.05²) ≈ 138. The achieved sample of 134 provides adequate power for descriptive and inferential objectives.

Time period: October 2022 to April 2023.

Ethical approval: Obtained from the LUTH Health Research and Ethics Committee (LUTHHREC) prior to data collection. All adolescents provided written informed assent; caregivers provided written informed consent. Participation was voluntary and did not affect clinical care.

Confidentiality: All patient identifiers were replaced with alphanumeric study codes (YM001–YM134) before data entry. No names, hospital numbers, or addresses appear in this submission.


4 Data Description

Code
path <- "DR YUSUF RAW FWACP DATA 2023.xlsx"
df_raw <- read_excel(path, sheet = 1)

df <- df_raw |>
  select(
    sn                  = `S/N`,
    study_id            = `Q1a Study ID`,
    doe                 = `Q1b DOE`,
    dob                 = `Q2a DOB`,
    age                 = `Q2b Age`,
    gender              = `Q2c Gender`,
    edu_level           = `Q3 Level of edu`,
    religion            = `Q6 Religion`,
    siblings            = `Q15 Siblings`,
    birth_order         = `Q16 Birth Order`,
    fh_hypert           = `Q18 FH Hypert`,
    exercise            = `Q19 Exercise`,
    age_hiv_diag_mnth   = `SecB Q1 Age HIV diagnosis(MNTH)`,
    cd4_count           = `SecB Q2a CD4 Count(Cells/mm3)`,
    viral_load          = `SecB Q2b Viral Load(c/ml)`,
    haart               = `SecB Q3 Present HAART`,
    haart_duration_mnth = `SecB Q4 Duration on present HAART(Mnth)`,
    prev_haart          = `SecB Q5 Any HAART in past`,
    weight_kg           = `SecC Q1 Weight(kg)`,
    height_raw          = `SecC Q2a Height(m)`,
    bmi_orig            = `SecC Q3a BMI(kg/m2)`,
    bmi_interpret       = `SecC Q3b BMI Interpret`,
    wc_cm               = `SecC Q4a WC (cm)`,
    sbp                 = `SecC Q5a SBP measured`,
    sbp_interpret       = `SecC Q5a2 SBP Interpret`,
    dbp                 = `SecC Q5b DBP measured`,
    dbp_interpret       = `SecC Q5b2 DBP Interpret`,
    total_chol          = `SecD Q1a Total Choles  (mg/dl)`,
    ldl                 = `SecD Q2a LDL (mg/dl)`,
    hdl                 = `SecD Q3a HDL (mg/dl)`,
    tg                  = `SecD Q4a Triglyceride (mg/dl)`
  ) |>
  mutate(
    # --- DATA QUALITY FIX 1: one height value entered in cm (160) instead of metres ---
    height_m   = if_else(height_raw > 10, height_raw / 100, height_raw),
    bmi        = round(weight_kg / height_m^2, 2),

    # --- DATA QUALITY FIX 2: extreme LDL outlier (1,167.83 mg/dL) set to NA ---
    ldl_clean  = if_else(ldl > 300, NA_real_, ldl),

    # --- Derived variables ---
    gender     = case_when(
      str_detect(gender, "Male")   ~ "Male",
      str_detect(gender, "Female") ~ "Female"
    ),
    religion   = case_when(
      str_detect(religion, "Christ") ~ "Christianity",
      str_detect(religion, "Islam")  ~ "Islam"
    ),
    fh_hypert  = case_when(
      str_detect(fh_hypert, "Yes") ~ "Yes",
      str_detect(fh_hypert, "No")  ~ "No"
    ),
    exercise   = case_when(
      str_detect(exercise, "Yes") ~ "Yes",
      str_detect(exercise, "No")  ~ "No"
    ),
    siblings   = case_when(
      str_detect(siblings, "<3") ~ "< 3",
      str_detect(siblings, "≥3") ~ "≥ 3"
    ),
    haart_category = case_when(
      str_detect(toupper(haart), "DTG|TLD|TDL") ~ "DTG-based",
      str_detect(toupper(haart), "LPV|LPU")     ~ "LPV/r-based",
      str_detect(toupper(haart), "ATV")          ~ "ATV/r-based",
      !is.na(haart)                              ~ "Other"
    ),
    vl_suppressed       = if_else(viral_load < 1000, "Suppressed",
                                  "Unsuppressed", missing = NA_character_),
    log_vl              = log10(viral_load + 1),
    age_hiv_diag_yr     = round(age_hiv_diag_mnth / 12, 1),

    # --- AAP 2017 BP Classification ---
    # Elevated/HTN = SBP OR DBP at or above 90th centile for age/sex/height
    sbp_ge90 = sbp_interpret %in% c(">95th Centile", "90-95th Centile", "90th Centile"),
    dbp_ge90 = dbp_interpret %in% c("95th Centile", "90-95th Centile", "90th Centile"),
    bp_htn   = as.integer(sbp_ge90 | dbp_ge90),

    # Factors for modelling
    gender_f    = factor(gender,        levels = c("Female", "Male")),
    fh_hypert_f = factor(fh_hypert,     levels = c("No", "Yes")),
    haart_f     = factor(haart_category,
                         levels = c("DTG-based", "LPV/r-based", "ATV/r-based", "Other"))
  )

4.1 Variable Inventory

Code
var_dict <- tribble(
  ~Variable,              ~Type,                 ~Description,
  "age",                  "Numeric",             "Age in years (10–19)",
  "gender",               "Categorical",         "Male / Female",
  "religion",             "Categorical",         "Christianity / Islam",
  "fh_hypert",            "Categorical",         "Family history of hypertension (Yes/No)",
  "exercise",             "Categorical",         "Reports regular exercise (Yes/No)",
  "age_hiv_diag_yr",      "Numeric",             "Age at HIV diagnosis (years)",
  "viral_load",           "Numeric",             "HIV viral load (copies/mL)",
  "haart_duration_mnth",  "Numeric",             "Duration on current HAART (months)",
  "haart_category",       "Categorical",         "HAART class (DTG / LPV/r / ATV/r)",
  "weight_kg",            "Numeric",             "Body weight (kg)",
  "height_m",             "Numeric",             "Height (metres) — corrected from entry error",
  "bmi",                  "Numeric",             "BMI (kg/m²) — recalculated post-correction",
  "wc_cm",                "Numeric",             "Waist circumference (cm)",
  "sbp",                  "Numeric [primary outcome]",  "Systolic BP (mmHg)",
  "dbp",                  "Numeric",             "Diastolic BP (mmHg)",
  "bp_htn",               "Binary [secondary outcome]", "Elevated BP or HTN per AAP 2017 (1=Yes)",
  "total_chol",           "Numeric",             "Total cholesterol (mg/dL)",
  "ldl_clean",            "Numeric",             "LDL cholesterol (mg/dL); extreme outlier → NA",
  "hdl",                  "Numeric",             "HDL cholesterol (mg/dL)",
  "tg",                   "Numeric",             "Triglycerides (mg/dL)"
)

var_dict |>
  kbl(caption = "Table 1. Data dictionary: key variables used in analysis") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  column_spec(1, bold = TRUE) |>
  row_spec(which(grepl("outcome", var_dict$Type)), background = "#FFF9C4")
Table 1. Data dictionary: key variables used in analysis
Variable Type Description
age Numeric Age in years (10–19)
gender Categorical Male / Female
religion Categorical Christianity / Islam
fh_hypert Categorical Family history of hypertension (Yes/No)
exercise Categorical Reports regular exercise (Yes/No)
age_hiv_diag_yr Numeric Age at HIV diagnosis (years)
viral_load Numeric HIV viral load (copies/mL)
haart_duration_mnth Numeric Duration on current HAART (months)
haart_category Categorical HAART class (DTG / LPV/r / ATV/r)
weight_kg Numeric Body weight (kg)
height_m Numeric Height (metres) — corrected from entry error
bmi Numeric BMI (kg/m²) — recalculated post-correction
wc_cm Numeric Waist circumference (cm)
sbp Numeric [primary outcome] Systolic BP (mmHg)
dbp Numeric Diastolic BP (mmHg)
bp_htn Binary [secondary outcome] Elevated BP or HTN per AAP 2017 (1=Yes)
total_chol Numeric Total cholesterol (mg/dL)
ldl_clean Numeric LDL cholesterol (mg/dL); extreme outlier → NA
hdl Numeric HDL cholesterol (mg/dL)
tg Numeric Triglycerides (mg/dL)

5 Technique 1 — Exploratory Data Analysis

Theory recap (Ch. 4): EDA is the first and most critical step in any analysis. It combines numerical summaries with visual inspection to understand distributions, detect data quality issues (missing values, outliers, coding errors), and identify patterns worth testing formally. Anscombe’s Quartet illustrates why summary statistics alone are insufficient — identical means and variances can mask wildly different data structures. Always look at the data before computing.

Business (clinical) justification: Before drawing any inference about hypertension risk in ALHIV, I must understand who is in the dataset. Summary statistics reveal the demographic and clinical profile of the clinic’s adolescent panel, while outlier detection and missing-value analysis prevent erroneous conclusions that could lead to misguided clinical recommendations. Data quality checks are especially important in a clinic where records are maintained across paper and electronic systems.

5.1 Summary Statistics

Code
numeric_vars <- df |>
  select(age, age_hiv_diag_yr, haart_duration_mnth, weight_kg, height_m,
         bmi, wc_cm, sbp, dbp, total_chol, ldl_clean, hdl, tg, log_vl) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    N       = sum(!is.na(Value)),
    Missing = sum(is.na(Value)),
    Mean    = round(mean(Value, na.rm = TRUE), 2),
    SD      = round(sd(Value, na.rm = TRUE), 2),
    Median  = round(median(Value, na.rm = TRUE), 2),
    IQR     = round(IQR(Value, na.rm = TRUE), 2),
    Min     = round(min(Value, na.rm = TRUE), 2),
    Max     = round(max(Value, na.rm = TRUE), 2),
    .groups = "drop"
  ) |>
  arrange(Variable)

numeric_vars |>
  kbl(caption = "Table 2. Descriptive statistics for continuous variables (n = 134 ALHIV)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  scroll_box(width = "100%", height = "400px")
Table 2. Descriptive statistics for continuous variables (n = 134 ALHIV)
Variable N Missing Mean SD Median IQR Min Max
age 134 0 15.34 2.67 16.00 3.00 10.00 19.00
age_hiv_diag_yr 134 0 3.97 3.59 3.00 5.00 0.20 16.00
bmi 134 0 19.62 3.17 19.44 3.73 13.15 32.69
dbp 134 0 65.35 7.93 65.50 11.00 42.00 84.00
haart_duration_mnth 132 2 29.17 20.32 30.00 24.00 1.00 132.00
hdl 134 0 40.66 12.73 38.28 13.34 7.73 109.82
height_m 134 0 1.58 0.13 1.60 0.16 1.23 1.87
ldl_clean 133 1 78.74 25.89 77.34 35.58 28.62 175.56
log_vl 133 1 1.68 1.04 1.32 0.22 0.00 5.75
sbp 134 0 105.28 10.72 105.50 14.00 70.00 130.00
tg 134 0 91.41 52.90 74.40 59.12 15.94 254.20
total_chol 134 0 134.50 34.60 134.38 41.28 28.62 250.97
wc_cm 134 0 70.59 7.66 70.00 9.00 54.00 94.00
weight_kg 134 0 50.01 12.47 51.05 14.73 19.90 89.00

5.2 Missing Value Analysis

Code
missing_df <- df |>
  select(age, gender, religion, fh_hypert, exercise,
         age_hiv_diag_yr, cd4_count, viral_load, haart_duration_mnth,
         weight_kg, height_m, bmi, wc_cm, sbp, dbp,
         total_chol, ldl_clean, hdl, tg, log_vl, bp_htn) |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(df), 1)) |>
  filter(n_missing > 0) |>
  arrange(desc(n_missing))

missing_df |>
  kbl(caption = "Table 3. Variables with missing values") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 3. Variables with missing values
Variable n_missing pct_missing
cd4_count 107 79.9
fh_hypert 12 9.0
exercise 6 4.5
religion 2 1.5
haart_duration_mnth 2 1.5
viral_load 1 0.7
ldl_clean 1 0.7
log_vl 1 0.7
Code
missing_df |>
  mutate(Variable = fct_reorder(Variable, pct_missing),
         flag = pct_missing > 50) |>
  ggplot(aes(x = Variable, y = pct_missing, fill = flag)) +
  geom_col() +
  geom_text(aes(label = paste0(pct_missing, "%")), hjust = -0.15, size = 3.5) +
  scale_fill_manual(values = c("FALSE" = "#64B5F6", "TRUE" = "#EF5350"), guide = "none") +
  coord_flip() +
  scale_y_continuous(limits = c(0, 92)) +
  labs(title = "Missing value profile by variable",
       subtitle = "Red = > 50% missing (CD4 count deprioritised post-2018 policy change)",
       x = NULL, y = "% Missing") +
  theme(plot.subtitle = element_text(size = 9.5, colour = "grey40"))

Figure 1. Missing value profile

Data quality issue 1 — Height entry error (n = 1): One participant had height recorded as 160 — clearly in centimetres rather than metres. This was corrected to 1.60 m and BMI was recalculated (original BMI for this participant was 9.6 kg/m², physiologically implausible; corrected BMI = 26.2 kg/m²). All 134 BMI values are therefore based on validated heights.

Data quality issue 2 — Extreme LDL outlier (n = 1): One participant had LDL recorded as 1,167.83 mg/dL (clinical upper limit in severe familial hypercholesterolaemia ≈ 400–500 mg/dL; the value likely represents a transcription error, possibly 116.78 mg/dL). This value was set to NA for all lipid analyses (ldl_clean). The participant is retained in all other analyses.

Data quality issue 3 — CD4 count (n missing = 107, 79.9%): CD4 monitoring was deprioritised in the Nigerian national ART programme after viral load became the primary treatment-monitoring tool post-2018. CD4 count is therefore excluded from the primary regression model.

5.3 Categorical Variable Distributions

Code
bind_rows(
  df |> filter(!is.na(gender)) |> count(gender) |>
    mutate(Variable = "Gender", pct = round(100*n/sum(n),1)),
  df |> filter(!is.na(religion)) |> count(religion) |>
    mutate(Variable = "Religion", pct = round(100*n/sum(n),1)),
  df |> filter(!is.na(fh_hypert)) |> count(fh_hypert) |>
    mutate(Variable = "Family Hx HTN", pct = round(100*n/sum(n),1)),
  df |> filter(!is.na(exercise)) |> count(exercise) |>
    mutate(Variable = "Exercise", pct = round(100*n/sum(n),1)),
  df |> filter(!is.na(haart_category)) |> count(haart_category) |>
    mutate(Variable = "HAART Class", pct = round(100*n/sum(n),1)),
  df |> filter(!is.na(vl_suppressed)) |> count(vl_suppressed) |>
    mutate(Variable = "VL Status", pct = round(100*n/sum(n),1))
) |>
  rename(Category = 1) |>
  select(Variable, Category, n, pct) |>
  kbl(caption = "Table 4. Distribution of key categorical variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 4. Distribution of key categorical variables
Variable Category n pct
Gender Female 64 47.8
Gender Male 70 52.2
Religion NA 110 83.3
Religion NA 22 16.7
Family Hx HTN NA 104 85.2
Family Hx HTN NA 18 14.8
Exercise NA 23 18.0
Exercise NA 105 82.0
HAART Class NA 8 6.0
HAART Class NA 111 83.5
HAART Class NA 10 7.5
HAART Class NA 4 3.0
VL Status NA 117 88.0
VL Status NA 16 12.0

Plain-language interpretation: The clinic panel is evenly split by sex (52.2% male). Christianity predominates (82.1%). Fourteen percent have a documented family history of hypertension — though this is likely under-reported given low parental awareness. Encouragingly, 87.2% of ALHIV with available viral load data are virally suppressed (< 1,000 copies/mL), reflecting strong HAART adherence. The majority (78%) are on DTG-based regimens, consistent with current Nigerian National ART Guidelines.


6 Technique 2 — Data Visualisation

Theory recap (Ch. 5): The grammar of graphics (Wilkinson, 2005; Wickham, 2016) builds every chart from data, aesthetic mappings, and geometric objects. Effective chart selection depends on the data type and the story to be told: histograms and density plots reveal shape; boxplots compare groups; scatter plots show relationships. Good data visualisation is not decorative — it communicates patterns that tables cannot.

Business (clinical) justification: In a busy paediatric HIV clinic, charts communicate panel characteristics to nurses, counsellors, and caregivers far more effectively than tables of numbers. The five plots below form a cohesive visual narrative about the cardiometabolic health of ALHIV at LUTH’s APIN Clinic: who are they (age, sex)? How nourished are they (BMI)? How well controlled is their HIV (viral load)? What are their blood pressure and lipid profiles?

Code
p1 <- df |>
  filter(!is.na(gender)) |>
  ggplot(aes(x = age, fill = gender)) +
  geom_histogram(binwidth = 1, colour = "white", alpha = 0.85, position = "identity") +
  scale_fill_manual(values = pal_gender) +
  scale_x_continuous(breaks = 10:19) +
  labs(title = "A. Age distribution by sex",
       x = "Age (years)", y = "Count", fill = NULL) +
  theme(legend.position = c(0.85, 0.83))

p2 <- df |>
  ggplot(aes(x = bmi, fill = gender)) +
  geom_density(alpha = 0.6) +
  geom_vline(xintercept = c(16, 18.5, 25, 30),
             linetype = "dashed", colour = "grey40", linewidth = 0.5) +
  annotate("text", x = c(14.8, 17.3, 21.8, 27.8), y = 0.118,
           label = c("Sev.\nUnderw.", "Underw.", "Normal", "Overw./\nObese"),
           size = 2.7, colour = "grey30") +
  scale_fill_manual(values = pal_gender) +
  labs(title = "B. BMI distribution (adult WHO reference zones)",
       x = "BMI (kg/m²)", y = "Density", fill = NULL) +
  theme(legend.position = c(0.88, 0.80))

p1 | p2

Figure 2. Age distribution and BMI profile
Code
df |>
  filter(!is.na(gender)) |>
  select(gender, SBP = sbp, DBP = dbp) |>
  pivot_longer(c(SBP, DBP), names_to = "Measure", values_to = "Value") |>
  ggplot(aes(x = Measure, y = Value, fill = gender)) +
  geom_violin(alpha = 0.7, trim = TRUE) +
  geom_boxplot(width = 0.12, position = position_dodge(0.9),
               outlier.shape = 21, outlier.size = 1.5, fill = "white") +
  geom_hline(yintercept = 90, linetype = "dashed",
             colour = "#FF5722", linewidth = 0.7) +
  geom_hline(yintercept = 120, linetype = "dashed",
             colour = "#FF9800", linewidth = 0.7) +
  annotate("text", x = 2.48, y = c(91.5, 121.5),
           label = c("DBP 90 mmHg", "SBP 120 mmHg"),
           hjust = 1, size = 3.1, colour = c("#FF5722","#FF9800")) +
  scale_fill_manual(values = pal_gender) +
  labs(title = "C. Blood pressure distribution by sex",
       subtitle = "Dashed lines show absolute thresholds (AAP 2017 for age ≥ 13)",
       x = NULL, y = "Blood Pressure (mmHg)", fill = NULL) +
  theme(legend.position = "top")

Figure 3. Blood pressure by sex
Code
df |>
  filter(!is.na(vl_suppressed)) |>
  ggplot(aes(x = log_vl, fill = vl_suppressed)) +
  geom_density(alpha = 0.75) +
  geom_vline(xintercept = log10(1000), linetype = "dashed",
             colour = "#D32F2F", linewidth = 0.9) +
  annotate("text", x = log10(1000) + 0.08, y = 0.52,
           label = "1,000 copies/mL\n(suppression threshold)",
           hjust = 0, size = 3, colour = "#D32F2F") +
  scale_fill_manual(values = c("Suppressed" = "#43A047", "Unsuppressed" = "#E53935")) +
  labs(title = "D. HIV viral load distribution (log₁₀ scale)",
       subtitle = "87.2% are virally suppressed — indicating strong treatment adherence",
       x = "log₁₀(Viral Load + 1)", y = "Density", fill = "VL Status") +
  theme(legend.position = c(0.78, 0.78))

Figure 4. Viral load suppression status
Code
df |>
  filter(!is.na(haart_category)) |>
  select(haart_category,
         "Total Chol." = total_chol,
         LDL           = ldl_clean,
         HDL           = hdl,
         Triglycerides = tg) |>
  pivot_longer(-haart_category, names_to = "Lipid", values_to = "mg_dL") |>
  filter(!is.na(mg_dL)) |>
  ggplot(aes(x = haart_category, y = mg_dL, fill = haart_category)) +
  geom_boxplot(outlier.shape = 21, alpha = 0.8) +
  facet_wrap(~ Lipid, scales = "free_y", ncol = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "E. Lipid profiles by HAART class",
       x = NULL, y = "Concentration (mg/dL)", fill = "HAART class") +
  theme(axis.text.x  = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "bottom",
        strip.text = element_text(size = 9.5, face = "bold"))

Figure 5. Lipid profiles by HAART class

Plain-language interpretation of the visualisation narrative:

  • Panel A: The panel is concentrated in mid-to-late adolescence (14–17 years), consistent with a perinatally infected cohort now entering puberty. Sex distribution is balanced.
  • Panel B: The BMI distribution is left-skewed — a substantial proportion of ALHIV are underweight or in the low-normal range, reflecting the nutritional vulnerability of children with HIV in Nigeria. A visible right tail captures overweight/obese adolescents who represent an emerging distinct risk group.
  • Panel C: Most blood pressures fall within the normal range. However, the DBP distribution extends further toward the 90 mmHg threshold than SBP, explaining why DBP-based hypertension is more prevalent than SBP-based in this cohort.
  • Panel D: Strong viral suppression (87.2%) reflects good treatment adherence and programme quality. The small unsuppressed group carries higher immune deterioration and cardiometabolic risk.
  • Panel E: LPV/r-based regimens produce visibly higher triglycerides and total cholesterol compared with DTG-based regimens — a well-documented pharmacological effect of protease inhibitors. This has direct clinical implications for switching adolescents from LPV/r to DTG, which is now standard practice in Nigeria.

7 Technique 3 — Hypothesis Testing

Theory recap (Ch. 6): Hypothesis testing provides a formal framework for deciding whether observed differences reflect true population effects or sampling variation. The process: state H₀ and H₁ → check assumptions → compute test statistic and p-value → report effect size. Effect sizes (Cohen’s d, Cramer’s V) convey practical significance independently of sample size. A significant p-value with a trivial effect size has no clinical meaning; a non-significant result with a moderate effect size in a small sample warrants further investigation.

Business (clinical) justification: Two clinically relevant questions arise: (1) Do male and female ALHIV have different systolic blood pressures? Sex is a documented modifier of blood pressure in adolescents after puberty, and the answer determines whether sex-specific monitoring thresholds are warranted. (2) Is overweight or obesity status associated with elevated BP? The answer determines whether simple anthropometry can substitute for formal BP measurement in resource-limited triage.

7.1 Test 1 — Does systolic BP differ between male and female ALHIV?

H₀: Mean SBP is equal in male and female ALHIV: μ_male = μ_female

H₁: Mean SBP differs between male and female ALHIV: μ_male ≠ μ_female (two-sided, α = 0.05)

Code
normality_tbl <- df |>
  filter(!is.na(gender)) |>
  group_by(gender) |>
  summarise(
    n         = n(),
    Mean_SBP  = round(mean(sbp), 2),
    SD_SBP    = round(sd(sbp), 2),
    SW_stat   = round(shapiro.test(sbp)$statistic, 4),
    SW_p      = round(shapiro.test(sbp)$p.value, 4),
    .groups   = "drop"
  )

normality_tbl |>
  kbl(caption = "Table 5. SBP descriptive statistics and Shapiro-Wilk normality test by sex") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 5. SBP descriptive statistics and Shapiro-Wilk normality test by sex
gender n Mean_SBP SD_SBP SW_stat SW_p
Female 64 104.20 11.10 0.9883 0.8051
Male 70 106.27 10.33 0.9734 0.1416
Code
sw_male   <- shapiro.test(df$sbp[df$gender == "Male"   & !is.na(df$gender)])$p.value
sw_female <- shapiro.test(df$sbp[df$gender == "Female" & !is.na(df$gender)])$p.value

# Both groups normal (SW p > 0.05): use Welch's t-test
if (sw_male > 0.05 & sw_female > 0.05) {
  t_res     <- t.test(sbp ~ gender, data = df, var.equal = FALSE)
  test_name <- "Welch's Two-Sample t-test"
} else {
  t_res     <- wilcox.test(sbp ~ gender, data = df, exact = FALSE)
  test_name <- "Mann-Whitney U test"
}

# Cohen's d (explicit namespace to avoid rstatix mask)
d_val   <- effectsize::cohens_d(sbp ~ gender, data = filter(df, !is.na(gender)))
d_abs   <- abs(d_val$Cohens_d)
d_label <- case_when(d_abs < 0.2 ~ "negligible", d_abs < 0.5 ~ "small",
                     d_abs < 0.8 ~ "medium", TRUE ~ "large")

cat(sprintf("Test: %s\n", test_name))
Test: Welch's Two-Sample t-test
Code
cat(sprintf("Statistic = %.4f,  p-value = %.4f\n",
            t_res$statistic, t_res$p.value))
Statistic = -1.1134,  p-value = 0.2676
Code
cat(sprintf("Cohen's d = %.3f  (%s effect)\n", d_abs, d_label))
Cohen's d = 0.193  (negligible effect)
Code
df |>
  filter(!is.na(gender)) |>
  ggplot(aes(x = gender, y = sbp, fill = gender)) +
  geom_violin(alpha = 0.6, trim = TRUE) +
  geom_boxplot(width = 0.14, outlier.shape = 21, fill = "white") +
  geom_jitter(aes(colour = gender), width = 0.12, alpha = 0.5, size = 1.3) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
               fill = "gold", colour = "black") +
  scale_fill_manual(values  = pal_gender) +
  scale_colour_manual(values = pal_gender) +
  labs(
    title    = "Systolic blood pressure by sex",
    subtitle = sprintf("%s: statistic = %.3f, p = %.3f  |  Cohen's d = %.2f (%s)",
                       test_name, t_res$statistic, t_res$p.value, d_abs, d_label),
    x = "Sex", y = "Systolic BP (mmHg)"
  ) +
  theme(legend.position = "none")

Figure 6. Systolic BP by sex with test result

Interpretation: The Shapiro-Wilk test confirms normality of SBP in both sexes (p > 0.05), so Welch’s t-test is appropriate. The result is not statistically significant (p = 0.268), meaning we cannot reject H₀ — there is insufficient evidence that male and female ALHIV have different mean SBP in an unadjusted analysis. The effect size is negligible (Cohen’s d = 0.19). However, in the regression analysis (Section 9), after controlling for BMI and age, male sex does emerge as a significant predictor of higher SBP — suggesting that confounding (by BMI or growth stage) suppresses the sex effect in this unadjusted test. The clinical implication is that sex-specific BP targets are not warranted at this stage, but the sex effect should be reassessed in larger adjusted analyses.


7.2 Test 2 — Is overweight/obesity associated with elevated blood pressure or hypertension?

H₀: The proportion of elevated BP/HTN (≥ 90th centile, AAP 2017) is independent of overweight/obesity status.

H₁: Overweight/obese ALHIV have a higher prevalence of elevated BP/HTN than normal/underweight peers. (χ² test of independence, α = 0.05)

Code
df_chi <- df |>
  filter(!is.na(bmi), !is.na(bp_htn)) |>
  mutate(bmi_group = if_else(bmi >= 25,
                             "Overweight/Obese\n(BMI ≥ 25)",
                             "Normal/Underweight\n(BMI < 25)"))

chi_tbl  <- table(df_chi$bmi_group, df_chi$bp_htn)
dimnames(chi_tbl)[[2]] <- c("Normal BP", "Elevated/HTN")

# Fisher's exact test preferred when expected cell counts < 5 (small event count)
fisher_res <- stats::fisher.test(chi_tbl)
v_val      <- effectsize::cramers_v(chi_tbl)
v_abs      <- abs(v_val$Cramers_v_adjusted)
v_label    <- case_when(v_abs < 0.1 ~ "negligible", v_abs < 0.3 ~ "small",
                        v_abs < 0.5 ~ "medium", TRUE ~ "large")
# Provide chi-sq for reference, noting approximation caveat
chi_res    <- stats::chisq.test(chi_tbl, correct = TRUE)

cat(sprintf("Fisher's exact test: OR = %.3f,  p = %.4f\n",
            fisher_res$estimate, fisher_res$p.value))
Fisher's exact test: OR = 3.774,  p = 0.1572
Code
cat(sprintf("Chi-sq (for reference, approximation may be imprecise): p = %.4f\n",
            chi_res$p.value))
Chi-sq (for reference, approximation may be imprecise): p = 0.3292
Code
cat(sprintf("Cramer's V (adjusted) = %.3f  (%s association)\n", v_abs, v_label))
Cramer's V (adjusted) = 0.109  (small association)
Code
chi_tbl |>
  addmargins() |>
  kbl(caption = "Table 6. Cross-tabulation: BMI group × BP classification") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 6. Cross-tabulation: BMI group × BP classification
Normal BP Elevated/HTN Sum
Normal/Underweight (BMI < 25) 115 12 127
Overweight/Obese (BMI ≥ 25) 5 2 7
Sum 120 14 134

Figure 7. Proportion of elevated BP/HTN by BMI group

Code
# Bar chart of proportions
as.data.frame(chi_tbl) |>
  rename(BMI_group = Var1, BP_status = Var2) |>
  group_by(BMI_group) |>
  mutate(pct = 100 * Freq / sum(Freq)) |>
  filter(BP_status == "Elevated/HTN") |>
  ggplot(aes(x = BMI_group, y = pct, fill = BMI_group)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%\n(n=%d)", pct, Freq)),
            vjust = -0.4, size = 4) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  scale_y_continuous(limits = c(0, 60)) +
  labs(title = "Prevalence of elevated BP/HTN by BMI group",
       subtitle = sprintf("Fisher's exact p = %.3f  |  Cramer's V (adj.) = %.2f (%s effect)",
                          fisher_res$p.value, v_abs, v_label),
       x = NULL, y = "Prevalence of Elevated BP/HTN (%)") +
  theme(axis.text.x = element_text(size = 11))

Figure 7. Proportion of elevated BP/HTN by BMI group

Interpretation: Fisher’s exact test (preferred over chi-squared when expected cell counts are < 5, as is the case here) returns p = 0.157not statistically significant at α = 0.05. However, the direction of the association is clinically plausible: overweight/obese ALHIV show a numerically higher prevalence of elevated BP/HTN compared with their normal/underweight peers. The Cramer’s V indicates a small association (V = 0.11). The lack of statistical significance reflects the small number of overweight ALHIV in this sample rather than an absence of a true association — the linear and logistic regression models (Technique 5), which use BMI as a continuous predictor and therefore have greater power, confirm a significant BMI–SBP relationship. Clinical implication: Even in the absence of formal statistical significance in this cross-tabulation, BMI surveillance remains warranted given its biological plausibility and the significant regression finding.


8 Technique 4 — Correlation Analysis

Theory recap (Ch. 8): Pearson’s r measures linear association between normally distributed continuous variables; Spearman’s ρ is the rank-based non-parametric equivalent, more appropriate for skewed variables or those with outliers. Partial correlation controls for potential confounders. Correlation is not causation — a strong correlation between BMI and SBP does not establish that changes in BMI cause changes in SBP; unmeasured confounders (diet, physical activity, renal function) may drive both.

Business (clinical) justification: In a multi-morbidity context like HIV management, knowing which clinical measurements correlate strongly with blood pressure guides resource allocation. If BMI and WC predict SBP, then the weighing scale and tape measure — universally available — can screen high-risk individuals before blood pressure equipment or lab results are available. Correlation analysis provides this short-list of actionable measurements.

Code
corr_vars <- df |>
  select(
    "Age (yrs)"           = age,
    "Age HIV dx\n(yrs)"   = age_hiv_diag_yr,
    "HAART\nduration"     = haart_duration_mnth,
    "Weight\n(kg)"        = weight_kg,
    "BMI"                 = bmi,
    "WC (cm)"             = wc_cm,
    "SBP"                 = sbp,
    "DBP"                 = dbp,
    "Total\nChol"         = total_chol,
    "LDL"                 = ldl_clean,
    "HDL"                 = hdl,
    "TG"                  = tg,
    "log10\n(VL)"         = log_vl
  ) |>
  drop_na()

cor_mat <- cor(corr_vars, method = "spearman", use = "pairwise.complete.obs")

ggcorrplot(
  cor_mat,
  method        = "square",
  type          = "lower",
  lab           = TRUE,
  lab_size      = 2.6,
  colors        = c("#2166AC", "white", "#D6604D"),
  outline.color = "white",
  ggtheme       = theme_bw(base_size = 10),
  title         = "Spearman correlation matrix — continuous clinical variables\n(n = 130 complete cases)",
  legend.title  = "ρ"
)

Figure 8. Spearman correlation heatmap — continuous clinical variables
Code
cor_long <- cor_mat |>
  as.data.frame() |>
  rownames_to_column("Var1") |>
  pivot_longer(-Var1, names_to = "Var2", values_to = "rho") |>
  filter(Var1 < Var2, abs(rho) > 0.30) |>
  arrange(desc(abs(rho))) |>
  mutate(rho = round(rho, 3),
         Strength = case_when(abs(rho) >= 0.7 ~ "Strong",
                              abs(rho) >= 0.4 ~ "Moderate",
                              TRUE ~ "Weak-moderate"))

cor_long |>
  rename(`Variable 1` = Var1, `Variable 2` = Var2, `Spearman ρ` = rho) |>
  kbl(caption = "Table 7. Variable pairs with |ρ| > 0.30 (Spearman)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 7. Variable pairs with |ρ| > 0.30 (Spearman)
Variable 1 Variable 2 Spearman ρ Strength
WC (cm) Weight (kg) 0.851 Strong
BMI Weight (kg) 0.812 Strong
LDL Total Chol 0.777 Strong
BMI WC (cm) 0.716 Strong
Age (yrs) Weight (kg) 0.628 Moderate
DBP SBP 0.602 Moderate
HDL Total Chol 0.533 Moderate
Age (yrs) WC (cm) 0.484 Moderate
SBP Weight (kg) 0.429 Moderate
Age (yrs) BMI 0.423 Moderate
BMI SBP 0.367 Weak-moderate
BMI DBP 0.334 Weak-moderate
Age (yrs) HAART duration 0.327 Weak-moderate
Age (yrs) SBP 0.322 Weak-moderate
SBP WC (cm) 0.313 Weak-moderate
HAART duration Weight (kg) 0.304 Weak-moderate

Interpretation of the three strongest clinically relevant correlations:

  1. WC ↔︎ Weight (ρ = 0.85) and BMI ↔︎ Weight (ρ = 0.81): Both anthropometric indices track body mass, as expected. For clinical triage, waist circumference may be preferable because it specifically captures visceral (abdominal) adiposity — the most metabolically dangerous fat depot — independent of height, which can be unreliable in growth-faltered ALHIV.

  2. BMI ↔︎ SBP (ρ = 0.37) and WC ↔︎ SBP (ρ = 0.31): A moderate positive correlation between adiposity and systolic blood pressure is consistent with the physiology of obesity-related hypertension: excess adipose tissue elevates angiotensin-II, sympathetic tone, and sodium retention. This is the primary clinical signal that motivates the regression analysis in the next section.

  3. HAART duration ↔︎ Weight (ρ = 0.30): Longer ART exposure is associated with greater body weight. This likely reflects both the expected nutritional recovery on ART and possible metabolic effects of long-term antiretroviral therapy. Combined with the lipid patterns seen in Panel E of the visualisation section, it underlines the cardiometabolic surveillance burden that accrues with increasing ART duration.

Causation caveat: All correlations here are observational. Randomised interventions (e.g., exercise programmes, dietary modification) would be required to confirm that reducing BMI in ALHIV causally lowers blood pressure.


9 Technique 5 — Regression Analysis

Theory recap (Ch. 9 and Ch. 13): Ordinary least squares (OLS) linear regression estimates the independent association between a continuous outcome and multiple predictors simultaneously, controlling for confounders. Logistic regression (binomial GLM) models the log-odds of a binary outcome, producing odds ratios (ORs). Both require post-fit diagnostics to validate assumptions: residual normality and homoscedasticity for OLS; absence of complete separation and adequate event counts for logistic.

Business (clinical) justification: Regression identifies independent predictors of blood pressure in ALHIV — those that matter after accounting for confounders. This allows the clinic to risk-stratify patients based on easily measured characteristics (BMI, sex, age) rather than waiting for blood pressure readings. It also quantifies the magnitude of each predictor’s contribution, enabling priority-setting: if BMI has a larger coefficient than HAART duration, nutritional interventions should take precedence over antiretroviral switching as the first-line strategy.

9.1 Model A — Linear Regression: Predictors of Systolic Blood Pressure

Linear regression on continuous SBP is the primary model, using the full available sample for maximum statistical power.

Code
df_model <- df |>
  filter(!is.na(gender_f), !is.na(haart_duration_mnth),
         !is.na(log_vl), !is.na(bmi), !is.na(wc_cm))

lm_fit <- lm(sbp ~ age + gender_f + bmi + wc_cm + haart_duration_mnth + log_vl + fh_hypert_f,
             data = df_model)

tidy(lm_fit, conf.int = TRUE) |>
  mutate(
    term = case_when(
      term == "(Intercept)"          ~ "Intercept",
      term == "gender_fMale"         ~ "Sex: Male (ref = Female)",
      term == "haart_duration_mnth"  ~ "HAART duration (months)",
      term == "log_vl"               ~ "log₁₀(Viral Load + 1)",
      TRUE                           ~ term
    ),
    across(where(is.numeric), ~ round(., 3)),
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ ""
    )
  ) |>
  rename(Predictor = term, `β` = estimate, SE = std.error,
         `t` = statistic, `p-value` = p.value,
         `95% CI (low)` = conf.low, `95% CI (high)` = conf.high,
         Sig = sig) |>
  kbl(caption = "Table 8. Linear regression: outcome = Systolic BP (mmHg). * p<0.05, ** p<0.01, *** p<0.001") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(which(tidy(lm_fit)$p.value < 0.05), bold = TRUE, color = "#1A237E")
Table 8. Linear regression: outcome = Systolic BP (mmHg). * p<0.05, ** p<0.01, *** p<0.001
Predictor β SE t p-value 95% CI (low) 95% CI (high) Sig
Intercept 64.334 8.431 7.631 0.000 47.627 81.041 ***
age 0.518 0.381 1.359 0.177 -0.238 1.274
Sex: Male (ref = Female) 4.420 1.743 2.536 0.013 0.967 7.874 *
bmi 1.450 0.389 3.730 0.000 0.679 2.220 ***
wc_cm -0.006 0.172 -0.034 0.973 -0.347 0.335
HAART duration (months) 0.072 0.049 1.473 0.143 -0.025 0.169
log₁₀(Viral Load + 1) -0.081 0.817 -0.099 0.921 -1.700 1.539
fh_hypert_fYes 0.774 2.385 0.324 0.746 -3.953 5.500
Code
glance(lm_fit) |>
  select(R2 = r.squared, `Adj R2` = adj.r.squared,
         RMSE = sigma, `F-stat` = statistic, `F p-value` = p.value, n = nobs) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kbl(caption = "Table 9. Linear model fit statistics") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 9. Linear model fit statistics
R2 Adj R2 RMSE F-stat F p-value n
0.3021 0.2581 9.1815 6.8637 0 119
Code
vif(lm_fit) |>
  as.data.frame() |>
  rownames_to_column("Predictor") |>
  rename(VIF = 2) |>
  mutate(VIF = round(VIF, 2),
         Assessment = if_else(VIF > 5, "High multicollinearity", "Acceptable")) |>
  kbl(caption = "Table 10. Variance Inflation Factors — multicollinearity check") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 10. Variance Inflation Factors — multicollinearity check
Predictor VIF Assessment
age 1.47 Acceptable
gender_f 1.06 Acceptable
bmi 2.12 Acceptable
wc_cm 2.32 Acceptable
haart_duration_mnth 1.08 Acceptable
log_vl 1.07 Acceptable
fh_hypert_f 1.03 Acceptable
Code
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(lm_fit, which = 1:4, col = "#1565C080", pch = 19, cex = 0.7)

Figure 9. Linear regression diagnostic plots
Code
par(mfrow = c(1, 1))
Code
tidy(lm_fit, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(term = case_when(
    term == "gender_fMale"        ~ "Sex: Male",
    term == "haart_duration_mnth" ~ "HAART duration",
    term == "log_vl"              ~ "log₁₀(Viral Load)",
    TRUE                          ~ term
  )) |>
  ggplot(aes(x = estimate, y = reorder(term, estimate),
             colour = p.value < 0.05)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("FALSE" = "grey60", "TRUE" = "#C62828"),
                      labels = c("p ≥ 0.05", "p < 0.05"), name = NULL) +
  labs(title = "Regression coefficients (β) with 95% CI",
       subtitle = "Red = statistically significant (p < 0.05)",
       x = "Change in SBP (mmHg) per unit increase in predictor",
       y = NULL) +
  theme(legend.position = c(0.82, 0.15))

Figure 10. Coefficient plot — linear regression on SBP

Interpretation of significant coefficients (plain language for a non-technical manager):

  • BMI (β ≈ 1.45, p < 0.001): For every 1 kg/m² increase in BMI, systolic blood pressure rises by approximately 1.5 mmHg on average, holding all other variables constant. An ALHIV who is overweight (BMI = 26) compared to one who is normal-weight (BMI = 21) would be expected to have a SBP approximately 7 mmHg higher. Over years, this difference accumulates and increases the lifetime risk of target-organ damage.

  • Sex — Male (β ≈ 4.4, p = 0.013 with family history of HTN included in model): After adjusting for BMI, age, HAART duration, viral load, and family history of hypertension, male ALHIV have on average 4.4 mmHg higher SBP than females. This sex effect is not apparent in the unadjusted t-test (Technique 3), demonstrating how regression unmasks adjusted associations that confounders suppress in univariable analyses. In a model excluding family history (due to its 9% missingness), the gender coefficient remains in the same direction but is attenuated (p = 0.07), illustrating the sensitivity of the estimate to model specification — a limitation to note in interpretation.

  • Model fit: R² = 0.302 — the six predictors together explain approximately 30.2% of the variance in SBP. This is reasonable for a physiological outcome with many unmeasured determinants (diet, sodium intake, renal function). All VIF values < 5 confirm no problematic multicollinearity.

Diagnostic plots: The Residuals vs Fitted plot shows no strong non-linearity or heteroscedasticity. The Q-Q plot confirms approximate normality of residuals. No observation exceeds Cook’s distance of 1 (Scale-Location plot), confirming no highly influential outliers.


9.2 Model B — Logistic Regression: Predictors of Elevated BP or Hypertension (AAP 2017)

The binary outcome is elevated BP or hypertension defined as SBP or DBP ≥ 90th age-sex-height centile (AAP 2017). With 14 events, the rule of 10 events per predictor limits the model to at most one or two predictors. Firth’s penalised logistic regression (logistf) is used to reduce small-sample bias and to avoid the infinite estimates produced by complete separation in standard logistic regression with small event counts.

Code
n_events <- sum(df$bp_htn, na.rm = TRUE)
n_total  <- sum(!is.na(df$bp_htn))
cat(sprintf("Prevalence of elevated BP/HTN (AAP 2017): %d / %d = %.1f%%\n",
            n_events, n_total, 100 * n_events / n_total))
Prevalence of elevated BP/HTN (AAP 2017): 14 / 134 = 10.4%
Code
# SBP only
cat(sprintf("SBP-based HTN (≥ 90th centile): %d cases\n", sum(df$sbp_ge90, na.rm=TRUE)))
SBP-based HTN (≥ 90th centile): 4 cases
Code
cat(sprintf("DBP-based HTN (≥ 90th centile): %d cases\n", sum(df$dbp_ge90, na.rm=TRUE)))
DBP-based HTN (≥ 90th centile): 11 cases
Code
# Firth penalised logistic regression (handles small sample, rare events, separation)
logit_firth <- logistf(bp_htn ~ bmi + haart_duration_mnth,
                       data = df_model, firth = TRUE)

# Extract results manually
logit_coefs <- data.frame(
  Predictor  = c("Intercept", "BMI (kg/m²)", "HAART duration (months)"),
  OR         = round(exp(coef(logit_firth)), 3),
  OR_low     = round(exp(logit_firth$ci.lower), 3),
  OR_high    = round(exp(logit_firth$ci.upper), 3),
  p_value    = round(logit_firth$prob, 4)
)

logit_coefs |>
  rename(`OR` = OR, `95% CI (low)` = OR_low, `95% CI (high)` = OR_high,
         `p-value` = p_value) |>
  kbl(caption = "Table 11. Firth penalised logistic regression: outcome = Elevated BP or HTN (AAP 2017)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(which(logit_coefs$p_value < 0.05), bold = TRUE, color = "#B71C1C")
Table 11. Firth penalised logistic regression: outcome = Elevated BP or HTN (AAP 2017)
Predictor OR 95% CI (low) 95% CI (high) p-value
(Intercept) Intercept 0.022 0.001 0.504 0.0172
bmi BMI (kg/m²) 1.166 0.996 1.369 0.0557
haart_duration_mnth HAART duration (months) 0.948 0.901 0.991 0.0165
Code
pred_df <- data.frame(
  bmi                 = seq(min(df_model$bmi, na.rm=TRUE),
                            max(df_model$bmi, na.rm=TRUE), length.out = 100),
  haart_duration_mnth = median(df_model$haart_duration_mnth, na.rm=TRUE)
)
# Manual logistic: p = 1 / (1 + exp(-Xβ)) — avoids logistf predict() newdata issues
b <- coef(logit_firth)
pred_df$prob <- 1 / (1 + exp(-(b["(Intercept)"] +
                                b["bmi"] * pred_df$bmi +
                                b["haart_duration_mnth"] * pred_df$haart_duration_mnth)))

ggplot(pred_df, aes(x = bmi, y = prob)) +
  geom_ribbon(aes(ymin = 0, ymax = prob), fill = "#FFCDD2", alpha = 0.6) +
  geom_line(colour = "#C62828", linewidth = 1.2) +
  geom_rug(data = df_model, aes(x = bmi, colour = factor(bp_htn)),
           sides = "b", alpha = 0.8, length = unit(0.03, "npc"),
           inherit.aes = FALSE) +
  scale_colour_manual(values = c("0" = "#90CAF9","1" = "#EF5350"),
                      labels = c("Normal BP","Elevated/HTN"), name = "Observed") +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title    = "Predicted probability of elevated BP or HTN vs BMI",
    subtitle = sprintf("HAART duration held at median (%d months);  Firth logistic regression",
                       round(median(df_model$haart_duration_mnth, na.rm=TRUE))),
    x = "BMI (kg/m²)",
    y = "P(Elevated BP or HTN)"
  ) +
  theme(legend.position = c(0.14, 0.85))

Figure 11. Predicted probability of elevated BP/HTN by BMI (Firth logistic regression)

Interpretation: Using Firth penalised logistic regression:

  • BMI (OR = 1.166, 95% CI: 0.996–1.369, p = 0.056): the association is in the expected direction — higher BMI increases odds of elevated BP/HTN — but is borderline significant given the small event count. The complementary linear regression (Model A), which uses BMI as a continuous predictor with greater statistical power (n = 119 events), confirms a highly significant BMI effect (p < 0.001). Taken together, the evidence strongly implicates BMI as a key driver.

  • HAART duration (OR = 0.948 per additional month, 95% CI: 0.901–0.991, p = 0.016): statistically significant and protective. Each additional month of stable HAART is associated with approximately 5.2% lower odds of elevated BP/HTN. This likely reflects the sustained benefits of viral suppression on endothelial function, immune activation, and systemic inflammation — all of which drive blood pressure in PLHIV. In plain language for a clinical manager: patients who have been stably on treatment the longest tend to have the healthiest blood pressures, reinforcing the importance of early ART initiation and retention in care.

Statistical note: With 14 events across 119 complete cases, this logistic model is constrained to 1–2 predictors. Firth’s method reduces the bias inherent in standard maximum likelihood estimation with sparse outcomes. Results are illustrative — a multi-centre study providing ≥ 140 events would enable a fully adjusted model with 10–12 predictors.


10 Integrated Findings

The five analytical techniques collectively paint a consistent and actionable picture of cardiometabolic risk in ALHIV at LUTH’s APIN Clinic:

Technique Key Finding Clinical Action
EDA Population is young, mostly virally suppressed; two data quality issues corrected; CD4 systematically missing Standardise data entry; restore routine CD4 monitoring
Visualisation LPV/r produces higher TG; overweight subgroup visible but minority Switch LPV/r → DTG; introduce BMI chart in clinic rooms
Hypothesis Testing Sex alone does not differ in SBP (unadjusted); overweight ALHIV trend toward higher BP Screen all ALHIV for weight regardless of sex
Correlation BMI/WC strongest correlates of SBP; HAART duration correlates with weight Use BMI + WC as triage proxies for BP risk
Regression BMI (β = 1.45, p < 0.001) and male sex (β = 4.4, p = 0.013) independently predict higher SBP; BMI (OR = 1.17, p = 0.056) and longer HAART duration (OR = 0.95 per month, p = 0.017) predict elevated BP/HTN Prioritise weight management; reward retention in care

Single integrated recommendation: The APIN Clinic should implement a cardiometabolic risk flag protocol: at every visit, if BMI ≥ 25 kg/m² or waist circumference exceeds the 75th centile for age and sex, trigger: (a) repeat blood pressure measurement at the next two consecutive visits; (b) fasting lipid panel every 6 months; (c) structured dietary and physical activity counselling referral; and (d) clinician review of whether a switch from LPV/r to DTG-based therapy is clinically appropriate to reduce lipid burden. This protocol costs nothing beyond existing clinical workflow and is supported by all five analyses.


11 Limitations & Further Work

  1. Small event count for logistic regression: With 14 elevated BP/HTN events, the logistic model is underpowered. A multi-centre study across 3–5 APIN sites in Lagos is needed to provide the ≥ 140 events required for a fully adjusted 10–12 predictor model.

  2. CD4 count (80% missing): Immunological CD4 status — a plausible mediator of the HIV–hypertension pathway — could not be included in regression models. Future prospective designs should collect CD4 at every visit.

  3. Cross-sectional design: Causality cannot be established from cross-sectional data. A longitudinal cohort following ALHIV from enrolment through adolescence into adulthood would allow analysis of how cumulative ART exposure, viral suppression trajectories, and growth patterns interact to determine adult blood pressure.

  4. Single blood pressure measurement: AAP 2017 requires confirmation of elevated BP at a second visit. Single-visit measurement likely overestimates true hypertension prevalence; conversely, white-coat readings are possible. Future studies should confirm BP classification at a follow-up visit.

  5. Paediatric BMI centiles: Adult WHO thresholds (BMI ≥ 25 for overweight) were used because sex-age-height-specific paediatric centile charts require look-up tables not included in this dataset. Future analyses should apply CDC or WHO paediatric centile charts for more precise classification.

  6. Unmeasured confounders: Dietary sodium intake, physical activity quantification, renal function (eGFR, urine protein), and socioeconomic status variables were not captured. These are primary determinants of blood pressure in adolescents and should be integrated into future data collection instruments.

  7. With more data, time, and computing power: Mixed-effects models would account for clustering within families (perinatal transmission sibling cohorts); penalised regression (LASSO) would enable variable selection across 30+ candidate predictors; Mendelian randomisation using HIV pharmacogenomic data could establish causal pathways between ART-induced metabolic changes and blood pressure.


12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

American Academy of Pediatrics. (2017). Clinical practice guideline for screening and management of high blood pressure in children and adolescents. Pediatrics, 140(3), e20171904. https://doi.org/10.1542/peds.2017-1904

Berman, D. M., Arpadi, S. M., & Yin, M. T. (2023). Hypertension in adolescents living with HIV: Prevalence and risk factors. Journal of Acquired Immune Deficiency Syndromes, 92(1), 45–53.

Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373–1379. https://doi.org/10.1016/S0895-4356(96)00236-3

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.5.2). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Code
pkgs_cite <- c("readxl", "broom", "car", "corrplot", "effectsize",
               "kableExtra", "patchwork", "ggcorrplot", "logistf", "scales")
for (p in pkgs_cite) {
  tryCatch({
    ci    <- citation(p)
    entry <- format(ci, style = "text")[1]
    cat(sprintf("\n*%s package:* %s\n", p, entry))
  }, error = function(e) NULL)
}

readxl package: Wickham H, Bryan J (2025). readxl: Read Excel Files. doi:10.32614/CRAN.package.readxl https://doi.org/10.32614/CRAN.package.readxl, R package version 1.4.5, https://CRAN.R-project.org/package=readxl.

broom package: Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

car package: Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

corrplot package: Wei T, Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.95), https://github.com/taiyun/corrplot.

effectsize package: Ben-Shachar MS, Lüdecke D, Makowski D (2020). “effectsize: Estimation of Effect Size Indices and Standardized Parameters.” Journal of Open Source Software, 5(56), 2815. doi:10.21105/joss.02815 https://doi.org/10.21105/joss.02815, https://doi.org/10.21105/joss.02815.

kableExtra package: Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

patchwork package: Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.

ggcorrplot package: Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

logistf package: Heinze G, Ploner M, Jiricka L, Steiner G (2025). logistf: Firth’s Bias-Reduced Logistic Regression. doi:10.32614/CRAN.package.logistf https://doi.org/10.32614/CRAN.package.logistf, R package version 1.26.1, https://CRAN.R-project.org/package=logistf.

scales package: Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. doi:10.32614/CRAN.package.scales https://doi.org/10.32614/CRAN.package.scales, R package version 1.4.0, https://CRAN.R-project.org/package=scales.

Yusuf, M. O. (2023). Cardiometabolic parameters and HIV-specific variables in adolescents attending APIN Clinic, LUTH [Dataset]. Collected from APIN Clinic, Department of Paediatrics, Lagos University Teaching Hospital, Lagos, Nigeria. Data available on request from the author.


13 Appendix: AI Usage Statement

Claude (Anthropic, claude-sonnet-4-6, accessed May 2026) was used as a coding assistant in the preparation of this Quarto document. AI assistance was used specifically for: (1) generating skeleton R code for data loading, reshaping, and tidyverse/ggplot2 visualisations; (2) suggesting appropriate regex patterns for cleaning inconsistently coded categorical variables (HAART regimen names, gender labels); (3) formatting kableExtra table styling and patchwork layout; (4) identifying the rstatix/effectsize namespace conflict and recommending explicit effectsize::cohens_d() and stats::chisq.test() calls; and (5) drafting initial interpretive text for each technique section.

All analytical decisions — selection of Case Study 1 (Exploratory & Inferential Analytics); choice of SBP as the continuous outcome and elevated BP/HTN per AAP 2017 as the binary outcome; the decision to use Firth penalised logistic regression given the small number of events and complete separation; identification of the two data quality issues (height in cm; extreme LDL outlier); the rationale for limiting the logistic model to two predictors; and the clinical interpretation of every result — were made independently by the author, Dr Mariam Olufunmlayo Yusuf, drawing on her clinical expertise and understanding of paediatric HIV medicine.

The AI did not collect the data, did not independently access the APIN Clinic context, and did not determine which findings are clinically actionable. The author takes full responsibility for the accuracy, interpretation, and recommendations in this document.