library(dplyr)
library(tidyr)
library(ggplot2)

msm.pop.totals_2019 <- WApopdata::msm.pop.totals_2019
PLWH <- msm.pop.totals_2019$MSM_PLWH_WA

Introduction

We need HIV prevalence targets to calibrate and validate our models. It turns out there is a lot of uncertainty in these target values.

HIV prevalence, as a rate, is defined by

\[ Prevalence = \frac{HIV+}{Population} \] To assess the validity of our model, we need this broken down by age group, race and region.

Uncertainty in the numerator

The numerator for us comes from the WA DOH HIV surveillance system, and represents persons who have been diagnosed with HIV, are living in the state, and whose risk category is identified as MSM. There are two aspects of these estimates worth noting:

  • Risk category is not known with certainty. We represent this here by having a lower bound estimate that includes the MSM and MSM/IDU groups, and an upper bound estimate that adds the CDC imputed MSM risk assignment from the NIR/NRR group.

  • These estimates do not include the undiagnosed cases.

Steve Erly provided the aggregate totals by age, race and region.

Uncertainty in the denominator

For this we need an estimate of the number of MSM living in WA state, broken down by age, race and region. We use two data sources for this estimate:

  • County-level estimates of MSM prevalence from the Emory CAMP team, published by Grey, et al., JMIR 2016. These estimates represent the fraction of men 18 and older in each county who have reported sex with a man in the past 5 years.

  • US Census bureau estimates of the number of men in each county in 2019, by age and race. These estimates come from: Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin: April 1, 2010 to July 1, 2019 CC-EST2019-ALLDATA

Our estimates of the number of MSM (the “Grey” estimates) are then

\[ MSM_{(county, age.group, race)} = Males_{(county, age.group, race)} * MSM.percent_{(county)} \] This assumes the age and race profile for MSM is the same as that for all men within each county.

The counties are then aggregated up into 3 regions: King Co, Western Washington and Eastern Washington.

However, the Emory CAMP team have also recently published two papers with State level estimates of MSM by age and race:

  • Rosenberg et al. 2018 by race

    The Rosenberg race denominator estimates are quite close to our Grey estimates, once you adjust for the fact that Rosenberg’s estimates are for “Black alone” (BA), while our estimates are for “Black alone or in combination” (BAC). Analysis of the WA census data for 2014-2019 shows a consistent fraction of 23% of the BAC group is in the “Combination” category. So we inflate Rosenberg’s BA estimates by 1/(1-0.23) (about 37%) to account for this.

  • Jones et al. 2018 by age

    The Jones age denominator estimates are not consistent with our Grey estimates. The comparison is complicated by the fact that Jones et al. have a slightly different grouping: the bottom age group is 18-24, rather than 15-24, and they consolidate the top age groups into 55+. To address this, I adjusted their youngest group by 10/7, re-estimated the percentages in each group with the new total, and redistributed the count in 55+ proportional to the Grey estimates of the percent in each group (55-64, 65-74, 75-84 and 85+). But the differences between the two estimates are still quite large.

Numerator: PLWH

These are the data from the WA DOH, that we use as numerators to construct prevalence. The upper and lower bounds are plotted.

# Raw PLWH data for WA 2019
ggplot(PLWH, aes(x=as.numeric(factor(age.grp10)))) +
  geom_ribbon(aes(ymin = LB, ymax = UB), color="blue", alpha=0.5) +
  facet_grid(rows = vars(race), cols=vars(region),
             scales = "free_y") +
  scale_x_continuous(breaks = 1:8,
                     labels = unique(PLWH$age.grp10)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Denominator: MSM population

Translating the new CAMP estimates into 2019 numbers is, in principle, straightforward. They report the number of MSM broken down by attribute, so we can take the percent in each group, and multiply this by our 2019 Grey estimate total.

These plots compare our original Grey estimates to these alternate estimates.

By age

As noted above, these estimates are quite discrepant.

ggplot(msm.pop.totals_2019$pop.age.all, 
       aes(x = as.numeric(factor(age.grp10)))) +
  geom_point(aes(y = num, color = "Grey")) +
  geom_line(aes(y = num, color = "Grey")) +
  geom_point(aes(y = num.alt, color = "Jones")) +
  geom_line(aes(y = num.alt, color = "Jones")) +
  scale_x_continuous(breaks = 1:8,
                     labels = msm.pop.totals_2019$pop.age.pos$age.grp10) +
  labs(title = "Number of MSM in WA by age",
       x = "Age",
       y = "Number",
       color = "Estimate")

By Race

And as noted above, these estimates are quite close.

dfbar <- msm.pop.totals_2019$pop.race.all %>%
  select(race, Grey = num, Rosenberg = num.alt) %>%
  pivot_longer(-race,
               names_to = "source",
               values_to = "num")

ggplot(dfbar, aes(x=race, y=num, fill=source)) +
  geom_col(position = "dodge2", alpha = 0.5) +
  labs(title = "Number of MSM in WA by race",
       x = "Race",
       y = "Number",
       fill = "Estimate")

HIV prevalence estimates

Note that we represent both forms of uncertainty here. The uncertainty in the numerator (number of PLWH) is shown as the upper and lower bounds; the uncertainty in the denominator (number of MSM) is shown as the comparison between estimates.

By Age

colors <- c("Grey" = "gray", "Jones" = "blue", "Rosenberg" = "green")

ggplot(msm.pop.totals_2019$pop.age.pos, 
       aes(x = as.numeric(factor(age.grp10)))) +
  geom_ribbon(aes(ymin = prev_lb, ymax = prev_ub, 
                  fill = "Grey"), alpha = 0.5) +
  geom_ribbon(aes(ymin = prev_lb_alt, ymax = prev_ub_alt, 
                  fill = "Jones"), alpha = 0.5) +
  scale_x_continuous(breaks = 1:8,
                     labels = msm.pop.totals_2019$pop.age.pos$age.grp10) +
  labs(title = "HIV prevalence estimates by age",
       x = "Age",
       y = "Prevalence",
       caption = "2019: WADOH PLWH numerator, alternate denominators",
       fill = "Estimate")

By Race

upper <- msm.pop.totals_2019$pop.race.pos %>%
  select(race, Grey = prev_ub, Rosenberg = prev_ub_alt) %>%
  pivot_longer(-race,
               names_to = "source",
               values_to = "upper")

lower <- msm.pop.totals_2019$pop.race.pos %>%
  select(race, Grey = prev_lb, Rosenberg = prev_lb_alt) %>%
  pivot_longer(-race,
               names_to = "source",
               values_to = "lower")

dfbar <- msm.pop.totals_2019$pop.race.pos %>%
  rowwise() %>%
  mutate(prev_mid = mean(c(prev_lb, prev_ub)),
         prev_mid_alt = mean(c(prev_lb_alt, prev_ub_alt))) %>%
  select(race, Grey = prev_mid, Rosenberg = prev_mid_alt) %>%
  pivot_longer(-race,
               names_to = "source",
               values_to = "prev") %>%
  left_join(lower) %>%
  left_join(upper)


ggplot(dfbar, aes(x=race, y=prev, fill=source)) +
  geom_col(position = "dodge2", alpha = 0.5) +
  geom_errorbar(
    aes(ymin = lower, ymax = upper),
    position = position_dodge2(width = 0.5, padding = 0.5)
    ) +
  labs(title = "Grey vs. Rosenberg estimates of HIV prevalence by race",
       x = "Race",
       y = "Prevalence",
       caption = "2019: WADOH PLWH numerator, alternate denominators",
       fill = "Estimate")