Tuberculosis (TB) and Human Immunodefciency Virus (HIV) are both infectious diseases with TB being spread through air and being caused by Mycobaterium tuberculosis, while HIV is a viral infection transmitted through bodily fluids. Both HIV and TB contribute to morbidity and mortality worldwide. According to the CDC, TB is the leading cause of mortality among the HIV community and they’re “16 times more likely to develop TB” in their lifetime. The burden of these two diseases isn’t evenly distributed but varies widely across countries and regions due to differences that may be in healthcare access , socioeconomic conditions, and public health system. Understanding how TB mortality differs between HIV positive and HIV negative populations is essential for guiding resource allocation and prevention strategies. In this analysis , we will utilize TB surveillance data to examine patterns in TB related mortality with specific focus on the differences between HIV positive and HIV negative populations. Using ggplot, dpylr, and tidyverse will help explore the trends over regions , time, compare mortality across populations, and assess the overall contribution of HIV to TB related death. We will also incorporate statistical tests to determine whether observed differences in mortality rates are statistically signiicant provided in the data. Overall, this project aims to provide insight on the relationship between TB and HIV at a global level, emphasizing disparities anf reinforcing the importance of disease control strategies.
How does TB mortality differ between HIV-positive and HIV-negative populations?
#Graph 1: Distribution of HIV positive mortality
ggplot(who_tb_data, aes(x = e_mort_tbhiv_100k)) +
geom_histogram(bins = 20) +
labs(
title = "Distribution of HIV-Positive TB Mortality",
x = "TB Mortality Rate per 100,000",
y = "Count"
) +
theme_minimal()
## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_bin()`).
#Graph 2: Distribution of TB mortality excluding HIV
ggplot(who_tb_data, aes(x = e_mort_exc_tbhiv_100k)) +
geom_histogram(bins = 20) +
labs(
title = "Distribution of TB Mortality (Excluding HIV)",
x = "TB Mortality Rate per 100,000",
y = "Count"
) +
theme_minimal()
## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_bin()`).
##Graph 3: Mortality rate over time
global_rate <- who_tb_data %>%
group_by(year) %>%
summarize(
hiv_positive_rate = mean(e_mort_tbhiv_100k, na.rm = TRUE),
hiv_negative_rate = mean(e_mort_exc_tbhiv_100k, na.rm = TRUE)
)
ggplot(global_rate, aes(x = year)) +
geom_line(aes(y = hiv_positive_rate, color = "HIV-positive")) +
geom_line(aes(y = hiv_negative_rate, color = "HIV-negative / excluding HIV")) +
labs(
title = "Average TB Mortality Rate by HIV Status",
x = "Year",
y = "TB Mortality Rate per 100,000",
color = "Population"
) +
theme_minimal()
## Graph 4: TB mortality by HIV status and region
latest_year <- max(who_tb_data$year, na.rm = TRUE)
region_tb <- who_tb_data %>%
filter(year == latest_year) %>%
group_by(g_whoregion) %>%
summarize(
hiv_positive_tb_deaths = sum(e_mort_tbhiv_num, na.rm = TRUE),
hiv_negative_tb_deaths = sum(e_mort_exc_tbhiv_num, na.rm = TRUE)
)
region_tb_long <- region_tb %>%
pivot_longer(
cols = c(hiv_positive_tb_deaths, hiv_negative_tb_deaths),
names_to = "hiv_status",
values_to = "deaths"
)
ggplot(region_tb_long, aes(x = g_whoregion, y = deaths, fill = hiv_status)) +
geom_col(position = "dodge") +
labs(
title = "TB Mortality by HIV Status and WHO Region",
x = "WHO Region",
y = "Estimated TB Deaths",
fill = "Population"
) +
theme_minimal()
##Graph 5: TB deaths in HIV+ over/total TB death over time
##Average Proportion:
tb_prop <- who_tb_data %>%
group_by(year) %>%
summarize(
hiv_positive_deaths = sum(e_mort_tbhiv_num, na.rm = TRUE),
hiv_negative_deaths = sum(e_mort_exc_tbhiv_num, na.rm = TRUE)
) %>%
mutate(
total_tb_deaths = hiv_positive_deaths + hiv_negative_deaths,
proportion_hiv_positive = hiv_positive_deaths / total_tb_deaths
)
##With HIV
ggplot(tb_prop, aes(x = year, y = proportion_hiv_positive)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_point() +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Proportion of TB Deaths in HIV-Positive Populations Over Time",
x = "Year",
y = "Percent of Total TB Deaths (HIV+)"
) +
theme_minimal()
# Graph 6: Linear regression proportion of TB deaths that are HIV-positive over time
tb_prop <- who_tb_data %>%
group_by(year) %>%
summarize(
hiv_positive_deaths = sum(e_mort_tbhiv_num, na.rm = TRUE),
hiv_negative_deaths = sum(e_mort_exc_tbhiv_num, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
total_tb_deaths = hiv_positive_deaths + hiv_negative_deaths,
proportion_hiv_positive = hiv_positive_deaths / total_tb_deaths
)
# running model with summary
tb_prop_model <- lm(proportion_hiv_positive ~ year, data = tb_prop)
summary(tb_prop_model)
##
## Call:
## lm(formula = proportion_hiv_positive ~ year, data = tb_prop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063200 -0.023716 -0.001636 0.024996 0.046798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.4683717 1.7264628 9.539 2.83e-09 ***
## year -0.0080644 0.0008583 -9.396 3.70e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02911 on 22 degrees of freedom
## Multiple R-squared: 0.8005, Adjusted R-squared: 0.7914
## F-statistic: 88.28 on 1 and 22 DF, p-value: 3.705e-09
#Regression Graph
ggplot(tb_prop, aes(x = year, y = proportion_hiv_positive)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Linear Regression of HIV-Positive TB Death Proportion Over Time",
x = "Year",
y = "Proportion of Total TB Deaths that are HIV-Positive"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Get most recent year
latest_year <- max(who_tb_data$year, na.rm = TRUE)
tb_latest <- who_tb_data %>%
filter(year == latest_year) %>%
select(e_mort_tbhiv_100k, e_mort_exc_tbhiv_100k) %>%
drop_na()
# Paired t-test: comparing HIV-positive TB mortality with TB mortality excluding HIV
#Null Hypothesis: There is no difference in TB mortality between HIV-positive and HIV-negative populations.
#Alternative Hypothesis: There is a difference in TB mortality between the two groups.
t.test(
tb_latest$e_mort_tbhiv_100k,
tb_latest$e_mort_exc_tbhiv_100k,
paired = TRUE
)
##
## Paired t-test
##
## data: tb_latest$e_mort_tbhiv_100k and tb_latest$e_mort_exc_tbhiv_100k
## t = -5.8482, df = 213, p-value = 1.852e-08
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -8.291815 -4.111270
## sample estimates:
## mean difference
## -6.201542
Graph 1 and 5 shows low mortality in the HIV postive community while graph 4 shows that the HIV negative community has a higher frequecny of TB deaths while HIV positive still shows lower frequency across regions. In graph 4 you can see there is a greater frequency of deaths in patients without HIV but have TB vs. with HIV and TB in all regions of the data including Africa, America, Eastern Mediterranean, Europe, Soth-East Asia, and Western Pacific. The highest frequency of death in TB patients without HIV is in South East Asia. Graph 6 shows a negative correlation because the proportion of total TB deaths that are HIV positive has decresed from 30% in 2000 to 15% by 2020. Graph is there to represent the distribution of TB mortality excluding HIV to show what data visually looks like per 100,000. Graph 3 is better to visualize both HIV negative and HIV positive together to see that the HIV positive communinty starts at lower mortality rate than HIV negative population and is significantly declining , faster than that of the HIV negative population. When we analyzed the data we ended up deciding to run a paired t test which indicated a statiscally significant difference between b_latest\(e_mort_tbhiv_100k and tb_latest\)e_mort_exc_tbhiv_100k variables. Since the p value (1.852e-08) is well below the standard (0.05), we reject the null hypthesis. The estimated mean difference is -6.20 which sugguests that the first group , e_mort_tbhiv_100k, on average are 6.2 units lower than the second group, e_mort_exc_tbhiv_100k. This finding is further supportive by the 95% confidence interval, not including 0, ranging from -8.29–4.11.
Based on the statistical and graphical/visual analysis there appears to be significant evidence to suggest that the HIV positive population has significantly lower mortality rates when in comparison to the HIV negative population. This finding could be due to the fact that the HIV positive population gets tested more frequently than the population that is HIV negative population which causes the TB infection to be caught earlier and treated quicker.
rfordatascience. “readme.md.” rfordatascience/tidytuesday, 11 Nov. 2025, https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-11-11/readme.md. Accessed 29 Apr. 2026.
Centers for Disease Control and Prevention. “Combating global TB.” Centers for Disease Control and Prevention, https://www.cdc.gov/global-hiv-tb/php/our-approach/combatingglobaltb.html#:~:text=To%20achieve%20these%20commitments%2C%20CDC,among%20PLHIV%20by%2080%20percent. Accessed 29 Apr. 2026.