Part 1 - Introduction

Tuberculosis (also known as TB) is an infectious airborne disease caused by a bacteria (Mycobacterium tuberculosis) that has been around since the late 19th century, and is still affecting the world’s population today. Tuberculosis is a disease that causes prolonged cough, chest pain, fatigue, and weight loss amongst plenty other symptoms in affected patients. Tuberculosis can exist in an inactive form where individuals are asymptomatic and not contagious, but when it is transmissible and contagious, its effects are worse amongst children and immunocompromised persons, especially individuals who have HIV. HIV stands for human immunodeficiency virus. It’s a virus that can only infect humans and leads to the weakening of the immune system, and it is also the virus that causes aids. HIV damages people’s immune systems, making it easier for them to get sick, which is why TB is twelve times more likely to affect a person with HIV versus one without. The tuberculosis dataset from the TidyTuesday site examined statistical data about the infectious disease, including incidence, mortality, and population across years in each country. We’ve decided to explore which region has the most progress in reducing tuberculosis mortality in HIV patients, based on the data. The independent variable is each region and the dependent variable is death rate per 100,000 people in each region. This data is important biologically because it allows researchers to understand the scale of the disease worldwide, and from there, are able to execute certain additional research or solutions to help reduce the amount of impaction caused by the disease.

Part 2 - Main Research Question

Which region has the most progress in reducing tuberculosis mortality in HIV patients?

Part 3 - Exploring the Data (Descriptive Statistics)

To better understand the data, we calculated statistics for each region. Measures like the mean and median help show the levels of TB mortality, while the standard deviation and standard error show how much the data varies and how reliable the averages are. These statistics help support and explain the patterns we observed in the graphs.

## Mean, Median, Standard Deviation, and Standard Error

tb_region <- summarize(
  group_by(who_tb_data, g_whoregion, year),
  mortality = mean(e_mort_tbhiv_num, na.rm = TRUE)
) 
## `summarise()` has grouped output by 'g_whoregion'. You can override using the
## `.groups` argument.
tb_stats <- summarize(
  group_by(tb_region, g_whoregion),
  mean = mean(mortality),
  median = median(mortality),
  sd = sd(mortality),
  n = length(mortality),
  se = sd / sqrt(n)
)
tb_stats
## # A tibble: 6 × 6
##   g_whoregion              mean  median     sd     n      se
##   <fct>                   <dbl>   <dbl>  <dbl> <int>   <dbl>
## 1 Africa                 7675.   7909.  3363.     24  686.  
## 2 Americas                165.    153.    33.9    24    6.93
## 3 Eastern Mediterranean   117.    108.    38.5    24    7.85
## 4 Europe                   91.3    84.1   30.1    24    6.14
## 5 South-East Asia       12342.  13252.  8257.     24 1685.  
## 6 Western Pacific         427.    449.    81.2    24   16.6

The statistics show clear differences in TB mortality among HIV patients across regions. Southeast Asia had the highest average mortality and the most variation, meaning its numbers were both high and changed a lot over time. Africa also had high mortality, but not as much variation as Southeast Asia. Regions like Europe, the Americas, and the Eastern Mediterranean had much lower mortality levels. The standard error was also highest for South-East Asia, showing that its average was less consistent compared to other regions. Overall, these results match what we saw in the graphs, where South-East Asia and Africa stood out the most.

All regions were compared together using a plot to show differences in tuberculosis mortality among HIV patients. After, separate plots were created for each region to more clearly show individual trends and differences in mortality rates.

## Filtering years by region and HIV/TB mortalities 
tb_region <- summarize(
  group_by(who_tb_data, g_whoregion, year),
  mortality = mean(e_mort_tbhiv_num, na.rm = TRUE)
)
## `summarise()` has grouped output by 'g_whoregion'. You can override using the
## `.groups` argument.
## Plot of HIV/TB deaths from 2000-2023

ggplot(tb_region, aes(x = year, y = mortality, color = g_whoregion)) +
  geom_line() +
  geom_point() +
  xlab("Year") +
  ylab("TB-HIV Mortality") +
  theme_classic()

## Africa data

africa <- filter(tb_region, g_whoregion == "Africa")

ggplot(africa, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("Africa TB HIV Mortality Over Time") +
  theme_classic()

## Americas data

americas <- filter(tb_region, g_whoregion == "Americas")

ggplot(americas, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("America TB HIV Mortality Over Time") +
  theme_classic()

## Europe data

europe <- filter(tb_region, g_whoregion == "Europe")

ggplot(europe, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("Europe TB HIV Mortality Over Time") +
  theme_classic()

## Western Pacific data

westernpacific <- filter(tb_region, g_whoregion == "Western Pacific")

ggplot(westernpacific, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("Western Pacific TB HIV Mortality Over Time") +
  theme_classic()

## Eastern Mediterranean data
easternmediterranean <- filter(tb_region, g_whoregion == "Eastern Mediterranean")

ggplot(easternmediterranean, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("Eastern Mediterranean TB HIV Mortality Over Time") +
  theme_classic()

## Southeast Asia data
southeastasia <- filter(tb_region, g_whoregion == "South-East Asia")

ggplot(southeastasia, aes(x = year, y = mortality)) +
  geom_line() +
  geom_point() +
  ggtitle("South East Asia TB HIV Mortality Over Time") +
  theme_classic()

## Comparing TB HIV deaths across all countries

ggplot(tb_region, aes(x = year, y = mortality, color = g_whoregion)) +
  geom_line() +
  geom_point() +
  ggtitle("TB-HIV Mortality Trends by Region") +
  theme_classic()

The separate graphs show how TB mortality in HIV patients changed over time in each region. Southeast Asia and Africa started with really high numbers but went down a lot over the years, especially Southeast Asia which had the biggest drop (note the range of people on the y axis is much higher for Southeast Asia and Africa, while the other four regions stay in the hundreds). Other regions like Europe, the Americas, and the Eastern Mediterranean stayed pretty low the whole time and didn’t change much. The Western Pacific went down a little, but not as much as the other high regions. Overall, this shows that some regions made a lot of progress, while others were already low and stayed that way.

Part 4 - Statistical Tests (Inferential Statistics)

A one way ANOVA was used to determine whether tuberculosis mortality among HIV patients is different across world regions. This test is appropriate because the independent variable, region, is categorical, while the dependent variable, mortality, is numerical. ANOVA lets us compare mean mortality values across all countries at once. The null hypothesis states that there is no difference in mean tuberculosis mortality among HIV patients across world regions. The alternative hypothesis states that at least one region has a different mean tuberculosis mortality among HIV patients compared to others.

## Running ANOVA

mortality ~ g_whoregion
## mortality ~ g_whoregion
tb_ANOVA <- lm(mortality ~ g_whoregion, data = tb_region)

## Looking for normality

qqnorm(residuals(tb_ANOVA))
qqline(residuals(tb_ANOVA))

A QQ plot of the residuals showed some deviation from normality, mainly at the tails. However, given the large sample size, this violation is not considered severe enough to impact the ANOVA results.

anova(tb_ANOVA)
## Analysis of Variance Table
## 
## Response: mortality
##              Df     Sum Sq   Mean Sq F value    Pr(>F)    
## g_whoregion   5 3341845730 668369146  50.449 < 2.2e-16 ***
## Residuals   138 1828300342  13248553                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A one way ANOVA was used to examine differences in tuberculosis mortality in HIV patients across world regions. The results showed a significant effect of region on mortality (F = 50.449, p < 2.2e-16), which means that mean mortality rates are different across regions. Because the p-value is way below 0.05, we can reject the null hypothesis that all regions have equal mortality rates, which is expected. The larger F value suggests that variation between regions is much greater than variation in regions. The residuals show the differences in mortality that is not by region. Overall, these results suggest that geographic region plays an important role in TB mortality in HIV patients.

We decided to run a Tukey test to observe which specific regions are different from one another. The ANOVA shows that at least one group mean is different, but it does not specifically show where those differences are. The Tukey test lets us see comparisons between all regions.

TukeyHSD(aov(tb_ANOVA))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tb_ANOVA)
## 
## $g_whoregion
##                                               diff        lwr       upr
## Americas-Africa                        -7510.36071 -10547.167 -4473.554
## Eastern Mediterranean-Africa           -7558.03814 -10594.845 -4521.231
## Europe-Africa                          -7583.70046 -10620.507 -4546.894
## South-East Asia-Africa                  4667.23201   1630.425  7704.039
## Western Pacific-Africa                 -7248.21489 -10285.022 -4211.408
## Eastern Mediterranean-Americas           -47.67743  -3084.484  2989.129
## Europe-Americas                          -73.33975  -3110.146  2963.467
## South-East Asia-Americas               12177.59272   9140.786 15214.399
## Western Pacific-Americas                 262.14582  -2774.661  3298.952
## Europe-Eastern Mediterranean             -25.66232  -3062.469  3011.144
## South-East Asia-Eastern Mediterranean  12225.27015   9188.463 15262.077
## Western Pacific-Eastern Mediterranean    309.82325  -2726.983  3346.630
## South-East Asia-Europe                 12250.93247   9214.126 15287.739
## Western Pacific-Europe                   335.48557  -2701.321  3372.292
## Western Pacific-South-East Asia       -11915.44690 -14952.254 -8878.640
##                                           p adj
## Americas-Africa                       0.0000000
## Eastern Mediterranean-Africa          0.0000000
## Europe-Africa                         0.0000000
## South-East Asia-Africa                0.0002587
## Western Pacific-Africa                0.0000000
## Eastern Mediterranean-Americas        1.0000000
## Europe-Americas                       0.9999998
## South-East Asia-Americas              0.0000000
## Western Pacific-Americas              0.9998658
## Europe-Eastern Mediterranean          1.0000000
## South-East Asia-Eastern Mediterranean 0.0000000
## Western Pacific-Eastern Mediterranean 0.9996950
## South-East Asia-Europe                0.0000000
## Western Pacific-Europe                0.9995498
## Western Pacific-South-East Asia       0.0000000

The Tukey test results show that a lot of region comparisons were statistically significant. For example, Southeast Asia had significantly higher mortality compared to all other regions, which is shown by very small p-values. Also, Africa had significantly higher mortality than regions such as the Americas, Europe, Eastern Mediterranean, and Western Pacific. But, a lot comparisons between regions like the Americas, Europe, Eastern Mediterranean, and Western Pacific were not statistically significant, shown by their p-values that were close to 1. These results suggest that the major differences in TB mortality among HIV patients are higher regions such as South-East Asia and Africa. Overall, the Tukey test helps identify which specific regions contribute most to the significant differences found in the ANOVA.

Part 4 - Discussion

The results of the analysis show that tuberculosis mortality among HIV patients is different across world regions. The ANOVA test was significant (F = 50.449, p < 2.2e-16), which means we reject the null hypothesis and see that not all regions have the same average mortality. The Tukey test showed that regions like Southeast Asia and Africa have much higher mortality compared to other regions, while many other regions were not very different from each other. This analysis is important because it shows that some parts of the world are more affected than others and may need more healthcare support. However, there are some limitations, such as not including factors like healthcare access or living conditions that could affect mortality. Also, the data was not perfectly normal, but this did not strongly affect the results because the sample size was large.

Part 5 - Conclusion

In conclusion, this study found that tuberculosis mortality among HIV patients varies significantly across world regions. Statistical analysis using ANOVA confirmed that these differences are significant, and Tukey test results showed specific regions contributing to these differences. Southeast Asia and Africa showed the highest mortality rates, while other regions had lower levels. Based on trends over time, Southeast Asia showed the greatest progress in reducing tuberculosis mortality among HIV patients, as it had the largest overall decrease in mortality compared to other regions over time, even though it did not have the lowest mortality by the end of the time period. Overall, while some progress has been made in reducing TB mortality, regions still have not reached the level of complete reduction. More efforts, research, and studies are needed to reduce these differences and improve global health outcomes.

Part 6 - References

HIV education is prevention – learn more on hivcare.org. HIV Care. (2025, December 4). https://hivcare.org/hiv-basics/?gad_source=1&gad_campaignid=1672329731

Nehal, D. (2025, November 11). WHO TB Burden Data: Incidence, Mortality, and Population. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-11-11/readme.md

Navasardyan, I., Miwalian, R., Petrosyan, A., Yeganyan, S., & Venketaraman, V. (2024). HIV–TB Coinfection: Current Therapeutic Approaches and Drug Interactions. Viruses, 16(3), 321. https://doi.org/10.3390/v16030321

World Health Organization. (2026, March 24). Tuberculosis (TB). World Health Organization. https://www.who.int/news-room/fact-sheets/detail/tuberculosis