Part 1 - Introduction

The data used in this analysis comes from the Pre-Approved TidyTuesday Datasets provided by Dr. Tothero and includes the annual number of confirmed measles cases reported across different regions of the world. This dataset allows for comparison of disease patterns on a global scale.

Measles is a highly contagious viral disease that continues to be a major global public health concern, especially in areas with low vaccination coverage. Although an effective vaccine exists, outbreaks still occur due to gaps in immunization, limited healthcare access, and population movement. Studying measles cases helps researchers and public health officials understand where the disease is most prevalent and where prevention efforts are needed.

Figure 1. Visualization of the measles virus (left) and common symptoms of measles (right)

The response variable in this study is the annual number of reported confirmed measles cases.

The explanatory variable is the geographic region, which groups countries into broader global areas.

This analysis aims to determine whether measles cases differ across regions, which can help identify areas that may need stronger vaccination and public health interventions.

Part 2 - Main Research Question

Reseacrh Question: Is there a significant difference in the annual number of reported confirmed measles cases across different regions?

Part 3 - Exploring the Data (Descriptive Statistics)

ggplot(data, aes(x = measles_lab_confirmed)) +
  geom_histogram(fill = "red", color = "black") +
  scale_x_log10() +
  xlab("Annual Confirmed Measles Cases (log10 scale)") +
  ylab("Frequency") +
  ggtitle("Distribution of Annual Confirmed Measles Cases")
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 703 rows containing non-finite outside the scale range
## (`stat_bin()`).

data %>%
  group_by(region) %>%
  summarise(
    mean_cases = mean(measles_lab_confirmed, na.rm = TRUE),
    median_cases = median(measles_lab_confirmed, na.rm = TRUE),
    sd_cases = sd(measles_lab_confirmed, na.rm = TRUE),
    se_cases = sd(measles_lab_confirmed, na.rm = TRUE) / sqrt(n()))
## # A tibble: 6 × 5
##   region mean_cases median_cases sd_cases se_cases
##   <chr>       <dbl>        <dbl>    <dbl>    <dbl>
## 1 AFRO         262.           53     523.     21.3
## 2 AMRO         176.            0    1410.     78.0
## 3 EMRO         660.           79    2014.    119. 
## 4 EURO         410.           10    1886.     72.4
## 5 SEARO        958.           20    3507.    287. 
## 6 WPRO         704.            1    4044.    219.
ggplot(data, aes(x = region, y = measles_lab_confirmed)) +
  geom_boxplot(fill = "red", color = "black") +
  scale_y_log10() +
  xlab("Region") +
  ylab("Annual Confirmed Measles Cases (log10 scale)") +
  ggtitle("Distribution of Confirmed Measles Cases by Region (Log Scale)") +
  theme_minimal()
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 703 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The histogram above shows the distribution of annual confirmed measles cases on a log scale. I chose to display it on a log scale because this transformation reduces the impact of extremely large values and provides a clearer view of the overall distribution, which still remains right-skewed. The histogram displays a warning related to the log10 transformation because the dataset contains zero values for confirmed measles cases, and log10(0) is undefined. As a result, these observations are removed from the visualization. This highlights that a substantial number of observations report zero confirmed cases, which is important when interpreting the overall distribution of the data.

The table above summarizes the mean, median, standard deviation, and standard error of annual confirmed measles cases across different regions.

The boxplot above shows the distribution of annual confirmed measles cases across different regions on a log scale. The log transformation has been used again because it allows for clearer comparison by reducing the impact of extremely large values. There are noticeable differences in both the median and spread of measles cases between regions. For example, SEARO and EMRO appear to have higher median case counts, while AMRO and EURO show lower median values. Additionally, regions such as SEARO and WPRO display greater variability, as indicated by the wider spread of the boxes and whiskers. These differences suggest that the number of confirmed measles cases varies by region, providing visual evidence that geographic region may influence measles case distribution. These observations will be further tested using statistical analysis in the next section.

Part 4 - Statistical Tests (Inferential Statistics)

To determine whether there is a statistically significant difference in the mean annual number of confirmed measles cases across regions, a one-way ANOVA test was conducted. This test was selected because the analysis involves comparing the mean number of confirmed measles cases (a continuous variable) across multiple independent groups (regions), making it appropriate for determining whether there are statistically significant differences between the means of more than two groups.

The null hypothesis (H₀) states that there is no difference in the mean number of confirmed measles cases across regions.

The alternative hypothesis (Hₐ) states that at least one region has a different mean number of confirmed measles cases.

anova_model <- aov(measles_lab_confirmed ~ region, data = data)
summary(anova_model)
##               Df    Sum Sq  Mean Sq F value   Pr(>F)    
## region         5 1.200e+08 24001288    4.85 0.000204 ***
## Residuals   2376 1.176e+10  4948409                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = measles_lab_confirmed ~ region, data = data)
## 
## $region
##                  diff        lwr       upr     p adj
## AMRO-AFRO   -85.72483 -521.68907  350.2394 0.9934576
## EMRO-AFRO   398.25984  -57.49318  854.0129 0.1266990
## EURO-AFRO   148.45656 -206.98744  503.9006 0.8412463
## SEARO-AFRO  696.44499  115.82998 1277.0600 0.0083220
## WPRO-AFRO   442.34601   12.21671  872.4753 0.0396571
## EMRO-AMRO   483.98467  -29.65829  997.6276 0.0781057
## EURO-AMRO   234.18139 -192.97047  661.3332 0.6227628
## SEARO-AMRO  782.16982  155.08764 1409.2520 0.0051226
## WPRO-AMRO   528.07084   37.02147 1019.1202 0.0265586
## EURO-EMRO  -249.80328 -697.13392  197.5274 0.6035014
## SEARO-EMRO  298.18515 -342.81255  939.1829 0.7702064
## WPRO-EMRO    44.08617 -464.61362  552.7860 0.9998749
## SEARO-EURO  547.98843  -26.03916 1122.0160 0.0711868
## WPRO-EURO   293.88945 -127.30540  715.0843 0.3482746
## WPRO-SEARO -254.09898 -877.13866  368.9407 0.8543087

The results of the one-way ANOVA test show a statistically significant difference in the mean number of confirmed measles cases across regions (F(5, 2376) = 4.85, p = 0.000204). Since the p-value is less than 0.05, we reject the null hypothesis.

This indicates that at least one region has a significantly different mean number of confirmed measles cases compared to the others.

Since the ANOVA test indicated a statistically significant difference among group means, a Tukey post-hoc test was conducted to determine which specific regions differed from each other. The results show that several regional pairs had statistically significant differences; specifically, SEARO had significantly higher mean measles cases compared to both AFRO (p = 0.0083) and AMRO (p = 0.0051). Additionally, WPRO showed significantly higher mean cases compared to AFRO (p = 0.0397) and AMRO (p = 0.0266).

No other pairwise comparisons were statistically significant (p > 0.05), indicating that not all regions differ from each other. These results suggest that the overall differences identified in the ANOVA are primarily driven by higher case counts in SEARO and WPRO compared to AFRO and AMRO.

Part 5 - Discussion

The results of this analysis show that there are statistically significant differences in the mean number of confirmed measles cases across global regions. Both the visualizations and the ANOVA test support the conclusion that measles case counts vary geographically.

The Tukey post-hoc test further revealed that regions such as SEARO and WPRO have significantly higher mean numbers of confirmed measles cases compared to AFRO and AMRO. These differences may be influenced by several factors, including variations in vaccination coverage, access to healthcare, population density, and differences in public health infrastructure across regions. Regions with lower vaccination rates or limited healthcare access may be more susceptible to higher measles case counts and outbreaks.

This analysis is important because it highlights how measles remains a global public health issue, particularly in certain regions. Understanding where higher case counts occur can help guide vaccination efforts and public health interventions to reduce the spread of the disease.

There are however, a few limitations to this analysis. The dataset does not account for all possible factors that influence measles cases, such as differences in reporting accuracy, population size, or vaccination rates within each region. Additionally, the use of aggregated regional data may mask important variations within individual countries. The presence of many zero values and highly skewed data may also impact the interpretation of results.

Overall, this analysis demonstrates that measles cases are not evenly distributed across regions, and that geographic location does play a role in the number of confirmed cases worldwide.

Part 6 - Conclusion

In conclusion, this analysis found that there is a statistically significant difference in the mean number of confirmed measles cases across global regions. The results indicate that regions such as SEARO and WPRO tend to have higher case counts compared to others, suggesting that geographic location plays an important role in measles distribution. These findings highlight the need for targeted public health efforts to reduce measles cases in the most affected regions.

Part 7 - References

Data Science Learning Community. (2025). Measles cases around the world [Data set]. TidyTuesday. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-06-24/cases_year.csv

Data Science Learning Community. (2025). Measles cases around the world (README). TidyTuesday. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-06-24/readme.md

KidsHealth New Zealand. (2025). Measles in children. https://www.kidshealth.org.nz/measles-in-children

Mayo Clinic. (2025). Measles: Symptoms and causes. https://www.mayoclinic.org/diseases-conditions/measles/symptoms-causes/syc-20374857

University of Virginia. (2025). Measles reported in 9 states: Here’s what you need to know. https://news.virginia.edu/content/measles-reported-9-states-heres-what-you-need-know