Helminth Prevalence Homework

Author

Nguyen Nhat Linh - R14B48009

Published

2025-09-11

0.1 Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

0.2 Running Code

1 Setup

1.1 Input data

The first step is to load the necessary library for data processing (tidyverse) and import the datasets into R.

# Load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Import Ascaris data
ascaris_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Ascaris_prevalence_data.txt'
ascaris_data <- read_delim(file = ascaris_data_url, delim ="\t")
Rows: 989 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import Schistosoma data
schisto_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Schistosoma_mansoni_prevalence_data.txt'
schisto_data <- read_delim(file = schisto_data_url, delim ="\t")
Rows: 589 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import Hookworm data
hookworm_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Hookworms_prevalence_data.txt'
hookworm_data <- read_delim(file = hookworm_data_url, delim ="\t")
Rows: 1000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset includes three tables describing the prevalence of three helminth infections: Ascaris, Hookworm, and Schistosoma. Each table contains corresponding columns that provide sufficient information on geographic location, survey time, study population, and survey results (sample size, positives, prevalence).

1.2 Combine into a master table

To explore the epidemiology of the three helminth infections across different regions, I will combine the datasets into a single master table that integrates all three diseases with their corresponding variables.

combined.mastertable <- bind_rows("ascaris" = ascaris_data,
                                  "schisto" = schisto_data, 
                                  "hookworm" = hookworm_data, 
                                  .id = "disease")
combined.mastertable
# A tibble: 2,578 × 14
   disease Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
   <chr>   <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
 1 ascaris Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
 2 ascaris Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
 3 ascaris Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
 4 ascaris Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
 5 ascaris Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
 6 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
 7 ascaris Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
 8 ascaris Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
 9 ascaris Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
10 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
# ℹ 2,568 more rows
# ℹ 5 more variables: Age_start <dbl>, Age_end <dbl>,
#   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>

This master table now contains 14 columns, with information from left to right including disease name, geographic location (the next 6 columns), study period (year start and year end), study population (age_start and age_end), and survey data (number of participants, number of positive cases, and disease prevalence).

2 Questions

2.1 Q1. Which region has the highest prevalence for each disease?

2.1.1 Rationale:

This question aims to identify regional hotspots of helminth infections and to provide an initial geographical overview of disease burden.

2.1.2 R code:

#group_by region and disease, then summarise the mean of prevalence to arrange it from higher to lower
q1.region.disease.tb <- combined.mastertable %>% 
  group_by(Region, 
           disease) %>% 
  summarise(mean_prev = mean(Prevalence, 
                             na.rm = TRUE), 
            n_sites = n()) %>% 
  arrange(desc(mean_prev))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q1.region.disease.tb
# A tibble: 6 × 4
# Groups:   Region [3]
  Region                  disease  mean_prev n_sites
  <chr>                   <chr>        <dbl>   <int>
1 East and Southeast Asia ascaris     0.288      133
2 Africa                  hookworm    0.233      880
3 Africa                  schisto     0.212      589
4 Central and South Asia  hookworm    0.119        2
5 East and Southeast Asia hookworm    0.113      118
6 Africa                  ascaris     0.0736     856
#Draw a column chart to compare the mean of prevalence of each disease in each region 
ggplot(q1.region.disease.tb, 
       aes(x = Region, 
           y = mean_prev, 
           fill = disease)) + 
  geom_col (position = "dodge") + 
  labs(title = "Mean prevalence by Region and Disease", 
       y = "Mean prevalence") + 
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) + 
  scale_fill_brewer(palette = "Dark2")

2.1.3 Results:

Results indicate that the East and Southeast Asian region exhibits the highest prevalence of ascariasis, whereas Africa shows the greatest burden of both hookworm infection and schistosomiasis. Among the three parasitic diseases, hookworm infection is the most widespread, while schistosomiasis appears to be restricted to Africa.

2.2 Q2. Are there geographic hotspots at finer resolution (latitude and longitude)?

2.2.1 Rationale:

This question extends the analysis from broad regional patterns (Q1) to more granular geographic scales defined by latitude and longitude. Identifying fine-resolution hotspots allows for a clearer understanding of localized transmission dynamics and helps pinpoint environmental and socio-economic factors driving disease prevalence.

2.2.2 R code:

#Create q2 table with the summary of mean prevalence in each group with the same lantitude, longitude and disease.   
q2.longid.latid.tb <- combined.mastertable %>% 
  group_by(Longitude, 
           Latitude, 
           disease) %>% 
  summarise(mean_prev = mean(Prevalence, 
                             na.rm = TRUE)) %>% 
  filter(mean_prev > 0)
`summarise()` has grouped output by 'Longitude', 'Latitude'. You can override
using the `.groups` argument.
q2.longid.latid.tb
# A tibble: 1,584 × 4
# Groups:   Longitude, Latitude [941]
   Longitude Latitude disease  mean_prev
       <dbl>    <dbl> <chr>        <dbl>
 1     -17.4     14.8 ascaris       0.06
 2     -17.4     14.7 ascaris       0.12
 3     -17.4     14.8 ascaris       0.36
 4     -17.4     14.8 ascaris       0.62
 5     -17.4     14.8 ascaris       0.48
 6     -17.4     14.7 ascaris       0.36
 7     -17.4     14.8 ascaris       0.38
 8     -17.4     14.7 ascaris       0.4 
 9     -17.4     14.8 ascaris       0.54
10     -17.4     14.8 hookworm      0.02
# ℹ 1,574 more rows
#Draw a scatterplot of each disease on the geographic map
ggplot(q2.longid.latid.tb, 
       aes (x = Longitude, 
            y = Latitude, 
            color = mean_prev, 
            size = mean_prev)) + 
  geom_point() + 
  scale_color_gradient(low = "lightyellow",
                       high = "red") + 
  facet_wrap(~ disease) + 
  labs (title = "Geographic distribution of helminth prevalence", 
        x = "Longitude", 
        y = "Latitude", 
        color = "mean prevalence", 
        size = "mean prevalence")

2.2.3 Results:

The figure shows that all three helminth infections are concentrated within tropical latitudes (-20° to +20°). In details, Helminth infections show distinct geographic patterns: Ascariasis hotspots mainly occur between -10° to 25° latitude and 0° to 120° longitude, mainly in East and Southeast Asia. Hookworm has a wider distribution across Africa and Southeast Asia, mostly between -30° and 20° latitude. Schistosomiasis is largely confined to Africa, found between -30° and 20° latitude and 0° to around 60° longitude.

Tropical latitudes around -20° to +20° offer warm and humid climates supportive to parasite survival and transmission. Differences in longitude often reflect distinct climate zones, landscapes, and human population patterns, which affect host availability and parasite life cycles. For example, areas between 20° and 40° longitude may experience different rainfall, temperature, or land use compared to regions near 80°, leading to variations in infection prevalence.

2.3 Q3. How has the prevalence of Hookworm infection changed overtime across regions?

2.3.1 Rationale:

Building on results from Q1 and Q2, which show that hookworm is the only helminth infection present in all three regions, examining temporal trends is essential. Understanding how hookworm prevalence has evolved over time allows for evaluation of the effectiveness of control programs in diverse geographic contexts. Additionally, this analysis helps identify emerging or persistent hotspots that may require accelerated public health interventions.

2.3.2 R code:

#Filter hookworm data in all region
q3.hookworm <- combined.mastertable %>% 
  filter(disease == "hookworm") %>% 
  group_by(Region, 
           Year_start) %>% 
  summarise(mean_prev = mean(Prevalence, 
                             na.rm = TRUE))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q3.hookworm
# A tibble: 26 × 3
# Groups:   Region [3]
   Region Year_start mean_prev
   <chr>       <dbl>     <dbl>
 1 Africa       1997    0.0992
 2 Africa       1998    0.493 
 3 Africa       1999    0.648 
 4 Africa       2001    0.136 
 5 Africa       2002    0.176 
 6 Africa       2003    0.545 
 7 Africa       2004    0.712 
 8 Africa       2005    0.276 
 9 Africa       2006    0.0435
10 Africa       2007    0.220 
# ℹ 16 more rows
#Plot trend of prevalence overtime by country
ggplot(q3.hookworm, 
       aes(x = Year_start, 
           y = mean_prev, 
           colour = Region)) + 
  geom_point() + 
  geom_line() + 
  labs(title = "Trend of Hookworm prevalence by region", 
       x = "Year of study", 
       y = "mean_prev") + 
  scale_x_continuous(breaks = unique(q3.hookworm$Year_start)) + 
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

2.3.3 Results:

Hookworm prevalence in Africa demonstrates considerable year-to-year variability, which may reflect changes in control efforts, surveillance quality, or underlying transmission dynamics. In contrast, East and Southeast Asia experienced a marked surge in prevalence around 2006, followed by a notable decline, suggesting the possible impact of accelerated interventions or improved public health initiatives in that region. The limited data available for Central and South Asia precludes meaningful assessment of temporal trends in that area.

2.4 Q4. Which countries are affected by specific helminth infections, and to what extent do multiple helminth species co-occur within the same country?

2.4.1 Rationale:

In helminth epidemiology, understanding co-occurrence of infections within populations is crucial, as it informs whether integrated intervention strategies are necessary. Identifying countries where multiple helminth diseases overlap provides evidence to support the design of combined control programs, maximizing resource efficiency and enhancing the effectiveness of public health efforts.

2.4.2 R code:

#table express whether a disease appears in a country
q4.country.disease.tb <- combined.mastertable %>% 
  group_by(Country, 
           disease) %>% 
  summarise(has_data = "X", 
            .groups = "drop") %>% 
  pivot_wider(names_from = disease, 
              values_from = has_data, 
              values_fill = "O")
knitr::kable(q4.country.disease.tb, caption = "Presence of helminth diseases by country")
Presence of helminth diseases by country
Country ascaris hookworm schisto
Angola X X O
Burundi X X O
Cameroon X O O
China X X O
Cote D’Ivoire X X O
Democratic Republic of the Congo O O X
Eritrea X X O
Ethiopia O X X
Ghana X X O
Malawi X X O
Nepal O X O
Nigeria X X O
Philippines X X O
Senegal X X O
Sierra Leone X X X
South Africa O X O
Uganda X X X
United Republic of Tanzania O X X
Zambia O O X

2.4.3 Results:

The table shows that helminth infections rarely occur in isolation; most countries report the presence of two or even all three diseases concurrently. This notable overlap underscores the need for integrated, multi-targeted intervention strategies rather than disease-specific approaches. In contrast, schistosomiasis is less widely and more often occurs independently, suggesting unique transmission dynamics and the potential need for targeted interventions when it appears alone.

2.5 Q5. Is there a significant correlation between the presence of different helminth diseases (in pairs) across countries?

2.5.1 Rationale:

The Q4 data reveal potential co-occurrence of multiple helminth infections within countries. To determine whether these diseases are truly associated rather than coincidentally overlapping, this analysis quantitatively assesses the frequency of co-occurrence. The correlation ratio is calculated as: correlation_ratio = (number of nations presenting both diseases) / (number of nations having at least one of the two diseases)

2.5.2 R code:

#Create a table summarizing information about the frequency of occurrence of 2 diseases in pairs:
q5.co_occur <- q4.country.disease.tb %>% 
  summarise(Ascaris_Hookworm = 
              sum(ascaris == "X" & hookworm == "X")/
              sum(ascaris == "X" | hookworm == "X"), 
            Hookworm_Schisto = 
              sum(schisto == "X" & hookworm == "X")/
              sum(schisto == "X" | hookworm == "X"), 
            Ascaris_Schisto = 
              sum(schisto == "X" & ascaris == "X")/
              sum(schisto == "X" | ascaris == "X")) %>% 
  pivot_longer(cols = c("Ascaris_Hookworm",
                        "Hookworm_Schisto",
                        "Ascaris_Schisto"), 
               names_to = "Pair", 
               values_to = "Correlation_ratio")
q5.co_occur
# A tibble: 3 × 2
  Pair             Correlation_ratio
  <chr>                        <dbl>
1 Ascaris_Hookworm             0.706
2 Hookworm_Schisto             0.222
3 Ascaris_Schisto              0.118
#Draw a graph illustrating the prevalence of each infection pair.
ggplot(q5.co_occur) + 
  geom_col(aes(x = reorder (Pair, 
                            Correlation_ratio), 
               y = Correlation_ratio, 
               fill = Pair)) + 
  labs (title = "Co-occurance rate between disease pairs", 
        x = "Disease pairs", 
        y = "Correlation ratio") +  
  scale_fill_brewer(palette=7)

2.5.3 Results:

The figure and table demonstrate that ascaris and hookworm exhibit a strong co-occurrence, with a correlation ratio of 71%, indicating these two infections frequently overlap within the same countries. By contrast, the correlation ratios for combinations involving schistosomes are much lower (all below 30%), reflecting a weaker association. These findings suggest that regions endemic for ascaris should be prioritized for concurrent hookworm surveillance and control efforts, and vice versa, while integrated strategies involving schistosomiasis may be less broadly applicable but still warranted in specific settings.

2.6 Q6. How does the prevalence of helminth infections change over time in countries where all three diseases are present?

2.6.1 Rationale:

Tracking the temporal dynamics of helminth prevalence in countries with comprehensive data on ascaris, hookworm, and schistosomiasis provides robust insights into the effectiveness of control strategies such as deworming and sanitation programs. Restricting the analysis to nations with full data coverage for all three infections ensures unbiased comparisons across diseases and over time.

2.6.2 R code:

#Filter countries with all 3 diseases and have more than 1 investigation
q6.countries.all3 <- combined.mastertable %>% 
  group_by(Country) %>% 
  filter(n_distinct(disease) == 3 & 
           n_distinct(Year_start) > 1) %>% 
  ungroup()

#Summarise data
q6.summarise <- q6.countries.all3 %>% 
  group_by(Country, 
           disease, 
           Year_start) %>% 
  summarise(mean_prev = mean(Prevalence)) %>% 
  ungroup()
`summarise()` has grouped output by 'Country', 'disease'. You can override
using the `.groups` argument.
knitr::kable(q6.summarise, caption = "Temporal Trends of Helminth triple-infections in Endemic Countries")
Temporal Trends of Helminth triple-infections in Endemic Countries
Country disease Year_start mean_prev
Uganda ascaris 1998 0.0452189
Uganda ascaris 2002 0.0594700
Uganda ascaris 2003 0.0830500
Uganda ascaris 2005 0.2472018
Uganda ascaris 2006 0.2187270
Uganda ascaris 2008 0.0951707
Uganda ascaris 2009 0.0061564
Uganda hookworm 1998 0.4934728
Uganda hookworm 2002 0.5752034
Uganda hookworm 2003 0.5447246
Uganda hookworm 2004 0.7124457
Uganda hookworm 2005 0.2762282
Uganda hookworm 2006 0.0435118
Uganda hookworm 2008 0.0331454
Uganda hookworm 2009 0.1027241
Uganda schisto 1998 0.1496048
Uganda schisto 2002 0.4415909
Uganda schisto 2003 0.4899146
Uganda schisto 2004 0.3674429
Uganda schisto 2005 0.0659276
Uganda schisto 2006 0.0361705
Uganda schisto 2008 0.1424500
Uganda schisto 2009 0.1000861
#Plot trend
ggplot(q6.summarise, 
       aes(x = Year_start, 
           y = mean_prev, 
           color = disease)) + 
  geom_line () + 
  geom_point() + 
  scale_x_continuous(breaks = q6.summarise$Year_start) +
  facet_wrap(~ Country) + 
  labs(title = "Trend of helminth prevalence over time", 
       x = "Year of study", 
       y = "Mean prevalence") + 
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))

2.6.3 Results:

Q4 and Q5 findings suggest that ascaris and hookworm frequently co-occur, while schistosomiasis tends to be more independent. In Uganda, however, the prevalence trends of hookworm and schistosomiasis closely mirror each other, both peaking in the early 2000s before declining sharply after 2005. Notably, ascaris prevalence rose in 2005 as the other two dropped, which might indicate a data anomaly, reporting lag, or a unique epidemiological event deserving further investigation. This underscores the complexity of helminth epidemiology and the necessity of longitudinal, multi-disease monitoring.

2.7 Q7. Which countries have sufficiently robust epidemiological studies to provide reliable evidence for helminth infection assessment?

2.7.1 Rationale:

Filtering studies based on sample size (>29 participants) and replication (>2 studies per country) ensures the elimination of noise and increases confidence in the epidemiological conclusions. This approach identifies countries where data quality is strong enough to support in-depth analysis, accurate prevalence estimation, and meaningful evaluation of control program effectiveness.

2.7.2 R code:

#study level - investigation has >29 paticipants
q7.study.level <- combined.mastertable %>% 
  mutate(study_reliable = 
           Individuals_surveyed > 29)
q7.study.level
# A tibble: 2,578 × 15
   disease Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
   <chr>   <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
 1 ascaris Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
 2 ascaris Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
 3 ascaris Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
 4 ascaris Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
 5 ascaris Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
 6 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
 7 ascaris Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
 8 ascaris Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
 9 ascaris Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
10 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
# ℹ 2,568 more rows
# ℹ 6 more variables: Age_start <dbl>, Age_end <dbl>,
#   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>,
#   study_reliable <lgl>
#country level - country has > 2 realiable studies
q7.country.level <- q7.study.level %>% 
  group_by(Country) %>% 
  summarise(n_reliable_study = sum(study_reliable), 
            sum_participants = sum(Individuals_surveyed), 
            participants_per_study = sum_participants/sum(study_reliable), 
            .groups = "drop") %>% 
  filter(n_reliable_study > 2) %>% 
  arrange(n_reliable_study) %>% 
  mutate(group_by_num.studies = 
           ifelse(n_reliable_study > 100, 
                  "high", 
                  ifelse(n_reliable_study >10, 
                         "medium",
                         "low")))
knitr::kable(q7.country.level, caption = "Countries with reliable data")
Countries with reliable data
Country n_reliable_study sum_participants participants_per_study group_by_num.studies
Cote D’Ivoire 3 348 116.00000 low
South Africa 4 642 160.50000 low
United Republic of Tanzania 7 1571 224.42857 low
Ethiopia 8 2892 361.50000 low
Nigeria 20 1470 73.50000 medium
Burundi 44 2628 59.72727 medium
Angola 52 3294 63.34615 medium
Malawi 66 3930 59.54545 medium
Eritrea 80 3214 40.17500 medium
Sierra Leone 147 8187 55.69388 high
Ghana 150 9036 60.24000 high
Senegal 207 11017 53.22222 high
Philippines 232 24916 107.39655 high
Uganda 1372 95011 69.25000 high
#draw a graphic to visualize the data
ggplot(q7.country.level, 
       aes(x = Country, 
           y = n_reliable_study, 
           size = participants_per_study, 
           color = participants_per_study)) + 
  geom_point() + 
  facet_wrap (~ group_by_num.studies, 
              scales = "free") + 
  labs(title = "Reliable studies vs participants per study", 
       x = "Country", 
       y = "Number of reliable studies", 
       size = "Avg participants per study", 
       color = "Avg participants per study") + 
  theme(axis.text.x = 
          element_text(angle = 45, 
                       hjust = 1, 
                       size = 8)) + 
  scale_color_gradient(low = "lightblue", 
                       high = "darkblue")

2.7.3 Results:

After applying reliability criteria, only 14 countries qualified for detailed epidemiological investigation. In the figure, countries represented by large, dark dots have the most reliable data due to large sample sizes, while those with a moderate to high number of reliable studies are ideal candidates for evaluating control efforts. Nations with both substantial sample sizes and repeated studies form the best foundation for further research into disease patterns and transmission dynamics within their populations.

Note: I used the Publish function (to Rpubs) to be able to share my work online via link: https://rpubs.com/nhatlinh0406/1344350