Helminth Prevalence Homework

Author

Nguyen Nhat Linh - R14B48009

Published

2025-09-11

1 Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

2 Running Code

#R14B48009_Helminth Prevalance_Homework_09-11

##Setup_In put data

# Load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Import Ascaris data
ascaris_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Ascaris_prevalence_data.txt'
ascaris_data <- read_delim(file = ascaris_data_url, delim ="\t")
Rows: 989 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import Schistosoma data
schisto_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Schistosoma_mansoni_prevalence_data.txt'
schisto_data <- read_delim(file = schisto_data_url, delim ="\t")
Rows: 589 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import Hookworm data
hookworm_data_url <- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Hookworms_prevalence_data.txt'
hookworm_data <- read_delim(file = hookworm_data_url, delim ="\t")
Rows: 1000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##Combine to a master table

combined.mastertable <- bind_rows("ascaris" = ascaris_data,"schisto" = schisto_data, "hookworm" = hookworm_data, .id = "disease")
combined.mastertable
# A tibble: 2,578 × 14
   disease Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
   <chr>   <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
 1 ascaris Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
 2 ascaris Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
 3 ascaris Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
 4 ascaris Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
 5 ascaris Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
 6 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
 7 ascaris Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
 8 ascaris Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
 9 ascaris Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
10 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
# ℹ 2,568 more rows
# ℹ 5 more variables: Age_start <dbl>, Age_end <dbl>,
#   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>

##Q1. Which region has the highest prevalence for each disease?

This question identifies regional hotspots and provide initial geographical overview of burden.

#group_by region and disease, then summarise the mean of prevalence to arrage it from higher to lower
q1.region.disease.tb <- combined.mastertable %>% group_by(Region, disease) %>% summarise(mean_prev = mean(Prevalence, na.rm = TRUE), n_sites = n()) %>% arrange(desc(mean_prev))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q1.region.disease.tb
# A tibble: 6 × 4
# Groups:   Region [3]
  Region                  disease  mean_prev n_sites
  <chr>                   <chr>        <dbl>   <int>
1 East and Southeast Asia ascaris     0.288      133
2 Africa                  hookworm    0.233      880
3 Africa                  schisto     0.212      589
4 Central and South Asia  hookworm    0.119        2
5 East and Southeast Asia hookworm    0.113      118
6 Africa                  ascaris     0.0736     856
#Draw a column chart to compare the mean of prevalence of each disease in each region 
ggplot(q1.region.disease.tb, aes(x = Region, y = mean_prev, fill = disease)) + geom_col (position = "dodge") + labs(title = "Mean prevalence by Region and Disease", y = "Mean prevalence") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Results show that the East and Southeast Asia region has the highest prevalence of the ascaris disease, while Africa reaches the peak of both hookworm and schistosomiasis diseases. Hookworm is the most popular disease of these 3 diseases, and schisto only appears in Africa.

##Q2. Are there geographic hotspots at finer resolution (latitude and longitude)?

Building on Q1 by moving from regions to specific locations, hotspots can reveal environmental and socio-economic drivers.

#Create q2 table with the summary of mean prevalence in each group with the same lantitude, longitude and disease.   
q2.longid.latid.tb <- combined.mastertable %>% group_by(Longitude, Latitude, disease) %>% summarise(mean_prev = mean(Prevalence, na.rm = TRUE)) %>% filter(mean_prev > 0)
`summarise()` has grouped output by 'Longitude', 'Latitude'. You can override
using the `.groups` argument.
q2.longid.latid.tb
# A tibble: 1,584 × 4
# Groups:   Longitude, Latitude [941]
   Longitude Latitude disease  mean_prev
       <dbl>    <dbl> <chr>        <dbl>
 1     -17.4     14.8 ascaris       0.06
 2     -17.4     14.7 ascaris       0.12
 3     -17.4     14.8 ascaris       0.36
 4     -17.4     14.8 ascaris       0.62
 5     -17.4     14.8 ascaris       0.48
 6     -17.4     14.7 ascaris       0.36
 7     -17.4     14.8 ascaris       0.38
 8     -17.4     14.7 ascaris       0.4 
 9     -17.4     14.8 ascaris       0.54
10     -17.4     14.8 hookworm      0.02
# ℹ 1,574 more rows
#Draw a scatterplot of each disease on the geographic map
ggplot(q2.longid.latid.tb, aes (x = Longitude, y = Latitude, color = mean_prev, size = mean_prev)) + geom_point() + scale_color_gradient(low = "lightyellow", high = "red") + facet_wrap(~ disease) + labs (title = "Geographic distribution of helminth prevalence", x = "Longitude", y = "Latitude", color = "mean prevalence", size = "mean prevalence")

All three helminth infections concentrate within tropical latitudes (-20 to +20), but differ in longitudinal spread. Ascaris appears in two separate longitude zones, and hookworm also shows wide distribution but focuses on the region from -20 to +50. Schisto disease is more geographically restricted (0 to +40).

##Q3. How havs the prevalence of Hookworm infection changed overtime across regions?

Q1 and Q2 results show hookworm is the only disease that appears in all 3 regions. Understanding temporal trends of hookworm infection allows us to evaluate how successful control programs have been across different regions and then highlights some possible hotspots that require further attention.

#Filter ascaris data in all region
q3.hookworm <- combined.mastertable %>% filter(disease == "hookworm") %>% group_by(Region, Year_start) %>% summarise(mean_prev = mean(Prevalence, na.rm = TRUE))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q3.hookworm
# A tibble: 26 × 3
# Groups:   Region [3]
   Region Year_start mean_prev
   <chr>       <dbl>     <dbl>
 1 Africa       1997    0.0992
 2 Africa       1998    0.493 
 3 Africa       1999    0.648 
 4 Africa       2001    0.136 
 5 Africa       2002    0.176 
 6 Africa       2003    0.545 
 7 Africa       2004    0.712 
 8 Africa       2005    0.276 
 9 Africa       2006    0.0435
10 Africa       2007    0.220 
# ℹ 16 more rows
#Plot trend of prevalence overtime by country
ggplot(q3.hookworm, aes(x = Year_start, y = mean_prev, colour = Region)) + geom_point() + geom_line() + labs(title = "Trend of Hookworm prevalence by region", x = "Year of study", y = "mean_prev") + scale_x_continuous(breaks = unique(q3.hookworm$Year_start)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Hookworm prevalence in Africa fluctuates strongly over time, likely reflecting inconsistent control or reporting, while East and Southeast Asia shows a sharp peak around 2006 followed by a decline, suggesting more effective interventions. Data from Central and South Asia are too sparse for interpretation.

##Q4. Which countries are affected by which helminth diseases, and how many diseases co-occur in the same country?

In helminth epidemiology, it is important to ask whether different infections occur together in the same populations. If they do, interventions should be integrated rather than targeting a single parasite. It provides evidence for designing combined control programs.

#table express whether a disease appears in a country
q4.country.disease.tb <- combined.mastertable %>% group_by(Country, disease) %>% summarise(has_data = "X", .groups = "drop") %>% pivot_wider(names_from = disease, values_from = has_data, values_fill = "O")
q4.country.disease.tb
# A tibble: 19 × 4
   Country                          ascaris hookworm schisto
   <chr>                            <chr>   <chr>    <chr>  
 1 Angola                           X       X        O      
 2 Burundi                          X       X        O      
 3 Cameroon                         X       O        O      
 4 China                            X       X        O      
 5 Cote D'Ivoire                    X       X        O      
 6 Democratic Republic of the Congo O       O        X      
 7 Eritrea                          X       X        O      
 8 Ethiopia                         O       X        X      
 9 Ghana                            X       X        O      
10 Malawi                           X       X        O      
11 Nepal                            O       X        O      
12 Nigeria                          X       X        O      
13 Philippines                      X       X        O      
14 Senegal                          X       X        O      
15 Sierra Leone                     X       X        X      
16 South Africa                     O       X        O      
17 Uganda                           X       X        X      
18 United Republic of Tanzania      O       X        X      
19 Zambia                           O       O        X      

From this table, it is possible to find parasitic diseases that rarely appear alone but mostly appear in pairs or even three diseases in the same place. Multi-targeted treatment options are needed. Meanwhile, schisto infection seems to be the less population and prefer appearing independently.

##Q5, Was there correlation between these diseases (pairs)?

The Q4 data indicate the potential prevalence of numerous diseases within a country. When diseases co-occur, integrated control may prove more effective than isolated disease management strategy. This analysis elucidates the co-occurrence of various diseases and quantifies the frequency of this trend for each disease pair: correlation_ratio = (number of nations presenting both diseases) / (number of nations having at least one of the two diseases)

#Create a table summarizing information about the frequency of occurrence of 2 diseases in pairs:
q5.co_occur <- q4.country.disease.tb %>% summarise(Ascaris_Hookworm = sum(ascaris == "X" & hookworm == "X")/sum(ascaris == "X" | hookworm == "X"), Hookworm_Schisto = sum(schisto == "X" & hookworm == "X")/sum(schisto == "X" | hookworm == "X"), Ascaris_Schisto = sum(schisto == "X" & ascaris == "X")/sum(schisto == "X" | ascaris == "X")) %>% pivot_longer(cols = c("Ascaris_Hookworm","Hookworm_Schisto","Ascaris_Schisto"), names_to = "Pair", values_to = "Correlation_ratio")
q5.co_occur
# A tibble: 3 × 2
  Pair             Correlation_ratio
  <chr>                        <dbl>
1 Ascaris_Hookworm             0.706
2 Hookworm_Schisto             0.222
3 Ascaris_Schisto              0.118
#Draw a graph illustrating the prevalence of each infection pair.
ggplot(q5.co_occur) + geom_col(aes(x = reorder (Pair, Correlation_ratio), y = Correlation_ratio, fill = Pair)) + labs (title = "Co-occurance rate between disease pairs", x = "Disease pairs", y = "Correlation ratio") +  scale_fill_brewer(palette=7)

This table confirms that with a correlation of 71%, ascaris and hookworm seem to have a very high tendency to co-infect, while the combination of schistosome with these two species is much lower (less than 30%). Therefore, areas with ascaris endemicity should be considered for concurrent hookworm infection and vice versa.

##Q6. How does the prevalence of helminth infections change over time in countries where all three diseases are present?

Assessing helminth prevalence over time is essential for evaluating control strategies like deworming and sanitation. To avoid bias, we only include nations with comprehensive Ascaris, Hookworm, and Schistosoma data. This way, comparisons across diseases and over time are meaningful and balanced.

#Filter countries with all 3 diseases and have more than 1 investigations
q6.countries.all3 <- combined.mastertable %>% group_by(Country) %>% filter(n_distinct(disease) == 3 & n_distinct(Year_start) > 1) %>% ungroup()

#Summarise data
q6.summarise <- q6.countries.all3 %>% group_by(Country, disease, Year_start) %>% summarise(mean_prev = mean(Prevalence)) %>% ungroup()
`summarise()` has grouped output by 'Country', 'disease'. You can override
using the `.groups` argument.
q6.summarise
# A tibble: 23 × 4
   Country disease  Year_start mean_prev
   <chr>   <chr>         <dbl>     <dbl>
 1 Uganda  ascaris        1998   0.0452 
 2 Uganda  ascaris        2002   0.0595 
 3 Uganda  ascaris        2003   0.0830 
 4 Uganda  ascaris        2005   0.247  
 5 Uganda  ascaris        2006   0.219  
 6 Uganda  ascaris        2008   0.0952 
 7 Uganda  ascaris        2009   0.00616
 8 Uganda  hookworm       1998   0.493  
 9 Uganda  hookworm       2002   0.575  
10 Uganda  hookworm       2003   0.545  
# ℹ 13 more rows
#Plot trend
ggplot(q6.summarise, aes(x = Year_start, y = mean_prev, color = disease)) + geom_line () + geom_point() + facet_wrap(~ Country) + labs(title = "Trend of helminth prevalence over time", x = "Year of study", y = "Mean prevalence") + scale_x_continuous(breaks = unique(q6.summarise$Year_start)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Results from Q4 indicate that ascaris and hookworm are the most likely to co-occur, while schisto is more independent. On the other hand, the overall trend of hookworm and schistosomiasis in Uganda is similar. Interestingly, in 2005, ascaris prevalence peaked while the other two diseases sharply declined, suggesting a possible anomaly that warrants further investigation.

##Q7. Which studies/countries have relatively reliable data?

A reliable assessment will help eliminate noise and provide a basis for the above information. I assess that a study with >29 participants is reliable and that the study’s replication in each country should be >2.

#study level - investigation has >29 paticipants
q7.study.level <- combined.mastertable %>% mutate(study_reliable = Individuals_surveyed > 29)
q7.study.level
# A tibble: 2,578 × 15
   disease Region Country ISO_code ADM1  Latitude Longitude Year_start Year_end
   <chr>   <chr>  <chr>   <chr>    <chr>    <dbl>     <dbl>      <dbl>    <dbl>
 1 ascaris Africa Angola  AO       Bengo    -8.59      13.6       2010     2010
 2 ascaris Africa Angola  AO       Bengo    -8.63      13.7       2010     2010
 3 ascaris Africa Angola  AO       Bengo    -8.61      13.6       2010     2010
 4 ascaris Africa Angola  AO       Bengo    -8.62      14.2       2010     2010
 5 ascaris Africa Angola  AO       Bengo    -8.53      13.7       2010     2010
 6 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
 7 ascaris Africa Angola  AO       Bengo    -8.63      14.0       2010     2010
 8 ascaris Africa Angola  AO       Bengo    -8.62      13.8       2010     2010
 9 ascaris Africa Angola  AO       Bengo    -8.64      14.0       2010     2010
10 ascaris Africa Angola  AO       Bengo    -8.60      13.6       2010     2010
# ℹ 2,568 more rows
# ℹ 6 more variables: Age_start <dbl>, Age_end <dbl>,
#   Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>,
#   study_reliable <lgl>
#country level - country has > 2 realiable studies
q7.country.level <- q7.study.level %>% group_by(Country) %>% summarise(n_reliable_study = sum(study_reliable), sum_participants = sum(Individuals_surveyed), participants_per_study = sum_participants/sum(study_reliable), .groups = "drop") %>% filter(n_reliable_study > 2) %>% arrange(n_reliable_study) %>% mutate(group_by_num.studies = ifelse(n_reliable_study > 100, "high", ifelse(n_reliable_study >10, "medium","low")))
q7.country.level
# A tibble: 14 × 5
   Country              n_reliable_study sum_participants participants_per_study
   <chr>                           <int>            <dbl>                  <dbl>
 1 Cote D'Ivoire                       3              348                  116  
 2 South Africa                        4              642                  160. 
 3 United Republic of …                7             1571                  224. 
 4 Ethiopia                            8             2892                  362. 
 5 Nigeria                            20             1470                   73.5
 6 Burundi                            44             2628                   59.7
 7 Angola                             52             3294                   63.3
 8 Malawi                             66             3930                   59.5
 9 Eritrea                            80             3214                   40.2
10 Sierra Leone                      147             8187                   55.7
11 Ghana                             150             9036                   60.2
12 Senegal                           207            11017                   53.2
13 Philippines                       232            24916                  107. 
14 Uganda                           1372            95011                   69.2
# ℹ 1 more variable: group_by_num.studies <chr>
#draw a graphic to visualize the data
ggplot(q7.country.level, aes(x = Country, y = n_reliable_study, size = participants_per_study, color = participants_per_study)) + geom_point() + facet_wrap (~ group_by_num.studies, scales = "free") + labs(title = "Reliable studies vs participants per study", x = "Country", y = "Number of reliable studies", size = "Avg participants per study", color = "Avg participants per study") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + scale_color_gradient(low = "lightblue", high = "darkblue")

After filtering, only 14 countries satisfy the reliability conditions, so it is possible to delve into detailed epidemiological studies in these countries to draw the most accurate conclusions. Countries with large, dark dots are the most reliable due to large sample sizes. Countries with a medium to high number of reliable studies are suitable for evaluating the effectiveness of epidemiological control programs. Meanwhile, countries with large sample sizes per study may provide stronger foundations for in-depth analyses of disease pathology and transmission dynamics.

Note: I used the Publish function (to Rpubs) to be able to share my work online via link: https://rpubs.com/nhatlinh0406/1344350