Rows: 989 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 589 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): Region, Country, ISO_code, ADM1
dbl (9): Latitude, Longitude, Year_start, Year_end, Age_start, Age_end, Indi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2,578 × 14
disease Region Country ISO_code ADM1 Latitude Longitude Year_start Year_end
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ascaris Africa Angola AO Bengo -8.59 13.6 2010 2010
2 ascaris Africa Angola AO Bengo -8.63 13.7 2010 2010
3 ascaris Africa Angola AO Bengo -8.61 13.6 2010 2010
4 ascaris Africa Angola AO Bengo -8.62 14.2 2010 2010
5 ascaris Africa Angola AO Bengo -8.53 13.7 2010 2010
6 ascaris Africa Angola AO Bengo -8.60 13.6 2010 2010
7 ascaris Africa Angola AO Bengo -8.63 14.0 2010 2010
8 ascaris Africa Angola AO Bengo -8.62 13.8 2010 2010
9 ascaris Africa Angola AO Bengo -8.64 14.0 2010 2010
10 ascaris Africa Angola AO Bengo -8.60 13.6 2010 2010
# ℹ 2,568 more rows
# ℹ 5 more variables: Age_start <dbl>, Age_end <dbl>,
# Individuals_surveyed <dbl>, Number_Positives <dbl>, Prevalence <dbl>
##Q1. Which region has the highest prevalence for each disease?
This question identifies regional hotspots and provide initial geographical overview of burden.
#group_by region and disease, then summarise the mean of prevalence to arrage it from higher to lowerq1.region.disease.tb <- combined.mastertable %>%group_by(Region, disease) %>%summarise(mean_prev =mean(Prevalence, na.rm =TRUE), n_sites =n()) %>%arrange(desc(mean_prev))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q1.region.disease.tb
# A tibble: 6 × 4
# Groups: Region [3]
Region disease mean_prev n_sites
<chr> <chr> <dbl> <int>
1 East and Southeast Asia ascaris 0.288 133
2 Africa hookworm 0.233 880
3 Africa schisto 0.212 589
4 Central and South Asia hookworm 0.119 2
5 East and Southeast Asia hookworm 0.113 118
6 Africa ascaris 0.0736 856
#Draw a column chart to compare the mean of prevalence of each disease in each region ggplot(q1.region.disease.tb, aes(x = Region, y = mean_prev, fill = disease)) +geom_col (position ="dodge") +labs(title ="Mean prevalence by Region and Disease", y ="Mean prevalence") +theme(axis.text.x =element_text(angle =45, hjust =1))
Results show that the East and Southeast Asia region has the highest prevalence of the ascaris disease, while Africa reaches the peak of both hookworm and schistosomiasis diseases. Hookworm is the most popular disease of these 3 diseases, and schisto only appears in Africa.
##Q2. Are there geographic hotspots at finer resolution (latitude and longitude)?
Building on Q1 by moving from regions to specific locations, hotspots can reveal environmental and socio-economic drivers.
#Create q2 table with the summary of mean prevalence in each group with the same lantitude, longitude and disease. q2.longid.latid.tb <- combined.mastertable %>%group_by(Longitude, Latitude, disease) %>%summarise(mean_prev =mean(Prevalence, na.rm =TRUE)) %>%filter(mean_prev >0)
`summarise()` has grouped output by 'Longitude', 'Latitude'. You can override
using the `.groups` argument.
#Draw a scatterplot of each disease on the geographic mapggplot(q2.longid.latid.tb, aes (x = Longitude, y = Latitude, color = mean_prev, size = mean_prev)) +geom_point() +scale_color_gradient(low ="lightyellow", high ="red") +facet_wrap(~ disease) +labs (title ="Geographic distribution of helminth prevalence", x ="Longitude", y ="Latitude", color ="mean prevalence", size ="mean prevalence")
All three helminth infections concentrate within tropical latitudes (-20 to +20), but differ in longitudinal spread. Ascaris appears in two separate longitude zones, and hookworm also shows wide distribution but focuses on the region from -20 to +50. Schisto disease is more geographically restricted (0 to +40).
##Q3. How havs the prevalence of Hookworm infection changed overtime across regions?
Q1 and Q2 results show hookworm is the only disease that appears in all 3 regions. Understanding temporal trends of hookworm infection allows us to evaluate how successful control programs have been across different regions and then highlights some possible hotspots that require further attention.
#Filter ascaris data in all regionq3.hookworm <- combined.mastertable %>%filter(disease =="hookworm") %>%group_by(Region, Year_start) %>%summarise(mean_prev =mean(Prevalence, na.rm =TRUE))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
q3.hookworm
# A tibble: 26 × 3
# Groups: Region [3]
Region Year_start mean_prev
<chr> <dbl> <dbl>
1 Africa 1997 0.0992
2 Africa 1998 0.493
3 Africa 1999 0.648
4 Africa 2001 0.136
5 Africa 2002 0.176
6 Africa 2003 0.545
7 Africa 2004 0.712
8 Africa 2005 0.276
9 Africa 2006 0.0435
10 Africa 2007 0.220
# ℹ 16 more rows
#Plot trend of prevalence overtime by countryggplot(q3.hookworm, aes(x = Year_start, y = mean_prev, colour = Region)) +geom_point() +geom_line() +labs(title ="Trend of Hookworm prevalence by region", x ="Year of study", y ="mean_prev") +scale_x_continuous(breaks =unique(q3.hookworm$Year_start)) +theme(axis.text.x =element_text(angle =45, hjust =1))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Hookworm prevalence in Africa fluctuates strongly over time, likely reflecting inconsistent control or reporting, while East and Southeast Asia shows a sharp peak around 2006 followed by a decline, suggesting more effective interventions. Data from Central and South Asia are too sparse for interpretation.
##Q4. Which countries are affected by which helminth diseases, and how many diseases co-occur in the same country?
In helminth epidemiology, it is important to ask whether different infections occur together in the same populations. If they do, interventions should be integrated rather than targeting a single parasite. It provides evidence for designing combined control programs.
#table express whether a disease appears in a countryq4.country.disease.tb <- combined.mastertable %>%group_by(Country, disease) %>%summarise(has_data ="X", .groups ="drop") %>%pivot_wider(names_from = disease, values_from = has_data, values_fill ="O")q4.country.disease.tb
# A tibble: 19 × 4
Country ascaris hookworm schisto
<chr> <chr> <chr> <chr>
1 Angola X X O
2 Burundi X X O
3 Cameroon X O O
4 China X X O
5 Cote D'Ivoire X X O
6 Democratic Republic of the Congo O O X
7 Eritrea X X O
8 Ethiopia O X X
9 Ghana X X O
10 Malawi X X O
11 Nepal O X O
12 Nigeria X X O
13 Philippines X X O
14 Senegal X X O
15 Sierra Leone X X X
16 South Africa O X O
17 Uganda X X X
18 United Republic of Tanzania O X X
19 Zambia O O X
From this table, it is possible to find parasitic diseases that rarely appear alone but mostly appear in pairs or even three diseases in the same place. Multi-targeted treatment options are needed. Meanwhile, schisto infection seems to be the less population and prefer appearing independently.
##Q5, Was there correlation between these diseases (pairs)?
The Q4 data indicate the potential prevalence of numerous diseases within a country. When diseases co-occur, integrated control may prove more effective than isolated disease management strategy. This analysis elucidates the co-occurrence of various diseases and quantifies the frequency of this trend for each disease pair: correlation_ratio = (number of nations presenting both diseases) / (number of nations having at least one of the two diseases)
#Create a table summarizing information about the frequency of occurrence of 2 diseases in pairs:q5.co_occur <- q4.country.disease.tb %>%summarise(Ascaris_Hookworm =sum(ascaris =="X"& hookworm =="X")/sum(ascaris =="X"| hookworm =="X"), Hookworm_Schisto =sum(schisto =="X"& hookworm =="X")/sum(schisto =="X"| hookworm =="X"), Ascaris_Schisto =sum(schisto =="X"& ascaris =="X")/sum(schisto =="X"| ascaris =="X")) %>%pivot_longer(cols =c("Ascaris_Hookworm","Hookworm_Schisto","Ascaris_Schisto"), names_to ="Pair", values_to ="Correlation_ratio")q5.co_occur
#Draw a graph illustrating the prevalence of each infection pair.ggplot(q5.co_occur) +geom_col(aes(x =reorder (Pair, Correlation_ratio), y = Correlation_ratio, fill = Pair)) +labs (title ="Co-occurance rate between disease pairs", x ="Disease pairs", y ="Correlation ratio") +scale_fill_brewer(palette=7)
This table confirms that with a correlation of 71%, ascaris and hookworm seem to have a very high tendency to co-infect, while the combination of schistosome with these two species is much lower (less than 30%). Therefore, areas with ascaris endemicity should be considered for concurrent hookworm infection and vice versa.
##Q6. How does the prevalence of helminth infections change over time in countries where all three diseases are present?
Assessing helminth prevalence over time is essential for evaluating control strategies like deworming and sanitation. To avoid bias, we only include nations with comprehensive Ascaris, Hookworm, and Schistosoma data. This way, comparisons across diseases and over time are meaningful and balanced.
#Filter countries with all 3 diseases and have more than 1 investigationsq6.countries.all3 <- combined.mastertable %>%group_by(Country) %>%filter(n_distinct(disease) ==3&n_distinct(Year_start) >1) %>%ungroup()#Summarise dataq6.summarise <- q6.countries.all3 %>%group_by(Country, disease, Year_start) %>%summarise(mean_prev =mean(Prevalence)) %>%ungroup()
`summarise()` has grouped output by 'Country', 'disease'. You can override
using the `.groups` argument.
#Plot trendggplot(q6.summarise, aes(x = Year_start, y = mean_prev, color = disease)) +geom_line () +geom_point() +facet_wrap(~ Country) +labs(title ="Trend of helminth prevalence over time", x ="Year of study", y ="Mean prevalence") +scale_x_continuous(breaks =unique(q6.summarise$Year_start)) +theme(axis.text.x =element_text(angle =45, hjust =1))
Results from Q4 indicate that ascaris and hookworm are the most likely to co-occur, while schisto is more independent. On the other hand, the overall trend of hookworm and schistosomiasis in Uganda is similar. Interestingly, in 2005, ascaris prevalence peaked while the other two diseases sharply declined, suggesting a possible anomaly that warrants further investigation.
##Q7. Which studies/countries have relatively reliable data?
A reliable assessment will help eliminate noise and provide a basis for the above information. I assess that a study with >29 participants is reliable and that the study’s replication in each country should be >2.
#draw a graphic to visualize the dataggplot(q7.country.level, aes(x = Country, y = n_reliable_study, size = participants_per_study, color = participants_per_study)) +geom_point() +facet_wrap (~ group_by_num.studies, scales ="free") +labs(title ="Reliable studies vs participants per study", x ="Country", y ="Number of reliable studies", size ="Avg participants per study", color ="Avg participants per study") +theme(axis.text.x =element_text(angle =45, hjust =1, size =8)) +scale_color_gradient(low ="lightblue", high ="darkblue")
After filtering, only 14 countries satisfy the reliability conditions, so it is possible to delve into detailed epidemiological studies in these countries to draw the most accurate conclusions. Countries with large, dark dots are the most reliable due to large sample sizes. Countries with a medium to high number of reliable studies are suitable for evaluating the effectiveness of epidemiological control programs. Meanwhile, countries with large sample sizes per study may provide stronger foundations for in-depth analyses of disease pathology and transmission dynamics.