LF.DATA

THE DATA:

The imported data was first found on the World Health Organization (WHO) from the section ‘Control of Neglected Tropical Diseases’. Next, ‘preventative chemotherapy’ was selected from the data platforms and tools tab. Here, we accessed the PCT Databank, and downloaded the data as a .xlsx file. However, after opening it in Excel, we realized there were way too many data points for the file to be easily analyzed with any Microsoft Office program. After much trial and mostly error, the data has been uploaded to R Studio as an R Markdown file in order to clean the data, first and foremost. Don’t forget to set your working directory! What follows is the cleaning of our file:

# Upon first glance, there seemed to be many 'N/A's within our data frame, making it necessary to remove those points. We decided to keep the zero's in because the data points still included non-zero numbers for the 'population requiring PC for LF'.
df1 <- drop_na(df) %>% arrange(desc(Total.population.of.IUs)) 
#wpop1 <- wpop %>% rename('country' = 'Country Name')
df1<- df1 %>% rename('country' = 'Country')
#joined<- df1 %>% left_join( y= wpop) %>% rename('pop' = '2023')
#AFR <- joined %>% filter(Region == 'AFR')

  The data frame df1 now contains 1022 observations from 14 variables.

When importing the data, it makes sense to keep it as an excel file (.xlsx) and uploaded as such. In Excel, it does not much matter if a column is a character or a number, but once uploaded to R, it is essential to use the as.numeric() function to differentiate between the characters (names, places etc.), dates, and numerical values present in our dataset.

df2 <- df1 %>% rename(`programme.drug.cov` = `Programme.(drug).coverage.(%)`)

TESTING:

Test 1

df2$Number.of.IUs.covered <- as.factor(df2$Number.of.IUs.covered)  # Remove duplicates
df2$Number.of.IUs.covered <- as.numeric(df2$Number.of.IUs.covered) #Convert to numerical
t.test(df2$Number.of.IUs.covered)

## 
##  One Sample t-test
## 
## data:  df2$Number.of.IUs.covered
## t = 36.282, df = 1021, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  48.82107 54.40398
## sample estimates:
## mean of x 
##  51.61252

t.test(df$Population.requiring.PC.for.LF)

## 
##  One Sample t-test
## 
## data:  df$Population.requiring.PC.for.LF
## t = 10.295, df = 1154, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20963698 30835651
## sample estimates:
## mean of x 
##  25899675

#print(df)
write.xlsx(df, 'df.xlsx')
#print(df2)
write.xlsx(df2, 'df2.xlsx')
write.xlsx(df1, 'df1.xlsx')

Test 2

df2$Geographical.coverage. <- as.factor(df2$Geographical.coverage.)
df2$Geographical.coverage. <- as.numeric(df2$Geographical.coverage.) 
#head(df2$Geographical.coverage.)
t.test(df2$Geographical.coverage.)

## 
##  One Sample t-test
## 
## data:  df2$Geographical.coverage.
## t = 25.41, df = 1021, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  78.22734 91.32061
## sample estimates:
## mean of x 
##  84.77397

#df2$National.coverage. <- as.factor(df2$National.coverage.) 
#df2$National.coverage. <- as.numeric(df2$National.coverage.) 
#head(df2$National.coverage.)

#df2$Total.population.of.IUs <- as.factor(df2$Total.population.of.IUs) 
#df2$Total.population.of.IUs <- as.numeric(df2$Total.population.of.IUs) 
#head(df2$Total.population.of.IUs)

#df2$Reported.number.of.people.treated <- as.factor(df2$Reported.number.of.people.treated) 
#df2$Reported.number.of.people.treated <- as.numeric(df2$Reported.number.of.people.treated) 
#head(df2$Reported.number.of.people.treated)

Test 3

df2$programme.drug.cov <- as.factor(df2$programme.drug.cov) 
df2$programme.drug.cov <- as.numeric(df2$programme.drug.cov) 
t.test(df2$programme.drug.cov)

## 
##  One Sample t-test
## 
## data:  df2$programme.drug.cov
## t = 42.503, df = 1021, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  351.2779 385.2837
## sample estimates:
## mean of x 
##  368.2808

INCLUDING PLOTS:

To aid in visualizing this data, multicolor and multivariable plots were created. For the following plots, the isolated variables are Region, Type of MDAs, and Year, creating dataframe ‘df1’ and ‘df1.1’.

Catagorizing Administered Drugs In Each Region

df1.1 <- df1 %>% select(Type.of.MDA, Region, Year)
ggplot(df1, aes(x = Region, fill = Type.of.MDA)) +  geom_bar(position = "stack", width = 0.6) + labs(title = "Regionally Catagorizing Administered Drugs Treating Lymphatic Filariasis", x = "Region", y = "Amount Administered", fill = "Type of MDA") + theme_light() + scale_fill_viridis(discrete = TRUE, option = "H", direction = 1)

Yearly Drug Popularity

ggplot(df1.1, aes(x = Year, fill = Type.of.MDA)) +  geom_bar(position = "stack", width = 0.6) + labs(title = "Worldwide Drug Popularity per Year to Treat Lymphatic Filariasis", x = "Year", y = "Amount Administered (per year, per region)", fill = "Type of MDA") + theme_light() + scale_fill_viridis(discrete = TRUE, option = "H", direction = 1)

Mapping Status

ggplot(df1, aes(x = Region, fill = Mapping.status)) + geom_bar() + theme_light() + labs(title = "Lymphatic Filariasis Mapping Status per Region", x = "Region", y = "Amount of Mapping", fill = "Mapping Status")

MDA Status

## Error in eval(expr, envir, enclos): object 'joined' not found

Needing Treatment

ggplot(df1, aes(Year, Population.requiring.PC.for.LF)) +geom_col(group = "country", fill = "hotpink") + ggtitle("Yearly Number of People Worldwide Needing PC to Treat LF Since 2000") + scale_y_continuous(labels = scales::comma) + ylab("Number of Individuals") + scale_fill_viridis(begin = .5, end = .5)

df8 <- df1 %>% mutate(drugcov = as.numeric(`Programme.(drug).coverage.(%)`)) %>%  drop_na(drugcov) %>% mutate(drugcov = round(`drugcov`, digits = 1)) %>% mutate(treat = as.numeric(Reported.number.of.people.treated)) %>% mutate(year = as.numeric(Year)) %>% mutate(geocov = as.numeric(`Geographical.coverage.(%)`)) %>% mutate(geocov = replace_na(geocov, 0))
av <- df8%>% select(year, country, geocov, drugcov, treat) %>% group_by(year)%>%mutate(mean = mean(geocov)) %>% distinct(mean)
ggplot(av, aes(year, mean)) + geom_point() + geom_line() + labs(title = "Global Average Percent of Geographic Coverage Per Year", y = "Mean (%)" )

av1 <- df8%>% select(year, treat, country, Region) %>% group_by(country)%>%mutate(treated = sum(treat)) %>% distinct(treated)
##TREATED PER COUNTRY
av2 <- df8%>% select(year, country, geocov, drugcov, treat, Region) %>% group_by(country) %>% filter(geocov != 0, drugcov != 0, treat != 0, geocov != 100, drugcov != 100) 
ggplot(av2, aes(year, treat, color = treat))+geom_point(size = 2) + labs(x = "Year", y = "Number of Individuals Treated", title = "Number of Individuals Treated per Country per Year", fill = "# Treated") + scale_color_viridis(option = "H") +theme_light() + scale_y_continuous(labels = scales::comma, breaks = scales::breaks_extended(n=8))

Brazil

brazil <- filter(df1, country == "Brazil")
brazil$xy<-paste0(brazil$Year," = ", scales::comma(brazil$Population.requiring.PC.for.LF))
ggplot(brazil, aes(Year, Population.requiring.PC.for.LF, fill = xy)) + geom_col( width = .75) + geom_text(aes(label = scales::comma(Population.requiring.PC.for.LF)), color = "black", vjust = -.5, size = 2.5)+ labs(title = "Number of individuals from Brazilian Population Needing PC for LF Treatment Each Year", x = "Year", y = "Number of Individuals", fill = "Year = Number of Individuals") + theme_light() + scale_y_continuous(labels = scales::comma, breaks = scales::breaks_extended(n=8)) + scale_fill_viridis(discrete = TRUE, option = "H")  +  theme(legend.text = element_text(size = 10))

The following plots necessitated the douwnloading of a few more packages, which can be done using the command ‘install.packages(c(“cowplot”, “googleway”, “ggplot2”, “ggrepel”, “ggspatial”, “libwgeom”, “sf”, “rnaturalearth”, “rnaturalearthdata”)’. Back at the top of the page, ‘sf’, ‘rnaturalearth’, and ‘rnaturalearthdata’ should be added to the library via ‘library(“sf”)’ ‘library(“rnaturalearth”)’ ‘library(“rnaturalearthdata”)’. This enables us to create world maps, sorting for continent, nation, and gives an insane amount of data on each. To create each map, a new dafaframe was created as the merging of our variables of interest and the ‘ne_countries’ set.

World Map

world <- ne_countries(scale = "medium", returnclass = "sf")
new <- full_join(world, av1, by = join_by(name == country))
ggplot(new) + geom_sf(aes(fill = treated)) + ggtitle("Cummulative Number of Individuals Treated for LF:  2000 - 2023", subtitle = paste0("(", length(unique(av1$country)), " countries)")) + scale_fill_viridis(option = "H", trans = "log10", na.value = "grey90", name = "# of individuals treated", labels = scales::comma_format()) + theme_light() +  theme(legend.text = element_text(size = 10))

Africa

world1 <- ne_countries(scale = "medium", returnclass = "sf", continent = "Africa")
new1 <- full_join(world1, av1, by = join_by(name == country))
ggplot(new1) +geom_sf(aes(fill = treated))  +ggtitle("Cummulative Number of Individuals Treated for LF: Africa, 2000 - 2023") + scale_fill_viridis(option = "H",  na.value = "grey90", trans = "log10",name = "# of individuals treated", labels = scales::comma_format()) + theme_light() +  theme(legend.text = element_text(size = 10)) + xlab("Longitude") + ylab("Latitude")

Asia

world2 <- ne_countries(scale = "medium", returnclass = "sf", continent = "Asia")
new2 <- full_join(world2, av1, by = join_by(name == country))
ggplot(new2) +geom_sf(aes(fill = treated))  +ggtitle("Cummulative Number of Individuals Treated for LF: Asia, 2000 - 2023") + xlab("Longitude") + ylab("Latitude") + scale_fill_viridis(option = "H",  na.value = "grey90", trans = "log10",name = "# of individuals treated", labels = scales::comma_format()) + theme_light() +  theme(legend.text = element_text(size = 10))