DATA 608 01[46846] : Final Project
DATA 608 - Final Project - Knowledge and Visual Analytics
DATA 608 01[46846] : Final Project
1 INTRODUCTION
Given the current pandemic COVID-19 situation, I was inclined to use diseases to explore the visualizations in the last decade. However, with plethora of COVID-19 data ambiguity and the ever-changing landscape of the spread of the disease, I wanted to avoid making any presumptive conclusions/predictions. Instead, I relied on data source over prolonged period for diseases that have lasted for a long period namely HIV/AIDS, Tuberculosis and Malaria.
1.1 DataSet
The dataset used in this project is a demography data collected from the World Health Nutrition and Population Statistics from 2000 to 2020.
The primary issue with such type of demographic data is the missing observation(s) (NAs) and needs data cleaning to be able to slice-and-dice the data for various visualizations
1.2 Objective
The purpose of this project is to:
- Identify the relationship between labor force/Population and epidemic diseases like HIV/AIDS, Tuberculosis, Malaria. Better still, does geographical location has anything to do with all the widespread of these diseases?
- Show the merit and demerit of visual analytics on data analysis, and how to improve it.
- Make the visualized image(s) tells the story in an unambiguous way, understandable, even to a layman!
- Apply the knowledge learned in class to real live situation.
1.3 Methodology
The project would lay more emphasis on the explanatory techniques. It will be used in making data presentation to the viewers in a more succinct way. Therefore, used the R programing language to explore and analysis the dataset.
- Majorly used SQL from sqldf package (it was the best for this type of dataset) to extract the observations and to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2.
- The dataset was not normally distributed, so used Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.
1.4 Tools/Packages To Be Used
Load these libraries and dataset and lets get to work!
- Plotly
- Knitr
- Dplyr
- Plyr
- Reshape2
- Ggplot2
- Graphics
- Ggthemes
- GoogleVis
- scatterplot3d
- Shiny
- Shinydashboard
- Rshiny
- Sqldf
- gridExtra
- treemap
2 DATA LOAD
The datasets are loaded into R to proceed with subsequent data analysis.
2.1 World Health
2.2 World Longitudes and Latitudes
This includes the various world Longitudes and Latitudes for all the country locations in the above dataset to able to plot them in a map for better visualizations.
3 DATA PREPARATION
- Checking for all NA’s in the datasets before cleaning.
| x | |
|---|---|
| Year_Code | 0 |
| Country_Name | 0 |
| Country_Code | 0 |
| Adults_15_living_HIV | 1716 |
| Adults_Children_0_14_15_living_HIV | 1688 |
| AIDS_estimated_deaths_UNAIDS | 1785 |
| Adults_children_0_14_15_newly_infected_HIV | 1655 |
| Adults_15_newly_infected_HIV | 1700 |
| Children_0_14_living_with_HIV | 2365 |
| Children_orphaned_by_HIV_AIDS | 2365 |
| Children_0_14_newly_infected_HIV | 2365 |
| Incidence_tuberculosis_per_100000 | 394 |
| Labor_force_total | 568 |
| Mortality_traffic_injury_100K | 3974 |
| Population_female | 623 |
| Population_male | 623 |
| Population_total | 224 |
| Malaria_cases_reported | 3450 |
| Suicide_mortality_per_100K | 3415 |
| Tuberculosis_death_per_100K | 394 |
| Tuberculosis_case_detection | 564 |
| Tuberculosis_treatment_success_NewCases | 982 |
- Cleaning and merging the two datasets of
World HealthandWorld Longitudes and Latitudes.
options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Name", by.y = "Country_Name", all = TRUE)
DT::datatable(df2)- Checking for all NA’s in the datasets after cleaning.
| x | |
|---|---|
| Country_Name | 0 |
| Year_Code | 63 |
| Country_Code.x | 63 |
| Adults_15_living_HIV | 1779 |
| Adults_Children_0_14_15_living_HIV | 1751 |
| AIDS_estimated_deaths_UNAIDS | 1848 |
| Adults_children_0_14_15_newly_infected_HIV | 1718 |
| Adults_15_newly_infected_HIV | 1763 |
| Children_0_14_living_with_HIV | 2428 |
| Children_orphaned_by_HIV_AIDS | 2428 |
| Children_0_14_newly_infected_HIV | 2428 |
| Incidence_tuberculosis_per_100000 | 457 |
| Labor_force_total | 631 |
| Mortality_traffic_injury_100K | 4037 |
| Population_female | 686 |
| Population_male | 686 |
| Population_total | 287 |
| Malaria_cases_reported | 3513 |
| Suicide_mortality_per_100K | 3478 |
| Tuberculosis_death_per_100K | 457 |
| Tuberculosis_case_detection | 627 |
| Tuberculosis_treatment_success_NewCases | 1045 |
| Country_Code.y | 721 |
| Latitude | 721 |
| Longtitude | 721 |
From the table above, we can see that both datasets contain a lot of NAs (This is due to data unavalability in some countries under observation).
- Convert the columns to Numeric
Let’s convert some of the variable to numeric for better a analysis.
## Observations: 4,403
## Variables: 25
## $ Country_Name <chr> "Afghanistan", "Afghanis...
## $ Year_Code <chr> "YR2009", "YR2012", "YR2...
## $ Country_Code.x <chr> "AFG", "AFG", "AFG", "AF...
## $ Adults_15_living_HIV <dbl> 3700, 4600, 4000, 1700, ...
## $ Adults_Children_0_14_15_living_HIV <dbl> 4000, 4900, 4200, 1800, ...
## $ AIDS_estimated_deaths_UNAIDS <dbl> 500, 500, 500, 200, 200,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 560, 620, 560, 500, 500,...
## $ Adults_15_newly_infected_HIV <dbl> 500, 560, 500, 500, 500,...
## $ Children_0_14_living_with_HIV <dbl> 500, 500, 500, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS <dbl> 2600, 3400, 2800, 1000, ...
## $ Children_0_14_newly_infected_HIV <dbl> 100, 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000 <dbl> 189, 189, 189, 189, 189,...
## $ Labor_force_total <dbl> 9312008, 10534726, 95725...
## $ Mortality_traffic_injury_100K <dbl> NA, NA, NA, NA, NA, NA, ...
## $ Population_female <dbl> 13850754, 15196503, 1424...
## $ Population_male <dbl> 14544059, 15964873, 1493...
## $ Population_total <dbl> 28394813, 31161376, 2918...
## $ Malaria_cases_reported <dbl> NA, 54840, 69397, NA, NA...
## $ Suicide_mortality_per_100K <dbl> NA, NA, 5.1, NA, NA, NA,...
## $ Tuberculosis_death_per_100K <dbl> 44.00, 44.00, 43.00, 62....
## $ Tuberculosis_case_detection <dbl> 49, 49, 51, 25, 31, 50, ...
## $ Tuberculosis_treatment_success_NewCases <dbl> 86, 88, 86, 84, 86, 87, ...
## $ Country_Code.y <fct> AFG, AFG, AFG, AFG, AFG,...
## $ Latitude <dbl> 33, 33, 33, 33, 33, 33, ...
## $ Longtitude <dbl> 65, 65, 65, 65, 65, 65, ...
- Merging dataset
Merging column longitude and Latitude together for a better coordinate to be used in our maps
- A graphical representation of the missing values
library(Amelia)
missmap(df2, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')), main ="Plot Showing The Missing Values Per Observation",
y.cex = 0.8, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)Since it is evident that we have lots of missing observations in the dataset, we are going to employ a method that best deal with datasets of this nature,
MICE (Multivariate Imputation by Chained Equations) package helps in inspecting, imputing, diagonise, analyze, pool the result, and generate simulated incomplete data
library(mice)
mice_missing_val <- mice(df2, m=1, method='cart', printFlag=FALSE, where = is.na(df2))
mice_missing_val_comp <- complete(mice_missing_val, action='long', include=FALSE)
df2_cleaned <- na.omit(mice_missing_val_comp)
df2_cleaned <- df2_cleaned[, -c(1:2)]
DT::datatable(df2_cleaned)After the generated values by mice, we still have some missing observations. We therefore going to omit those observation that still have missing value(s) to avoid a biased analysis.
na_count_after_cleaning <- sapply(df2_cleaned, function(c) sum(is.na(c)))
kable(na_count_after_cleaning)| x | |
|---|---|
| Country_Name | 0 |
| Year_Code | 0 |
| Country_Code.x | 0 |
| Adults_15_living_HIV | 0 |
| Adults_Children_0_14_15_living_HIV | 0 |
| AIDS_estimated_deaths_UNAIDS | 0 |
| Adults_children_0_14_15_newly_infected_HIV | 0 |
| Adults_15_newly_infected_HIV | 0 |
| Children_0_14_living_with_HIV | 0 |
| Children_orphaned_by_HIV_AIDS | 0 |
| Children_0_14_newly_infected_HIV | 0 |
| Incidence_tuberculosis_per_100000 | 0 |
| Labor_force_total | 0 |
| Mortality_traffic_injury_100K | 0 |
| Population_female | 0 |
| Population_male | 0 |
| Population_total | 0 |
| Malaria_cases_reported | 0 |
| Suicide_mortality_per_100K | 0 |
| Tuberculosis_death_per_100K | 0 |
| Tuberculosis_case_detection | 0 |
| Tuberculosis_treatment_success_NewCases | 0 |
| Country_Code.y | 0 |
| Latitude | 0 |
| Longtitude | 0 |
| Lat_Long | 0 |
We now have data with no missing observations.
Comparative Analysis of DataTypes and Missing Values Before and After
library(gridExtra)
library(visdat)
p_before_dt <- vis_dat(df)
p_before_miss <- vis_miss(df)
p_after_dt <- vis_dat(df2_cleaned)
p_after_miss <- vis_miss(df2_cleaned)
p_before_dt4 DATA EXPLORATION
In the data exploration, obtained an inferential statistic and checked for any outliers.
Summary Statistics
Percentage of Children Orphaned By HIV/AIDS
Use sql to subset(query) columns so as to diffentiate between year 2000 and 2020 where the number of children orphaned by HIV/AIDS.
library(sqldf)
All_Cen <- sqldf("SELECT Country_Name, Year_Code, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2_cleaned where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50")
library(plotly)
# Boxplot
g <- ggplot(All_Cen, aes(Year_Code, Percentage_Orphaned_byHIV)) + geom_boxplot(varwidth=T, fill="plum") +
labs(title="Box plot of Children Orphaned By HIV/AIDS per Year (%)",
subtitle="Percentage Orphaned by HIV grouped by Year",
caption="Source: The World Bank Data Bank",
x="Year",
y="% of Children Orphaned By HIV/AIDS") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plotly::ggplotly(g)Scatter Plot of HIV/Tuberculosis/Malaria Comparison
Disease Distribution by Top 50 Countries
library(dplyr)
disease = df2_cleaned %>%
filter(Country_Name != 'World') %>%
group_by(Country_Name) %>%
summarise(HivTbMalaria = max(Children_0_14_living_with_HIV+Adults_15_living_HIV) + max(Tuberculosis_case_detection) + max(Malaria_cases_reported)) %>%
arrange(desc(HivTbMalaria)) %>%
ungroup() %>%
mutate(Country_Name = reorder(Country_Name,HivTbMalaria)) %>%
head(50)
#DT::datatable(disease)
library(treemap)
treemap(disease, index="Country_Name", vSize = "HivTbMalaria", title="Top 50 countries with Disease Count", palette = "RdBu",fontsize.title = 14)library(ggplot2)
disease %>%
head(20) %>%
mutate( fill = as.factor(Country_Name)) %>%
ggplot(aes(x = Country_Name, y= HivTbMalaria, fill = Country_Name)) +
scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
geom_boxplot() +
labs(x= 'Country Name', y = 'Disease Count', title = 'Distribution of Diseases by Top 20 Countries') +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))5 DATA ANALYSIS
library(plotly)
ggplotly(ggplot(All_Cen, aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))5.1 INCIDENCE PROPORTION and INCIDENCE RATES
Incidence: A measure of the occurrence of a new disease, in a defined population at risk for the disease, within a specified time period.
while
Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk
Incidence rate (also known as person-time rate) is a measure of incidence that incorporates time directly into the denominator. As per CDC, “a person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented.”
Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population
Mathematically as,
\[Incidence Rate = \frac{Number Of New Cases Of Disease Or Injury During Specified Period}{Time Each Person Was Observed,Totaled For All Persons}\]
5.1.1 Incidence Person At Risk
- HIV/AIDS Incidence Calculation For Person At Risk
library(sqldf)
Hiv_Incident_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where HIV_Incident_Risk_Per_100K >= 200 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 30" )
DT::datatable(Hiv_Incident_Risk)ggplotly(ggplot(Hiv_Incident_Risk, aes(x=HIV_Incident_Risk_Per_100K, y=reorder(Country_Name,+HIV_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="red", size=2, alpha=.8) + ylab("Countries") + xlab("HIV Incident Risk") +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of HIV/AIDS Incidence (Person At Risk) BY Countries"))library(googleVis)
Hiv_Inc_Map <- googleVis::gvisGeoChart(Hiv_Incident_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "HIV_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_At_Risk")
plot(Hiv_Inc_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- Tuberculosis Incidence Calculation For Person At Risk
Tuberculosis_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K' FROM df2_cleaned where Tuberculocisis_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
DT::datatable(Tuberculosis_Inc_Risk)ggplotly(ggplot(Tuberculosis_Inc_Risk, aes(x=Tuberculocisis_Incident_Risk_Per_100K, y=reorder(Country_Name, Tuberculocisis_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Tuberculosis Incidence (Person At Risk) BY Countries"))library(googleVis)
Tub_Inc_Map <- googleVis::gvisGeoChart(Tuberculosis_Inc_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Tuberculocisis_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_At_Risk")
plot(Tub_Inc_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuberculosis_Incidence_Person_At_Risk.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- Malaria Incidence Calculation For Person At Risk
Malaria_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Malaria_cases_reported/Population_total)* 100000 as 'Malaria_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
DT::datatable(Malaria_Inc_Risk)ggplotly(ggplot(Malaria_Inc_Risk, aes(x=Malaria_Incident_Risk_Per_100K, y=reorder(Country_Name, Malaria_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="brown", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Malaria Incidence (Person At Risk) BY Countries"))library(googleVis)
Malaria_Inc_Map <- googleVis::gvisGeoChart(Malaria_Inc_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Malaria_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Malaria_Incidence_Person_At_Risk")
plot(Malaria_Inc_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Malaria_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Malaria_Incidence_Person_At_Risk.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
5.1.2 HIV/AID, Malaria & Tuberculosis Incident Risk Comparison
- Person At Risk Comparison
comparison <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, Labor_force_total, (Malaria_cases_reported/Population_total)* 1000 as 'Malaria_Incident_Risk_Per_1K', (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K', (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_1K >= 1 and Lat_Long != 'NA:NA' or Tuberculocisis_Incident_Risk_Per_100K >= 1 or HIV_Incident_Risk_Per_100K >= 200 ORDER BY Country_Name ASC")
DT::datatable(comparison)comparison <- comparison[order(comparison$Country_Name, comparison$Year_Code),]
comparison$size <- comparison$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
pw <- plot_ly(comparison, x = ~Malaria_Incident_Risk_Per_1K, y = ~Tuberculocisis_Incident_Risk_Per_100K, z = ~HIV_Incident_Risk_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
text = ~paste('Country:', Country_Name, '<br>Labor Force:', Labor_force_total, '<br>Population Total:', Population_total,
'<br>Lat & Long.:', Lat_Long)) %>%
layout(title = 'Comparison of HIV/AIDS, Malaria & Tuberculosis Incidence for People At Risk',
scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)')
pw5.1.3 Incidence Person Per Time
- HIV/AIDS Person Per Time
HIV_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,((New_Hiv_Cases)/(Population_total-(AIDS_Deaths+Tuberculosis_Deaths+Mortality_Deaths+Traffic_Deaths)))*1000000 as 'HIV_Inc_Person_Time100K' FROM (SELECT Population_total,Lat_Long,Country_Name, Year_Code,Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths',round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY New_Hiv_Cases, Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(HIV_Inc_Person_Time)HIV_Inc_Person_Time_Map <- gvisGeoChart(HIV_Inc_Person_Time, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "HIV_Inc_Person_Time100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_Per_Time")
plot(HIV_Inc_Person_Time_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_Per_Time.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Hiv_Incidence_Person_Per_Time.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- Tuberculosis Person Per Time
Tuber_Inc_Person_Time <- sqldf("SELECT Population_total,Lat_Long,Country_Name,Year_Code,(Tuberculosis_case_detection/(Population_total-(Tuber_trtd + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'Tuberculosis_Inc_Person_Time100K' FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Tuberculosis_case_detection,Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Tuber_Inc_Person_Time)Tuber_Inc_Person_Time_Map <- gvisGeoChart(Tuber_Inc_Person_Time, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Tuberculosis_Inc_Person_Time100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_Per_Time")
plot(Tuber_Inc_Person_Time_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Tuberculosis_Incidence_Person_Per_Time.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuberculosis_Incidence_Person_Per_Time.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- Malaria Person Per Time
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
5.2 PREVALENCE RATE
Prevalence : Actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.
Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.
- HIV Prevalence
HIV_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long, Country_Name, Year_Code, round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Inc_Prevalence_100K' FROM (SELECT Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(HIV_Prevalence_Rate)HIV_Prevalence_Rate_Map <- gvisGeoChart(HIV_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "HIV_Inc_Prevalence_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="HIV_Prevalence_Rate_Map")
plot(HIV_Prevalence_Rate_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/HIV_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/HIV_Prevalence_Rate_Map.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
Since, Malaria and Tuberculosis are most likely not be carried over to the proceding year, so we will not include them (person living with either malaria or tuberculosis) in the numerators.
- Tuberculosis Prevalence
Tuber_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K' FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Tuber_Prevalence_Rate)Tuber_Prevalence_Rate_Map <- gvisGeoChart(Tuber_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "Tuberculosis_Prevalence_per_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Tuber_Prevalence_Rate_Map")
plot(Tuber_Prevalence_Rate_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Tuber_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuber_Prevalence_Rate_Map.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- Malaria Prevalence
Malaria_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Malaria_Prevalence_per_100K' FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year',Malaria_cases_reported, round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Malaria_Prevalence_Rate)Malaria_Prevalence_Rate_Map <- gvisGeoChart(Malaria_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "Malaria_Prevalence_per_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Malaria_Prevalence_Rate_Map")
plot(Malaria_Prevalence_Rate_Map)
#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Malaria_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Malaria_Prevalence_Rate_Map.html")
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy
- All Disease Prevalence Comparison
Com_HIV_Tub_Mal_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,round((Labor_force_total/Population_total)*100) as 'Labor_Pop_Percentage',round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd + AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K',round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*100000) as 'Malaria_Prevalence_per_100K',round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Prevalence_Per_100K' FROM (SELECT Labor_force_total,Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',Tuberculosis_case_detection,Malaria_cases_reported,Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC) where Lat_Long != 'NA:NA'")
DT::datatable(Com_HIV_Tub_Mal_Prevalence)5.2.1 A 3D Approach For All Disease Prevalence Comparison
comp <- Com_HIV_Tub_Mal_Prevalence[order(Com_HIV_Tub_Mal_Prevalence$Country_Name, Com_HIV_Tub_Mal_Prevalence$Year_Code),]
comp$size <- Com_HIV_Tub_Mal_Prevalence$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
pw2 <- plot_ly(comp, x = ~Malaria_Prevalence_per_100K, y = ~Tuberculosis_Prevalence_per_100K, z = ~HIV_Prevalence_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
text = ~paste('Country:', Country_Name, '<br>Population Total:', Population_total,
'<br>Labor Force/Pop %:', Labor_Pop_Percentage)) %>%
layout(title = 'Comparison of HIV/AIDS, Malaria & Tuberculosis Incidence for People At Risk',
scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)')
pw26 REGRESSION ANALYSIS
- Check for normality
qqnorm(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K)
qqline(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K, col = 2)From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.
i.e The dependent variable \(Y_i\) does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.
For Linear Regression, we have:
\[Y_i= \beta_0 + \beta_1x_i + \epsilon_i\] where, \[Y_i = Dependent Variable\] \[\beta_0 = Intercept\] \[\beta_1 = Parameter To Be estimated (slope)\] \[x_i = Independent Variable\] \[\epsilon_i = Error Term\] \[i = 1,2,3,4,......,n\]
While Multiple linear regression, we have (in matrix notation):
\[\beta =(X^TX)^{-1}X^TY\]
- Obtaining Training and Test Dataset
# Create Training and Test data -
set.seed(12345) # setting seed to reproduce results of random sampling
Com_HIV_Tub_Mal_Prevalence <- Com_HIV_Tub_Mal_Prevalence[, c(1,5,6,7,8)]
traindata <- sample(1:nrow(Com_HIV_Tub_Mal_Prevalence), 0.8*nrow(Com_HIV_Tub_Mal_Prevalence)) # row indices for training data
Train <- Com_HIV_Tub_Mal_Prevalence[traindata, ] # model training data
Test <- Com_HIV_Tub_Mal_Prevalence[-traindata, ] # test data6.1 Analysis of Labor Force
- Null Hypothesis: Labor Force has no effect on HIV, Tuberculosis and Malaria.
\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)
- Alternative Hypothesis: Labor Force has effect on HIV, Tuberculosis and Malaria.
\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)
- Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)
Using both forward and backward model selection to select the best model for our analysis.
fit_labor <- glm(Labor_Pop_Percentage ~., family=poisson(link='log'),data=Train)
summary(stepwise <- step(fit_labor, direction = "both")) ## Start: AIC=12485.84
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
##
## Df Deviance AIC
## - Population_total 1 2584.2 12484
## <none> 2584.1 12486
## - HIV_Prevalence_Per_100K 1 2591.0 12491
## - Malaria_Prevalence_per_100K 1 2604.2 12504
## - Tuberculosis_Prevalence_per_100K 1 2622.8 12522
##
## Step: AIC=12483.87
## Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K + Malaria_Prevalence_per_100K +
## HIV_Prevalence_Per_100K
##
## Df Deviance AIC
## <none> 2584.2 12484
## + Population_total 1 2584.1 12486
## - HIV_Prevalence_Per_100K 1 2591.2 12489
## - Malaria_Prevalence_per_100K 1 2604.4 12502
## - Tuberculosis_Prevalence_per_100K 1 2624.9 12523
##
## Call:
## glm(formula = Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"),
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3191 -0.7588 0.1204 0.8847 3.3895
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.758e+00 4.419e-03 850.226 < 2e-16 ***
## Tuberculosis_Prevalence_per_100K 4.301e-04 6.649e-05 6.470 9.83e-11 ***
## Malaria_Prevalence_per_100K -1.522e-06 3.477e-07 -4.378 1.20e-05 ***
## HIV_Prevalence_Per_100K -3.282e-07 1.243e-07 -2.640 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2653.6 on 1771 degrees of freedom
## Residual deviance: 2584.2 on 1768 degrees of freedom
## AIC: 12484
##
## Number of Fisher Scoring iterations: 4
Plot of Residuals
residual_labor = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = Train$Labor_Pop_Percentage)
plot_res <- ggplot(residual_labor, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
ggplotly(plot_res)Analysis of Variance (ANOVA of Labor Force)
Decision
Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.
6.2 Analysis of Population
- Null Hypothesis: Population has no effect on HIV, Tuberculosis and Malaria.
\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)
- Alternative Hythesis: Population has effect on HIV, Tuberculosis and Malaria.
\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)
- Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)
Using both forward and backward model selection to select the best model for our analysis.
fit_Population <- glm(Population_total ~., family=poisson(link='log'),data=Train)
summary(stepwise_pop <- step(fit_Population , direction = "both")) ## Start: AIC=21941748579
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
##
## Df Deviance AIC
## <none> 2.1942e+10 2.1942e+10
## - Labor_Pop_Percentage 1 2.1980e+10 2.1981e+10
## - HIV_Prevalence_Per_100K 1 2.2276e+10 2.2276e+10
## - Malaria_Prevalence_per_100K 1 2.2726e+10 2.2726e+10
## - Tuberculosis_Prevalence_per_100K 1 6.5965e+10 6.5965e+10
##
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"),
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7103.3 -2151.0 -318.5 2490.6 16399.9
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.802e+01 2.818e-05 639641 <2e-16 ***
## Labor_Pop_Percentage 3.945e-03 6.346e-07 6216 <2e-16 ***
## Tuberculosis_Prevalence_per_100K -1.995e-01 1.567e-06 -127356 <2e-16 ***
## Malaria_Prevalence_per_100K -2.173e-05 8.688e-10 -25010 <2e-16 ***
## HIV_Prevalence_Per_100K -4.397e-06 2.570e-10 -17105 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6.7926e+10 on 1771 degrees of freedom
## Residual deviance: 2.1942e+10 on 1767 degrees of freedom
## AIC: 2.1942e+10
##
## Number of Fisher Scoring iterations: 7
Plot of Residuals
residual_pop = data.frame(Fitted = fitted(stepwise_pop),
Residuals = resid(stepwise_pop), Treatment = Train$Population_total)
plot_pop <- ggplot(residual_pop, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
ggplotly(plot_pop)Analysis of Variance (ANOVA of Population)
Decision
Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.
Non Zero Confidence Interval
Meaning that there is a 95% probability that the confidence interval will contain the true population mean.
## 2.5 % 97.5 %
## (Intercept) 3.748894e+00 3.766218e+00
## Tuberculosis_Prevalence_per_100K 2.992630e-04 5.598876e-04
## Malaria_Prevalence_per_100K -2.211387e-06 -8.485368e-07
## HIV_Prevalence_Per_100K -5.728867e-07 -8.563474e-08
From the table above, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.
7 PREDICTION
The below below shows the predicted values and the actual values side-by-side.
pred <- predict(fit_labor, Test, type="response")
actual_pred <- data.frame(cbind(actual=Train$Labor_Pop_Percentage, predicted=pred))
kable(head(actual_pred))| actual | predicted |
|---|---|
| 41 | 42.86778 |
| 41 | 42.87670 |
| 45 | 42.87575 |
| 45 | 42.87028 |
| 47 | 42.89949 |
| 35 | 42.87475 |
7.1 MAPE
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,
\[M=\frac{100}{n}\sum _{t=1}^{n} |\frac{A_t - F_t}{A_t}|\], Where, \[A_t=Actual Value\] \[F_t=Predicted Value\]
min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))## [1] "The Percentage Accuracy Is: 86.2957407776888"
mape <- mean(abs((actual_pred$predicted - actual_pred$actual))/actual_pred$actual)*100
print(paste('The MAPE Is: ', mape ))## [1] "The MAPE Is: 16.4859681592503"
We can be rest assured that about 16 % of our prediction might not be accurate.
8 CONCLUSION
Using data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. As observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.
Incident (Person At Risk): Shows that majority of the diseases is highly concentrated in southern part of africa.
Incident (Person Per Time): Gives more insight about the dieases that are considered non-chronic like Malaria and Tuberculosis.
Prevalence Rate: This was used to estimate the prevalence of all the diseases and a better result was obtained. It shows the widespread of the diseases beyond southern part of africa to some of Asia, South America and Carribean was not left out!
Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.
8.1 Recommendation
The countries that are mostly affected needs to do more in terms of;
Labor Force: Provide more employment opportunities to the citizens. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.
Population: Maintain the population, as a sporadic increment in population without good standard of living, would jeopardize the economy and cause an increase in the spread of diseases.