ABSTRACT
The main reason why I opted to do my project on demographic dataset is to show the beauty of data visualization. A good statistical analysis may not be convincing enough without depicting the result graphically in a very understandable and unambiguous way. The dataset used in this project is a demography data collected from the world bank website from 2010 to 2015. If I may ask, are you not curious to know if there is any relationship between labor Force/Population and diseases like HIV/AIDS, Tuberculosis, Malaria etc.? Better still, does geographical location has anything to do with all the widespread of these diseases?
Although the dataset was collected from world bank, but the issue with demographic data is that, they always have missing observation(s) and this dataset is no exception. We had a lot of missing observations (NAs), but I was able to deal with it.
In the data exploration, I obtained an inferential statistic and checked for any outliers. I majorly used SQL from sqldf package (it was the best for this type of dataset) to extract our observation and was able to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2. I observed that the dataset was not normally distributed, so, I decided to use Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.
In the end, I was able to use data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. I also observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.
Plotly Map
Click this map to see the incident risk-person for the year 2015
DATA EXPLORATORY
Get the datasets from here:
https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv
https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv
suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(sqldf))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressPackageStartupMessages(library(googleVis))
suppressMessages(library(stringi))
suppressMessages(library(ROCR));
The dataset is being loaded into R and a glimpse of it follows…
df <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
kable(head(df[1:10, ]))
| Year_Code | Country_Name | Country_Code | Adults_15_living_HIV | Adults_Children_0_14_15_living_HIV | AIDS_estimated_deaths_UNAIDS | Adults_children_0_14_15_newly_infected_HIV | Adults_15_newly_infected_HIV | Children_0_14_living_with_HIV | Children_orphaned_by_HIV_AIDS | Children_0_14_newly_infected_HIV | Incidence_tuberculosis_per_100000 | Labor_force_total | Mortality_traffic_injury_100K | Population_female | Population_male | Population_total | Malaria_cases_reported | Suicide_mortality_per_100K | Tuberculosis_death_per_100K | Tuberculosis_case_detection | Tuberculosis_treatment_success_NewCases |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YR2010 | Afghanistan | AFG | 4300 | 4500 | 500 | 1000 | 1000 | 200 | 3000 | 100 | 189.0 | 7707349 | 15.3 | 14005473 | 14797694 | 28803167 | 69397 | 5.1 | 41.00 | 53 | 86 |
| YR2010 | Albania | ALB | 1000 | 1000 | 100 | 200 | 200 | NA | NA | NA | 17.0 | 1216574 | 15.4 | 1451422 | 1461599 | 2913021 | NA | 5.3 | 0.20 | 88 | 89 |
| YR2010 | Algeria | DZA | 9000 | 9200 | 500 | 1000 | 1000 | 500 | 2100 | 100 | 77.0 | 11115092 | 24.7 | 17887239 | 18230398 | 36117637 | 1 | 3.4 | 8.50 | 80 | 89 |
| YR2010 | American Samoa | ASM | NA | NA | NA | NA | NA | NA | NA | NA | 8.3 | NA | NA | NA | NA | 55637 | NA | NA | 0.68 | 87 | 100 |
| YR2010 | Andorra | AND | NA | NA | NA | NA | NA | NA | NA | NA | 9.5 | NA | NA | NA | NA | 84449 | NA | NA | 0.78 | 87 | 86 |
| YR2010 | Angola | AGO | 190000 | 210000 | 9400 | 25000 | 21000 | 18000 | 94000 | 3800 | 384.0 | 8394741 | 20.8 | 11936016 | 11433115 | 23369131 | 1682870 | 20.6 | 58.00 | 55 | 48 |
We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.
# We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.
lat_long <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country_Name", "Country_Code", "Latitude", "Longtitude")
kable(head(lat_long))
| Country_Name | Country_Code | Latitude | Longtitude |
|---|---|---|---|
| Albania | ALB | 41.0000 | 20.0000 |
| Algeria | DZA | 28.0000 | 3.0000 |
| American Samoa | ASM | -14.3333 | -170.0000 |
| Andorra | AND | 42.5000 | 1.6000 |
| Angola | AGO | -12.5000 | 18.5000 |
| Anguilla | AIA | 18.2500 | -63.1667 |
options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Name", by.y = "Country_Name", all = TRUE)
kable(head(df2))
| Country_Name | Year_Code | Country_Code.x | Adults_15_living_HIV | Adults_Children_0_14_15_living_HIV | AIDS_estimated_deaths_UNAIDS | Adults_children_0_14_15_newly_infected_HIV | Adults_15_newly_infected_HIV | Children_0_14_living_with_HIV | Children_orphaned_by_HIV_AIDS | Children_0_14_newly_infected_HIV | Incidence_tuberculosis_per_100000 | Labor_force_total | Mortality_traffic_injury_100K | Population_female | Population_male | Population_total | Malaria_cases_reported | Suicide_mortality_per_100K | Tuberculosis_death_per_100K | Tuberculosis_case_detection | Tuberculosis_treatment_success_NewCases | Country_Code.y | Latitude | Longtitude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | YR2015 | AFG | 6700 | 6900 | 500 | 1000 | 1000 | 500 | 4200 | 100 | 189 | 9825616 | 15.2 | 16346869 | 17389625 | 33736494 | 86895 | 5.5 | 37 | 58 | NA | AFG | 33 | 65 |
| Afghanistan | YR2013 | AFG | 5600 | 5800 | 500 | 1000 | 1000 | 200 | 3700 | 100 | 189 | 8916157 | NA | 15398276 | 16333412 | 31731688 | 39263 | NA | 41 | 53 | 88 | AFG | 33 | 65 |
| Afghanistan | YR2011 | AFG | 4700 | 4900 | 500 | 1000 | 1000 | 200 | 3200 | 100 | 189 | 8050184 | NA | 14444001 | 15264598 | 29708599 | 77549 | NA | 42 | 51 | 88 | AFG | 33 | 65 |
| Afghanistan | YR2012 | AFG | 5200 | 5300 | 500 | 1000 | 1000 | 200 | 3500 | 100 | 189 | 8458402 | NA | 14912657 | 15784301 | 30696958 | 54840 | NA | 42 | 51 | 88 | AFG | 33 | 65 |
| Afghanistan | YR2014 | AFG | 6200 | 6400 | 500 | 1000 | 1000 | 500 | 4000 | 100 | 189 | 9397624 | NA | 15881092 | 16876928 | 32758020 | 145282 | NA | 41 | 53 | 87 | AFG | 33 | 65 |
| Afghanistan | YR2010 | AFG | 4300 | 4500 | 500 | 1000 | 1000 | 200 | 3000 | 100 | 189 | 7707349 | 15.3 | 14005473 | 14797694 | 28803167 | 69397 | 5.1 | 41 | 53 | 86 | AFG | 33 | 65 |
Checking for all NAs in the datasets before cleaning.
na_count_before <-sapply(df, function(a) sum(is.na(a)))
na_count_after <-sapply(df2, function(b) sum(is.na(b)))
kable(na_count_before)
| Year_Code | 0 |
| Country_Name | 0 |
| Country_Code | 0 |
| Adults_15_living_HIV | 511 |
| Adults_Children_0_14_15_living_HIV | 516 |
| AIDS_estimated_deaths_UNAIDS | 540 |
| Adults_children_0_14_15_newly_infected_HIV | 576 |
| Adults_15_newly_infected_HIV | 571 |
| Children_0_14_living_with_HIV | 684 |
| Children_orphaned_by_HIV_AIDS | 684 |
| Children_0_14_newly_infected_HIV | 678 |
| Incidence_tuberculosis_per_100000 | 55 |
| Labor_force_total | 160 |
| Mortality_traffic_injury_100K | 916 |
| Population_female | 118 |
| Population_male | 118 |
| Population_total | 4 |
| Malaria_cases_reported | 703 |
| Suicide_mortality_per_100K | 916 |
| Tuberculosis_death_per_100K | 55 |
| Tuberculosis_case_detection | 101 |
| Tuberculosis_treatment_success_NewCases | 345 |
kable(na_count_after)
| Country_Name | 0 |
| Year_Code | 68 |
| Country_Code.x | 68 |
| Adults_15_living_HIV | 579 |
| Adults_Children_0_14_15_living_HIV | 584 |
| AIDS_estimated_deaths_UNAIDS | 608 |
| Adults_children_0_14_15_newly_infected_HIV | 644 |
| Adults_15_newly_infected_HIV | 639 |
| Children_0_14_living_with_HIV | 752 |
| Children_orphaned_by_HIV_AIDS | 752 |
| Children_0_14_newly_infected_HIV | 746 |
| Incidence_tuberculosis_per_100000 | 123 |
| Labor_force_total | 228 |
| Mortality_traffic_injury_100K | 984 |
| Population_female | 186 |
| Population_male | 186 |
| Population_total | 72 |
| Malaria_cases_reported | 771 |
| Suicide_mortality_per_100K | 984 |
| Tuberculosis_death_per_100K | 123 |
| Tuberculosis_case_detection | 169 |
| Tuberculosis_treatment_success_NewCases | 413 |
| Country_Code.y | 246 |
| Latitude | 246 |
| Longtitude | 246 |
From the table above, we can see that both datasets contains alot of NAs (This is due to data unavalability in some countries under observation). But, before we can proceed, let first convert some of the variable to numeric. This would give room for better a analysis.
#Covert the columns to Numeric
df2[, c(4:22)] <- sapply(df2[, c(4:22)], as.numeric)
glimpse(df2)
## Observations: 1,364
## Variables: 25
## $ Country_Name <chr> "Afghanistan", "Afg...
## $ Year_Code <chr> "YR2015", "YR2013",...
## $ Country_Code.x <chr> "AFG", "AFG", "AFG"...
## $ Adults_15_living_HIV <dbl> 6700, 5600, 4700, 5...
## $ Adults_Children_0_14_15_living_HIV <dbl> 6900, 5800, 4900, 5...
## $ AIDS_estimated_deaths_UNAIDS <dbl> 500, 500, 500, 500,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 1000, 1000, 1000, 1...
## $ Adults_15_newly_infected_HIV <dbl> 1000, 1000, 1000, 1...
## $ Children_0_14_living_with_HIV <dbl> 500, 200, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS <dbl> 4200, 3700, 3200, 3...
## $ Children_0_14_newly_infected_HIV <dbl> 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000 <dbl> 189.0, 189.0, 189.0...
## $ Labor_force_total <dbl> 9825616, 8916157, 8...
## $ Mortality_traffic_injury_100K <dbl> 15.2, NA, NA, NA, N...
## $ Population_female <dbl> 16346869, 15398276,...
## $ Population_male <dbl> 17389625, 16333412,...
## $ Population_total <dbl> 33736494, 31731688,...
## $ Malaria_cases_reported <dbl> 86895, 39263, 77549...
## $ Suicide_mortality_per_100K <dbl> 5.5, NA, NA, NA, NA...
## $ Tuberculosis_death_per_100K <dbl> 37.00, 41.00, 42.00...
## $ Tuberculosis_case_detection <dbl> 58, 53, 51, 51, 53,...
## $ Tuberculosis_treatment_success_NewCases <dbl> NA, 88, 88, 88, 87,...
## $ Country_Code.y <fctr> AFG, AFG, AFG, AFG...
## $ Latitude <dbl> 33.0000, 33.0000, 3...
## $ Longtitude <dbl> 65.0, 65.0, 65.0, 6...
Merging column longitude and Latitude together for a better coordinate to be used in our maps (googlevis requirement!)
df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")
kable(head(df2))
| Country_Name | Year_Code | Country_Code.x | Adults_15_living_HIV | Adults_Children_0_14_15_living_HIV | AIDS_estimated_deaths_UNAIDS | Adults_children_0_14_15_newly_infected_HIV | Adults_15_newly_infected_HIV | Children_0_14_living_with_HIV | Children_orphaned_by_HIV_AIDS | Children_0_14_newly_infected_HIV | Incidence_tuberculosis_per_100000 | Labor_force_total | Mortality_traffic_injury_100K | Population_female | Population_male | Population_total | Malaria_cases_reported | Suicide_mortality_per_100K | Tuberculosis_death_per_100K | Tuberculosis_case_detection | Tuberculosis_treatment_success_NewCases | Country_Code.y | Latitude | Longtitude | Lat_Long |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | YR2015 | AFG | 6700 | 6900 | 500 | 1000 | 1000 | 500 | 4200 | 100 | 189 | 9825616 | 15.2 | 16346869 | 17389625 | 33736494 | 86895 | 5.5 | 37 | 58 | NA | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2013 | AFG | 5600 | 5800 | 500 | 1000 | 1000 | 200 | 3700 | 100 | 189 | 8916157 | NA | 15398276 | 16333412 | 31731688 | 39263 | NA | 41 | 53 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2011 | AFG | 4700 | 4900 | 500 | 1000 | 1000 | 200 | 3200 | 100 | 189 | 8050184 | NA | 14444001 | 15264598 | 29708599 | 77549 | NA | 42 | 51 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2012 | AFG | 5200 | 5300 | 500 | 1000 | 1000 | 200 | 3500 | 100 | 189 | 8458402 | NA | 14912657 | 15784301 | 30696958 | 54840 | NA | 42 | 51 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2014 | AFG | 6200 | 6400 | 500 | 1000 | 1000 | 500 | 4000 | 100 | 189 | 9397624 | NA | 15881092 | 16876928 | 32758020 | 145282 | NA | 41 | 53 | 87 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2010 | AFG | 4300 | 4500 | 500 | 1000 | 1000 | 200 | 3000 | 100 | 189 | 7707349 | 15.3 | 14005473 | 14797694 | 28803167 | 69397 | 5.1 | 41 | 53 | 86 | AFG | 33 | 65 | 33:65 |
A graphical representation of the missing values.
missmap(df2, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')), main ="Plot Showing The Missing Values Per Observation",
y.cex = 0.8, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)
mice_missing_val <- mice(df2, m=1, method='cart', printFlag=FALSE, where = is.na(df2))
mice_missing_val_comp <- complete(mice_missing_val, action='long', include=FALSE)
df2_cleaned <- na.omit(mice_missing_val_comp)
df2_cleaned <- df2_cleaned[, -c(1:2)]
kable(head(df2_cleaned))
| Country_Name | Year_Code | Country_Code.x | Adults_15_living_HIV | Adults_Children_0_14_15_living_HIV | AIDS_estimated_deaths_UNAIDS | Adults_children_0_14_15_newly_infected_HIV | Adults_15_newly_infected_HIV | Children_0_14_living_with_HIV | Children_orphaned_by_HIV_AIDS | Children_0_14_newly_infected_HIV | Incidence_tuberculosis_per_100000 | Labor_force_total | Mortality_traffic_injury_100K | Population_female | Population_male | Population_total | Malaria_cases_reported | Suicide_mortality_per_100K | Tuberculosis_death_per_100K | Tuberculosis_case_detection | Tuberculosis_treatment_success_NewCases | Country_Code.y | Latitude | Longtitude | Lat_Long |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | YR2015 | AFG | 6700 | 6900 | 500 | 1000 | 1000 | 500 | 4200 | 100 | 189 | 9825616 | 15.2 | 16346869 | 17389625 | 33736494 | 86895 | 5.500000 | 37 | 58 | 91 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2013 | AFG | 5600 | 5800 | 500 | 1000 | 1000 | 200 | 3700 | 100 | 189 | 8916157 | 4.6 | 15398276 | 16333412 | 31731688 | 39263 | 14.600000 | 41 | 53 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2011 | AFG | 4700 | 4900 | 500 | 1000 | 1000 | 200 | 3200 | 100 | 189 | 8050184 | 15.0 | 14444001 | 15264598 | 29708599 | 77549 | 5.100000 | 42 | 51 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2012 | AFG | 5200 | 5300 | 500 | 1000 | 1000 | 200 | 3500 | 100 | 189 | 8458402 | 6.5 | 14912657 | 15784301 | 30696958 | 54840 | 8.500000 | 42 | 51 | 88 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2014 | AFG | 6200 | 6400 | 500 | 1000 | 1000 | 500 | 4000 | 100 | 189 | 9397624 | 13.0 | 15881092 | 16876928 | 32758020 | 145282 | 9.397671 | 41 | 53 | 87 | AFG | 33 | 65 | 33:65 |
| Afghanistan | YR2010 | AFG | 4300 | 4500 | 500 | 1000 | 1000 | 200 | 3000 | 100 | 189 | 7707349 | 15.3 | 14005473 | 14797694 | 28803167 | 69397 | 5.100000 | 41 | 53 | 86 | AFG | 33 | 65 | 33:65 |
We now have data with no missing observations.
na_count_after_cleaning <-sapply(df2_cleaned, function(c) sum(is.na(c)))
kable(na_count_after_cleaning)
| Country_Name | 0 |
| Year_Code | 0 |
| Country_Code.x | 0 |
| Adults_15_living_HIV | 0 |
| Adults_Children_0_14_15_living_HIV | 0 |
| AIDS_estimated_deaths_UNAIDS | 0 |
| Adults_children_0_14_15_newly_infected_HIV | 0 |
| Adults_15_newly_infected_HIV | 0 |
| Children_0_14_living_with_HIV | 0 |
| Children_orphaned_by_HIV_AIDS | 0 |
| Children_0_14_newly_infected_HIV | 0 |
| Incidence_tuberculosis_per_100000 | 0 |
| Labor_force_total | 0 |
| Mortality_traffic_injury_100K | 0 |
| Population_female | 0 |
| Population_male | 0 |
| Population_total | 0 |
| Malaria_cases_reported | 0 |
| Suicide_mortality_per_100K | 0 |
| Tuberculosis_death_per_100K | 0 |
| Tuberculosis_case_detection | 0 |
| Tuberculosis_treatment_success_NewCases | 0 |
| Country_Code.y | 0 |
| Latitude | 0 |
| Longtitude | 0 |
| Lat_Long | 0 |
Summary Statistics
kable(summary(df2_cleaned[, 4:22]))
| Adults_15_living_HIV | Adults_Children_0_14_15_living_HIV | AIDS_estimated_deaths_UNAIDS | Adults_children_0_14_15_newly_infected_HIV | Adults_15_newly_infected_HIV | Children_0_14_living_with_HIV | Children_orphaned_by_HIV_AIDS | Children_0_14_newly_infected_HIV | Incidence_tuberculosis_per_100000 | Labor_force_total | Mortality_traffic_injury_100K | Population_female | Population_male | Population_total | Malaria_cases_reported | Suicide_mortality_per_100K | Tuberculosis_death_per_100K | Tuberculosis_case_detection | Tuberculosis_treatment_success_NewCases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 100 | Min. : 100 | Min. : 100 | Min. : 100 | Min. : 100 | Min. : 100 | Min. : 500 | Min. : 100 | Min. : 0.00 | Min. :1.389e+05 | Min. : 2.40 | Min. :1.454e+05 | Min. :1.341e+05 | Min. :2.796e+05 | Min. : 0 | Min. : 0.000 | Min. : 0.11 | Min. : 15.00 | Min. : 0.00 | |
| 1st Qu.: 8300 | 1st Qu.: 8700 | 1st Qu.: 200 | 1st Qu.: 1000 | 1st Qu.: 1000 | 1st Qu.: 500 | 1st Qu.: 3925 | 1st Qu.: 100 | 1st Qu.: 27.25 | 1st Qu.:1.897e+06 | 1st Qu.:14.10 | 1st Qu.:2.199e+06 | 1st Qu.:2.194e+06 | 1st Qu.:4.350e+06 | 1st Qu.: 364 | 1st Qu.: 6.025 | 1st Qu.: 2.00 | 1st Qu.: 58.00 | 1st Qu.: 75.00 | |
| Median : 37000 | Median : 40500 | Median : 1300 | Median : 2400 | Median : 2000 | Median : 2400 | Median : 23000 | Median : 500 | Median : 100.00 | Median :5.089e+06 | Median :20.80 | Median :5.661e+06 | Median :5.620e+06 | Median :1.131e+07 | Median : 15156 | Median : 8.200 | Median : 9.25 | Median : 76.00 | Median : 83.00 | |
| Mean : 514638 | Mean : 555964 | Mean : 20739 | Mean : 32101 | Mean : 28040 | Mean : 44315 | Mean : 326680 | Mean : 4192 | Mean : 165.15 | Mean :4.366e+07 | Mean :20.25 | Mean :4.838e+07 | Mean :4.923e+07 | Mean :9.763e+07 | Mean : 469355 | Mean :10.256 | Mean :18.96 | Mean : 70.22 | Mean : 79.34 | |
| 3rd Qu.: 170000 | 3rd Qu.: 180000 | 3rd Qu.: 5875 | 3rd Qu.: 9950 | 3rd Qu.: 9225 | 3rd Qu.: 13000 | 3rd Qu.: 97000 | 3rd Qu.: 1200 | 3rd Qu.: 235.00 | 3rd Qu.:1.229e+07 | 3rd Qu.:27.30 | 3rd Qu.:1.476e+07 | 3rd Qu.:1.574e+07 | 3rd Qu.:3.030e+07 | 3rd Qu.: 259482 | 3rd Qu.:12.700 | 3rd Qu.:30.00 | 3rd Qu.: 87.00 | 3rd Qu.: 88.00 | |
| Max. :33800000 | Max. :36100000 | Max. :1500000 | Max. :2200000 | Max. :1900000 | Max. :2600000 | Max. :19700000 | Max. :300000 | Max. :1246.00 | Max. :3.406e+09 | Max. :45.40 | Max. :3.644e+09 | Max. :3.708e+09 | Max. :7.355e+09 | Max. :11627473 | Max. :35.300 | Max. :99.00 | Max. :110.00 | Max. :100.00 |
Percentage of Children Orphaned By HIV/AIDS
All_Cen <- sqldf("SELECT Country_Name, Year_Code, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2_cleaned where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50" )
A box plot of the outcomes.
# Boxplot
g <- ggplot(All_Cen, aes(Year_Code, Percentage_Orphaned_byHIV)) + geom_boxplot(varwidth=T, fill="plum") +
labs(title="Box plot of Children Orphaned By HIV/AID per Year (%)",
subtitle="Percentage Orphaned by HIV grouped by Year",
caption="Source: The World Bank Data Bank",
x="Year",
y="% of Children Orphaned By HIV/AIDS")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
THE ANALYSIS
Let first look at the graph of children orphaned by HIV?AIDS!
ggplotly(ggplot(All_Cen, aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
INCIDENCE PROPORTION,INCIDENCE RATES AND PREVALENCE
Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.
while
1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk
2). Incidence rate or person-time rate is a measure of incidence that incorporates time directly into the denominator. A person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented." —CDC
3). Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population
\(Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad disease\quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }\)
Hiv_Incident_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where HIV_Incident_Risk_Per_100K >= 200 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 30" )
kable(head(Hiv_Incident_Risk))
| Population_total | Lat_Long | Country_Name | Year_Code | HIV_Incident_Risk_Per_100K |
|---|---|---|---|---|
| 2209197 | -22:24 | Botswana | YR2015 | 497.9185 |
| 4546100 | 7:21 | Central African Republic | YR2015 | 202.3713 |
| 1175389 | 2:10 | Equatorial Guinea | YR2015 | 204.1877 |
| 2174645 | -29.5:28.5 | Lesotho | YR2015 | 1011.6594 |
| 17573607 | -13.5:34 | Malawi | YR2015 | 221.9237 |
| 28010691 | -18.25:35 | Mozambique | YR2015 | 339.1562 |
Tuberculosis_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K' FROM df2_cleaned where Tuberculocisis_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Tuberculosis_Inc_Risk))
| Population_total | Lat_Long | Country_Name | Year_Code | Tuberculocisis_Incident_Risk_Per_100K |
|---|---|---|---|---|
| 2880703 | 41:20 | Albania | YR2015 | 2.638245 |
| 2916950 | 40.0691:45.03819 | Armenia | YR2015 | 3.051132 |
| 1371855 | 26:50.55 | Bahrain | YR2015 | 6.341778 |
| 284217 | 13.1667:-59.5333 | Barbados | YR2015 | 30.610414 |
| 359288 | 17.25:-88.75 | Belize | YR2015 | 24.214558 |
| 2209197 | -22:24 | Botswana | YR2015 | 2.806450 |
Malaria_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Malaria_cases_reported/Population_total)* 100000 as 'Malaria_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Malaria_Inc_Risk))
| Population_total | Lat_Long | Country_Name | Year_Code | Malaria_Incident_Risk_Per_100K |
|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 257.569740 |
| 2880703 | 41:20 | Albania | YR2015 | 1.909256 |
| 27859305 | -12.5:18.5 | Angola | YR2015 | 9940.323350 |
| 161200886 | 24:90 | Bangladesh | YR2015 | 4.099233 |
| 284217 | 13.1667:-59.5333 | Barbados | YR2015 | 6186.470197 |
| 9489616 | 53:28 | Belarus | YR2015 | 126.980902 |
ggplotly(ggplot(Hiv_Incident_Risk, aes(x=HIV_Incident_Risk_Per_100K, y=reorder(Country_Name,+HIV_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="red", size=2, alpha=.8) + ylab("Countries") + xlab("HIV Incident Risk") +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of HIV/AIDS Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(ggplot(Tuberculosis_Inc_Risk, aes(x=Tuberculocisis_Incident_Risk_Per_100K, y=reorder(Country_Name, Tuberculocisis_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Tuberculosis Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(ggplot(Malaria_Inc_Risk, aes(x=Malaria_Incident_Risk_Per_100K, y=reorder(Country_Name, Malaria_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="brown", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Malaria Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Their Locations On The Map
Hiv_Inc_Map <- gvisGeoChart(Hiv_Incident_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "HIV_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_At_Risk")
plot(Hiv_Inc_Map)
Tub_Inc_Map <- gvisGeoChart(Tuberculosis_Inc_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Tuberculocisis_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_At_Risk")
plot(Tub_Inc_Map )
Malaria_Inc_Map <- gvisGeoChart(Malaria_Inc_Risk, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Malaria_Incident_Risk_Per_100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Malaria_Incidence_Person_At_Risk")
plot(Malaria_Inc_Map )
HIV/AID, Malaria & Tuberculosis Incident Risk Comparison
comparison <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, Labor_force_total, (Malaria_cases_reported/Population_total)* 1000 as 'Malaria_Incident_Risk_Per_1K', (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K', (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_1K >= 1 and Lat_Long != 'NA:NA' or Tuberculocisis_Incident_Risk_Per_100K >= 1 or HIV_Incident_Risk_Per_100K >= 200 ORDER BY Country_Name ASC")
kable(head(comparison))
| Population_total | Lat_Long | Country_Name | Year_Code | Labor_force_total | Malaria_Incident_Risk_Per_1K | Tuberculocisis_Incident_Risk_Per_100K | HIV_Incident_Risk_Per_100K |
|---|---|---|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 9825616 | 2.575697 | 0.1719207 | 2.964149 |
| 31731688 | 33:65 | Afghanistan | YR2013 | 8916157 | 1.237344 | 0.1670255 | 3.151424 |
| 29708599 | 33:65 | Afghanistan | YR2011 | 8050184 | 2.610322 | 0.1716675 | 3.366029 |
| 30696958 | 33:65 | Afghanistan | YR2012 | 8458402 | 1.786496 | 0.1661402 | 3.257652 |
| 32758020 | 33:65 | Afghanistan | YR2014 | 9397624 | 4.435005 | 0.1617924 | 3.052688 |
| 28803167 | 33:65 | Afghanistan | YR2010 | 7707349 | 2.409353 | 0.1840075 | 3.471840 |
A 3D approach
comparison <- comparison[order(comparison$Country_Name, comparison$Year_Code),]
comparison$size <- comparison$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
pw <- plot_ly(comparison, x = ~Malaria_Incident_Risk_Per_1K, y = ~Tuberculocisis_Incident_Risk_Per_100K, z = ~HIV_Incident_Risk_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
text = ~paste('Country:', Country_Name, '<br>Labor Force:', Labor_force_total, '<br>Population Total:', Population_total,
'<br>Lat & Long.:', Lat_Long)) %>%
layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)')
pw
Incidence Person_Per_Time
HIV_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, ((New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'HIV_Inc_Person_Time100K' FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY New_Hiv_Cases, Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Inc_Person_Time))
| Population_total | Lat_Long | Country_Name | Year_Code | HIV_Inc_Person_Time100K |
|---|---|---|---|---|
| 1371855 | 26:50.55 | Bahrain | YR2015 | 72.90038 |
| 777424 | -12.1667:44.25 | Comoros | YR2015 | 128.65211 |
| 4203604 | 45.1667:15.5 | Croatia | YR2015 | 23.79033 |
| 9159302 | 31:36 | Jordan | YR2015 | 10.91828 |
| 3935794 | 29.3375:47.6581 | Kuwait | YR2015 | 25.40904 |
| 2976877 | 46:105 | Mongolia | YR2015 | 33.59576 |
#Nested SELECTion!
Tuber_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/(Population_total -(Tuber_trtd + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'Tuberculosis_Inc_Person_Time100K' FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Inc_Person_Time))
| Population_total | Lat_Long | Country_Name | Year_Code | Tuberculosis_Inc_Person_Time100K |
|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 1.719336 |
| 2880703 | 41:20 | Albania | YR2015 | 26.384446 |
| 39871528 | 28:3 | Algeria | YR2015 | 2.006544 |
| 27859305 | -12.5:18.5 | Angola | YR2015 | 2.298376 |
| 43417765 | -34:-64 | Argentina | YR2015 | 2.003961 |
| 2916950 | 40.0691:45.03819 | Armenia | YR2015 | 30.515015 |
Person Per Time Maps
HIV_Inc_Person_Time_Map <- gvisGeoChart(HIV_Inc_Person_Time, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "HIV_Inc_Person_Time100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_Per_Time")
plot(HIV_Inc_Person_Time_Map )
Tuber_Inc_PT_Map <- gvisGeoChart(Tuber_Inc_Person_Time, locationvar ="Lat_Long", hovervar ="Country_Name",sizevar = "Tuberculosis_Inc_Person_Time100K", colorvar = "Population_total",
options=list(displayMode="Markers",
colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_Per_Time")
plot(Tuber_Inc_PT_Map)
PREVALENCE RATE:
Prevalence is the actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.
Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.
HIV_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Inc_Prevalence_100K' FROM (SELECT Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Prevalence))
| Population_total | Lat_Long | Country_Name | Year_Code | HIV_Inc_Prevalence_100K |
|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 234 |
| 31731688 | 33:65 | Afghanistan | YR2013 | 214 |
| 29708599 | 33:65 | Afghanistan | YR2011 | 199 |
| 30696958 | 33:65 | Afghanistan | YR2012 | 205 |
| 32758020 | 33:65 | Afghanistan | YR2014 | 226 |
| 28803167 | 33:65 | Afghanistan | YR2010 | 191 |
NB: Malaria and Tuberculosis are most likely not be carried over to the proceding year. So, we will not include them (person living with either malaria or tuberculosis) in the numerators.
#Nested SELECTion!
Tuber_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K' FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Prevalence_Rate))
| Population_total | Lat_Long | Country_Name | Year_Code | Tuberculosis_Prevalence_per_100K |
|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 2 |
| 31731688 | 33:65 | Afghanistan | YR2013 | 2 |
| 29708599 | 33:65 | Afghanistan | YR2011 | 2 |
| 30696958 | 33:65 | Afghanistan | YR2012 | 2 |
| 32758020 | 33:65 | Afghanistan | YR2014 | 2 |
| 28803167 | 33:65 | Afghanistan | YR2010 | 2 |
Malaria_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Malaria_Prevalence_per_100K' FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year',Malaria_cases_reported, round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Malaria_Prevalence_Rate))
| Population_total | Lat_Long | Country_Name | Year_Code | Malaria_Prevalence_per_100K |
|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 2576 |
| 31731688 | 33:65 | Afghanistan | YR2013 | 1237 |
| 29708599 | 33:65 | Afghanistan | YR2011 | 2611 |
| 30696958 | 33:65 | Afghanistan | YR2012 | 1787 |
| 32758020 | 33:65 | Afghanistan | YR2014 | 4435 |
| 28803167 | 33:65 | Afghanistan | YR2010 | 2410 |
Com_HIV_Tub_Mal_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,round((Labor_force_total/Population_total)*100) as 'Labor_Pop_Percentage',round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd + AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K',round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*100000) as 'Malaria_Prevalence_per_100K',round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Prevalence_Per_100K' FROM (SELECT Labor_force_total,Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',Tuberculosis_case_detection,Malaria_cases_reported,Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC) where Lat_Long != 'NA:NA'")
kable(head(Com_HIV_Tub_Mal_Prevalence))
| Population_total | Lat_Long | Country_Name | Year_Code | Labor_Pop_Percentage | Tuberculosis_Prevalence_per_100K | Malaria_Prevalence_per_100K | HIV_Prevalence_Per_100K |
|---|---|---|---|---|---|---|---|
| 33736494 | 33:65 | Afghanistan | YR2015 | 29 | 2 | 258 | 234 |
| 31731688 | 33:65 | Afghanistan | YR2013 | 28 | 2 | 124 | 214 |
| 29708599 | 33:65 | Afghanistan | YR2011 | 27 | 2 | 261 | 199 |
| 30696958 | 33:65 | Afghanistan | YR2012 | 28 | 2 | 179 | 205 |
| 32758020 | 33:65 | Afghanistan | YR2014 | 29 | 2 | 444 | 226 |
| 28803167 | 33:65 | Afghanistan | YR2010 | 27 | 2 | 241 | 191 |
comp <- Com_HIV_Tub_Mal_Prevalence[order(Com_HIV_Tub_Mal_Prevalence$Country_Name, Com_HIV_Tub_Mal_Prevalence$Year_Code),]
comp$size <- Com_HIV_Tub_Mal_Prevalence$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
pw2 <- plot_ly(comp, x = ~Malaria_Prevalence_per_100K, y = ~Tuberculosis_Prevalence_per_100K, z = ~HIV_Prevalence_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
text = ~paste('Country:', Country_Name, '<br>Population Total:', Population_total,
'<br>Labor Force/Pop %:', Labor_Pop_Percentage)) %>%
layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)')
pw2
Regression Analysis & Prediction
qqnorm(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K)
qqline(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K, col = 2)
From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.
i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.
For Linear Regression, we have
\({ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad \\ where,\\ { Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad \\\)
While Multiple linear regression, we have (in matrix notation):
\(\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y\)
# Create Training and Test data -
set.seed(12345) # setting seed to reproduce results of random sampling
Com_HIV_Tub_Mal_Prevalence <- Com_HIV_Tub_Mal_Prevalence[, c(1,5,6,7,8)]
traindata <- sample(1:nrow(Com_HIV_Tub_Mal_Prevalence), 0.8*nrow(Com_HIV_Tub_Mal_Prevalence)) # row indices for training data
Train <- Com_HIV_Tub_Mal_Prevalence[traindata, ] # model training data
Test <- Com_HIV_Tub_Mal_Prevalence[-traindata, ] # test data
SETTING UP HYPOTHESIS: For Labor Force
Null Hypothesis: Labor Force has no effect on HIV, Tuberculosis and Malaria.
\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:
\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)
Rejection:
Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)
fit_labor <- glm(Labor_Pop_Percentage ~., family=poisson(link='log'),data=Train)
summary(stepwise <- step(fit_labor, direction = "both"))
## Start: AIC=3397.71
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
##
## Df Deviance AIC
## <none> 661.93 3397.7
## - Malaria_Prevalence_per_100K 1 666.83 3400.6
## - Population_total 1 666.87 3400.6
## - Tuberculosis_Prevalence_per_100K 1 667.02 3400.8
## - HIV_Prevalence_Per_100K 1 671.42 3405.2
##
## Call:
## glm(formula = Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"),
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2093 -0.6812 0.1074 0.7941 4.0322
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.797e+00 8.946e-03 424.403 < 2e-16
## Population_total -1.338e-10 6.168e-11 -2.169 0.03012
## Tuberculosis_Prevalence_per_100K 3.378e-04 1.482e-04 2.279 0.02264
## Malaria_Prevalence_per_100K -2.248e-06 1.027e-06 -2.189 0.02859
## HIV_Prevalence_Per_100K -6.438e-07 2.115e-07 -3.043 0.00234
##
## (Intercept) ***
## Population_total *
## Tuberculosis_Prevalence_per_100K *
## Malaria_Prevalence_per_100K *
## HIV_Prevalence_Per_100K **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 688.60 on 485 degrees of freedom
## Residual deviance: 661.93 on 481 degrees of freedom
## AIC: 3397.7
##
## Number of Fisher Scoring iterations: 4
residual_labor = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = Train$Labor_Pop_Percentage)
plot_res <- ggplot(residual_labor, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
ggplotly(plot_res)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
anova(stepwise, test="F")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Labor_Pop_Percentage
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F
## NULL 485 688.60
## Population_total 1 4.7922 484 683.81 4.7922
## Tuberculosis_Prevalence_per_100K 1 5.1819 483 678.63 5.1819
## Malaria_Prevalence_per_100K 1 7.2054 482 671.42 7.2054
## HIV_Prevalence_Per_100K 1 9.4892 481 661.93 9.4892
## Pr(>F)
## NULL
## Population_total 0.028589 *
## Tuberculosis_Prevalence_per_100K 0.022824 *
## Malaria_Prevalence_per_100K 0.007269 **
## HIV_Prevalence_Per_100K 0.002067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DECISION (Labor Force):
Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.
SETTING UP HYPOTHESIS: Population
Null Hypothesis: Population has no effect on HIV, Tuberculosis and Malaria.
\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:
\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)
Rejection:
Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)
fit_Population <- glm(Population_total ~., family=poisson(link='log'),data=Train)
summary(stepwise_pop <- step(fit_Population , direction = "both"))
## Start: AIC=20335673459
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
##
## Df Deviance AIC
## <none> 2.0336e+10 2.0336e+10
## - Labor_Pop_Percentage 1 2.0638e+10 2.0638e+10
## - HIV_Prevalence_Per_100K 1 2.0751e+10 2.0751e+10
## - Malaria_Prevalence_per_100K 1 2.1448e+10 2.1448e+10
## - Tuberculosis_Prevalence_per_100K 1 4.4134e+10 4.4134e+10
##
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K +
## Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"),
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10699 -2899 184 4585 53407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.967e+01 4.044e-05 486275 <2e-16
## Labor_Pop_Percentage -1.652e-02 9.326e-07 -17713 <2e-16
## Tuberculosis_Prevalence_per_100K -3.374e-01 3.824e-06 -88230 <2e-16
## Malaria_Prevalence_per_100K -7.291e-05 2.624e-09 -27782 <2e-16
## HIV_Prevalence_Per_100K -7.459e-06 4.197e-10 -17772 <2e-16
##
## (Intercept) ***
## Labor_Pop_Percentage ***
## Tuberculosis_Prevalence_per_100K ***
## Malaria_Prevalence_per_100K ***
## HIV_Prevalence_Per_100K ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4.8172e+10 on 485 degrees of freedom
## Residual deviance: 2.0336e+10 on 481 degrees of freedom
## AIC: 2.0336e+10
##
## Number of Fisher Scoring iterations: 8
residual_pop = data.frame(Fitted = fitted(stepwise_pop),
Residuals = resid(stepwise_pop), Treatment = Train$Population_total)
plot_pop <- ggplot(residual_pop, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
ggplotly(plot_pop)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
anova(stepwise_pop, test="F")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Population_total
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 485 4.8172e+10
## Labor_Pop_Percentage 1 1.3111e+09 484 4.6861e+10
## Tuberculosis_Prevalence_per_100K 1 2.4433e+10 483 2.2428e+10
## Malaria_Prevalence_per_100K 1 1.6777e+09 482 2.0751e+10
## HIV_Prevalence_Per_100K 1 4.1494e+08 481 2.0336e+10
## F Pr(>F)
## NULL
## Labor_Pop_Percentage 1.3111e+09 < 2.2e-16 ***
## Tuberculosis_Prevalence_per_100K 2.4433e+10 < 2.2e-16 ***
## Malaria_Prevalence_per_100K 1.6777e+09 < 2.2e-16 ***
## HIV_Prevalence_Per_100K 4.1494e+08 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DECISION (Population):
Reject Null Hypothesis since P-Values are all less than Tabulated value of 0.05.
Non Zero Confidence Interval
Meaning that there is a 95% probability that the confidence interval will contain the true population mean. From the table below, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.
confint(stepwise)
## 2.5 % 97.5 %
## (Intercept) 3.779050e+00 3.814117e+00
## Population_total -2.572919e-10 -1.540813e-11
## Tuberculosis_Prevalence_per_100K 4.451145e-05 6.254320e-04
## Malaria_Prevalence_per_100K -4.279592e-06 -2.537143e-07
## HIV_Prevalence_Per_100K -1.061679e-06 -2.323674e-07
Prediction
The below below shows the predicted values and the actual values side-by-side.
pred <- predict(fit_labor, Test, type="response")
actual_pred <- data.frame(cbind(actual=Train$Labor_Pop_Percentage, predicted=pred))
kable(head(actual_pred))
| actual | predicted |
|---|---|
| 31 | 44.35727 |
| 39 | 44.33879 |
| 50 | 44.35285 |
| 27 | 43.54034 |
| 48 | 44.43838 |
| 45 | 45.53024 |
MAPE
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,
\(M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| } ,\\ Where\quad { A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.\)
min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))
## [1] "The Percentage Accuracy Is: 86.9749925016897"
mape <- mean(abs((actual_pred$predicted - actual_pred$actual))/actual_pred$actual)*100
print(paste('The MAPE Is: ', mape ))
## [1] "The MAPE Is: 15.3796521256326"
We can be rest assured that about 15 percent of our prediction might not be accurate. Although this might be a bit of concern to us as we wanted about 99percent prediction accuracy.
GENERAL CONCLUSION
Incident (Person At Risk): Shows that majority of the diseases is highly concentrated in southern part of africa.
Incident (Person Per Time): Gives more insight about the dieases that are considered non-chronic like Malaria and Tuberculosis.
Prevalence Rate: This was used to estimate the prevalence of all the diseases and a better result was obtained. It shows the widespread of the diseases beyond southern part of africa to some of Asia, South America and Carribean was not left out!
Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.
RECOMMENDATION
The countries that are mostly affected needs to do more in terms of;
1). LABOR FORCE: Provide more employment opportunities to the citizenry. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.
2). Population: Maintain the population, as a sporadic increament in population without good stanadard of living, would jeopardized the ecomony and cause an increase in the spread of dideases.
REFERENCES
http://advancedrenaleducation.com/content/incidence-and-prevalence
https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson3/section2.html
http://rmarkdown.rstudio.com/html_document_format.html#overview