ABSTRACT

The main reason why I opted to do my project on demographic dataset is to show the beauty of data visualization. A good statistical analysis may not be convincing enough without depicting the result graphically in a very understandable and unambiguous way. The dataset used in this project is a demography data collected from the world bank website from 2010 to 2015. If I may ask, are you not curious to know if there is any relationship between labor Force/Population and diseases like HIV/AIDS, Tuberculosis, Malaria etc.? Better still, does geographical location has anything to do with all the widespread of these diseases?

Although the dataset was collected from world bank, but the issue with demographic data is that, they always have missing observation(s) and this dataset is no exception. We had a lot of missing observations (NAs), but I was able to deal with it.

In the data exploration, I obtained an inferential statistic and checked for any outliers. I majorly used SQL from sqldf package (it was the best for this type of dataset) to extract our observation and was able to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2. I observed that the dataset was not normally distributed, so, I decided to use Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.

In the end, I was able to use data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. I also observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.

Plotly Map

Click this map to see the incident risk-person for the year 2015

DATA EXPLORATORY

Get the datasets from here:

https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv

https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv

suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(sqldf))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressPackageStartupMessages(library(googleVis))
suppressMessages(library(stringi))
suppressMessages(library(ROCR));

The dataset is being loaded into R and a glimpse of it follows…

df <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
kable(head(df[1:10, ]))
Year_Code Country_Name Country_Code Adults_15_living_HIV Adults_Children_0_14_15_living_HIV AIDS_estimated_deaths_UNAIDS Adults_children_0_14_15_newly_infected_HIV Adults_15_newly_infected_HIV Children_0_14_living_with_HIV Children_orphaned_by_HIV_AIDS Children_0_14_newly_infected_HIV Incidence_tuberculosis_per_100000 Labor_force_total Mortality_traffic_injury_100K Population_female Population_male Population_total Malaria_cases_reported Suicide_mortality_per_100K Tuberculosis_death_per_100K Tuberculosis_case_detection Tuberculosis_treatment_success_NewCases
YR2010 Afghanistan AFG 4300 4500 500 1000 1000 200 3000 100 189.0 7707349 15.3 14005473 14797694 28803167 69397 5.1 41.00 53 86
YR2010 Albania ALB 1000 1000 100 200 200 NA NA NA 17.0 1216574 15.4 1451422 1461599 2913021 NA 5.3 0.20 88 89
YR2010 Algeria DZA 9000 9200 500 1000 1000 500 2100 100 77.0 11115092 24.7 17887239 18230398 36117637 1 3.4 8.50 80 89
YR2010 American Samoa ASM NA NA NA NA NA NA NA NA 8.3 NA NA NA NA 55637 NA NA 0.68 87 100
YR2010 Andorra AND NA NA NA NA NA NA NA NA 9.5 NA NA NA NA 84449 NA NA 0.78 87 86
YR2010 Angola AGO 190000 210000 9400 25000 21000 18000 94000 3800 384.0 8394741 20.8 11936016 11433115 23369131 1682870 20.6 58.00 55 48

We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.

# We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.
lat_long <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country_Name", "Country_Code", "Latitude", "Longtitude")
kable(head(lat_long))
Country_Name Country_Code Latitude Longtitude
Albania ALB 41.0000 20.0000
Algeria DZA 28.0000 3.0000
American Samoa ASM -14.3333 -170.0000
Andorra AND 42.5000 1.6000
Angola AGO -12.5000 18.5000
Anguilla AIA 18.2500 -63.1667
options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Name", by.y = "Country_Name", all = TRUE)
kable(head(df2))
Country_Name Year_Code Country_Code.x Adults_15_living_HIV Adults_Children_0_14_15_living_HIV AIDS_estimated_deaths_UNAIDS Adults_children_0_14_15_newly_infected_HIV Adults_15_newly_infected_HIV Children_0_14_living_with_HIV Children_orphaned_by_HIV_AIDS Children_0_14_newly_infected_HIV Incidence_tuberculosis_per_100000 Labor_force_total Mortality_traffic_injury_100K Population_female Population_male Population_total Malaria_cases_reported Suicide_mortality_per_100K Tuberculosis_death_per_100K Tuberculosis_case_detection Tuberculosis_treatment_success_NewCases Country_Code.y Latitude Longtitude
Afghanistan YR2015 AFG 6700 6900 500 1000 1000 500 4200 100 189 9825616 15.2 16346869 17389625 33736494 86895 5.5 37 58 NA AFG 33 65
Afghanistan YR2013 AFG 5600 5800 500 1000 1000 200 3700 100 189 8916157 NA 15398276 16333412 31731688 39263 NA 41 53 88 AFG 33 65
Afghanistan YR2011 AFG 4700 4900 500 1000 1000 200 3200 100 189 8050184 NA 14444001 15264598 29708599 77549 NA 42 51 88 AFG 33 65
Afghanistan YR2012 AFG 5200 5300 500 1000 1000 200 3500 100 189 8458402 NA 14912657 15784301 30696958 54840 NA 42 51 88 AFG 33 65
Afghanistan YR2014 AFG 6200 6400 500 1000 1000 500 4000 100 189 9397624 NA 15881092 16876928 32758020 145282 NA 41 53 87 AFG 33 65
Afghanistan YR2010 AFG 4300 4500 500 1000 1000 200 3000 100 189 7707349 15.3 14005473 14797694 28803167 69397 5.1 41 53 86 AFG 33 65

Checking for all NAs in the datasets before cleaning.

na_count_before <-sapply(df, function(a) sum(is.na(a)))
na_count_after <-sapply(df2, function(b) sum(is.na(b)))

kable(na_count_before)
Year_Code 0
Country_Name 0
Country_Code 0
Adults_15_living_HIV 511
Adults_Children_0_14_15_living_HIV 516
AIDS_estimated_deaths_UNAIDS 540
Adults_children_0_14_15_newly_infected_HIV 576
Adults_15_newly_infected_HIV 571
Children_0_14_living_with_HIV 684
Children_orphaned_by_HIV_AIDS 684
Children_0_14_newly_infected_HIV 678
Incidence_tuberculosis_per_100000 55
Labor_force_total 160
Mortality_traffic_injury_100K 916
Population_female 118
Population_male 118
Population_total 4
Malaria_cases_reported 703
Suicide_mortality_per_100K 916
Tuberculosis_death_per_100K 55
Tuberculosis_case_detection 101
Tuberculosis_treatment_success_NewCases 345
kable(na_count_after)
Country_Name 0
Year_Code 68
Country_Code.x 68
Adults_15_living_HIV 579
Adults_Children_0_14_15_living_HIV 584
AIDS_estimated_deaths_UNAIDS 608
Adults_children_0_14_15_newly_infected_HIV 644
Adults_15_newly_infected_HIV 639
Children_0_14_living_with_HIV 752
Children_orphaned_by_HIV_AIDS 752
Children_0_14_newly_infected_HIV 746
Incidence_tuberculosis_per_100000 123
Labor_force_total 228
Mortality_traffic_injury_100K 984
Population_female 186
Population_male 186
Population_total 72
Malaria_cases_reported 771
Suicide_mortality_per_100K 984
Tuberculosis_death_per_100K 123
Tuberculosis_case_detection 169
Tuberculosis_treatment_success_NewCases 413
Country_Code.y 246
Latitude 246
Longtitude 246

From the table above, we can see that both datasets contains alot of NAs (This is due to data unavalability in some countries under observation). But, before we can proceed, let first convert some of the variable to numeric. This would give room for better a analysis.

#Covert the columns to Numeric

df2[, c(4:22)] <- sapply(df2[, c(4:22)], as.numeric)

glimpse(df2)
## Observations: 1,364
## Variables: 25
## $ Country_Name                               <chr> "Afghanistan", "Afg...
## $ Year_Code                                  <chr> "YR2015", "YR2013",...
## $ Country_Code.x                             <chr> "AFG", "AFG", "AFG"...
## $ Adults_15_living_HIV                       <dbl> 6700, 5600, 4700, 5...
## $ Adults_Children_0_14_15_living_HIV         <dbl> 6900, 5800, 4900, 5...
## $ AIDS_estimated_deaths_UNAIDS               <dbl> 500, 500, 500, 500,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 1000, 1000, 1000, 1...
## $ Adults_15_newly_infected_HIV               <dbl> 1000, 1000, 1000, 1...
## $ Children_0_14_living_with_HIV              <dbl> 500, 200, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS              <dbl> 4200, 3700, 3200, 3...
## $ Children_0_14_newly_infected_HIV           <dbl> 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000          <dbl> 189.0, 189.0, 189.0...
## $ Labor_force_total                          <dbl> 9825616, 8916157, 8...
## $ Mortality_traffic_injury_100K              <dbl> 15.2, NA, NA, NA, N...
## $ Population_female                          <dbl> 16346869, 15398276,...
## $ Population_male                            <dbl> 17389625, 16333412,...
## $ Population_total                           <dbl> 33736494, 31731688,...
## $ Malaria_cases_reported                     <dbl> 86895, 39263, 77549...
## $ Suicide_mortality_per_100K                 <dbl> 5.5, NA, NA, NA, NA...
## $ Tuberculosis_death_per_100K                <dbl> 37.00, 41.00, 42.00...
## $ Tuberculosis_case_detection                <dbl> 58, 53, 51, 51, 53,...
## $ Tuberculosis_treatment_success_NewCases    <dbl> NA, 88, 88, 88, 87,...
## $ Country_Code.y                             <fctr> AFG, AFG, AFG, AFG...
## $ Latitude                                   <dbl> 33.0000, 33.0000, 3...
## $ Longtitude                                 <dbl> 65.0, 65.0, 65.0, 6...

Merging column longitude and Latitude together for a better coordinate to be used in our maps (googlevis requirement!)

df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")

kable(head(df2))
Country_Name Year_Code Country_Code.x Adults_15_living_HIV Adults_Children_0_14_15_living_HIV AIDS_estimated_deaths_UNAIDS Adults_children_0_14_15_newly_infected_HIV Adults_15_newly_infected_HIV Children_0_14_living_with_HIV Children_orphaned_by_HIV_AIDS Children_0_14_newly_infected_HIV Incidence_tuberculosis_per_100000 Labor_force_total Mortality_traffic_injury_100K Population_female Population_male Population_total Malaria_cases_reported Suicide_mortality_per_100K Tuberculosis_death_per_100K Tuberculosis_case_detection Tuberculosis_treatment_success_NewCases Country_Code.y Latitude Longtitude Lat_Long
Afghanistan YR2015 AFG 6700 6900 500 1000 1000 500 4200 100 189 9825616 15.2 16346869 17389625 33736494 86895 5.5 37 58 NA AFG 33 65 33:65
Afghanistan YR2013 AFG 5600 5800 500 1000 1000 200 3700 100 189 8916157 NA 15398276 16333412 31731688 39263 NA 41 53 88 AFG 33 65 33:65
Afghanistan YR2011 AFG 4700 4900 500 1000 1000 200 3200 100 189 8050184 NA 14444001 15264598 29708599 77549 NA 42 51 88 AFG 33 65 33:65
Afghanistan YR2012 AFG 5200 5300 500 1000 1000 200 3500 100 189 8458402 NA 14912657 15784301 30696958 54840 NA 42 51 88 AFG 33 65 33:65
Afghanistan YR2014 AFG 6200 6400 500 1000 1000 500 4000 100 189 9397624 NA 15881092 16876928 32758020 145282 NA 41 53 87 AFG 33 65 33:65
Afghanistan YR2010 AFG 4300 4500 500 1000 1000 200 3000 100 189 7707349 15.3 14005473 14797694 28803167 69397 5.1 41 53 86 AFG 33 65 33:65

A graphical representation of the missing values.

missmap(df2, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')), main ="Plot Showing The Missing Values Per Observation",
        y.cex = 0.8, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)

mice_missing_val <- mice(df2, m=1, method='cart', printFlag=FALSE, where = is.na(df2))
mice_missing_val_comp <- complete(mice_missing_val, action='long', include=FALSE)
df2_cleaned <- na.omit(mice_missing_val_comp)
df2_cleaned <- df2_cleaned[, -c(1:2)]
kable(head(df2_cleaned))
Country_Name Year_Code Country_Code.x Adults_15_living_HIV Adults_Children_0_14_15_living_HIV AIDS_estimated_deaths_UNAIDS Adults_children_0_14_15_newly_infected_HIV Adults_15_newly_infected_HIV Children_0_14_living_with_HIV Children_orphaned_by_HIV_AIDS Children_0_14_newly_infected_HIV Incidence_tuberculosis_per_100000 Labor_force_total Mortality_traffic_injury_100K Population_female Population_male Population_total Malaria_cases_reported Suicide_mortality_per_100K Tuberculosis_death_per_100K Tuberculosis_case_detection Tuberculosis_treatment_success_NewCases Country_Code.y Latitude Longtitude Lat_Long
Afghanistan YR2015 AFG 6700 6900 500 1000 1000 500 4200 100 189 9825616 15.2 16346869 17389625 33736494 86895 5.500000 37 58 91 AFG 33 65 33:65
Afghanistan YR2013 AFG 5600 5800 500 1000 1000 200 3700 100 189 8916157 4.6 15398276 16333412 31731688 39263 14.600000 41 53 88 AFG 33 65 33:65
Afghanistan YR2011 AFG 4700 4900 500 1000 1000 200 3200 100 189 8050184 15.0 14444001 15264598 29708599 77549 5.100000 42 51 88 AFG 33 65 33:65
Afghanistan YR2012 AFG 5200 5300 500 1000 1000 200 3500 100 189 8458402 6.5 14912657 15784301 30696958 54840 8.500000 42 51 88 AFG 33 65 33:65
Afghanistan YR2014 AFG 6200 6400 500 1000 1000 500 4000 100 189 9397624 13.0 15881092 16876928 32758020 145282 9.397671 41 53 87 AFG 33 65 33:65
Afghanistan YR2010 AFG 4300 4500 500 1000 1000 200 3000 100 189 7707349 15.3 14005473 14797694 28803167 69397 5.100000 41 53 86 AFG 33 65 33:65

We now have data with no missing observations.

na_count_after_cleaning <-sapply(df2_cleaned, function(c) sum(is.na(c)))
kable(na_count_after_cleaning)
Country_Name 0
Year_Code 0
Country_Code.x 0
Adults_15_living_HIV 0
Adults_Children_0_14_15_living_HIV 0
AIDS_estimated_deaths_UNAIDS 0
Adults_children_0_14_15_newly_infected_HIV 0
Adults_15_newly_infected_HIV 0
Children_0_14_living_with_HIV 0
Children_orphaned_by_HIV_AIDS 0
Children_0_14_newly_infected_HIV 0
Incidence_tuberculosis_per_100000 0
Labor_force_total 0
Mortality_traffic_injury_100K 0
Population_female 0
Population_male 0
Population_total 0
Malaria_cases_reported 0
Suicide_mortality_per_100K 0
Tuberculosis_death_per_100K 0
Tuberculosis_case_detection 0
Tuberculosis_treatment_success_NewCases 0
Country_Code.y 0
Latitude 0
Longtitude 0
Lat_Long 0

Summary Statistics

kable(summary(df2_cleaned[, 4:22]))
Adults_15_living_HIV Adults_Children_0_14_15_living_HIV AIDS_estimated_deaths_UNAIDS Adults_children_0_14_15_newly_infected_HIV Adults_15_newly_infected_HIV Children_0_14_living_with_HIV Children_orphaned_by_HIV_AIDS Children_0_14_newly_infected_HIV Incidence_tuberculosis_per_100000 Labor_force_total Mortality_traffic_injury_100K Population_female Population_male Population_total Malaria_cases_reported Suicide_mortality_per_100K Tuberculosis_death_per_100K Tuberculosis_case_detection Tuberculosis_treatment_success_NewCases
Min. : 100 Min. : 100 Min. : 100 Min. : 100 Min. : 100 Min. : 100 Min. : 500 Min. : 100 Min. : 0.00 Min. :1.389e+05 Min. : 2.40 Min. :1.454e+05 Min. :1.341e+05 Min. :2.796e+05 Min. : 0 Min. : 0.000 Min. : 0.11 Min. : 15.00 Min. : 0.00
1st Qu.: 8300 1st Qu.: 8700 1st Qu.: 200 1st Qu.: 1000 1st Qu.: 1000 1st Qu.: 500 1st Qu.: 3925 1st Qu.: 100 1st Qu.: 27.25 1st Qu.:1.897e+06 1st Qu.:14.10 1st Qu.:2.199e+06 1st Qu.:2.194e+06 1st Qu.:4.350e+06 1st Qu.: 364 1st Qu.: 6.025 1st Qu.: 2.00 1st Qu.: 58.00 1st Qu.: 75.00
Median : 37000 Median : 40500 Median : 1300 Median : 2400 Median : 2000 Median : 2400 Median : 23000 Median : 500 Median : 100.00 Median :5.089e+06 Median :20.80 Median :5.661e+06 Median :5.620e+06 Median :1.131e+07 Median : 15156 Median : 8.200 Median : 9.25 Median : 76.00 Median : 83.00
Mean : 514638 Mean : 555964 Mean : 20739 Mean : 32101 Mean : 28040 Mean : 44315 Mean : 326680 Mean : 4192 Mean : 165.15 Mean :4.366e+07 Mean :20.25 Mean :4.838e+07 Mean :4.923e+07 Mean :9.763e+07 Mean : 469355 Mean :10.256 Mean :18.96 Mean : 70.22 Mean : 79.34
3rd Qu.: 170000 3rd Qu.: 180000 3rd Qu.: 5875 3rd Qu.: 9950 3rd Qu.: 9225 3rd Qu.: 13000 3rd Qu.: 97000 3rd Qu.: 1200 3rd Qu.: 235.00 3rd Qu.:1.229e+07 3rd Qu.:27.30 3rd Qu.:1.476e+07 3rd Qu.:1.574e+07 3rd Qu.:3.030e+07 3rd Qu.: 259482 3rd Qu.:12.700 3rd Qu.:30.00 3rd Qu.: 87.00 3rd Qu.: 88.00
Max. :33800000 Max. :36100000 Max. :1500000 Max. :2200000 Max. :1900000 Max. :2600000 Max. :19700000 Max. :300000 Max. :1246.00 Max. :3.406e+09 Max. :45.40 Max. :3.644e+09 Max. :3.708e+09 Max. :7.355e+09 Max. :11627473 Max. :35.300 Max. :99.00 Max. :110.00 Max. :100.00

Percentage of Children Orphaned By HIV/AIDS

All_Cen <- sqldf("SELECT Country_Name, Year_Code, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2_cleaned where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50" )

A box plot of the outcomes.

# Boxplot
g <- ggplot(All_Cen, aes(Year_Code, Percentage_Orphaned_byHIV)) + geom_boxplot(varwidth=T, fill="plum") + 
    labs(title="Box plot of Children Orphaned By HIV/AID per Year (%)", 
         subtitle="Percentage Orphaned by HIV grouped by Year",
         caption="Source: The World Bank Data Bank",
         x="Year",
         y="% of Children Orphaned By HIV/AIDS")

ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

THE ANALYSIS

Let first look at the graph of children orphaned by HIV?AIDS!

ggplotly(ggplot(All_Cen, aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

INCIDENCE PROPORTION,INCIDENCE RATES AND PREVALENCE

Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.

while

1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk

2). Incidence rate or person-time rate is a measure of incidence that incorporates time directly into the denominator. A person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented." —CDC

3). Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population

\(Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad disease\quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }\)

Hiv_Incident_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where HIV_Incident_Risk_Per_100K >= 200 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 30" )
kable(head(Hiv_Incident_Risk))
Population_total Lat_Long Country_Name Year_Code HIV_Incident_Risk_Per_100K
2209197 -22:24 Botswana YR2015 497.9185
4546100 7:21 Central African Republic YR2015 202.3713
1175389 2:10 Equatorial Guinea YR2015 204.1877
2174645 -29.5:28.5 Lesotho YR2015 1011.6594
17573607 -13.5:34 Malawi YR2015 221.9237
28010691 -18.25:35 Mozambique YR2015 339.1562
Tuberculosis_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K' FROM df2_cleaned where Tuberculocisis_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Tuberculosis_Inc_Risk)) 
Population_total Lat_Long Country_Name Year_Code Tuberculocisis_Incident_Risk_Per_100K
2880703 41:20 Albania YR2015 2.638245
2916950 40.0691:45.03819 Armenia YR2015 3.051132
1371855 26:50.55 Bahrain YR2015 6.341778
284217 13.1667:-59.5333 Barbados YR2015 30.610414
359288 17.25:-88.75 Belize YR2015 24.214558
2209197 -22:24 Botswana YR2015 2.806450
Malaria_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Malaria_cases_reported/Population_total)* 100000 as 'Malaria_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Malaria_Inc_Risk)) 
Population_total Lat_Long Country_Name Year_Code Malaria_Incident_Risk_Per_100K
33736494 33:65 Afghanistan YR2015 257.569740
2880703 41:20 Albania YR2015 1.909256
27859305 -12.5:18.5 Angola YR2015 9940.323350
161200886 24:90 Bangladesh YR2015 4.099233
284217 13.1667:-59.5333 Barbados YR2015 6186.470197
9489616 53:28 Belarus YR2015 126.980902
ggplotly(ggplot(Hiv_Incident_Risk, aes(x=HIV_Incident_Risk_Per_100K, y=reorder(Country_Name,+HIV_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="red", size=2, alpha=.8) + ylab("Countries") + xlab("HIV Incident Risk") +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of HIV/AIDS Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(ggplot(Tuberculosis_Inc_Risk, aes(x=Tuberculocisis_Incident_Risk_Per_100K, y=reorder(Country_Name, Tuberculocisis_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Tuberculosis Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(ggplot(Malaria_Inc_Risk, aes(x=Malaria_Incident_Risk_Per_100K, y=reorder(Country_Name, Malaria_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="brown", size=2, alpha=.8) + 
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Malaria Incidence (Person At Risk) BY Countries"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Their Locations On The Map

Hiv_Inc_Map <- gvisGeoChart(Hiv_Incident_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_At_Risk")
plot(Hiv_Inc_Map)
Tub_Inc_Map <- gvisGeoChart(Tuberculosis_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculocisis_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_At_Risk")
plot(Tub_Inc_Map )
Malaria_Inc_Map <- gvisGeoChart(Malaria_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Malaria_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Malaria_Incidence_Person_At_Risk")
plot(Malaria_Inc_Map )

HIV/AID, Malaria & Tuberculosis Incident Risk Comparison

comparison <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, Labor_force_total, (Malaria_cases_reported/Population_total)* 1000 as 'Malaria_Incident_Risk_Per_1K', (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K',  (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_1K >= 1 and Lat_Long != 'NA:NA' or Tuberculocisis_Incident_Risk_Per_100K >= 1 or HIV_Incident_Risk_Per_100K >= 200 ORDER BY Country_Name ASC")

kable(head(comparison))
Population_total Lat_Long Country_Name Year_Code Labor_force_total Malaria_Incident_Risk_Per_1K Tuberculocisis_Incident_Risk_Per_100K HIV_Incident_Risk_Per_100K
33736494 33:65 Afghanistan YR2015 9825616 2.575697 0.1719207 2.964149
31731688 33:65 Afghanistan YR2013 8916157 1.237344 0.1670255 3.151424
29708599 33:65 Afghanistan YR2011 8050184 2.610322 0.1716675 3.366029
30696958 33:65 Afghanistan YR2012 8458402 1.786496 0.1661402 3.257652
32758020 33:65 Afghanistan YR2014 9397624 4.435005 0.1617924 3.052688
28803167 33:65 Afghanistan YR2010 7707349 2.409353 0.1840075 3.471840

A 3D approach

comparison <- comparison[order(comparison$Country_Name, comparison$Year_Code),]
comparison$size <- comparison$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw <- plot_ly(comparison, x = ~Malaria_Incident_Risk_Per_1K, y = ~Tuberculocisis_Incident_Risk_Per_100K, z = ~HIV_Incident_Risk_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Labor Force:', Labor_force_total, '<br>Population Total:', Population_total,
                           '<br>Lat & Long.:', Lat_Long)) %>%
  layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw

Incidence Person_Per_Time

HIV_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, ((New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'HIV_Inc_Person_Time100K'  FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY New_Hiv_Cases, Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Inc_Person_Time))
Population_total Lat_Long Country_Name Year_Code HIV_Inc_Person_Time100K
1371855 26:50.55 Bahrain YR2015 72.90038
777424 -12.1667:44.25 Comoros YR2015 128.65211
4203604 45.1667:15.5 Croatia YR2015 23.79033
9159302 31:36 Jordan YR2015 10.91828
3935794 29.3375:47.6581 Kuwait YR2015 25.40904
2976877 46:105 Mongolia YR2015 33.59576
#Nested SELECTion!

Tuber_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/(Population_total -(Tuber_trtd  + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'Tuberculosis_Inc_Person_Time100K'  FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Inc_Person_Time))
Population_total Lat_Long Country_Name Year_Code Tuberculosis_Inc_Person_Time100K
33736494 33:65 Afghanistan YR2015 1.719336
2880703 41:20 Albania YR2015 26.384446
39871528 28:3 Algeria YR2015 2.006544
27859305 -12.5:18.5 Angola YR2015 2.298376
43417765 -34:-64 Argentina YR2015 2.003961
2916950 40.0691:45.03819 Armenia YR2015 30.515015

Person Per Time Maps

HIV_Inc_Person_Time_Map <- gvisGeoChart(HIV_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Inc_Person_Time100K", colorvar = "Population_total",
                    options=list(displayMode="Markers", 
                    colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                    backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_Per_Time")
plot(HIV_Inc_Person_Time_Map )
Tuber_Inc_PT_Map <- gvisGeoChart(Tuber_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculosis_Inc_Person_Time100K", colorvar = "Population_total",
                    options=list(displayMode="Markers", 
                    colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                    backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_Per_Time")
plot(Tuber_Inc_PT_Map)

PREVALENCE RATE:

Prevalence is the actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.

Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.

HIV_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Inc_Prevalence_100K'  FROM (SELECT Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Prevalence))
Population_total Lat_Long Country_Name Year_Code HIV_Inc_Prevalence_100K
33736494 33:65 Afghanistan YR2015 234
31731688 33:65 Afghanistan YR2013 214
29708599 33:65 Afghanistan YR2011 199
30696958 33:65 Afghanistan YR2012 205
32758020 33:65 Afghanistan YR2014 226
28803167 33:65 Afghanistan YR2010 191

NB: Malaria and Tuberculosis are most likely not be carried over to the proceding year. So, we will not include them (person living with either malaria or tuberculosis) in the numerators.

#Nested SELECTion!

Tuber_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd    + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K'  FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Prevalence_Rate))
Population_total Lat_Long Country_Name Year_Code Tuberculosis_Prevalence_per_100K
33736494 33:65 Afghanistan YR2015 2
31731688 33:65 Afghanistan YR2013 2
29708599 33:65 Afghanistan YR2011 2
30696958 33:65 Afghanistan YR2012 2
32758020 33:65 Afghanistan YR2014 2
28803167 33:65 Afghanistan YR2010 2
Malaria_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Malaria_Prevalence_per_100K'  FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year',Malaria_cases_reported, round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Malaria_Prevalence_Rate))
Population_total Lat_Long Country_Name Year_Code Malaria_Prevalence_per_100K
33736494 33:65 Afghanistan YR2015 2576
31731688 33:65 Afghanistan YR2013 1237
29708599 33:65 Afghanistan YR2011 2611
30696958 33:65 Afghanistan YR2012 1787
32758020 33:65 Afghanistan YR2014 4435
28803167 33:65 Afghanistan YR2010 2410
Com_HIV_Tub_Mal_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,round((Labor_force_total/Population_total)*100) as 'Labor_Pop_Percentage',round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd  + AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K',round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*100000) as 'Malaria_Prevalence_per_100K',round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Prevalence_Per_100K'  FROM (SELECT Labor_force_total,Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',Tuberculosis_case_detection,Malaria_cases_reported,Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC) where Lat_Long != 'NA:NA'")
kable(head(Com_HIV_Tub_Mal_Prevalence))
Population_total Lat_Long Country_Name Year_Code Labor_Pop_Percentage Tuberculosis_Prevalence_per_100K Malaria_Prevalence_per_100K HIV_Prevalence_Per_100K
33736494 33:65 Afghanistan YR2015 29 2 258 234
31731688 33:65 Afghanistan YR2013 28 2 124 214
29708599 33:65 Afghanistan YR2011 27 2 261 199
30696958 33:65 Afghanistan YR2012 28 2 179 205
32758020 33:65 Afghanistan YR2014 29 2 444 226
28803167 33:65 Afghanistan YR2010 27 2 241 191
comp <- Com_HIV_Tub_Mal_Prevalence[order(Com_HIV_Tub_Mal_Prevalence$Country_Name, Com_HIV_Tub_Mal_Prevalence$Year_Code),]
comp$size <- Com_HIV_Tub_Mal_Prevalence$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw2 <- plot_ly(comp, x = ~Malaria_Prevalence_per_100K, y = ~Tuberculosis_Prevalence_per_100K, z = ~HIV_Prevalence_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Population Total:', Population_total,
                           '<br>Labor Force/Pop %:', Labor_Pop_Percentage)) %>%
  layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw2

Regression Analysis & Prediction

qqnorm(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K)
qqline(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K, col = 2)

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have

\({ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad \\ where,\\ { Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad \\\)

While Multiple linear regression, we have (in matrix notation):

\(\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y\)

# Create Training and Test data -
set.seed(12345)  # setting seed to reproduce results of random sampling
Com_HIV_Tub_Mal_Prevalence <- Com_HIV_Tub_Mal_Prevalence[, c(1,5,6,7,8)]
traindata <- sample(1:nrow(Com_HIV_Tub_Mal_Prevalence), 0.8*nrow(Com_HIV_Tub_Mal_Prevalence))  # row indices for training data
Train <- Com_HIV_Tub_Mal_Prevalence[traindata, ]  # model training data
Test  <- Com_HIV_Tub_Mal_Prevalence[-traindata, ]   # test data

SETTING UP HYPOTHESIS: For Labor Force

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

fit_labor <-  glm(Labor_Pop_Percentage ~., family=poisson(link='log'),data=Train)
summary(stepwise <- step(fit_labor, direction = "both")) 
## Start:  AIC=3397.71
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df Deviance    AIC
## <none>                                  661.93 3397.7
## - Malaria_Prevalence_per_100K       1   666.83 3400.6
## - Population_total                  1   666.87 3400.6
## - Tuberculosis_Prevalence_per_100K  1   667.02 3400.8
## - HIV_Prevalence_Per_100K           1   671.42 3405.2
## 
## Call:
## glm(formula = Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2093  -0.6812   0.1074   0.7941   4.0322  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       3.797e+00  8.946e-03 424.403  < 2e-16
## Population_total                 -1.338e-10  6.168e-11  -2.169  0.03012
## Tuberculosis_Prevalence_per_100K  3.378e-04  1.482e-04   2.279  0.02264
## Malaria_Prevalence_per_100K      -2.248e-06  1.027e-06  -2.189  0.02859
## HIV_Prevalence_Per_100K          -6.438e-07  2.115e-07  -3.043  0.00234
##                                     
## (Intercept)                      ***
## Population_total                 *  
## Tuberculosis_Prevalence_per_100K *  
## Malaria_Prevalence_per_100K      *  
## HIV_Prevalence_Per_100K          ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 688.60  on 485  degrees of freedom
## Residual deviance: 661.93  on 481  degrees of freedom
## AIC: 3397.7
## 
## Number of Fisher Scoring iterations: 4
residual_labor = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = Train$Labor_Pop_Percentage)

plot_res <- ggplot(residual_labor, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_res)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
anova(stepwise, test="F")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Labor_Pop_Percentage
## 
## Terms added sequentially (first to last)
## 
## 
##                                  Df Deviance Resid. Df Resid. Dev      F
## NULL                                               485     688.60       
## Population_total                  1   4.7922       484     683.81 4.7922
## Tuberculosis_Prevalence_per_100K  1   5.1819       483     678.63 5.1819
## Malaria_Prevalence_per_100K       1   7.2054       482     671.42 7.2054
## HIV_Prevalence_Per_100K           1   9.4892       481     661.93 9.4892
##                                    Pr(>F)   
## NULL                                        
## Population_total                 0.028589 * 
## Tuberculosis_Prevalence_per_100K 0.022824 * 
## Malaria_Prevalence_per_100K      0.007269 **
## HIV_Prevalence_Per_100K          0.002067 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DECISION (Labor Force):

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

SETTING UP HYPOTHESIS: Population

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

fit_Population <-  glm(Population_total ~., family=poisson(link='log'),data=Train)
summary(stepwise_pop <- step(fit_Population , direction = "both")) 
## Start:  AIC=20335673459
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df   Deviance        AIC
## <none>                                2.0336e+10 2.0336e+10
## - Labor_Pop_Percentage              1 2.0638e+10 2.0638e+10
## - HIV_Prevalence_Per_100K           1 2.0751e+10 2.0751e+10
## - Malaria_Prevalence_per_100K       1 2.1448e+10 2.1448e+10
## - Tuberculosis_Prevalence_per_100K  1 4.4134e+10 4.4134e+10
## 
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -10699   -2899     184    4585   53407  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       1.967e+01  4.044e-05  486275   <2e-16
## Labor_Pop_Percentage             -1.652e-02  9.326e-07  -17713   <2e-16
## Tuberculosis_Prevalence_per_100K -3.374e-01  3.824e-06  -88230   <2e-16
## Malaria_Prevalence_per_100K      -7.291e-05  2.624e-09  -27782   <2e-16
## HIV_Prevalence_Per_100K          -7.459e-06  4.197e-10  -17772   <2e-16
##                                     
## (Intercept)                      ***
## Labor_Pop_Percentage             ***
## Tuberculosis_Prevalence_per_100K ***
## Malaria_Prevalence_per_100K      ***
## HIV_Prevalence_Per_100K          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4.8172e+10  on 485  degrees of freedom
## Residual deviance: 2.0336e+10  on 481  degrees of freedom
## AIC: 2.0336e+10
## 
## Number of Fisher Scoring iterations: 8
residual_pop = data.frame(Fitted = fitted(stepwise_pop),
Residuals = resid(stepwise_pop), Treatment = Train$Population_total)

plot_pop <- ggplot(residual_pop, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_pop)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
anova(stepwise_pop, test="F")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Population_total
## 
## Terms added sequentially (first to last)
## 
## 
##                                  Df   Deviance Resid. Df Resid. Dev
## NULL                                                 485 4.8172e+10
## Labor_Pop_Percentage              1 1.3111e+09       484 4.6861e+10
## Tuberculosis_Prevalence_per_100K  1 2.4433e+10       483 2.2428e+10
## Malaria_Prevalence_per_100K       1 1.6777e+09       482 2.0751e+10
## HIV_Prevalence_Per_100K           1 4.1494e+08       481 2.0336e+10
##                                           F    Pr(>F)    
## NULL                                                     
## Labor_Pop_Percentage             1.3111e+09 < 2.2e-16 ***
## Tuberculosis_Prevalence_per_100K 2.4433e+10 < 2.2e-16 ***
## Malaria_Prevalence_per_100K      1.6777e+09 < 2.2e-16 ***
## HIV_Prevalence_Per_100K          4.1494e+08 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DECISION (Population):

Reject Null Hypothesis since P-Values are all less than Tabulated value of 0.05.

Non Zero Confidence Interval

Meaning that there is a 95% probability that the confidence interval will contain the true population mean. From the table below, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.

confint(stepwise)
##                                          2.5 %        97.5 %
## (Intercept)                       3.779050e+00  3.814117e+00
## Population_total                 -2.572919e-10 -1.540813e-11
## Tuberculosis_Prevalence_per_100K  4.451145e-05  6.254320e-04
## Malaria_Prevalence_per_100K      -4.279592e-06 -2.537143e-07
## HIV_Prevalence_Per_100K          -1.061679e-06 -2.323674e-07

Prediction

The below below shows the predicted values and the actual values side-by-side.

pred <- predict(fit_labor, Test, type="response")
actual_pred <- data.frame(cbind(actual=Train$Labor_Pop_Percentage, predicted=pred))
kable(head(actual_pred))
actual predicted
31 44.35727
39 44.33879
50 44.35285
27 43.54034
48 44.43838
45 45.53024

MAPE

The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,

\(M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| } ,\\ Where\quad { A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.\)

min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100 
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))
## [1] "The Percentage Accuracy Is:  86.9749925016897"
mape <- mean(abs((actual_pred$predicted - actual_pred$actual))/actual_pred$actual)*100 
print(paste('The MAPE Is: ', mape ))
## [1] "The MAPE Is:  15.3796521256326"

We can be rest assured that about 15 percent of our prediction might not be accurate. Although this might be a bit of concern to us as we wanted about 99percent prediction accuracy.

GENERAL CONCLUSION

Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.

RECOMMENDATION

The countries that are mostly affected needs to do more in terms of;

1). LABOR FORCE: Provide more employment opportunities to the citizenry. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.

2). Population: Maintain the population, as a sporadic increament in population without good stanadard of living, would jeopardized the ecomony and cause an increase in the spread of dideases.

REFERENCES

http://advancedrenaleducation.com/content/incidence-and-prevalence

https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson3/section2.html

https://plot.ly/r/

http://rmarkdown.rstudio.com/html_document_format.html#overview

http://databank.worldbank.org/data/reports.aspx?source=health-nutrition-and-population-statistics#advancedDownloadOptions