|IS 608 KNOWLEDGE AND VISUAL ANALYTICS - FINAL PROJECT

ABSTRACT

The main reason why I opted to do my project on demographic dataset is to show the beauty of data visualization. A good statistical analysis may not be convincing enough without depicting the result graphically in a very understandable and unambiguous way. The dataset used in this project is a demography data collected from the world bank website from 2010 to 2015. If I may ask, are you not curious to know if there is any relationship between labor Force/Population and diseases like HIV/AIDS, Tuberculosis, Malaria etc.? Better still, does geographical location has anything to do with all the widespread of these diseases?

Although the dataset was collected from world bank, but the issue with demographic data is that, they always have missing observation(s) and this dataset is no exception. We had a lot of missing observations (NAs), but I was able to deal with it.

In the data exploration, I obtained an inferential statistic and checked for any outliers. I majorly used SQL from sqldf package (it was the best for this type of dataset) to extract our observation and was able to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2. I observed that the dataset was not normally distributed, so, I decided to use Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.

In the end, I was able to use data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. I also observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.

Plotly Map

Click this map to see the incident risk-person for the year 2015

DATA EXPLORATORY

Load this libraries and dataset and lets get to work!

Get the datasets from here:

https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv

https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv

suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(sqldf))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressPackageStartupMessages(library(googleVis))
suppressMessages(library(stringi))
suppressMessages(library(ROCR));

The dataset is being loaded into R and a glimpse of it follows…

df <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/WorldHealth.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
kable(head(df[1:10, ]))

Year_Code	Country_Name	Country_Code	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases
YR2010	Afghanistan	AFG	4300	4500	500	1000	1000	200	3000	100	189.0	7707349	15.3	14005473	14797694	28803167	69397	5.1	41.00	53	86
YR2010	Albania	ALB	1000	1000	100	200	200	NA	NA	NA	17.0	1216574	15.4	1451422	1461599	2913021	NA	5.3	0.20	88	89
YR2010	Algeria	DZA	9000	9200	500	1000	1000	500	2100	100	77.0	11115092	24.7	17887239	18230398	36117637	1	3.4	8.50	80	89
YR2010	American Samoa	ASM	NA	NA	NA	NA	NA	NA	NA	NA	8.3	NA	NA	NA	NA	55637	NA	NA	0.68	87	100
YR2010	Andorra	AND	NA	NA	NA	NA	NA	NA	NA	NA	9.5	NA	NA	NA	NA	84449	NA	NA	0.78	87	86
YR2010	Angola	AGO	190000	210000	9400	25000	21000	18000	94000	3800	384.0	8394741	20.8	11936016	11433115	23369131	1682870	20.6	58.00	55	48

World longitudes and latitudes.

We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.

# We are going to include another data that contains all the needed Longtitude and latitude for all country locations in the ddataset.
lat_long <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-608/master/Countries_long_lat2.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country_Name", "Country_Code", "Latitude", "Longtitude")
kable(head(lat_long))

Country_Name	Country_Code	Latitude	Longtitude
Albania	ALB	41.0000	20.0000
Algeria	DZA	28.0000	3.0000
American Samoa	ASM	-14.3333	-170.0000
Andorra	AND	42.5000	1.6000
Angola	AGO	-12.5000	18.5000
Anguilla	AIA	18.2500	-63.1667

Cleaning and renaming of dataset and column respectively.

options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Name", by.y = "Country_Name", all = TRUE)
kable(head(df2))

Country_Name	Year_Code	Country_Code.x	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases	Country_Code.y	Latitude	Longtitude
Afghanistan	YR2015	AFG	6700	6900	500	1000	1000	500	4200	100	189	9825616	15.2	16346869	17389625	33736494	86895	5.5	37	58	NA	AFG	33	65
Afghanistan	YR2013	AFG	5600	5800	500	1000	1000	200	3700	100	189	8916157	NA	15398276	16333412	31731688	39263	NA	41	53	88	AFG	33	65
Afghanistan	YR2011	AFG	4700	4900	500	1000	1000	200	3200	100	189	8050184	NA	14444001	15264598	29708599	77549	NA	42	51	88	AFG	33	65
Afghanistan	YR2012	AFG	5200	5300	500	1000	1000	200	3500	100	189	8458402	NA	14912657	15784301	30696958	54840	NA	42	51	88	AFG	33	65
Afghanistan	YR2014	AFG	6200	6400	500	1000	1000	500	4000	100	189	9397624	NA	15881092	16876928	32758020	145282	NA	41	53	87	AFG	33	65
Afghanistan	YR2010	AFG	4300	4500	500	1000	1000	200	3000	100	189	7707349	15.3	14005473	14797694	28803167	69397	5.1	41	53	86	AFG	33	65

Checking for all NAs in the datasets before cleaning.

na_count_before <-sapply(df, function(a) sum(is.na(a)))
na_count_after <-sapply(df2, function(b) sum(is.na(b)))

kable(na_count_before)

Year_Code	0
Country_Name	0
Country_Code	0
Adults_15_living_HIV	511
Adults_Children_0_14_15_living_HIV	516
AIDS_estimated_deaths_UNAIDS	540
Adults_children_0_14_15_newly_infected_HIV	576
Adults_15_newly_infected_HIV	571
Children_0_14_living_with_HIV	684
Children_orphaned_by_HIV_AIDS	684
Children_0_14_newly_infected_HIV	678
Incidence_tuberculosis_per_100000	55
Labor_force_total	160
Mortality_traffic_injury_100K	916
Population_female	118
Population_male	118
Population_total	4
Malaria_cases_reported	703
Suicide_mortality_per_100K	916
Tuberculosis_death_per_100K	55
Tuberculosis_case_detection	101
Tuberculosis_treatment_success_NewCases	345

kable(na_count_after)

Country_Name	0
Year_Code	68
Country_Code.x	68
Adults_15_living_HIV	579
Adults_Children_0_14_15_living_HIV	584
AIDS_estimated_deaths_UNAIDS	608
Adults_children_0_14_15_newly_infected_HIV	644
Adults_15_newly_infected_HIV	639
Children_0_14_living_with_HIV	752
Children_orphaned_by_HIV_AIDS	752
Children_0_14_newly_infected_HIV	746
Incidence_tuberculosis_per_100000	123
Labor_force_total	228
Mortality_traffic_injury_100K	984
Population_female	186
Population_male	186
Population_total	72
Malaria_cases_reported	771
Suicide_mortality_per_100K	984
Tuberculosis_death_per_100K	123
Tuberculosis_case_detection	169
Tuberculosis_treatment_success_NewCases	413
Country_Code.y	246
Latitude	246
Longtitude	246

From the table above, we can see that both datasets contains alot of NAs (This is due to data unavalability in some countries under observation). But, before we can proceed, let first convert some of the variable to numeric. This would give room for better a analysis.

#Covert the columns to Numeric

df2[, c(4:22)] <- sapply(df2[, c(4:22)], as.numeric)

glimpse(df2)

## Observations: 1,364
## Variables: 25
## $ Country_Name                               <chr> "Afghanistan", "Afg...
## $ Year_Code                                  <chr> "YR2015", "YR2013",...
## $ Country_Code.x                             <chr> "AFG", "AFG", "AFG"...
## $ Adults_15_living_HIV                       <dbl> 6700, 5600, 4700, 5...
## $ Adults_Children_0_14_15_living_HIV         <dbl> 6900, 5800, 4900, 5...
## $ AIDS_estimated_deaths_UNAIDS               <dbl> 500, 500, 500, 500,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 1000, 1000, 1000, 1...
## $ Adults_15_newly_infected_HIV               <dbl> 1000, 1000, 1000, 1...
## $ Children_0_14_living_with_HIV              <dbl> 500, 200, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS              <dbl> 4200, 3700, 3200, 3...
## $ Children_0_14_newly_infected_HIV           <dbl> 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000          <dbl> 189.0, 189.0, 189.0...
## $ Labor_force_total                          <dbl> 9825616, 8916157, 8...
## $ Mortality_traffic_injury_100K              <dbl> 15.2, NA, NA, NA, N...
## $ Population_female                          <dbl> 16346869, 15398276,...
## $ Population_male                            <dbl> 17389625, 16333412,...
## $ Population_total                           <dbl> 33736494, 31731688,...
## $ Malaria_cases_reported                     <dbl> 86895, 39263, 77549...
## $ Suicide_mortality_per_100K                 <dbl> 5.5, NA, NA, NA, NA...
## $ Tuberculosis_death_per_100K                <dbl> 37.00, 41.00, 42.00...
## $ Tuberculosis_case_detection                <dbl> 58, 53, 51, 51, 53,...
## $ Tuberculosis_treatment_success_NewCases    <dbl> NA, 88, 88, 88, 87,...
## $ Country_Code.y                             <fctr> AFG, AFG, AFG, AFG...
## $ Latitude                                   <dbl> 33.0000, 33.0000, 3...
## $ Longtitude                                 <dbl> 65.0, 65.0, 65.0, 6...

Merging column longitude and Latitude together for a better coordinate to be used in our maps (googlevis requirement!)

df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")

kable(head(df2))

Country_Name	Year_Code	Country_Code.x	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases	Country_Code.y	Latitude	Longtitude	Lat_Long
Afghanistan	YR2015	AFG	6700	6900	500	1000	1000	500	4200	100	189	9825616	15.2	16346869	17389625	33736494	86895	5.5	37	58	NA	AFG	33	65	33:65
Afghanistan	YR2013	AFG	5600	5800	500	1000	1000	200	3700	100	189	8916157	NA	15398276	16333412	31731688	39263	NA	41	53	88	AFG	33	65	33:65
Afghanistan	YR2011	AFG	4700	4900	500	1000	1000	200	3200	100	189	8050184	NA	14444001	15264598	29708599	77549	NA	42	51	88	AFG	33	65	33:65
Afghanistan	YR2012	AFG	5200	5300	500	1000	1000	200	3500	100	189	8458402	NA	14912657	15784301	30696958	54840	NA	42	51	88	AFG	33	65	33:65
Afghanistan	YR2014	AFG	6200	6400	500	1000	1000	500	4000	100	189	9397624	NA	15881092	16876928	32758020	145282	NA	41	53	87	AFG	33	65	33:65
Afghanistan	YR2010	AFG	4300	4500	500	1000	1000	200	3000	100	189	7707349	15.3	14005473	14797694	28803167	69397	5.1	41	53	86	AFG	33	65	33:65

A graphical representation of the missing values.

missmap(df2, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')), main ="Plot Showing The Missing Values Per Observation",
        y.cex = 0.8, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)

Since it is lucid that we have lots of missing observations in the dataset, we are going to employ a method that best deal with datasets of this nature, The MICE (Multivariate Imputation by Chained Equations) package. The package helps in inspecting, imputing, diagonise, analyze, pool the result, and generate simulated incomplete data.

mice_missing_val <- mice(df2, m=1, method='cart', printFlag=FALSE, where = is.na(df2))
mice_missing_val_comp <- complete(mice_missing_val, action='long', include=FALSE)

After the generated values by mice, we still have some missing observations. We therefore going to omit those observation that stll have missing value(s) to avoid a biased analysis.

df2_cleaned <- na.omit(mice_missing_val_comp)
df2_cleaned <- df2_cleaned[, -c(1:2)]
kable(head(df2_cleaned))

Country_Name	Year_Code	Country_Code.x	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases	Country_Code.y	Latitude	Longtitude	Lat_Long
Afghanistan	YR2015	AFG	6700	6900	500	1000	1000	500	4200	100	189	9825616	15.2	16346869	17389625	33736494	86895	5.500000	37	58	91	AFG	33	65	33:65
Afghanistan	YR2013	AFG	5600	5800	500	1000	1000	200	3700	100	189	8916157	4.6	15398276	16333412	31731688	39263	14.600000	41	53	88	AFG	33	65	33:65
Afghanistan	YR2011	AFG	4700	4900	500	1000	1000	200	3200	100	189	8050184	15.0	14444001	15264598	29708599	77549	5.100000	42	51	88	AFG	33	65	33:65
Afghanistan	YR2012	AFG	5200	5300	500	1000	1000	200	3500	100	189	8458402	6.5	14912657	15784301	30696958	54840	8.500000	42	51	88	AFG	33	65	33:65
Afghanistan	YR2014	AFG	6200	6400	500	1000	1000	500	4000	100	189	9397624	13.0	15881092	16876928	32758020	145282	9.397671	41	53	87	AFG	33	65	33:65
Afghanistan	YR2010	AFG	4300	4500	500	1000	1000	200	3000	100	189	7707349	15.3	14005473	14797694	28803167	69397	5.100000	41	53	86	AFG	33	65	33:65

We now have data with no missing observations.

na_count_after_cleaning <-sapply(df2_cleaned, function(c) sum(is.na(c)))
kable(na_count_after_cleaning)

Country_Name	0
Year_Code	0
Country_Code.x	0
Adults_15_living_HIV	0
Adults_Children_0_14_15_living_HIV	0
AIDS_estimated_deaths_UNAIDS	0
Adults_children_0_14_15_newly_infected_HIV	0
Adults_15_newly_infected_HIV	0
Children_0_14_living_with_HIV	0
Children_orphaned_by_HIV_AIDS	0
Children_0_14_newly_infected_HIV	0
Incidence_tuberculosis_per_100000	0
Labor_force_total	0
Mortality_traffic_injury_100K	0
Population_female	0
Population_male	0
Population_total	0
Malaria_cases_reported	0
Suicide_mortality_per_100K	0
Tuberculosis_death_per_100K	0
Tuberculosis_case_detection	0
Tuberculosis_treatment_success_NewCases	0
Country_Code.y	0
Latitude	0
Longtitude	0
Lat_Long	0

Summary Statistics

kable(summary(df2_cleaned[, 4:22]))

Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases
Min. : 100	Min. : 100	Min. : 100	Min. : 100	Min. : 100	Min. : 100	Min. : 500	Min. : 100	Min. : 0.00	Min. :1.389e+05	Min. : 2.40	Min. :1.454e+05	Min. :1.341e+05	Min. :2.796e+05	Min. : 0	Min. : 0.000	Min. : 0.11	Min. : 15.00	Min. : 0.00
1st Qu.: 8300	1st Qu.: 8700	1st Qu.: 200	1st Qu.: 1000	1st Qu.: 1000	1st Qu.: 500	1st Qu.: 3925	1st Qu.: 100	1st Qu.: 27.25	1st Qu.:1.897e+06	1st Qu.:14.10	1st Qu.:2.199e+06	1st Qu.:2.194e+06	1st Qu.:4.350e+06	1st Qu.: 364	1st Qu.: 6.025	1st Qu.: 2.00	1st Qu.: 58.00	1st Qu.: 75.00
Median : 37000	Median : 40500	Median : 1300	Median : 2400	Median : 2000	Median : 2400	Median : 23000	Median : 500	Median : 100.00	Median :5.089e+06	Median :20.80	Median :5.661e+06	Median :5.620e+06	Median :1.131e+07	Median : 15156	Median : 8.200	Median : 9.25	Median : 76.00	Median : 83.00
Mean : 514638	Mean : 555964	Mean : 20739	Mean : 32101	Mean : 28040	Mean : 44315	Mean : 326680	Mean : 4192	Mean : 165.15	Mean :4.366e+07	Mean :20.25	Mean :4.838e+07	Mean :4.923e+07	Mean :9.763e+07	Mean : 469355	Mean :10.256	Mean :18.96	Mean : 70.22	Mean : 79.34
3rd Qu.: 170000	3rd Qu.: 180000	3rd Qu.: 5875	3rd Qu.: 9950	3rd Qu.: 9225	3rd Qu.: 13000	3rd Qu.: 97000	3rd Qu.: 1200	3rd Qu.: 235.00	3rd Qu.:1.229e+07	3rd Qu.:27.30	3rd Qu.:1.476e+07	3rd Qu.:1.574e+07	3rd Qu.:3.030e+07	3rd Qu.: 259482	3rd Qu.:12.700	3rd Qu.:30.00	3rd Qu.: 87.00	3rd Qu.: 88.00
Max. :33800000	Max. :36100000	Max. :1500000	Max. :2200000	Max. :1900000	Max. :2600000	Max. :19700000	Max. :300000	Max. :1246.00	Max. :3.406e+09	Max. :45.40	Max. :3.644e+09	Max. :3.708e+09	Max. :7.355e+09	Max. :11627473	Max. :35.300	Max. :99.00	Max. :110.00	Max. :100.00

Percentage of Children Orphaned By HIV/AIDS

we are now to goint make use of sql to subset(query) columns so as to diffentiate between year 2000 and 20105 where the number children orphaned by HIV/AIDS.

All_Cen <- sqldf("SELECT Country_Name, Year_Code, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2_cleaned where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50" )

A box plot of the outcomes.

# Boxplot
g <- ggplot(All_Cen, aes(Year_Code, Percentage_Orphaned_byHIV)) + geom_boxplot(varwidth=T, fill="plum") + 
    labs(title="Box plot of Children Orphaned By HIV/AID per Year (%)", 
         subtitle="Percentage Orphaned by HIV grouped by Year",
         caption="Source: The World Bank Data Bank",
         x="Year",
         y="% of Children Orphaned By HIV/AIDS")

ggplotly(g)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

I think a better approach is to include as many information as possible that is also free of redundancy! (PLEASE HOVER YOUR MOUSE ON PLOTS!!!)

THE ANALYSIS

Let first look at the graph of children orphaned by HIV?AIDS!

ggplotly(ggplot(All_Cen, aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

INCIDENCE PROPORTION,INCIDENCE RATES AND PREVALENCE

Definitions:

Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.

while

1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk

2). Incidence rate or person-time rate is a measure of incidence that incorporates time directly into the denominator. A person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented." —CDC

3). Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population

Mathematically as,

\(Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad disease\quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }\)

HIV/AIDS and Tuberculosis Incidence Calculation For People At Risk (Proportion)

Hiv_Incident_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where HIV_Incident_Risk_Per_100K >= 200 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 30" )
kable(head(Hiv_Incident_Risk))

Population_total	Lat_Long	Country_Name	Year_Code	HIV_Incident_Risk_Per_100K
2209197	-22:24	Botswana	YR2015	497.9185
4546100	7:21	Central African Republic	YR2015	202.3713
1175389	2:10	Equatorial Guinea	YR2015	204.1877
2174645	-29.5:28.5	Lesotho	YR2015	1011.6594
17573607	-13.5:34	Malawi	YR2015	221.9237
28010691	-18.25:35	Mozambique	YR2015	339.1562

Tuberculosis_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K' FROM df2_cleaned where Tuberculocisis_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Tuberculosis_Inc_Risk))

Population_total	Lat_Long	Country_Name	Year_Code	Tuberculocisis_Incident_Risk_Per_100K
2880703	41:20	Albania	YR2015	2.638245
2916950	40.0691:45.03819	Armenia	YR2015	3.051132
1371855	26:50.55	Bahrain	YR2015	6.341778
284217	13.1667:-59.5333	Barbados	YR2015	30.610414
359288	17.25:-88.75	Belize	YR2015	24.214558
2209197	-22:24	Botswana	YR2015	2.806450

Malaria_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Malaria_cases_reported/Population_total)* 100000 as 'Malaria_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
kable(head(Malaria_Inc_Risk))

Population_total	Lat_Long	Country_Name	Year_Code	Malaria_Incident_Risk_Per_100K
33736494	33:65	Afghanistan	YR2015	257.569740
2880703	41:20	Albania	YR2015	1.909256
27859305	-12.5:18.5	Angola	YR2015	9940.323350
161200886	24:90	Bangladesh	YR2015	4.099233
284217	13.1667:-59.5333	Barbados	YR2015	6186.470197
9489616	53:28	Belarus	YR2015	126.980902

HIV Incidence Person At Risk

ggplotly(ggplot(Hiv_Incident_Risk, aes(x=HIV_Incident_Risk_Per_100K, y=reorder(Country_Name,+HIV_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="red", size=2, alpha=.8) + ylab("Countries") + xlab("HIV Incident Risk") +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of HIV/AIDS Incidence (Person At Risk) BY Countries"))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Tuberculosis Incidence Person At Risk

ggplotly(ggplot(Tuberculosis_Inc_Risk, aes(x=Tuberculocisis_Incident_Risk_Per_100K, y=reorder(Country_Name, Tuberculocisis_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Tuberculosis Incidence (Person At Risk) BY Countries"))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Malaria Incidence Person At Risk

ggplotly(ggplot(Malaria_Inc_Risk, aes(x=Malaria_Incident_Risk_Per_100K, y=reorder(Country_Name, Malaria_Incident_Risk_Per_100K), fill=Year_Code)) +
geom_point(colour="brown", size=2, alpha=.8) + 
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Malaria Incidence (Person At Risk) BY Countries"))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Their Locations On The Map

HIV/AIDS

Hiv_Inc_Map <- gvisGeoChart(Hiv_Incident_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_At_Risk")
plot(Hiv_Inc_Map)

Tuberculosis

Tub_Inc_Map <- gvisGeoChart(Tuberculosis_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculocisis_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_At_Risk")
plot(Tub_Inc_Map )

Malaria

Malaria_Inc_Map <- gvisGeoChart(Malaria_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Malaria_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Malaria_Incidence_Person_At_Risk")
plot(Malaria_Inc_Map )

HIV/AID, Malaria & Tuberculosis Incident Risk Comparison

Person At Risk Comaparison

comparison <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, Labor_force_total, (Malaria_cases_reported/Population_total)* 1000 as 'Malaria_Incident_Risk_Per_1K', (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K',  (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_1K >= 1 and Lat_Long != 'NA:NA' or Tuberculocisis_Incident_Risk_Per_100K >= 1 or HIV_Incident_Risk_Per_100K >= 200 ORDER BY Country_Name ASC")

kable(head(comparison))

Population_total	Lat_Long	Country_Name	Year_Code	Labor_force_total	Malaria_Incident_Risk_Per_1K	Tuberculocisis_Incident_Risk_Per_100K	HIV_Incident_Risk_Per_100K
33736494	33:65	Afghanistan	YR2015	9825616	2.575697	0.1719207	2.964149
31731688	33:65	Afghanistan	YR2013	8916157	1.237344	0.1670255	3.151424
29708599	33:65	Afghanistan	YR2011	8050184	2.610322	0.1716675	3.366029
30696958	33:65	Afghanistan	YR2012	8458402	1.786496	0.1661402	3.257652
32758020	33:65	Afghanistan	YR2014	9397624	4.435005	0.1617924	3.052688
28803167	33:65	Afghanistan	YR2010	7707349	2.409353	0.1840075	3.471840

A 3D approach

comparison <- comparison[order(comparison$Country_Name, comparison$Year_Code),]
comparison$size <- comparison$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw <- plot_ly(comparison, x = ~Malaria_Incident_Risk_Per_1K, y = ~Tuberculocisis_Incident_Risk_Per_100K, z = ~HIV_Incident_Risk_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Labor Force:', Labor_force_total, '<br>Population Total:', Population_total,
                           '<br>Lat & Long.:', Lat_Long)) %>%
  layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw

Incidence Person_Per_Time

HIV/AIDS Person Per Time

HIV_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, ((New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'HIV_Inc_Person_Time100K'  FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY New_Hiv_Cases, Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Inc_Person_Time))

Population_total	Lat_Long	Country_Name	Year_Code	HIV_Inc_Person_Time100K
1371855	26:50.55	Bahrain	YR2015	72.90038
777424	-12.1667:44.25	Comoros	YR2015	128.65211
4203604	45.1667:15.5	Croatia	YR2015	23.79033
9159302	31:36	Jordan	YR2015	10.91828
3935794	29.3375:47.6581	Kuwait	YR2015	25.40904
2976877	46:105	Mongolia	YR2015	33.59576

Tuberculosis Person Per Time

#Nested SELECTion!

Tuber_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/(Population_total -(Tuber_trtd  + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'Tuberculosis_Inc_Person_Time100K'  FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Inc_Person_Time))

Population_total	Lat_Long	Country_Name	Year_Code	Tuberculosis_Inc_Person_Time100K
33736494	33:65	Afghanistan	YR2015	1.719336
2880703	41:20	Albania	YR2015	26.384446
39871528	28:3	Algeria	YR2015	2.006544
27859305	-12.5:18.5	Angola	YR2015	2.298376
43417765	-34:-64	Argentina	YR2015	2.003961
2916950	40.0691:45.03819	Armenia	YR2015	30.515015

Person Per Time Maps

HIV/AIDS Person Per Time Map

HIV_Inc_Person_Time_Map <- gvisGeoChart(HIV_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Inc_Person_Time100K", colorvar = "Population_total",
                    options=list(displayMode="Markers", 
                    colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                    backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_Per_Time")
plot(HIV_Inc_Person_Time_Map )

Tuberculosis Person Per Time Map

Tuber_Inc_PT_Map <- gvisGeoChart(Tuber_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculosis_Inc_Person_Time100K", colorvar = "Population_total",
                    options=list(displayMode="Markers", 
                    colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                    backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_Per_Time")
plot(Tuber_Inc_PT_Map)

PREVALENCE RATE:

Prevalence is the actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.

Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.

HIV Prevalence

HIV_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Inc_Prevalence_100K'  FROM (SELECT Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(HIV_Prevalence))

Population_total	Lat_Long	Country_Name	Year_Code	HIV_Inc_Prevalence_100K
33736494	33:65	Afghanistan	YR2015	234
31731688	33:65	Afghanistan	YR2013	214
29708599	33:65	Afghanistan	YR2011	199
30696958	33:65	Afghanistan	YR2012	205
32758020	33:65	Afghanistan	YR2014	226
28803167	33:65	Afghanistan	YR2010	191

NB: Malaria and Tuberculosis are most likely not be carried over to the proceding year. So, we will not include them (person living with either malaria or tuberculosis) in the numerators.

Tuberculosis Prevalence

#Nested SELECTion!

Tuber_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd    + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K'  FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Tuber_Prevalence_Rate))

Population_total	Lat_Long	Country_Name	Year_Code	Tuberculosis_Prevalence_per_100K
33736494	33:65	Afghanistan	YR2015	2
31731688	33:65	Afghanistan	YR2013	2
29708599	33:65	Afghanistan	YR2011	2
30696958	33:65	Afghanistan	YR2012	2
32758020	33:65	Afghanistan	YR2014	2
28803167	33:65	Afghanistan	YR2010	2

Malaria Prevalence

Malaria_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Malaria_Prevalence_per_100K'  FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year',Malaria_cases_reported, round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
kable(head(Malaria_Prevalence_Rate))

Population_total	Lat_Long	Country_Name	Year_Code	Malaria_Prevalence_per_100K
33736494	33:65	Afghanistan	YR2015	2576
31731688	33:65	Afghanistan	YR2013	1237
29708599	33:65	Afghanistan	YR2011	2611
30696958	33:65	Afghanistan	YR2012	1787
32758020	33:65	Afghanistan	YR2014	4435
28803167	33:65	Afghanistan	YR2010	2410

All Disease Comparison (Prevalence)

Com_HIV_Tub_Mal_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,round((Labor_force_total/Population_total)*100) as 'Labor_Pop_Percentage',round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd  + AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K',round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*100000) as 'Malaria_Prevalence_per_100K',round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Prevalence_Per_100K'  FROM (SELECT Labor_force_total,Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',Tuberculosis_case_detection,Malaria_cases_reported,Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC) where Lat_Long != 'NA:NA'")
kable(head(Com_HIV_Tub_Mal_Prevalence))

Population_total	Lat_Long	Country_Name	Year_Code	Labor_Pop_Percentage	Tuberculosis_Prevalence_per_100K	Malaria_Prevalence_per_100K	HIV_Prevalence_Per_100K
33736494	33:65	Afghanistan	YR2015	29	2	258	234
31731688	33:65	Afghanistan	YR2013	28	2	124	214
29708599	33:65	Afghanistan	YR2011	27	2	261	199
30696958	33:65	Afghanistan	YR2012	28	2	179	205
32758020	33:65	Afghanistan	YR2014	29	2	444	226
28803167	33:65	Afghanistan	YR2010	27	2	241	191

A 3D Approach For All Disease Comparison (Prevalence)

comp <- Com_HIV_Tub_Mal_Prevalence[order(Com_HIV_Tub_Mal_Prevalence$Country_Name, Com_HIV_Tub_Mal_Prevalence$Year_Code),]
comp$size <- Com_HIV_Tub_Mal_Prevalence$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw2 <- plot_ly(comp, x = ~Malaria_Prevalence_per_100K, y = ~Tuberculosis_Prevalence_per_100K, z = ~HIV_Prevalence_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Population Total:', Population_total,
                           '<br>Labor Force/Pop %:', Labor_Pop_Percentage)) %>%
  layout(title = 'Comparison Among HIV/AIDS, Malaria & Tuberculosis Incidence For People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw2

Regression Analysis & Prediction

Check for normality

qqnorm(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K)
qqline(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K, col = 2)

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have

\({ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad \\ where,\\ { Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad \\\)

While Multiple linear regression, we have (in matrix notation):

\(\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y\)

Obtaining Training and Test Dataset

# Create Training and Test data -
set.seed(12345)  # setting seed to reproduce results of random sampling
Com_HIV_Tub_Mal_Prevalence <- Com_HIV_Tub_Mal_Prevalence[, c(1,5,6,7,8)]
traindata <- sample(1:nrow(Com_HIV_Tub_Mal_Prevalence), 0.8*nrow(Com_HIV_Tub_Mal_Prevalence))  # row indices for training data
Train <- Com_HIV_Tub_Mal_Prevalence[traindata, ]  # model training data
Test  <- Com_HIV_Tub_Mal_Prevalence[-traindata, ]   # test data

SETTING UP HYPOTHESIS: For Labor Force

Null Hypothesis: Labor Force has no effect on HIV, Tuberculosis and Malaria.

\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:

\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

NB: We used both forward and backward model selection to select the best model for our analysis.

fit_labor <-  glm(Labor_Pop_Percentage ~., family=poisson(link='log'),data=Train)
summary(stepwise <- step(fit_labor, direction = "both"))

## Start:  AIC=3397.71
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df Deviance    AIC
## <none>                                  661.93 3397.7
## - Malaria_Prevalence_per_100K       1   666.83 3400.6
## - Population_total                  1   666.87 3400.6
## - Tuberculosis_Prevalence_per_100K  1   667.02 3400.8
## - HIV_Prevalence_Per_100K           1   671.42 3405.2

## 
## Call:
## glm(formula = Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2093  -0.6812   0.1074   0.7941   4.0322  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       3.797e+00  8.946e-03 424.403  < 2e-16
## Population_total                 -1.338e-10  6.168e-11  -2.169  0.03012
## Tuberculosis_Prevalence_per_100K  3.378e-04  1.482e-04   2.279  0.02264
## Malaria_Prevalence_per_100K      -2.248e-06  1.027e-06  -2.189  0.02859
## HIV_Prevalence_Per_100K          -6.438e-07  2.115e-07  -3.043  0.00234
##                                     
## (Intercept)                      ***
## Population_total                 *  
## Tuberculosis_Prevalence_per_100K *  
## Malaria_Prevalence_per_100K      *  
## HIV_Prevalence_Per_100K          ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 688.60  on 485  degrees of freedom
## Residual deviance: 661.93  on 481  degrees of freedom
## AIC: 3397.7
## 
## Number of Fisher Scoring iterations: 4

Plot of Residuals

residual_labor = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = Train$Labor_Pop_Percentage)

plot_res <- ggplot(residual_labor, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_res)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Analysis of Variance (ANOVA)

anova(stepwise, test="F")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Labor_Pop_Percentage
## 
## Terms added sequentially (first to last)
## 
## 
##                                  Df Deviance Resid. Df Resid. Dev      F
## NULL                                               485     688.60       
## Population_total                  1   4.7922       484     683.81 4.7922
## Tuberculosis_Prevalence_per_100K  1   5.1819       483     678.63 5.1819
## Malaria_Prevalence_per_100K       1   7.2054       482     671.42 7.2054
## HIV_Prevalence_Per_100K           1   9.4892       481     661.93 9.4892
##                                    Pr(>F)   
## NULL                                        
## Population_total                 0.028589 * 
## Tuberculosis_Prevalence_per_100K 0.022824 * 
## Malaria_Prevalence_per_100K      0.007269 **
## HIV_Prevalence_Per_100K          0.002067 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DECISION (Labor Force):

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

SETTING UP HYPOTHESIS: Population

Null Hypothesis: Population has no effect on HIV, Tuberculosis and Malaria.

\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:

\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

NB: We used both forward and backward model selection to select the best model for our analysi.

fit_Population <-  glm(Population_total ~., family=poisson(link='log'),data=Train)
summary(stepwise_pop <- step(fit_Population , direction = "both"))

## Start:  AIC=20335673459
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df   Deviance        AIC
## <none>                                2.0336e+10 2.0336e+10
## - Labor_Pop_Percentage              1 2.0638e+10 2.0638e+10
## - HIV_Prevalence_Per_100K           1 2.0751e+10 2.0751e+10
## - Malaria_Prevalence_per_100K       1 2.1448e+10 2.1448e+10
## - Tuberculosis_Prevalence_per_100K  1 4.4134e+10 4.4134e+10

## 
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -10699   -2899     184    4585   53407  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       1.967e+01  4.044e-05  486275   <2e-16
## Labor_Pop_Percentage             -1.652e-02  9.326e-07  -17713   <2e-16
## Tuberculosis_Prevalence_per_100K -3.374e-01  3.824e-06  -88230   <2e-16
## Malaria_Prevalence_per_100K      -7.291e-05  2.624e-09  -27782   <2e-16
## HIV_Prevalence_Per_100K          -7.459e-06  4.197e-10  -17772   <2e-16
##                                     
## (Intercept)                      ***
## Labor_Pop_Percentage             ***
## Tuberculosis_Prevalence_per_100K ***
## Malaria_Prevalence_per_100K      ***
## HIV_Prevalence_Per_100K          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4.8172e+10  on 485  degrees of freedom
## Residual deviance: 2.0336e+10  on 481  degrees of freedom
## AIC: 2.0336e+10
## 
## Number of Fisher Scoring iterations: 8

Plot of Residual (Population)

residual_pop = data.frame(Fitted = fitted(stepwise_pop),
Residuals = resid(stepwise_pop), Treatment = Train$Population_total)

plot_pop <- ggplot(residual_pop, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_pop)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Analysis of Variance (Population)

anova(stepwise_pop, test="F")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Population_total
## 
## Terms added sequentially (first to last)
## 
## 
##                                  Df   Deviance Resid. Df Resid. Dev
## NULL                                                 485 4.8172e+10
## Labor_Pop_Percentage              1 1.3111e+09       484 4.6861e+10
## Tuberculosis_Prevalence_per_100K  1 2.4433e+10       483 2.2428e+10
## Malaria_Prevalence_per_100K       1 1.6777e+09       482 2.0751e+10
## HIV_Prevalence_Per_100K           1 4.1494e+08       481 2.0336e+10
##                                           F    Pr(>F)    
## NULL                                                     
## Labor_Pop_Percentage             1.3111e+09 < 2.2e-16 ***
## Tuberculosis_Prevalence_per_100K 2.4433e+10 < 2.2e-16 ***
## Malaria_Prevalence_per_100K      1.6777e+09 < 2.2e-16 ***
## HIV_Prevalence_Per_100K          4.1494e+08 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DECISION (Population):

Reject Null Hypothesis since P-Values are all less than Tabulated value of 0.05.

Non Zero Confidence Interval

Meaning that there is a 95% probability that the confidence interval will contain the true population mean. From the table below, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.

confint(stepwise)

##                                          2.5 %        97.5 %
## (Intercept)                       3.779050e+00  3.814117e+00
## Population_total                 -2.572919e-10 -1.540813e-11
## Tuberculosis_Prevalence_per_100K  4.451145e-05  6.254320e-04
## Malaria_Prevalence_per_100K      -4.279592e-06 -2.537143e-07
## HIV_Prevalence_Per_100K          -1.061679e-06 -2.323674e-07

Prediction

The below below shows the predicted values and the actual values side-by-side.

pred <- predict(fit_labor, Test, type="response")
actual_pred <- data.frame(cbind(actual=Train$Labor_Pop_Percentage, predicted=pred))
kable(head(actual_pred))

actual	predicted
31	44.35727
39	44.33879
50	44.35285
27	43.54034
48	44.43838
45	45.53024

MAPE

The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,

\(M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| } ,\\ Where\quad { A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.\)

min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100 
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))

## [1] "The Percentage Accuracy Is:  86.9749925016897"

mape <- mean(abs((actual_pred$predicted - actual_pred$actual))/actual_pred$actual)*100 
print(paste('The MAPE Is: ', mape ))

## [1] "The MAPE Is:  15.3796521256326"

We can be rest assured that about 15 percent of our prediction might not be accurate. Although this might be a bit of concern to us as we wanted about 99percent prediction accuracy.

GENERAL CONCLUSION

Incident (Person At Risk): Shows that majority of the diseases is highly concentrated in southern part of africa.
Incident (Person Per Time): Gives more insight about the dieases that are considered non-chronic like Malaria and Tuberculosis.
Prevalence Rate: This was used to estimate the prevalence of all the diseases and a better result was obtained. It shows the widespread of the diseases beyond southern part of africa to some of Asia, South America and Carribean was not left out!

Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.

RECOMMENDATION

The countries that are mostly affected needs to do more in terms of;

1). LABOR FORCE: Provide more employment opportunities to the citizenry. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.

2). Population: Maintain the population, as a sporadic increament in population without good stanadard of living, would jeopardized the ecomony and cause an increase in the spread of dideases.

REFERENCES

http://advancedrenaleducation.com/content/incidence-and-prevalence

https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson3/section2.html

https://plot.ly/r/

http://rmarkdown.rstudio.com/html_document_format.html#overview

http://databank.worldbank.org/data/reports.aspx?source=health-nutrition-and-population-statistics#advancedDownloadOptions

|IS 608 KNOWLEDGE AND VISUAL ANALYTICS - FINAL PROJECT | Data Analytics

Author:MUSA T GANIYU

OCT. 20th, 2017