DATA 608 01[46846] : Final Project

1 INTRODUCTION
2 DATA LOAD
- 2.1 World Health
- 2.2 World Longitudes and Latitudes
3 DATA PREPARATION
4 DATA EXPLORATION
5 DATA ANALYSIS
- 5.1 INCIDENCE PROPORTION and INCIDENCE RATES
- 5.2 PREVALENCE RATE
  - 5.2.1 A 3D Approach For All Disease Prevalence Comparison
6 REGRESSION ANALYSIS
- 6.1 Analysis of Labor Force
- 6.2 Analysis of Population
7 PREDICTION
- 7.1 MAPE
8 CONCLUSION
- 8.1 Recommendation

1 INTRODUCTION

Given the current pandemic COVID-19 situation, I was inclined to use diseases to explore the visualizations in the last decade. However, with plethora of COVID-19 data ambiguity and the ever-changing landscape of the spread of the disease, I wanted to avoid making any presumptive conclusions/predictions. Instead, I relied on data source over prolonged period for diseases that have lasted for a long period namely HIV/AIDS, Tuberculosis and Malaria.

1.1 DataSet

The dataset used in this project is a demography data collected from the World Health Nutrition and Population Statistics from 2000 to 2020.

The primary issue with such type of demographic data is the missing observation(s) (NAs) and needs data cleaning to be able to slice-and-dice the data for various visualizations

1.2 Objective

The purpose of this project is to:

Identify the relationship between labor force/Population and epidemic diseases like HIV/AIDS, Tuberculosis, Malaria. Better still, does geographical location has anything to do with all the widespread of these diseases?
Show the merit and demerit of visual analytics on data analysis, and how to improve it.
Make the visualized image(s) tells the story in an unambiguous way, understandable, even to a layman!
Apply the knowledge learned in class to real live situation.

1.3 Methodology

The project would lay more emphasis on the explanatory techniques. It will be used in making data presentation to the viewers in a more succinct way. Therefore, used the R programing language to explore and analysis the dataset.

Majorly used SQL from sqldf package (it was the best for this type of dataset) to extract the observations and to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2.
The dataset was not normally distributed, so used Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.

1.4 Tools/Packages To Be Used

Load these libraries and dataset and lets get to work!

Plotly
Knitr
Dplyr
Plyr
Reshape2
Ggplot2
Graphics
Ggthemes
GoogleVis
scatterplot3d
Shiny
Shinydashboard
Rshiny
Sqldf
gridExtra
treemap

2 DATA LOAD

The datasets are loaded into R to proceed with subsequent data analysis.

2.1 World Health

df <- read.csv("World Health Nutrition and Population Statistics_2000-2020.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
DT::datatable(df)

2.2 World Longitudes and Latitudes

This includes the various world Longitudes and Latitudes for all the country locations in the above dataset to able to plot them in a map for better visualizations.

lat_long <- read.csv("Countries-longitude_latitude.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country_Name", "Country_Code", "Latitude", "Longtitude")
DT::datatable(lat_long)

3 DATA PREPARATION

Checking for all NA’s in the datasets before cleaning.

na_count_before <-sapply(df, function(a) sum(is.na(a)))

library(knitr)
kable(na_count_before)

	x
Year_Code	0
Country_Name	0
Country_Code	0
Adults_15_living_HIV	1716
Adults_Children_0_14_15_living_HIV	1688
AIDS_estimated_deaths_UNAIDS	1785
Adults_children_0_14_15_newly_infected_HIV	1655
Adults_15_newly_infected_HIV	1700
Children_0_14_living_with_HIV	2365
Children_orphaned_by_HIV_AIDS	2365
Children_0_14_newly_infected_HIV	2365
Incidence_tuberculosis_per_100000	394
Labor_force_total	568
Mortality_traffic_injury_100K	3974
Population_female	623
Population_male	623
Population_total	224
Malaria_cases_reported	3450
Suicide_mortality_per_100K	3415
Tuberculosis_death_per_100K	394
Tuberculosis_case_detection	564
Tuberculosis_treatment_success_NewCases	982

Cleaning and merging the two datasets of World Health and World Longitudes and Latitudes.

options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Name", by.y = "Country_Name", all = TRUE)
DT::datatable(df2)

Checking for all NA’s in the datasets after cleaning.

na_count_after <-sapply(df2, function(b) sum(is.na(b)))

library(knitr)
kable(na_count_after)

	x
Country_Name	0
Year_Code	63
Country_Code.x	63
Adults_15_living_HIV	1779
Adults_Children_0_14_15_living_HIV	1751
AIDS_estimated_deaths_UNAIDS	1848
Adults_children_0_14_15_newly_infected_HIV	1718
Adults_15_newly_infected_HIV	1763
Children_0_14_living_with_HIV	2428
Children_orphaned_by_HIV_AIDS	2428
Children_0_14_newly_infected_HIV	2428
Incidence_tuberculosis_per_100000	457
Labor_force_total	631
Mortality_traffic_injury_100K	4037
Population_female	686
Population_male	686
Population_total	287
Malaria_cases_reported	3513
Suicide_mortality_per_100K	3478
Tuberculosis_death_per_100K	457
Tuberculosis_case_detection	627
Tuberculosis_treatment_success_NewCases	1045
Country_Code.y	721
Latitude	721
Longtitude	721

From the table above, we can see that both datasets contain a lot of NAs (This is due to data unavalability in some countries under observation).

Convert the columns to Numeric

Let’s convert some of the variable to numeric for better a analysis.

df2[, c(4:22)] <- sapply(df2[, c(4:22)], as.numeric)

glimpse(df2)

## Observations: 4,403
## Variables: 25
## $ Country_Name                               <chr> "Afghanistan", "Afghanis...
## $ Year_Code                                  <chr> "YR2009", "YR2012", "YR2...
## $ Country_Code.x                             <chr> "AFG", "AFG", "AFG", "AF...
## $ Adults_15_living_HIV                       <dbl> 3700, 4600, 4000, 1700, ...
## $ Adults_Children_0_14_15_living_HIV         <dbl> 4000, 4900, 4200, 1800, ...
## $ AIDS_estimated_deaths_UNAIDS               <dbl> 500, 500, 500, 200, 200,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 560, 620, 560, 500, 500,...
## $ Adults_15_newly_infected_HIV               <dbl> 500, 560, 500, 500, 500,...
## $ Children_0_14_living_with_HIV              <dbl> 500, 500, 500, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS              <dbl> 2600, 3400, 2800, 1000, ...
## $ Children_0_14_newly_infected_HIV           <dbl> 100, 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000          <dbl> 189, 189, 189, 189, 189,...
## $ Labor_force_total                          <dbl> 9312008, 10534726, 95725...
## $ Mortality_traffic_injury_100K              <dbl> NA, NA, NA, NA, NA, NA, ...
## $ Population_female                          <dbl> 13850754, 15196503, 1424...
## $ Population_male                            <dbl> 14544059, 15964873, 1493...
## $ Population_total                           <dbl> 28394813, 31161376, 2918...
## $ Malaria_cases_reported                     <dbl> NA, 54840, 69397, NA, NA...
## $ Suicide_mortality_per_100K                 <dbl> NA, NA, 5.1, NA, NA, NA,...
## $ Tuberculosis_death_per_100K                <dbl> 44.00, 44.00, 43.00, 62....
## $ Tuberculosis_case_detection                <dbl> 49, 49, 51, 25, 31, 50, ...
## $ Tuberculosis_treatment_success_NewCases    <dbl> 86, 88, 86, 84, 86, 87, ...
## $ Country_Code.y                             <fct> AFG, AFG, AFG, AFG, AFG,...
## $ Latitude                                   <dbl> 33, 33, 33, 33, 33, 33, ...
## $ Longtitude                                 <dbl> 65, 65, 65, 65, 65, 65, ...

Merging dataset

Merging column longitude and Latitude together for a better coordinate to be used in our maps

df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")
DT::datatable(df2)

A graphical representation of the missing values

library(Amelia)
missmap(df2, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')), main ="Plot Showing The Missing Values Per Observation",
        y.cex = 0.8, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)

Since it is evident that we have lots of missing observations in the dataset, we are going to employ a method that best deal with datasets of this nature,

MICE (Multivariate Imputation by Chained Equations) package helps in inspecting, imputing, diagonise, analyze, pool the result, and generate simulated incomplete data

library(mice)
mice_missing_val <- mice(df2, m=1, method='cart', printFlag=FALSE, where = is.na(df2))
mice_missing_val_comp <- complete(mice_missing_val, action='long', include=FALSE)

df2_cleaned <- na.omit(mice_missing_val_comp)
df2_cleaned <- df2_cleaned[, -c(1:2)]
DT::datatable(df2_cleaned)

After the generated values by mice, we still have some missing observations. We therefore going to omit those observation that still have missing value(s) to avoid a biased analysis.

na_count_after_cleaning <- sapply(df2_cleaned, function(c) sum(is.na(c)))
kable(na_count_after_cleaning)

	x
Country_Name	0
Year_Code	0
Country_Code.x	0
Adults_15_living_HIV	0
Adults_Children_0_14_15_living_HIV	0
AIDS_estimated_deaths_UNAIDS	0
Adults_children_0_14_15_newly_infected_HIV	0
Adults_15_newly_infected_HIV	0
Children_0_14_living_with_HIV	0
Children_orphaned_by_HIV_AIDS	0
Children_0_14_newly_infected_HIV	0
Incidence_tuberculosis_per_100000	0
Labor_force_total	0
Mortality_traffic_injury_100K	0
Population_female	0
Population_male	0
Population_total	0
Malaria_cases_reported	0
Suicide_mortality_per_100K	0
Tuberculosis_death_per_100K	0
Tuberculosis_case_detection	0
Tuberculosis_treatment_success_NewCases	0
Country_Code.y	0
Latitude	0
Longtitude	0
Lat_Long	0

We now have data with no missing observations.

Comparative Analysis of DataTypes and Missing Values Before and After

library(gridExtra)
library(visdat)

p_before_dt <- vis_dat(df)
p_before_miss <- vis_miss(df)
p_after_dt <- vis_dat(df2_cleaned)
p_after_miss <- vis_miss(df2_cleaned)

p_before_dt

p_before_miss

p_after_dt

p_after_miss

#grid.arrange(p_before_dt,p_before_miss, p_after_dt,p_after_miss, ncol = 2, widths = c(2,2), heights = c(2,2), top = 'Missing Values & Data Type Check Before and After Cleaning')

4 DATA EXPLORATION

In the data exploration, obtained an inferential statistic and checked for any outliers.

Summary Statistics

DT::datatable(summary(df2_cleaned[, 4:22]))

Percentage of Children Orphaned By HIV/AIDS

Use sql to subset(query) columns so as to diffentiate between year 2000 and 2020 where the number of children orphaned by HIV/AIDS.

library(sqldf)
All_Cen  <- sqldf("SELECT Country_Name, Year_Code, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2_cleaned where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50")

library(plotly)
# Boxplot
g <- ggplot(All_Cen, aes(Year_Code, Percentage_Orphaned_byHIV)) + geom_boxplot(varwidth=T, fill="plum") + 
    labs(title="Box plot of Children Orphaned By HIV/AIDS per Year (%)", 
         subtitle="Percentage Orphaned by HIV grouped by Year",
         caption="Source: The World Bank Data Bank",
         x="Year",
         y="% of Children Orphaned By HIV/AIDS") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

plotly::ggplotly(g)

Scatter Plot of HIV/Tuberculosis/Malaria Comparison

scatterplot3d(x=(df2_cleaned$Children_0_14_living_with_HIV+df2_cleaned$Adults_15_living_HIV)/1000, y=df2_cleaned$Tuberculosis_case_detection, z=df2_cleaned$Malaria_cases_reported/1000, pch = 20, highlight.3d = TRUE, type = "h", main = "3D ScatterPlots", xlab='HIV in 1000s', ylab='TB', zlab='Malaria in 1000s')

Disease Distribution by Top 50 Countries

library(dplyr)
disease = df2_cleaned %>%
  filter(Country_Name != 'World') %>%
  group_by(Country_Name) %>%
  summarise(HivTbMalaria = max(Children_0_14_living_with_HIV+Adults_15_living_HIV) + max(Tuberculosis_case_detection) + max(Malaria_cases_reported)) %>%
  arrange(desc(HivTbMalaria)) %>%
  ungroup() %>%
  mutate(Country_Name = reorder(Country_Name,HivTbMalaria)) %>%
  head(50)
#DT::datatable(disease)

library(treemap)
treemap(disease, index="Country_Name", vSize = "HivTbMalaria", title="Top 50 countries with Disease Count", palette = "RdBu",fontsize.title = 14)

library(ggplot2)
disease %>%
  head(20) %>%
  mutate( fill = as.factor(Country_Name)) %>%
      ggplot(aes(x = Country_Name, y= HivTbMalaria, fill = Country_Name)) +
      scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
      ) +
  geom_boxplot() +
  labs(x= 'Country Name', y = 'Disease Count', title = 'Distribution of Diseases by Top 20 Countries') +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5 DATA ANALYSIS

library(plotly)
ggplotly(ggplot(All_Cen, aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year_Code)) +
geom_point(colour="purple", size=2, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))

5.1 INCIDENCE PROPORTION and INCIDENCE RATES

Incidence: A measure of the occurrence of a new disease, in a defined population at risk for the disease, within a specified time period.

while

Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk
Incidence rate (also known as person-time rate) is a measure of incidence that incorporates time directly into the denominator. As per CDC, “a person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented.”
Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population

Mathematically as,

\[Incidence Rate = \frac{Number Of New Cases Of Disease Or Injury During Specified Period}{Time Each Person Was Observed,Totaled For All Persons}\]

5.1.1 Incidence Person At Risk

HIV/AIDS Incidence Calculation For Person At Risk

library(sqldf)
Hiv_Incident_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where HIV_Incident_Risk_Per_100K >= 200 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 30" )
DT::datatable(Hiv_Incident_Risk)

ggplotly(ggplot(Hiv_Incident_Risk, aes(x=HIV_Incident_Risk_Per_100K, y=reorder(Country_Name,+HIV_Incident_Risk_Per_100K), fill=Year_Code)) +
  geom_point(colour="red", size=2, alpha=.8) + ylab("Countries") + xlab("HIV Incident Risk") +
  scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code))) + labs(title="Chart of HIV/AIDS Incidence (Person At Risk) BY Countries"))

library(googleVis)
Hiv_Inc_Map <- googleVis::gvisGeoChart(Hiv_Incident_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_At_Risk")
plot(Hiv_Inc_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")

Hiv_Incidence_Person_At_Risk

Data: Hiv_Incident_Risk • Chart ID: Hiv_Incidence_Person_At_Risk • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

HIV_Incidence_Person_At_Risk_WorldMap

Tuberculosis Incidence Calculation For Person At Risk

Tuberculosis_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K' FROM df2_cleaned where Tuberculocisis_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
DT::datatable(Tuberculosis_Inc_Risk)

ggplotly(ggplot(Tuberculosis_Inc_Risk, aes(x=Tuberculocisis_Incident_Risk_Per_100K, y=reorder(Country_Name, Tuberculocisis_Incident_Risk_Per_100K), fill=Year_Code)) +
  geom_point(colour="purple", size=2, alpha=.8) +
  scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Tuberculosis Incidence (Person At Risk) BY Countries"))

library(googleVis)
Tub_Inc_Map <- googleVis::gvisGeoChart(Tuberculosis_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculocisis_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_At_Risk")
plot(Tub_Inc_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuberculosis_Incidence_Person_At_Risk.html")

Tuberculosis_Incidence_Person_At_Risk

Data: Tuberculosis_Inc_Risk • Chart ID: Tuberculosis_Incidence_Person_At_Risk • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Tuberculosis_Incidence_Person_At_Risk_WorldMap

Malaria Incidence Calculation For Person At Risk

Malaria_Inc_Risk <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, (Malaria_cases_reported/Population_total)* 100000 as 'Malaria_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_100K >= 1 and Lat_Long != 'NA:NA' ORDER BY Year_Code DESC LIMIT 50" )
DT::datatable(Malaria_Inc_Risk)

ggplotly(ggplot(Malaria_Inc_Risk, aes(x=Malaria_Incident_Risk_Per_100K, y=reorder(Country_Name, Malaria_Incident_Risk_Per_100K), fill=Year_Code)) +
  geom_point(colour="brown", size=2, alpha=.8) + 
  scale_fill_brewer(palette="Blues", breaks=rev(levels(All_Cen$Year_Code)))+ labs(title="Chart of Malaria Incidence (Person At Risk) BY Countries"))

library(googleVis)
Malaria_Inc_Map <- googleVis::gvisGeoChart(Malaria_Inc_Risk, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Malaria_Incident_Risk_Per_100K", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                   colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                   backgroundColor="lightblue"), chartid="Malaria_Incidence_Person_At_Risk")
plot(Malaria_Inc_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Malaria_Incidence_Person_At_Risk.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Malaria_Incidence_Person_At_Risk.html")

Malaria_Incidence_Person_At_Risk

Data: Malaria_Inc_Risk • Chart ID: Malaria_Incidence_Person_At_Risk • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Malaria_Incidence_Person_At_Risk_WorldMap

5.1.2 HIV/AID, Malaria & Tuberculosis Incident Risk Comparison

Person At Risk Comparison

comparison <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, Labor_force_total, (Malaria_cases_reported/Population_total)* 1000 as 'Malaria_Incident_Risk_Per_1K', (Tuberculosis_case_detection/Population_total)* 100000 as 'Tuberculocisis_Incident_Risk_Per_100K',  (Adults_children_0_14_15_newly_infected_HIV/Population_total)*100000 as 'HIV_Incident_Risk_Per_100K' FROM df2_cleaned where Malaria_Incident_Risk_Per_1K >= 1 and Lat_Long != 'NA:NA' or Tuberculocisis_Incident_Risk_Per_100K >= 1 or HIV_Incident_Risk_Per_100K >= 200 ORDER BY Country_Name ASC")
DT::datatable(comparison)

comparison <- comparison[order(comparison$Country_Name, comparison$Year_Code),]
comparison$size <- comparison$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw <- plot_ly(comparison, x = ~Malaria_Incident_Risk_Per_1K, y = ~Tuberculocisis_Incident_Risk_Per_100K, z = ~HIV_Incident_Risk_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Labor Force:', Labor_force_total, '<br>Population Total:', Population_total,
                           '<br>Lat & Long.:', Lat_Long)) %>%
  layout(title = 'Comparison of HIV/AIDS, Malaria & Tuberculosis Incidence for People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw

5.1.3 Incidence Person Per Time

HIV/AIDS Person Per Time

HIV_Inc_Person_Time <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,((New_Hiv_Cases)/(Population_total-(AIDS_Deaths+Tuberculosis_Deaths+Mortality_Deaths+Traffic_Deaths)))*1000000 as 'HIV_Inc_Person_Time100K' FROM (SELECT Population_total,Lat_Long,Country_Name, Year_Code,Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths',round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY New_Hiv_Cases, Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(HIV_Inc_Person_Time)

HIV_Inc_Person_Time_Map <- gvisGeoChart(HIV_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "HIV_Inc_Person_Time100K", colorvar = "Population_total",
                    options=list(displayMode="Markers", 
                    colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                    backgroundColor="lightblue"), chartid="Hiv_Incidence_Person_Per_Time")
plot(HIV_Inc_Person_Time_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Hiv_Incidence_Person_Per_Time.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Hiv_Incidence_Person_Per_Time.html")

Hiv_Incidence_Person_Per_Time

Data: HIV_Inc_Person_Time • Chart ID: Hiv_Incidence_Person_Per_Time • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

HIV_Incidence_Person_Per_Time_WorldMap

Tuberculosis Person Per Time

Tuber_Inc_Person_Time <- sqldf("SELECT Population_total,Lat_Long,Country_Name,Year_Code,(Tuberculosis_case_detection/(Population_total-(Tuber_trtd  + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000 as 'Tuberculosis_Inc_Person_Time100K' FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Tuberculosis_case_detection,Year_Code DESC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Tuber_Inc_Person_Time)

Tuber_Inc_Person_Time_Map <- gvisGeoChart(Tuber_Inc_Person_Time, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Tuberculosis_Inc_Person_Time100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Tuberculosis_Incidence_Person_Per_Time")
plot(Tuber_Inc_Person_Time_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Tuberculosis_Incidence_Person_Per_Time.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuberculosis_Incidence_Person_Per_Time.html")

Tuberculosis_Incidence_Person_Per_Time

Data: Tuber_Inc_Person_Time • Chart ID: Tuberculosis_Incidence_Person_Per_Time • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Tuberculosis_Incidence_Person_Per_Time_WorldMap

Malaria Person Per Time

Malaria_Incidence_Person_Per_Time

Data: Malaria_Inc_Person_Time • Chart ID: Malaria_Incidence_Person_Per_Time • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Malaria_Incidence_Person_Per_Time_WorldMap

5.2 PREVALENCE RATE

Prevalence : Actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.

Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.

HIV Prevalence

HIV_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long, Country_Name, Year_Code, round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Inc_Prevalence_100K'  FROM (SELECT Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(HIV_Prevalence_Rate)

HIV_Prevalence_Rate_Map <- gvisGeoChart(HIV_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "HIV_Inc_Prevalence_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"),  chartid="HIV_Prevalence_Rate_Map")
plot(HIV_Prevalence_Rate_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/HIV_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/HIV_Prevalence_Rate_Map.html")

HIV_Prevalence_Rate_Map

Data: HIV_Prevalence_Rate • Chart ID: HIV_Prevalence_Rate_Map • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

HIV_Prevalence_Rate_WorldMap

Since, Malaria and Tuberculosis are most likely not be carried over to the proceding year, so we will not include them (person living with either malaria or tuberculosis) in the numerators.

Tuberculosis Prevalence

Tuber_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd    + AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K'  FROM (SELECT Population_total,Tuberculosis_case_detection, Lat_Long,Country_Name, Year_Code, Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Tuber_Prevalence_Rate)

Tuber_Prevalence_Rate_Map <- gvisGeoChart(Tuber_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "Tuberculosis_Prevalence_per_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Tuber_Prevalence_Rate_Map")
plot(Tuber_Prevalence_Rate_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Tuber_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Tuber_Prevalence_Rate_Map.html")

Tuber_Prevalence_Rate_Map

Data: Tuber_Prevalence_Rate • Chart ID: Tuber_Prevalence_Rate_Map • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Tuberculosis_Prevalence_Rate_WorldMap

Malaria Prevalence

Malaria_Prevalence_Rate <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code, round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculocisis_Deaths_Per_Year + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Malaria_Prevalence_per_100K'  FROM (SELECT Population_total, Lat_Long,Country_Name, Year_Code, AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculocisis_Deaths_Per_Year',Malaria_cases_reported, round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC LIMIT 50) where Lat_Long != 'NA:NA'")
DT::datatable(Malaria_Prevalence_Rate)

Malaria_Prevalence_Rate_Map <- gvisGeoChart(Malaria_Prevalence_Rate, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "Malaria_Prevalence_per_100K", colorvar = "Population_total", options=list(displayMode="Markers", colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}", backgroundColor="lightblue"), chartid="Malaria_Prevalence_Rate_Map")
plot(Malaria_Prevalence_Rate_Map)

#shiny::includeHTML("http://127.0.0.1:27278/custom/googleVis/Malaria_Prevalence_Rate_Map.html")
htmltools::includeHTML("http://127.0.0.1:22461/custom/googleVis/Malaria_Prevalence_Rate_Map.html")

Malaria_Prevalence_Rate_Map

Data: Malaria_Prevalence_Rate • Chart ID: Malaria_Prevalence_Rate_Map • googleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of Use • Documentation and Data Policy

Malaria_Prevalence_Rate_WorldMap

All Disease Prevalence Comparison

Com_HIV_Tub_Mal_Prevalence <- sqldf("SELECT Population_total, Lat_Long,Country_Name, Year_Code,round((Labor_force_total/Population_total)*100) as 'Labor_Pop_Percentage',round((Tuberculosis_case_detection/(Population_total -(Tuber_trtd  + AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'Tuberculosis_Prevalence_per_100K',round((Malaria_cases_reported/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*100000) as 'Malaria_Prevalence_per_100K',round(((Old_Hiv_Cases + New_Hiv_Cases)/(Population_total -(AIDS_Deaths + Tuberculosis_Deaths + Mortality_Deaths + Traffic_Deaths)))*1000000) as 'HIV_Prevalence_Per_100K'  FROM (SELECT Labor_force_total,Population_total,Adults_Children_0_14_15_living_HIV as 'Old_Hiv_Cases', Lat_Long,Country_Name, Year_Code, Adults_children_0_14_15_newly_infected_HIV as 'New_Hiv_Cases',Tuberculosis_case_detection,Malaria_cases_reported,Tuberculosis_treatment_success_NewCases as 'Tuber_trtd',AIDS_estimated_deaths_UNAIDS as 'AIDS_Deaths' ,round((Tuberculosis_death_per_100K/1000000)*Population_total) as 'Tuberculosis_Deaths', round((Suicide_mortality_per_100K/1000000) * Population_total) as 'Mortality_Deaths', round((Mortality_traffic_injury_100K/1000000) * Population_total) as 'Traffic_Deaths' FROM df2_cleaned ORDER BY Country_Name ASC) where Lat_Long != 'NA:NA'")

DT::datatable(Com_HIV_Tub_Mal_Prevalence)

5.2.1 A 3D Approach For All Disease Prevalence Comparison

comp <- Com_HIV_Tub_Mal_Prevalence[order(Com_HIV_Tub_Mal_Prevalence$Country_Name, Com_HIV_Tub_Mal_Prevalence$Year_Code),]
comp$size <- Com_HIV_Tub_Mal_Prevalence$Population_total
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

pw2 <- plot_ly(comp, x = ~Malaria_Prevalence_per_100K, y = ~Tuberculosis_Prevalence_per_100K, z = ~HIV_Prevalence_Per_100K, color = ~Year_Code, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 120),
             text = ~paste('Country:', Country_Name, '<br>Population Total:', Population_total,
                           '<br>Labor Force/Pop %:', Labor_Pop_Percentage)) %>%
  layout(title = 'Comparison of HIV/AIDS, Malaria & Tuberculosis Incidence for People At Risk',
         scene = list(xaxis = list(title = 'Malaria Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = 'Tuberculosis Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'HIV/AIDS Pop At Risk Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

pw2

6 REGRESSION ANALYSIS

Check for normality

qqnorm(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K)
qqline(Com_HIV_Tub_Mal_Prevalence$Malaria_Prevalence_per_100K, col = 2)

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable \(Y_i\) does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have:

\[Y_i= \beta_0 + \beta_1x_i + \epsilon_i\] where, \[Y_i = Dependent Variable\] \[\beta_0 = Intercept\] \[\beta_1 = Parameter To Be estimated (slope)\] \[x_i = Independent Variable\] \[\epsilon_i = Error Term\] \[i = 1,2,3,4,......,n\]

While Multiple linear regression, we have (in matrix notation):

\[\beta =(X^TX)^{-1}X^TY\]

Obtaining Training and Test Dataset

# Create Training and Test data -
set.seed(12345)  # setting seed to reproduce results of random sampling
Com_HIV_Tub_Mal_Prevalence <- Com_HIV_Tub_Mal_Prevalence[, c(1,5,6,7,8)]
traindata <- sample(1:nrow(Com_HIV_Tub_Mal_Prevalence), 0.8*nrow(Com_HIV_Tub_Mal_Prevalence))  # row indices for training data
Train <- Com_HIV_Tub_Mal_Prevalence[traindata, ]  # model training data
Test  <- Com_HIV_Tub_Mal_Prevalence[-traindata, ]   # test data

6.1 Analysis of Labor Force

Null Hypothesis: Labor Force has no effect on HIV, Tuberculosis and Malaria.

\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)

Alternative Hypothesis: Labor Force has effect on HIV, Tuberculosis and Malaria.

\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)

Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Using both forward and backward model selection to select the best model for our analysis.

fit_labor <-  glm(Labor_Pop_Percentage ~., family=poisson(link='log'),data=Train)
summary(stepwise <- step(fit_labor, direction = "both"))

## Start:  AIC=12485.84
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df Deviance   AIC
## - Population_total                  1   2584.2 12484
## <none>                                  2584.1 12486
## - HIV_Prevalence_Per_100K           1   2591.0 12491
## - Malaria_Prevalence_per_100K       1   2604.2 12504
## - Tuberculosis_Prevalence_per_100K  1   2622.8 12522
## 
## Step:  AIC=12483.87
## Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K + Malaria_Prevalence_per_100K + 
##     HIV_Prevalence_Per_100K
## 
##                                    Df Deviance   AIC
## <none>                                  2584.2 12484
## + Population_total                  1   2584.1 12486
## - HIV_Prevalence_Per_100K           1   2591.2 12489
## - Malaria_Prevalence_per_100K       1   2604.4 12502
## - Tuberculosis_Prevalence_per_100K  1   2624.9 12523

## 
## Call:
## glm(formula = Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3191  -0.7588   0.1204   0.8847   3.3895  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       3.758e+00  4.419e-03 850.226  < 2e-16 ***
## Tuberculosis_Prevalence_per_100K  4.301e-04  6.649e-05   6.470 9.83e-11 ***
## Malaria_Prevalence_per_100K      -1.522e-06  3.477e-07  -4.378 1.20e-05 ***
## HIV_Prevalence_Per_100K          -3.282e-07  1.243e-07  -2.640  0.00829 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2653.6  on 1771  degrees of freedom
## Residual deviance: 2584.2  on 1768  degrees of freedom
## AIC: 12484
## 
## Number of Fisher Scoring iterations: 4

Plot of Residuals

residual_labor = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = Train$Labor_Pop_Percentage)

plot_res <- ggplot(residual_labor, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_res)

Analysis of Variance (ANOVA of Labor Force)

anova(stepwise, test="F")

Decision

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

6.2 Analysis of Population

Null Hypothesis: Population has no effect on HIV, Tuberculosis and Malaria.

\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)

Alternative Hythesis: Population has effect on HIV, Tuberculosis and Malaria.

\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)

Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Using both forward and backward model selection to select the best model for our analysis.

fit_Population <-  glm(Population_total ~., family=poisson(link='log'),data=Train)
summary(stepwise_pop <- step(fit_Population , direction = "both"))

## Start:  AIC=21941748579
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df   Deviance        AIC
## <none>                                2.1942e+10 2.1942e+10
## - Labor_Pop_Percentage              1 2.1980e+10 2.1981e+10
## - HIV_Prevalence_Per_100K           1 2.2276e+10 2.2276e+10
## - Malaria_Prevalence_per_100K       1 2.2726e+10 2.2726e+10
## - Tuberculosis_Prevalence_per_100K  1 6.5965e+10 6.5965e+10

## 
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7103.3  -2151.0   -318.5   2490.6  16399.9  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.802e+01  2.818e-05  639641   <2e-16 ***
## Labor_Pop_Percentage              3.945e-03  6.346e-07    6216   <2e-16 ***
## Tuberculosis_Prevalence_per_100K -1.995e-01  1.567e-06 -127356   <2e-16 ***
## Malaria_Prevalence_per_100K      -2.173e-05  8.688e-10  -25010   <2e-16 ***
## HIV_Prevalence_Per_100K          -4.397e-06  2.570e-10  -17105   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6.7926e+10  on 1771  degrees of freedom
## Residual deviance: 2.1942e+10  on 1767  degrees of freedom
## AIC: 2.1942e+10
## 
## Number of Fisher Scoring iterations: 7

Plot of Residuals

residual_pop = data.frame(Fitted = fitted(stepwise_pop),
Residuals = resid(stepwise_pop), Treatment = Train$Population_total)

plot_pop <- ggplot(residual_pop, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_pop)

Analysis of Variance (ANOVA of Population)

anova(stepwise_pop, test="F")

Decision

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

Non Zero Confidence Interval

Meaning that there is a 95% probability that the confidence interval will contain the true population mean.

confint(stepwise)

##                                          2.5 %        97.5 %
## (Intercept)                       3.748894e+00  3.766218e+00
## Tuberculosis_Prevalence_per_100K  2.992630e-04  5.598876e-04
## Malaria_Prevalence_per_100K      -2.211387e-06 -8.485368e-07
## HIV_Prevalence_Per_100K          -5.728867e-07 -8.563474e-08

From the table above, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.

7 PREDICTION

The below below shows the predicted values and the actual values side-by-side.

pred <- predict(fit_labor, Test, type="response")
actual_pred <- data.frame(cbind(actual=Train$Labor_Pop_Percentage, predicted=pred))
kable(head(actual_pred))

actual	predicted
41	42.86778
41	42.87670
45	42.87575
45	42.87028
47	42.89949
35	42.87475

7.1 MAPE

The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,

\[M=\frac{100}{n}\sum _{t=1}^{n} |\frac{A_t - F_t}{A_t}|\], Where, \[A_t=Actual Value\] \[F_t=Predicted Value\]

min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100 
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))

## [1] "The Percentage Accuracy Is:  86.2957407776888"

mape <- mean(abs((actual_pred$predicted - actual_pred$actual))/actual_pred$actual)*100 
print(paste('The MAPE Is: ', mape ))

## [1] "The MAPE Is:  16.4859681592503"

We can be rest assured that about 16 % of our prediction might not be accurate.

8 CONCLUSION

Using data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. As observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.

Incident (Person At Risk): Shows that majority of the diseases is highly concentrated in southern part of africa.
Incident (Person Per Time): Gives more insight about the dieases that are considered non-chronic like Malaria and Tuberculosis.
Prevalence Rate: This was used to estimate the prevalence of all the diseases and a better result was obtained. It shows the widespread of the diseases beyond southern part of africa to some of Asia, South America and Carribean was not left out!

Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.

8.1 Recommendation

The countries that are mostly affected needs to do more in terms of;

Labor Force: Provide more employment opportunities to the citizens. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.

Population: Maintain the population, as a sporadic increment in population without good standard of living, would jeopardize the economy and cause an increase in the spread of diseases.

DATA 608 01[46846] : Final Project

DATA 608 - Final Project - Knowledge and Visual Analytics

DATA 608 01[46846] : Final Project

1 INTRODUCTION

1.1 DataSet

1.2 Objective

1.3 Methodology

1.4 Tools/Packages To Be Used

2 DATA LOAD

2.1 World Health

2.2 World Longitudes and Latitudes

3 DATA PREPARATION

4 DATA EXPLORATION

5 DATA ANALYSIS

5.1 INCIDENCE PROPORTION and INCIDENCE RATES

5.1.1 Incidence Person At Risk

5.1.2 HIV/AID, Malaria & Tuberculosis Incident Risk Comparison

5.1.3 Incidence Person Per Time

5.2 PREVALENCE RATE

5.2.1 A 3D Approach For All Disease Prevalence Comparison

6 REGRESSION ANALYSIS

6.1 Analysis of Labor Force

6.2 Analysis of Population

7 PREDICTION

7.1 MAPE

8 CONCLUSION

8.1 Recommendation