DATA 608 01[46846] : Final Project

1 INTRODUCTION

Given the current pandemic COVID-19 situation, I was inclined to use diseases to explore the visualizations in the last decade. However, with plethora of COVID-19 data ambiguity and the ever-changing landscape of the spread of the disease, I wanted to avoid making any presumptive conclusions/predictions. Instead, I relied on data source over prolonged period for diseases that have lasted for a long period namely HIV/AIDS, Tuberculosis and Malaria.

1.1 DataSet

The dataset used in this project is a demography data collected from the World Health Nutrition and Population Statistics from 2000 to 2020.

The primary issue with such type of demographic data is the missing observation(s) (NAs) and needs data cleaning to be able to slice-and-dice the data for various visualizations

1.2 Objective

The purpose of this project is to:

  • Identify the relationship between labor force/Population and epidemic diseases like HIV/AIDS, Tuberculosis, Malaria. Better still, does geographical location has anything to do with all the widespread of these diseases?
  • Show the merit and demerit of visual analytics on data analysis, and how to improve it.
  • Make the visualized image(s) tells the story in an unambiguous way, understandable, even to a layman!
  • Apply the knowledge learned in class to real live situation.

1.3 Methodology

The project would lay more emphasis on the explanatory techniques. It will be used in making data presentation to the viewers in a more succinct way. Therefore, used the R programing language to explore and analysis the dataset.

  • Majorly used SQL from sqldf package (it was the best for this type of dataset) to extract the observations and to show their (countries using GOOGLEVIS) location and graphically using PLOTLY and GGPLOT2.
  • The dataset was not normally distributed, so used Generalized Linear Model (GLM) for regression analysis and Analysis of Variance (ANOVA) to obtain best model for predictions.

1.4 Tools/Packages To Be Used

Load these libraries and dataset and lets get to work!

  • Plotly
  • Knitr
  • Dplyr
  • Plyr
  • Reshape2
  • Ggplot2
  • Graphics
  • Ggthemes
  • GoogleVis
  • scatterplot3d
  • Shiny
  • Shinydashboard
  • Rshiny
  • Sqldf
  • gridExtra
  • treemap

2 DATA LOAD

The datasets are loaded into R to proceed with subsequent data analysis.

2.1 World Health

2.2 World Longitudes and Latitudes

This includes the various world Longitudes and Latitudes for all the country locations in the above dataset to able to plot them in a map for better visualizations.

3 DATA PREPARATION

  • Checking for all NA’s in the datasets before cleaning.
x
Year_Code 0
Country_Name 0
Country_Code 0
Adults_15_living_HIV 1716
Adults_Children_0_14_15_living_HIV 1688
AIDS_estimated_deaths_UNAIDS 1785
Adults_children_0_14_15_newly_infected_HIV 1655
Adults_15_newly_infected_HIV 1700
Children_0_14_living_with_HIV 2365
Children_orphaned_by_HIV_AIDS 2365
Children_0_14_newly_infected_HIV 2365
Incidence_tuberculosis_per_100000 394
Labor_force_total 568
Mortality_traffic_injury_100K 3974
Population_female 623
Population_male 623
Population_total 224
Malaria_cases_reported 3450
Suicide_mortality_per_100K 3415
Tuberculosis_death_per_100K 394
Tuberculosis_case_detection 564
Tuberculosis_treatment_success_NewCases 982
  • Cleaning and merging the two datasets of World Health and World Longitudes and Latitudes.
  • Checking for all NA’s in the datasets after cleaning.
x
Country_Name 0
Year_Code 63
Country_Code.x 63
Adults_15_living_HIV 1779
Adults_Children_0_14_15_living_HIV 1751
AIDS_estimated_deaths_UNAIDS 1848
Adults_children_0_14_15_newly_infected_HIV 1718
Adults_15_newly_infected_HIV 1763
Children_0_14_living_with_HIV 2428
Children_orphaned_by_HIV_AIDS 2428
Children_0_14_newly_infected_HIV 2428
Incidence_tuberculosis_per_100000 457
Labor_force_total 631
Mortality_traffic_injury_100K 4037
Population_female 686
Population_male 686
Population_total 287
Malaria_cases_reported 3513
Suicide_mortality_per_100K 3478
Tuberculosis_death_per_100K 457
Tuberculosis_case_detection 627
Tuberculosis_treatment_success_NewCases 1045
Country_Code.y 721
Latitude 721
Longtitude 721

From the table above, we can see that both datasets contain a lot of NAs (This is due to data unavalability in some countries under observation).

  • Convert the columns to Numeric

Let’s convert some of the variable to numeric for better a analysis.

## Observations: 4,403
## Variables: 25
## $ Country_Name                               <chr> "Afghanistan", "Afghanis...
## $ Year_Code                                  <chr> "YR2009", "YR2012", "YR2...
## $ Country_Code.x                             <chr> "AFG", "AFG", "AFG", "AF...
## $ Adults_15_living_HIV                       <dbl> 3700, 4600, 4000, 1700, ...
## $ Adults_Children_0_14_15_living_HIV         <dbl> 4000, 4900, 4200, 1800, ...
## $ AIDS_estimated_deaths_UNAIDS               <dbl> 500, 500, 500, 200, 200,...
## $ Adults_children_0_14_15_newly_infected_HIV <dbl> 560, 620, 560, 500, 500,...
## $ Adults_15_newly_infected_HIV               <dbl> 500, 560, 500, 500, 500,...
## $ Children_0_14_living_with_HIV              <dbl> 500, 500, 500, 200, 200,...
## $ Children_orphaned_by_HIV_AIDS              <dbl> 2600, 3400, 2800, 1000, ...
## $ Children_0_14_newly_infected_HIV           <dbl> 100, 100, 100, 100, 100,...
## $ Incidence_tuberculosis_per_100000          <dbl> 189, 189, 189, 189, 189,...
## $ Labor_force_total                          <dbl> 9312008, 10534726, 95725...
## $ Mortality_traffic_injury_100K              <dbl> NA, NA, NA, NA, NA, NA, ...
## $ Population_female                          <dbl> 13850754, 15196503, 1424...
## $ Population_male                            <dbl> 14544059, 15964873, 1493...
## $ Population_total                           <dbl> 28394813, 31161376, 2918...
## $ Malaria_cases_reported                     <dbl> NA, 54840, 69397, NA, NA...
## $ Suicide_mortality_per_100K                 <dbl> NA, NA, 5.1, NA, NA, NA,...
## $ Tuberculosis_death_per_100K                <dbl> 44.00, 44.00, 43.00, 62....
## $ Tuberculosis_case_detection                <dbl> 49, 49, 51, 25, 31, 50, ...
## $ Tuberculosis_treatment_success_NewCases    <dbl> 86, 88, 86, 84, 86, 87, ...
## $ Country_Code.y                             <fct> AFG, AFG, AFG, AFG, AFG,...
## $ Latitude                                   <dbl> 33, 33, 33, 33, 33, 33, ...
## $ Longtitude                                 <dbl> 65, 65, 65, 65, 65, 65, ...
  • Merging dataset

Merging column longitude and Latitude together for a better coordinate to be used in our maps

  • A graphical representation of the missing values

Since it is evident that we have lots of missing observations in the dataset, we are going to employ a method that best deal with datasets of this nature,

MICE (Multivariate Imputation by Chained Equations) package helps in inspecting, imputing, diagonise, analyze, pool the result, and generate simulated incomplete data

After the generated values by mice, we still have some missing observations. We therefore going to omit those observation that still have missing value(s) to avoid a biased analysis.

x
Country_Name 0
Year_Code 0
Country_Code.x 0
Adults_15_living_HIV 0
Adults_Children_0_14_15_living_HIV 0
AIDS_estimated_deaths_UNAIDS 0
Adults_children_0_14_15_newly_infected_HIV 0
Adults_15_newly_infected_HIV 0
Children_0_14_living_with_HIV 0
Children_orphaned_by_HIV_AIDS 0
Children_0_14_newly_infected_HIV 0
Incidence_tuberculosis_per_100000 0
Labor_force_total 0
Mortality_traffic_injury_100K 0
Population_female 0
Population_male 0
Population_total 0
Malaria_cases_reported 0
Suicide_mortality_per_100K 0
Tuberculosis_death_per_100K 0
Tuberculosis_case_detection 0
Tuberculosis_treatment_success_NewCases 0
Country_Code.y 0
Latitude 0
Longtitude 0
Lat_Long 0

We now have data with no missing observations.

Comparative Analysis of DataTypes and Missing Values Before and After

4 DATA EXPLORATION

In the data exploration, obtained an inferential statistic and checked for any outliers.

Summary Statistics

Percentage of Children Orphaned By HIV/AIDS

Use sql to subset(query) columns so as to diffentiate between year 2000 and 2020 where the number of children orphaned by HIV/AIDS.

Scatter Plot of HIV/Tuberculosis/Malaria Comparison

Disease Distribution by Top 50 Countries

5 DATA ANALYSIS

5.1 INCIDENCE PROPORTION and INCIDENCE RATES

Incidence: A measure of the occurrence of a new disease, in a defined population at risk for the disease, within a specified time period.

while

  1. Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk

  2. Incidence rate (also known as person-time rate) is a measure of incidence that incorporates time directly into the denominator. As per CDC, “a person-time rate is generally calculated from a long-term cohort follow-up study, wherein enrollees are followed over time and the occurrence of new cases of disease is documented.”

  3. Prevalence is the proportion of the total number of cases to the total population and is more a measure of the burden of the disease on society with no regard to time at risk or when subjects may have been exposed to a possible risk factor. Prevalence can also be measured with respect to a specific subgroup of a population

Mathematically as,

\[Incidence Rate = \frac{Number Of New Cases Of Disease Or Injury During Specified Period}{Time Each Person Was Observed,Totaled For All Persons}\]

5.1.1 Incidence Person At Risk

  • HIV/AIDS Incidence Calculation For Person At Risk

Hiv_Incidence_Person_At_Risk

Data: Hiv_Incident_Risk • Chart ID: Hiv_Incidence_Person_At_RiskgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
HIV_Incidence_Person_At_Risk_WorldMap

  • Tuberculosis Incidence Calculation For Person At Risk

Tuberculosis_Incidence_Person_At_Risk

Data: Tuberculosis_Inc_Risk • Chart ID: Tuberculosis_Incidence_Person_At_RiskgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Tuberculosis_Incidence_Person_At_Risk_WorldMap

  • Malaria Incidence Calculation For Person At Risk

Malaria_Incidence_Person_At_Risk

Data: Malaria_Inc_Risk • Chart ID: Malaria_Incidence_Person_At_RiskgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Malaria_Incidence_Person_At_Risk_WorldMap

5.1.2 HIV/AID, Malaria & Tuberculosis Incident Risk Comparison

  • Person At Risk Comparison

5.1.3 Incidence Person Per Time

  • HIV/AIDS Person Per Time

Hiv_Incidence_Person_Per_Time

Data: HIV_Inc_Person_Time • Chart ID: Hiv_Incidence_Person_Per_TimegoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
HIV_Incidence_Person_Per_Time_WorldMap

  • Tuberculosis Person Per Time

Tuberculosis_Incidence_Person_Per_Time

Data: Tuber_Inc_Person_Time • Chart ID: Tuberculosis_Incidence_Person_Per_TimegoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Tuberculosis_Incidence_Person_Per_Time_WorldMap

  • Malaria Person Per Time

Malaria_Incidence_Person_Per_Time
Data: Malaria_Inc_Person_Time • Chart ID: Malaria_Incidence_Person_Per_TimegoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Malaria_Incidence_Person_Per_Time_WorldMap

5.2 PREVALENCE RATE

Prevalence : Actual number of cases alive, with the disease either during a period of time (period prevalence) or at a particular date in time (point prevalence). Period prevalence provides the better measure of the disease load since it includes all new cases and all deaths between two dates, whereas point prevalence only counts those alive on a particular date.

Prevalence is also most meaningfully reported as the number of cases as a fraction of the total population at risk and can be further categorized according to different subsets of the population.

  • HIV Prevalence

HIV_Prevalence_Rate_Map

Data: HIV_Prevalence_Rate • Chart ID: HIV_Prevalence_Rate_MapgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
HIV_Prevalence_Rate_WorldMap

Since, Malaria and Tuberculosis are most likely not be carried over to the proceding year, so we will not include them (person living with either malaria or tuberculosis) in the numerators.

  • Tuberculosis Prevalence

Tuber_Prevalence_Rate_Map

Data: Tuber_Prevalence_Rate • Chart ID: Tuber_Prevalence_Rate_MapgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Tuberculosis_Prevalence_Rate_WorldMap

  • Malaria Prevalence

Malaria_Prevalence_Rate_Map

Data: Malaria_Prevalence_Rate • Chart ID: Malaria_Prevalence_Rate_MapgoogleVis-0.6.4
R version 3.6.3 (2020-02-29) • Google Terms of UseDocumentation and Data Policy
Malaria_Prevalence_Rate_WorldMap

  • All Disease Prevalence Comparison

5.2.1 A 3D Approach For All Disease Prevalence Comparison

6 REGRESSION ANALYSIS

  • Check for normality

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable \(Y_i\) does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have:

\[Y_i= \beta_0 + \beta_1x_i + \epsilon_i\] where, \[Y_i = Dependent Variable\] \[\beta_0 = Intercept\] \[\beta_1 = Parameter To Be estimated (slope)\] \[x_i = Independent Variable\] \[\epsilon_i = Error Term\] \[i = 1,2,3,4,......,n\]

While Multiple linear regression, we have (in matrix notation):

\[\beta =(X^TX)^{-1}X^TY\]

  • Obtaining Training and Test Dataset

6.1 Analysis of Labor Force

  • Null Hypothesis: Labor Force has no effect on HIV, Tuberculosis and Malaria.

\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)

  • Alternative Hypothesis: Labor Force has effect on HIV, Tuberculosis and Malaria.

\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)

  • Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Using both forward and backward model selection to select the best model for our analysis.

## Start:  AIC=12485.84
## Labor_Pop_Percentage ~ Population_total + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df Deviance   AIC
## - Population_total                  1   2584.2 12484
## <none>                                  2584.1 12486
## - HIV_Prevalence_Per_100K           1   2591.0 12491
## - Malaria_Prevalence_per_100K       1   2604.2 12504
## - Tuberculosis_Prevalence_per_100K  1   2622.8 12522
## 
## Step:  AIC=12483.87
## Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K + Malaria_Prevalence_per_100K + 
##     HIV_Prevalence_Per_100K
## 
##                                    Df Deviance   AIC
## <none>                                  2584.2 12484
## + Population_total                  1   2584.1 12486
## - HIV_Prevalence_Per_100K           1   2591.2 12489
## - Malaria_Prevalence_per_100K       1   2604.4 12502
## - Tuberculosis_Prevalence_per_100K  1   2624.9 12523
## 
## Call:
## glm(formula = Labor_Pop_Percentage ~ Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3191  -0.7588   0.1204   0.8847   3.3895  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       3.758e+00  4.419e-03 850.226  < 2e-16 ***
## Tuberculosis_Prevalence_per_100K  4.301e-04  6.649e-05   6.470 9.83e-11 ***
## Malaria_Prevalence_per_100K      -1.522e-06  3.477e-07  -4.378 1.20e-05 ***
## HIV_Prevalence_Per_100K          -3.282e-07  1.243e-07  -2.640  0.00829 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2653.6  on 1771  degrees of freedom
## Residual deviance: 2584.2  on 1768  degrees of freedom
## AIC: 12484
## 
## Number of Fisher Scoring iterations: 4

Plot of Residuals

Analysis of Variance (ANOVA of Labor Force)

Decision

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

6.2 Analysis of Population

  • Null Hypothesis: Population has no effect on HIV, Tuberculosis and Malaria.

\(H_o: \mu_1 = \mu_2 = \mu_3...= \mu_n\)

  • Alternative Hythesis: Population has effect on HIV, Tuberculosis and Malaria.

\(H_a \neq Ho: \mu_1 \neq \mu_2 \neq \mu_3...\neq \mu_n\)

  • Rejection : Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Using both forward and backward model selection to select the best model for our analysis.

## Start:  AIC=21941748579
## Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K
## 
##                                    Df   Deviance        AIC
## <none>                                2.1942e+10 2.1942e+10
## - Labor_Pop_Percentage              1 2.1980e+10 2.1981e+10
## - HIV_Prevalence_Per_100K           1 2.2276e+10 2.2276e+10
## - Malaria_Prevalence_per_100K       1 2.2726e+10 2.2726e+10
## - Tuberculosis_Prevalence_per_100K  1 6.5965e+10 6.5965e+10
## 
## Call:
## glm(formula = Population_total ~ Labor_Pop_Percentage + Tuberculosis_Prevalence_per_100K + 
##     Malaria_Prevalence_per_100K + HIV_Prevalence_Per_100K, family = poisson(link = "log"), 
##     data = Train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7103.3  -2151.0   -318.5   2490.6  16399.9  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.802e+01  2.818e-05  639641   <2e-16 ***
## Labor_Pop_Percentage              3.945e-03  6.346e-07    6216   <2e-16 ***
## Tuberculosis_Prevalence_per_100K -1.995e-01  1.567e-06 -127356   <2e-16 ***
## Malaria_Prevalence_per_100K      -2.173e-05  8.688e-10  -25010   <2e-16 ***
## HIV_Prevalence_Per_100K          -4.397e-06  2.570e-10  -17105   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6.7926e+10  on 1771  degrees of freedom
## Residual deviance: 2.1942e+10  on 1767  degrees of freedom
## AIC: 2.1942e+10
## 
## Number of Fisher Scoring iterations: 7

Plot of Residuals

Analysis of Variance (ANOVA of Population)

Decision

Reject Null Hypothesis (\(H_o\)) since P-Values are all less than Tabulated value of 0.05.

Non Zero Confidence Interval

Meaning that there is a 95% probability that the confidence interval will contain the true population mean.

##                                          2.5 %        97.5 %
## (Intercept)                       3.748894e+00  3.766218e+00
## Tuberculosis_Prevalence_per_100K  2.992630e-04  5.598876e-04
## Malaria_Prevalence_per_100K      -2.211387e-06 -8.485368e-07
## HIV_Prevalence_Per_100K          -5.728867e-07 -8.563474e-08

From the table above, we can deduce that the our analysis is correct (as non of the value between the 2.5 and 97.5 is zero) for Population_total, Malaria and HIV as corfirmed by Analysis of Variance (Anova) result above.

7 PREDICTION

The below below shows the predicted values and the actual values side-by-side.

actual predicted
41 42.86778
41 42.87670
45 42.87575
45 42.87028
47 42.89949
35 42.87475

7.1 MAPE

The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as,

\[M=\frac{100}{n}\sum _{t=1}^{n} |\frac{A_t - F_t}{A_t}|\], Where, \[A_t=Actual Value\] \[F_t=Predicted Value\]

## [1] "The Percentage Accuracy Is:  86.2957407776888"
## [1] "The MAPE Is:  16.4859681592503"

We can be rest assured that about 16 % of our prediction might not be accurate.

8 CONCLUSION

Using data visualization in the field of epidemiology to depict succinctly the effect of labor force and population on the widespread of diseases. As observed that using incident risk alone may not be the best option to make recommendation and conclusion on epidemics, but rather use prevalence to make accurate analysis.

  • Incident (Person At Risk): Shows that majority of the diseases is highly concentrated in southern part of africa.

  • Incident (Person Per Time): Gives more insight about the dieases that are considered non-chronic like Malaria and Tuberculosis.

  • Prevalence Rate: This was used to estimate the prevalence of all the diseases and a better result was obtained. It shows the widespread of the diseases beyond southern part of africa to some of Asia, South America and Carribean was not left out!

Since P-value is less than Tabulated/Table value (0.05), we will therefore reject Null Hypothesis Ho and conclude that both Labor Force and Population have effects on HIV/AIDS, Malaria and Tuberculosis. I.e an increase and decrease in the labor force and population would have an impact on the wide spread of disease in the countries under study.

8.1 Recommendation

The countries that are mostly affected needs to do more in terms of;

Labor Force: Provide more employment opportunities to the citizens. If an economy is viable, the citizen would want to live a healthy life style and would ultimately aid in decreasing the widespread of diseases.

Population: Maintain the population, as a sporadic increment in population without good standard of living, would jeopardize the economy and cause an increase in the spread of diseases.


Debabrata Kabiraj

May 07, 2019