Introduction

The global burden of HIV/AIDS continues to challenge public health systems worldwide, with marked disparities across: Regions, Economic groups, and Genders.

This analysis brings a fresh perspective by integrating:

  1. Gender-disaggregated HIV/AIDS indicators, and
  2. Socioeconomic data (specifically GDP per capita).

The Main goal is to uncover nuanced insights that can drive targeted interventions.

Expected Outcomes

This analysis aims to deliver the following key insights:

1. Comprehensive Overview

Provide a detailed understanding of gender-specific disparities in: HIV/AIDS prevalence, and AIDS-related deaths.

2. Economic Determinants

Highlight the role of economic factors, such as GDP per capita, in shaping these disparities.

3. Detection of Anomalies

Explore eventual anomalies such as, High prevalence rates correlating with low AIDS-related deaths.

4. Statistical Validation

Determine whether gender differences in Prevalence, or Deaths are statistically significant.

5. Visualization

  • Create geographic maps to visualize gender-specific indicators.
  • Generate:
    • Bar charts to highlight disparities, and
    • Boxplots to showcase regional trends.

6. Policy Contributions

Identify priority areas for Policy interventions and Resource allocation.

7. Targeted Insights

Focus on countries with:
- Pronounced gender disparities, or
- Observable anomalies (e.g., high prevalence but low death rates).

These findings aim to support evidence-based decision-making and contribute to addressing gender inequalities in the global HIV/AIDS burden.

Data Sources and Preparation

The analysis is based on data from two key sources:
1. UNAIDS (ONUSIDA): For HIV/AIDS indicators such as prevalence and AIDS-related deaths.
2. World Bank: For GDP data adjusted for purchasing power parity (PPP).

Data Cleaning and Integration

  • The datasets were cleaned and merged using Python to focus on 2023 data.
  • The preprocessing steps, along with the analysis code, are documented in a Jupyter Notebook.

You can access the datasets and the Jupyter Notebook here:
Download the ZIP file

Dataset Overview

The dataset used in this analysis comprises 11 columns and 117 rows, representing countries with up-to-date data for 2023. The data provides insights into various HIV/AIDS indicators, gender disparities, and economic factors across different countries. The key variables included are as follows:

Analysis

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(sf)
library(DT)
library(htmlwidgets )
library(htmltools)
library(FSA)
library(ggsignif)
library(corrplot)
library(psych)
library(tidyr)

**1.Load dataset, Check for and handle missing values

# Load dataset
data <- read.csv("data_AIDS_and_GDP_cleaned.csv")
# Check for missing values
describe_result <- describe(data)
describe_result$na <- sapply(data, function(x) sum(is.na(x)))
result <- describe_result[, "na", drop = FALSE]
print(result)
##                                        na
## Country*                                0
## PeopleWithAIDS_All_adults_2023          0
## PeopleWithAIDS_Female_adults_2023       2
## PeopleWithAIDSe_Male_adults_2023        2
## AIDS_Prevalence_All_adults_2023         0
## AIDS_Prevalence_Female_adults_2023      2
## AIDS_Prevalence_Male_adults_2023        2
## AIDS_related_deaths_All_adults_2023     0
## AIDS_related_deaths_Female_adults_2023  2
## AIDS_related_deaths_Male_adults_2023    2
## GPD_PCAP_2023                           0

Decision

Since the number of missing values is relatively small, we have decided to remove rows containing missing values.
This approach simplifies the analysis while having minimal impact on the overall results,as the missing data represents an insignificant fraction of the dataset.

# Handle missing values by removing rows with NA)
data_clean <- na.omit(data)
data_clean <- na.omit(data)
na_count <- sum(is.na(data_clean))
cat("After removing rows with missing values, the number of NA values remaining is:", na_count, "\n")
## After removing rows with missing values, the number of NA values remaining is: 0

2. Descriptive Analysis by Gender

Objectives - Compare HIV/AIDS prevalence and AIDS-related deaths between men and women. - Identify countries where gender disparities are most pronounced.

Step 1: Subset data by gender
# Subset data by gender
women_data <- data_clean %>% select(Country, GPD_PCAP_2023, AIDS_Prevalence_Female_adults_2023, AIDS_related_deaths_Female_adults_2023)
men_data <- data_clean %>% select(Country, GPD_PCAP_2023, AIDS_Prevalence_Male_adults_2023, AIDS_related_deaths_Male_adults_2023)
Step 2: Descriptive Analysis

a- Summarize key statistics for HIV prevalence and deaths by gender

Women key statistics

# Summary statistics for women
women_data_summary <- describe(women_data)
women_data_summary[, c("n", "mean", "sd", "median", "min", "max")]
##                                          n     mean       sd   median    min
## Country*                               115    58.00    33.34    58.00   1.00
## GPD_PCAP_2023                          115 25607.46 28752.09 16062.02 919.91
## AIDS_Prevalence_Female_adults_2023     115     1.46     4.55     0.20   0.10
## AIDS_related_deaths_Female_adults_2023 115   631.30  1035.71   200.00 100.00
##                                             max
## Country*                                  115.0
## GPD_PCAP_2023                          143809.5
## AIDS_Prevalence_Female_adults_2023         32.2
## AIDS_related_deaths_Female_adults_2023   5300.0

Men key statistics

# Summary statistics for men
women_data_summary <- describe(men_data)
women_data_summary[, c("n", "mean", "sd", "median", "min", "max")]
##                                        n     mean       sd   median    min
## Country*                             115    58.00    33.34    58.00   1.00
## GPD_PCAP_2023                        115 25607.46 28752.09 16062.02 919.91
## AIDS_Prevalence_Male_adults_2023     115     1.13     3.00     0.50   0.10
## AIDS_related_deaths_Male_adults_2023 115   732.17  1009.04   500.00 100.00
##                                           max
## Country*                                115.0
## GPD_PCAP_2023                        143809.5
## AIDS_Prevalence_Male_adults_2023         22.5
## AIDS_related_deaths_Male_adults_2023   5200.0

Geographic Maps

Objective: Create maps showing prevalence by gender.

Load spatial data and merge with HIV/AIDS data_clean

# Load spatial data (replace with actual shapefile for country boundaries)
world_shapefile <- st_read("ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp")
## Reading layer `ne_10m_admin_0_countries' from data source 
##   `C:\Users\PDG Junior\Desktop\M.Sc biostatistics and epidemiology\Biostatistics\R_project\AIDS\ne_10m_admin_0_countries\ne_10m_admin_0_countries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 258 features and 168 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.6341
## Geodetic CRS:  WGS 84
# Merge spatial data with HIV/AIDS data
map_data <- world_shapefile %>% 
  left_join(data_clean, by = c("NAME" = "Country"))

Prevalence Map for Woman

# Map for women
map_women <- ggplot(map_data) +
  geom_sf(aes(fill = AIDS_Prevalence_Female_adults_2023), color = "white") +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(title = "HIV Prevalence among Women by Country",
       fill = "Prevalence")
print(map_women)

Prevalence Map for Men

# Map for men
map_men <- ggplot(map_data) +
  geom_sf(aes(fill = AIDS_Prevalence_Male_adults_2023), color = "white") +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(title = "HIV Prevalence among Men by Country",
       fill = "Prevalence")
print(map_men)

HIV Prevalence accross Gender by Country and GDP Categories

HIV Prevalence Map among Women by Country and GDP Categories

HIV Prevalence Map among men by Country and GDP Categories

HIV Prevalence Table accross Gender by Country and GDP Categories

Légende :
  • GDP Category: GDP categories color :
    • Low (GDP <= 1st quartile of GDP per capita values)
    • Moderate (1st quartile < GDP <= 2nd quartile of GDP per capita values)
    • High (GDP > 2nd quartile of GDP per capita values)
  • Prevalence: Couleurs représentant les valeurs de la prévalence du VIH :
    • Low (<= 1%)
    • Moderate (entre 1% et 5%)
    • High (> 5%)

b- Calculate disparities between gender “HIV prevalence” and “death”

# Calculate disparities
data_clean_With_Disparities <- data_clean %>% 
  mutate(Prevalence_Disparity_2023 = round(abs(AIDS_Prevalence_Female_adults_2023 - AIDS_Prevalence_Male_adults_2023), 2),
         Deaths_Disparity_2023 = abs(AIDS_related_deaths_Female_adults_2023 - AIDS_related_deaths_Male_adults_2023))%>% 
  select(Country, Prevalence_Disparity_2023, Deaths_Disparity_2023) 
saveRDS(data_clean_With_Disparities, "data_clean_With_Disparities.rds")

Table of Gender “HIV prevalence” and “death” disparities

datatable(data_clean_With_Disparities, 
          options = list(
            
          ),
          caption = 'Gender "HIV prevalence" and "death" disparities Table pertaining to 2023'
          
)

Top 10 countries with largest disparities

HIV prevalence

HIV related death

Dispatities Visualization

Bar Charts and Boxplots

Objective: Highlight gender disparities and regional trends.

a. Bar Chart

b. Boxplot

# Boxplot
boxplot <- ggplot(data_clean, aes(x = "Gender", y = AIDS_Prevalence_Male_adults_2023)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Boxplot of Prevalence for mal",
       x = "", y = "HIV Prevalence")
print(boxplot)

Check summary for any extreme outliers

# Check summary for any extreme outliers
summary(data_clean$AIDS_Prevalence_Female_adults_2023)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.100   0.200   1.461   1.000  32.200
summary(data_clean$AIDS_Prevalence_Male_adults_2023)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.200   0.500   1.126   0.900  22.500

Presence of Extreme Values

The summary statistics indicate the presence of extreme values, as evidenced by the maximum values of 32.2 and 22.5 in the respective datasets.

Boxplot with limited scale to exclude extreme values

2-Comparison of Prevalence by GDP Categories

Objective: Explore whether HIV prevalence varies significantly across GDP categories (low, middle, high). Null Hypothesis (Ho): The distributions of AIDS_Prevalence_All_adults_2023 are the same across the GDP_Category groups. In other words, there is no difference in medians between the groups.

Alternative Hypothesis (Ha): At least one group has a different distribution or median compared to the others.

Step 1: Visualize the Distribution for Each GDP_Category

We’ll use histograms and Q-Q plots to assess the distribution of AIDS_Prevalence_All_adults_2023 for each GDP category.

Histogram for each GDP category

ggplot(data_clean, aes(x = AIDS_Prevalence_All_adults_2023, fill = GDP_Category)) +
  geom_histogram(aes(y = after_stat(density)), bins = 15, alpha = 0.4, position = "identity", color = "black") +
  geom_density(alpha = 1, linewidth = 1.2, aes(color = GDP_Category)) +
  facet_wrap(~GDP_Category) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green")) + # Set distinct colors for curves
  scale_fill_manual(values = c("lightpink", "lightblue", "lightgreen")) + # Set distinct colors for bars
  labs(title = "Histogram and Frequency Curve of HIV Prevalence by GDP Category",
       x = "HIV Prevalence (%)",
       y = "Density") +
  theme(legend.position = "top",
        legend.title = element_blank())

Q-Q Plot for normality assessment

ggplot(data_clean, aes(sample = AIDS_Prevalence_All_adults_2023)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~GDP_Category) +
  theme_minimal() +
  labs(title = "Q-Q Plot of HIV Prevalence by GDP Category",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

Step 2: Perform Shapiro-Wilk Test for Normality

The Shapiro-Wilk test will be conducted separately for each GDP category to statistically test the normality of AIDS_Prevalence_All_adults_2023.

  • If p≤0.05, the data for that GDP category significantly deviates from normality.
  • If p>0.05, the data does not significantly deviate from normality.
# Shapiro-Wilk test for normality within each GDP category
normality_results <- lapply(unique(data_clean$GDP_Category), function(category) {
  shapiro.test(data_clean$AIDS_Prevalence_All_adults_2023[data_clean$GDP_Category == category])
})

# Assign category names to the results
names(normality_results) <- unique(data_clean$GDP_Category)

# Display results
print("Shapiro-Wilk Test Results by GDP Category:")
## [1] "Shapiro-Wilk Test Results by GDP Category:"
print(normality_results)
## $Low
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$AIDS_Prevalence_All_adults_2023[data_clean$GDP_Category == category]
## W = 0.39495, p-value = 1.684e-11
## 
## 
## $Moderate
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$AIDS_Prevalence_All_adults_2023[data_clean$GDP_Category == category]
## W = 0.37255, p-value = 1.43e-11
## 
## 
## $High
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$AIDS_Prevalence_All_adults_2023[data_clean$GDP_Category == category]
## W = 0.69954, p-value = 1.621e-07

The Shapiro-Wilk test shows that the distribution is not normal

Perform Log transformation to normalize the distribution

data_clean$Log_Prevalence <- log(data_clean$AIDS_Prevalence_All_adults_2023 + 1)

performing the Shapiro-Wilk test again

normality_results <- lapply(unique(data_clean$GDP_Category), function(category) {
  shapiro.test(data_clean$Log_Prevalence[data_clean$GDP_Category == category])
})

# Assign category names to the results
names(normality_results) <- unique(data_clean$GDP_Category)

# Display results
print("Shapiro-Wilk Test Results by GDP Category:")
## [1] "Shapiro-Wilk Test Results by GDP Category:"
print(normality_results)
## $Low
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$Log_Prevalence[data_clean$GDP_Category == category]
## W = 0.79939, p-value = 8.168e-06
## 
## 
## $Moderate
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$Log_Prevalence[data_clean$GDP_Category == category]
## W = 0.60077, p-value = 5.814e-09
## 
## 
## $High
## 
##  Shapiro-Wilk normality test
## 
## data:  data_clean$Log_Prevalence[data_clean$GDP_Category == category]
## W = 0.76835, p-value = 2.519e-06

the distribution still remains non-normal.

Since the distribution remains non-normal even after the log transformation, we will proceed with a non-parametric test.

Step 3: Perform Kruskal-Wallis test

As the normality assumption is not meet, rather to use ANOVA to compare means between multiple groups, we will use Kruskal-Wallis test The Kruskal-Wallis test is a non-parametric method used to determine if there are statistically significant differences between the medians of three or more independent groups. It is often used when the assumptions of ANOVA (normality and homogeneity of variances) are not met

#Kruskal-Wallis test to compare HIV prevalence by GDP categories
kruskal_test<- kruskal.test(AIDS_Prevalence_All_adults_2023 ~ GDP_Category, data = data_clean)
print(kruskal_test)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  AIDS_Prevalence_All_adults_2023 by GDP_Category
## Kruskal-Wallis chi-squared = 14.294, df = 2, p-value = 0.0007872
Key Statistics

Kruskal-Wallis chi-squared: 14.294 This is the test statistic, which measures the degree of difference in ranks among the groups. A higher value suggests greater differences. Degrees of freedom (df): 2 This corresponds to the number of groups minus one (e.g., 3 GDP categories: Low, Moderate, High → df=3−1). p-value: 0.0007872 This indicates the probability of observing the data (or more extreme results) under the null hypothesis. A p-value of 0.0007872 is significantly lower than a common significance level (α=0.05). **Therefore, reject the null *hypothesis.**

Conclusion

The Kruskal-Wallis test shows a significant difference in AIDS_Prevalence_All_adults_2023 among GDP categories (p = 0.0008). This suggests that AIDS prevalence varies meaningfully with GDP levels. Post hoc analysis can reveal specific group differences.

Step 4: Post Hoc Pairwise Comparisons

We will use the Dunn test for pairwise comparisons

dunn_test <- dunnTest(AIDS_Prevalence_All_adults_2023 ~ GDP_Category, data = data_clean, method = "bonferroni")
print(dunn_test)
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Bonferroni method.
##        Comparison         Z      P.unadj        P.adj
## 1      High - Low -3.764237 0.0001670586 0.0005011757
## 2 High - Moderate -1.589149 0.1120268530 0.3360805590
## 3  Low - Moderate  2.164802 0.0304028298 0.0912084893

Interpretation of Post Hoc Results The post hoc comparisons using Dunn’s test show the pairwise differences among GDP categories with adjusted p-values (Bonferroni correction). Here’s the breakdown:

High vs. Low Z = -3.76, Adjusted p-value = 0.0005 The difference between the “High” and “Low” GDP categories in terms of AIDS prevalence is statistically significant (p < 0.05), indicating a notable disparity.

High vs. Moderate Z = -1.59, Adjusted p-value = 0.336 The difference between the “High” and “Moderate” GDP categories is not statistically significant (p > 0.05).

Low vs. Moderate Z = 2.16, Adjusted p-value = 0.091 The difference between the “Low” and “Moderate” GDP categories is not statistically significant after adjustment (p > 0.05).

Summary There is a significant difference in AIDS prevalence between countries with “High” and “Low” GDP categories. Differences between “High vs. Moderate” and “Low vs. Moderate” GDP categories are not statistically significant after Bonferroni adjustment.

Step 5:Visualization of Pairwise Comparisons

Boxplots to illustrate variations.

##### P-Value and Significance Levels

#####“NS” (Not Significant)** P-value: > 0.05 Confidence Level: Less than 95% Significance: Not significant

“*” (Significant)

P-value: 0.01 < P ≤ 0.05 Confidence Level: 95% Significance: Significant

“**” (Very Significant)

P-value: 0.001 < P ≤ 0.01 Confidence Level: 99% Significance: Very significant

“***” (Highly Significant)

P-value: P ≤ 0.001 Confidence Level: 99.9% Significance: Highly significant

Why the Difference Between Dunn Test and Boxplot Results regarding the pair “Low” and “Moderate”?
  • The Dunn Test adjusts for the risk of Type I errors (false positives) when making multiple comparisons. While the difference between Low and Moderate is visible in the boxplot, the Dunn Test finds it not statistically significant at the 95% confidence level (with an adjusted p-value of 0.091).

  • The boxplot suggests a visible difference without considering necessary statistical adjustments, which can lead to the appearance of significance, even if the statistical test does not support it.

Key Takeaway: - Boxplot: Useful for quickly observing trends, but may not account for adjustments needed when making multiple comparisons.

  • Dunn Test: Provides a more rigorous statistical evaluation of differences, adjusting for the risk of false positives in multiple comparisons.

2- Correlation Analysis

Test Normality

# Appliquer le test de Shapiro-Wilk à toutes les variables quantitatives
shapiro_results <- sapply(data_clean[ , c("AIDS_Prevalence_All_adults_2023", 
                                           "AIDS_Prevalence_Female_adults_2023", 
                                           "AIDS_Prevalence_Male_adults_2023", 
                                           "AIDS_related_deaths_All_adults_2023", 
                                           "AIDS_related_deaths_Female_adults_2023", 
                                           "AIDS_related_deaths_Male_adults_2023", 
                                           "GPD_PCAP_2023")], shapiro.test)

# Résultats des tests de Shapiro-Wilk
shapiro_results
##           AIDS_Prevalence_All_adults_2023 AIDS_Prevalence_Female_adults_2023
## statistic 0.3107968                       0.3111598                         
## p.value   7.318499e-21                    7.399975e-21                      
## method    "Shapiro-Wilk normality test"   "Shapiro-Wilk normality test"     
## data.name "X[[i]]"                        "X[[i]]"                          
##           AIDS_Prevalence_Male_adults_2023 AIDS_related_deaths_All_adults_2023
## statistic 0.3147373                        0.6127807                          
## p.value   8.254949e-21                     5.883565e-16                       
## method    "Shapiro-Wilk normality test"    "Shapiro-Wilk normality test"      
## data.name "X[[i]]"                         "X[[i]]"                           
##           AIDS_related_deaths_Female_adults_2023
## statistic 0.566939                              
## p.value   7.471252e-17                          
## method    "Shapiro-Wilk normality test"         
## data.name "X[[i]]"                              
##           AIDS_related_deaths_Male_adults_2023 GPD_PCAP_2023                
## statistic 0.6574065                            0.7517411                    
## p.value   5.245459e-15                         1.164334e-12                 
## method    "Shapiro-Wilk normality test"        "Shapiro-Wilk normality test"
## data.name "X[[i]]"                             "X[[i]]"

A p-value from the Shapiro-Wilk test less than 0.05 for a variable, indicates that the variable does not follow a normal distribution. Therefore, based on the results of the normality tests, we can conclude that our variables do not adhere to the assumption of normality. This is important to note, as non-normal distributions may require the use of non-parametric methods for statistical analysis.

Step 1: Correlation Matrix

Objective: Obtain an overview of relationships between all quantitative variables using spearman method.

# Selecting relevant quantitative variables
quantitative_vars <- data_clean %>%
  select(AIDS_Prevalence_All_adults_2023,
         AIDS_Prevalence_Female_adults_2023,
         AIDS_Prevalence_Male_adults_2023,
         AIDS_related_deaths_All_adults_2023,
         AIDS_related_deaths_Female_adults_2023,
         AIDS_related_deaths_Male_adults_2023,
         GPD_PCAP_2023)

# Calculating the correlation matrix
#cor_matrix <- cor(quantitative_vars, use = "complete.obs")
cor_matrix_2 <- cor(quantitative_vars, use = "complete.obs", method = "spearman")
#print(cor_matrix_2)
Step 2: Correlation Matrix visualization

Objective: All correlation coefficients will be displayed in a heatmap for a global view

Key Points from the Correlation Matrix

The strongest relationships are observed among HIV-related indicators, while GDP displays weaker and more indirect associations.

a. Strong Positive Correlations Among HIV Indicators

Indicators such as HIV prevalence among all adults, females, and males show very strong positive correlations (r > 0.99). This reflects their interdependence, where the trends across gender-specific and overall prevalence metrics are highly similar.

b. Moderate Positive Correlations Between Prevalence and Deaths

The correlation between HIV prevalence and AIDS-related deaths ranges from 0.56 to 0.63, indicating a moderate positive relationship. This suggests that higher prevalence tends to align with higher mortality, though not perfectly. The correlation suggests that other factors—such as access to treatment and healthcare—may also influence mortality rates.

c. Negative Correlations Between GDP and HIV Indicators

GDP per capita shows weak to moderate negative correlations with HIV-related metrics, ranging from -0.13 to -0.58: - AIDS_Prevalence_All_adults_2023 : -0.3735 - AIDS_Prevalence_Female_adults_2023 : -0.4748 - AIDS_Prevalence_Male_adults_2023 : -0.2559 - AIDS_related_deaths_All_adults_2023 : -0.5846 - AIDS_related_deaths_Female_adults_2023 : -0.5807 - AIDS_related_deaths_Male_adults_2023 : -0.5689

These findings are consistent with global trends, where higher GDP often correlates with better healthcare infrastructure, greater access to antiretroviral therapy, and more effective prevention programs. However, the low magnitudes (ranging from -0.1 to -0.6) suggest that the relationship is moderate to weak. This indicates that economic factors alone cannot fully explain variations in HIV prevalence or mortality. Other factors, such as health inequalities, access to healthcare, and national policies, play a significant role and significantly mediate this relationship.

d. Implications for Further Analysis

The insights from this correlation matrix suggest the need for multivariate analysis to better understand the combined effects of economic and non-economic factors on HIV prevalence and mortality.

Step 3: Visualization of Key Relationships (Scatter Plots)

Relationship between GDP and HIV Prevalence

ggplot(data_clean, aes(x = GPD_PCAP_2023, y = AIDS_Prevalence_All_adults_2023)) +
  geom_point(aes(color = as.factor(GDP_Category)), size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Relationship between GDP and HIV Prevalence",
       x = "GDP per Capita (2023)",
       y = "HIV Prevalence (%)",
       color = "GDP Category") +
  theme(plot.title = element_text(hjust = 0.5))  # Center-align the title
## `geom_smooth()` using formula = 'y ~ x'

Relationship between HIV Prevalence and Deaths

ggplot(data_clean, aes(x = AIDS_related_deaths_All_adults_2023, y = AIDS_Prevalence_All_adults_2023)) +
  geom_point(aes(color = as.factor(GDP_Category)), size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Relationship between HIV Prevalence and Deaths",
       x = "Death",
       y = "HIV Prevalence (%)",
       color = "GDP Category") +
  theme(plot.title = element_text(hjust = 0.5))  # Center-align the title
## `geom_smooth()` using formula = 'y ~ x'

##### Step 4: Statistical Significance Tests for Correlations Objective: Confirm whether the detected correlations between GDP and HIV Prevalence, HIV Prevalence and Deaths are statistically significant.

Explanation of the significance test:

The Pearson or Spearman test evaluates whether the observed relationship between two variables is likely to have occurred by chance.

  • Null hypothesis: There is no true correlation (r = 0).
  • Alternative hypothesis: A true correlation exists (r ≠ 0).

The p-value helps determine if the Null hypothesis can be rejected: - p ≤ 0.05: The correlation is statistically significant. - p > 0.05: We fail to reject the null hypothesis, meaning the correlation is not considered statistically significant.

Statistical Significance Test for GDP and Prevalence Correlation

# Correlation test for GDP and prevalence (all adults)
cor_GDP_prevalence <- cor.test(data_clean$GPD_PCAP_2023, data_clean$AIDS_Prevalence_All_adults_2023, method = "spearman")
print(cor_GDP_prevalence)
## 
##  Spearman's rank correlation rho
## 
## data:  data_clean$GPD_PCAP_2023 and data_clean$AIDS_Prevalence_All_adults_2023
## S = 348130, p-value = 3.93e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.3735126

Keys Statistics - Spearman’s rho = -0.3735126 - p-value = 3.93e-05 - Conclusion: There is a moderate negative correlation between GDP per capita and HIV prevalence among all adults. The p-value is less than 0.05, indicating that this correlation is statistically significant. This suggests that as GDP increases, HIV prevalence decreases, although the relationship is not strong.

Statistical Significance Test for Prevalence and Deaths Correlation:

# Correlation test for prevalence and deaths (all adults)
cor_prevalence_deaths <- cor.test(data_clean$AIDS_Prevalence_All_adults_2023, data_clean$AIDS_related_deaths_All_adults_2023, method = "spearman")
print(cor_prevalence_deaths)
## 
##  Spearman's rank correlation rho
## 
## data:  data_clean$AIDS_Prevalence_All_adults_2023 and data_clean$AIDS_related_deaths_All_adults_2023
## S = 97925, p-value = 3.064e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6136468

Key Statistics - Spearman’s rho = 0.6136468
- p-value = 3.064e-13
- Conclusion: There is a moderate positive correlation between HIV prevalence and AIDS-related deaths among all adults. The p-value is less than 0.05, indicating that this correlation is statistically significant. This suggests that as HIV prevalence increases, the number of AIDS-related deaths also increases, though the correlation is moderate in strength.

3. Detection of Anomalies

Objective: Identify unusual or unexpected observations in the dataset.

Potential types of anomalies: - Countries with high prevalence and low deaths. - Countries with high GDP and high prevalence (or the reverse).

Step 1: Detecting Anomalies

Rules to Flag Anomalies

  1. High Prevalence and Low Deaths:
    • Condition:
      • HIV prevalence among all adults is in the top 25% (≥ 75th percentile).
      • AIDS-related deaths among all adults are in the bottom 25% (≤ 25th percentile).
  2. High GDP and High Prevalence:
    • Condition:
      • GDP per capita is in the top 25% (≥ 75th percentile).
      • HIV prevalence among all adults is in the top 25% (≥ 75th percentile).
  3. Low GDP and Low Prevalence:
    • Condition:
      • GDP per capita is in the bottom 25% (≤ 25th percentile).
      • HIV prevalence among all adults is in the bottom 50% (≤ 50th percentile).
  4. Low GDP and High Deaths:
    • Condition:
      • GDP per capita is in the bottom 25% (≤ 25th percentile).
      • AIDS-related deaths among all adults are in the top 25% (≥ 75th percentile).
  5. High Deaths and Low Prevalence:
    • Condition:
      • AIDS-related deaths among all adults are in the top 25% (≥ 75th percentile).
      • HIV prevalence among all adults is in the bottom 50% (≤ 50th percentile).
  6. Normal:
    • Condition:
      • Any observation that does not meet the conditions for the above anomaly categories.

Thresholds Used

  • High Prevalence Threshold: 75th percentile of AIDS_Prevalence_All_adults_2023.
  • Low Prevalence Threshold: 50th percentile of AIDS_Prevalence_All_adults_2023.
  • High Deaths Threshold: 75th percentile of AIDS_related_deaths_All_adults_2023.
  • Low Deaths Threshold: 25th percentile of AIDS_related_deaths_All_adults_2023.
  • High GDP Threshold: 75th percentile of GPD_PCAP_2023.
  • Low GDP Threshold: 25th percentile of GPD_PCAP_2023.

Objective

These rules aim to identify countries with unusual combinations of HIV prevalence, AIDS-related deaths, and GDP, which could highlight disparities or inefficiencies in healthcare systems, access to treatment, or disease management strategies.

Anomalies by countries and Types

# View anomalies in the data
anomalies <- data_clean %>% filter(anomaly != "Normal")

# Display anomalies with DT (Only Country and Anomaly)
datatable(anomalies[, c("Country", "anomaly")], options = list(
              
          ),
          caption = 'Table: Anomalies by Country and Type')

Anomalies and Countries

# List countries by anomaly type, keeping the countries in one line
anomalies_list <- anomalies %>%
  group_by(anomaly) %>%
  summarise(Countries = paste(unique(Country), collapse = ", ")) %>%
  arrange(anomaly)

# Display the anomalies list with countries on the same line
datatable(anomalies_list, options = list(
  pageLength = 10,
  autoWidth = TRUE,
  columnDefs = list(list(targets = 1, width = '300px'))  # Adjust column width for readability
),
caption = 'Anomalies and Countries')

Summary of anomalies by type

# Optionally view the summary of anomalies by type
anomalies_summary <- anomalies %>%
  group_by(anomaly) %>%
  summarise(Count = n())
datatable(anomalies_summary, options = list(
),
caption = 'Anomalies summary')

1. Interpretation of Results

a) Correlation Between GDP and HIV Prevalence

A moderate negative correlation (-0.3735) was observed between GDP per capita and HIV prevalence, indicating that countries with higher GDPs tend to have lower HIV prevalence. However, this relationship is not absolute. For instance, countries like South Africa exhibit high prevalence despite having relatively higher GDPs in sub-Saharan Africa.

This may be attributed to:
- Better healthcare infrastructure and prevention programs in wealthier countries.
- Internal economic inequalities (among social classes) within these countries that remain an influencing factor.

b) Gender Disparities

Women represent a disproportionately high share of people living with HIV, especially in sub-Saharan Africa, where their prevalence rates are often double those of men.

Factors Contributing to Gender Disparities:

  1. Biological factors: Women are biologically more susceptible to infection during heterosexual intercourse.
  2. Socio-economic vulnerabilities:
    • Gender norms
    • Early marriages
    • Lack of economic power
    • Limited access to education

The geographic maps included in the analysis highlight critical zones like West Africa and parts of South Asia.

c) Detected Anomalies

2. Countries with High AIDS Mortality Rates Despite Low Prevalence:

  • Characteristics: Healthcare access gaps, especially in low-income countries.

Discussion

The analysis presented in this study offers critical insights into the socio-economic and gendered dynamics of the HIV/AIDS epidemic. The findings contribute to existing literature while uncovering anomalies and trends that demand attention in public health policymaking.

Economic Determinants and HIV Prevalence

The negative correlation observed between GDP per capita and HIV prevalence (-0.3735) suggests that higher-income nations generally have better health outcomes concerning HIV. This trend is largely attributable to:
- Improved healthcare infrastructure: Including prevention programs and widespread access to antiretroviral therapies (ARVs).
- Greater investment in public health awareness campaigns: Alongside early detection programs.

However, the correlation’s moderate strength underscores the complexity of the HIV epidemic. Middle- and high-income countries, such as South Africa, illustrate that economic strength alone cannot mitigate the epidemic. Structural inequalities, uneven healthcare distribution, and cultural factors can offset the benefits of economic growth.

This finding aligns with previous studies highlighting that the epidemic disproportionately affects marginalized populations, even in wealthier nations.

Gender Disparities in HIV Prevalence

The disproportionate burden of HIV among women in low-income countries, especially in sub-Saharan Africa, is both a biological and socio-cultural issue. Women face higher biological susceptibility to infection, but social determinants—such as gender inequality, lack of education, and economic dependence—amplify their vulnerability.

Key Findings:

  • Early marriages and gender-based violence remain pervasive in certain regions, directly increasing HIV risks among women and adolescent girls.
  • Gender norms often limit women’s ability to negotiate safer sexual practices or access healthcare services, further exacerbating the epidemic’s impact on them.

This disparity highlights the urgent need for gender-sensitive policies and interventions that empower women through:
- Education
- Economic independence
- Healthcare access

Anomalies in Healthcare Access

The study identified anomalies where countries with high HIV prevalence demonstrated low AIDS-related mortality rates (e.g., Botswana). This suggests that healthcare access, specifically ARV coverage, plays a pivotal role in mitigating AIDS-related deaths.

Conversely, countries with low prevalence but high AIDS mortality highlight significant gaps in healthcare systems, particularly in reaching underserved populations.

These findings reveal that healthcare outcomes are not solely dependent on prevalence but are significantly influenced by:
- Access to life-saving treatments.
- The efficiency of healthcare delivery systems.

Policy and Practical Implications

The trends and disparities observed in this analysis reinforce the need for tailored public health interventions. Policies should prioritize:
1. Reducing systemic barriers to healthcare access, particularly for rural and marginalized populations.
2. Addressing socio-economic and gender inequalities that amplify the epidemic’s impact on women.
3. Strengthening healthcare systems to close gaps in treatment availability and ensure equitable access.

Moreover, anomalies in the data emphasize the importance of localized strategies that consider the unique challenges faced by each country or region. A one-size-fits-all approach is insufficient in addressing the diverse determinants of the HIV epidemic.

Limitations and Future Research

While the study offers valuable insights, it is important to acknowledge its limitations:
- The reliance on secondary data may not fully capture localized nuances or undocumented populations, such as informal settlements or migrant workers.
- Correlation does not imply causation; additional research is needed to understand the causal pathways between socio-economic factors and HIV prevalence.

Future Research Directions:

  • Exploring the role of cultural factors in shaping HIV-related behaviors and outcomes.
  • Investigating the impact of emerging technologies, such as digital health platforms, in improving data collection and healthcare delivery in low-resource settings.

Recommendations

a) Strengthen Universal Access to Care

  • Scale up the distribution of antiretroviral treatments (ARVs):
    Focus on rural and low-income areas through public-private partnerships and global initiatives such as the Global Fund.
  • Establish mobile health centers:
    Provide services to isolated and underserved communities.

b) Target Gender-Sensitive Policies

  • Community-based awareness programs:
    Focus on young girls and women, emphasizing prevention methods such as female condoms and sexual education.
  • Advancing gender equality:
    Promote laws and policies to combat early marriages and domestic violence.

c) Improve National Data Collection

  • Implement real-time data collection tools:
    Use online platforms or mobile applications to track new infections and trends.
  • Collaborate with international research institutions:
    Strengthen data-driven decision-making through partnerships with governments and global organizations.

d) Promote Socio-Economic Interventions

  • Invest in girls’ education:
    Studies show that women with higher education levels are less likely to contract HIV.
  • Develop microcredit programs:
    Empower women economically to reduce their vulnerability.

e) Multidisciplinary Interventions

  • Integrate health, education, and economic development:
    Adopt a holistic approach to address structural vulnerabilities contributing to the HIV epidemic.