Project 1: How AIDS Affects Different Populations

Author

Emme Gunther

Published

March 30, 2026

Load Libraries and Data

library(readr)
library(ggplot2)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ stringr   1.6.0
✔ forcats   1.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("~/Desktop/datasets/")
aids <- read_csv("aids.csv")

Rows: 2759 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Country
dbl (22): Year, Data.AIDS-Related Deaths.AIDS Orphans, Data.AIDS-Related Dea...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

About the Data

AIDS (Acquired Immunodeficiency Syndrome) is a deadly disease that progresses from the HIV (Human Immunodeficiency Virus) infection. The “AIDS” dataset contains information on populations of people globally affected by the disease in Asian and African countries. The data is collected by the UNAIDS Organization, an entity of the United Nations focusing on the epidemic. The dimensions are 23 columns by 2,759 rows. 20/23 variables are split into 4 groups; AIDS related deaths, HIV prevalence, new HIV Infections, and people living with HIV. Within these four categories contain 3-7 variables with the specific population that is affected (men, women, children etc). These variables are defined below:

Data.AIDS-Related Deaths.(AIDS Orphans, Adults, All Ages, Children, Female Adults, Male Adults): a discrete, numerical variable containing the amount of AIDS-related deaths (with respect to each population type)
Data.HIV Prevalence.(Adults, Young Men, Young Women): a continuous, numerical variable containing HIV prevalence in population percentage (with respect to each population type)
Data.New HIV Infections.(Young Adults, Male Adults, Female Adults, Children, All Ages, Adults, Incidence Rate Among Adults): a discrete, numerical variable containing the amount of new HIV infections (with respect to each population type)
Data.People Living with HIV.(Total, Male Adults, Female Adults, Children, Adults): a discrete, numerical variable containing the amount of people living with HIV (with respect to each population type)

Other variables:

year: a discrete, numberical variable containing what year the infection data is from

country: a nominal, categorical variable containing what country the infection data is from.

Data.New HIV Infections.Incidence Rate Among Adults: a continuous, numerical variable containing the new HIV infection incidence rate amoung adults

Cleaning:

sum(is.na(aids))

[1] 0

#no NA values were found. 

#filter to the year 1990 only 
aids <- aids |>
  filter(Year == 1990)

Linear Regression:

Prepare Data

#create new dataset for linear regression (easier to read and manipulate)
aids2 <- aids |>
#choose 2 quantitative variables
  select("Country", "Data.HIV Prevalence.Young Men", "Data.AIDS-Related Deaths.Male Adults")

First Plot

#first scatterplot graph (p1)
p1 <- ggplot(aids2, aes(x = `Data.AIDS-Related Deaths.Male Adults`, y = `Data.HIV Prevalence.Young Men`)) +
  labs(title = "AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",
  caption = "Source: UNAIDS",
  x = "AIDS-Related Deaths for Male Adults", 
  y = "HIV Prevalance Rate in Young Men") +
  theme_minimal(base_size = 12)
p1 + geom_point()

#there are 2 outliers- locate country by sorting the dataset in desc order for both variables
aids2 |>
  arrange(desc(`Data.AIDS-Related Deaths.Male Adults`))

# A tibble: 89 × 3
   Country                         Data.HIV Prevalence.…¹ Data.AIDS-Related De…²
   <chr>                                            <dbl>                  <dbl>
 1 Uganda                                             2.3                  15000
 2 Zimbabwe                                           6.6                   7300
 3 United Republic of Tanzania                        1.5                   5900
 4 Malawi                                             2.3                   4700
 5 Zambia                                             3.1                   4100
 6 Nigeria                                            0.3                   3900
 7 Kenya                                              2.3                   3600
 8 Côte d'Ivoire                                      1.1                   3500
 9 Ethiopia                                           0.5                   3200
10 Democratic Republic of the Con…                    0.4                   2800
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.HIV Prevalence.Young Men`,
#   ²`Data.AIDS-Related Deaths.Male Adults`

aids2 |>
  arrange(desc(`Data.HIV Prevalence.Young Men`))

# A tibble: 89 × 3
   Country                     Data.HIV Prevalence.Youn…¹ Data.AIDS-Related De…²
   <chr>                                            <dbl>                  <dbl>
 1 Zimbabwe                                           6.6                   7300
 2 Zambia                                             3.1                   4100
 3 Botswana                                           2.8                    500
 4 Kenya                                              2.3                   3600
 5 Malawi                                             2.3                   4700
 6 Uganda                                             2.3                  15000
 7 Burundi                                            1.6                   1400
 8 Thailand                                           1.6                   1000
 9 United Republic of Tanzania                        1.5                   5900
10 Burkina Faso                                       1.4                   1700
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.HIV Prevalence.Young Men`,
#   ²`Data.AIDS-Related Deaths.Male Adults`

Second Plot (without outliers)

#remove outlier (Uganda and Zimbabwe)
aids2 <- aids2 |>
  filter(!Country %in% c("Uganda", "Zimbabwe"))

#re plot
p2 <- ggplot(aids2, aes(x = `Data.AIDS-Related Deaths.Male Adults`, y = `Data.HIV Prevalence.Young Men`)) +
  labs(title = "AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",
  caption = "Source: UNAIDS",
  x = "AIDS-Related Deaths for Male Adults", 
  y = "HIV Prevalance Rate in Young Men") +
  theme_minimal(base_size = 12)
p2 + geom_point()

Create Linear Regression Model

#fix axes to start at 0
p3 <- p2 + geom_point() + xlim(0,5900)+ ylim(0,3.1)

#smoother in red w/ confidence interval
p4 <- p3 + geom_smooth(color = "red")

#linear regression with confidence interval
p5 <- p3 + geom_smooth(method='lm',formula=y~x) 

#add a title, make the line dashed, and remove the confidence interval band
p6 <- p3 + geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = .2)
p6

Correlation, Fit the predictor variable x into the model to predict y

cor(aids2$`Data.AIDS-Related Deaths.Male Adults`, aids2$`Data.HIV Prevalence.Young Men`)

[1] 0.6408574

#0.641 is closer to 1, therefore the two variables have a strong correlation

fit1 <- lm(`Data.AIDS-Related Deaths.Male Adults` ~ aids2$`Data.HIV Prevalence.Young Men`, data = aids2)
summary(fit1)


Call:
lm(formula = `Data.AIDS-Related Deaths.Male Adults` ~ aids2$`Data.HIV Prevalence.Young Men`, 
    data = aids2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3041.2  -182.2  -182.2   -65.0  3927.9 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                              161.5      115.9   1.393    0.167    
aids2$`Data.HIV Prevalence.Young Men`   1207.1      156.8   7.697  2.3e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 901.2 on 85 degrees of freedom
Multiple R-squared:  0.4107,    Adjusted R-squared:  0.4038 
F-statistic: 59.24 on 1 and 85 DF,  p-value: 2.3e-11

Conclusion 1:

Final equation: Data.AIDS-Related Deaths.Male Adults = 1207.1(Data.HIV Prevalence.Young Men) + 161.5

In summary, the rate of young men in 1990 diagnosed with HIV is very strongly correlated with the amount of Male Adult HIV deaths (corr of 0.641). In other words, for every added percent to the rate of HIV diagnoses for young men, there is a predicted increase of 1207.1 AIDS-related male adult deaths. The adjusted R-squared tells us that about 40% of the variation in the observations may be explained by the model. The p-value (0.000000000023), shows that the predictor (prevalence of young men with HIV) is very significant to the variation of male adult HIV death rates.

Visualization

Prepare data for bar graph

#identify top 3 countries for AIDS related adult deaths
aids |>
  arrange(desc(`Data.AIDS-Related Deaths.Adults`))

# A tibble: 89 × 23
   Country                    Year Data.AIDS-Related De…¹ Data.AIDS-Related De…²
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Uganda                     1990                 190000                  31000
 2 Zimbabwe                   1990                  68000                  14000
 3 Malawi                     1990                  57000                  11000
 4 United Republic of Tanza…  1990                  56000                  11000
 5 Democratic Republic of t…  1990                  49000                   9600
 6 Zambia                     1990                  44000                   8000
 7 Kenya                      1990                  36000                   7600
 8 Nigeria                    1990                  37000                   7000
 9 Ethiopia                   1990                  30000                   6600
10 Côte d'Ivoire              1990                  28000                   6100
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.AIDS-Related Deaths.AIDS Orphans`,
#   ²`Data.AIDS-Related Deaths.Adults`
# ℹ 19 more variables: `Data.AIDS-Related Deaths.All Ages` <dbl>,
#   `Data.AIDS-Related Deaths.Children` <dbl>,
#   `Data.AIDS-Related Deaths.Female Adults` <dbl>,
#   `Data.AIDS-Related Deaths.Male Adults` <dbl>, …

#new dataset with top 3 countries
aids_top3 <- aids |>
  filter(Country %in% c("Uganda", "Zimbabwe", "Malawi"))

Bar graph

ggplot(aids_top3, aes(x = Country, y = `Data.AIDS-Related Deaths.Adults`, fill = Country)) +
  geom_bar(stat = "identity") +        
  labs(
    title = "AIDS-Related Deaths per Country",       
    x = "Country",                   
    y = "AIDS-Related Adult Deaths"          
  ) +
  scale_fill_manual(values = c(
    "Uganda" = "pink",
    "Zimbabwe" = "purple",
    "Malawi" = "blue"
  )) +
  theme_minimal()

Conclusion 2

To clean the dataset, I first checked for NAs (there was none, so minimal cleaning was involved). Afterwards, I filtered for the year 1990 in order to prevent violating basic assumptions of independent observations. The visualization above represents a bar graph of the top 3 countries that have the highest AIDS-Related Adult Deaths. I thought it was interesting that most other countries have similar death rates, but Uganda is the outlier in many of the death-related categories. This shows in the bar graph, as Uganda rises very far above Malawi and Zimbabwe. If I were to further my research, I would try to investigate Uganda specifically to try to find internal patterns as to why their death rate is so high. There are many variables in the dataset and years, so I start by creating an AIDS dataset soley for Uganda and perform a multiple linear regression.