library(readr)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(ggfortify)
library(plotly)
setwd("~/Desktop/datasets/")
aids <- read_csv("aids.csv")Project 1: How AIDS Affects Different Populations
Load Libraries and Data
About the Data
AIDS (Acquired Immunodeficiency Syndrome) is a deadly disease that progresses from the HIV (Human Immunodeficiency Virus) infection. The “AIDS” dataset contains information on populations of people globally affected by the disease in Asian and African countries. The data is collected by the UNAIDS Organization, an entity of the United Nations focusing on the epidemic. In my research, I plan on exploring the spread of HIV to AIDS in adults populations.
The dimensions are 23 columns by 2,759 rows. 20/23 variables are split into 4 groups; AIDS related deaths, HIV prevalence, new HIV Infections, and people living with HIV. Within these four categories contain 3-7 variables with the specific population that is affected (men, women, children etc). These variables are defined below:
Data.AIDS-Related Deaths.(AIDS Orphans, Adults, All Ages, Children, Female Adults, Male Adults): a discrete, numerical variable containing the amount of AIDS-related deaths (with respect to each population type)
Data.HIV Prevalence.(Adults, Young Men, Young Women): a continuous, numerical variable containing HIV prevalence in population percentage (with respect to each population type)
Data.New HIV Infections.(Young Adults, Male Adults, Female Adults, Children, All Ages, Adults, Incidence Rate Among Adults): a discrete, numerical variable containing the amount of new HIV infections (with respect to each population type)
Data.People Living with HIV.(Total, Male Adults, Female Adults, Children, Adults): a discrete, numerical variable containing the amount of people living with HIV (with respect to each population type)
Other variables:
year: a discrete, numberical variable containing what year the infection data is from
country: a nominal, categorical variable containing what country the infection data is from.
Data.New HIV Infections.Incidence Rate Among Adults: a continuous, numerical variable containing the new HIV infection incidence rate amoung adults
Cleaning:
sum(is.na(aids))[1] 0
#no NA values were found.
#filter to the year 1990 only
aids <- aids |>
filter(Year == 1990) Linear Regression:
Prepare Data
#create new dataset for linear regression (easier to read and manipulate)
aids2 <- aids |>
#choose 2 quantitative variables
select("Country", "Data.HIV Prevalence.Young Men", "Data.AIDS-Related Deaths.Male Adults")First Plot
#first scatterplot graph (p1)
p1 <- ggplot(aids2, aes(x = `Data.AIDS-Related Deaths.Male Adults`, y = `Data.HIV Prevalence.Young Men`)) +
labs(title = "AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",
caption = "Source: UNAIDS",
x = "AIDS-Related Deaths for Male Adults",
y = "HIV Prevalance Rate in Young Men") +
theme_minimal(base_size = 12)
p1 + geom_point()#there are 2 outliers- locate country by sorting the dataset in desc order for both variables
aids2 |>
arrange(desc(`Data.AIDS-Related Deaths.Male Adults`))# A tibble: 89 × 3
Country Data.HIV Prevalence.…¹ Data.AIDS-Related De…²
<chr> <dbl> <dbl>
1 Uganda 2.3 15000
2 Zimbabwe 6.6 7300
3 United Republic of Tanzania 1.5 5900
4 Malawi 2.3 4700
5 Zambia 3.1 4100
6 Nigeria 0.3 3900
7 Kenya 2.3 3600
8 Côte d'Ivoire 1.1 3500
9 Ethiopia 0.5 3200
10 Democratic Republic of the Con… 0.4 2800
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.HIV Prevalence.Young Men`,
# ²`Data.AIDS-Related Deaths.Male Adults`
aids2 |>
arrange(desc(`Data.HIV Prevalence.Young Men`))# A tibble: 89 × 3
Country Data.HIV Prevalence.Youn…¹ Data.AIDS-Related De…²
<chr> <dbl> <dbl>
1 Zimbabwe 6.6 7300
2 Zambia 3.1 4100
3 Botswana 2.8 500
4 Kenya 2.3 3600
5 Malawi 2.3 4700
6 Uganda 2.3 15000
7 Burundi 1.6 1400
8 Thailand 1.6 1000
9 United Republic of Tanzania 1.5 5900
10 Burkina Faso 1.4 1700
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.HIV Prevalence.Young Men`,
# ²`Data.AIDS-Related Deaths.Male Adults`
Second Plot (without outliers)
#remove outlier (Uganda and Zimbabwe)
aids2 <- aids2 |>
filter(!Country %in% c("Uganda", "Zimbabwe"))
#re plot
p2 <- ggplot(aids2, aes(x = `Data.AIDS-Related Deaths.Male Adults`, y = `Data.HIV Prevalence.Young Men`)) +
labs(title = "AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",
caption = "Source: UNAIDS",
x = "AIDS-Related Deaths for Male Adults",
y = "HIV Prevalance Rate in Young Men") +
theme_minimal(base_size = 12)
p2 + geom_point()Create Linear Regression Model
#fix axes to start at 0
p3 <- p2 + geom_point() + xlim(0,5900)+ ylim(0,3.1)
#smoother in red w/ confidence interval
p4 <- p3 + geom_smooth(color = "red")
#linear regression with confidence interval
p5 <- p3 + geom_smooth(method='lm',formula=y~x)
#add a title, make the line dashed, and remove the confidence interval band
p6 <- p3 + geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = .2)
p6 Correlation, Fit the predictor variable x into the model to predict y
cor(aids2$`Data.AIDS-Related Deaths.Male Adults`, aids2$`Data.HIV Prevalence.Young Men`)[1] 0.6408574
#0.641 is closer to 1, therefore the two variables have a strong correlation
fit1 <- lm(`Data.AIDS-Related Deaths.Male Adults` ~ aids2$`Data.HIV Prevalence.Young Men`, data = aids2)
summary(fit1)
Call:
lm(formula = `Data.AIDS-Related Deaths.Male Adults` ~ aids2$`Data.HIV Prevalence.Young Men`,
data = aids2)
Residuals:
Min 1Q Median 3Q Max
-3041.2 -182.2 -182.2 -65.0 3927.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 161.5 115.9 1.393 0.167
aids2$`Data.HIV Prevalence.Young Men` 1207.1 156.8 7.697 2.3e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 901.2 on 85 degrees of freedom
Multiple R-squared: 0.4107, Adjusted R-squared: 0.4038
F-statistic: 59.24 on 1 and 85 DF, p-value: 2.3e-11
Conclusion 1:
Final equation: Data.AIDS-Related Deaths.Male Adults = 1207.1(Data.HIV Prevalence.Young Men) + 161.5
In summary, the rate of young men in 1990 diagnosed with HIV is very strongly correlated with the amount of Male Adult HIV deaths (corr of 0.641). In other words, for every added percent to the rate of HIV diagnoses for young men, there is a predicted increase of 1207.1 AIDS-related male adult deaths. The adjusted R-squared tells us that about 40% of the variation in the observations may be explained by the model. The p-value (0.000000000023), shows that the predictor (prevalence of young men with HIV) is very significant to the variation of male adult HIV death rates.
Visualization
Prepare data for bar graph
#identify top 3 countries for AIDS related adult deaths
aids |>
arrange(desc(`Data.AIDS-Related Deaths.Adults`))# A tibble: 89 × 23
Country Year Data.AIDS-Related De…¹ Data.AIDS-Related De…²
<chr> <dbl> <dbl> <dbl>
1 Uganda 1990 190000 31000
2 Zimbabwe 1990 68000 14000
3 Malawi 1990 57000 11000
4 United Republic of Tanza… 1990 56000 11000
5 Democratic Republic of t… 1990 49000 9600
6 Zambia 1990 44000 8000
7 Kenya 1990 36000 7600
8 Nigeria 1990 37000 7000
9 Ethiopia 1990 30000 6600
10 Côte d'Ivoire 1990 28000 6100
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.AIDS-Related Deaths.AIDS Orphans`,
# ²`Data.AIDS-Related Deaths.Adults`
# ℹ 19 more variables: `Data.AIDS-Related Deaths.All Ages` <dbl>,
# `Data.AIDS-Related Deaths.Children` <dbl>,
# `Data.AIDS-Related Deaths.Female Adults` <dbl>,
# `Data.AIDS-Related Deaths.Male Adults` <dbl>, …
#new dataset with top 3 countries
aids_top3 <- aids |>
filter(Country %in% c("Uganda", "Zimbabwe", "Malawi"))Bar graph
ggplot(aids_top3, aes(x = Country, y = `Data.AIDS-Related Deaths.Adults`, fill = Country)) +
geom_bar(stat = "identity") +
labs(
title = "AIDS-Related Deaths per Country",
x = "Country",
y = "AIDS-Related Adult Deaths"
) +
scale_fill_manual(values = c(
"Uganda" = "pink",
"Zimbabwe" = "purple",
"Malawi" = "blue"
)) +
theme_minimal() Conclusion 2
To clean the dataset, I first checked for NAs (there was none, so minimal cleaning was involved). Afterwards, I filtered for the year 1990 in order to prevent violating basic assumptions of independent observations. The visualization above represents a bar graph of the top 3 countries that have the highest AIDS-Related Adult Deaths. I thought it was interesting that most other countries have similar death rates, but Uganda is the outlier in many of the death-related categories. This shows in the bar graph, as Uganda rises very far above Malawi and Zimbabwe. If I were to further my research, I would try to investigate Uganda specifically to try to find internal patterns as to why their death rate is so high. There are many variables in the dataset and years, so I start by creating an AIDS dataset soley for Uganda and perform a multiple linear regression.