── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ stringr 1.6.0
✔ forcats 1.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggfortify)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 2759 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Country
dbl (22): Year, Data.AIDS-Related Deaths.AIDS Orphans, Data.AIDS-Related Dea...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
About the Data
AIDS (Acquired Immunodeficiency Syndrome) is a deadly disease that progresses from the HIV (Human Immunodeficiency Virus) infection. The “AIDS” dataset contains information on populations of people globally affected by the disease in Asian and African countries. The data is collected by the UNAIDS Organization, an entity of the United Nations focusing on the epidemic. The dimensions are 23 columns by 2,759 rows. 20/23 variables are split into 4 groups; AIDS related deaths, HIV prevalence, new HIV Infections, and people living with HIV. Within these four categories contain 3-7 variables with the specific population that is affected (men, women, children etc). These variables are defined below:
Data.AIDS-Related Deaths.(AIDS Orphans, Adults, All Ages, Children, Female Adults, Male Adults): a discrete, numerical variable containing the amount of AIDS-related deaths (with respect to each population type)
Data.HIV Prevalence.(Adults, Young Men, Young Women): a continuous, numerical variable containing HIV prevalence in population percentage (with respect to each population type)
Data.New HIV Infections.(Young Adults, Male Adults, Female Adults, Children, All Ages, Adults, Incidence Rate Among Adults): a discrete, numerical variable containing the amount of new HIV infections (with respect to each population type)
Data.People Living with HIV.(Total, Male Adults, Female Adults, Children, Adults): a discrete, numerical variable containing the amount of people living with HIV (with respect to each population type)
Other variables:
year: a discrete, numberical variable containing what year the infection data is from
country: a nominal, categorical variable containing what country the infection data is from.
Data.New HIV Infections.Incidence Rate Among Adults: a continuous, numerical variable containing the new HIV infection incidence rate amoung adults
Cleaning:
sum(is.na(aids))
[1] 0
#no NA values were found. #filter to the year 1990 only aids <- aids |>filter(Year ==1990)
Linear Regression:
Prepare Data
#create new dataset for linear regression (easier to read and manipulate)aids2 <- aids |>#choose 2 quantitative variablesselect("Country", "Data.HIV Prevalence.Young Men", "Data.AIDS-Related Deaths.Male Adults")
First Plot
#first scatterplot graph (p1)p1 <-ggplot(aids2, aes(x =`Data.AIDS-Related Deaths.Male Adults`, y =`Data.HIV Prevalence.Young Men`)) +labs(title ="AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",caption ="Source: UNAIDS",x ="AIDS-Related Deaths for Male Adults", y ="HIV Prevalance Rate in Young Men") +theme_minimal(base_size =12)p1 +geom_point()
#there are 2 outliers- locate country by sorting the dataset in desc order for both variablesaids2 |>arrange(desc(`Data.AIDS-Related Deaths.Male Adults`))
# A tibble: 89 × 3
Country Data.HIV Prevalence.…¹ Data.AIDS-Related De…²
<chr> <dbl> <dbl>
1 Uganda 2.3 15000
2 Zimbabwe 6.6 7300
3 United Republic of Tanzania 1.5 5900
4 Malawi 2.3 4700
5 Zambia 3.1 4100
6 Nigeria 0.3 3900
7 Kenya 2.3 3600
8 Côte d'Ivoire 1.1 3500
9 Ethiopia 0.5 3200
10 Democratic Republic of the Con… 0.4 2800
# ℹ 79 more rows
# ℹ abbreviated names: ¹`Data.HIV Prevalence.Young Men`,
# ²`Data.AIDS-Related Deaths.Male Adults`
#remove outlier (Uganda and Zimbabwe)aids2 <- aids2 |>filter(!Country %in%c("Uganda", "Zimbabwe"))#re plotp2 <-ggplot(aids2, aes(x =`Data.AIDS-Related Deaths.Male Adults`, y =`Data.HIV Prevalence.Young Men`)) +labs(title ="AIDS RELATED DEATHS FOR MALE ADULTS VERSUS \n HIV PREVALENCE RATE IN YOUNG MEN",caption ="Source: UNAIDS",x ="AIDS-Related Deaths for Male Adults", y ="HIV Prevalance Rate in Young Men") +theme_minimal(base_size =12)p2 +geom_point()
Create Linear Regression Model
#fix axes to start at 0p3 <- p2 +geom_point() +xlim(0,5900)+ylim(0,3.1)#smoother in red w/ confidence intervalp4 <- p3 +geom_smooth(color ="red")#linear regression with confidence intervalp5 <- p3 +geom_smooth(method='lm',formula=y~x) #add a title, make the line dashed, and remove the confidence interval bandp6 <- p3 +geom_smooth(method='lm',formula=y~x, se =FALSE, linetype="dotdash", size = .2)p6
Correlation, Fit the predictor variable x into the model to predict y
#0.641 is closer to 1, therefore the two variables have a strong correlationfit1 <-lm(`Data.AIDS-Related Deaths.Male Adults`~ aids2$`Data.HIV Prevalence.Young Men`, data = aids2)summary(fit1)
Call:
lm(formula = `Data.AIDS-Related Deaths.Male Adults` ~ aids2$`Data.HIV Prevalence.Young Men`,
data = aids2)
Residuals:
Min 1Q Median 3Q Max
-3041.2 -182.2 -182.2 -65.0 3927.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 161.5 115.9 1.393 0.167
aids2$`Data.HIV Prevalence.Young Men` 1207.1 156.8 7.697 2.3e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 901.2 on 85 degrees of freedom
Multiple R-squared: 0.4107, Adjusted R-squared: 0.4038
F-statistic: 59.24 on 1 and 85 DF, p-value: 2.3e-11
In summary, the rate of young men in 1990 diagnosed with HIV is very strongly correlated with the amount of Male Adult HIV deaths (corr of 0.641). In other words, for every added percent to the rate of HIV diagnoses for young men, there is a predicted increase of 1207.1 AIDS-related male adult deaths. The adjusted R-squared tells us that about 40% of the variation in the observations may be explained by the model. The p-value (0.000000000023), shows that the predictor (prevalence of young men with HIV) is very significant to the variation of male adult HIV death rates.
Visualization
Prepare data for bar graph
#identify top 3 countries for AIDS related adult deathsaids |>arrange(desc(`Data.AIDS-Related Deaths.Adults`))
#new dataset with top 3 countriesaids_top3 <- aids |>filter(Country %in%c("Uganda", "Zimbabwe", "Malawi"))
Bar graph
ggplot(aids_top3, aes(x = Country, y =`Data.AIDS-Related Deaths.Adults`, fill = Country)) +geom_bar(stat ="identity") +labs(title ="AIDS-Related Deaths per Country", x ="Country", y ="AIDS-Related Adult Deaths" ) +scale_fill_manual(values =c("Uganda"="pink","Zimbabwe"="purple","Malawi"="blue" )) +theme_minimal()
Conclusion 2
To clean the dataset, I first checked for NAs (there was none, so minimal cleaning was involved). Afterwards, I filtered for the year 1990 in order to prevent violating basic assumptions of independent observations. The visualization above represents a bar graph of the top 3 countries that have the highest AIDS-Related Adult Deaths. I thought it was interesting that most other countries have similar death rates, but Uganda is the outlier in many of the death-related categories. This shows in the bar graph, as Uganda rises very far above Malawi and Zimbabwe. If I were to further my research, I would try to investigate Uganda specifically to try to find internal patterns as to why their death rate is so high. There are many variables in the dataset and years, so I start by creating an AIDS dataset soley for Uganda and perform a multiple linear regression.