Age and Gender Disparities in HIV Diagnoses and Deaths in NYC: A Data Analysis
The spread of HIV/AIDS remains a major public health concern, particularly in large urban centers like New York City. The dataset used in this project is obtained from the NYC Open Data platform, which contains detailed records on HIV diagnoses and related deaths in New York City across multiple years. This data allows us to explore trends in HIV/AIDS outcomes based on key demographic variables, such as age and gender.
The dataset includes the following key variables:
Age Group: A categorical variable representing different age brackets. Gender: A categorical variable representing male and female, which will allow us to explore gender disparities in HIV diagnoses and deaths. HIV Diagnoses: A quantitative variable representing the number of newly diagnosed HIV cases in each demographic group. Deaths: A quantitative variable that tracks the number of deaths related to HIV/AIDS, helping us understand the mortality impact across different age and gender groups. In this project, we aim to explore the relationship between age and gender in HIV diagnoses and deaths in New York City. Specifically, we will investigate trends in HIV diagnoses across different age groups, examine significant differences between males and females, and assess how age influences HIV-related mortality rates.
Dataset Source: NYC.gov
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set working directorysetwd("C:/Users/akais/OneDrive/Documents/HIV dataset")# Load the datasetHIV_data <-read_csv("HIV_AIDS_NY.csv")
Rows: 6005 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Borough, UHF, Gender, Age, Race
dbl (13): Year, HIV diagnoses, HIV diagnosis rate, Concurrent diagnoses, % l...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for missing valuesprint(colSums(is.na(HIV_data))) # Display the count of missing values in each column
Year Borough
0 0
UHF Gender
0 0
Age Race
0 0
HIV diagnoses HIV diagnosis rate
0 0
Concurrent diagnoses % linked to care within 3 months
0 0
AIDS diagnoses AIDS diagnosis rate
0 0
PLWDHI prevalence % viral suppression
0 0
Deaths Death rate
0 0
HIV-related death rate Non-HIV-related death rate
0 0
# Remove rows with missing values in critical columnsHIV_data_clean <-na.omit(HIV_data) # This removes rows with any NA values# Standardize gender values to "Male" and "Female"HIV_data_clean$Gender <-tolower(HIV_data_clean$Gender) # Convert to lowercaseHIV_data_clean$Gender[HIV_data_clean$Gender =="male"] <-"Male"# Standardize to "Male"HIV_data_clean$Gender[HIV_data_clean$Gender =="female"] <-"Female"# Standardize to "Female"# Remove duplicate rowsHIV_data_clean <-unique(HIV_data_clean)# Display cleaned datahead(HIV_data_clean) # Show the first few rows of the cleaned data
# A tibble: 6 × 18
Year Borough UHF Gender Age Race `HIV diagnoses` `HIV diagnosis rate`
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2011 All All all All All 3379 48.3
2 2011 All All Male All All 2595 79.1
3 2011 All All Female All All 733 21.1
4 2011 All All transgen… All All 51 99999
5 2011 All All Female 13 -… All 47 13.6
6 2011 All All Female 20 -… All 178 24.7
# ℹ 10 more variables: `Concurrent diagnoses` <dbl>,
# `% linked to care within 3 months` <dbl>, `AIDS diagnoses` <dbl>,
# `AIDS diagnosis rate` <dbl>, `PLWDHI prevalence` <dbl>,
# `% viral suppression` <dbl>, Deaths <dbl>, `Death rate` <dbl>,
# `HIV-related death rate` <dbl>, `Non-HIV-related death rate` <dbl>
# Filter the dataset to include only the four variables of interestHIV_data_filtered <- HIV_data_clean %>%select(Age, Gender, `HIV diagnoses`, Deaths)# Display the first few rows of the filtered dataset to checkhead(HIV_data_filtered)
# A tibble: 6 × 4
Age Gender `HIV diagnoses` Deaths
<chr> <chr> <dbl> <dbl>
1 All all 3379 2040
2 All Male 2595 1423
3 All Female 733 605
4 All transgender 51 12
5 13 - 19 Female 47 1
6 20 - 29 Female 178 19
# Checking if the Age_Group is a categorical variable.# If it is, converting it into a numeric format for analysis.HIV_data_clean$Age_Num <-as.numeric(factor(HIV_data_clean$Age))# Running a linear regression analysis.# The goal is seeing how age and gender can help predict the number of HIV diagnoses.linear_model <-lm( `HIV diagnoses`~ Age_Num + Gender + Deaths, data = HIV_data_clean)# Summarizing the results of the linear regression model.# This is providing important statistics like coefficients and p-values to interpret the model's effectiveness.summary(linear_model)
# Reshape the data to long format using pivot_longerHIV_data_long <- HIV_data_clean %>%pivot_longer(cols =c(`HIV diagnoses`, Deaths), names_to ="Variable", values_to ="Count")
#Define age groupsHIV_data_grouped <- HIV_data_clean %>%mutate(Age_Group =case_when( Age <13~"Baby", # Define "Baby" as ages less than 13 Age >=13& Age <20~"Teenager", # Define "Teenager" as ages 13-19 Age >=20& Age <65~"Adult", # Define "Adult" as ages 20-64 Age >=65~"Senior"# Define "Senior" as ages 65 and above ))
# Summarize the data to get total HIV diagnoses and deaths by age group and genderHIV_data_summary <- HIV_data_grouped %>%group_by(Age_Group, Gender) %>%summarize(Total_Diagnoses =sum(`HIV diagnoses`, na.rm =TRUE), Total_Deaths =sum(Deaths, na.rm =TRUE), .groups ='drop')
# Load necessary librarieslibrary(ggplot2)library(dplyr)# Summarize total counts for total HIV diagnoses and deaths by age group and genderHIV_summary <- HIV_data_grouped %>%group_by(Age_Group, Gender) %>%summarise(total_diagnoses =sum(`HIV diagnoses`, na.rm =TRUE),total_deaths =sum(Deaths, na.rm =TRUE),mean_count =mean(`HIV diagnoses`+ Deaths, na.rm =TRUE))
`summarise()` has grouped output by 'Age_Group'. You can override using the
`.groups` argument.
# Reshape the data for plottingHIV_summary_long <- HIV_summary %>%pivot_longer(cols =c(total_diagnoses, total_deaths), names_to ="Variable", values_to ="Count")# Create the bar plotggplot(HIV_summary_long, aes(x = Age_Group, y = Count, fill = Variable)) +geom_bar(stat ="identity", position ="dodge", na.rm =TRUE) +scale_fill_manual(values =c("darkblue", "darkred")) +labs(x ="Age Group", y ="Count", fill ="Variable",title ="Total HIV Diagnoses and Deaths by Age Group and Gende:Analysis", caption ="Source: NYC.gov") +theme_minimal() +scale_y_continuous(labels = scales::comma) # This will format y-axis as real numbers
Dataset Cleaning and Visualization Analysis To clean the dataset, I checked for missing values and removed rows with critical NA entries using the na.omit() function. I standardized the gender variable by converting all entries to lowercase and then labeling them consistently as “Male” or “Female.” Duplicate entries were eliminated with the unique() function. Finally, I filtered the dataset to focus on the relevant variables: Age, Gender, HIV Diagnoses, and Deaths.
The visualization displays total HIV diagnoses (dark blue) and deaths (dark red) by age group and gender. The analysis reveals higher counts among older adults, indicating increased vulnerability to HIV/AIDS in this demographic. In contrast, younger groups have significantly lower counts, highlighting a potential area for targeted prevention efforts.
I planned to incorporate a time series analysis to examine annual trends in HIV diagnoses and deaths, which could provide insights into the effectiveness of public health interventions over time. Additionally, I considered using interactive visualizations, such as plotly charts, for a more detailed exploration of the data. However, I ultimately decided to focus on a straightforward bar plot for this project to maintain clarity and simplicity.