Project 1 Assignment

Author

Latifah Traore

Age and Gender Disparities in HIV Diagnoses and Deaths in NYC: A Data Analysis

The spread of HIV/AIDS remains a major public health concern, particularly in large urban centers like New York City. The dataset used in this project is obtained from the NYC Open Data platform, which contains detailed records on HIV diagnoses and related deaths in New York City across multiple years. This data allows us to explore trends in HIV/AIDS outcomes based on key demographic variables, such as age and gender.

The dataset includes the following key variables:

Age Group: A categorical variable representing different age brackets. Gender: A categorical variable representing male and female, which will allow us to explore gender disparities in HIV diagnoses and deaths. HIV Diagnoses: A quantitative variable representing the number of newly diagnosed HIV cases in each demographic group. Deaths: A quantitative variable that tracks the number of deaths related to HIV/AIDS, helping us understand the mortality impact across different age and gender groups. In this project, we aim to explore the relationship between age and gender in HIV diagnoses and deaths in New York City. Specifically, we will investigate trends in HIV diagnoses across different age groups, examine significant differences between males and females, and assess how age influences HIV-related mortality rates.

Dataset Source: NYC.gov

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggsci)
library(RColorBrewer)
library(readr)

# Set working directory
setwd("C:/Users/akais/OneDrive/Documents/HIV dataset")

# Load the dataset
HIV_data <- read_csv("HIV_AIDS_NY.csv")

Rows: 6005 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Borough, UHF, Gender, Age, Race
dbl (13): Year, HIV diagnoses, HIV diagnosis rate, Concurrent diagnoses, % l...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check for missing values
print(colSums(is.na(HIV_data)))  # Display the count of missing values in each column

                            Year                          Borough 
                               0                                0 
                             UHF                           Gender 
                               0                                0 
                             Age                             Race 
                               0                                0 
                   HIV diagnoses               HIV diagnosis rate 
                               0                                0 
            Concurrent diagnoses % linked to care within 3 months 
                               0                                0 
                  AIDS diagnoses              AIDS diagnosis rate 
                               0                                0 
               PLWDHI prevalence              % viral suppression 
                               0                                0 
                          Deaths                       Death rate 
                               0                                0 
          HIV-related death rate       Non-HIV-related death rate 
                               0                                0

# Remove rows with missing values in critical columns
HIV_data_clean <- na.omit(HIV_data)  # This removes rows with any NA values

# Standardize gender values to "Male" and "Female"
HIV_data_clean$Gender <- tolower(HIV_data_clean$Gender)  # Convert to lowercase
HIV_data_clean$Gender[HIV_data_clean$Gender == "male"] <- "Male"  # Standardize to "Male"
HIV_data_clean$Gender[HIV_data_clean$Gender == "female"] <- "Female"  # Standardize to "Female"

# Remove duplicate rows
HIV_data_clean <- unique(HIV_data_clean)

# Display cleaned data
head(HIV_data_clean)  # Show the first few rows of the cleaned data

# A tibble: 6 × 18
   Year Borough UHF   Gender    Age   Race  `HIV diagnoses` `HIV diagnosis rate`
  <dbl> <chr>   <chr> <chr>     <chr> <chr>           <dbl>                <dbl>
1  2011 All     All   all       All   All              3379                 48.3
2  2011 All     All   Male      All   All              2595                 79.1
3  2011 All     All   Female    All   All               733                 21.1
4  2011 All     All   transgen… All   All                51              99999  
5  2011 All     All   Female    13 -… All                47                 13.6
6  2011 All     All   Female    20 -… All               178                 24.7
# ℹ 10 more variables: `Concurrent diagnoses` <dbl>,
#   `% linked to care within 3 months` <dbl>, `AIDS diagnoses` <dbl>,
#   `AIDS diagnosis rate` <dbl>, `PLWDHI prevalence` <dbl>,
#   `% viral suppression` <dbl>, Deaths <dbl>, `Death rate` <dbl>,
#   `HIV-related death rate` <dbl>, `Non-HIV-related death rate` <dbl>

# Filter the dataset to include only the four variables of interest
HIV_data_filtered <- HIV_data_clean %>%
  select(Age, Gender, `HIV diagnoses`, Deaths)

# Display the first few rows of the filtered dataset to check
head(HIV_data_filtered)

# A tibble: 6 × 4
  Age     Gender      `HIV diagnoses` Deaths
  <chr>   <chr>                 <dbl>  <dbl>
1 All     all                    3379   2040
2 All     Male                   2595   1423
3 All     Female                  733    605
4 All     transgender              51     12
5 13 - 19 Female                   47      1
6 20 - 29 Female                  178     19

# Checking if the Age_Group is a categorical variable.
# If it is, converting it into a numeric format for analysis.
HIV_data_clean$Age_Num <- as.numeric(factor(HIV_data_clean$Age))

# Running a linear regression analysis.
# The goal is seeing how age and gender can help predict the number of HIV diagnoses.
linear_model <- lm( `HIV diagnoses` ~ Age_Num + Gender + Deaths, data = HIV_data_clean)

# Summarizing the results of the linear regression model.
# This is providing important statistics like coefficients and p-values to interpret the model's effectiveness.
summary(linear_model)


Call:
lm(formula = `HIV diagnoses` ~ Age_Num + Gender + Deaths, data = HIV_data_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-289.1  -25.6  -11.6   -2.1 3208.4 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.509e+02  9.897e+00  15.251  < 2e-16 ***
Age_Num            2.077e+00  7.887e-01   2.633  0.00849 ** 
GenderFemale      -1.530e+02  8.660e+00 -17.664  < 2e-16 ***
GenderMale        -1.296e+02  8.660e+00 -14.964  < 2e-16 ***
Gendertransgender -1.195e+02  5.750e+01  -2.078  0.03772 *  
Deaths             2.533e-03  8.994e-04   2.816  0.00488 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 127.3 on 5999 degrees of freedom
Multiple R-squared:  0.05711,   Adjusted R-squared:  0.05632 
F-statistic: 72.67 on 5 and 5999 DF,  p-value: < 2.2e-16

library(tidyverse)
library(tidyr)

# Reshape the data to long format using pivot_longer
HIV_data_long <- HIV_data_clean %>%
  pivot_longer(cols = c(`HIV diagnoses`, Deaths), 
               names_to = "Variable", 
               values_to = "Count")

 #Define age groups
HIV_data_grouped <- HIV_data_clean %>%
  mutate(Age_Group = case_when(
    Age < 13 ~ "Baby",         # Define "Baby" as ages less than 13
    Age >= 13 & Age < 20 ~ "Teenager",  # Define "Teenager" as ages 13-19
    Age >= 20 & Age < 65 ~ "Adult",      # Define "Adult" as ages 20-64
    Age >= 65 ~ "Senior"       # Define "Senior" as ages 65 and above
  ))

# Summarize the data to get total HIV diagnoses and deaths by age group and gender
HIV_data_summary <- HIV_data_grouped %>%
  group_by(Age_Group, Gender) %>%
  summarize(
    Total_Diagnoses = sum(`HIV diagnoses`, na.rm = TRUE), 
    Total_Deaths = sum(Deaths, na.rm = TRUE), 
    .groups = 'drop')

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Summarize total counts for total HIV diagnoses and deaths by age group and gender
HIV_summary <- HIV_data_grouped %>%
  group_by(Age_Group, Gender) %>%
  summarise(total_diagnoses = sum(`HIV diagnoses`, na.rm = TRUE),
            total_deaths = sum(Deaths, na.rm = TRUE),
            mean_count = mean(`HIV diagnoses` + Deaths, na.rm = TRUE))

`summarise()` has grouped output by 'Age_Group'. You can override using the
`.groups` argument.

# Reshape the data for plotting
HIV_summary_long <- HIV_summary %>%
  pivot_longer(cols = c(total_diagnoses, total_deaths), 
               names_to = "Variable", 
               values_to = "Count")

# Create the bar plot
ggplot(HIV_summary_long, aes(x = Age_Group, y = Count, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge", na.rm = TRUE) +
  scale_fill_manual(values = c("darkblue", "darkred")) +
  labs(x = "Age Group", 
       y = "Count", 
       fill = "Variable",
       title = "Total HIV Diagnoses and Deaths by Age Group and Gende:Analysis", 
       caption = "Source: NYC.gov") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)  # This will format y-axis as real numbers

Dataset Cleaning and Visualization Analysis To clean the dataset, I checked for missing values and removed rows with critical NA entries using the na.omit() function. I standardized the gender variable by converting all entries to lowercase and then labeling them consistently as “Male” or “Female.” Duplicate entries were eliminated with the unique() function. Finally, I filtered the dataset to focus on the relevant variables: Age, Gender, HIV Diagnoses, and Deaths.

The visualization displays total HIV diagnoses (dark blue) and deaths (dark red) by age group and gender. The analysis reveals higher counts among older adults, indicating increased vulnerability to HIV/AIDS in this demographic. In contrast, younger groups have significantly lower counts, highlighting a potential area for targeted prevention efforts.

I planned to incorporate a time series analysis to examine annual trends in HIV diagnoses and deaths, which could provide insights into the effectiveness of public health interventions over time. Additionally, I considered using interactive visualizations, such as plotly charts, for a more detailed exploration of the data. However, I ultimately decided to focus on a straightforward bar plot for this project to maintain clarity and simplicity.