Project 2 DATA 607: Analysis of HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity in NYC

#Introduction This is the analysis of dataset on HIV and AIds diagnoses in New York City, covering 2010 to 2021 that reveals critical trends and insights showing key years peaks and fluctuations, the data provides valuable context for understanding the impact of HIV across different communities, how this virus affects more neighborhoods than other, sex gender, races.

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(readr)

#Defining file path and steps to tidy the dataset

# Define file path
file_path <- "C:/Users/Dell/Downloads/HIV_AIDS_Diagnoses_20241014.csv"

# Step 1: Read the CSV file
data <- read_csv(file_path)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 8976 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Borough, Neighborhood (U.H.F), SEX, RACE/ETHNICITY, TOTAL NUMBER OF...
## dbl (3): YEAR, TOTAL NUMBER OF CONCURRENT HIV/AIDS DIAGNOSES, PROPORTION OF ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Rename columns for clarity
data <- data %>%
  rename(
    Year = `YEAR`,
    Borough = `Borough`,
    Neighborhood = `Neighborhood (U.H.F)`,
    Sex = `SEX`,
    Race_Ethnicity = `RACE/ETHNICITY`,
    Total_HIV_Diagnoses = `TOTAL NUMBER OF HIV DIAGNOSES`,
    HIV_Diagnoses_Per_100k = `HIV DIAGNOSES PER 100,000 POPULATION`,
    Total_Concurrent_HIV_AIDS_Diagnoses = `TOTAL NUMBER OF CONCURRENT HIV/AIDS DIAGNOSES`,
    Proportion_Concurrent_HIV_AIDS_Diagnoses = `PROPORTION OF CONCURRENT HIV/AIDS DIAGNOSES AMONG ALL HIV DIAGNOSES`,
    Total_AIDS_Diagnoses = `TOTAL NUMBER OF AIDS DIAGNOSES`,
    AIDS_Diagnoses_Per_100k = `AIDS DIAGNOSES PER 100,000 POPULATION`
  )

# Step 3: Split Neighborhood column where necessary
# Some neighborhoods might represent more than one area, split them into separate rows
data <- data %>%
  separate_rows(Neighborhood, sep = " - ")

# Step 4: Fill missing values in Borough column based on Neighborhood if known
data <- data %>%
  mutate(
    Borough = case_when(
      Neighborhood == "Greenpoint" ~ "Brooklyn",
      Neighborhood == "Stapleton" ~ "Staten Island",
      Neighborhood == "Southeast Queens" ~ "Queens",
      Neighborhood == "Upper Westside" ~ "Manhattan",
      Neighborhood == "Willowbrook" ~ "Staten Island",
      Neighborhood == "East Flatbush" ~ "Brooklyn",
      Neighborhood == "Southwest Queens" ~ "Queens",
      Neighborhood == "Fordham" ~ "Bronx",
      Neighborhood == "Flushing" ~ "Queens",
      TRUE ~ Borough  # Keep existing Borough values
    )
  )

# Step 5: Treat "All" entries in Sex and Race/Ethnicity columns
data <- data %>%
  mutate(
    Sex = ifelse(Sex == "All", NA, Sex),  # Replace "All" with NA
    Race_Ethnicity = ifelse(Race_Ethnicity == "All", NA, Race_Ethnicity)  # Replace "All" with NA
  )

# Step 6: Replace missing values with NA
data <- data %>%
  replace_na(list(
    Borough = "Unknown",
    Sex = "Unknown",
    Race_Ethnicity = "Unknown"
  ))

# Step 7: Ensure all numeric columns are numeric and replace NA values
data <- data %>%
  mutate(
    Total_HIV_Diagnoses = as.numeric(Total_HIV_Diagnoses),
    HIV_Diagnoses_Per_100k = as.numeric(HIV_Diagnoses_Per_100k),
    Total_Concurrent_HIV_AIDS_Diagnoses = as.numeric(Total_Concurrent_HIV_AIDS_Diagnoses),
    Proportion_Concurrent_HIV_AIDS_Diagnoses = as.numeric(Proportion_Concurrent_HIV_AIDS_Diagnoses),
    Total_AIDS_Diagnoses = as.numeric(Total_AIDS_Diagnoses),
    AIDS_Diagnoses_Per_100k = as.numeric(AIDS_Diagnoses_Per_100k)
  )
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Total_HIV_Diagnoses = as.numeric(Total_HIV_Diagnoses)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
# Step 8: View the cleaned dataset
print(data)
## # A tibble: 12,256 × 11
##     Year Borough       Neighborhood     Sex   Race_Ethnicity Total_HIV_Diagnoses
##    <dbl> <chr>         <chr>            <chr> <chr>                        <dbl>
##  1  2010 Brooklyn      Greenpoint       Male  Black                            6
##  2  2011 Staten Island Stapleton        Fema… Native Americ…                   0
##  3  2011 Unknown       St. George       Fema… Native Americ…                   0
##  4  2010 Queens        Southeast Queens Male  Unknown                         23
##  5  2012 Manhattan     Upper Westside   Fema… Unknown                          0
##  6  2013 Staten Island Willowbrook      Male  Unknown                          0
##  7  2013 Brooklyn      East Flatbush    Male  Black                           54
##  8  2013 Unknown       Flatbush         Male  Black                           54
##  9  2013 Brooklyn      East Flatbush    Fema… Native Americ…                   0
## 10  2013 Unknown       Flatbush         Fema… Native Americ…                   0
## # ℹ 12,246 more rows
## # ℹ 5 more variables: HIV_Diagnoses_Per_100k <dbl>,
## #   Total_Concurrent_HIV_AIDS_Diagnoses <dbl>,
## #   Proportion_Concurrent_HIV_AIDS_Diagnoses <dbl>, Total_AIDS_Diagnoses <dbl>,
## #   AIDS_Diagnoses_Per_100k <dbl>
# Optional: Write the cleaned data to a new CSV file
write_csv(data, "C:/Users/Dell/Downloads/Tidy_HIV_AIDS_Diagnoses_20241014.csv")

Including Plots

1)HIV Diagnosis Rates by Gender and Race/Ethnicity

# Load necessary libraries
library(dplyr)
library(ggplot2)

#Setting new dataset: 
dataset <- read.csv("C:/Users/Dell/Downloads/Tidy_HIV_AIDS_Diagnoses_20241014.csv")

# Summarize HIV Diagnosis Rates by Gender and Race/Ethnicity
hiv_by_gender_race <- dataset %>%
  group_by(Sex, Race_Ethnicity) %>%
  summarise(average_rate = mean(HIV_Diagnoses_Per_100k, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Plot the results
ggplot(hiv_by_gender_race, aes(x = Sex, y = average_rate, fill = Race_Ethnicity)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "HIV Diagnosis Rates by Gender and Race/Ethnicity",
       y = "Average HIV Diagnosis Rate per 100,000",
       x = "Gender") +
  theme_minimal()

Temporal Trends: Number of HIV and AIDS Diagnoses Over the Years

hiv_aids_trends <- dataset %>%
  group_by(Year) %>%
  summarise(total_hiv_diagnoses = sum(`Total_HIV_Diagnoses`, na.rm = TRUE),
            total_aids_diagnoses = sum(`Total_AIDS_Diagnoses`, na.rm = TRUE))

# Plot the results
ggplot(hiv_aids_trends, aes(x = Year)) +
  geom_line(aes(y = total_hiv_diagnoses, color = "HIV Diagnoses")) +
  geom_line(aes(y = total_aids_diagnoses, color = "AIDS Diagnoses")) +
  labs(title = "Temporal Trends in HIV and AIDS Diagnoses",
       y = "Number of Diagnoses",
       x = "Year") +
  scale_color_manual(name = "Diagnosis Type", values = c("HIV Diagnoses" = "blue", "AIDS Diagnoses" = "red")) +
  theme_minimal()

Year with most HIV diagnoses

# Calculate the total HIV diagnoses per year
hiv_by_year <- dataset %>%
  group_by(Year) %>%
  summarise(total_hiv_diagnoses = sum(Total_HIV_Diagnoses, na.rm = TRUE))

# Find the year with the most HIV diagnoses
year_most_hiv <- hiv_by_year %>%
  filter(total_hiv_diagnoses == max(total_hiv_diagnoses))

year_most_hiv
## # A tibble: 1 × 2
##    Year total_hiv_diagnoses
##   <int>               <int>
## 1  2020               35136

HIV Diagnoses by Neighborhood (Percentage)

# Create a summary of HIV diagnoses by neighborhood
hiv_by_neighborhood <- dataset %>%
  group_by(Neighborhood) %>%
  summarise(total_hiv_diagnoses = sum(Total_HIV_Diagnoses, na.rm = TRUE))

# Calculate percentage of total diagnoses per neighborhood
hiv_by_neighborhood <- hiv_by_neighborhood %>%
  mutate(percentage = (total_hiv_diagnoses / sum(total_hiv_diagnoses)) * 100)

# Plot the histogram
ggplot(hiv_by_neighborhood, aes(x = reorder(Neighborhood, -percentage), y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Percentage of HIV Diagnoses by Neighborhood",
       x = "Neighborhood",
       y = "Percentage of Total HIV Diagnoses") +
  theme_minimal()

Geographical Patterns: HIV/AIDS Diagnoses by Neighborhood

# Summarize HIV and AIDS Diagnoses by Neighborhood
hiv_by_neighborhood <- dataset %>%
  group_by(Neighborhood) %>%
  summarise(
    total_hiv_diagnoses = sum(Total_HIV_Diagnoses, na.rm = TRUE),
    total_aids_diagnoses = sum(Total_AIDS_Diagnoses, na.rm = TRUE)
  )

# Plot the results for HIV Diagnoses by Neighborhood
ggplot(hiv_by_neighborhood, aes(x = reorder(Neighborhood, total_hiv_diagnoses), y = total_hiv_diagnoses)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = "HIV Diagnoses by Neighborhood",
       y = "Total HIV Diagnoses",
       x = "Neighborhood") +
  theme_minimal()

Intersectional Analysis: HIV Diagnoses by Gender and Race/Ethnicity

# Intersectional Analysis by Gender and Race/Ethnicity
intersectional_analysis <- dataset %>%
  group_by(Sex, `Race_Ethnicity`) %>%
  summarise(total_hiv_diagnoses = sum(`Total_HIV_Diagnoses`, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Plot the results
ggplot(intersectional_analysis, aes(x = Sex, y = total_hiv_diagnoses, fill = `Race_Ethnicity`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Intersectional Analysis of HIV Diagnoses by Gender and Race/Ethnicity",
       y = "Total HIV Diagnoses",
       x = "Gender") +
  theme_minimal()

Relation newly HIV and Aids diagnosis

# Summarize the total number of HIV and AIDS diagnoses for the pie chart
diagnosis_summary <- dataset %>%
  summarise(
    total_hiv_diagnoses = sum(Total_HIV_Diagnoses, na.rm = TRUE),
    total_aids_diagnoses = sum(Total_AIDS_Diagnoses, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(total_hiv_diagnoses, total_aids_diagnoses), names_to = "Diagnosis", values_to = "Count")

# Create pie chart
ggplot(diagnosis_summary, aes(x = "", y = Count, fill = Diagnosis)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Proportion of Newly Diagnosed HIV vs AIDS Cases") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank())

# Statistics Summary:

# Get a statistical summary of all numeric columns in the dataset
summary(dataset)
##       Year        Borough          Neighborhood           Sex           
##  Min.   :2010   Length:12256       Length:12256       Length:12256      
##  1st Qu.:2012   Class :character   Class :character   Class :character  
##  Median :2017   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2020                                                           
##  Max.   :2021                                                           
##                                                                         
##  Race_Ethnicity     Total_HIV_Diagnoses HIV_Diagnoses_Per_100k
##  Length:12256       Min.   :   0.00     Min.   :  0.00        
##  Class :character   1st Qu.:   0.00     1st Qu.:  0.00        
##  Mode  :character   Median :   2.00     Median :  8.90        
##                     Mean   :  18.48     Mean   : 26.33        
##                     3rd Qu.:  12.00     3rd Qu.: 34.20        
##                     Max.   :3353.00     Max.   :821.60        
##                     NA's   :29          NA's   :97            
##  Total_Concurrent_HIV_AIDS_Diagnoses Proportion_Concurrent_HIV_AIDS_Diagnoses
##  Min.   :  0.000                     Min.   :  0.00                          
##  1st Qu.:  0.000                     1st Qu.:  0.00                          
##  Median :  0.000                     Median :  9.10                          
##  Mean   :  3.458                     Mean   : 15.11                          
##  3rd Qu.:  2.000                     3rd Qu.: 22.70                          
##  Max.   :680.000                     Max.   :100.00                          
##  NA's   :7                           NA's   :2424                            
##  Total_AIDS_Diagnoses AIDS_Diagnoses_Per_100k
##  Min.   :   0.00      Min.   :  0.00         
##  1st Qu.:   0.00      1st Qu.:  0.00         
##  Median :   1.00      Median :  4.10         
##  Mean   :  12.03      Mean   : 17.15         
##  3rd Qu.:   7.00      3rd Qu.: 20.80         
##  Max.   :2611.00      Max.   :565.50         
##  NA's   :24           NA's   :92

Interpreting the summary this dataset reveals significant variability in HIV and AIDS diagnoses across different neighborhoods. Many areas report no diagnoses looking both total and per 100,000 population, with the median values indicating that over half the neighborhoods have very low diagnosis rates. However, the mean values are higher, suggesting some neighborhoods experience disproportionately high diagnoses, skewing the data. Now regarding concurrent HIV/AIDS diagnoses are rare, with most neighborhoods reporting none. Notably, the dataset contains missing values, particularly in the proportions of concurrent diagnoses, this suggests varying levels of impact across regions, with certain areas facing more significant health challenges than others.

#Conclusion The dataset on HIV diagnoses in New York City, spanning from 2010 to 2021, highlights significant trends. Key years include a peak of over 35,136 diagnoses in 2020, with the earliest data from 2010, a median year of 2017, and the most recent in 2021. With 12,256 entries, the dataset encompasses categorical variables like Borough, Neighborhood, Sex, and Ethnicity. This comprehensive data aids in analyzing trends, showing that the most affected population is cisgender males, although their sexual orientation is unknown due to the absence of this attribute in the dataset. This information can help inform and create better HIV prevention programs such as PrEP and improve understanding of the spread and impact of HIV across different communities. It emphasizes the importance of adherence to antiretroviral therapy to prevent further infections, offering valuable insights for public health strategies. Another challenge in the analysis was the lack of differentiation between neighborhoods and their respective counties in the initial dataset.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.