Dylan Johnson - Project 2 HIV/AIDS

HIV and AIDS: The Basics | NIH

Intro:

The data set analyzed in this project focuses on HIV/AIDS diagnoses in New York. This topic is of immense importance due to the ongoing public health implications of HIV/AIDS, which remains a significant health challenge worldwide. By analyzing the distribution and trends of HIV diagnoses, public health officials, researchers, and policymakers can better understand the spread of the disease and implement targeted interventions.

Variables Included:

Year: The year when the HIV diagnoses were reported (Numeric). Borough: The borough in New York where the diagnoses were reported (Categorical: Brooklyn, Bronx, Manhattan, Queens, Staten Island). HIV diagnoses: The number of HIV diagnoses reported (Numeric). Race: The racial category of individuals diagnosed (Categorical: White, Black, Latino/Hispanic, Asian/Pacific Islander, Other/Unknown, All).

Data Source and Methodology

The data was sourced from the New York City Department of Health and Mental Hygiene. It was downloaded from a publicly available data set on HIV/AIDS diagnoses. Unfortunately, there was no accompanying ReadMe file with detailed information about the data collection methodology. However, it is reasonable to assume that the data was collected through mandatory reporting systems where healthcare providers report new HIV diagnoses to the health department.

Data Cleaning Process

1: Renaming Columns: The column containing the number of HIV diagnoses was renamed from ‘Cases’ to ‘HIV diagnoses’ for clarity. 2:Handling White Spaces: The Race column contained leading and trailing white spaces which were removed to standardize the data. 3:Filtering Data: The data set was filtered to include data only from the year 2000 on wards, focusing on more recent trends. 4:Summarizing Data: Summarizing the data to get total diagnoses by race and borough for more focused analysis.

Personal Signifigance

I chose this topic because I am interested in epidemiology and public health, I choose this topic and data set. Planning and implementing public health interventions that are effective will require an understanding of the patterns and trends in HIV diagnosis. I hope that my analysis of this data set will advance knowledge about HIV/AIDS and aid in the fight against this illness. This topic holds special significance for me since it relates to my academic and professional aspirations in the field of public health and offers a chance to apply analytical and statistical knowledge to a pressing problem affecting a large number of individuals.

#Load Packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(broom)
library(RColorBrewer)
library(ggalluvial)

#Load Data set and set working directory
setwd("C:/Users/dylan/OneDrive/Documents/Data 110 Summer")
HivAids <- read_csv("HIV_AIDS_NY.csv")

## Rows: 6005 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Borough, UHF, Gender, Age, Race
## dbl (13): Year, HIV diagnoses, HIV diagnosis rate, Concurrent diagnoses, % l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(HivAids)

## # A tibble: 6 × 18
##    Year Borough UHF   Gender    Age   Race  `HIV diagnoses` `HIV diagnosis rate`
##   <dbl> <chr>   <chr> <chr>     <chr> <chr>           <dbl>                <dbl>
## 1  2011 All     All   All       All   All              3379                 48.3
## 2  2011 All     All   Male      All   All              2595                 79.1
## 3  2011 All     All   Female    All   All               733                 21.1
## 4  2011 All     All   Transgen… All   All                51              99999  
## 5  2011 All     All   Female    13 -… All                47                 13.6
## 6  2011 All     All   Female    20 -… All               178                 24.7
## # ℹ 10 more variables: `Concurrent diagnoses` <dbl>,
## #   `% linked to care within 3 months` <dbl>, `AIDS diagnoses` <dbl>,
## #   `AIDS diagnosis rate` <dbl>, `PLWDHI prevalence` <dbl>,
## #   `% viral suppression` <dbl>, Deaths <dbl>, `Death rate` <dbl>,
## #   `HIV-related death rate` <dbl>, `Non-HIV-related death rate` <dbl>

# Review column names
colnames(HivAids)

##  [1] "Year"                             "Borough"                         
##  [3] "UHF"                              "Gender"                          
##  [5] "Age"                              "Race"                            
##  [7] "HIV diagnoses"                    "HIV diagnosis rate"              
##  [9] "Concurrent diagnoses"             "% linked to care within 3 months"
## [11] "AIDS diagnoses"                   "AIDS diagnosis rate"             
## [13] "PLWDHI prevalence"                "% viral suppression"             
## [15] "Deaths"                           "Death rate"                      
## [17] "HIV-related death rate"           "Non-HIV-related death rate"

# Select relevant columns and filter for years greater than 2000
HivAids_filtered <- HivAids %>%
  select(Year, Borough, `HIV diagnoses`, Race, Age, Borough) %>%
  filter(Year > 2000)

# Summarize the data to get total diagnoses per year
total_diagnoses_per_year <- HivAids_filtered %>%
  group_by(Year) %>%
  summarize(TotalDiagnoses = sum(`HIV diagnoses`))
head(total_diagnoses_per_year)

## # A tibble: 5 × 2
##    Year TotalDiagnoses
##   <dbl>          <dbl>
## 1  2011          36708
## 2  2012          33648
## 3  2013          31004
## 4  2014          30028
## 5  2015          27712

# Create a box plot for HIV diagnoses by race with custom colors and theme
box_plot <- ggplot(HivAids_filtered, aes(x = Race, y = `HIV diagnoses`, colour = Race)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2, notch = TRUE, alpha = 0.5) +
  scale_fill_manual(values = c("White" = "#87ceeb", "Black" = "#00008b", "Latino/Hispanic" = "#32cd32", "Asian/Pacific Islander" = "#800080", "Other/Unknown" = "#ffa500", "All" = "#ffff00")) +
  labs(title = "HIV Diagnoses by Race per Report",
       x = "Race",
       y = "Number of Diagnoses",
       caption = "Data Source: HIV/AIDS Data NY") +
  theme_classic()

# Convert to Interactive plot and display
box_plot_interactive <- ggplotly(box_plot)

## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

box_plot_interactive

What this graph visualizes:

The box plot visualizes the distribution of HIV diagnoses across different racial groups. Each box represents the interquartile range (IQR) of diagnoses within a race, with the median indicated by a line inside the box. The whiskers extend to 1.5 times the IQR. The visualization reveals significant variation in the number of diagnoses among different racial groups. For instance, the median number of diagnoses might be notably higher in certain groups such as Black and Latino/Hispanic populations compared to others. This pattern highlights disparities that may exist in the incidence of HIV across different racial groups. One limitation is that the box plot does not show the distribution of diagnoses within each borough but I was having trouble programming that into the visualization. It would be useful to have a multi-faceted plot that includes both race and borough to see how these factors interact. Additionally, adding interactivity to display more details (like specific counts when hovering) could enhance the utility of the visualization.

# Create a line plot for HIV diagnoses over time by borough with custom colors and theme
line_plot <- ggplot(HivAids_filtered, aes(x = Year, y = `HIV diagnoses`, color = Borough, alpha = 0.5)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("blue", "red", "green", "purple", "orange", "yellow")) +
  labs(title = "HIV Diagnoses Over Time by Borough",
       x = "Year",
       y = "Number of Diagnoses",
       caption = "Data Source: HIV/AIDS Data NY") +
  theme_bw()

# Convert to an interactive plot
line_plot_interactive <- ggplotly(line_plot)

# Display the interactive plot
line_plot_interactive

What this visualization represents

The line plot shows the trend of HIV diagnoses over time, separated by borough. Each line represents the trend for a specific borough, with different colors used to differentiate between them. The plot may reveal temporal trends such as periods of increasing or decreasing diagnoses. For example, one might observe a general decline in diagnoses over time in some boroughs while others might have fluctuating trends. This can indicate the effectiveness of public health interventions or emerging hot-spots of the epidemic.

# Summarize the data to get total diagnoses per borough
total_diagnoses_by_borough <- HivAids_filtered %>%
  group_by(Borough) %>%
  summarize(TotalDiagnoses = sum(`HIV diagnoses`))

bar_plot <- ggplot(total_diagnoses_by_borough, aes(x = Borough, y = TotalDiagnoses, fill = Borough)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("dodgerblue", "orange", "forestgreen", "purple", "red", "#a26ca3")) +
  labs(title = "HIV Diagnoses by Borough",
       x = "Borough",
       y = "Number of Diagnoses",
       caption = "Data Source: HIV/AIDS Data NY") +
  theme_minimal() +
  theme(legend.position = "none")

bar_plot_interactive <- ggplotly(bar_plot)

# Display the interactive plot
bar_plot_interactive

What this visualization shows

The bar plot displays the total number of HIV diagnoses in each borough. Each bar represents a borough, and the height of the bar corresponds to the total number of diagnoses reported.This visualization might show stark differences in the number of diagnoses between boroughs. For example, boroughs like Manhattan or Brooklyn might have significantly higher numbers compared to others, which could be indicative of population density or healthcare access disparities.

# Perform linear regression analysis
linear_model <- lm(TotalDiagnoses ~ Year, data = total_diagnoses_per_year)

# Summary of the model
model_summary <- summary(linear_model)
print(model_summary)

## 
## Call:
## lm(formula = TotalDiagnoses ~ Year, data = total_diagnoses_per_year)
## 
## Residuals:
##      1      2      3      4      5 
##  565.6 -333.2 -816.0  369.2  214.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 4382315.6   415650.5   10.54  0.00182 **
## Year          -2161.2      206.5  -10.47  0.00186 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 653 on 3 degrees of freedom
## Multiple R-squared:  0.9733, Adjusted R-squared:  0.9645 
## F-statistic: 109.6 on 1 and 3 DF,  p-value: 0.001862

# Equation for the model
cat("The linear regression equation is: TotalDiagnoses =", coef(linear_model)[1], "+", coef(linear_model)[2], "* Year\n")

## The linear regression equation is: TotalDiagnoses = 4382316 + -2161.2 * Year

# Plot diagnostics
par(mfrow = c(2, 2))
plot(linear_model)

# Adjusted R-squared value
cat("Adjusted R-squared value:", model_summary$adj.r.squared, "\n")

## Adjusted R-squared value: 0.9644609

# P-values
cat("P-values:\n", model_summary$coefficients[,4], "\n")

## P-values:
##  0.00182245 0.001861868

The linear regression analysis examines the relationship between the year and the total number of HIV diagnoses. The model’s adjusted R-squared value indicates the proportion of variability in the total number of diagnoses explained by the year. The p-values for the coefficients help determine the statistical significance of the relationship. Diagnostic plots are used to check the assumptions of the linear regression model.The model’s summary statistics, including the coefficients, p-values, and adjusted R-squared value, provide insights into the strength and significance of the relationship between time and HIV diagnoses. For instance, a significant negative slope would indicate a decrease in diagnoses over time, which might be a positive outcome reflecting successful interventions. One limitation is that the model only considers the overall trend without accounting for potential confounders such as demographic changes, public health initiatives, or socio-economic factors. Including additional variables in a multiple regression model could provide a more nuanced understanding of the factors influencing HIV diagnoses over time.