HIV AIDS NY Data Visualizations

Intro

The dataset I used is the HIV_AIDS_NY dataset conducted by the HIV Epidemiology Program of the NYC Department of Health and Mental Hygiene, which provides data on HIV and AIDS cases in New York City from the years 2011-2015. The categorical variables of this datset are the year, both the Borough and UHF, which is a code for a smaller neighborhood inside of a borough in New York City, gender, age, and race. The quantitative variable include the number of HIV diagnoses, AIDS diagnoses, and Concurrent diagnosis(those diagnosed with both diseases), the diagnosis rates for HIV and AIDS (which is the number of diagnoses per 100,00 people), the percentage of people who were diagnosed with HIV that were linked to medical care within 3 months, the PLWDHI prevelance (estimate prevelence of people living with diagnosed or undiagnosed HIV), the percent viral supression of people diagnosed with HIV within one year of diangosis, the number of deaths, the death-rate, HIV-related death rate, and the NOn-HIV-related death rate (all of which are also per 100,00 people).This dataset unfortunately was not very clean to start out with. Almost every quantitative column had many values of 999999 randomly sprinkled into the data, as well as some of the percentage values being over 100. To remove these issues, I decided to exclude all entries containing either of these issues in any of the columns this affected by subsetting all of them out of the data. This eneded up being a very long code string because of the amount of columns I had to change in this dataset.

Loading the data packages and the dataset

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)
hiv_data <- read_csv("HIV_AIDS_NY.csv")
## Rows: 6005 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Borough, UHF, Gender, Age, Race
## dbl (13): Year, HIV diagnoses, HIV diagnosis rate, Concurrent diagnoses, % l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here I loaded tidyverse and RColorBrewer to allow me to use the functions necessary for my visualizations. I also loaded in the dataset.

Cleaning up the data

hiv_data <- subset(hiv_data, hiv_data$`HIV diagnosis rate` != 999999.0 & hiv_data$`% linked to care within 3 months`<= 100 & hiv_data$`AIDS diagnosis rate` != 99999.0 & hiv_data$`PLWDHI prevalence` != 99999.0 & hiv_data$`% viral suppression` != 99999 & hiv_data$Deaths != 99999 & hiv_data$`Death rate` <= 100 & hiv_data$`HIV-related death rate` != 99999 & hiv_data$`Non-HIV-related death rate` != 99999)

This code is very long and a bit hard to look at, but what it does is remove any row from the dataset that had values of 99999 or 99999.0 as well as any columns with percentage values greater than 100. This cleans the data in a way where my visualizations will not have outliers that are clearly not correctly recorded data.

Visulaizations

First attempt at visualizing this dataset

ggplot(hiv_data, aes(x=`HIV diagnoses`, fill=`Borough`)) +
  geom_histogram(binwidth=100, alpha=0.5) +
  labs(title="HIV Diagnoses by Borough",
       x="Number of HIV Diagnoses",
       y="Count",
       fill="Borough") +
  scale_fill_hue()

In this visualization I attempted to make a histogram to show the frequency of HIV diagnoses of the different boroughs of New York City. However, there was some outliers in terms of how many diagnosis occurred in different places do to difference in population, so most of the data is resonablly placed while a few stretch the graph too far. While this could easily be fixed I decided against this idea and ended the visualization in this state.

Scatterplot of Linkage Care vs. Viral Supression

hiv_data_without_all <- subset(hiv_data, Race != "All")
ggplot(hiv_data_without_all, aes(x=`% linked to care within 3 months`, y=`% viral suppression`, color=`Race`)) +
  geom_point() +
  labs(title="Linkage to Care vs. Viral Suppression",
       x="Peercent linked to care within 3 months",
       y="Percent viral suppression",
       color="Race") +
  scale_color_brewer(palette= "Set2")

In this visualization I wanted to compare the percentage of people who were linked to care for their HIV after 3 months of their diagnosis and the percent of which the virus was supressed to try and figure out if treating HIV early after a diagnosis may be more effective in supressing the symptoms of the virus.I then colored each point based on race to look for any differences in effectiveness based on the race of those diagnosed. I first subsetted the data to remove the the data collected that listed the race as “All” due to these data points not shwoing and racial disparities and because when I first created this plot I found that the “All” data points covered up almost all of the other data.

Bar Graph of HIV diagnosis Rate by Borough and Gender

ggplot(hiv_data, aes(x=`Borough`, y=`HIV diagnosis rate`, fill=`Gender`)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(title="HIV Diagnosis Rates by Borough and Gender",
       x="Borough",
       y="HIV diagnosis rate (per 100,000)",
       fill="Gender") +
  scale_fill_brewer(palette="Dark2")

In this visualization, I wanted to look at the diagnosis rates of HIV of each borough of New York City and also compare the rates of diagnosis between genders. In order to get the bars separating gender to appear next to one another instead of stacked on top of one another, I used the position = position_dodge() function.

Conclusion

While my first plot didn’t really amount to much, my second and third plots worked out much better. My scatter plot reveals that as the percentage of people who are diagnosed with HIV that are linked to care within 3 months increased, the viral suppression rate also increases. However, it was interesting to note that this seemed to be less effective at suppressing the virus in black people, while it seemed to more effective in white and Asian/pacific islander populations. The bar graph illustrates the the HIV diagnosis rate is higher in borough like the Bronx, Brooklyn, and Manhattan, and that the diagnosis rate of HIV in males is significantly higher than all other genders. This data set in so rich with interesting data that there are multiple facets of it I was unable to explore in this project due to time constraints. I wish I could have looked at some of the AIDS statistics in comparison to the HIV statistics, as well as looking at the death rate data, but I ultimately didn’t give myself enough time to included more than the number of plots I created in this assignment due to a bit of procrastination. I also wish I could have made a histogram of the HIV data work, but there was too much variance in a lot of the data to make a histogram that both conveyed the data well and also looked good at the same time. I’m sure with more time I could have figured out a way to make it work.