project 1

Author

bodidi

## introduction

HIV is a virus that attacks cells that help the body fight infection, making a person more vulnerable to other infections and diseases. It is spread by contact with certain bodily fluids of a person with HIV, most commonly during unprotected sex , or through sharing needles. While there is no cure, If untreated, it can progress to AIDS

This dataset focuses on HIV surveillance in New York City, examining various factors such as year, race, gender, deaths, and more. Today, I will focus on analyzing the HIV diagnosis rate in new york city and I will also look at which of the five boroughs has the highest number of HIV cases to identify any emerging trends.

source -https://www.hiv.gov/hiv-basics/overview/about-hiv-and-aids/what-are-hiv-and-aids

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/data 110")
HIV_AIDS_NY <- read_csv("HIV_AIDS_NY.csv")
Rows: 6005 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Borough, UHF, Gender, Age, Race
dbl (13): Year, HIV diagnoses, HIV diagnosis rate, Concurrent diagnoses, % l...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(HIV_AIDS_NY)
# A tibble: 6 × 18
   Year Borough UHF   Gender    Age   Race  `HIV diagnoses` `HIV diagnosis rate`
  <dbl> <chr>   <chr> <chr>     <chr> <chr>           <dbl>                <dbl>
1  2011 All     All   All       All   All              3379                 48.3
2  2011 All     All   Male      All   All              2595                 79.1
3  2011 All     All   Female    All   All               733                 21.1
4  2011 All     All   Transgen… All   All                51              99999  
5  2011 All     All   Female    13 -… All                47                 13.6
6  2011 All     All   Female    20 -… All               178                 24.7
# ℹ 10 more variables: `Concurrent diagnoses` <dbl>,
#   `% linked to care within 3 months` <dbl>, `AIDS diagnoses` <dbl>,
#   `AIDS diagnosis rate` <dbl>, `PLWDHI prevalence` <dbl>,
#   `% viral suppression` <dbl>, Deaths <dbl>, `Death rate` <dbl>,
#   `HIV-related death rate` <dbl>, `Non-HIV-related death rate` <dbl>
HIV_AIDS <- HIV_AIDS_NY |>
  select(`Year`, `Borough`, `Age`, `Death rate`, `HIV diagnoses`, `HIV diagnosis rate`) |>
  group_by(`Year`, `Borough`)
head(HIV_AIDS)
# A tibble: 6 × 6
# Groups:   Year, Borough [1]
   Year Borough Age     `Death rate` `HIV diagnoses` `HIV diagnosis rate`
  <dbl> <chr>   <chr>          <dbl>           <dbl>                <dbl>
1  2011 All     All             13.6            3379                 48.3
2  2011 All     All             13.4            2595                 79.1
3  2011 All     All             14               733                 21.1
4  2011 All     All             11.1              51              99999  
5  2011 All     13 - 19          1.4              47                 13.6
6  2011 All     20 - 29          7.2             178                 24.7
plot1 <- HIV_AIDS_NY |>
  ggplot() +
  geom_bar(aes(x = `Year`, y = `HIV diagnoses`, fill = `Borough`), 
           position = "dodge", stat = "identity") + 
  labs(fill = "Borough",
       y = "Number of HIV Diagnoses",
       title = "HIV Diagnoses by Year and Borough",
       caption = "nyc.gov") +
  theme_minimal()
plot1

plot2 <- HIV_AIDS_NY |>
  ggplot() +
  geom_bar(aes(x = `Borough`, y = `HIV diagnoses`, fill = `Borough`), 
           position = "dodge", stat = "identity") + 
  labs(fill = "Borough",
       y = "Number of HIV Diagnoses",
       title = "HIV Diagnoses by Borough",
       caption = "nyc.gov") +
  theme_minimal()
plot2

model <- lm(`Death rate` ~ `Age`, data = HIV_AIDS_NY)
summary(model)

Call:
lm(formula = `Death rate` ~ Age, data = HIV_AIDS_NY)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.364  -8.889  -1.718   2.921 232.836 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.3871     0.6779   0.571    0.568    
Age20 - 29    3.9923     0.9586   4.165 3.16e-05 ***
Age30 - 39    5.9798     0.9586   6.238 4.74e-10 ***
Age40 - 49   10.0435     0.9586  10.477  < 2e-16 ***
Age50 - 59   16.4017     0.9586  17.109  < 2e-16 ***
Age60+       29.9767     0.9586  31.270  < 2e-16 ***
AgeAll        8.9305     0.7281  12.266  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.85 on 5998 degrees of freedom
Multiple R-squared:  0.1787,    Adjusted R-squared:  0.1779 
F-statistic: 217.5 on 6 and 5998 DF,  p-value: < 2.2e-16
plot3 <- HIV_AIDS_NY |>
  ggplot(aes(x = `Age`, y = `Death rate`)) +
  geom_point(size = 2) +  
  labs(title = "Linear Regression: Age vs. Death Rate for HIV/AIDS Deaths",
       x = "Age",
       y = "Death Rate",
       caption = "nyc.gov") +
  theme_minimal()
plot3

How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).

What the visualization represents, any interesting patterns or surprises that arise within the visualization. Anything that you might have shown that you could not get to work or that you wished you could have included. I got the inspiration for this project from the hate crime homework we did a couple of weeks ago. I started by cleaning up the dataset and picking the columns I wanted to focus on. I created a bar graph to show HIV diagnosis rates by year and borough, which helped me spot a trend i was looking for. I discovered that Brooklyn and Manhattan had significantly higher rates compared to the others. Then, I ran a linear regression to look at the relationship between age and death rates, which revealed that the death rate tends to increase with age, while younger individuals had a 50% lower mortality rate. One thing I really wanted to explore was whether the survey people were locals or tourists or both, especially since New York is such a popular tourist destination.