National Institute of Allergy and Infectious Diseases. (2024). “a group of red and black cells on a blue background”. Unsplash. https://unsplash.com/photos/a-group-of-red-and-black-cells-on-a-blue-background-cxUtwHxNGmo
I chose the dataset “HIV in Prisons, 2021 - Statistical Tables”, provided by Laura M. Maruschak to the Bureau of Justice Statistics. This dataset stood out to me; HIV/AIDS was a huge deal in America in the decades before I was born, but I didn’t even know how serious the epidemic was until college. AIDS killed 700,000 Americans by 2019, at one point 50,000 Americans in a year alone (Bayer, Oppenheimer & Parisi, 2021). The government did not act to find treatment for the disease as quickly as it would have if “mainstream” people were dying from it; the first cases of HIV/AIDS were in 1981, and effective medication that made HIV no longer a death sentence was not available until 1995 (Bayer, Oppenheimer & Parisi, 2021). Learning about the true impact of this disease was shocking to me. It has been estimated that 10% of gay men in the US aged 25-44 had died of AIDS by 1995 (Rosenfeld 2018); this would mean almost all gay men my parents’ age who have been out long enough have lost someone. Yet today, people dying of AIDS in America is much more rare. The modern day population of U.S. prisoners with HIV/AIDS was an interesting topic to me because of how much the disease and the context of it has changed.
I usd multiple datasets, which were very messy at first. I did most of the initial cleaning in the CSV files and only some of it through R. The main variables I worked with were totalpop and totalperc, variables representing the total number of inmates with HIV per prison jurisdiction and the total percentage of inmates with HIV per jurisdiction respectively. I mostly only used quantitative variables. The categorical variables other than jurisdiction that I included (the survey questions on HIV testing) proved to be irrelevant. I used dplyr to filter the data and join the datasets.
Load datasets
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
setwd("/Users/Lucinda/Downloads/data110/HIVproject")hiv <-read_csv("HIVpositiveprisoners.csv") # HIV positive prisoners
Rows: 53 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Jurisdiction
dbl (5): female2017, female2018, female2019, female2020, female2021
num (5): male2017, male2018, male2019, male2020, male2021
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop <-read_csv("PrisonPop.csv") # prison population
New names:
Rows: 53 Columns: 18
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): Jurisdiction num (15): 2017total, 2017male, 2017female, 2018total,
2018male, 2018female, ... lgl (2): ...17, ...18
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...17`
• `` -> `...18`
Rows: 50 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Jurisdiction, HIVMandatory, HIVOptout, HIVoptin, HIVonassess, Conse...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
custody <-read_csv("InCustodyHIV TESTING.csv") # policies on testing in custody
Rows: 50 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Jurisdiction, RoutineMed, HighRisk, OnRequest, ClinIndication, Cour...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean datasets
I actually did most of my cleaning by editing the CSV files manually. There were a lot of empty spaces, and I decided that the case_when command is way too painful to use when I could just shorten the state names in the file.
Here, I used my knowledge of tidyverse to filter out the parts of the datasets that I didn’t want and then join all the datasets together into one.
hiv <- hiv |>filter(jurisdiction !="AK", jurisdiction !="OR", jurisdiction !="VA")# I removed these rows because they have NA that make including them pointless later onhiv21 <- hiv |>select(jurisdiction, male2021, female2021) |>mutate(total2021 = male2021+female2021) # select just 2021, and add a totals columnpop21 <- pop |>select(jurisdiction, "2021total", "2021male", "2021female")# make a matching subset to hiv21 with the general prison population (already has a totals column)pophiv21 <-left_join(hiv21, pop21)
Joining with `by = join_by(jurisdiction)`
names(pophiv21) <-c("jurisdiction", "mhiv", "fhiv", "totalhiv", "totalpop", "mpop", "fpop")# rename titles to make sense for the joined dataframeintake21 <-left_join(pophiv21, intake) |>filter(jurisdiction !="Ustotal", jurisdiction !="State")
Joining with `by = join_by(jurisdiction)`
# Ustotal and State both have NA so I'm removing themeverything21 <-left_join(intake21, custody) |>mutate(fperc = fhiv/fpop,mperc = mhiv/mpop,totalperc = totalhiv/totalpop)
Joining with `by = join_by(jurisdiction)`
# joined all datasets together, and added percent columns for the percentage of prisoners in each jurisdiction who are positive for HIV
Statistical analysis
plot_correlation(everything21)
1 features with more than 20 categories ignored!
jurisdiction: 48 categories
This visualization was surprising to me. There is next to no correlation between categorical variables (other than jurisdiction) and any of the quantitative variables. The only correlation we really see is categorical variables among themselves (jurisdictions that mandate HIV testing are obviously most likely to be the jurisdictions that don’t allow consent in medical procedures) and quantitative variables among themselves.
More surprising, there is no correlation between the population of the jurisdiction and the number of prisoners with HIV. There is definitely a correlation between the jurisdiction and the HIV percentage, as I illustrate in a plot later. HIV positive population must instead depend on factors outside the scope of this data.
summary(lm(totalperc ~ totalpop, data = everything21))
Call:
lm(formula = totalperc ~ totalpop, data = everything21)
Residuals:
Min 1Q Median 3Q Max
-0.006573 -0.003626 -0.002035 0.002350 0.016298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.476e-03 1.024e-03 8.277 1.16e-10 ***
totalpop 5.265e-08 2.943e-08 1.789 0.0803 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.005703 on 46 degrees of freedom
Multiple R-squared: 0.06502, Adjusted R-squared: 0.0447
F-statistic: 3.199 on 1 and 46 DF, p-value: 0.08027
# p-value 0.08 - no correlation between prison population and percentage with HIVsummary(lm(fhiv ~ mhiv, data = everything21))
Call:
lm(formula = fhiv ~ mhiv, data = everything21)
Residuals:
Min 1Q Median 3Q Max
-20.728 -1.365 0.185 2.188 16.548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.684199 1.019495 -0.671 0.506
mhiv 0.056714 0.002384 23.787 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.986 on 46 degrees of freedom
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9232
F-statistic: 565.8 on 1 and 46 DF, p-value: < 2.2e-16
# tiny p-value, extremely strong correlation between men with HIV and women with HIV per jurisdiction
I looked over many different variables, but none of the categorical (“YES”/“NO”) variables had correlation with any quantitative variables other than on-request HIV testing (which I believe only had a statistically significant p-value because of having too little data; only 3/48 jurisdictions reported “NO”). However, there is a very strong correlation between the amount of HIV-positive female prisoners and HIV-positive male prisoners in a jurisdiction. The p-value is stated as 2.2e-16, which is statistically significant. I suppose this makes sense, but it’s also interesting to me, because from projects I’ve done on prisons before it generally seems like male and female prisons are very different. The formula for the linear regression model would be fhiv = -0.68 + 0.056(mhiv). The multiple r-squared value indicates that the variance in the female population is accounted for 92% of the time by the male prisoner population.
Scatterplots, exploring dataset
everything21 |>ggplot(aes(x = jurisdiction, y = totalhiv, color = totalperc)) +labs(title ="Total prisoners with HIV by jursidction",caption ="Bureau of Justice Statistics (Maruschak)",x ="Jurisdiction",y ="HIV total") +theme_minimal(base_size =12) +geom_point()
# the x axis is utterly unreadable, I just wanted to see where the points would fall in relation to each othereverything21 |>filter(totalpop <60000) |># removed outliers, 60000 inmates is way higher population than averageggplot(aes(x = totalpop, y = totalperc)) +labs(title ="Jurisdiction Population vs Percentage with HIV",caption ="Bureau of Justice Statistics (Maruschak)",x ="Population",y ="% with HIV") +geom_smooth(color ="#ee3a84") +geom_point() +scale_x_continuous(labels = scales::comma) +# stack overflowtheme_minimal(base_size =10)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
everything21 |>ggplot(aes(x = fhiv, y = mhiv)) +labs(title ="Amount of Prisoners with HIV per Jurisdiction by Gender",caption ="Bureau of Justice Statistics (Maruschak)",x ="Female Prisoners with HIV",y ="Male Prisoners with HIV") +geom_smooth(color ="#ee3a84") +geom_point() +scale_x_continuous(labels = scales::comma) +# the same stackoverflowtheme_minimal(base_size =10)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Here I just wanted a visual representation of the correlations I was testing in the linear regression model. I was also curious about HIV statistics per jurisdiction.
Main graphs
hivperc21 <- everything21 |>filter(totalperc >0.0095) |>arrange(totalperc)acolors <-c("#B429F9", "#A935F9", "#9E41F8", "#934DF8", "#8859F7", "#7D65F7", "#7271F6", "#687DF6", "#5D89F5", "#5295F5", "#47A1F4", "#3CADF4", "#31B9F3", "#26C5F3")highchart() |>hc_add_series(data = hivperc21,type ="lollipop", hcaes(x = jurisdiction,y = totalperc,group = totalpop)) |>hc_colors(acolors) |>hc_xAxis(title =list(text="Jurisdiction")) |>hc_yAxis(title =list(text="Percent of Inmates with HIV")) |>hc_subtitle(text="Source: Bureau of Justice Statistics (provided by M. Maruschak)") |>hc_title( # from rdrr.iotext ="<b>Jurisdictions (States) with Highest Percentages of Inmates with HIV, 2021</b>",margin =30,align ="center",style =list(color ="#B429F9") )
For this last plot, I made an interactive lollipop chart using Highcharter, which I had never used before. Unfortunately, I was unable to include the mean percentage as a baseline like I like doing with ggplot. I tried to use arrange to sort the visualization by percentage before creating it, and while it did change the arrangement within the dataset, it did not have any effect on the graph. I could not find a way to add a title in the class notes, so I went online to find out how. For the data source subtitle, I made a luck guess on my own on what the code would be.
I enjoyed this project, and like the first project, will probably continue to play with the code and improve it on my own for practice. One thing not included in this dataset is what the categorical variables other than jurisdiction are; they are a series of different policies on HIV testing (eg, whether or not it’s a mandatory part of intake, if its part of routine medical exams, if inmates can opt out of it, etc). They are still included in the dataset I used in the project because I spent time exploring them initially. However, there is no correlation between them and the quantitative variables (the amount of HIV positive inmates). This both surprised me and also didn’t surprise me at all. I expected there to be no correlation but was still curious if there would be. I also was originally working on a GIS plot but decided it was too much trouble for a project due so soon, and might look a bit odd since each jurisdiction is an entire state.
Bibliography and sources for code not taught in class
For scale_y_continuous(labels = scales::percent) https://stackoverflow.com/questions/27433798/how-can-i-change-the-y-axis-figures-into-percentages-in-a-barplot For scale_x_continuous(labels = scales::comma) https://stackoverflow.com/questions/52602503/display-an-axis-value-in-millions-in-ggplot For hc_title and its customization https://rdrr.io/cran/highcharter/man/hc_title.html
Bayer, R., Oppenheimer, G. M., & Parisi, V. (2021). Marking the 40th Anniversary of the AIDS Epidemic — American Physicians Look Back. The New England Journal of Medicine, 385(14), 1251–1253. https://doi.org/10.1056/NEJMp2106933
Rosenfeld, D. (2018). The AIDS epidemic’s lasting impact on gay men. The British Academy. https://www.thebritishacademy.ac.uk/blog/aids-epidemic-lasting-impact-gay-men/