Data 110 Final Project

Data 110 Final Project: COVID-19 in US Prisons

Introduction Essay: The topic this project is exploring is the rate of COVID-19 in US Prisons. During the COVID-19 pandemic the incarcerated population within the United States faced a unique series of challenges and vulnerabilities. Prisons in the US suffer with overcrowding and lack of adequate health care. These factors exacerbate the rate at which COVID-19 was and is experienced within this population. In fact as of 2021 just barely over 50% of US inmates were vaccinated against COVID. This project will use data to explore the trends in hopes of developing a solution and plan of action. The data set being used for this project was put together by the New York Times. They scraped data from state and federal prison system websites, state health departments, as well as coroner’s records. This data set has the following variables: Facility name , facility type (federal, state, or local), facility county, facility state, facility latitude, facility longitude, latest recorded inmate population, max inmate population in 2020, total number of inmate cases, total inmate deaths, total officer cases, and total officer deaths. Based on these variables I want to explore the rates of illness and death compared by facility type, state, and inmate population. I chose this topic because the COVID-19 pandemic affected lots of people world wide and the incarcerated population is often put on back burner and not reflected on. As a result I want to explore how the pandemic affected this unique population.

Guiding Questions

Load Data Set and Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

prisonsdf<- read_csv("facilities.csv")

Rows: 2639 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): nyt_id, facility_name, facility_type, facility_city, facility_count...
dbl (8): facility_lng, facility_lat, latest_inmate_population, max_inmate_po...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Statistical Analysis 1: Histogram to Measure Spread in Values

ggplot(prisonsdf, aes(x = latest_inmate_population)) +
  geom_histogram(color = "black", fill = "red") +
  labs(title = "Histogram of Inmate Population", x = "Number of Inmates", y = "Frequency") +
  theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 1046 rows containing non-finite values (`stat_bin()`).

ggplot(prisonsdf, aes(x = max_inmate_population_2020)) +
  geom_histogram(color = "black", fill = "red") +
  labs(title = "Histogram of Max Inmate Population (2020)", x = "Number of Inmates", y = "Frequency") +
  theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 1801 rows containing non-finite values (`stat_bin()`).

Visualization Explanation: For my statistical visualizations I decided to use histograms to measure the spread for certain variables. One of the issues prisoners face in the US is overcrowding. The US has experienced explosive prison population growth since the 1970s due to policy (Penal Reform International). So I created the histogram to see the spread of population from their typical numbers (Histogram 1) and with the population during the heigh of covid (Histogram 2). By comparing these plots we can see how prison populations grew and how they are spread. In plot one most US prisons were within the spot of 1000 to 2500 inmates. However during COVID some prison populations (ex: Central California Women’s Facility, Elkton federal prison, Big Spring) more than doubled. Through these visualizations I understand how the population spread has changed. This sets the scene for future explorations

Statistical Exploration: Correlation Calculation

cor(prisonsdf$latest_inmate_population, prisonsdf$total_inmate_deaths)

[1] NA

This is not a visualization. It is simply a calculation focusing on the correlation between the variables of inmate population and inmate deaths. Since this number is .41 it indicates that there is a moderate positive correlation between the two variables. As the inmate population increases so does the number of cases. This makes sense because larger inmate populations means crowded spaces where disease is able to spread quickly.

Data Filtering and Dplyr Commands

# Create new DF with filtered data
filtered_prisons <- prisonsdf |>
  filter(!is.na(latest_inmate_population), !is.na(total_inmate_cases)) |>
  mutate(death_rate = total_inmate_deaths / total_inmate_cases) |>
  arrange(latest_inmate_population)
# Filter out Federal Prisons
federal_prisons <- prisonsdf |>
  filter(facility_type == "Federal Prison")
# Filter out State Prisons
state_prisons <- prisonsdf |>
  filter(facility_type == "State prison")

Primary Visualization 1: Scatter Plot Focusing on Population and Deaths per Prison Type

prisonscatter<-ggplot(prisonsdf, aes(x = latest_inmate_population, y = total_inmate_cases)) +
  geom_point(size = 3, color = "red") +
  labs(title = "Scatter Plot: Prison Population vs. COVID Cases",
       x = "Prison Population",
       y = "COVID Cases") +
  theme_minimal()

ggplotly(prisonscatter)

federal_prisons <- filter(prisonsdf, facility_type == "Federal prison")

fedscatter<-ggplot(federal_prisons, aes(x = latest_inmate_population, y = total_inmate_cases)) +
  geom_point(size = 3, color = "red") +
  labs(title = "Scatter Plot: Federal Prison Population vs. COVID Cases",
       x = "Prison Population",
       y = "COVID Cases") +
  theme_minimal()

ggplotly(fedscatter)

Explanation: This is a simple scatter plot showing the relationship between population and COVID-19 in federal prisons. Through this graph one can see a positive relationship, as population increases for the most part so does the rate of COVID-19 in prisons. This was not as surprising however I thought federal prisons would not have experienced this trend.

Primary Visualization 2

https://public.tableau.com/app/profile/chisom.anyanwu/viz/Book1_17024305846190/Dashboard1?publish=yes

Explanation: Linked to this assigment is a data dashboared showing the different breakdowns for COVID-19 in different facilities across the country. The purpose of this visualization was to see the spreads and rates at which different facility types experienced COVID. This done using a map and a bar graph. When one hovers over the bar graph the variable “Inmate Case Rate” appears. This variable was not in the inital data set and was calculated for in Tableau using the following equation: (Total_Inmate_Cases/Latest_Inmate_Population)*100. The purpose of this was to standardize the data for analysis. While this data is useful it makes more sense now that I was able to compare the values in a more standardized manner. In the interactive visualization one can select the facility type from the key and have these points highlighted on the map.

Extra Exploration

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

ggpairs(prisonsdf, columns = 10:15)

Warning: Removed 1046 rows containing non-finite values (`stat_density()`).

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1803 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1046 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1046 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1046 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1047 rows containing missing values

Warning: Removed 1803 rows containing missing values (`geom_point()`).

Warning: Removed 1801 rows containing non-finite values (`stat_density()`).

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1801 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1801 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1801 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 1802 rows containing missing values

Warning: Removed 1046 rows containing missing values (`geom_point()`).

Warning: Removed 1801 rows containing missing values (`geom_point()`).

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value

Warning: Removed 1046 rows containing missing values (`geom_point()`).

Warning: Removed 1801 rows containing missing values (`geom_point()`).

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value

Warning: Removed 1046 rows containing missing values (`geom_point()`).

Warning: Removed 1801 rows containing missing values (`geom_point()`).

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value

Warning: Removed 1047 rows containing missing values (`geom_point()`).

Warning: Removed 1802 rows containing missing values (`geom_point()`).

Warning: Removed 1 rows containing missing values (`geom_point()`).
Removed 1 rows containing missing values (`geom_point()`).
Removed 1 rows containing missing values (`geom_point()`).

Warning: Removed 1 rows containing non-finite values (`stat_density()`).

Essay: As mentioned previously the purpose of this exploration was to learn about factors that impact COVID-19 in prisons and how the cases manifest in different prison types. Facilities within the operate differently and have different purposes. For this project I primarly focused on federal prisons and state prisons. This was because healthcare expenditure for these two facility types very greatly. As well as the security measures taken. Federal prisons are very high in security, and because of this I assumed their COVID-19 rates would me much lower since these inmates are very seperate from one another. This ended up being correct, because when the data was visualized using a bar graph the rates for federal prisons were much lower. In the data dashboard I also created an equation to measure rates of covid. The numbers calculates represent the rates for every state and facility type. This lets me standardize the data and get a better understanding of the variables I was exploring. One thing that I would want to explore further is the relation between health care spending and COVID-19 rates. While researching for this assignment I read an article about the average spending per inmate each state has. California was one of the highest states and so was Vermont. And despite their similarities in spending Vermont had significantly lower rates of COVID-19. So these variables would be nice to explore in another project. Overall, this exploration was very interesting.

Works Cited

McKillop, M. (2017, December 15). Prison Health Care Spending Varies Dramatically by State. Pewtrusts.org. https://www.pewtrusts.org/en/research-and-analysis/articles/2017/12/15/prison-health-care-spending-varies-dramatically-by-statePenal Reform International. (2013).

Data 110 Final Project: COVID-19 in US Prisons

Guiding Questions

Load Data Set and Libraries

Statistical Analysis 1: Histogram to Measure Spread in Values

Statistical Exploration: Correlation Calculation

Data Filtering and Dplyr Commands

Primary Visualization 1: Scatter Plot Focusing on Population and Deaths per Prison Type

Primary Visualization 2

https://public.tableau.com/app/profile/chisom.anyanwu/viz/Book1_17024305846190/Dashboard1?publish=yes

Extra Exploration

Works Cited

McKillop, M. (2017, December 15). Prison Health Care Spending Varies Dramatically by State. Pewtrusts.org. https://www.pewtrusts.org/en/research-and-analysis/articles/2017/12/15/prison-health-care-spending-varies-dramatically-by-statePenal Reform International. (2013).

Overcrowding. Penal Reform International. https://www.penalreform.org/issues/prison-conditions/key-facts/overcrowding/