Final Project

Author

Emma Poch

setwd("C:/Users/emmap/Downloads/DATA110")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(viridis)

Warning: package 'viridis' was built under R version 4.3.3

Loading required package: viridisLite

jails <- read_csv("california_jail_county_monthly_1995_2020.csv")

Rows: 17801 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): jurisdiction, month, census_county_name, fips_state_code, fips_co...
dbl  (40): year, avg_daily_pop_unsentenced_male, avg_daily_pop_unsentenced_f...
date  (2): date, day_of_highest_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

This data set, organized by Jacob Kaplan, provides data from 1995-2020 on various population and quality-of-life factors for jails throughout California (stratified on a county level). The variables most relevant to my research were the average daily population per month (as most data was collected at the beginning of each month), the amount of new mental health cases that had been opened each month, and other components of these variables (such as counts for male/female, sentenced/unsentenced, or felony/misdemeanor charges respectively). The majority of my cleaning for this data set involved identifying columns that were not useful, assigning levels to factored variables, and isolating variables under specific conditions. The data set uses data scraped from the California Board of State and Community Corrections’ website. I was particularly interested in using this data because I believe that our prison system ought to be advancing carceral justice and adequately meeting its inmates’ needs. As California is the most populous state in the nation and simultaneously has one of the highest rates of income inequality, with the wealthiest 10% of the state earning 11 times more than the poorest 10% (Thorman & Payares-Montoya, 2024). The vast disparities present in this single state make it a particularly relevant one to study. The topic of how mental health is handled in California jails is also pertinent, given that the total number of inmates requiring some form of mental health care increased dramatically from 20% in 2010 to 53% in 2023 (Lofstrom & Martin, 2023). This data set provides an insightful look into the composition of the California prison system and its populations, and may serve as a foundation for surveys of prison systems at a larger scale.

Cleaning the data

# Filtering out columns that have primarily NA values and thus won't be helpful
jails2 <- jails |>
  select(!adp_of_maximum_security_inmates & !adp_of_medium_security_inmates & !adp_of_minimum_security_inmates & !avg_inmates_get_medical_bed & !avg_inmate_need_reg_med_attent & !avg_inmates_need_reg_ment_health)

Linear regression

# Constructing a linear regression model to evaluate the level of correlation between amount of new mental health cases opened during a given month and amount of sentenced inmates released due to a lack of housing
model1 <- lm(tot_sentenced_release_lack_bed ~  num_new_mental_health_cases, data = jails2)
summary(model1)


Call:
lm(formula = tot_sentenced_release_lack_bed ~ num_new_mental_health_cases, 
    data = jails2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2369.2   -33.8    13.7    26.6  4595.3 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -21.016104   3.484070  -6.032 1.67e-09 ***
num_new_mental_health_cases   0.694084   0.008115  85.530  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 328 on 10821 degrees of freedom
  (6978 observations deleted due to missingness)
Multiple R-squared:  0.4034,    Adjusted R-squared:  0.4033 
F-statistic:  7315 on 1 and 10821 DF,  p-value: < 2.2e-16

plot(model1$residuals)

Judging from the distinct pattern that the residuals plot appears to have formed, it’s likely that this model may not have been the best method of representing these variables. This idea is supported by the relatively low R-squared value of 0.40; it’s not low enough to make it completely worth discounting, but a model that is only capable of explaining 40% of the variation in its response variable is not a very useful one. The relative unreliability of the model thus makes it necessary to take the yielded p-value (which is well below the alpha of 0.05 and would otherwise be fairly statistically significant) with a grain of salt. Although I hesitate to speculate, I would assume that variation in many of the variables is at least somewhat attributable to larger variations in population, which may be why a correlation exists despite the model’s weakness.

Visualization 1

# Determining the county with the highest daily populations
jails2 |>
  group_by(jurisdiction)|>
  arrange(desc(avg_daily_pop_total_jurisdiction))|>
  head(1)

# A tibble: 1 × 42
# Groups:   jurisdiction [1]
  jurisdiction          year month date       census_county_name fips_state_code
  <chr>                <dbl> <chr> <date>     <chr>              <chr>          
1 Los Angeles Sheriff…  1998 march 1998-03-01 Los Angeles County 06             
# ℹ 36 more variables: fips_county_code <chr>, fips_state_county_code <chr>,
#   avg_daily_pop_unsentenced_male <dbl>,
#   avg_daily_pop_unsentenced_female <dbl>, avg_daily_pop_sentenced_male <dbl>,
#   avg_daily_pop_sentenced_female <dbl>,
#   avg_daily_pop_total_jurisdiction <dbl>,
#   avg_felony_inmate_unsentenced <dbl>, avg_felony_inmate_sentenced <dbl>,
#   avg_felony_inmate_total <dbl>, avg_misdemean_inmate_unsentenced <dbl>, …

jails2 |>
  filter(census_county_name == "Los Angeles County", year > 2005)|>
  mutate(month = str_to_title(month)) |>
  mutate(month = fct_relevel(month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))|>
  ggplot(aes(x = year, y = month, fill = num_new_mental_health_cases, na.rm = TRUE))+
  geom_tile()+
  scale_fill_viridis(option = "rocket")+
  labs(x = "Year", y = "Month", fill = "Number of New Mental Health Cases", title = "Mental Health Cases Per Month for \n Los Angeles County Prisons (2005-2020)", caption = "Source: California Board of State and Community Corrections")+
  theme_minimal()+
  theme(panel.grid = element_blank())

This visualization, focusing specifically on Los Angeles County, is a heatmap depicting variations in the amount of new mental health cases each month across 15 years. Although the data set begins tracking in 1995, I limited the time frame to start at 2005 because data prior to then was sporadic and less visually cohesive. Los Angeles County jails appear to have experienced their highest rates of new mental health cases around the summer of 2005 and their lowest rates around 2009-2010. I have to wonder if the 2008 recession played any role in this data pattern, or what other factors might potentially be impactful if investigated more deeply. Additionally, given that the data only considers new cases and doesn’t count long-term issues, it’s likely that the data doesn’t provide as thorough of a story as it would if it counted the cases cumulatively. Ideally, I would have liked to compare the results of multiple counties (which I attempted to do by log-transforming the amount of mental health cases), but my resulting graphic ended up being too visually crowded so I decided to stick with only the Los Angeles data.

Visualization 2

Tableau Visualization

This visualization is a map depicting average population distributions (and the percentage of the total population that has been sentenced) for each of the different counties in California. As I chose not to stratify the graphic by year, the numbers for each population and percentage are the average value for all jails in each county across all years during which data was documented. Unsurprisingly, Los Angeles county distinctly has the greatest population, although I was somewhat surprised that it had a relatively high sentenced percentage of 51.02% given the high volume of inmates that it handles. San Diego had the highest percentage of sentenced inmates proportional to its population, at 70.65%, despite having a fairly large daily population compared to many other counties. The shape file used to create this visualization was sourced from Data.gov and compiled by the California Department of Technology, and it is cited in the references below.

References

California Department of Technology. (2024). CA Geographic Boundaries [Dataset]. State of California. https://catalog.data.gov/dataset/ca-geographic-boundaries

Kaplan, J. (2024). California Jail Profile Survey 1995-2020 (Version 1) [Dataset]. Harvard Dataverse. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/9KWMTJ

Lofstrom, M., & Martin, B. (2023, October 25). County Jails House Fewer Inmates, but Over Half Face Mental Health Issues. Public Policy Institute of California. https://www.ppic.org/blog/county-jails-house-fewer-inmates-but-over-half-face-mental-health-issues/

Thorman, T., & Payares-Montoya, D. (2024, April). Income Inequality in California. Public Policy Institute of California. https://www.ppic.org/publication/income-inequality-in-california/