Abstract

Pretrial release decisions represent a critical juncture in the criminal justice system, balancing public safety against the presumption of innocence. This project investigates the predictive power of Public Safety Assessment (PSA) like scores and case-level factors on rearrest and release decisions in Queens County, New York, from 2023 to 2024. Using comprehensive court data, we reconstructed PSA risk scores and developed local models. Our analysis reveals that while the PSA framework shows predictive value, a locally calibrated model significantly outperforms its generic application. Furthermore, we identify important disparities in prediction accuracy across racial groups. These findings highlight the necessity of localized validation and careful implementation of risk assessment tools to ensure both effectiveness and equity.

Introduction

Pretrial release decisions are among the most consequential moments in the criminal justice process. Judges must determine whether to release a defendant before trial, weighing concerns for public safety against the presumption of innocence. In recent years, data-driven risk assessment tools like the Public Safety Assessment (PSA) have been introduced to guide these decisions objectively.

This project examines whether PSA scores and case level factors predict two key outcomes in Queens County between 2023 and 2024: - rearrest while on release - judicial release decisions.

The study period is particularly relevant as it follows New York’s significant bail reform measures, providing insight into how risk assessment operates in this reformed landscape.

I chose this topic not only for its pressing importance in criminal justice reform but also because of its direct alignment with my professional aspirations. As someone pursuing opportunities to become a court officer, understanding the mechanics of pretrial decision-making is highly relevant to my future work. A personal experience further motivated this research: I was recently in a car accident where the other driver fled the scene, leaving me with damages and no recourse. This incident underscored how profoundly we rely on the justice system to deliver fairness, accountability, and safety concerns that are central to pretrial risk assessment.

Literature Review

Pretrial risk assessment and the PSA

Actuarial pretrial tools aim to estimate a defendant’s likelihood of failing to appear and/or being rearrested pretrial. The Public Safety Assessment (PSA), developed by Arnold Ventures, uses nine criminal-history–based factors to generate two risk scores (FTA and NCA) and a violent-activity flag; it is used across hundreds of jurisdictions and comes with core implementation requirements.

Validation studies generally find that PSA scores are predictive of pretrial outcomes in multiple sites, though performance varies by locality. For example, a San Francisco validation using local data reported meaningful discrimination for FTA and NCA outcomes and discussed the importance of local calibration and policy cutoffs.

Evidence on bail reform and pretrial outcomes

New Jersey’s 2017 reform (with PSA statewide) substantially reduced the pretrial jail population while maintaining public safety and court appearance rates, according to independent evaluations (MDRC and later academic work). Recent analyses continue to find no associated increase in violent crime.

In New York, the 2019–2020 bail reforms (with subsequent amendments in 2020 and 2022) sought to limit money bail and reduce pretrial detention. Two-year follow-ups documented large shifts in release decisions, bail amounts, and racial disparities, while policy explainers clarify how “qualifying” vs. “non-qualifying” offenses structure eligibility for bail and detention today. Your Queens focus (2023–2024) lands squarely in this evolving policy period, making local, time-bounded analysis important.

Accuracy, bias, and fairness debates

Scholars have raised concerns about whether algorithmic tools outperform simple or human baselines and about potential disparate impacts. Dressel & Farid (2018) found a widely used tool (COMPAS) performed comparably to lay predictions, sparking broader debates about transparency and construct validity; while COMPAS ≠ PSA, the critique motivates careful, local validation and subgroup analyses. Theoretical work (Kleinberg, Mullainathan & Raghavan) proves that commonly desired fairness criteria cannot be simultaneously satisfied, underscoring unavoidable trade-offs policymakers face when converting scores into detention or supervision rules.

Gaps this study addresses

Most PSA research is jurisdiction-specific and often pre-2022. There is limited public analysis of PSA-like scores specifically for Queens in 2023–2024 amid New York’s post-amendment landscape. By (a) reconstructing NCA/NVCA measures from case-level data, (b) benchmarking them against realized rearrest and release outcomes, and (c) comparing simple PSA thresholds with logistic and random-forest models, this project adds a timely, local validation and explores model transparency vs. accuracy trade-offs relevant to current NYC practice.

Research Questions

This project addresses three primary research questions:

1- Predictive Accuracy:

Do reconstructed PSA risk scores (NCA and NVCA) predict general and violent rearrest in Queens County?

2- Model Comparison:

Does a locally trained statistical model outperform the standard PSA scoring thresholds?

3- Fairness Assessment:

Are there significant disparities in prediction accuracy across racial groups?

Data Sources

This project uses the NYC Pretrial Release dataset (2023–2024). The dataset includes individual level court case information, PSA risk scores, criminal history variables, release decisions, and rearrest outcomes. In this project, I focus on Queens County for three reasons:

  • It is one of the largest boroughs in NYC with diverse populations and case volumes.
  • Focusing on one borough ensures manageable scope for modeling.
  • As someone who commutes through Queens, I feel personally connected to the region.

Variables analyzed include:

  • Dependent variables: Rearrest (yes/no), Release decision (released/detained)
  • Independent variables: PSA risk scores (Failure to Appear, New Criminal Activity), prior criminal history, type of arrest, custody status, and demographic information

The dataset comes from publicly available court and pretrial release records for NY, covering 2023–2024. Data can be downloaded from here NYC Official Pre-Trial Release Data

Data Preparation and Cleaning

Primary Data Filtering and Cleaning

# Load required libraries
library(tidyverse)
library(knitr)
library(naniar)
library(caret)
library(randomForest)
library(pROC)
library(snakecase)
library(logistf)
library(gt)
library(patchwork)
# Load data
NYC <- read.csv("NYS for Web 2024 copy_try copy.csv", na.strings = c("NULL", " ", "\\s+", ""))
kable(head(NYC,5))
Internal_Case_ID Gender Race Ethnicity Age_at_Crime Age_at_Arrest Court_Name Court_ORI County_Name District Region Court_Type Judge_Name Offense_Month Offense.Year Arrest_Month Arrest.Year Arrest_Type Top_Arrest_Law Top_Arrest_Article_Section Top_Arrest_Attempt_Indicator Top_Charge_at_Arrest Top_Charge_Severity_at_Arrest Top_Charge_Weight_at_Arrest Top_Charge_at_Arrest_Violent_Felony_Ind Case_Type First_Arraign_Date Top_Arraign_Law Top_Arraign_Article_Section Top_Arraign_Attempt_Indicator Top_Charge_at_Arraign Top_Severity_at_Arraign Top_Charge_Weight_at_Arraign Top_Charge_at_Arraign_Violent_Felony_Ind Hate_Crime_Ind Arraign.Charge.Category Representation_Type App_Count_Arraign_to_Dispo_Released App_Count_Arraign_to_Dispo_Detained App_Count_Arraign_to_Dispo_Total Def_Attended_Sched_Pretrials Remanded_to_Jail_at_Arraign ROR_at_Arraign Bail_Set_and_Posted_at_Arraign Bail_Set_and_Not_Posted_at_Arraign NMR_at_Arraign Release.Decision.at.Arraign Representation_at_Securing_Order Pretrial_Supervision_at_Arraign Contact_Pretrial_Service_Agency Electronic_Monitoring Travel_Restrictions Passport_Surrender No_Firearms_or_Weapons Maintain_Employment Maintain_Housing Maintain_School Placement_in_Mandatory_Program Removal_to_Hospital Obey_Order_of_Protection Obey_Court_Conditions.Family_Offense Other_NMR Order_of_Protection First_Bail_Set_Cash First_Bail_Set_Credit First_Insurance_Company_Bail_Bond First_Secured_Surety_Bond First_Secured_App_Bond First_Unsecured_Surety_Bond First_Unsecured_App_Bond First_Partially_Secured_Surety_Bond Partially_Secured_Surety_Bond_Perc First_Partially_Secured_App_Bond Partially_Secured_App_Bond_Perc Bail_Made_Indicator NotRequestedFlag RemandRequestedFlag NMRRequestedFlag RORRequestedFlag UnspecifiedTypeRequestedAmount CashRequestedAmount CreditRequestedAmount InsuranceCompanyRequestedAmount SecuredSuretyRequestedAmount SecuredAppRequestedAmount UnsecuredSuretyRequestedAmount UnsecuredAppRequestedAmount PartiallySecuredSuretyRequestedAmount PartiallySecuredAppRequestedAmount UnspecifiedBondTypeRequestedAmount Warrant_Ordered_btw_Arraign_and_Dispo DAT_WO_WS_Prior_to_Arraign First_Bench_Warrant_Month First_Bench_Warrant_Year Non_Stayed_WO Num_of_Stayed_WO Num_of_ROW Docket_Status Disposition_Type Disposition_Detail Dismissal_Reason Disposition_Date Most_Severe_Sentence Top_Conviction_Law Top_Conviction_Article_Section Top_Conviction_Attempt_Indicator Top_Charge_at_Conviction Top_Charge_Severity_at_Conviction Top_Charge_Weight_at_Conviction Top_Charge_at_Conviction_Violent_Felony_Ind Days_Arraign_Remand_First_Released Known_Days_in_Custody Days_Arraign_Bail_Set_to_First_Posted Days_Arraign_Bail_Set_to_First_Release Days_Arraign_to_Dispo MinImpTopConvDays MaxImpTopConvDays UCMSLiveDate prior_vfo_cnt prior_nonvfo_cnt prior_misd_cnt pend_vfo pend_nonvfo pend_misd supervision rearrest rearrest_date rearrest_firearm rearrest_date_firearm arr_cycle_id
4.887186e+27 Male White Hispanic 21 21 New York Criminal Court NY030033J New York District 1 NYC Local Tatham, Beverly S. Jan 2024 Jan 2024 Custody PL 215.51 NA PL 215.51 BII EF Crim Contempt-1st:Follows Felony EF N Docket 1/1/2024 PL 215.51 NA PL 215.51 BII EF Crim Contempt-1st:Follows Felony EF N N Criminal Contempt 18B (Assigned Counsel) 1 0 1 1 Y N N N N Remanded Y N N N N N N N N N N N N N N Family Offense NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA N N NA NA NA NA 0 Disposed GJ/Trans Transfer to Superior Court NA 1/1/2024 NA NA NA NA NA NA NA N 1 1 NA NA 1 NA NA NA 0 0 0 0 0 1 0 No Arrest NA 0 NA 1370269
2.813440e+27 Male Black Non Hispanic 19 19 Kings Criminal Court NY023033J Kings District 2 NYC Local Tubridy, Jennifer A. Apr 2024 Jun 2024 Custody PL 215.5 NA PL 215.50 03 AM Crim Contempt-2nd:Disobey Crt Misdemeanor AM N Docket 6/1/2024 PL 215.5 NA PL 215.50 03 AM Crim Contempt-2nd:Disobey Crt Misdemeanor AM N N Criminal Contempt Legal Aid 3 0 3 2 N Y N N N ROR Y N N N N N N N N N N N N N N Non-Family Offense NA NA NA NA NA NA NA NA NA NA NA NA Y N N N NA NA NA NA NA NA NA NA NA NA NA N N NA NA NA 1 0 Pending NA NA NA NA NA NA NA NA NA NA NA N NA 0 NA NA NA NA NA NA 0 0 0 1 0 0 0 Misdemeanor 7/1/2024 0 NA 1252047
3.546526e+27 Unknown Unknown Unknown 0 0 Nassau District Court NY029013J Nassau District 10N ONYC Local Mccormack, Marie F. Oct 2023 Oct 2023 DAT NC-FPO 13.11 NA NC-FPO 13.11 UM Failure to Comply Misdemeanor UM N Docket 1/1/2024 NC-FPO 13.11 NA NC-FPO 13.11 UM Failure to Comply Misdemeanor UM N N Other Retained Attorney 1 0 1 NA N N N N N Disposed at arraign Y N N N N N N N N N N N N N N NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA N N NA NA NA NA 0 Disposed Plea Pled Guilty NA 1/1/2024 Fine NC-FPO 1.7 NA NC-FPO 1.7 V Failing To Comply Violation V N NA 0 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2.159048e+27 Male Unknown Unknown 23 23 Mount Vernon City Court NY059031J Westchester District 9 ONYC Local Johnson, Nichelle Aug 2022 Aug 2022 DAT VTL 511 NA VTL 0511 02A2 UM Agg Unlicensed Operation-2nd Misdemeanor UM N Docket 5/1/2024 VTL 511 NA VTL 0511 02A2 UM Agg Unlicensed Operation-2nd Misdemeanor UM N N Unlicensed Operation 18B (Assigned Counsel) 1 0 1 1 N N N N N Disposed at arraign Y N N N N N N N N N N N N N N NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA N N NA NA NA NA 0 Disposed Plea Pled Guilty NA 5/1/2024 Surcharge VTL 509 NA VTL 0509 01 I MV License Viol:No License Infraction I N NA 0 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4.715249e+27 Male White Hispanic 40 40 Queens Criminal Court NY040033J Queens District 11 NYC Local Battisti, Anthony M. Jan 2024 Jan 2024 Custody PL 155.25 NA PL 155.25 AM Petit Larceny Misdemeanor AM N Docket 1/1/2024 PL 155.25 NA PL 155.25 AM Petit Larceny Misdemeanor AM N N Larceny Legal Aid 4 0 4 2 N Y N N N ROR Y N N N N N N N N N N N N N N Non-Family Offense NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Y N Apr 2024 1 NA 1 Disposed Plea Pled Guilty NA 8/1/2024 Conditional Discharge PL 240.2 NA PL 240.20 V Disorderly Conduct Violation V N NA 0 NA NA 189 NA NA NA 0 0 0 0 0 1 0 Misdemeanor 2/1/2024 0 NA 1419467
# Standardize column names
names(NYC) <- snakecase::to_snake_case(names(NYC))

# Check structure before filtering
# glimpse(NYC)

# Filter for Queens County 2023-2024 with key criteria
queens_data <- NYC |>
  filter(county_name == "Queens",
         arrest_year %in% c(2023, 2024),
         arrest_type == "Custody",
         docket_status == "Disposed",
         !release_decision_at_arraign %in% c("Disposed at arraign", "Unknown", "Remanded"),
         rearrest != "Unknown") |>
  mutate(across(c(prior_vfo_cnt, prior_nonvfo_cnt, prior_misd_cnt, pend_vfo, pend_nonvfo, pend_misd),
                ~ as.numeric(str_replace(., "[+|>]", "")))) |>
  select(-arr_cycle_id) |>
  drop_na(prior_vfo_cnt, prior_nonvfo_cnt, prior_misd_cnt, pend_vfo, pend_nonvfo, pend_misd)

# Remove systematically missing variables
variables_to_remove <- c(
  "first_secured_surety_bond", "first_secured_app_bond", 
  "first_unsecured_app_bond", "first_partially_secured_app_bond",
  "partially_secured_app_bond_perc", "unsecured_surety_requested_amount",
  "unsecured_app_requested_amount", "days_arraign_remand_first_released",
  "secured_app_requested_amount", "unspecified_bond_type_requested_amount"
)

queens_clean <- queens_data |>
  select(-any_of(variables_to_remove))

# Handle missing data and create analysis variables
queens_clean <- queens_clean |>
  mutate(
    across(where(is.numeric) & !contains("id") & !contains("date"),
           ~ifelse(is.na(.), median(., na.rm = TRUE), .)),
    across(where(is.character), 
           ~ifelse(is.na(.), "Unknown", .)),
    rearrest_binary = case_when(
      rearrest %in% c("No Arrest") ~ "No",
      rearrest %in% c("Non-violent felony", "Violent felony", "Yes") ~ "Yes",
      TRUE ~ "Unknown"
    ),
    rearrest_type = case_when(
      rearrest == "No Arrest" ~ "None",
      rearrest == "Non-violent felony" ~ "Non-violent",
      rearrest == "Violent felony" ~ "Violent", 
      rearrest == "Yes" ~ "Unknown Type",
      TRUE ~ "Unknown"
    )
  ) |>
  filter(age_at_arrest >= 16, age_at_arrest <= 100,
         rearrest_binary != "Unknown")

# Remove impossible age values and focus on adults (16+)
queens_psa_ready <- queens_clean |>
  filter(age_at_arrest >= 16, age_at_arrest <= 100)
# Save the cleaned dataset
 write.csv(queens_clean, "queens_2023_2024.csv", row.names = FALSE)
kable(head(queens_clean, 5))
internal_case_id gender race ethnicity age_at_crime age_at_arrest court_name court_ori county_name district region court_type judge_name offense_month offense_year arrest_month arrest_year arrest_type top_arrest_law top_arrest_article_section top_arrest_attempt_indicator top_charge_at_arrest top_charge_severity_at_arrest top_charge_weight_at_arrest top_charge_at_arrest_violent_felony_ind case_type first_arraign_date top_arraign_law top_arraign_article_section top_arraign_attempt_indicator top_charge_at_arraign top_severity_at_arraign top_charge_weight_at_arraign top_charge_at_arraign_violent_felony_ind hate_crime_ind arraign_charge_category representation_type app_count_arraign_to_dispo_released app_count_arraign_to_dispo_detained app_count_arraign_to_dispo_total def_attended_sched_pretrials remanded_to_jail_at_arraign ror_at_arraign bail_set_and_posted_at_arraign bail_set_and_not_posted_at_arraign nmr_at_arraign release_decision_at_arraign representation_at_securing_order pretrial_supervision_at_arraign contact_pretrial_service_agency electronic_monitoring travel_restrictions passport_surrender no_firearms_or_weapons maintain_employment maintain_housing maintain_school placement_in_mandatory_program removal_to_hospital obey_order_of_protection obey_court_conditions_family_offense other_nmr order_of_protection first_bail_set_cash first_bail_set_credit first_insurance_company_bail_bond first_unsecured_surety_bond first_partially_secured_surety_bond partially_secured_surety_bond_perc bail_made_indicator not_requested_flag remand_requested_flag nmr_requested_flag ror_requested_flag unspecified_type_requested_amount cash_requested_amount credit_requested_amount insurance_company_requested_amount secured_surety_requested_amount partially_secured_surety_requested_amount partially_secured_app_requested_amount warrant_ordered_btw_arraign_and_dispo dat_wo_ws_prior_to_arraign first_bench_warrant_month first_bench_warrant_year non_stayed_wo num_of_stayed_wo num_of_row docket_status disposition_type disposition_detail dismissal_reason disposition_date most_severe_sentence top_conviction_law top_conviction_article_section top_conviction_attempt_indicator top_charge_at_conviction top_charge_severity_at_conviction top_charge_weight_at_conviction top_charge_at_conviction_violent_felony_ind known_days_in_custody days_arraign_bail_set_to_first_posted days_arraign_bail_set_to_first_release days_arraign_to_dispo min_imp_top_conv_days max_imp_top_conv_days ucms_live_date prior_vfo_cnt prior_nonvfo_cnt prior_misd_cnt pend_vfo pend_nonvfo pend_misd supervision rearrest rearrest_date rearrest_firearm rearrest_date_firearm rearrest_binary rearrest_type
2.557388e+27 Female Black Non Hispanic 39 39 Queens Criminal Court NY040033J Queens District 11 NYC Local Daniels, Edward F. May 2024 May 2024 Custody PL 120.05 Unknown PL 120.05 12 DF Aslt-2: Injure Vic 65 Or Older Felony DF Y Docket 5/1/2024 PL 120.05 Unknown PL 120.05 12 DF Aslt-2: Injure Vic 65 Or Older Felony DF Y N Assault Public Defender 2 0 2 2 N N N Y N Bail-set Y N N N N N N N N N N N N N N Family Offense 5000 10000 5000 1 5000 10 Unknown N N N N 5000 20000 60000 45000 42500 60000 45000 N N Unknown 2024 1 1 0 Disposed Plea Pled Guilty Unknown 5/1/2024 Conditional Discharge PL 240.2 Unknown PL 240.20 V Disorderly Conduct Violation V N 1 2 3 3 90 Unknown Unknown 0 0 0 1 0 0 0 No Arrest Unknown 0 Unknown No None
3.352332e+27 Male Black Non Hispanic 37 37 Queens Criminal Court NY040033J Queens District 11 NYC Local Gershuny, Jeffrey A. Feb 2024 Feb 2024 Custody PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N Docket 2/1/2024 PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N N Assault Public Defender 2 0 2 2 N Y N N N ROR Y N N N N N N N N N N N N N N Non-Family Offense 10000 10000 25000 1 30000 10 Unknown N N N Y 20000 20000 60000 45000 42500 60000 45000 N N Unknown 2024 1 1 0 Disposed Plea Pled Guilty Unknown 3/1/2024 Conditional Discharge PL 240.2 Unknown PL 240.20 V Disorderly Conduct Violation V N 0 2 6 24 90 Unknown Unknown 0 0 0 0 0 0 0 No Arrest Unknown 0 Unknown No None
3.301952e+27 Male White Hispanic 31 32 Queens Criminal Court NY040033J Queens District 11 NYC Local Gershuny, Jeffrey A. Apr 2023 Jan 2024 Custody PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N Docket 1/1/2024 PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N N Assault Retained Attorney 2 0 2 1 N Y N N N ROR Y N N N N N N N N N N N N N N Non-Family Offense 10000 10000 25000 1 30000 10 Unknown Unknown Unknown Unknown Unknown 20000 20000 60000 45000 42500 60000 45000 N N Unknown 2024 1 1 0 Disposed Dismissed Dismissed Uncooperative Witness (CPL 170.30 (1)(f)) 4/1/2024 Unknown Unknown Unknown Unknown Unknown Unknown Unknown N 0 2 6 96 90 Unknown Unknown 0 0 2 1 0 0 0 No Arrest Unknown 0 Unknown No None
3.339975e+27 Male Black Non Hispanic 26 26 Queens Criminal Court NY040033J Queens District 11 NYC Local Gonzalez, Maria T. Mar 2024 Apr 2024 Custody PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N Docket 4/1/2024 PL 120 Unknown PL 120.00 01 AM Aslt 3-W/Int Cause Phys Injury Misdemeanor AM N N Assault Public Defender 4 0 4 3 N N N Y N Bail-set Y N N N N N N N N N N N N N N Family Offense 3000 10000 9000 1 9000 10 Bond N N N N 20000 3000 60000 45000 42500 90000 45000 N N Unknown 2024 1 1 0 Disposed Plea Pled Guilty Unknown 5/1/2024 Conditional Discharge PL 240.2 Unknown PL 240.20 V Disorderly Conduct Violation V N 6 6 6 33 90 Unknown Unknown 1 0 0 0 0 1 0 No Arrest Unknown 0 Unknown No None
1.854242e+27 Male White Hispanic 37 37 Queens Criminal Court NY040033J Queens District 11 NYC Local Gonzalez, Maria T. Jun 2024 Jun 2024 Custody PL 160.05 Attempt PL 110-160.05 EF Robbery-3rd Felony EF N Docket 6/1/2024 PL 160.05 Attempt PL 110-160.05 EF Robbery-3rd Felony EF N N Robbery Legal Aid 1 2 3 3 N N N Y N Bail-set Y N N N N N N N N N N N N N N Family Offense 5000 10000 10000 1 10000 10 Unknown N N N N 20000 75000 60000 45000 42500 225000 45000 N N Unknown 2024 1 1 0 Disposed Plea Pled Guilty Unknown 8/1/2024 Imprisonment-Not Time Served PL 240.26 Unknown PL 240.26 01 V Harassment-2nd:Physical Cntact Violation V N 49 2 52 52 15 15 Unknown 1 1 4 0 1 0 1 No Arrest Unknown 0 Unknown No None

PSA Score Calculation

# Calculate NCA Score (New Criminal Activity)
nca_scored <- queens_psa_ready |>
  mutate(
    pend_charge = ifelse(pend_vfo + pend_nonvfo + pend_misd > 0, 1, 0),
    age_at_arrest_score = ifelse(age_at_arrest <= 22, 2, 0),
    pending_charge_score = ifelse(pend_charge == 1, 3, 0),
    prior_misd_score = ifelse(prior_misd_cnt > 0, 1, 0),
    prior_felony_score = ifelse(prior_nonvfo_cnt > 0 | prior_vfo_cnt > 0, 1, 0),
    prior_violent_score = case_when(
      prior_vfo_cnt %in% c(1, 2) ~ 1,
      prior_vfo_cnt >= 3 ~ 2,
      TRUE ~ 0
    ),
    nca_score_raw = age_at_arrest_score + pending_charge_score + prior_misd_score + prior_felony_score + prior_violent_score,
    nca_score = case_when(
      nca_score_raw %in% c(0, 1) ~ 1,
      nca_score_raw %in% c(5, 6) ~ 5,
      nca_score_raw %in% c(7, 8) ~ 6,
      TRUE ~ nca_score_raw
    )
  )

# Calculate NVCA Score (New Violent Criminal Activity)
nvca_scored <- queens_psa_ready |>
  mutate(
    pend_charge = ifelse(pend_vfo + pend_nonvfo + pend_misd > 0, 1, 0),
    current_violent_offense_score = ifelse(arraign_charge_category == "Violent", 2, 0),
    current_violent_20_under = ifelse(arraign_charge_category == "Violent" & age_at_arrest <= 20, 1, 0),
    pending_charge_score = ifelse(pend_charge == 1, 1, 0),
    prior_misd_or_felony_score = ifelse(prior_misd_cnt > 0 | prior_nonvfo_cnt > 0 | prior_vfo_cnt > 0, 1, 0),
    prior_violent_score = case_when(
      prior_vfo_cnt %in% c(1, 2) ~ 1,
      prior_vfo_cnt >= 3 ~ 2,
      TRUE ~ 0
    ),
    nvca_score_raw = current_violent_offense_score + current_violent_20_under + pending_charge_score + prior_misd_or_felony_score + prior_violent_score,
    nvca_score = case_when(
      nvca_score_raw %in% c(0, 1) ~ 1,
      TRUE ~ nvca_score_raw
    )
  )

# Add scores to main dataset
queens_psa_ready$nca_score <- nca_scored$nca_score
queens_psa_ready$nvca_score <- nvca_scored$nvca_score

# Save scored data 
write.csv(queens_psa_ready, "nca_scored.csv", row.names = FALSE)
write.csv(nvca_scored, "nvca_scored.csv", row.names = FALSE) 

Final Modeling Dataset Preparation

# Create modeling dataframe with proper variables
model_data <- queens_psa_ready |>
  mutate(
    current_violent = ifelse(arraign_charge_category %in% c("Assault", "Strangulation", "Rape", "Homicide Related", "Robbery"), 1, 0),
    current_property = ifelse(arraign_charge_category %in% c("Larceny", "Burglary"), 1, 0),
    rearrest_binary = ifelse(rearrest_binary == "Yes", 1, 0),
    rearrest_violent_binary = ifelse(rearrest_type == "Violent", 1, 0),
    current_misd = ifelse(top_severity_at_arraign == "Misdemeanor", 1, 0),
    gender_male = ifelse(gender == "Male", 1, 0),
    race_black = ifelse(race == "Black", 1, 0),
    race_white = ifelse(race == "White", 1, 0),
    race_hispanic = ifelse(ethnicity == "Hispanic", 1, 0),
    released_ror = ifelse(release_decision_at_arraign == "ROR", 1, 0),
    released_nmr = ifelse(release_decision_at_arraign == "Nonmonetary release", 1, 0),
    detained = ifelse(release_decision_at_arraign == "Bail-set", 1, 0)
  ) |>
  select(gender_male, race_black, race_white, race_hispanic, age_at_arrest, 
         current_violent, current_misd, current_property,
         prior_vfo_cnt, prior_nonvfo_cnt, prior_misd_cnt, 
         pend_vfo, pend_nonvfo, pend_misd,
         nca_score, nvca_score,
         released_ror, released_nmr, detained, 
         rearrest_binary, rearrest_violent_binary)

# Remove any remaining missing values
model_data <- model_data |> drop_na()

# Split data into training and testing sets
set.seed(613)
train_index <- createDataPartition(model_data$rearrest_binary, p = 0.8, list = FALSE)
train <- model_data[train_index, ]
test <- model_data[-train_index, ]

# Add continuous NVCA scores
train$nvca_continuous <- nvca_scored$nvca_score[train_index]
test$nvca_continuous <- nvca_scored$nvca_score[-train_index]

Data Exploration

Descriptive Statistics

# Create descriptive tables
desc_stats <- list()

# Sample characteristics
desc_stats$sample_size <- nrow(queens_psa_ready)
desc_stats$rearrest_rate <- mean(queens_psa_ready$rearrest_binary == "Yes") * 100
desc_stats$violent_rearrest_rate <- mean(queens_psa_ready$rearrest_type == "Violent", na.rm = TRUE) * 100

# Release decisions
release_dist <- queens_psa_ready |>
  count(release_decision_at_arraign) |>
  mutate(Percentage = round(n / sum(n) * 100, 1))

# Demographic characteristics
gender_dist <- queens_psa_ready |>
  count(gender) |>
  mutate(Percentage = round(n / sum(n) * 100, 1))

race_dist <- queens_psa_ready |>
  count(race) |>
  mutate(Percentage = round(n / sum(n) * 100, 1))

# Age distribution
age_stats <- queens_psa_ready |>
  summarize(
    Mean = round(mean(age_at_arrest), 1),
    Median = median(age_at_arrest),
    SD = round(sd(age_at_arrest), 1),
    Min = min(age_at_arrest),
    Max = max(age_at_arrest)
  )

# Criminal history
criminal_history <- queens_psa_ready |>
  summarize(
    `Prior Violent Felonies` = mean(prior_vfo_cnt),
    `Prior Non-Violent Felonies` = mean(prior_nonvfo_cnt),
    `Prior Misdemeanors` = mean(prior_misd_cnt),
    `Any Prior Record (%)` = round(mean(prior_vfo_cnt > 0 | prior_nonvfo_cnt > 0 | prior_misd_cnt > 0) * 100, 1)
  )

# Display key statistics
cat("## Sample Characteristics\n")
## ## Sample Characteristics
cat("- Total cases:", desc_stats$sample_size, "\n")
## - Total cases: 9027
cat("- Rearrest rate:", round(desc_stats$rearrest_rate, 1), "%\n")
## - Rearrest rate: 8.8 %
cat("- Violent rearrest rate:", round(desc_stats$violent_rearrest_rate, 1), "%\n\n")
## - Violent rearrest rate: 1.9 %
cat("## Release Decisions\n")
## ## Release Decisions
kable(release_dist, col.names = c("Release Decision", "Count", "Percentage"))
Release Decision Count Percentage
Bail-set 1937 21.5
Nonmonetary release 1744 19.3
ROR 5346 59.2
cat("\n## Demographic Characteristics\n")
## 
## ## Demographic Characteristics
kable(gender_dist, col.names = c("Gender", "Count", "Percentage"))
Gender Count Percentage
Female 1718 19
Male 7309 81
kable(race_dist, col.names = c("Race", "Count", "Percentage"))
Race Count Percentage
American Indian/Alaskan Native 52 0.6
Asian/Pacific Islander 1336 14.8
Black 3671 40.7
Unknown 64 0.7
White 3904 43.2
cat("\n## Age Distribution\n")
## 
## ## Age Distribution
kable(age_stats, col.names = c("Mean", "Median", "SD", "Min", "Max"))
Mean Median SD Min Max
35.5 33 12.1 16 88
cat("\n## Criminal History\n")
## 
## ## Criminal History
kable(criminal_history)
Prior Violent Felonies Prior Non-Violent Felonies Prior Misdemeanors Any Prior Record (%)
0.1488867 0.2690816 1.015066 31.2

Correlation Analysis

Correlation Analysis

## ### Correlation with General Rearrest (rearrest_binary)
Correlation Coefficient
rearrest_binary 1.000
nca_score 0.125
pend_misd 0.101
pend_nonvfo 0.095
nvca_score 0.059
prior_misd_cnt 0.054
prior_nonvfo_cnt 0.047
pend_vfo 0.046
prior_vfo_cnt 0.022
age_at_arrest -0.034

PSA Score Performance

library(gridExtra)
# NCA Score performance
nca_performance <- nca_scored |>
  group_by(nca_score) |>
  summarize(
    Cases = n(),
    Rearrests = sum(rearrest_binary == "Yes"),
    Rearrest_Rate = round(Rearrests / Cases * 100, 1)
  )

# NVCA Score performance
nvca_performance <- nvca_scored |>
  group_by(nvca_score) |>
  summarize(
    Cases = n(),
    Violent_Rearrests = sum(rearrest_type == "Violent", na.rm = TRUE),
    Violent_Rearrest_Rate = round(Violent_Rearrests / Cases * 100, 2)
  )

# Plot NCA performance
nca_plot <- ggplot(nca_performance, aes(x = factor(nca_score), y = Rearrest_Rate)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = Rearrest_Rate), vjust = -0.5) +
  labs(title = "Rearrest Rate by NCA Score",
       x = "NCA Score", y = "Rearrest Rate (%)") +
  theme_minimal()

# Plot NVCA performance
nvca_plot <- ggplot(nvca_performance, aes(x = factor(nvca_score), y = Violent_Rearrest_Rate)) +
  geom_bar(stat = "identity", fill = "darkred", alpha = 0.7) +
  geom_text(aes(label = Violent_Rearrest_Rate), vjust = -0.5) +
  labs(title = "Violent Rearrest Rate by NVCA Score",
       x = "NVCA Score", y = "Violent Rearrest Rate (%)") +
  theme_minimal()


# Side by side
grid.arrange(nca_plot, nvca_plot, ncol = 2)

# Or one above the other
grid.arrange(nca_plot, nvca_plot, nrow = 2)

cat("### NCA Score Performance\n")
## ### NCA Score Performance
kable(nca_performance, col.names = c("NCA Score", "Cases", "Rearrests", "Rearrest Rate (%)"))
NCA Score Cases Rearrests Rearrest Rate (%)
1 4803 268 5.6
2 1185 95 8.0
3 1359 198 14.6
4 485 74 15.3
5 1162 151 13.0
6 33 8 24.2
cat("\n### NVCA Score Performance\n")
## 
## ### NVCA Score Performance
kable(nvca_performance, col.names = c("NVCA Score", "Cases", "Violent Rearrests", "Violent Rearrest Rate (%)"))
NVCA Score Cases Violent Rearrests Violent Rearrest Rate (%)
1 7102 115 1.62
2 1398 46 3.29
3 527 12 2.28

Predictive Modeling

# PSA threshold approach (standard approach)
test_psa <- test |>
  mutate(psa_high_risk = ifelse(nca_score >= 4, 1, 0))

roc_psa <- roc(test$rearrest_binary, test_psa$psa_high_risk)
auc_psa <- auc(roc_psa)

# Logistic regression model with class weighting
logit_model <- glm(rearrest_binary ~ current_violent + current_misd + current_property + nca_score,
                  data = train, family = "binomial",
                  weights = ifelse(train$rearrest_binary == 1, 6, 1))

prob_logit <- predict(logit_model, test, type = "response")
roc_logit <- roc(test$rearrest_binary, prob_logit)
auc_logit <- auc(roc_logit)

# Find optimal threshold for logistic model
optimal_threshold <- coords(roc_logit, "best", ret = "threshold", best.method = "youden")$threshold
pred_logit <- ifelse(prob_logit > optimal_threshold, 1, 0)

# Random Forest model with balanced sampling
set.seed(613)
rf_model <- randomForest(factor(rearrest_binary) ~ current_violent + current_misd + current_property + nca_score,
                        data = train,
                        strata = factor(train$rearrest_binary),
                        sampsize = rep(min(table(train$rearrest_binary)), 2),
                        ntree = 300)

prob_rf <- predict(rf_model, test, type = "prob")[, "1"]
roc_rf <- roc(test$rearrest_binary, prob_rf)
auc_rf <- auc(roc_rf)

# Compare all approaches
roc_comparison <- tibble(
  Model = c("PSA Threshold", "Logistic Regression", "Random Forest"),
  AUC = c(auc_psa, auc_logit, auc_rf)
)

# Create ROC curve plot
roc_data <- rbind(
  data.frame(Sensitivity = roc_psa$sensitivities, 
             Specificity = roc_psa$specificities,
             Model = "PSA Threshold (AUC = 0.59)"),
  data.frame(Sensitivity = roc_logit$sensitivities, 
             Specificity = roc_logit$specificities,
             Model = "Logistic Regression (AUC = 0.71)"),
  data.frame(Sensitivity = roc_rf$sensitivities, 
             Specificity = roc_rf$specificities,
             Model = "Random Forest (AUC = 0.69)")
)

roc_plot <- ggplot(roc_data, aes(x = 1 - Specificity, y = Sensitivity, color = Model)) +
  geom_line(linewidth = 1) +
  geom_abline(linetype = "dashed", color = "gray") +
  labs(title = "ROC Curve Comparison", x = "False Positive Rate", y = "True Positive Rate") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Display results
roc_plot

cat("### Model Performance Comparison\n")
## ### Model Performance Comparison
kable(roc_comparison, col.names = c("Model", "AUC"), digits = 3)
Model AUC
PSA Threshold 0.560
Logistic Regression 0.633
Random Forest 0.640
# Statistical comparison
roc_test <- roc.test(roc_psa, roc_logit)
cat("\nStatistical comparison between PSA threshold and logistic regression: p =", format.pval(roc_test$p.value, digits = 3), "\n")
## 
## Statistical comparison between PSA threshold and logistic regression: p = 3.85e-05

Fairness Analysis

# Add predictions to test data
test$predicted_risk <- predict(logit_model, test, type = "response")

# Evaluate fairness across racial groups
fairness_metrics <- test |>
  group_by(race_black) |>
  summarize(
    N = n(),
    Actual_Rearrest_Rate = mean(rearrest_binary) * 100,
    Predicted_Risk = mean(predicted_risk) * 100,
    Calibration_Error = abs(Actual_Rearrest_Rate - Predicted_Risk),
    TPR = sum(rearrest_binary == 1 & predicted_risk > optimal_threshold) / sum(rearrest_binary == 1),
    FPR = sum(rearrest_binary == 0 & predicted_risk > optimal_threshold) / sum(rearrest_binary == 0),
    FNR = sum(rearrest_binary == 1 & predicted_risk <= optimal_threshold) / sum(rearrest_binary == 1),
    PPV = sum(rearrest_binary == 1 & predicted_risk > optimal_threshold) / sum(predicted_risk > optimal_threshold)
  )

# Calculate disparity ratios
disparity_ratios <- fairness_metrics |>
  summarize(
    FPR_Ratio = max(FPR) / min(FPR),
    TPR_Ratio = max(TPR) / min(TPR),
    PPV_Ratio = max(PPV) / min(PPV)
  )

# Create visualization
fairness_plot <- fairness_metrics |>
  select(race_black, Actual_Rearrest_Rate, Predicted_Risk) |>
  pivot_longer(cols = -race_black, names_to = "Metric", values_to = "Value") |>
  mutate(Metric = ifelse(Metric == "Actual_Rearrest_Rate", "Actual Rearrest Rate", "Predicted Risk"),
         Race = ifelse(race_black == 1, "Black", "Non-Black")) |>
  ggplot(aes(x = Race, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Actual vs Predicted Rearrest Rates by Race",
       y = "Percentage", fill = "") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Display results
fairness_plot

cat("### Fairness Metrics by Race\n")
## ### Fairness Metrics by Race
kable(fairness_metrics, col.names = c("Black", "N", "Actual Rearrest Rate", "Predicted Risk", 
                                     "Calibration Error", "TPR", "FPR", "FNR", "PPV"), digits = 3)
Black N Actual Rearrest Rate Predicted Risk Calibration Error TPR FPR FNR PPV
0 1077 7.985 34.778 26.793 0.581 0.351 0.419 0.126
1 728 9.341 36.734 27.394 0.662 0.442 0.338 0.134
cat("\n### Disparity Ratios\n")
## 
## ### Disparity Ratios
kable(disparity_ratios, col.names = c("FPR Ratio", "TPR Ratio", "PPV Ratio"), digits = 3)
FPR Ratio TPR Ratio PPV Ratio
1.26 1.138 1.063

Discussion

This study set out to evaluate the predictive performance of PSA-style scores in Queens County (2023–2024) and compare them with locally trained models. The descriptive statistics reveal a relatively low overall rearrest rate (8.8%) and violent rearrest rate (1.9%), suggesting that most defendants released pretrial do not reoffend during this period. This baseline is crucial: even modest predictive gains must be interpreted in the context of generally low event rates. The PSA NCA and NVCA scores showed some predictive value, with rearrest rates rising alongside higher scores (e.g., NCA score 6 had a 24.2% rearrest rate compared to 5.6% for score 1). However, the correlation coefficients between PSA scores and rearrest outcomes were modest, and overall discrimination was limited (AUC ≈ 0.56). This indicates that, while the PSA captures some risk-related variation, its predictive power in Queens is weaker than often reported in multi-jurisdiction validations.

When benchmarked against statistical models, both logistic regression (AUC ≈ 0.63) and random forests (AUC ≈ 0.64) substantially outperformed the simple PSA cutoff approach. This finding underscores the importance of local calibration and the potential for even relatively simple statistical models to achieve higher accuracy when trained on site-specific data. The statistically significant improvement of the logistic regression model over the PSA threshold suggests that courts could benefit from tailoring predictive tools to local contexts rather than relying on generic thresholds. The fairness analysis adds an important dimension. Predictions systematically overestimated risk across both Black and non-Black groups, with calibration errors of about 27 percentage points in each group. Although disparities in metrics such as TPR, FPR, and PPV were present, they were not extreme (e.g., FPR ratio of 1.26). This suggests that while racial disparities exist in predictive accuracy, they may not be the primary driver of inequities in this dataset. Still, even modest disparities can accumulate into meaningful differences in pretrial detention decisions, highlighting the need for transparency and ongoing monitoring. Overall, the results highlight a mixed picture: risk assessment tools like the PSA capture real patterns but may under perform in specific jurisdictions if applied without local validation. Local models provide measurable accuracy gains but still face challenges related to fairness and calibration.

Conclusion

Pretrial decision making in Queens County reflects the broader tension between public safety, fairness, and efficiency. This analysis demonstrates that while the PSA offers a structured and transparent framework, its predictive power is limited in this setting. Locally trained models, even with modest complexity, deliver stronger predictive accuracy, underscoring the value of jurisdiction-specific validation. At the same time, fairness concerns remain. Although disparities between Black and non-Black defendants were not extreme, systematic overprediction of risk across groups points to the limits of statistical models in resolving deeply rooted inequities in the justice system. Policymakers and practitioners should therefore view predictive tools as one piece of a larger decision-making process, not as replacements for judicial discretion or broader reform.

In practice, this means:Local calibration of tools like the PSA should be routine, not optional. Performance should be assessed not only on accuracy but also on fairness metrics across groups. Risk assessments should complement — rather than substitute — transparent judicial reasoning. Ultimately, the findings suggest that effective pretrial reform requires both technical improvements to predictive tools and a broader commitment to equity and accountability in decision-making.