Getting Started:

# Load Packages
library(readxl)
library(dplyr)
library(tidyr)
library(tidyverse)
library(descr)
library(knitr)
library(ggplot2)
library(ggthemes)
library(clarify)
library(arm) # clarify still won't work
library(betareg)
library(MASS)
library(modelsummary)

Dataset:

The dataset used in this analysis comes from the NYC Department of Probation’s Annual Adult Cases dataset, available on NYC Open Data. The dataset is provided in CSV for Excel format, accompanied by a detailed data dictionary that explains each variable. Key variables include the Year (indicating the calendar year of the case), Borough (the NYC borough where the case was recorded), and several measures related to probation cases. These measures include Supervision Caseload Count (the number of cases under supervision), Investigation Charge - Felonies (the number of felony charges investigated), Filed Violation - Rearrests (the count of rearrests filed as violations), and Filed Violation - Technical (the count of technical violations filed). Other variables in the dataset, such as Passthrough Count, Supervision Intakes, Early Discharge - New Requests, and Early Discharge - Requests Approved/Disapproved, provide additional context about the probation system’s operations and outcomes.

For the purpose of this analysis, I focused on early discharge outcomes by creating a new variable called approval_rate. This variable was derived by combining the counts from Early Discharge - Requests Approved and Early Discharge - Requests Disapproved to first calculate the total number of early discharge requests, and then computing the ratio of approved requests to this total. This transformation results in a proportion that reflects the rate at which early discharge requests are approved. By converting raw counts into a proportion, the data became well-suited for beta regression analysis. This approach allowed me to examine how factors such as supervision caseload, felony investigations, and filed violations relate to the likelihood of early discharge approvals in NYC’s adult probation cases.

This analysis builds on the foundations of the previous weeks assignment.

# Loading the data
data <- read_excel("Department_of_Probation__DOP__Annual_Adult_Cases_20250302.xlsx")
# Cleaning data and creating the approval_rate variable
data2 <- data %>%
  mutate(
    total_requests = `Early Discharge - Requests Approved` + `Early Discharge - Requests Disapproved`,
    approval_rate = `Early Discharge - Requests Approved` / total_requests
  ) %>%
  filter(total_requests > 0) %>%
  dplyr::select(
    Year,
    Borough,
    approval_rate,
    `Supervision Caseload Count`,
    `Investigation Charge - Felonies`,
    `Filed Violation - Rearrests`,
    `Filed Violation - Technical`
  )

# Displaying the first few rows of the cleaned data
kable(head(data2))
Year Borough approval_rate Supervision Caseload Count Investigation Charge - Felonies Filed Violation - Rearrests Filed Violation - Technical
2006 Bronx 0.8989899 8426 3469 964 297
2006 Brooklyn 0.9600000 7193 4511 768 529
2006 Manhattan 0.6442953 6619 7471 802 393
2006 Queens 0.9620253 7633 3193 1246 290
2006 Staten Island 0.3636364 1246 621 164 69
2007 Bronx 0.9838710 8142 3619 925 341

I used mutate to create a new variable “total_requests” by adding the approved and disapproved early discharge counts then I created “approval_rate” as the proportion of approved requests (i.e., approved divided by total). I then used filter(total_requests > 0) to exclude rows where total_requests equals 0 to avoid division by zero. I had to use dplyr::select() to keep only the relevant columns needed for the analysis and because I was getting errors when I don’t use dplyr:: . The select is to choose only the relevant columns for further analysis. Kable is used to make the table from head look visually aesthetic.

Factors Influencing Early Discharge Approval Rates in NYC Adult Probation Cases: A Beta Regression Analysis

Beta Regression Model:

# Fit the beta regression model with constant precision (phi) explicitly specified
model <- betareg(approval_rate ~ 
                   `Supervision Caseload Count` + 
                   `Investigation Charge - Felonies` +
                   `Filed Violation - Rearrests` + 
                   `Filed Violation - Technical`,
                 phi.formula = ~ 1,
                 data = data2)

# Producing a neat summary table using modelsummary
modelsummary(model)
(1)
(Intercept) 1.806
(0.214)
`Supervision Caseload Count` -0.000
(0.000)
`Investigation Charge - Felonies` -0.000
(0.000)
`Filed Violation - Rearrests` 0.004
(0.001)
`Filed Violation - Technical` -0.003
(0.001)
(phi) 4.486
Log(nu) -2.303
(0.210)
Num.Obs. 190
AIC 62.7
BIC 85.5
Log.Lik. -24.366

Here:

  • The outcome approval_rate is modeled as a function of four predictors.

  • phi.formula = ~ 1 Specifies that the precision parameter (phi) should be modeled as a constant. This ensures the model object includes phi coefficients.

  • modelsummary generates a clean, publication-ready table summarizing the regression results. This table typically includes coefficients, standard errors, t-values, and p-values, making it easier to interpret your model’s performance. This replaces summary().

Manually Simulate Predictions (Can’t use clarify):

# Extracting mean coefficients
beta_mean <- coef(model, model = "mean")
summary(beta_mean)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0029335 -0.0003225 -0.0000088  0.3612754  0.0035973  1.8060445
# Extracting the variance-covariance matrix for these coefficients
vcov_mean <- vcov(model)[names(beta_mean), names(beta_mean)]
summary(vcov_mean)
##   (Intercept)         `Supervision Caseload Count`
##  Min.   :-1.760e-04   Min.   :-2.554e-08          
##  1st Qu.: 1.660e-06   1st Qu.:-7.220e-09          
##  Median : 4.670e-06   Median : 2.794e-09          
##  Mean   : 9.181e-03   Mean   : 3.320e-07          
##  3rd Qu.: 1.907e-04   3rd Qu.: 2.518e-08          
##  Max.   : 4.589e-02   Max.   : 1.665e-06          
##  `Investigation Charge - Felonies` `Filed Violation - Rearrests`
##  Min.   :-4.741e-08                Min.   :-1.760e-04           
##  1st Qu.:-7.220e-09                1st Qu.:-1.108e-06           
##  Median : 1.335e-08                Median :-4.741e-08           
##  Mean   : 9.336e-07                Mean   :-3.526e-05           
##  3rd Qu.: 4.005e-08                3rd Qu.: 2.518e-08           
##  Max.   : 4.669e-06                Max.   : 8.693e-07           
##  `Filed Violation - Technical`
##  Min.   :-1.108e-06           
##  1st Qu.:-2.554e-08           
##  Median : 4.005e-08           
##  Mean   : 3.829e-05           
##  3rd Qu.: 1.820e-06           
##  Max.   : 1.907e-04

Here:

  • coef(model, model = "mean") retrieves only the coefficients for the mean submodel.

  • vcov(model)[names(beta_mean), names(beta_mean)] extracts the part of the variance-covariance matrix corresponding to these coefficients.

# Number of simulation draws
n_sim <- 1000

# Simulating draws from a multivariate normal distribution using the mean coefficients and their covariance matrix
sim_draws <- mvrnorm(n_sim, mu = beta_mean, Sigma = vcov_mean)

# Visualize
plot(sim_draws)

Here:

  • mvrnorm() generates 1,000 sets of simulated coefficient values that reflect the uncertainty in our estimates.

  • I added a plot to visualize the simulation.

# Creating the model matrix for the mean model using the cleaned data
X <- model.matrix(model, data = data2)

# Calculating simulated linear predictors for each observation
lp_sim <- X %*% t(sim_draws)

# Visualize
plot(lp_sim)

Here:

  • model.matrix() constructs a design matrix X (including the intercept and predictor variables) from data2.

  • Matrix Multiplication: Multiplies the design matrix by the transposed simulation draws. Each element in lp_sim is a simulated linear predictor (on the logit scale) for an observation.

# Defining the inverse logit function to transform linear predictors to probabilities
inv_logit <- function(x) {
  exp(x) / (1 + exp(x))
}

# Applying the inverse logit function to obtain predicted approval rates
pred_sim <- inv_logit(lp_sim)

# Visualize
plot(pred_sim)

Here:

  • inv_logit converts the logit (log-odds) back to probabilities (values between 0 and 1).

  • Transformation: Applying inv_logit to lp_sim converts the simulated linear predictors into predicted approval rates.

# For each observation, I'm computing the mean predicted approval rate and the 95% confidence intervals
pred_mean <- apply(pred_sim, 1, mean)
pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)
pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)

# Combining these into a data frame for easy viewing
preds_df <- data.frame(
  fit = pred_mean,
  lower = pred_lower,
  upper = pred_upper
)

# Showing the first few rows of the predictions
head(preds_df)
##         fit     lower     upper
## 1 0.9514124 0.8691374 0.9892699
## 2 0.8072428 0.6318858 0.9237964
## 3 0.7329289 0.5337602 0.8786372
## 4 0.9812861 0.9351130 0.9975408
## 5 0.8775677 0.8337012 0.9138319
## 6 0.9379132 0.8469009 0.9830235

Here:

  • pred_mean <- apply(pred_sim, 1, mean) calculates the average predicted approval rate for each observation across the simulation draws.

  • The code lines below: calculates the 2.5th and 97.5th percentiles to form a 95% confidence interval.

    • pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)

    • pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)

Introduction:

Community supervision—encompassing probation and parole—remains a pivotal component of the U.S. criminal justice system. Recent data show that nearly 3.7 million adults were under community supervision in 2021 (“Probation and Parole in the United States, 2021 | Bureau of Justice Statistics,” n.d.). Despite this extensive reach, there is still considerable debate about how supervision practices influence recidivism and other outcomes. A rapid evidence assessment by (Smith et al. 2018) concluded that while supervision overall is associated with lower rates of re-offending, there is no universal consensus on “what works” best for reducing criminal behavior. Approaches vary widely, from intensive monitoring and enforcement to rehabilitative, relationship-focused models.

In light of ongoing reforms and policy debates, it is critical to deepen the understanding of the factors that shape probation outcomes. This study examines early discharge approval rates within NYC adult probation cases using a beta regression framework. This approach facilitates the modeling of a proportional outcome while accounting for uncertainty in the estimates through simulation techniques.

The analysis is designed to rigorously investigate the determinants of early discharge approvals without preempting specific outcomes. By modeling approval rates as a function of key predictors—such as case complexity, violation types, and supervisory workload—it aims to develop a nuanced understanding of how various elements within probation supervision may influence decision-making processes. This methodological framework, grounded in robust statistical techniques, sets the stage for future research and policy interventions aimed at enhancing the effectiveness of community supervision.

Analysis:

This analysis performs a full beta regression analysis and simulation of predicted outcomes for early discharge approval rates. It starts by fitting a beta regression model where the dependent variable, approval_rate, is modeled as a function of predictors like Supervision Caseload Count, Investigation Charge - Felonies, Filed Violation - Rearrests, and Filed Violation - Technical. By specifying phi.formula = ~ 1, the model assumes that the precision parameter (phi) is constant across observations, which is common when there is no strong reason to believe the variability of the response changes with predictors. The summary(model) command provides the coefficient estimates, standard errors, and significance levels, offering insight into how each predictor affects the approval rate on the logit scale.

Following the model fitting, the code proceeds to quantify the uncertainty in these estimates by simulating 1,000 draws from the multivariate normal distribution of the mean coefficients. The mean coefficients are extracted using coef(model, model = "mean"), and their variance-covariance matrix is retrieved with vcov(model), ensuring that the simulated draws account for the estimation uncertainty. Next, the design matrix is constructed using model.matrix(), and the simulated linear predictors are computed via matrix multiplication. These linear predictors, originally on the logit scale, are then transformed into probabilities (i.e., predicted approval rates) using an inverse logit function. Finally, for each observation, the code calculates the mean predicted approval rate along with its 95% confidence interval (using the 2.5th and 97.5th percentiles from the simulated predictions). The results are compiled into a data frame (preds_df), and the first few rows are displayed with head(preds_df), providing a clear summary of the predicted outcomes along with measures of uncertainty.

Results:

The analysis provides an evaluation of the factors influencing approval_rate in NYC adult probation cases. In this analysis, approval_rate is modeled as a function of several key variables: Supervision Caseload Count, Investigation Charge - Felonies, Filed Violation - Rearrests, and Filed Violation - Technical. The beta regression model was carefully specified with a constant precision parameter (phi) to account for the proportional nature of the data.

The results reveal that Investigation Charge - Felonies is significantly and negatively associated with approval_rate (Estimate = -3.225e-04, p ≈ 0.005), indicating that an increase in felony investigations corresponds with a reduced likelihood of early discharge approvals. Conversely, Filed Violation - Rearrests shows a statistically significant positive effect (Estimate = 3.597e-03, p ≈ 0.0001), suggesting that higher counts of rearrests are linked with increased approval rates. Additionally, Filed Violation - Technical exhibits a significant negative association (Estimate = -2.934e-03, p ≈ 0.03), implying that a higher number of technical violations decreases approval rates. The Supervision Caseload Count does not appear to have a significant impact (p ≈ 0.87), suggesting that overall caseload size may not be a critical determinant in early discharge decisions. Furthermore, by simulating the mean coefficients via multivariate normal draws, the analysis effectively quantifies the uncertainty around these estimates and provides robust confidence intervals for the predicted approval rates.

Works Cited

“Probation and Parole in the United States, 2021 | Bureau of Justice Statistics.” n.d. https://bjs.ojp.gov/library/publications/probation-and-parole-united-states-2021.
Smith, Andrew, Kim Heyes, Chris Fox, Jordan Harrison, Zsolt Kiss, and Andrew Bradbury. 2018. “The Effectiveness of Probation Supervision Towards Reducing Reoffending: A Rapid Evidence Assessment.” Probation Journal 65 (4): 407–28. https://doi.org/10.1177/0264550518796275.