# Load Packages
library(readxl)
library(dplyr)
library(tidyr)
library(tidyverse)
library(descr)
library(knitr)
library(ggplot2)
library(ggthemes)
library(clarify)
library(arm) # clarify still won't work
library(betareg)
library(MASS)
library(modelsummary)
The dataset used in this analysis comes from the NYC Department of Probation’s Annual Adult Cases dataset, available on NYC Open Data. The dataset is provided in CSV for Excel format, accompanied by a detailed data dictionary that explains each variable. Key variables include the Year (indicating the calendar year of the case), Borough (the NYC borough where the case was recorded), and several measures related to probation cases. These measures include Supervision Caseload Count (the number of cases under supervision), Investigation Charge - Felonies (the number of felony charges investigated), Filed Violation - Rearrests (the count of rearrests filed as violations), and Filed Violation - Technical (the count of technical violations filed). Other variables in the dataset, such as Passthrough Count, Supervision Intakes, Early Discharge - New Requests, and Early Discharge - Requests Approved/Disapproved, provide additional context about the probation system’s operations and outcomes.
For the purpose of this analysis, I focused on early discharge outcomes by creating a new variable called approval_rate. This variable was derived by combining the counts from Early Discharge - Requests Approved and Early Discharge - Requests Disapproved to first calculate the total number of early discharge requests, and then computing the ratio of approved requests to this total. This transformation results in a proportion that reflects the rate at which early discharge requests are approved. By converting raw counts into a proportion, the data became well-suited for beta regression analysis. This approach allowed me to examine how factors such as supervision caseload, felony investigations, and filed violations relate to the likelihood of early discharge approvals in NYC’s adult probation cases.
This analysis builds on the foundations of the previous weeks assignment.
# Loading the data
data <- read_excel("Department_of_Probation__DOP__Annual_Adult_Cases_20250302.xlsx")
# Cleaning data and creating the approval_rate variable
data2 <- data %>%
mutate(
total_requests = `Early Discharge - Requests Approved` + `Early Discharge - Requests Disapproved`,
approval_rate = `Early Discharge - Requests Approved` / total_requests
) %>%
filter(total_requests > 0) %>%
dplyr::select(
Year,
Borough,
approval_rate,
`Supervision Caseload Count`,
`Investigation Charge - Felonies`,
`Filed Violation - Rearrests`,
`Filed Violation - Technical`
)
# Displaying the first few rows of the cleaned data
kable(head(data2))
Year | Borough | approval_rate | Supervision Caseload Count | Investigation Charge - Felonies | Filed Violation - Rearrests | Filed Violation - Technical |
---|---|---|---|---|---|---|
2006 | Bronx | 0.8989899 | 8426 | 3469 | 964 | 297 |
2006 | Brooklyn | 0.9600000 | 7193 | 4511 | 768 | 529 |
2006 | Manhattan | 0.6442953 | 6619 | 7471 | 802 | 393 |
2006 | Queens | 0.9620253 | 7633 | 3193 | 1246 | 290 |
2006 | Staten Island | 0.3636364 | 1246 | 621 | 164 | 69 |
2007 | Bronx | 0.9838710 | 8142 | 3619 | 925 | 341 |
I used mutate
to create a new variable
“total_requests
” by adding the approved and disapproved
early discharge counts then I created “approval_rate
” as
the proportion of approved requests (i.e., approved divided by total). I
then used filter(total_requests > 0)
to exclude rows
where total_requests
equals 0 to avoid division by zero. I
had to use dplyr::select()
to keep only the relevant
columns needed for the analysis and because I was getting errors when I
don’t use dplyr::
. The select
is to choose
only the relevant columns for further analysis. Kable
is
used to make the table from head
look visually
aesthetic.
# Fit the beta regression model with constant precision (phi) explicitly specified
model <- betareg(approval_rate ~
`Supervision Caseload Count` +
`Investigation Charge - Felonies` +
`Filed Violation - Rearrests` +
`Filed Violation - Technical`,
phi.formula = ~ 1,
data = data2)
# Producing a neat summary table using modelsummary
modelsummary(model)
(1) | |
---|---|
(Intercept) | 1.806 |
(0.214) | |
`Supervision Caseload Count` | -0.000 |
(0.000) | |
`Investigation Charge - Felonies` | -0.000 |
(0.000) | |
`Filed Violation - Rearrests` | 0.004 |
(0.001) | |
`Filed Violation - Technical` | -0.003 |
(0.001) | |
(phi) | 4.486 |
Log(nu) | -2.303 |
(0.210) | |
Num.Obs. | 190 |
AIC | 62.7 |
BIC | 85.5 |
Log.Lik. | -24.366 |
Here:
The outcome approval_rate
is modeled as a function
of four predictors.
phi.formula = ~ 1
Specifies that the precision
parameter (phi) should be modeled as a constant. This ensures the model
object includes phi coefficients.
modelsummary
generates a clean, publication-ready
table summarizing the regression results. This table typically includes
coefficients, standard errors, t-values, and p-values, making it easier
to interpret your model’s performance. This replaces
summary()
.
# Extracting mean coefficients
beta_mean <- coef(model, model = "mean")
summary(beta_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0029335 -0.0003225 -0.0000088 0.3612754 0.0035973 1.8060445
# Extracting the variance-covariance matrix for these coefficients
vcov_mean <- vcov(model)[names(beta_mean), names(beta_mean)]
summary(vcov_mean)
## (Intercept) `Supervision Caseload Count`
## Min. :-1.760e-04 Min. :-2.554e-08
## 1st Qu.: 1.660e-06 1st Qu.:-7.220e-09
## Median : 4.670e-06 Median : 2.794e-09
## Mean : 9.181e-03 Mean : 3.320e-07
## 3rd Qu.: 1.907e-04 3rd Qu.: 2.518e-08
## Max. : 4.589e-02 Max. : 1.665e-06
## `Investigation Charge - Felonies` `Filed Violation - Rearrests`
## Min. :-4.741e-08 Min. :-1.760e-04
## 1st Qu.:-7.220e-09 1st Qu.:-1.108e-06
## Median : 1.335e-08 Median :-4.741e-08
## Mean : 9.336e-07 Mean :-3.526e-05
## 3rd Qu.: 4.005e-08 3rd Qu.: 2.518e-08
## Max. : 4.669e-06 Max. : 8.693e-07
## `Filed Violation - Technical`
## Min. :-1.108e-06
## 1st Qu.:-2.554e-08
## Median : 4.005e-08
## Mean : 3.829e-05
## 3rd Qu.: 1.820e-06
## Max. : 1.907e-04
Here:
coef(model, model = "mean")
retrieves only the
coefficients for the mean submodel.
vcov(model)[names(beta_mean), names(beta_mean)]
extracts the part of the variance-covariance matrix corresponding to
these coefficients.
# Number of simulation draws
n_sim <- 1000
# Simulating draws from a multivariate normal distribution using the mean coefficients and their covariance matrix
sim_draws <- mvrnorm(n_sim, mu = beta_mean, Sigma = vcov_mean)
# Visualize
plot(sim_draws)
Here:
mvrnorm()
generates 1,000 sets of simulated
coefficient values that reflect the uncertainty in our
estimates.
I added a plot to visualize the simulation.
# Creating the model matrix for the mean model using the cleaned data
X <- model.matrix(model, data = data2)
# Calculating simulated linear predictors for each observation
lp_sim <- X %*% t(sim_draws)
# Visualize
plot(lp_sim)
Here:
model.matrix()
constructs a design matrix
X
(including the intercept and predictor variables) from
data2.
Matrix Multiplication: Multiplies the design matrix by the
transposed simulation draws. Each element in lp_sim
is a
simulated linear predictor (on the logit scale) for an
observation.
# Defining the inverse logit function to transform linear predictors to probabilities
inv_logit <- function(x) {
exp(x) / (1 + exp(x))
}
# Applying the inverse logit function to obtain predicted approval rates
pred_sim <- inv_logit(lp_sim)
# Visualize
plot(pred_sim)
Here:
inv_logit
converts the logit (log-odds) back to
probabilities (values between 0 and 1).
Transformation: Applying inv_logit
to
lp_sim
converts the simulated linear predictors into
predicted approval rates.
# For each observation, I'm computing the mean predicted approval rate and the 95% confidence intervals
pred_mean <- apply(pred_sim, 1, mean)
pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)
pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)
# Combining these into a data frame for easy viewing
preds_df <- data.frame(
fit = pred_mean,
lower = pred_lower,
upper = pred_upper
)
# Showing the first few rows of the predictions
head(preds_df)
## fit lower upper
## 1 0.9514124 0.8691374 0.9892699
## 2 0.8072428 0.6318858 0.9237964
## 3 0.7329289 0.5337602 0.8786372
## 4 0.9812861 0.9351130 0.9975408
## 5 0.8775677 0.8337012 0.9138319
## 6 0.9379132 0.8469009 0.9830235
Here:
pred_mean <- apply(pred_sim, 1, mean)
calculates
the average predicted approval rate for each observation across the
simulation draws.
The code lines below: calculates the 2.5th and 97.5th percentiles to form a 95% confidence interval.
pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)
pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)
Community supervision—encompassing probation and parole—remains a pivotal component of the U.S. criminal justice system. Recent data show that nearly 3.7 million adults were under community supervision in 2021 (“Probation and Parole in the United States, 2021 | Bureau of Justice Statistics,” n.d.). Despite this extensive reach, there is still considerable debate about how supervision practices influence recidivism and other outcomes. A rapid evidence assessment by (Smith et al. 2018) concluded that while supervision overall is associated with lower rates of re-offending, there is no universal consensus on “what works” best for reducing criminal behavior. Approaches vary widely, from intensive monitoring and enforcement to rehabilitative, relationship-focused models.
In light of ongoing reforms and policy debates, it is critical to deepen the understanding of the factors that shape probation outcomes. This study examines early discharge approval rates within NYC adult probation cases using a beta regression framework. This approach facilitates the modeling of a proportional outcome while accounting for uncertainty in the estimates through simulation techniques.
The analysis is designed to rigorously investigate the determinants of early discharge approvals without preempting specific outcomes. By modeling approval rates as a function of key predictors—such as case complexity, violation types, and supervisory workload—it aims to develop a nuanced understanding of how various elements within probation supervision may influence decision-making processes. This methodological framework, grounded in robust statistical techniques, sets the stage for future research and policy interventions aimed at enhancing the effectiveness of community supervision.
This analysis performs a full beta regression analysis and simulation
of predicted outcomes for early discharge approval rates. It starts by
fitting a beta regression model where the dependent variable,
approval_rate
, is modeled as a function of predictors like
Supervision Caseload Count, Investigation Charge - Felonies, Filed
Violation - Rearrests, and Filed Violation - Technical. By specifying
phi.formula = ~ 1
, the model assumes that the precision
parameter (phi) is constant across observations, which is common when
there is no strong reason to believe the variability of the response
changes with predictors. The summary(model)
command
provides the coefficient estimates, standard errors, and significance
levels, offering insight into how each predictor affects the approval
rate on the logit scale.
Following the model fitting, the code proceeds to quantify the
uncertainty in these estimates by simulating 1,000 draws from the
multivariate normal distribution of the mean coefficients. The mean
coefficients are extracted using
coef(model, model = "mean")
, and their variance-covariance
matrix is retrieved with vcov(model)
, ensuring that the
simulated draws account for the estimation uncertainty. Next, the design
matrix is constructed using model.matrix()
, and the
simulated linear predictors are computed via matrix multiplication.
These linear predictors, originally on the logit scale, are then
transformed into probabilities (i.e., predicted approval rates) using an
inverse logit function. Finally, for each observation, the code
calculates the mean predicted approval rate along with its 95%
confidence interval (using the 2.5th and 97.5th percentiles from the
simulated predictions). The results are compiled into a data frame
(preds_df
), and the first few rows are displayed with
head(preds_df)
, providing a clear summary of the predicted
outcomes along with measures of uncertainty.
The analysis provides an evaluation of the factors influencing
approval_rate
in NYC adult probation
cases. In this analysis, approval_rate
is modeled as a
function of several key variables:
Supervision Caseload Count
,
Investigation Charge - Felonies
,
Filed Violation - Rearrests
, and
Filed Violation - Technical
. The beta
regression model was carefully specified with a constant precision
parameter (phi) to account for the proportional nature of the data.
The results reveal that
Investigation Charge - Felonies
is
significantly and negatively associated with
approval_rate
(Estimate = -3.225e-04, p ≈ 0.005)
, indicating that an
increase in felony investigations corresponds with a reduced likelihood
of early discharge approvals. Conversely,
Filed Violation - Rearrests
shows a
statistically significant positive effect
(Estimate = 3.597e-03, p ≈ 0.0001)
, suggesting that higher
counts of rearrests are linked with increased approval rates.
Additionally, Filed Violation - Technical
exhibits a significant negative association
(Estimate = -2.934e-03, p ≈ 0.03)
, implying that a higher
number of technical violations decreases approval rates. The
Supervision Caseload Count
does not appear
to have a significant impact (p ≈ 0.87)
, suggesting that
overall caseload size may not be a critical determinant in early
discharge decisions. Furthermore, by simulating the mean coefficients
via multivariate normal draws, the analysis effectively quantifies the
uncertainty around these estimates and provides robust confidence
intervals for the predicted approval rates.