Getting Started:

# Load Packages
library(readxl)
library(dplyr)
library(tidyr)
library(tidyverse)
library(descr)
library(knitr)
library(ggplot2)
library(ggthemes)
library(clarify)
library(arm) # clarify still won't work
library(betareg)
library(MASS)

Dataset:

The dataset used in this analysis comes from the NYC Department of Probation’s Annual Adult Cases dataset, available on NYC Open Data. The dataset is provided in CSV for Excel format, accompanied by a detailed data dictionary that explains each variable. Key variables include the Year (indicating the calendar year of the case), Borough (the NYC borough where the case was recorded), and several measures related to probation cases. These measures include Supervision Caseload Count (the number of cases under supervision), Investigation Charge - Felonies (the number of felony charges investigated), Filed Violation - Rearrests (the count of rearrests filed as violations), and Filed Violation - Technical (the count of technical violations filed). Other variables in the dataset, such as Passthrough Count, Supervision Intakes, Early Discharge - New Requests, and Early Discharge - Requests Approved/Disapproved, provide additional context about the probation system’s operations and outcomes.

For the purpose of our analysis, we focused on early discharge outcomes by creating a new variable called approval_rate. This variable was derived by combining the counts from Early Discharge - Requests Approved and Early Discharge - Requests Disapproved to first calculate the total number of early discharge requests, and then computing the ratio of approved requests to this total. This transformation results in a proportion that reflects the rate at which early discharge requests are approved. By converting raw counts into a proportion, the data became well-suited for beta regression analysis. This approach allowed us to examine how factors such as supervision caseload, felony investigations, and filed violations relate to the likelihood of early discharge approvals in NYC’s adult probation cases.

# Loading the data
data <- read_excel("Department_of_Probation__DOP__Annual_Adult_Cases_20250302.xlsx")
# Cleaning data and creating the approval_rate variable
data2 <- data %>%
  mutate(
    total_requests = `Early Discharge - Requests Approved` + `Early Discharge - Requests Disapproved`,
    approval_rate = `Early Discharge - Requests Approved` / total_requests
  ) %>%
  filter(total_requests > 0) %>%
  dplyr::select(
    Year,
    Borough,
    approval_rate,
    `Supervision Caseload Count`,
    `Investigation Charge - Felonies`,
    `Filed Violation - Rearrests`,
    `Filed Violation - Technical`
  )

head(data2)
## # A tibble: 6 × 7
##    Year Borough      approval_rate Supervision Caseload…¹ Investigation Charge…²
##   <dbl> <chr>                <dbl>                  <dbl>                  <dbl>
## 1  2006 Bronx                0.899                   8426                   3469
## 2  2006 Brooklyn             0.96                    7193                   4511
## 3  2006 Manhattan            0.644                   6619                   7471
## 4  2006 Queens               0.962                   7633                   3193
## 5  2006 Staten Isla…         0.364                   1246                    621
## 6  2007 Bronx                0.984                   8142                   3619
## # ℹ abbreviated names: ¹​`Supervision Caseload Count`,
## #   ²​`Investigation Charge - Felonies`
## # ℹ 2 more variables: `Filed Violation - Rearrests` <dbl>,
## #   `Filed Violation - Technical` <dbl>

I used mutate to create a new variable “total_requests” by adding the approved and disapproved early discharge counts then I created “approval_rate” as the proportion of approved requests (i.e., approved divided by total). I then used filter(total_requests > 0) to exclude rows where total_requests equals 0 to avoid division by zero. I had to use dplyr::select() to keep only the relevant columns needed for the analysis and because I was getting errors when I don’t use dplyr:: .

Factors Influencing Early Discharge Approval Rates in NYC Adult Probation Cases: A Beta Regression Analysis

Beta Regression Model:

# Fit the beta regression model with constant precision (phi) explicitly specified
model <- betareg(approval_rate ~ 
                   `Supervision Caseload Count` + 
                   `Investigation Charge - Felonies` +
                   `Filed Violation - Rearrests` + 
                   `Filed Violation - Technical`,
                 phi.formula = ~ 1,
                 data = data2)
## Loading required namespace: numDeriv
# View a summary of the model
summary(model)
## 
## Call:
## betareg(formula = approval_rate ~ `Supervision Caseload Count` + `Investigation Charge - Felonies` + 
##     `Filed Violation - Rearrests` + `Filed Violation - Technical`, data = data2, 
##     phi.formula = ~1)
## 
## Randomized quantile residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7494 -1.2594 -0.7026 -0.3238  2.4499 
## 
## Coefficients (mu model with logit link):
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        1.806e+00  2.142e-01   8.431  < 2e-16 ***
## `Supervision Caseload Count`      -8.848e-06  5.286e-05  -0.167 0.867072    
## `Investigation Charge - Felonies` -3.225e-04  1.155e-04  -2.791 0.005250 ** 
## `Filed Violation - Rearrests`      3.597e-03  9.324e-04   3.858 0.000114 ***
## `Filed Violation - Technical`     -2.934e-03  1.349e-03  -2.174 0.029674 *  
## 
## Phi coefficients (phi model with identity link):
##       Estimate Std. Error z value Pr(>|z|)
## (phi)    4.486        NaN     NaN      NaN
## 
## Exceedence parameter (extended-support xbetax model):
##         Estimate Std. Error z value Pr(>|z|)    
## Log(nu)  -2.3026     0.2097  -10.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Exceedence parameter nu: 0.1
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: -24.37 on 7 Df
## Number of iterations in BFGS optimization: 1

Here:

  • The outcome approval_rate is modeled as a function of four predictors.

  • phi.formula = ~ 1 Specifies that the precision parameter (phi) should be modeled as a constant. This ensures the model object includes phi coefficients.

  • summary(model) displays the estimated coefficients, standard errors, and other diagnostic information for both the mean model (on a logit scale by default) and the precision parameter.

Manually Simulate Predictions (Can’t use clarify):

# Extracting mean coefficients
beta_mean <- coef(model, model = "mean")

summary(beta_mean)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0029335 -0.0003225 -0.0000088  0.3612754  0.0035973  1.8060445
# Extracting the variance-covariance matrix for these coefficients
vcov_mean <- vcov(model)[names(beta_mean), names(beta_mean)]

summary(vcov_mean)
##   (Intercept)         `Supervision Caseload Count`
##  Min.   :-1.760e-04   Min.   :-2.554e-08          
##  1st Qu.: 1.660e-06   1st Qu.:-7.220e-09          
##  Median : 4.670e-06   Median : 2.794e-09          
##  Mean   : 9.181e-03   Mean   : 3.320e-07          
##  3rd Qu.: 1.907e-04   3rd Qu.: 2.518e-08          
##  Max.   : 4.589e-02   Max.   : 1.665e-06          
##  `Investigation Charge - Felonies` `Filed Violation - Rearrests`
##  Min.   :-4.741e-08                Min.   :-1.760e-04           
##  1st Qu.:-7.220e-09                1st Qu.:-1.108e-06           
##  Median : 1.335e-08                Median :-4.741e-08           
##  Mean   : 9.336e-07                Mean   :-3.526e-05           
##  3rd Qu.: 4.005e-08                3rd Qu.: 2.518e-08           
##  Max.   : 4.669e-06                Max.   : 8.693e-07           
##  `Filed Violation - Technical`
##  Min.   :-1.108e-06           
##  1st Qu.:-2.554e-08           
##  Median : 4.005e-08           
##  Mean   : 3.829e-05           
##  3rd Qu.: 1.820e-06           
##  Max.   : 1.907e-04

Here:

  • coef(model, model = "mean") retrieves only the coefficients for the mean submodel.

  • vcov(model)[names(beta_mean), names(beta_mean)] extracts the part of the variance-covariance matrix corresponding to these coefficients.

# Number of simulation draws
n_sim <- 1000

# Simulating draws from a multivariate normal distribution using the mean coefficients and their covariance matrix
sim_draws <- mvrnorm(n_sim, mu = beta_mean, Sigma = vcov_mean)

# Visualize
plot(sim_draws)

Here:

  • mvrnorm() generates 1,000 sets of simulated coefficient values that reflect the uncertainty in our estimates.

  • I added a plot to visualize the simulation.

# Creating the model matrix for the mean model using the cleaned data
X <- model.matrix(model, data = data2)

# Calculating simulated linear predictors for each observation
lp_sim <- X %*% t(sim_draws)

# Visualize
plot(lp_sim)

Here:

  • model.matrix() sonstructs a design matrix X (including the intercept and predictor variables) from data2.

  • Matrix Multiplication: Multiplies the design matrix by the transposed simulation draws. Each element in lp_sim is a simulated linear predictor (on the logit scale) for an observation.

# Defining the inverse logit function to transform linear predictors to probabilities
inv_logit <- function(x) {
  exp(x) / (1 + exp(x))
}

# Applying the inverse logit function to obtain predicted approval rates
pred_sim <- inv_logit(lp_sim)

# Visualize
plot(pred_sim)

Here:

  • inv_logit converts the logit (log-odds) back to probabilities (values between 0 and 1).

  • Transformation: Applying inv_logit to lp_sim converts the simulated linear predictors into predicted approval rates.

# For each observation, I'm computing the mean predicted approval rate and the 95% confidence intervals
pred_mean <- apply(pred_sim, 1, mean)
pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)
pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)

# Combining these into a data frame for easy viewing
preds_df <- data.frame(
  fit = pred_mean,
  lower = pred_lower,
  upper = pred_upper
)

# Showing the first few rows of the predictions
head(preds_df)
##         fit     lower     upper
## 1 0.9539846 0.8798346 0.9894745
## 2 0.8043348 0.6341338 0.9195871
## 3 0.7308754 0.5323478 0.8862635
## 4 0.9829193 0.9431664 0.9975357
## 5 0.8771210 0.8324872 0.9113810
## 6 0.9402384 0.8607385 0.9831500

Here:

  • pred_mean <- apply(pred_sim, 1, mean) calculates the average predicted approval rate for each observation across the simulation draws.

  • The code lines below: calculates the 2.5th and 97.5th percentiles to form a 95% confidence interval.

    • pred_lower <- apply(pred_sim, 1, quantile, probs = 0.025)

    • pred_upper <- apply(pred_sim, 1, quantile, probs = 0.975)

Analysis:

This analysis performs a full beta regression analysis and simulation of predicted outcomes for early discharge approval rates. It starts by fitting a beta regression model where the dependent variable, approval_rate, is modeled as a function of predictors like Supervision Caseload Count, Investigation Charge - Felonies, Filed Violation - Rearrests, and Filed Violation - Technical. By specifying phi.formula = ~ 1, the model assumes that the precision parameter (phi) is constant across observations, which is common when there is no strong reason to believe the variability of the response changes with predictors. The summary(model) command provides the coefficient estimates, standard errors, and significance levels, offering insight into how each predictor affects the approval rate on the logit scale.

Following the model fitting, the code proceeds to quantify the uncertainty in these estimates by simulating 1,000 draws from the multivariate normal distribution of the mean coefficients. The mean coefficients are extracted using coef(model, model = "mean"), and their variance-covariance matrix is retrieved with vcov(model), ensuring that the simulated draws account for the estimation uncertainty. Next, the design matrix is constructed using model.matrix(), and the simulated linear predictors are computed via matrix multiplication. These linear predictors, originally on the logit scale, are then transformed into probabilities (i.e., predicted approval rates) using an inverse logit function. Finally, for each observation, the code calculates the mean predicted approval rate along with its 95% confidence interval (using the 2.5th and 97.5th percentiles from the simulated predictions). The results are compiled into a data frame (preds_df), and the first few rows are displayed with head(preds_df), providing a clear summary of the predicted outcomes along with measures of uncertainty.

Results:

The analysis provides an evaluation of the factors influencing approval_rate in NYC adult probation cases. In this study, approval_rate is modeled as a function of several key variables: Supervision Caseload Count, Investigation Charge - Felonies, Filed Violation - Rearrests, and Filed Violation - Technical. The beta regression model was carefully specified with a constant precision parameter (phi) to account for the proportional nature of the data.

The results reveal that Investigation Charge - Felonies is significantly and negatively associated with approval_rate (Estimate = -3.225e-04, p ≈ 0.005), indicating that an increase in felony investigations corresponds with a reduced likelihood of early discharge approvals. Conversely, Filed Violation - Rearrests shows a statistically significant positive effect (Estimate = 3.597e-03, p ≈ 0.0001), suggesting that higher counts of rearrests are linked with increased approval rates. Additionally, Filed Violation - Technical exhibits a significant negative association (Estimate = -2.934e-03, p ≈ 0.03), implying that a higher number of technical violations decreases approval rates. The Supervision Caseload Count does not appear to have a significant impact (p ≈ 0.87), suggesting that overall caseload size may not be a critical determinant in early discharge decisions. Furthermore, by simulating the mean coefficients via multivariate normal draws, the analysis effectively quantifies the uncertainty around these estimates and provides robust confidence intervals for the predicted approval rates.