Introduction to DataSet
Overview of Dataset
The data set I will use to perform a
logit regression is the Rossi dataset from the AER package. The data set
contains information on 432 individuals released from prison and tracks
whether they were re-arrested within a specific time period. Mass
Incarceration is a serious problem in the United States and the
incarceration rate is increasing rapidly as time passes. To add on in
2018 rearrest rates reached a historic high, with more than 76.6% of
offenders re-offending and returning to prison. Previous studies have
shown that many former convicts lack education and employment, placing
them at a higher risk of rearrest. I want to see how factors like race,
marriage status, work experience, and financial assistance contribute to
mass incarceration and rearrest rates.
Key Variables I will analyze
- Binary Dependent Variable:
Rearrested(1 = Yes, 0 = No)
- Independent Variable:
race(black,other)
- Independent Variable: Marriage(not
married, married)
- Independent Variable: Work
Experience(yes, no)
- Independent Variable: Financial
Assistance(yes, no)
Research Questions
- Does race influence the likelihood
of being re-arrested?
- Does marriage status impact the
likelihood of being re-arrested?
- Does having work experience impact the likelihood of being re-arrested?
- Does financial assistance impact the likelihood of being re-arrested?
Hypotheses
- 1. Does race influence the likelihood of being re-arrested?
- H₀ (Null Hypothesis): Race has no effect on the likelihood of being re-arrested.
- H₁ (Alternative Hypothesis): Race has a significant effect on the likelihood of being re-arrested.
- 2. Does marriage status impact the likelihood of being re-arrested?
- H₀: Marriage status has no effect on the likelihood of being re-arrested.
- H₁: Marriage status significantly affects the likelihood of being re-arrested.
- 3. Does having work experience impact the likelihood of being re-arrested?
- H₀: Work experience does not impact the likelihood of being re-arrested.
-H₁: Work experience does impact the likelihood of being re-arrested.
- 4. Does financial assistance
impact the likelihood of being re-arrested?
- H₀: Receiving financial assistance does not impact the likelihood of being re-arrested.
- H₁: Receiving financial assistance does impact the likelihood of being re-arrested.
Statistical Analysis
- I will perform a binary logit regression to analyze the relationship between my variables.
- The binary logit regression model is appropriate for this analysis because my dependent variable is binary (rearrested: yes or no).
- I will conduct a likelihood ratio test and use the AIC and BIC to select the best model.
- I will obtain the probability of being re-arrested for each independent variable
Data Preparation
# Clear workspace
rm(list = ls())
# Import Data
library(AER)
data("Rossi")
# Create new data frame with variables of interest
library(dplyr) # Data manipulation
Rossi.clean <- select(Rossi, arrest, race, mar, wexp,fin)
#Rename Variables
Rossi.clean <- rename(Rossi.clean, Rearrested = "arrest", Race = "race", Marriage = "mar", Work.Experience = "wexp", Financial.Assistance = "fin")
# Make sure structure of the data is correct
glimpse(Rossi.clean)
## Rows: 432
## Columns: 5
## $ Rearrested <int> 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1…
## $ Race <fct> black, black, other, black, other, black, black, …
## $ Marriage <fct> not married, not married, not married, married, n…
## $ Work.Experience <fct> no, no, yes, yes, yes, yes, yes, yes, no, yes, no…
## $ Financial.Assistance <fct> no, no, no, yes, no, no, no, yes, no, no, yes, no…
Data Table
library(DT) # Data Table
datatable(Rossi.clean)
Descriptive Statistics
Univariate Analysis
library(modelsummary) # Data Tables/Model Summaries
datasummary_skim(Rossi.clean, output = "kableExtra", type = "categorical")
N | % | ||
---|---|---|---|
Race | black | 379 | 87.7 |
other | 53 | 12.3 | |
Marriage | married | 53 | 12.3 |
not married | 379 | 87.7 | |
Work.Experience | no | 185 | 42.8 |
yes | 247 | 57.2 | |
Financial.Assistance | no | 216 | 50.0 |
yes | 216 | 50.0 |
Key Takeaways
- Majority of participants in the study are black (87.7%)
- Majority of participants are not married (87.7%)
- Majority of participants have work experience (57.2%)
- Half of the participants receive financial assistance, while the other half do not.
Bivariate Analysis
# Cross-tabulation of categorical variables
# I want to get a better understanding of the relationship between financial assistance and work experience among participants of the study.
datasummary_crosstab(Work.Experience ~ Financial.Assistance, data = Rossi.clean, output = "kableExtra")
Work.Experience | no | yes | All | |
---|---|---|---|---|
no | N | 93 | 92 | 185 |
% row | 50.3 | 49.7 | 100.0 | |
yes | N | 123 | 124 | 247 |
% row | 49.8 | 50.2 | 100.0 | |
All | N | 216 | 216 | 432 |
% row | 50.0 | 50.0 | 100.0 |
Key Takeaways
- Among participants without work experience, the distribution is nearly equal: 50.3% did not receive financial assistance and 49.7% did.
- Among participants with work experience, the distribution is nearly equal: 49.8% without and 50.2% with financial assistance.
- Overall, the distribution of financial assistance is balanced across work experience groups.
Binary Logit Regression
First I will Convert my individual-level raw data into grouped data
Grouped <- Rossi.clean %>%
group_by(Race, Marriage, Work.Experience, Financial.Assistance) %>%
summarise(total = n(), Yes = sum(Rearrested)) %>%
mutate(No = total - Yes)
datatable(Grouped)
Grouped data can be beneficial for a binary logistic regression for a few reasons:
- Improving Model Fit
- Reducing Size
- Reducing Variability
Binary Logistic Regression Models
models <- list(Model_1 = glm(formula = cbind(Yes, No) ~ Race, family = binomial, data = Grouped),
Model_2 = glm(formula = cbind(Yes, No) ~ Race + Marriage, family = binomial, data = Grouped),
Model_3 = glm(formula = cbind(Yes, No) ~ Race + Marriage + Work.Experience, family = binomial, data = Grouped),
Model_4 = glm(formula = cbind(Yes, No) ~ Race + Marriage + Work.Experience + Financial.Assistance, family = binomial, data = Grouped))
library(huxtable)
modelsummary(models, output = "huxtable", statistic = "p.value")
Model_1 | Model_2 | Model_3 | Model_4 | |
---|---|---|---|---|
(Intercept) | -0.999 | -1.696 | -1.208 | -1.011 |
(<0.001) | (<0.001) | (0.006) | (0.023) | |
Raceother | -0.230 | -0.196 | -0.176 | -0.221 |
(0.509) | (0.575) | (0.617) | (0.534) | |
Marriagenot married | 0.772 | 0.555 | 0.584 | |
(0.054) | (0.179) | (0.158) | ||
Work.Experienceyes | -0.552 | -0.551 | ||
(0.015) | (0.016) | |||
Financial.Assistanceyes | -0.460 | |||
(0.040) | ||||
Num.Obs. | 15 | 15 | 15 | 15 |
AIC | 69.9 | 67.7 | 63.7 | 61.4 |
BIC | 71.3 | 69.8 | 66.5 | 65.0 |
Log.Lik. | -32.958 | -30.827 | -27.838 | -25.706 |
F | 0.435 | 2.055 | 3.362 | 3.505 |
RMSE | 0.31 | 0.34 | 0.31 | 0.30 |
A lower AIC and BIC indicates a better model fit. By looking at the table you can see that adding more predictors to the model improves the model’s fit because the AIC values and BIC values decrease as the numbers of predictors increase. So model 4 is the best fit model because it has the lowest AIC and BIC values compared to the other models. If you look at model 4 you can see that work experience and financial assistance both have a significant impact on the likelihood of being rearrested because the p values are less than .05, while race and marriage status does not have a significant impact on the likelihood of being rearrested because the p value is greater than .05
Likelihood Ratio Test
anova(models$Model_1, models$Model_2, models$Model_3, models$Model_4, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(Yes, No) ~ Race
## Model 2: cbind(Yes, No) ~ Race + Marriage
## Model 3: cbind(Yes, No) ~ Race + Marriage + Work.Experience
## Model 4: cbind(Yes, No) ~ Race + Marriage + Work.Experience + Financial.Assistance
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 13 30.991
## 2 12 26.728 1 4.2632 0.03895 *
## 3 11 20.750 1 5.9773 0.01449 *
## 4 10 16.487 1 4.2637 0.03893 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Create a table for the likelihood ratio test
anova.result <- anova(models$Model_1, models$Model_2, models$Model_3, models$Model_4, test = "Chisq")
library(knitr)
kable(anova.result, caption = "Likelihood Ratio Test")
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
13 | 30.99092 | NA | NA | NA |
12 | 26.72769 | 1 | 4.263235 | 0.0389457 |
11 | 20.75038 | 1 | 5.977306 | 0.0144911 |
10 | 16.48664 | 1 | 4.263740 | 0.0389341 |
Since all the p-values are < 0.05, adding Marriage, Work Experience, and Financial Assistance significantly improves the model’s fit. So Model 4 would be the best fit model.
Obtaining the Probabilities of being re-arrested for significant variables
# Probability of the likelihood of being re-arrested based on Work Experience
Prob_Table <- Rossi.clean %>%
group_by(Work.Experience) %>%
summarise(Rearrested = mean(Rearrested)) %>%
mutate(Not.Rearrested = 1 - Rearrested)
#create table for results
kable(Prob_Table, caption = "Probabilities of being re-arrested based on Work Experience")
Work.Experience | Rearrested | Not.Rearrested |
---|---|---|
no | 0.3351351 | 0.6648649 |
yes | 0.2105263 | 0.7894737 |
Key Takeaways
- In my table, former convicts with work experience have a lower probability of being rearrested compared to those without work experience.
# Probability of the likelihood of being re-arrested based on Financial Assistance
Prob_Table1 <- Rossi.clean %>%
group_by(Financial.Assistance) %>%
summarise(Rearrested = mean(Rearrested)) %>%
mutate(Not.Rearrested = 1 - Rearrested)
#create table for results
kable(Prob_Table1, caption = "Probabilities of being re-arrested based on Financial Assistance")
Financial.Assistance | Rearrested | Not.Rearrested |
---|---|---|
no | 0.3055556 | 0.6944444 |
yes | 0.2222222 | 0.7777778 |
Key Takeaways
- In my table, former convicts with financial assistance have a lower probability of being rearrested compared to those without financial assistance.
Conclusion
The results of my analysis indicated that race and marital status do not have a significant impact on the likelihood of re-arrest. However, work experience and financial assistance significantly reduce the likelihood of being re-arrested. This suggests that providing individuals with work experience and financial assistance can help reduce recidivism rates and improve their chances of reintegrating into society successfully.
References
(Esparza Flores 2018)