-- coding: utf-8 --

“““(Final) DIDA 325 Final Project

Automatically generated by Colab.

Original file is located at https://colab.research.google.com/drive/14UnXxs4T9Z07UEPSrJor24RbeupGuXjE

Group Members: * Elena Michaud * Evan Liu * Ryan Gilbert

1. Introduction

Our dataset comes from the National Institute of Justice under the US Department of Justice. This data was collected by the Georgia Department of Community Supervision on 26,000 individuals from the State of Georgia released from Georgia prisons under post-incarceration supervision between January 1, 2013 and December 31, 2015. It was originally released by the NIJ for the Recidivism Forecasting Challenge. The results of the challenge were published on February 2022.

We find this dataset interesting because it relates to an important aspect of the American justice system (the rehabilitation and parole of criminals), it contains a significant amount of (anonymized) data on real-world parolees, and it includes recidivism outcomes, so we can check the accuracy of any models we make.

We will explore the following research questions: 1. Which of the numeric variables have the strongest correlation with each other? For example, does the % of drug tests positive of THC have a strong correlation with the number of unexcused absences? We will determine this by data visualization through corrplot. 2. Using the data in this set, can we accurately predict Recidivism_Within_3years? Using the model generated, if we split the testing data between white and black parolees, does the model overpredict or underpredict compared to the general accuracy? We will perform logistic regression on the data set, excluding the other recidivism-related columns. 3. Are there significant differences between the history and outcomes of different races? For example, are they more likely to have a higher Supervision Risk Assessment Score? Are they more likely to be sentenced to more years in prison or be older at the time of their release? We will perform descriptive statistics on the data set using dplyr, and create and interpret visualizations (such as box plots) using ggplot2. 4. Are individuals convicted of more serious crimes more likely to have worse outcomes? For example, are they at higher risk of violating parole, or testing positive for illicit substances? We will perform descriptive statistics on the data set using dplyr, and create and interpret visualizations (such as box plots) using ggplot2.

2. Dataset

““”

# We install and load our packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3

“““Our dataset was found at and downloaded from United States government’s open data website, and originally came from the US Department of Justice. More information about our data can be found in the details for the Recidivism Forecasting Challenge, which we use to inform our interpretation of the columns below:

Column Name Description Type Value Range
Gender Individual’s gender categorical text M/F
Race Individual’s race categorical text BLACK/WHITE
Age_at_Release Age when released on parole numeric int 20/25/30/35/40/45/48
Gang_Affiliated Whether or not the individual was gang affiliated categorical boolean TRUE/FALSE/NA
Supervision_Risk_Score_First First parole “Supervision Risk Assessment Score” (1=lowest risk of recidivism) numeric int 1-10
Supervision_Level_First First parole “Supervision Level Assignment” (level of required supervision on parole) categorical text High/Specialized,Standard,NA
Education_Level Individual’s education level upon prison entry categorical text At least some college/High School Diploma/Less than HS diploma
Dependents Number of dependents the individual had upon prison entry numeric int 0/1/2/3
Prison_Offense The offense committed by the individual categorical text Drug/Other/Property/(Violent/Non-Sex)/(Violent/Sex)/NA
Prison_Years Number of years in prison prior to parole release categorical text 1-2 years/Greater than 2 to 3 years/Less than 1 year/More than 3 years
Prior_Arrest_Episodes_Felony Number of prior felony arrests numeric int 0-10
Prior_Arrest_Episodes_Misd Number of prior misdemeanor arrests numeric int 0-6
Prior_Arrest_Episodes_Violent Number of prior violent arrests numeric int 0-3
Prior_Arrest_Episodes_Property Number of prior property-related arrests numeric int 0-5
Prior_Arrest_Episodes_Drug Number of prior drug-related arrests numeric int 0-5
Prior_Arrest_Episodes_PPViolationCharges Number of prior arrests leading to parole violations numeric int 0-5
Prior_Arrest_Episodes_DVCharges If the individual has prior domestic violence-related arrests categorical boolean TRUE/FALSE
Prior_Arrest_Episodes_GunCharges If the individual has of prior gun-related arrests categorical boolean TRUE/FALSE
Prior_Conviction_Episodes_Felony Number of prior felony convictions numeric int 0-3
Prior_Conviction_Episodes_Misd Number of prior misdemeanor convictions numeric int 0-4
Prior_Conviction_Episodes_Viol If the individual has prior violence-related convictions categorical boolean TRUE/FALSE
Prior_Conviction_Episodes_Prop Number of prior property-related convictions numeric int 0-3
Prior_Conviction_Episodes_Drug Number of prior drug-related convictions numeric int 0-2
Prior_Conviction_Episodes_PPViolationCharges If the individual has prior parole/probation violations due to arrest categorical boolean TRUE/FALSE
Prior_Conviction_Episodes_DomesticViolenceCharges If the individual has prior convictions related to domestic arrest categorical boolean TRUE/FALSE
Prior_Conviction_Episodes_GunCharges If the individual has prior gun-related convivtions categorical boolean TRUE/FALSE
Prior_Revocations_Parole If the individual has prior parole revocations categorical boolean TRUE/FALSE
Prior_Revocations_Probation If the individual has prior probation revocations categorical boolean TRUE/FALSE
Condition_MH_SA If a condition of parole release was mental health or substance abuse programming categorical boolean TRUE/FALSE
Condition_Cog_Ed If a condition of parole release was cognitive skills or education programming categorical boolean TRUE/FALSE
Condition_Other If a condition of parole was no victim contact, or electronic monitoring, or restitution, or sex offender registration categorical boolean TRUE/FALSE
Violations_ElectronicMonitoring If the individual has electronic monitoring-related violations categorical boolean TRUE/FALSE
Violations_Instruction If the individual has parole instruction-related violations categorical boolean TRUE/FALSE
Violations_FailToReport If the individual has failed to report violations before categorical boolean TRUE/FALSE
Violations_MoveWithoutPermission If the individual has changed residences without permission before categorical boolean TRUE/FALSE
Delinquency_Reports Number of delinquency reports on parole numeric int 1-4
Program_Attendances Number of programs attended numeric int 1-10
Program_UnexcusedAbsences Number of unexcused program absences numeric int 0-3
Residence_Changes Number of residence changes numeric int 0-3
Avg_Days_per_DrugTest Average days between drug tests numeric float 0.5-1088.5
DrugTests_THC_Positive % of drug tests positive for THC numeric float 0-1
DrugTests_Cocaine_Positive % of drug tests positive for cocaine numeric float 0-1
DrugTests_Meth_Positive % of drug tests positive for meth numeric float 0-1
DrugTests_Other_Positive % of drug tests positive for other drugs numeric float 0-1
Percent_Days_Employed % of days employed on parole numeric float 0-1
Jobs_Per_Year Jobs held per year on parole numeric int 0-8
Employment_Exempt If employment is not required for parole categorical boolean TRUE/FALSE
Recidivism_Within_3years If the individual was arrested for a misdemeanor/felony within 3 years of parole categorical boolean TRUE/FALSE
Recidivism_Arrest_Year1 If the previous arrest occured in Year 1 categorical boolean TRUE/FALSE
Recidivism_Arrest_Year2 If the previous arrest occured in Year 3 categorical boolean TRUE/FALSE
Recidivism_Arrest_Year3 If the previous arrest occured in Year 3 categorical boolean TRUE/FALSE

We cleaned the data using Python, and remove unnecessary columns with R. We display the results below. ““”

# The data went through the following cleaning procedures:
# "n or more" values were truncated to "n" for data analysis purposes
# "n or older" values were truncated to "n"
# true and false were rewritten to TRUE and FALSE
# In cases where a range of ages were provided (e.g. 27-33), the median of that range was taken (e.g. 30)

# Reading the cleaned data
recidivism <- read.csv("https://raw.githubusercontent.com/WhileCrocodile/datasets/main/DIDA%20325/NIJ_s_Recidivism_Challenge_Full_Dataset_Clean.csv", na.strings="")

# Removing row IDs, geographical informaition and model training metadata
recidivism <- recidivism %>% select(-ID, -Residence_PUMA, -Training_Sample)

# Removine NA rows
recidivism <- na.omit(recidivism)

# Displaying the result through head(), column explanations in markdown below
head(recidivism)
##   Gender  Race Age_at_Release Gang_Affiliated Supervision_Risk_Score_First
## 1      M BLACK             45           FALSE                            3
## 2      M BLACK             35           FALSE                            6
## 3      M BLACK             48           FALSE                            7
## 4      M WHITE             40           FALSE                            7
## 5      M WHITE             35           FALSE                            4
## 6      M WHITE             40           FALSE                            5
##   Supervision_Level_First       Education_Level Dependents  Prison_Offense
## 1                Standard At least some college          3            Drug
## 2             Specialized  Less than HS diploma          1 Violent/Non-Sex
## 3                    High At least some college          3            Drug
## 4                    High  Less than HS diploma          1        Property
## 5             Specialized  Less than HS diploma          3 Violent/Non-Sex
## 6                Standard   High School Diploma          0        Property
##        Prison_Years Prior_Arrest_Episodes_Felony Prior_Arrest_Episodes_Misd
## 1 More than 3 years                            6                          6
## 2 More than 3 years                            7                          6
## 3         1-2 years                            6                          6
## 4         1-2 years                            8                          6
## 5         1-2 years                            4                          4
## 6 More than 3 years                            4                          0
##   Prior_Arrest_Episodes_Violent Prior_Arrest_Episodes_Property
## 1                             1                              3
## 2                             3                              0
## 3                             3                              2
## 4                             0                              3
## 5                             3                              2
## 6                             1                              3
##   Prior_Arrest_Episodes_Drug Prior_Arrest_Episodes_PPViolationCharges
## 1                          3                                        4
## 2                          3                                        5
## 3                          2                                        5
## 4                          3                                        3
## 5                          1                                        3
## 6                          0                                        0
##   Prior_Arrest_Episodes_DVCharges Prior_Arrest_Episodes_GunCharges
## 1                           FALSE                            FALSE
## 2                            TRUE                            FALSE
## 3                            TRUE                            FALSE
## 4                           FALSE                            FALSE
## 5                            TRUE                            FALSE
## 6                           FALSE                            FALSE
##   Prior_Conviction_Episodes_Felony Prior_Conviction_Episodes_Misd
## 1                                3                              3
## 2                                3                              4
## 3                                3                              2
## 4                                3                              4
## 5                                1                              0
## 6                                1                              0
##   Prior_Conviction_Episodes_Viol Prior_Conviction_Episodes_Prop
## 1                          FALSE                              2
## 2                           TRUE                              0
## 3                           TRUE                              1
## 4                          FALSE                              3
## 5                           TRUE                              0
## 6                          FALSE                              2
##   Prior_Conviction_Episodes_Drug Prior_Conviction_Episodes_PPViolationCharges
## 1                              2                                        FALSE
## 2                              2                                         TRUE
## 3                              2                                        FALSE
## 4                              2                                        FALSE
## 5                              1                                        FALSE
## 6                              0                                        FALSE
##   Prior_Conviction_Episodes_DomesticViolenceCharges
## 1                                             FALSE
## 2                                              TRUE
## 3                                              TRUE
## 4                                             FALSE
## 5                                             FALSE
## 6                                             FALSE
##   Prior_Conviction_Episodes_GunCharges Prior_Revocations_Parole
## 1                                FALSE                    FALSE
## 2                                 TRUE                    FALSE
## 3                                FALSE                    FALSE
## 4                                FALSE                    FALSE
## 5                                FALSE                    FALSE
## 6                                FALSE                    FALSE
##   Prior_Revocations_Probation Condition_MH_SA Condition_Cog_Ed Condition_Other
## 1                       FALSE            TRUE             TRUE           FALSE
## 2                       FALSE           FALSE            FALSE           FALSE
## 3                       FALSE            TRUE             TRUE           FALSE
## 4                        TRUE            TRUE             TRUE           FALSE
## 5                       FALSE            TRUE             TRUE            TRUE
## 6                       FALSE           FALSE            FALSE            TRUE
##   Violations_ElectronicMonitoring Violations_Instruction
## 1                           FALSE                  FALSE
## 2                           FALSE                   TRUE
## 3                           FALSE                   TRUE
## 4                           FALSE                  FALSE
## 5                           FALSE                  FALSE
## 6                           FALSE                  FALSE
##   Violations_FailToReport Violations_MoveWithoutPermission Delinquency_Reports
## 1                   FALSE                            FALSE                   0
## 2                   FALSE                            FALSE                   4
## 3                   FALSE                             TRUE                   4
## 4                   FALSE                            FALSE                   0
## 5                   FALSE                            FALSE                   0
## 6                   FALSE                             TRUE                   0
##   Program_Attendances Program_UnexcusedAbsences Residence_Changes
## 1                   6                         0                 2
## 2                   0                         0                 2
## 3                   6                         0                 0
## 4                   6                         0                 3
## 5                   7                         0                 0
## 6                   0                         0                 3
##   Avg_Days_per_DrugTest DrugTests_THC_Positive DrugTests_Cocaine_Positive
## 1             612.00000              0.0000000                          0
## 2              35.66667              0.0000000                          0
## 3              93.66667              0.3333333                          0
## 4              25.40000              0.0000000                          0
## 5              23.11765              0.0000000                          0
## 6             474.60000              0.0000000                          0
##   DrugTests_Meth_Positive DrugTests_Other_Positive Percent_Days_Employed
## 1              0.00000000                        0             0.4885621
## 2              0.00000000                        0             0.4252336
## 3              0.16666667                        0             0.0000000
## 4              0.00000000                        0             1.0000000
## 5              0.05882353                        0             0.2035623
## 6              0.00000000                        0             0.6742520
##   Jobs_Per_Year Employment_Exempt Recidivism_Within_3years
## 1     0.4476103             FALSE                    FALSE
## 2     2.0000000             FALSE                     TRUE
## 3     0.0000000             FALSE                     TRUE
## 4     0.7189961             FALSE                    FALSE
## 5     0.9293893             FALSE                     TRUE
## 6     0.3078382             FALSE                    FALSE
##   Recidivism_Arrest_Year1 Recidivism_Arrest_Year2 Recidivism_Arrest_Year3
## 1                   FALSE                   FALSE                   FALSE
## 2                   FALSE                   FALSE                    TRUE
## 3                   FALSE                    TRUE                   FALSE
## 4                   FALSE                   FALSE                   FALSE
## 5                    TRUE                   FALSE                   FALSE
## 6                   FALSE                   FALSE                   FALSE

“““## 2.1. Question 1

Which of the numeric variables have the strongest correlation with each other? For example, does the % of drug tests positive of THC have a strong correlation with the number of unexcused absences? We will determine this by data visualization through corrplot. ““”

# Selects only numeric columns using select_if() and a function() which checks if a column contains characters or booleans
# Renames some of the columns to be shorter for the sake of readability
recidivism_num <- recidivism %>% select_if(function(column) !is.character(column) & !is.logical(column)) %>%
  rename(Risk_Score = Supervision_Risk_Score_First, Felony_Arrests = Prior_Arrest_Episodes_Felony, Misdemeanor_Arrests = Prior_Arrest_Episodes_Misd, Violent_Arrests = Prior_Arrest_Episodes_Violent,
  Property_Arrests = Prior_Arrest_Episodes_Property, Drug_Arrests = Prior_Arrest_Episodes_Drug, Parole_Violations = Prior_Arrest_Episodes_PPViolationCharges,
  Felony_Convictions = Prior_Conviction_Episodes_Felony, Misdemeanor_Convictions = Prior_Conviction_Episodes_Misd, Property_Convictions = Prior_Conviction_Episodes_Prop,
  Drug_Convictions = Prior_Conviction_Episodes_Drug)

recidivism_corr <- cor(recidivism_num, method="spearman")
corrplot(recidivism_corr)

“““The square cluster in the middle is not particularly interesting, as it just tells us that”prior arrests and convictions are often also correlated with arrests for other crimes, and convictions for other crimes”.

While this blank plot seems to not give us a lot of data, the lack of correlations between prior arrests/convictions and other numeric variables suggests, perhaps surprisingly, that they are not good predictors of (numerically measured) forms of delinquency. In fact, most of these variables seem to be largely independent of each other.

However, we do see that “Age_At_Release” seems to be negatively correlated with “Supervision_Risk_Score”, which shows that higher ages at release are correlated with lower perceived risk of recidivism. Additionally, “Jobs_Per_Year” is positively correlated with “Percent_Days_Employed”. This shows us that someone who is consistently employed is likely to have held (and changed) more jobs than someone who is not.

2.2. Question 2

Using the data in this set, can we accurately predict Recidivism_Within_3years? Using the model generated, if we split the testing data between white and black parolees, does the model overpredict or underpredict compared to the general accuracy? We will perform logistic regression on the data set, excluding the other recidivism-related columns. ““”

# Split the data into training and testing sets.
set.seed(1)
split <- 0.75
rows  <- nrow(recidivism)
train.entries <- sample(rows, rows*split)

train.data <- recidivism[train.entries, ]
test.data  <- recidivism[-train.entries,  ]

# Create a logistic regression model and fit it to the training data
model <- glm( Recidivism_Within_3years ~ Race + Age_at_Release + Gang_Affiliated + Supervision_Risk_Score_First + Supervision_Level_First + Education_Level + Dependents + Prison_Offense + Prison_Years + Prior_Arrest_Episodes_Felony + Prior_Arrest_Episodes_Misd +
  Prior_Arrest_Episodes_Violent + Prior_Arrest_Episodes_Property + Prior_Arrest_Episodes_Drug + Prior_Arrest_Episodes_PPViolationCharges + Prior_Arrest_Episodes_DVCharges + Prior_Arrest_Episodes_GunCharges + Prior_Conviction_Episodes_Felony + Prior_Conviction_Episodes_Misd +
    Prior_Conviction_Episodes_Viol + Prior_Conviction_Episodes_Prop + Prior_Conviction_Episodes_Drug + Prior_Conviction_Episodes_PPViolationCharges + Prior_Conviction_Episodes_DomesticViolenceCharges + Prior_Conviction_Episodes_GunCharges + Prior_Revocations_Parole + Prior_Revocations_Probation +
      Condition_MH_SA + Condition_Cog_Ed + Condition_Other + Violations_ElectronicMonitoring + Violations_Instruction + Violations_FailToReport + Violations_MoveWithoutPermission + Delinquency_Reports + Program_Attendances + Program_UnexcusedAbsences + Residence_Changes + Avg_Days_per_DrugTest +
        DrugTests_THC_Positive + DrugTests_Cocaine_Positive + DrugTests_Meth_Positive + DrugTests_Other_Positive + Percent_Days_Employed + Jobs_Per_Year + Employment_Exempt, data = train.data, family = binomial()) %>% step(trace=0)

# Predictions
train.pred <- train.data %>% mutate(phat = predict(model, type='response', newdata = train.data))
test.pred <- test.data %>% mutate(phat = predict(model, type='response', newdata = test.data))

# Model Evaluations
test.pred <- test.pred %>% mutate(prediction = phat > 0.5)    # Evaluate testing set
train.pred <- train.pred %>% mutate(prediction = phat > 0.5) # then evaluate on training set to see if model is overfit

# Compare with true values
table(test.pred$prediction, test.pred$Recidivism_Within_3years, dnn=c("prediction", "true value"))
##           true value
## prediction FALSE TRUE
##      FALSE   799  410
##      TRUE    608 1726
# Calculate accuracy
((799 + 1726)/(799 + 1726 + 608 + 410)) * 100
## [1] 71.26729
# Calculate the accuracy for the training set to check if the model is overfit
table(train.pred$prediction, train.pred$Recidivism_Within_3years, dnn=c("prediction", "true value"))
##           true value
## prediction FALSE TRUE
##      FALSE  2551 1166
##      TRUE   1789 5121
((2551 + 5121)/(2551 + 5121 + 1789 + 1166)) * 100
## [1] 72.19347
# Since the testing accuracy (71.3%) is close to the training accuracy (72.2%), the model is not overfit

# Null Accuracy
recidivism %>% group_by(Recidivism_Within_3years) %>% summarize(N=n(), null.predict = n()/nrow(recidivism)*100)
## # A tibble: 2 × 3
##   Recidivism_Within_3years     N null.predict
##   <lgl>                    <int>        <dbl>
## 1 FALSE                     5747         40.6
## 2 TRUE                      8423         59.4
# TRUE has the highest N and accuracy. This has an accuracy of 59.4% which is lower than our model's accuracy

# Testing model accuracy based on race

black.test <- test.data %>% filter(Race == "BLACK")
black.test.pred <- black.test %>% mutate(phat = predict(model, type='response', newdata = black.test))
black.test.pred <- black.test.pred %>% mutate(prediction = phat > 0.5)
table(black.test.pred$prediction, black.test.pred$Recidivism_Within_3years, dnn=c("prediction", "true value"))
##           true value
## prediction FALSE TRUE
##      FALSE   457  231
##      TRUE    383 1035
((457 + 1035)/(457 + 231 + 1035 + 383)) * 100  # Accuracy black only data
## [1] 70.8452
white.test <- test.data %>% filter(Race == "WHITE")
white.test.pred <- white.test %>% mutate(phat = predict(model, type='response', newdata = white.test))
white.test.pred <- white.test.pred %>% mutate(prediction = phat > 0.5)
table(white.test.pred$prediction, white.test.pred$Recidivism_Within_3years, dnn=c("prediction", "true value"))
##           true value
## prediction FALSE TRUE
##      FALSE   342  179
##      TRUE    225  691
((342 + 691)/(342 + 691 + 179 + 225)) * 100  # Accuracy white only data
## [1] 71.88587
test.data %>% group_by(Race) %>% summarize(counts = n())
## # A tibble: 2 × 2
##   Race  counts
##   <chr>  <int>
## 1 BLACK   2106
## 2 WHITE   1437
# The model predicts Recidivism Within 3 years from black parolees with an accuracy of 70.9% and from white parolees with an accuracy of 71.9%
# This similarity in accuracy indicates that the model is performing consistently for both groups, and there isn't a significant disparity in performance based on race.
# These numbers are also very similar to the model's general accuracy of 71.3% with the model slightly underpredicting when it comes to black data and slightly overpredicting when it comes to white data although the difference is very small (0.4 & 0.06).
# It is important to note too, that while the sample size of the data is large, there is a decent gap between the number of black versus white data which could be affecting these results

“““## 2.3. Question 3

Are there significant differences between the history and outcomes of different races? For example, are they more likely to have a higher Supervision Risk Assessment Score? Are they more likely to be sentenced to more years in prison or be older at the time of their release? We will perform descriptive statistics on the data set using dplyr, and create and interpret visualizations (such as box plots) using ggplot2. ““”

Race_Count <- recidivism %>% group_by(Race) %>% summarize(counts = n())
Race_Count
## # A tibble: 2 × 2
##   Race  counts
##   <chr>  <int>
## 1 BLACK   8400
## 2 WHITE   5770

“““The nearly 3,000 person difference between white and black offenders could speak towards biases based on race. Due to this stark difference, visuals must be carefully constructed to accurately represent each group.”“”

supervision_counts <- recidivism %>% group_by(Supervision_Risk_Score_First, Race) %>% summarize(counts = n())
## `summarise()` has grouped output by 'Supervision_Risk_Score_First'. You can
## override using the `.groups` argument.
total_pop <- supervision_counts %>% group_by(Race) %>% summarize(total_race = sum(counts))
supervision_counts <- supervision_counts %>% left_join(total_pop, by = "Race") %>% mutate(percentage = counts / total_race * 100)

ggplot(supervision_counts, aes(x = Supervision_Risk_Score_First, y = percentage, fill = Race)) +
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Supervision Risk Assessment Score") +
  ylab("Percentage of Offenders") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

“““This graph shows the percentage of offenders in the data set at each Supervision Risk Assessment score separated by race. For the lower score, (1-5), we see white offenders generally make up larger percentages of their respective total than black offenders. For the higher scores, (6-10), we see black offenders make up larger percentages of their respective total than white offenders. This suggests that black individuals are more likely to be given higher risk assessment scores than white individuals.”“”

recidivism_counts <- recidivism %>% group_by(Recidivism_Within_3years) %>% summarize(counts = n())
recidivism_counts <- recidivism_counts %>% mutate(percentage = counts / sum(counts) * 100)

ggplot(recidivism_counts, aes(x = "", y = counts, fill = Recidivism_Within_3years)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  theme_void() +
  labs(title = "Recidivism Within 3 Years", fill = "Recidivism") +
  geom_text(aes(label = paste0(round(percentage), "%")), position = position_stack(vjust = 0.5))

ggplot(recidivism) +
  geom_boxplot(aes(x=Race, y=Age_at_Release)) +
  xlab("Race") + ylab("Age at Time of Release")

“““The first chart is a manipulated bar chart used to show the percentage of the total offenders who recidivate. Nearly 60% of the offenders in the data set commit redivisim within 3 years.

The boxplot shows the spread of ages at the time of an individuals release, separated by race. Black individuals, on average, were released 5 years earlier than those that were white. This could suggest a bias towards seeking to convict black individuals for crimes at younger ages. said convictions tend to lead to a cycle of recidivism given that an individual is more likely to recidivate within 3 years than to not. ““”

recidivism_counts1 <- recidivism %>% group_by(Race, Prison_Years) %>% summarize(counts = n())
## `summarise()` has grouped output by 'Race'. You can override using the
## `.groups` argument.
total_counts_race <- recidivism_counts1 %>% group_by(Race) %>% summarize(total_race = sum(counts))

recidivism_counts1 <- recidivism_counts1 %>% left_join(total_counts_race, by = "Race") %>% mutate(percentage = counts / total_race * 100)

order <- c("Less than 1 year", "1-2 years", "Greater than 2 to 3 years", "More than 3 years")
recidivism_counts1$Prison_Years <- factor(recidivism_counts1$Prison_Years, levels = order)

ggplot(recidivism_counts1, aes(x = Prison_Years, y = percentage, fill = as.factor(Prison_Years))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Race) +
  xlab("Prison Sentence") +
  ylab("Percentage of Total People from Each Race") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

“““Lastly, this visual shows two bar charts, separated by race. The charts represent the length of an offenders prison sentence, split into 4 categories. The general trends of the charts are similar; however, the chart for balck individuals clearly skews more heavily towards sentences of 2 or more years. This shows the percentage of black individuals who recieve harsher sentences, relative to their total count, is greater than the percentage that of white individuals, relative to their own respective count.

2.4. Question 4

Are individuals convicted of more serious crimes more likely to have worse outcomes? For example, are they at higher risk of violating parole, or testing positive for illicit substances? We will perform descriptive statistics on the data set using dplyr, and create and interpret visualizations (such as box plots) using ggplot2. ““”

recidivism %>% group_by(Prison_Offense) %>% summarise(Count = n())
## # A tibble: 5 × 2
##   Prison_Offense  Count
##   <chr>           <int>
## 1 Drug             2978
## 2 Other            1845
## 3 Property         4888
## 4 Violent/Non-Sex  3773
## 5 Violent/Sex       686

“““As we can see, the number of parolees for each crime category varies greatly. In particular, we do not have nearly as much data on violent sex offenders, which may be a result of both the comparatively small violent sex offender population in prison, and the number of those that are actually released on parole. We will have to keep this in mind when making any conclusions from the data.”“”

ggplot(recidivism) +
  geom_bar(aes(x=Prison_Offense, fill=Recidivism_Within_3years), position="fill") +
  xlab("Prison Offense") + ylab("Proportion of Recidivism") +
  scale_fill_brewer(palette="Paired")

“““From this graph, we can see that while most other types of crimes are about equal in terms of recidivism, those who commited crimes of a violent and sexual nature are less likely to fall into recidivism. In the state of Georgia, as in many states, there are heavy restrictions and reporting requirements placed on what sex offenders can and cannot do, which may contribute to a lower rate of recidivism.

Furthermore, “property” crimes are generally related to theft, and can indicate financial instability. This is someting a prison sentence and a criminal record may worsen, and this higher rate of recidivism may result from these conditions. ““”

recidivism_avgreschanges <- recidivism %>% group_by(Prison_Offense) %>% summarise(Average_Res_Changes = mean(Residence_Changes))
recidivism_avgreschanges
## # A tibble: 5 × 2
##   Prison_Offense  Average_Res_Changes
##   <chr>                         <dbl>
## 1 Drug                          0.796
## 2 Other                         0.811
## 3 Property                      0.975
## 4 Violent/Non-Sex               0.916
## 5 Violent/Sex                   0.569

“““As we can see, violent sex offenders on average go through fewer residence changes than those who are on parole for other crimes. Some of the restrictions listed from the page above also relates to residence requirements: > The individual under supervision shall live only in a residence approved by his or her community supervision officer and agree not to live or share a residence with any other person with a history of sexual offense conviction(s). The location of the offender’s residence shall comply with Ga Law and/or any other condition imposed by the Board. The offender may not be employed at a location within 1000 ft of, or be employed by a child day care facility, school or church.

Given these requirements, it makes sense that sex offenders would get fewer opportunities for residence changes compared to other offenders. ““”

# Create table of the average days between drug tests
recidivism_testfrequency <- recidivism %>% group_by(Prison_Offense) %>% summarise(Mean_Avg_Days_per_DrugTest = mean(Avg_Days_per_DrugTest))
recidivism_testfrequency
## # A tibble: 5 × 2
##   Prison_Offense  Mean_Avg_Days_per_DrugTest
##   <chr>                                <dbl>
## 1 Drug                                 104. 
## 2 Other                                 92.7
## 3 Property                              98.9
## 4 Violent/Non-Sex                       92.6
## 5 Violent/Sex                           61.6
# Create a table of the average proportion of positive test results, then use gather() to convert it to a table useful for heat maps
recidivism_testresults <- recidivism %>% group_by(Prison_Offense) %>% summarise(THC = mean(DrugTests_THC_Positive), Cocaine = mean(DrugTests_Cocaine_Positive),
                                                                                Meth = mean(DrugTests_Meth_Positive), Other = mean(DrugTests_Other_Positive)) %>%
                                                                                  gather(key="Test_Type", value="Proportion_Positive_Tests", 2:5)
ggplot(recidivism_testresults) +
  geom_tile(aes(x=Prison_Offense, y=Test_Type, fill=Proportion_Positive_Tests)) +
  xlab("Prison Offense") + ylab("Drug Test Type") +
  scale_fill_distiller(palette = "GnBu")

“““From the table above, violent sex offenders were subjected to drug tests the most often, while those on parole for drug offenses were actually tested the least.

The heat map shows us that violent sex offenders seem to be the least likely to test positive for any kind of drug. Furthermore, THC seems to have the highest testing incidence, and is most common for parolees in the “Other” and “Property” categories, second-least-common for drug crime parolees, and by far the least common for violent sex offenders. Furthermore, those in prison for drug-related crimes do not seem to have an especially higher proportion of positive tests compared to those of other categories, and in many cases are slightly lower.

In addition to these results being further indicative of the harsh restrictions and standards placed upon violent sex offenders, the fact that drug crime parolees are both tested the least and have comparable/fewer incidences of positive tests could be indicative of the effectiveness of rehabilitation, increased mindfulness of drug-related violations, or both. ““”

# Creates a dataframe of the proportion of delinquency reports for each category (e.g. of those in prison for drug crimes, 5% of them have 3 reports)
recidivism_offense_count <- recidivism %>% group_by(Prison_Offense) %>% summarise(total = n())
recidivism_delinquency_count <- recidivism %>% group_by(Prison_Offense, Delinquency_Reports) %>% summarize(count=n()) %>% left_join(recidivism_offense_count) %>% mutate(proportion = count/total)
## `summarise()` has grouped output by 'Prison_Offense'. You can override using
## the `.groups` argument.
## Joining with `by = join_by(Prison_Offense)`
ggplot(recidivism_delinquency_count) +
  geom_tile(aes(x=Prison_Offense, y=Delinquency_Reports, fill=proportion)) +
  xlab("Prison Offense") + ylab("Proportion of Delinquency Reports") +
  scale_fill_distiller(palette = "Blues")

“““Note that in the heat map above, near the bottom of the graph, lighter colors are better (indicating a greater proportion of individuals with a low number of reports). Near the top of the graph, dark colors are better (indicating a smaller proportion of individuals with a high number of reports).

Based on our heat map, we see that the distribution of delinquency reports across the different prison offense categories is relatively similar. However, we can see that the Violent/Sex category tends to be slightly darker near the bottom of the graph, and slightly lighter near the top of the graph as compared to the other categories. This indicates a greater proportion of individuals who have received many delinquency reports.

The parole requirements for sex offenders in Georgia include:

  • Not approaching anybody under 18 years of age unless under chaperoned
  • Regularly registering into a sex offenders database
  • Having to submit a log of weekly activities
  • Being placed under (possibly permanent) electronic supervision

Given these requirements, it is possible that it is both easier for sex offenders to inadvertently violate the terms of their parole, and for these violations to be detected. Since their parole is also a matter of public record, outside reports of their activities may be easier to receive as well. ““”

recidivism_avgcrimes <- recidivism %>% group_by(Prison_Offense) %>% summarise(Average_Number_Felony = mean(Prior_Arrest_Episodes_Felony),
                                                                              Average_Number_Misdemeanor = mean(Prior_Arrest_Episodes_Misd),
                                                                              Average_Prior_Parole_Violations = mean(Prior_Arrest_Episodes_PPViolationCharges))
recidivism_avgcrimes
## # A tibble: 5 × 4
##   Prison_Offense  Average_Number_Felony Average_Number_Misdemeanor
##   <chr>                           <dbl>                      <dbl>
## 1 Drug                             5.98                       3.82
## 2 Other                            6.15                       3.71
## 3 Property                         6.40                       3.45
## 4 Violent/Non-Sex                  4.81                       2.77
## 5 Violent/Sex                      3.38                       1.93
## # ℹ 1 more variable: Average_Prior_Parole_Violations <dbl>

“““We observe another trend here that those who have commited crimes of a violent and sexual nature have the lowest number of average felonies and misdemeanors, and those who have commited crimes of a violent and non-sexual nature are similarly lower.

Meanwhile, those who have commited crimes relating to drugs, property, or other crimes are fairly similar, between 5.90-6.4 felonies on average, and 3.40-3.82 misdemeanors on average.

Looking up parole eligibility requirements as determined by the Georgia State Board of Paroles and Pardons, we see that violent offenses and sexual offenses are seen as especially severe “Level VII” and above crimes, for which some are ineligible for parole unless on life sentence, and are weighed more heavily when assessing parole risk. Based on these requirements, it seems extremely unlikely for people who have commited these kinds of crimes to receive parole.

Therefore, it makes sense that those who have received parole (the individuals in our data set) would have fallen under much heavier scrutiny, especially when it comes to past crimes.

3. Conclusion

Our correlation visualization and analysis found that, surprisingly, there seem to be few correlations between our variables outside of those between arrests and convictions. One exception is the negative correlation between supervision risk level and age, which indicates that older parolees are judged to be less problematic than younger parolees. Another is the positive correlation between the number of jobs held per year and the percentage of days employed, suggesting that changing jobs frequently is required for stable employment. While our correlation analysis does not tie together many variables, the lack of correlations does effectively convey that we should be careful about assuming something about a parolee based on their other characteristics.

Upon using logistic regression to predict Recidivism Within 3 years, it can be concluded that the model demonstrates a reasonable level of accuracy both in the overall dataset and when considering racial subgroups. The model’s testing accuracy closely aligns with the training accuracy, suggesting that overfitting is not a big concern. Also, its accuracy in predicting recidivism for black and white parolees is similar which shows consistency across racial groups. Although there is a slight difference in accuracy for black and white data, it is relatively small (0.4% and 0.06%, respectively). However, there are only two races Black and White, which raises the question of which category other potential races are placed in and why. A more detailed understanding or additional factors would result in a more thorough analysis. Also, while the logistic model performed well, a random forest model could potentially be another approach in predicting recidivism beacuse it can capture the many complex relationships in the dataset and handle any imbalanced data. But overall, the model’s performance appears to be effective, given the context of the analysis.

Upon analyzing the differences between the history and outcomes of different races, we found clear biases in the data. For the Supervision Risk Assessment Scores, lower scores (1-5) were primarily made up of a higher percentage of white offenders relative to their respective total, than black offenders. For the higher scores, (6-10), we saw black offenders make up larger percentages of their respective total than white offenders. This suggests that black individuals are more likely to be given higher risk assessment scores than white individuals. Also, black individuals, on average, were released 5 years earlier than those that were white. Although, on the surface that would be considered a postive for black individuals, this points to the fact that black individuals are being incarcerated at a younger age. These early convictions tend to lead to a cycle of recidivism, given that an individual is more likely to recidivate within 3 years than to not. The data supports this claim as 60% of the offenders committed recidivism within 3 years. Lastly, the percentage of black individuals who recieved harsher sentences, relative to their total count, is greater than the percentage of that of white individuals, relative to their own respective count. This reflects a disproportionate response to crime for black and white individuals.

On comparing the outcomes between parolees that have commited different types of crimes, we found violent sexual offenders to have the lowest parole population, but also to have had generally the most positive results when it comes to recidivism, drug use, and past criminal history. Based on Georgia State legal standards on the evaluation and supervision of sex offender parolees, as well as the small number of such parolees actually in our data set, we conjecture that this is a result of both restrictive screening of such parolees, and close surveillance after they are released on parole. Interestingly, when it comes to drug tests, drug crime parolees were not more likely to have drug test violations than other parolees. Overall, those convicted of property crimes (theft-related crimes) seem to have the worst outcomes on most metrics. One possible explanation is that these types of crimes are associated with financial instability, which a prison sentence and criminal record would necessarily worsen. While our analysis of the data seems to generally be effective, a more informed interpretation of the results would require more context and a deeper understanding of the Georgia parole system. ““”