Loading and creating a working data set that is cleaned and transformed with additional factors. Age is going to be considered because at first glance in the plots, it seems that younger people tend to get rearrested. Gender was considered as a factor because at first glance, the data looked to cover the male population more so than the female population, gender of male could be as significant factor. Race is also going to be considered as a factor because the goal of this analysis is to try and reduce statistical bias. Past offenses will also be taken into consideration with juvenile misdemeanors counts, juvenile felony counts. Repeat violence offenses and the types of charges will also be considered.
This plot shows the age distribution within the dataset.
This plot shows the age distribution broken into categories of 18 - 34 and 35 + and how these two groups relate to rearrests within two years. A clear trend shows that most repeat offenders are in the younger age bracket. A derived column called ageBucket will look at the records where the offender’s age is below or above 34 years.
This plot shows the distribution of gender within our dataset. This analysis assumes binary gender roles. The male population makes up about 80% of the dataset. A derived column that will set this factor to binary will be used for the linear and logistics models.
The following two plots show the breakouts of juvenile criminal records with misdemeanors and felonies. What is surprising is that the majority of people in our dataset did not have misdemeanors or felonies. This could be gaps in data being shared in previous arrests or an issue with access to historical for this reported dataset.
This plot shows how males were rearrested more than women
within two years.
This plot shows how disproportionately African-Americans are represented when compared to Caucasians. This will be a consideration for downsampling later in the analysis.
This plot shows frequency of rearrest within in 2 years.
The following two plots show the number of felonies and violent recidivism by race. These plots highlight the bias within the dataset even further.
These plots show that most members within our data set are males, ages 18 - 34, and primarily African American. These plots also show that our predictor variable shows more substantial statistical bias to these groups.
Building out our training and testing data sets for next steps of
analysis.
Here is the code snippet used -
dfcrimeDataTraining <- dfcrimeData[,
c(‘two_year_recid’,‘sex’,“sex_bit”,‘age’,‘ageBucket’,‘race’,‘race_bit’,‘juv_fel_count’,‘juv_misd_count’,‘c_charge_degree’,‘is_violent_recid’)]
set.seed(42)
samp <-
sample(nrow(dfcrimeDataTraining), nrow(dfcrimeDataTraining)*0.7)
training <- dfcrimeDataTraining[samp, ]
testing <-
dfcrimeDataTraining[-samp, ]
Here is a Linear Regression Approach. A linear regression approach can help come up with a formulaic answer to the question if age, race, and gender have a linear relationship to rearrests in 2 years. Regression models are easy to interpret and can provide intercepts for variables to better understand trends. For example, does age have a certain intercept that can explain along with other factors of rearrest within two years is likely. Based on the summary of this data, each factor strongly influences the two-year rearrest variable. However, the R-squared value is so low that the model does not adequately explain total variability. Age, race, and juvenile felonies had less influence on this regression model.
##
## Call:
## lm(formula = two_year_recid ~ age + ageBucket + race_bit + sex_bit +
## juv_fel_count + juv_misd_count + c_charge_degree + is_violent_recid,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2188 -0.3903 -0.2012 0.4848 0.8885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.383865 0.039277 9.773 < 2e-16 ***
## age -0.002957 0.001175 -2.516 0.011909 *
## ageBucket -0.080177 0.027948 -2.869 0.004145 **
## race_bit 0.047780 0.015725 3.038 0.002395 **
## sex_bit 0.066053 0.018720 3.528 0.000423 ***
## juv_fel_count 0.055762 0.017974 3.102 0.001934 **
## juv_misd_count 0.059639 0.014578 4.091 4.39e-05 ***
## c_charge_degree 0.103225 0.015679 6.584 5.23e-11 ***
## is_violent_recid 0.510869 0.023137 22.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4523 on 3685 degrees of freedom
## Multiple R-squared: 0.1802, Adjusted R-squared: 0.1784
## F-statistic: 101.3 on 8 and 3685 DF, p-value: < 2.2e-16
| 0 | 1 | |
|---|---|---|
| 0 | 676 | 387 |
| 1 | 155 | 366 |
| Accuracy | Sensitivity | Specificity | Missclassification Error |
|---|---|---|---|
| 0.6578283 | 0.4860558 | 0.8134777 | 0.3422 |
The next model will explore a logistic regression. The goal of this model is to use a similar concept to the previous regression attempt but using a logistic regression approach. Since the predicted variable is binary with Yes or No, a logistic approach is more accurate and meaningful than a linear regression model.
##
## Call:
## glm(formula = two_year_recid ~ age + ageBucket + race_bit + sex_bit +
## juv_fel_count + juv_misd_count + c_charge_degree + is_violent_recid,
## family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0644 -0.9764 -0.6547 1.1330 1.9897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.523691 0.197486 -2.652 0.00801 **
## age -0.014171 0.005996 -2.363 0.01811 *
## ageBucket -0.386025 0.139876 -2.760 0.00578 **
## race_bit 0.220467 0.076810 2.870 0.00410 **
## sex_bit 0.300288 0.091940 3.266 0.00109 **
## juv_fel_count 0.345781 0.115734 2.988 0.00281 **
## juv_misd_count 0.420999 0.103637 4.062 4.86e-05 ***
## c_charge_degree 0.519371 0.078198 6.642 3.10e-11 ***
## is_violent_recid 3.307795 0.220093 15.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5106.1 on 3693 degrees of freedom
## Residual deviance: 4321.2 on 3685 degrees of freedom
## AIC: 4339.2
##
## Number of Fisher Scoring iterations: 5
| 0 | 1 | |
|---|---|---|
| 0 | 669 | 376 |
| 1 | 162 | 377 |
| Accuracy | Sensitivity | Specificity | Missclassification Error |
|---|---|---|---|
| 0.6603535 | 0.500664 | 0.8050542 | 0.3396 |
The following approach is going to use Random Forests. The idea
behind this is that there might be hidden layers in the data that the
regression models could not touch. This model is also better equipped to
make classification considerations than the regression models.
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 3694 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 3694 -none- numeric
## importance 8 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 13645636 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 3694 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
| 0 | 1 | |
|---|---|---|
| 0 | 669 | 376 |
| 1 | 162 | 377 |
| Accuracy | Sensitivity | Specificity | Missclassification Error |
|---|---|---|---|
| 0.6578283 | 0.498008 | 0.8026474 | 0.3422 |
This approach will use downsampling to get a more even grouping
of the Race variable for training data. The idea here is to reduce
statistical bias and hopefully create a more accurate model training
dataset. This sampling method will take more samples from the minority
group, in this case, the race variable, and attempt to make race a more
even variable and offset some of the bias shown in the charts above.
Here is the code snippet used to downsample
training_ds <- downSample(x=select(training, -race),
y=training$race, yname=“race”)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 2982 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 2982 -none- numeric
## importance 8 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 8892324 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 2982 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
##
## predicted4 0 1
## 0 668 378
## 1 163 375
| 0 | 1 | |
|---|---|---|
| 0 | 668 | 378 |
| 1 | 163 | 375 |
| Accuracy | Sensitivity | Specificity | Missclassification Error |
|---|---|---|---|
| 0.6584596 | 0.498008 | 0.8038508 | 0.3415 |
This project was a learning experience about statistical bias and the implications that it can have. This data set reflects crime data; African-Americans are overrepresented and have higher arrests in 2-year occurrences than their Caucasian counterparts. Downsampling helped put people in the same weighted groups but based on the variables, age, sex, race, juvenile misdemeanor, juvenile felony, charge type, and violent offender, and the derived columns created off of these variables. No model was that accurate when predicting arrests in 2 years. The implications of using prediction methods to make parole and early release decisions are profound. The final model results will be assessed in the next section, but the main takeaway is that even with the available models, Type-I and Type-II errors will occur.
| Model | Accuracy | Sensitivity | Specificity | Missclassification Error |
|---|---|---|---|---|
| Linear Regression | 0.6578283 | 0.4860558 | 0.8134777 | 0.3422 |
| Logisitc Regression | 0.6603535 | 0.5006640 | 0.8050542 | 0.3396 |
| Random Forest | 0.6578283 | 0.4980080 | 0.8026474 | 0.3422 |
| Random Forest Down Sample | 0.6584596 | 0.4980080 | 0.8038508 | 0.3415 |
Based on the final results table, the Logistic Regression
Model had the highest overall accuracy and specificity of over 80%,
which means it was over 80% accurate in identifying true negatives. The
sensitivity was also the highest at just over 50%, which means it is
over 50% accurate in identifying true positives. Every other model was
under 50% in identifying true positives, meaning they will all have more
Type II errors and false favorable rates. Further analysis could be done
to use downsampling for all models, but each model was reasonably close
in accuracy, sensitivity, and specificity.