R Markdown

Loading and creating a working data set that is cleaned and transformed with additional factors. Age is going to be considered because at first glance in the plots, it seems that younger people tend to get rearrested. Gender was considered as a factor because at first glance, the data looked to cover the male population more so than the female population, gender of male could be as significant factor. Race is also going to be considered as a factor because the goal of this analysis is to try and reduce statistical bias. Past offenses will also be taken into consideration with juvenile misdemeanors counts, juvenile felony counts. Repeat violence offenses and the types of charges will also be considered.

Including Plots

This plot shows the age distribution within the dataset.

This plot shows the age distribution broken into categories of 18 - 34 and 35 + and how these two groups relate to rearrests within two years. A clear trend shows that most repeat offenders are in the younger age bracket. A derived column called ageBucket will look at the records where the offender’s age is below or above 34 years.

This plot shows the distribution of gender within our dataset. This analysis assumes binary gender roles. The male population makes up about 80% of the dataset. A derived column that will set this factor to binary will be used for the linear and logistics models.

The following two plots show the breakouts of juvenile criminal records with misdemeanors and felonies. What is surprising is that the majority of people in our dataset did not have misdemeanors or felonies. This could be gaps in data being shared in previous arrests or an issue with access to historical for this reported dataset.



This plot shows how males were rearrested more than women within two years.

This plot shows how disproportionately African-Americans are represented when compared to Caucasians. This will be a consideration for downsampling later in the analysis.

This plot shows frequency of rearrest within in 2 years.

The following two plots show the number of felonies and violent recidivism by race. These plots highlight the bias within the dataset even further.



These plots show that most members within our data set are males, ages 18 - 34, and primarily African American. These plots also show that our predictor variable shows more substantial statistical bias to these groups.


Building out our training and testing data sets for next steps of analysis.

Here is the code snippet used -

dfcrimeDataTraining <- dfcrimeData[, c(‘two_year_recid’,‘sex’,“sex_bit”,‘age’,‘ageBucket’,‘race’,‘race_bit’,‘juv_fel_count’,‘juv_misd_count’,‘c_charge_degree’,‘is_violent_recid’)]

set.seed(42)

samp <- sample(nrow(dfcrimeDataTraining), nrow(dfcrimeDataTraining)*0.7)
training <- dfcrimeDataTraining[samp, ]
testing <- dfcrimeDataTraining[-samp, ]


Here is a Linear Regression Approach. A linear regression approach can help come up with a formulaic answer to the question if age, race, and gender have a linear relationship to rearrests in 2 years. Regression models are easy to interpret and can provide intercepts for variables to better understand trends. For example, does age have a certain intercept that can explain along with other factors of rearrest within two years is likely. Based on the summary of this data, each factor strongly influences the two-year rearrest variable. However, the R-squared value is so low that the model does not adequately explain total variability. Age, race, and juvenile felonies had less influence on this regression model.



## 
## Call:
## lm(formula = two_year_recid ~ age + ageBucket + race_bit + sex_bit + 
##     juv_fel_count + juv_misd_count + c_charge_degree + is_violent_recid, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2188 -0.3903 -0.2012  0.4848  0.8885 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.383865   0.039277   9.773  < 2e-16 ***
## age              -0.002957   0.001175  -2.516 0.011909 *  
## ageBucket        -0.080177   0.027948  -2.869 0.004145 ** 
## race_bit          0.047780   0.015725   3.038 0.002395 ** 
## sex_bit           0.066053   0.018720   3.528 0.000423 ***
## juv_fel_count     0.055762   0.017974   3.102 0.001934 ** 
## juv_misd_count    0.059639   0.014578   4.091 4.39e-05 ***
## c_charge_degree   0.103225   0.015679   6.584 5.23e-11 ***
## is_violent_recid  0.510869   0.023137  22.080  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4523 on 3685 degrees of freedom
## Multiple R-squared:  0.1802, Adjusted R-squared:  0.1784 
## F-statistic: 101.3 on 8 and 3685 DF,  p-value: < 2.2e-16
0 1
0 676 387
1 155 366
Accuracy Sensitivity Specificity Missclassification Error
0.6578283 0.4860558 0.8134777 0.3422

The next model will explore a logistic regression. The goal of this model is to use a similar concept to the previous regression attempt but using a logistic regression approach. Since the predicted variable is binary with Yes or No, a logistic approach is more accurate and meaningful than a linear regression model.

## 
## Call:
## glm(formula = two_year_recid ~ age + ageBucket + race_bit + sex_bit + 
##     juv_fel_count + juv_misd_count + c_charge_degree + is_violent_recid, 
##     family = binomial, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0644  -0.9764  -0.6547   1.1330   1.9897  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.523691   0.197486  -2.652  0.00801 ** 
## age              -0.014171   0.005996  -2.363  0.01811 *  
## ageBucket        -0.386025   0.139876  -2.760  0.00578 ** 
## race_bit          0.220467   0.076810   2.870  0.00410 ** 
## sex_bit           0.300288   0.091940   3.266  0.00109 ** 
## juv_fel_count     0.345781   0.115734   2.988  0.00281 ** 
## juv_misd_count    0.420999   0.103637   4.062 4.86e-05 ***
## c_charge_degree   0.519371   0.078198   6.642 3.10e-11 ***
## is_violent_recid  3.307795   0.220093  15.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5106.1  on 3693  degrees of freedom
## Residual deviance: 4321.2  on 3685  degrees of freedom
## AIC: 4339.2
## 
## Number of Fisher Scoring iterations: 5
0 1
0 669 376
1 162 377
Accuracy Sensitivity Specificity Missclassification Error
0.6603535 0.500664 0.8050542 0.3396


The following approach is going to use Random Forests. The idea behind this is that there might be hidden layers in the data that the regression models could not touch. This model is also better equipped to make classification considerations than the regression models.

##                 Length   Class  Mode     
## call                   4 -none- call     
## type                   1 -none- character
## predicted           3694 -none- numeric  
## mse                  500 -none- numeric  
## rsq                  500 -none- numeric  
## oob.times           3694 -none- numeric  
## importance             8 -none- numeric  
## importanceSD           0 -none- NULL     
## localImportance        0 -none- NULL     
## proximity       13645636 -none- numeric  
## ntree                  1 -none- numeric  
## mtry                   1 -none- numeric  
## forest                11 -none- list     
## coefs                  0 -none- NULL     
## y                   3694 -none- numeric  
## test                   0 -none- NULL     
## inbag                  0 -none- NULL     
## terms                  3 terms  call
0 1
0 669 376
1 162 377
Accuracy Sensitivity Specificity Missclassification Error
0.6578283 0.498008 0.8026474 0.3422


This approach will use downsampling to get a more even grouping of the Race variable for training data. The idea here is to reduce statistical bias and hopefully create a more accurate model training dataset. This sampling method will take more samples from the minority group, in this case, the race variable, and attempt to make race a more even variable and offset some of the bias shown in the charts above.


Here is the code snippet used to downsample

training_ds <- downSample(x=select(training, -race),
y=training$race, yname=“race”)

##                 Length  Class  Mode     
## call                  4 -none- call     
## type                  1 -none- character
## predicted          2982 -none- numeric  
## mse                 500 -none- numeric  
## rsq                 500 -none- numeric  
## oob.times          2982 -none- numeric  
## importance            8 -none- numeric  
## importanceSD          0 -none- NULL     
## localImportance       0 -none- NULL     
## proximity       8892324 -none- numeric  
## ntree                 1 -none- numeric  
## mtry                  1 -none- numeric  
## forest               11 -none- list     
## coefs                 0 -none- NULL     
## y                  2982 -none- numeric  
## test                  0 -none- NULL     
## inbag                 0 -none- NULL     
## terms                 3 terms  call
##           
## predicted4   0   1
##          0 668 378
##          1 163 375
0 1
0 668 378
1 163 375
Accuracy Sensitivity Specificity Missclassification Error
0.6584596 0.498008 0.8038508 0.3415

This project was a learning experience about statistical bias and the implications that it can have. This data set reflects crime data; African-Americans are overrepresented and have higher arrests in 2-year occurrences than their Caucasian counterparts. Downsampling helped put people in the same weighted groups but based on the variables, age, sex, race, juvenile misdemeanor, juvenile felony, charge type, and violent offender, and the derived columns created off of these variables. No model was that accurate when predicting arrests in 2 years. The implications of using prediction methods to make parole and early release decisions are profound. The final model results will be assessed in the next section, but the main takeaway is that even with the available models, Type-I and Type-II errors will occur.



Model Accuracy Sensitivity Specificity Missclassification Error
Linear Regression 0.6578283 0.4860558 0.8134777 0.3422
Logisitc Regression 0.6603535 0.5006640 0.8050542 0.3396
Random Forest 0.6578283 0.4980080 0.8026474 0.3422
Random Forest Down Sample 0.6584596 0.4980080 0.8038508 0.3415



Based on the final results table, the Logistic Regression Model had the highest overall accuracy and specificity of over 80%, which means it was over 80% accurate in identifying true negatives. The sensitivity was also the highest at just over 50%, which means it is over 50% accurate in identifying true positives. Every other model was under 50% in identifying true positives, meaning they will all have more Type II errors and false favorable rates. Further analysis could be done to use downsampling for all models, but each model was reasonably close in accuracy, sensitivity, and specificity.