Analyzing Algorithmic Bias in Risk Assessment Instruments

by Eric Shapiro, Preetha Swamy, and Jessica Young

Note: Please use the Table of Contents on the left to explore this report.

knitr::opts_chunk$set(cache=TRUE, message= FALSE, warning=FALSE)

library(broom)

## Warning: package 'broom' was built under R version 3.4.3

library(knitr)

## Warning: package 'knitr' was built under R version 3.4.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

library(pROC)

## Warning: package 'pROC' was built under R version 3.4.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

## Warning: package 'caret' was built under R version 3.4.3

## Loading required package: lattice

library(klaR)

## Warning: package 'klaR' was built under R version 3.4.3

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.4.3

library(leaps)

## Warning: package 'leaps' was built under R version 3.4.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.4.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(import)

## Warning: package 'import' was built under R version 3.4.3

## The import package should not be attached.
## Use "colon syntax" instead, e.g. import::from, or import:::from.

library(foreach)

## Warning: package 'foreach' was built under R version 3.4.3

library(doParallel)

## Warning: package 'doParallel' was built under R version 3.4.3

## Loading required package: iterators

## Warning: package 'iterators' was built under R version 3.4.3

## Loading required package: parallel

library(iterators)
library(sparsediscrim)

## Warning: package 'sparsediscrim' was built under R version 3.4.3

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.3

## Loading required package: Matrix

## Loaded glmnet 2.0-13

## 
## Attaching package: 'glmnet'

## The following object is masked from 'package:pROC':
## 
##     auc

library(plyr)

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:import':
## 
##     here

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.4.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:randomForest':
## 
##     combine

Introduction

As machine learning and artificial intelligence gain popularity, predictive analytics are entering a vast array of settings including the U.S. criminal justice system. Asssisted by risk assessment algorithms, court systems make bail and sentencing decisions based upon the predicted likelihood that an individual will commit a crime. However, the probability predicted for any one individual can vary widely based on the algorithm, or Risk Assesment Instrument (RAI), used.

Using data released by ProPublica of criminal justice records from Broward County, Florida, we created our own risk assessment models and compare them to a popular risk assessment tool developed by Northpointe Inc.: Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). Through this data mining project we 1) develop a model to predict two-year recidism and assess strong predictors of recidivating 2) develop a model to predict violent recidivism, again evaluating strong predictors 3) determine if our RAIs are equally predictive across sociodemographic factors such as age, race, ethnicity, and sex and 4) compare our RAI performance to that of COMPAS in terms of accuracy and bias.

Following in this report, we entail the process we undertook to achieve the above goals. First, we provide an overview of how we cleaned and processed our data including variable transformations and feature engineering. We then produce high level plots and tables that provide an overivew of the data at hand. Next, we dive into testing our trial models, assessing the performance of each model we created before selecting the best. Finally, we use the chosen model to compare its performance in predicting recidivism and violent recidivism relative to COMPAS.

recidivism.raw <- read.csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv", header = TRUE)

Data Exploration

Data Cleaning & Transformations

The steps that we took in order to clean our data consist of the following. We subsetted the following variables to keep in our dataset: sex, race, prior count, juvenile felony count, juvenile misdemeanor count, juvenile other count, the charge degree, and two year recidivism. A new age variable called age.at.charge was created that calculates the age of a person once he/she entered jail. We did so given that that age variable listed in the original dataset was based on age in 2016 as opposed to the age when someone went to jail. This was calculated by subtracting our c_jail_in variable by DOB. We created another new variable, length of stay, that calculated the duration of time an individual spent in jail. We did so given there might be a relationship between the length of stay and the type of charge that someone recieved. This was calculated by C_jail_out - C_jail_in. Though this could have been a good predictor in our models, we did realize that not everyone had a jail date which could possibly alter our results. After deliberation, we decided against keeping this variable in our dataset. We binned age into 4 categories ensuring not to separate ages 22-24. Our rationale for this was that this age group had the largest number of crimes after we ran our histogram. A separate violent recidivism dataframe was created by mutating our subsetted data frame to include variables that were associated with violent recidivism. 268 missing values were removed from our data. We took out individuals who didn’t have a c_jail_date_in date and removed them from our dataframe.

Correlations Assessment

We ran a pairs plot, finding the following correlations: there is a positive correlation as we expected between age.at.charge and date of birth. There could be possible correlations with juvenile others count and priors count, age.at.charge and juvenile misdemeanor count. Date of birth and age.at.charge appear potentially correlated with juvenile misdemeanor count and juvenile other count. We also see that there is correlation between juvenile misdemeanor count and juvenile other counts. In this process, we decided to take out the Date of birth variable from our dataframe as well.

Data Visualizations

The following are visuals that capture high-level information about our dataset. We ran plots of our data based on the binning that we conducted as well as looked at information that was split among socio-demographic categories. In our first graph, we see the number of individuals and the age at which they were charged. As mentioned prior, it is evident that younger individuals (between 17-24) are the most represented in this graphic. As an individual ages, we notice a downward slope, which indicates that crimes committed could decrease with age. In looking at the second plot, we see how many individuals of a certain race had a redivism activity. African Americans appear to have a higher count of two year recidivism compared to other races while Asians and Native Americans have the lowest counts. We presented this information in stacked bar graphs versus just bar graphs. We also highlight the different age categories.

ggplot(data=recidivism, mapping = aes(x = age.at.charge, fill=age.cat)) + geom_bar()

ggplot(data=recidivism, mapping = aes(x = race, fill=as.factor(two_year_recid))) + geom_bar()

ggplot(data=recidivism, mapping = aes(x = race, fill=as.factor(two_year_recid))) + geom_bar(position = "dodge")

ggplot(data=recidivism, mapping = aes(x = age.cat, fill=as.factor(two_year_recid))) + geom_bar(position = "dodge")

Method of Creating the Train and Test Data. We used an 80/20 split to set up our train and test data.

#Create the training and testing dataset in order to run our models and cross validate
set.seed(531)
recid_test <- sample(1:nrow(recidivism), 
                       round(0.2 * nrow(recidivism)))
recid_train <- setdiff(1:nrow(recidivism), recid_test)

recid.train <- recidivism[recid_train,]
recid.test <- recidivism[recid_test,]


v_recid_test <- sample(1:nrow(recidivism.v), 
                       round(0.2 * nrow(recidivism.v)))
v_recid_train <- setdiff(1:nrow(recidivism), v_recid_test)

v.recid.train <- recidivism.v[v_recid_train,]
v.recid.test <- recidivism.v[v_recid_test,]

Methodology

Intending to determine if an individual will recidivate or not, we use classification (supervised learning). Given that we are aware of our outcome (i.e. whether an individual recidivated or not) based on the dataset at hand, supervised learning is appropriate. Model options considered for classification entail: Logistic Regression, K-nearest neighbors, Bayes methods, classification trees, bagging, boosting, and random forests. Based on these approaches, we chose logistic regression, bayes methods (specifically looking at Naive Bayes), and random forests. We chose logistic regression because our outcome variable is categorical and binary. This method works well in analyzing data that has these parameters specifically our recidivism and violent recidivism outcomes. We chose Naive Bayes because this method assumes independence. We did run a LDA/QDA analysis but came across the issue of collinearity among our variables. Because of this, we decided to stick with Naive Bayes. Finally we chose random forest. Random forest, similar to LDA and QDA classification methods, does a good job of capturing interactions. We sought to try a model from each of the major sections within classification: logistic regression, bayes methods, and a classification with trees. The following are our tests and analysis of these methods.

classMetrics <- function(score, y, cutoff, 
                         type = c("all", "accuracy", "sensitivity", 
                                  "specificity", "ppv", "npv", "precision", 
                                  "recall")) {
  type <- match.arg(type, several.ok = TRUE)
  n <- length(y) 
  
  # Form confusion matrix
  score.factor <- factor(as.numeric(score >= cutoff), levels = c("0", "1"))
  confusion.mat <- table(score.factor, as.factor(y), dnn = list("predicted", "observed"))
  # Calculate all metrics
  acc <- sum(diag(confusion.mat)) / n
  sens <- confusion.mat[2,2] / sum(confusion.mat[,2])
  spec <- confusion.mat[1,1] / sum(confusion.mat[,1])
  ppv <- confusion.mat[2,2] / sum(confusion.mat[2,])
  npv <- confusion.mat[1,1] / sum(confusion.mat[1,])
  prec <- ppv
  rec <- sens
  
  metric.names <- c("accuracy", "sensitivity", "specificity", 
                    "ppv", "npv", "precision", "recall")
  metric.vals <- c(acc, sens, spec, ppv, npv, prec, rec)
  
  # Form into data frame
  full.df <- data.frame(value = metric.vals)
  rownames(full.df) <- metric.names
  
  # Return just the requested subset of metrics
  if(type[1] == "all") {
    list(conf.mat = confusion.mat, perf = full.df)
  } else {
    list(conf.mat = confusion.mat, 
         perf = subset(full.df, subset = metric.names %in% type))
  }
}




plotClassMetrics <- function(score, y, xvar = NULL, yvar = c("accuracy", "sensitivity", 
                                  "specificity", "ppv", "npv", "precision", 
                                  "recall"),
                             flip.x = FALSE) {
  yvar <- match.arg(yvar)
  
  # If there are a lot of unique score values, just calculate the metrics
  # along an evenly spaced grid of 100 scores.
  unique.scores <- unique(score)
  if(length(unique.scores) > 100) {
    cutoff.seq <- sample(unique.scores, 100, replace = FALSE)
  } else {
    cutoff.seq <- unique.scores
  }
  
  n <- length(cutoff.seq)
  
  x.out <- numeric(n)
  y.out <- numeric(n)
  # Loop over all values of the score and calculate the performance metrics
  for(i in 1:n) {
    if(!is.null(xvar)) {
      metrics <- classMetrics(score, y, cutoff = cutoff.seq[i], type = c(xvar, yvar))
      x.out[i] <- metrics$perf[xvar, 1]
      y.out[i] <- metrics$perf[yvar, 1]
    } else {
      metrics <- classMetrics(score, y, cutoff = cutoff.seq[i], type = c(yvar))
      x.out[i] <- cutoff.seq[i]
      y.out[i] <- metrics$perf[yvar, 1]
    }
  }
  # Combine metrics into a data frame
  if(flip.x) {
    x.out <- 1 - x.out
  }
  df.out <- data.frame(score = cutoff.seq, x = x.out, y = y.out)
  # Reorder the data frame in increasing order of the x-axis variable
  df.out <- df.out[order(df.out$score), ]
  # De-duplicate x-axis
  df.out <- subset(df.out, subset = !duplicated(df.out$x))
  
  # Construct line plot
  if(!is.null(xvar)) {
    x.text <- ifelse(flip.x, paste0("1 - ", xvar), xvar)
  } else {
    x.text <- "score"
  }
  
  print(qplot(data = df.out, x = x, y = y, geom = "line",
              xlab = ifelse(is.null(xvar), "score", x.text),
              ylab = yvar, ylim = c(0, 1)))
}

Creating Models to Predict Recidivism

To predict two-year recidivism (i.e. the likelihood that one will recidivate within two years), we developed three models (Naive Bayes, Logistic Regression and Random Forest) and ultimately selected the Random Forest model based on its performance.

Naive Bayes

Density Plots

With our Naive Bayes Model for predicting two-year recidivism, we see how a classification method that assumes independence handles this task. We constructed a series of density plots to see how Naive Bayes would operate under certain categories. Based on the overlap that we see in the plots, the model does have interactions showing with the variables. Because Naive Bayes does not account for interactions, this may not be our best model choice.

Assessment of Model Metrics

With our analysis of accuracy, sensitivity, and specificity, we garner information about this model. We see that accuracy is fairly high at 80.2%, yet we do comprise sensitivity (0%) at a very high specificity (100%). Though we do want to prioritize specificity for this model, we do not do this at the expense of sensitivity. Additionally since we didn’t have any false positives in our confusion matrix, this could have affected our results for specificity. Part of this issue could be due to the independence constraint of the Naive Bayes Model. From the confusion matrix, see that the model was able to predict true negatives (TNs) for 718 individuals. However, there were also 177 false negatives (FNs). FNs in this case indicate that the model assumed someone wouldn’t recidivate in two-years when they actually would.

ROC Curve

With the ROC Curve we see the plot of the true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity). Because making a false positive (and wrongly setting a high bail or jail time for someone who will not recidivate) is the costliest mistake by and large, we will prioritize specificity over sensitivity. With Naive Bayes, we get an AUC of 0.7089 indicating that our model is correctly classifying 70.89% of the time. This number is only slightly higher than the values that COMPAS received at around 65%.

recid.nb <- NaiveBayes(two_year_recid ~. ,data = recid.train, usekernel = TRUE)


nb.model <- two_year_recid ~ .
train.data <- recid.train
testData <- recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
recidnb.model <- nb.model

#Randomly shuffle the data
train.data<-train.data[sample(nrow(train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- train.data[testIndexes, ]
  trainData <- train.data[-testIndexes, ]
  
  recidnb.model.trained <- NaiveBayes(recidnb.model, data = trainData)
  recidnb.preds <- predict(recidnb.model.trained, testData)$posterior[,2]
  mse[i] <- mean((as.numeric(testData$two_year_recid)  - recidnb.preds)^2)
  
  w.mse[i] <- mean((length(testData$two_year_recid)*mse[i])/(length(testData$two_year_recid)+length(trainData$two_year_recid)))
  
 }

mse.nb.r <- sum(w.mse)/length(w.mse)

#how to predict with naive bayes
confusion.nb <- predict(recid.nb, newdata = recid.test)$posterior[,2]
classMetric.nb <- classMetrics(confusion.nb, recid.test$two_year_recid, cutoff=.5, type = "all")



#Produce ROC Curve. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Here we see the trade-off between sensitivity and specificity with this model, and that we get an AUC value of .7089. 

roc.nb.r <- roc(recid.test$two_year_recid, confusion.nb)         
plot(roc.nb.r)

print(roc.nb.r$auc)

## Area under the curve: 0.7089

ggplot(data=recidivism, mapping = aes(x = age.at.charge, fill=two_year_recid)) + geom_density(alpha = 0.5)

ggplot(data=recidivism, mapping = aes(x = priors_count, fill=two_year_recid)) + geom_density(alpha = 0.5)

ggplot(data=recidivism, mapping = aes(x = race, fill=two_year_recid)) + geom_density(alpha = 0.5)

Logistic Regression

Probability Plot and Model Size Plot

Based on our probability plot, we see that the distribution of our data is skewed towards the left and not a normal distribution. The plot shows us the probability of recidivism that the logistic model is assigning to someone and how many individuals.

Looking at the model size plot, we see that our RSS value decreases as we add more predictors. These variables are adding more to the description of our data. We note that on this graph a model size of 4 is the lowest point, therefore it is the best model size.

Assessment of Model Metrics

Specifically for two-year recidivism, we ran BIC within our Best Subset to check for ideal model size, opting for this over AIC to get a stronger penalty for model complexity. According to this, we would choose model size 4 to reduce BIC, keeping the following four variables in our model: sexMale, priors_count, juv_other_count, and age.at.charge. One thing to note in particular for this model is that accuracy is relatively high at 81.8%, however we get a high specificity (98.8%) at the cost of low sensitivity 12.9%. We do see a lot of true negatives (710) compared to true positives (23).

ROC Curve

When we plotted a ROC curve, we found that the area under the curve was 0.7177. In other words, the logistic regression model is correctly classifying whether or not someone will recidivate within two years 71.7% of the time.

#Fit Logisitc Regression for 2-Year Recid
recid.glm <- glm(two_year_recid ~ ., data = recid.train, family = binomial)



glm.model <- two_year_recid ~ .
train.data <- recid.train
testData <- recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
logrecid.model <- glm.model

#Randomly shuffle the data
train.data<-train.data[sample(nrow(train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- train.data[testIndexes, ]
  trainData <- train.data[-testIndexes, ]
  
  logrecid.model.trained <- glm(logrecid.model, data = trainData, family = binomial)
  logrecid.preds <- predict(logrecid.model.trained, testData, type = "response")
  mse[i] <- mean((as.numeric(testData$two_year_recid)  - logrecid.preds)^2)
  
  w.mse[i] <- mean((length(testData$two_year_recid)*mse[i])/(length(testData$two_year_recid)+length(trainData$two_year_recid)))
  
  # pred.stats <- classMetrics(r.pred, rec
  #id.test$two_year_recid, cutoff=.5, type = "all")
  # rf.matrix[i] <- pred.stats$conf.mat
  # rf.perf[i] <- pred.stats$perf
}

mse.logi.r <- sum(w.mse)/ length(w.mse)



#Table of variables with statistically significant for LogReg: Recid
#We see only 6 appear significant and none are noise variables.
signif.names.recid <- names(which(coef(summary(recid.glm))[,"Pr(>|z|)"] < 0.05))
kable(signif.names.recid)

x
sexMale
raceCaucasian
raceHispanic
priors_count
juv_other_count
age.at.charge

#Confusion Matrix for LogReg: Recid with .5 cutoff
table(round(recid.glm$fitted.values), recid.glm$y)

##    
##        0    1
##   0 2954  525
##   1   44   57

#Est. Probabilities Histogram for LogReg: Recid with .5 cutoff
qplot(x = recid.glm$fitted.values)

#Run best subset
recid.subset <- regsubsets(two_year_recid ~ .,
               data = recid.train,
               nbest = 1,    # 1 best model for each number of predictors
               nvmax = 5,    # NULL for no limit on number of variables
               method = "exhaustive", really.big = TRUE)
# Print-out of the best variables in each model size
lapply(coef(recid.subset, 1:5), FUN = names)

## [[1]]
## [1] "(Intercept)"  "priors_count"
## 
## [[2]]
## [1] "(Intercept)"   "priors_count"  "age.at.charge"
## 
## [[3]]
## [1] "(Intercept)"   "sexMale"       "priors_count"  "age.at.charge"
## 
## [[4]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_other_count"
## [5] "age.at.charge"  
## 
## [[5]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_misd_count" 
## [5] "juv_other_count" "age.at.charge"

sum.recid.subset<-summary(recid.subset)
sum.recid.subset

## Subset selection object
## Call: regsubsets.formula(two_year_recid ~ ., data = recid.train, nbest = 1, 
##     nvmax = 5, method = "exhaustive", really.big = TRUE)
## 15 Variables  (and intercept)
##                     Forced in Forced out
## sexMale                 FALSE      FALSE
## raceAsian               FALSE      FALSE
## raceCaucasian           FALSE      FALSE
## raceHispanic            FALSE      FALSE
## raceNative American     FALSE      FALSE
## raceOther               FALSE      FALSE
## priors_count            FALSE      FALSE
## juv_fel_count           FALSE      FALSE
## juv_misd_count          FALSE      FALSE
## juv_other_count         FALSE      FALSE
## c_charge_degreeM        FALSE      FALSE
## age.at.charge           FALSE      FALSE
## age.cat(24,31]          FALSE      FALSE
## age.cat(31,42]          FALSE      FALSE
## age.cat(42,80]          FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          sexMale raceAsian raceCaucasian raceHispanic raceNative American
## 1  ( 1 ) " "     " "       " "           " "          " "                
## 2  ( 1 ) " "     " "       " "           " "          " "                
## 3  ( 1 ) "*"     " "       " "           " "          " "                
## 4  ( 1 ) "*"     " "       " "           " "          " "                
## 5  ( 1 ) "*"     " "       " "           " "          " "                
##          raceOther priors_count juv_fel_count juv_misd_count
## 1  ( 1 ) " "       "*"          " "           " "           
## 2  ( 1 ) " "       "*"          " "           " "           
## 3  ( 1 ) " "       "*"          " "           " "           
## 4  ( 1 ) " "       "*"          " "           " "           
## 5  ( 1 ) " "       "*"          " "           "*"           
##          juv_other_count c_charge_degreeM age.at.charge age.cat(24,31]
## 1  ( 1 ) " "             " "              " "           " "           
## 2  ( 1 ) " "             " "              "*"           " "           
## 3  ( 1 ) " "             " "              "*"           " "           
## 4  ( 1 ) "*"             " "              "*"           " "           
## 5  ( 1 ) "*"             " "              "*"           " "           
##          age.cat(31,42] age.cat(42,80]
## 1  ( 1 ) " "            " "           
## 2  ( 1 ) " "            " "           
## 3  ( 1 ) " "            " "           
## 4  ( 1 ) " "            " "           
## 5  ( 1 ) " "            " "

#We run BIC to check for ideal model size, opting for this over AIC to get a stronger penalty for model complexity. According to this, we would choose model size 4 to reduce BIC, keeping the following four variables in our model: sexMale, priors_count, juv_other_count, and age.at.charge.

num.vars <- length(sum.recid.subset$rsq)
qplot(1:num.vars, sum.recid.subset$bic, xlab = "Model size", ylab = "BIC")

which.min(sum.recid.subset$bic)

## [1] 4

#Print confusion matrix comparing 2 year recid prediction accuracy against TEST set
TwoYr.recid.glm.test <- glm(two_year_recid ~ .,  data = recid.test, family = "binomial")

classMetrics(TwoYr.recid.glm.test$fitted.values, recid.test$two_year_recid, cutoff=0.5, type = "all")

## $conf.mat
##          observed
## predicted   0   1
##         0 710 154
##         1   8  23
## 
## $perf
##                 value
## accuracy    0.8189944
## sensitivity 0.1299435
## specificity 0.9888579
## ppv         0.7419355
## npv         0.8217593
## precision   0.7419355
## recall      0.1299435

#Produce ROC Curve. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Here we see the trade-off between sensitivity and specificity with this model, and that we get an AUC value of .7177. 
roc.glm.r <- roc(recid.test$two_year_recid, TwoYr.recid.glm.test$fitted.values)         
plot(roc.glm.r)

print(roc.glm.r$auc)

## Area under the curve: 0.7177

Random Forest

Variable Importance Plot

In our Random Forest model, we see that certain variables would be strong predictors. According to the variable importance model, we find that prior_count and age.at.charge were ranked as 2 of the highest predictors. We had race come in third. Race was a variable that was analyzed in great detail by ProPublica.

Assessment Model Metrics

We see that our accuracy for this model is high at 82.3%, with a sensitivity of 5.79% and a specificity of 96.3%. We also have 130 False Positives and 28 False Negatives.

ROC Curve

When we plotted a ROC curve, we found that the area under the curve was 0.6749 for our random forest model. In other words, the random forest model classified whether or not someone will have a 2 year recidivism event correctly 67.49% of the time. If we were to run the random forest model on the smaller subset, specifying the variables that we found of importance, we could expect that this value would improve.

rf.model <- two_year_recid ~ .
train.data <- recid.train
testData <- recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
model <- rf.model

#Randomly shuffle the data
train.data<-train.data[sample(nrow(train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- train.data[testIndexes, ]
  trainData <- train.data[-testIndexes, ]
  
  model.trained <- randomForest(model, data = trainData)
  preds <- predict(model.trained, testData, type = "prob")[,2]
  mse[i] <- mean((as.numeric(testData$two_year_recid)  - preds)^2)
  
  w.mse[i] <- mean((length(testData$two_year_recid)*mse[i])/(length(testData$two_year_recid)+length(trainData$two_year_recid)))
  
}

mse.rf.r <- sum(w.mse)/length(w.mse)






rf.model <- two_year_recid ~ priors_count+age.at.charge
train.data <- recid.train
testData <- recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
model <- rf.model

#Randomly shuffle the data
train.data<-train.data[sample(nrow(train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- train.data[testIndexes, ]
  trainData <- train.data[-testIndexes, ]
  
  model.trained <- randomForest(model, data = trainData)
  preds <- predict(model.trained, testData, type = "prob")[,2]
  mse[i] <- mean((as.numeric(testData$two_year_recid)  - preds)^2)
  
  w.mse[i] <- mean((length(testData$two_year_recid)*mse[i])/(length(testData$two_year_recid)+length(trainData$two_year_recid)))
  
}

mse.rf.r.short <- sum(w.mse)/length(w.mse)


#2 yr recid preditcion and roc
model.rf <- randomForest(two_year_recid ~ .,  data = recid.train)
varImpPlot(model.rf)

pred.rf.r <- predict(model.rf, newdata = recid.test, type = "prob")[,2]
classMetrics(pred.rf.r, v.recid.test$two_year_recid, cutoff=0.5, type="all")

## $conf.mat
##          observed
## predicted   0   1
##         0 730 130
##         1  27   8
## 
## $perf
##                  value
## accuracy    0.82458101
## sensitivity 0.05797101
## specificity 0.96433289
## ppv         0.22857143
## npv         0.84883721
## precision   0.22857143
## recall      0.05797101

roc.rf.r <-roc(recid.test$two_year_recid, pred.rf.r)
plot(roc.rf.r)

print(roc.rf.r$auc)

## Area under the curve: 0.675

#2 year short predication and roc

model.rf.short <- randomForest(two_year_recid ~ priors_count+age.at.charge,  data = recid.train)
pred.rf.r.short <- predict(model.rf.short, newdata = recid.test, type = "prob")[,2]
roc.rf.r.short <-roc(recid.test$two_year_recid,  pred.rf.r.short)



#qplot(x = r.pred, fill = recid.test$two_year_recid, geom = "histogram")

Summary of Two-Year Recidivism Models

Based on the following assessments and the table listed below, we found that the Random Forest model seemed to be the best performer. Because this model has a high specificity, which is what we are looking for, we assume that it can be even better with a smaller model. Additionally, random forests can take interactions, which are highly present in the data, into account. If we specifically look at misclassification error, we calculated that Naive Bayes had a miscalculation error of 19.8%, Logistic Regression had a MSE of 18.1% and Random Forest had a MSE of 17.6%. Using our MSE scores, we can easily see that Random Forest was able to classify individuals who may recidivate better than the other models because it did have a lower MSE score. Random forest, in addition, did have a strong accuracy and specificity overall. We did not think that Naive Bayes would be a good approach because it does not take interactions into account. Additionally, we felt that Random Forest, which did have a lower sensitivity than logistic regression, did better because of the lower MSE score. For both of these models, age.at.charge and priors_count appeared to be strong predictors for recidivism.

nb.metrics <- classMetrics(predict(recid.nb, newdata = recid.test)$posterior[,2], recid.test$two_year_recid, cutoff=0.5, type = "all")
logi.metric <- classMetrics(TwoYr.recid.glm.test$fitted.values, recid.test$two_year_recid, cutoff=0.5, type = "all")
rf.metric <- classMetrics(predict(model.rf, newdata = recid.test, type = "prob")[,2], recid.test$two_year_recid, cutoff=0.5, type = "all")

rf.table<- c('Random Forest', 
             round(mse.rf.r, 3),
            round(roc.rf.r$auc[1], 3),
            round( nb.metrics$perf[3,1], 3),
            round( nb.metrics$perf[2,1], 3),
            round( nb.metrics$perf[1,1],3))
nb.table <- c('Niave Bayes', 
             round( mse.nb.r, 3),
             round( roc.nb.r$auc[1], 3),
             round( logi.metric$perf[3,1], 3), 
             round( logi.metric$perf[2,1], 3),
             round( logi.metric$perf[1,1],3))
logi.table <- c('Logistic Regression', 
              round(  mse.logi.r, 3),
              round(  roc.glm.r$auc[1], 3),
              round(  rf.metric$perf[3,1], 3),
               round( rf.metric$perf[2,1], 3),
               round( rf.metric$perf[1,1],3))



table.values <- rbind(rf.table,logi.table, nb.table)
df.values.models<-as.data.frame(table.values)

kable(df.values.models, row.names = FALSE, col.names = c("Model", "CV MSE", "AUC", "Specificity", "Sensitivity", "Accuracy" ), digits = 3)

Model	CV MSE	AUC	Specificity	Sensitivity	Accuracy
Random Forest	0.125	0.675	1	0	0.802
Logistic Regression	0.112	0.718	0.979	0.113	0.808
Niave Bayes	0.13	0.709	0.989	0.13	0.819

Creating Models to Predict Violent Recidivism

To predict the likelihood of violent recidivism (i.e. an individual committing a violent crime), we explored three models (Naive Bayes, Logistic Regression, and Random Forest). Again, we found the Random Forest model performed best and was thus chosen for violent recidivism predictions.

Naive Bayes

Density Plots

With our Naive Bayes Model for violent recidivism, we see how a classification method that assumes independence handles this task. We constructed a series of density plots to see how Naive Bayes would operate under certain categories. Based on the overlap that we see in the plots, the model does have interactions show up with the variables. Because Naive Bayes does not account for interactions, this may not be our best model choice. In particular, there is considerable overlap for the race plot.

Assessment of Model Metrics

When we run our model to determine its accuracy, sensitivity, and specificity, we see that this model has a very high accuracy of 98.8%. This is also complemented with its specificity of 100% and a sensitivity of 92%. A major reason why our specificity could be so high is that there were no instances of false positives in our confusion matrix. All the values that were pooling into specificity were just true negatives. This may be a factor that could be skewing our results.

ROC Curve

With the ROC Curve we see the plot of the true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity). Because making a false positive (and wrongly setting a high bail or jail time for someone who will not recidivate) is the costliest mistake by and large, we will prioritize specificity over sensitivity. With Naive Bayes, we get an AUC of 1 indicating that our model is correctly classifying 100% of the time. Though this is an excellent result, we do need to take the independence condition that Naive Bayes places into high consideration.

violent.recid.nb <- NaiveBayes(violent.recid ~ .-vr_charge_degree-two_year_recid, data = v.recid.train, usekernel = TRUE)


violent.nb.model <- violent.recid ~ .-vr_charge_degree-two_year_recid
v.train.data <- v.recid.train
v.testData <- v.recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
v.recidnb.model <- violent.nb.model

#Randomly shuffle the data
v.train.data<-v.train.data[sample(nrow(v.train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(v.train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  v.testData <- v.train.data[testIndexes, ]
  v.trainData <- v.train.data[-testIndexes, ]
  
  v.model.trained.nb <- NaiveBayes(v.recidnb.model, data = v.trainData)
  v.recidnb.preds <- predict(v.model.trained.nb, v.testData)$posterior[,2]
  mse[i] <- mean((as.numeric(v.testData$violent.recid)  - v.recidnb.preds)^2)
  
  w.mse[i] <- mean((length(v.testData$violent.recid)*mse[i])/(length(v.testData$violent.recid)+length(v.trainData$violent.recid)))
  
 }

mse.nb.v <- sum(w.mse)/length(w.mse)

#how to predict with naive bayes
violent.confusion.nb <- predict(violent.recid.nb, newdata =v.recid.test)$posterior[,2]
classMetrics(violent.confusion.nb, v.recid.test$violent.recid, cutoff=.5, type = "all")

## $conf.mat
##          observed
## predicted   0   1
##         0 744  11
##         1   0 140
## 
## $perf
##                 value
## accuracy    0.9877095
## sensitivity 0.9271523
## specificity 1.0000000
## ppv         1.0000000
## npv         0.9854305
## precision   1.0000000
## recall      0.9271523

#Produce ROC Curve. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Here we see the trade-off between sensitivity and specificity with this model, and that we get an AUC value of 1. 

violent.roc.nb.r <- roc(v.recid.test$violent.recid, violent.confusion.nb)         
plot(violent.roc.nb.r)

print(violent.roc.nb.r$auc)

## Area under the curve: 1

Logistic Regression

Probability Histograms and Model Size Plots

From our histogram probability plot, we can see how the logistic regression model assessed violent recidivism. Based on the plot, the model is assigning less than 20% of individuals as likely to commit a violent recidivism activity. The count of individuals is also fairly small as well with less than 200 individuals.

For our model size plot, we see that as we increase the number of predictors, our RSS decreases as well. A model size of 5 seems to be the best for classifying violent recidivism with a logistic regression.

Assessment of Model Metrics

We conducted a similar approach for the violent recidivism using logistic regression. We wanted to determine appropriate variables to select so we ran a best subset with BIC to determine both variables and an ideal model size. For our violent recidivism model, we did find that there were no noise variables and 6 variables that appear to be good predictors which were sexMale, raceCaucasian, raceHispanic, priors_count, juv_other_count, and age.at.charge. To have the lowest BIC, we would choose model size 5, however this model still keeps a relatively low BIC even at the variable size of 3. With a model of size 5, we would keep the following five variables in our model: sexMale, priors_count, juv_misd_count, juv_other_count, age.at.charge. We did find that the accuracy of this model was 83.4% with a specificity of 98% and a sensitivity of 7.9%

ROC Curve

We ran a ROC curve and found that our AUC was 0.7216. In other words, the logistic regression model correctly classified whether someone would have a violent recidivism event 72.16% of the time.

#Fit Logisitc Regression for Violent Recid
v.recid.glm <- glm(violent.recid ~ . - vr_charge_degree - two_year_recid, data = v.recid.train, family = binomial)

v.glm.model <- violent.recid ~ . - vr_charge_degree - two_year_recid
v.train.dataglm <- v.recid.train
v.testDataglm <- v.recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
v.logrecid.model <- v.glm.model

#Randomly shuffle the data
v.train.dataglm<-v.train.dataglm[sample(nrow(v.train.dataglm)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(v.train.dataglm)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  v.testDataglm <- v.train.dataglm[testIndexes, ]
  v.trainDataglm <- v.train.dataglm[-testIndexes, ]
  
  v.logrecid.model.trained <- glm(v.logrecid.model, data = v.trainData, family = binomial)
  v.logrecid.preds <- predict(v.logrecid.model.trained, v.testData, type = "response")
  mse[i] <- mean((as.numeric(v.testData$violent.recid)  - v.logrecid.preds)^2)
  
  w.mse[i] <- mean((length(v.testData$violent.recid)*mse[i])/(length(v.testData$violent.recid)+length(v.trainData$violent.recid)))

}

mse.logi.v <- sum(w.mse)/ length(w.mse)






#Table of variables with statistically significant for LogReg: Violent Recid
#We see only 6 appear significant and none are noise variables.
signif.names.vrecid <- names(which(coef(summary(v.recid.glm))[,"Pr(>|z|)"] < 0.05))
kable(signif.names.vrecid)

x
sexMale
raceCaucasian
raceHispanic
priors_count
juv_other_count
age.at.charge

#Confusion Matrix for LogReg: Recid with .5 cutoff
table(round(v.recid.glm$fitted.values), v.recid.glm$y)

##    
##        0    1
##   0 2869  578
##   1   59   74

#Est. Probabilities Histogram for LogReg: Recid with .5 cutoff
qplot(x = v.recid.glm$fitted.values)

v.recid.subset <- regsubsets(violent.recid ~ .-vr_charge_degree-two_year_recid,
               data = v.recid.train,
               nbest = 1,    # 1 best model for each number of predictors
               nvmax = 5,    # NULL for no limit on number of variables
               method = "exhaustive", really.big = TRUE)

# Print-out of the variables in each model.
lapply(coef(recid.subset, 1:5), FUN = names)

## [[1]]
## [1] "(Intercept)"  "priors_count"
## 
## [[2]]
## [1] "(Intercept)"   "priors_count"  "age.at.charge"
## 
## [[3]]
## [1] "(Intercept)"   "sexMale"       "priors_count"  "age.at.charge"
## 
## [[4]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_other_count"
## [5] "age.at.charge"  
## 
## [[5]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_misd_count" 
## [5] "juv_other_count" "age.at.charge"

sum.v.recid.subset<-summary(v.recid.subset)

sum.v.recid.subset

## Subset selection object
## Call: regsubsets.formula(violent.recid ~ . - vr_charge_degree - two_year_recid, 
##     data = v.recid.train, nbest = 1, nvmax = 5, method = "exhaustive", 
##     really.big = TRUE)
## 15 Variables  (and intercept)
##                     Forced in Forced out
## sexMale                 FALSE      FALSE
## raceAsian               FALSE      FALSE
## raceCaucasian           FALSE      FALSE
## raceHispanic            FALSE      FALSE
## raceNative American     FALSE      FALSE
## raceOther               FALSE      FALSE
## priors_count            FALSE      FALSE
## juv_fel_count           FALSE      FALSE
## juv_misd_count          FALSE      FALSE
## juv_other_count         FALSE      FALSE
## c_charge_degreeM        FALSE      FALSE
## age.at.charge           FALSE      FALSE
## age.cat(24,31]          FALSE      FALSE
## age.cat(31,42]          FALSE      FALSE
## age.cat(42,80]          FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          sexMale raceAsian raceCaucasian raceHispanic raceNative American
## 1  ( 1 ) " "     " "       " "           " "          " "                
## 2  ( 1 ) " "     " "       " "           " "          " "                
## 3  ( 1 ) " "     " "       " "           " "          " "                
## 4  ( 1 ) "*"     " "       " "           " "          " "                
## 5  ( 1 ) "*"     " "       " "           " "          " "                
##          raceOther priors_count juv_fel_count juv_misd_count
## 1  ( 1 ) " "       "*"          " "           " "           
## 2  ( 1 ) " "       "*"          " "           " "           
## 3  ( 1 ) " "       "*"          " "           " "           
## 4  ( 1 ) " "       "*"          " "           " "           
## 5  ( 1 ) " "       "*"          " "           "*"           
##          juv_other_count c_charge_degreeM age.at.charge age.cat(24,31]
## 1  ( 1 ) " "             " "              " "           " "           
## 2  ( 1 ) " "             " "              "*"           " "           
## 3  ( 1 ) "*"             " "              "*"           " "           
## 4  ( 1 ) "*"             " "              "*"           " "           
## 5  ( 1 ) "*"             " "              "*"           " "           
##          age.cat(31,42] age.cat(42,80]
## 1  ( 1 ) " "            " "           
## 2  ( 1 ) " "            " "           
## 3  ( 1 ) " "            " "           
## 4  ( 1 ) " "            " "           
## 5  ( 1 ) " "            " "

#We run BIC to check for ideal model size, opting for this over AIC to get a stronger penalty for model complexity. To have the lowest BIC, we would choose model size 5, however this model still keeps a relatively low BIC even at variable size of 3. With a model of size 5, we would keep the following five variables in our model: sexMale, priors_count, juv_misd_count, juv_other_count, age.at.charge.

num.vars.v <- length(sum.v.recid.subset$rsq)
qplot(1:num.vars.v, sum.v.recid.subset$bic, xlab = "Model size", ylab = "BIC")

which.min(sum.v.recid.subset$bic)

## [1] 5

# Print-out of the variables in each model.
lapply(coef(v.recid.subset, 1:5), FUN = names)

## [[1]]
## [1] "(Intercept)"  "priors_count"
## 
## [[2]]
## [1] "(Intercept)"   "priors_count"  "age.at.charge"
## 
## [[3]]
## [1] "(Intercept)"     "priors_count"    "juv_other_count" "age.at.charge"  
## 
## [[4]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_other_count"
## [5] "age.at.charge"  
## 
## [[5]]
## [1] "(Intercept)"     "sexMale"         "priors_count"    "juv_misd_count" 
## [5] "juv_other_count" "age.at.charge"

#Print confusion matrix comparing violent recid prediction accuracy against TEST set
v.recid.glm.test <- glm(violent.recid ~ .-vr_charge_degree-two_year_recid,  data = v.recid.test, family = "binomial")

classMetrics(v.recid.glm.test$fitted.values, v.recid.test$violent.recid, cutoff=0.5, type = "all")

## $conf.mat
##          observed
## predicted   0   1
##         0 735 139
##         1   9  12
## 
## $perf
##                 value
## accuracy    0.8346369
## sensitivity 0.0794702
## specificity 0.9879032
## ppv         0.5714286
## npv         0.8409611
## precision   0.5714286
## recall      0.0794702

#Produce ROC Curve. This time the area under the curve is 0.7216.
roc.glm.vr <-roc(v.recid.test$violent.recid, v.recid.glm.test$fitted.values)
plot(roc.glm.vr)

print(roc.glm.vr$auc)

## Area under the curve: 0.7216

Random Forest

Variable Importance Plot

Our plot shows us variables that would assist in classifying whether or not someone would recidivate or not. Similar to the two-year recidivism model, the top three are age.at.charge, priors_count, and race.

Assessment Model Metrics

We see that our accuracy for this model is high at 82.8%, with a sensitivity of 13.0% and a specificity of 95.5%. This model seems like it is balancing maintaining a high specificity while still having a decent sensitivity. Additionally, there are 120 false positives and 34 false negatives.

ROC Curve

We ran a ROC curve and found that our AUC was 0.6663. In other words, the random forest model correctly classified whether or not someone would have a violent recidivism event 66.63% of the time.

rf.model.v <- violent.recid ~ .-vr_charge_degree-two_year_recid
train.data <- v.recid.train
testData <- v.recid.test
mse <- rep(1:10)
w.mse <- rep(1:10)
model <- rf.model.v

#Randomly shuffle the data
train.data<-train.data[sample(nrow(train.data)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- train.data[testIndexes, ]
  trainData <- train.data[-testIndexes, ]
  
  model.trained <- randomForest(model, data = trainData)
  preds <- predict(model.trained, testData, type = "prob")[,2]
  mse[i] <- mean((as.numeric(testData$violent.recid)  - preds)^2)
  
  w.mse[i] <- mean((length(testData$violent.recid)*mse[i])/(length(testData$violent.recid)+length(trainData$violent.recid)))
  
}

mse.rf.v <- sum(w.mse)/length(w.mse)






# rf.model.v <- violent.recid ~ priors_count+age.at.charge
# train.data <- v.recid.train
# testData <- v.recid.test
# mse <- rep(1:10)
# w.mse <- rep(1:10)
# model <- rf.model.v
# 
# #Randomly shuffle the data
# train.data<-train.data[sample(nrow(train.data)),]
# #Create 10 equally size folds
# folds <- cut(seq(1,nrow(train.data)),breaks=10,labels=FALSE)
# #Perform 10 fold cross validation
# for(i in 1:10){
#   #Segement your data by fold using the which() function 
#   testIndexes <- which(folds==i,arr.ind=TRUE)
#   testData <- train.data[testIndexes, ]
#   trainData <- train.data[-testIndexes, ]
#   
#   model.trained <- randomForest(model, data = trainData)
#   preds <- predict(model.trained, testData, type = "prob")[,2]
#   mse[i] <- mean((as.numeric(testData$two_year_recid)  - preds)^2)
#   
#   w.mse[i] <- mean((length(testData$two_year_recid)*mse[i])/(length(testData$two_year_recid)+length(trainData$two_year_recid)))
#   
# }
# 
# mse.rf.v.short <- sum(w.mse)/length(w.mse)



#2 yr recid preditcion and roc
model.rf.v <- randomForest(violent.recid ~ .-vr_charge_degree-two_year_recid,  data = v.recid.train)
varImpPlot(model.rf.v)

pred.rf.r.v <- predict(model.rf.v, newdata = v.recid.test, type = "prob")[,2]
classMetrics(pred.rf.r.v, v.recid.test$two_year_recid, cutoff=0.5, type="all")

## $conf.mat
##          observed
## predicted   0   1
##         0 725 119
##         1  32  19
## 
## $perf
##                 value
## accuracy    0.8312849
## sensitivity 0.1376812
## specificity 0.9577279
## ppv         0.3725490
## npv         0.8590047
## precision   0.3725490
## recall      0.1376812

roc.rf.r.v <-roc(v.recid.test$violent.recid, pred.rf.r.v)
plot(roc.rf.r.v)

print(roc.rf.r.v$auc)

## Area under the curve: 0.6717

#2 year short predication and roc

# model.rf.short.v <- randomForest(two_year_recid ~ priors_count+age.at.charge,  data = v.recid.train)
# pred.rf.r.short.v <- predict(model.rf.short, newdata = v.recid.test, type = "prob")[,2]
# roc.rf.r.short.v <-roc(v.recid.test$violent.recid, pred.rf.r)





#qplot(x = r.pred, fill = recid.test$two_year_recid, geom = "histogram")

Summary of Violent Recidivism Models

Based on the following assessments and the table listed below, we found that the Random Forest model seemed to be the best performer. This model had the high specificity we sought in addition to maintaining a higher sensitivity compared to the other models. If we specifically look at misclassification error, we calculated that Naive Bayes had a miscalculation error of 1.2%, Logistic Regression had a MSE of 16.5%, and Random Forest had a MSE of 17.2%. Using our MSE scores, we can easily see that Naive Bayes was able to classify individuals who may recidivate better than the other models because it did have a lower MSE score. However, we think that this might be such a low value because this model did not assume that any of the variables could interact with one another. Random forest, on the otherhand, did have a strong accuracy and specificity overall.

nb.metrics.v <- classMetrics(predict(violent.recid.nb, newdata =v.recid.test)$posterior[,2], v.recid.test$violent.recid, cutoff=0.5, type = "all")
logi.metric.v <- classMetrics(v.recid.glm.test$fitted.values, v.recid.test$violent.recid, cutoff=0.5, type = "all")
rf.metric.v <- classMetrics(predict(model.rf.v, newdata = v.recid.test, type = "prob")[,2], v.recid.test$violent.recid, cutoff=0.5, type = "all")

rf.table.v<- c('Random Forest', 
               round(mse.rf.r, 3),
               round(roc.rf.r.v$auc[1], 5),
               round(nb.metrics.v$perf[3,1], 3),
               round(nb.metrics.v$perf[2,1], 3),
              round(nb.metrics.v$perf[1,1], 3))
nb.table.v <- c('Niave Bayes', 
                round(mse.nb.r, 3), 
                round(violent.roc.nb.r$auc[1], 3),
                round(logi.metric.v$perf[3,1], 3),
                round(logi.metric.v$perf[2,1], 3),
                round(logi.metric.v$perf[1,1], 3))
logi.table.v <- c('Logistic Regression', 
                  round(mse.logi.r, 3),
                  round(roc.glm.vr$auc[1], 3),
                 round(rf.metric.v$perf[3,1], 3),
                 round(rf.metric.v$perf[2,1], 3), 
                  round(rf.metric.v$perf[1,1],3))




table.values.v <- rbind(rf.table.v,logi.table.v, nb.table.v)
df.values.models.v<-as.data.frame(table.values.v)

kable(df.values.models.v, row.names = FALSE, col.names = c("Model", "CV MSE", "AUC", "Specificity", "Sensitivity", "Accuracy" ))

Model	CV MSE	AUC	Specificity	Sensitivity	Accuracy
Random Forest	0.125	0.6717	1	0.927	0.988
Logistic Regression	0.112	0.722	0.961	0.146	0.823
Niave Bayes	0.13	1	0.988	0.079	0.835
##ROC Curves Overlay

Here we overlay all of the ROC curves against each of our models. This furthered our decision that Random forest would be our best model for this classification.

#pull down all of the ROC curves into one for paper

plot(roc.nb.r, col="blue")
plot(roc.glm.r,add = TRUE, col="red")
plot(roc.rf.r,add = TRUE, col="green")

plot(violent.roc.nb.r, col="blue")
plot(roc.glm.vr, add= TRUE, col="red")
plot(roc.rf.r.v, add=TRUE, col="green")

Evaluating Accuacy and Bias

For both our two-year recidivism model, and our model for violent recidivism, our most important predictors were by far priors_count and age.at.charge. It is quite rational that the prior arrest count in particular would make for a good predictive parameter.

Before running the analysis below, we removed observations for which the Compas is_recid score was equal to -1 or N/A, only considering a valid range of 0 or 1. This allowed us to make comparisons across our predictions and theirs vis-a-vis observed outcomes to analyze for both bias and predictive accuracy. We analyzed bias and predictive accuracy across subgroups by race, sex and age.

model.final <- randomForest(two_year_recid ~ .,  data = recid.train) 

#changes recidivism to from numeric to factor
recidivism.raw <- mutate(recidivism.raw, two_year_recid = as.factor(two_year_recid)) 


#creates age at the time individuals was charged 
recidivism.raw$age.at.charge <- round(as.numeric(as.Date(recidivism.raw$c_jail_in) - as.Date(recidivism.raw$dob))/365, 0)

    
#creats age 4 equal catgeories
recidivism.raw$age.cat <- cut_number(recidivism.raw$age.at.charge, 4)

recidivism.final <- mutate(recidivism.raw, PJE.RIA = predict(model.final, newdata = recidivism.raw, type = "prob")[,2])

v.model.final <- randomForest(violent.recid ~ .-vr_charge_degree-two_year_recid,  data = v.recid.train) 

recidivism.final <- mutate(recidivism.final, PJE.RIA.v = predict(v.model.final, newdata = recidivism.raw, type = "prob")[,2])

Bias

Using logistic regression, we analyze how much bias exists in both of our models. We find the following.

Regarding bias in our two-year recidivism model: At first glance, black defendants appear to be 17% more likely than white defendants to receive a higher score, after controlling for seriousness of their crime, prior arrest count, and future criminal activity; and Native Americans initially appear to be 5.5 times as likely to receive a high score controlling for these same factors. However, neither of these race coefficients is statistically significant, so we can’t conclude via this table that there is racial bias in our model. Regarding bias by sex, we do see that women are roughly 40% less likely to get a high score as men (in other words only 60% as likely as men), after controlling for the elements above (seriousness of their crime, prior arrest count, and future criminal activity). Our model also indicates that varying degrees of bias does exist against those aged under 24. Controlling for the above elements, those age 24-31 are roughly half as likely to get a high score as those under age 24. The probability gets smaller for those age 32-41, and even more drastic for those aged 42-80 (who are only 2% as likely to get a high score as those aged under 24).

Regarding bias in our violent recidivism model: We again see that factored races have coefficients that are not statistically significant (at the .01 significance code), with the exception of race as asian. Here, we see that asian defendants appear to be 8.5 times more likely than white defendants to receive a higher score, after controlling for seriousness of their crime, prior arrest count, and future criminal activity. We see the same trends discussed above for sex and age bias present in our violent recidivism model, with the coefficients roughly the same in this model as well.

# Most important predictors for 2-year recid 
varImpPlot(model.final)

#Most important predictors for violent recid
varImpPlot(v.model.final)

#ANANLYZING OUR PERF ON 2 YEAR RECIDIVISM
# Run LogReg for coeffs: to see bias holding all else constant : ln 16-17 ProP Github (across sex, age.cat, race)
# Calculate concordance to see predictive accuracy: ln 31-31 ProP (across sex, age.cat, race)
# Pull False Positive Rates by subgroup: ln 50-54 ProP (across sex, age.cat, race) >> or can we xtab confusion matrix FP rate across the categories?


#ANANLYZING OUR PERF ON VIOLENT RECIDIVISM
### same as above for violent


#Our 2 year recidivism RAI
recidivism.final$PJE.RIA.level = cut(recidivism.final$PJE.RIA, breaks=c(-1,.5,.8,1), labels=c("Low","Medium","High"))

recidivism.final <- subset(recidivism.final, age.at.charge != " ")
recidivism.final <- subset(recidivism.final, score_text != "N/A")

recidivism.final$score_text <- factor(recidivism.final$score_text, c('High', 'Medium', 'Low'))
recidivism.final$PJE.RIA.level <- factor(recidivism.final$PJE.RIA.level, c('High', 'Medium', 'Low'))



df <- mutate(recidivism.final, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age.cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race)) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(PJE.RIA.level != "Low", labels = c("LowScore","HighScore")))


model <- glm(score_factor ~ gender_factor + age_factor + race_factor +
                            priors_count + crime_factor + two_year_recid, family="binomial", data=df)
summary(model)

## 
## Call:
## glm(formula = score_factor ~ gender_factor + age_factor + race_factor + 
##     priors_count + crime_factor + two_year_recid, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7224  -0.0912  -0.0548  -0.0199   3.6870  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.78098    0.35968 -16.073  < 2e-16 ***
## gender_factorFemale         -0.13006    0.27901  -0.466   0.6411    
## age_factor(24,31]           -0.93196    0.21200  -4.396 1.10e-05 ***
## age_factor(31,42]           -2.45409    0.32378  -7.580 3.47e-14 ***
## age_factor(42,80]           -4.79991    0.63249  -7.589 3.23e-14 ***
## race_factorAfrican-American -0.09410    0.21822  -0.431   0.6663    
## race_factorAsian             1.63283    0.98524   1.657   0.0975 .  
## race_factorHispanic         -0.23390    0.42594  -0.549   0.5829    
## race_factorNative American   2.34160    1.41304   1.657   0.0975 .  
## race_factorOther             0.29947    0.44209   0.677   0.4982    
## priors_count                 0.30664    0.02202  13.924  < 2e-16 ***
## crime_factorM                0.30147    0.19047   1.583   0.1135    
## two_year_recid1              4.34402    0.28852  15.056  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1961.35  on 4469  degrees of freedom
## Residual deviance:  821.75  on 4457  degrees of freedom
## AIC: 847.75
## 
## Number of Fisher Scoring iterations: 9

#Our Violent Recidivism RAI
recidivism.final$PJE.RIA.level.v = cut(recidivism.final$PJE.RIA.v, breaks=c(-1,.5,.8,1), labels=c("Low","Medium","High"))

recidivism.final$v_score_text <- factor(recidivism.final$v_score_text, c('High', 'Medium', 'Low'))
recidivism.final$PJE.RIA.level.v <- factor(recidivism.final$PJE.RIA.level.v, c('High', 'Medium', 'Low'))

df.v <- mutate(recidivism.final, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age.cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race)) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(PJE.RIA.level.v != "Low", labels = c("LowScore","HighScore")))


model.v <- glm(score_factor ~ gender_factor + age_factor + race_factor +
                            priors_count + crime_factor + is_violent_recid, family="binomial", data=df.v)
summary(model.v)

## 
## Call:
## glm(formula = score_factor ~ gender_factor + age_factor + race_factor + 
##     priors_count + crime_factor + is_violent_recid, family = "binomial", 
##     data = df.v)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7840  -0.1272  -0.0753  -0.0206   3.5665  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.02381    0.29032 -17.304  < 2e-16 ***
## gender_factorFemale         -0.64018    0.27048  -2.367  0.01794 *  
## age_factor(24,31]           -0.75392    0.19174  -3.932 8.42e-05 ***
## age_factor(31,42]           -2.79983    0.31261  -8.956  < 2e-16 ***
## age_factor(42,80]           -4.67329    0.53939  -8.664  < 2e-16 ***
## race_factorAfrican-American -0.10369    0.19995  -0.519  0.60407    
## race_factorAsian             2.72227    0.86599   3.144  0.00167 ** 
## race_factorHispanic          0.35878    0.34916   1.028  0.30415    
## race_factorNative American   2.43017    1.55852   1.559  0.11893    
## race_factorOther            -0.43495    0.45530  -0.955  0.33942    
## priors_count                 0.31381    0.02084  15.058  < 2e-16 ***
## crime_factorM                0.23542    0.17498   1.345  0.17849    
## is_violent_recid             3.93655    0.22552  17.456  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2298.96  on 4469  degrees of freedom
## Residual deviance:  977.61  on 4457  degrees of freedom
## AIC: 1003.6
## 
## Number of Fisher Scoring iterations: 8

#compass 2-Year Recidivism RAI
df.c <- mutate(recidivism.final, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age.cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race)) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(score_text != "Low", labels = c("LowScore","HighScore")))


model.c <- glm(score_factor ~ gender_factor + age_factor + race_factor +
                            priors_count + crime_factor + two_year_recid, family="binomial", data=df.c)
summary(model.c)

## 
## Call:
## glm(formula = score_factor ~ gender_factor + age_factor + race_factor + 
##     priors_count + crime_factor + two_year_recid, family = "binomial", 
##     data = df.c)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5185  -0.7782  -0.3985   0.7984   2.6270  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.49401    0.09263  -5.333 9.65e-08 ***
## gender_factorFemale          0.24117    0.09181   2.627 0.008619 ** 
## age_factor(24,31]           -1.08151    0.09613 -11.251  < 2e-16 ***
## age_factor(31,42]           -1.64254    0.10551 -15.567  < 2e-16 ***
## age_factor(42,80]           -2.53501    0.12848 -19.730  < 2e-16 ***
## race_factorAfrican-American  0.53591    0.08323   6.439 1.20e-10 ***
## race_factorAsian            -0.67692    0.59041  -1.147 0.251583    
## race_factorHispanic         -0.23111    0.14846  -1.557 0.119539    
## race_factorNative American   1.05954    0.79548   1.332 0.182877    
## race_factorOther            -0.67866    0.19346  -3.508 0.000451 ***
## priors_count                 0.28936    0.01392  20.789  < 2e-16 ***
## crime_factorM               -0.20489    0.07869  -2.604 0.009220 ** 
## two_year_recid1              0.92159    0.10116   9.110  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5912.1  on 4469  degrees of freedom
## Residual deviance: 4315.9  on 4457  degrees of freedom
## AIC: 4341.9
## 
## Number of Fisher Scoring iterations: 5

#compass violent Recidivism RAI
df.cv <- mutate(recidivism.final, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age.cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race)) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(v_score_text != "Low", labels = c("LowScore","HighScore")))


model.cv <- glm(score_factor ~ gender_factor + age_factor + race_factor +
                            priors_count + crime_factor + is_violent_recid, family="binomial", data=df.cv)
summary(model.cv)

## 
## Call:
## glm(formula = score_factor ~ gender_factor + age_factor + race_factor + 
##     priors_count + crime_factor + is_violent_recid, family = "binomial", 
##     data = df.cv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6841  -0.5595  -0.2818   0.5852   2.7849  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  0.08603    0.09970   0.863   0.3882    
## gender_factorFemale         -0.51592    0.10970  -4.703 2.56e-06 ***
## age_factor(24,31]           -2.09776    0.10295 -20.377  < 2e-16 ***
## age_factor(31,42]           -3.14220    0.13186 -23.831  < 2e-16 ***
## age_factor(42,80]           -4.14588    0.18464 -22.454  < 2e-16 ***
## race_factorAfrican-American  0.63245    0.09753   6.484 8.91e-11 ***
## race_factorAsian            -0.94510    0.67365  -1.403   0.1606    
## race_factorHispanic          0.04153    0.16846   0.247   0.8053    
## race_factorNative American   0.51779    0.87256   0.593   0.5529    
## race_factorOther            -0.17247    0.20535  -0.840   0.4010    
## priors_count                 0.14224    0.01084  13.117  < 2e-16 ***
## crime_factorM               -0.22390    0.08958  -2.499   0.0124 *  
## is_violent_recid             0.95945    0.10389   9.235  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5339.1  on 4469  degrees of freedom
## Residual deviance: 3510.6  on 4457  degrees of freedom
## AIC: 3536.6
## 
## Number of Fisher Scoring iterations: 6

#our recid racial bias -African American
control <- exp(-4.94314)/(1+ exp(-4.94314))
exp(0.16369)/(1-control) + (control*exp(0.16369))

## [1] 1.194591

#our recid racial bias - Native American
exp(1.74306)/(1-control + (control*exp(1.74306)))

## [1] 5.530159

#our recid gender bias
exp(-0.51769)/(1-control + (control*exp(-0.51769)))

## [1] 0.5976057

#our recid: ageism
#under 24 compared to 24-31
exp(-0.66056)/(1-control + (control*exp(-0.66056)))

## [1] 0.5183365

#under 24 compared to 42-80
exp(-3.78080)/(1-control + (control*exp(-3.78080)))

## [1] 0.02296335

#our violent recid racial bias -Asian
vcontrol <- exp(-4.49629)/(1+ exp(-4.49629))
exp(2.23788)/(1-vcontrol + (vcontrol*exp(2.23788)))

## [1] 8.581089

#our violent recid gender bias
exp(-0.62064)/(1-vcontrol + (vcontrol*exp(-0.62064)))

## [1] 0.5403556

#our violent recid: ageism
#under 24 compared to 24-31
exp(-0.66751)/(1-vcontrol + (vcontrol*exp(-0.66751)))

## [1] 0.5157542

#under 24 compared to 42-80
exp(-4.01727)/(1-vcontrol + (vcontrol*exp(-4.01727)))

## [1] 0.01819912

#Compas recid racial bias: race African American
ccontrol <- exp(-0.49401)/(1+ exp(-0.49401))
exp(0.53591)/(1-ccontrol) + (ccontrol*exp(0.53591))

## [1] 3.399418

#Compas recid  racial bias: race Other
exp(-0.67866)/(1-ccontrol + (ccontrol*exp(-0.67866)))

## [1] 0.623758

#Compas recid gender bias
exp(0.24117)/(1-ccontrol + (ccontrol*exp(0.24117)))

## [1] 1.153517

#Compas recid: ageism
#under 24 compared to 24-31
exp(-1.08151)/(1-ccontrol + (ccontrol*exp(-1.08151)))

## [1] 0.4523847

#under 24 compared to 42-80
exp(-2.53501)/(1-ccontrol + (ccontrol*exp(-2.53501)))

## [1] 0.1217364

#Compas violent recid racial bias: race African American
cvcontrol <- exp(0.08603)/(1+ exp(0.08603))
exp(0.63245)/(1-cvcontrol) + (cvcontrol*exp(0.63245))

## [1] 4.915094

#Compas violent recid gender bias
exp(-0.51592)/(1-cvcontrol + (cvcontrol*exp(-0.51592)))

## [1] 0.7558139

#Compas violent recid: ageism
#under 24 compared to 24-31
exp(-2.09776)/(1-cvcontrol + (cvcontrol*exp(-2.09776)))

## [1] 0.2262285

#under 24 compared to 42-80
exp(-4.14588)/(1-cvcontrol + (cvcontrol*exp(-4.14588)))

## [1] 0.03252008

summary(recidivism.final$race)

## African-American            Asian        Caucasian         Hispanic 
##             2129               27             1628              411 
##  Native American            Other 
##               10              265

Prediction Accuracy

Looking at the plotted ROC curve for our two year recidivism model, we see the our curve hugs the X-axis until roughly the .35 mark, nearly perfectly classifying the true positives without compromising specificity and increasing the false positive rate.

We then consider confusion matrices for our RAIs to evaluate how accurate our predictions were in classifying at the high-risk, medium-risk, and low-risk of recidivism levels. This considers that anyone with a predicted probability of .5 or higher would be classified in the medium to high tiers and be considered at risk of recidivism. If the observed outcome was a “1” for these observations, we conclude that we made correct classifications for these observations. If the observed outcome was a “0”, a prediction of low-risk would likewise be a correct classification.

For our two-year RAI we calculated the following. We correctly classified: high-risk 96% of the time, medium-risk 92% of the time, and low-risk 88% of the time. With false positives being the costliest mistake to make for society, we seek high specificity which was achieved with 99.5%. In light of the prevalence of actual two-year recidivism in our data set (which came in at 20%), our model did rather well at classifying correctly. Our overall misclassification rate was 14.6% and we had lowest accuracy at the low-risk level, where our model tended to over-classify down. This flows from our desire to minimize false positives.

For our violent recidivism RAI we calculated the following. We correctly classified: high-risk 98% of the time, medium-risk 84% of the time, and low-risk 87% of the time. With false positives being the costliest mistake to make, we seek high specificity which was achieved with 99%. The prevalence of actual violent recidivism in our data set came in at 18%, and our model did rather well at classifying correctly. Our overall misclassification rate was 12.6%. As a caveat, we note that the costs of misclassification are higher for society with this RAI, as the costs are large for both inaccurately predicting one will violently recidivate (in the case of false positives), as well as making a false negatives (thereby missing the opportunity to mitigate more severe, violent crime).

# part 4 from Steps.xls

#can run code from the above for Compas, and compare it to our values

#use confusion matrix ton cmpare false poitves and false negatives

recidivism.final <- mutate(recidivism.final, decile_score=decile_score*.1, v_decile_score=v_decile_score*.1 )



compasss.ROC <- plotClassMetrics(recidivism.final$decile_score, recidivism.final$two_year_recid, xvar = "specificity", yvar = "sensitivity",
                 flip.x = TRUE)

PJE.RIA.ROC <-plotClassMetrics(recidivism.final$PJE.RIA, recidivism.final$two_year_recid, xvar = "specificity", yvar = "sensitivity",
                 flip.x = TRUE)

compass.v.ROC <-plotClassMetrics(recidivism.final$v_decile_score, recidivism.final$two_year_recid, xvar = "specificity", yvar = "sensitivity",
                 flip.x = TRUE)

PJE.RIA.V.ROC <-plotClassMetrics(recidivism.final$PJE.RIA, recidivism.final$two_year_recid, xvar = "specificity", yvar = "sensitivity",
                 flip.x = TRUE)

grid.arrange(compasss.ROC$plot, PJE.RIA.ROC$plot,  ncol = 2, top="Two Year Recidvism Prediction ROC", 
               left="Compass Model", right="Our Model")

compas.table.v <- kable(xtabs(~v_score_text+is_violent_recid, data = recidivism.final), caption = "Violent Recidivism: Compas Prediction Catgories Vs Actual Outcomes")

PJE.table.v <- kable(xtabs(~PJE.RIA.level.v+is_violent_recid, data = recidivism.final), caption = "Violent Recidivism: Our Model Prediction Catgoeries Vs Actual Outcomes")


compass.table <- kable(xtabs(~score_text+two_year_recid, data = recidivism.final), caption = "2 Year Recidivism: Compas Prediction Catgoeries Vs Actual Outcomes")

PJE.table <- kable(xtabs(~PJE.RIA.level+two_year_recid, data = recidivism.final), caption = "2 Year Recidivism: Our Model Prediction Catgoeries Vs Actual Outcomes")



PJE.table

2 Year Recidivism: Our Model Prediction Catgoeries Vs Actual Outcomes
	0	1
High	2	47
Medium	17	190
Low	3692	522

PJE.table.v

Violent Recidivism: Our Model Prediction Catgoeries Vs Actual Outcomes
	0	1
High	1	67
Medium	34	217
Low	3632	519

compass.table

2 Year Recidivism: Compas Prediction Catgoeries Vs Actual Outcomes
	0	1
High	350	272
Medium	814	238
Low	2547	249

compas.table.v

Violent Recidivism: Compas Prediction Catgories Vs Actual Outcomes
	0	1
High	166	170
Medium	674	262
Low	2827	371

#   0   1
# High  350 272
# Low   2547    249
# Medium    814 238
# N/A   0   0

Comparing Our Model Performance to COMPAS

Bias

Looking at COMPAS’ two-year recidivism model, we see statistically significant bias across race, age and sex. Black defendants are 3.4 times more likely than white defendants to receive a higher score, after controlling for seriousness of their crime, prior arrest count, and future criminal activity. Those in the race category “Other” are interestingly less likely to receive a high score than white defendants (only 62% as likely) after controlling for the above elements. Controlling for the same elements, we also see the following: women are 15% more likely to get a high score than men; and those in the under 24 age group are much likelier to get a high score. Those aged 24-31 are only 45% as likely to get a high score as those under 24, and those age 42-80 are only 12% as likely.

Assessing COMPAS’ violent recidivism model, we see statistically significant racial bias again with regard to African Americans, noting that black defendants are 4.9 times more likely to receive a high score for violent recidivism as white defendants after controlling for seriousness of their crime, prior arrest count, and future criminal activity. Here, women are 25% less likely to get a high score than men, and the same trend for age bias is present. Those aged 24-31 are only 23% as likely to get a high score as those under 24, and those age 42-80 are only 3% as likely.

Considering the bias present in our models vis-a-vis COMPAS’, we conclude: (1) Our RAI for predicting two-year recidivism shows no statistically significant racial bias. COMPAS’ performs much worse here with significant bias against black defendants who are 3.4 timely likely to get a high recidivism score as their white counterparts. (2) Both RAIs for two-year recidivism exhibit bias across sexes, with ours having more bias against men, and COMPAS’ being more biased against women. (3) Both RAIs for two-year recidivism show age bias against those under 24 years of age. At the most extreme end of the spectrum, ours shows more positive bias toward the eldest group (42-80) who are only 2% as likely to have a high score compared to their 12% likelihood with COMPAS’ RAI. (4) Both RAIs for predicting violent recidivism show racial bias. Ours shows bias against Asians (to a degree of 8.5X), while COMPAS’ shows bias against African Americans (to a degree of 4.9X). We believe ours may perform poorly here due to the fact that Asians are underrepresented in the analyzed dataset with only 27 observations. In contrast, African Americans are widely represented with 2129 observations, so the racial bias is surprising; further, the implications of having an instrument showing such bias against a population that comprises nearly 48% of those judged (as in this case) is severe. (5) Both RAIs for violent recidivism show bias by sex and age. In these areas however, COMPAS’ bias is smaller than ours, with women still not as likely to have a high score as men but with a more narrow bias gap. Positive age bias is comparable with the eldest group, however ours show more positive bias with regard to the 24-31 year old group.

Prediction Accuracy

With its two-year recidivism model, COMPAS performed as follows. It correctly classified: high-risk 44% of the time (compared to our 96%); medium-risk 23% of the time (compared to our 92%), and low-risk 91% of the time (slightly better than our 88%). Though false positives are the costliest mistake to make for society in this instance, COMPAS’ specificity was only 68% (compared to our 99.5%). COMPAS’ overall misclassification rate for two-year recidivism was 32% (over twice our misclassification rate of 14.6%).

Across nearly every metric listed above, our RAI for predicting 2-year recidivism performed better than COMPAS’. This is largely due to the fact that it vastly over-categorized people as high and medium risk, when in actuality they did not recidivate as predicted.

With its violent recidivism RAI, COMPAS’ performance included, correctly classifying: high-risk 51% of the time (compared to our 98%), medium-risk only 28% of the time (compared to our 84%), and low-risk 88% of the time (compared to our 87%). Though false positives should be considered the costliest mistake to make, COMPAS specificity in this model was 77% (compared to our 99%). COMPAS’ overall misclassification rate for violent recidivism was 27% (over twice our misclassification rate of 12.6%).

Again, across nearly every metric, our RAI for predicting violent recidivism performed starkly better than that of Compas. (The only exception to this was low-risk classification performance, where Compas’ 1% favorability was negligible.) Again, this is largely due to the fact that COMPAS vastly over-categorized people as being at high and medium risk of violent recidivism, when in fact they did not recidivate as predicted.

Policy Implications for RAI Use & Recommendations

Through the above analysis, we see that despite actively trying to avoid it, bias does creep into algorithms and predictive models. While the bias of our RAIs was less extreme than COMPAS’ and we were able to avoid any statistically significant racial bias in our two-year recidivism model, we (and COMPAS) still had sex- and age-related bias. As such, our models were likelier to assign higher scores to people based on demographic traits beyond their control, even when controlling for variables like prior arrest count. Such bias, and the high costs of making false positives as explained throughout this report, illustrate the risks of using algorithms as risk assessment instruments in the criminal justice system.

We see that with careful model selection, tuning, and feature engineering, we can use a wide array of techniques to improve the performance of RAIs. We also the see the importance of training models on data representative of the population for which the system will be deployed. The COMPAS tool, for instance, was not trained using the Broward County, Florida dataset we accessed via ProPublica; had COMPAS’ dataset been representative of this population, its prediction performance might have been better. Instead, we found shockingly high misclassification rates when analyzing COMPAS’ models for predicting both 2-year recidivism and violent recidivism.

Commerical providers of RAI systems and public sector procurement offices alike should follow best practices for reducing bias in machine learning systems. As mentioned, training sets should be representative of the poulations for which a system will be deployed. Testing biases within a system pre-deployment, and iteratively, should be robust - proactively and continually assessing accuary and potenital bias across demographics such as age, sex, and race. Procurement offices should avoid purchasing proprietary systems for which such testing and auditing is not permissable, especially in areas of high-stakes decision-making such as criminal justice. Finally, best practices for transparency, such as Explainable AI mechanisms, should accompany RAIs when used for punitive (rather than benevolent) purposes. Such a component could provide value to stakeholders using and impacted by these instruments, providing enhanced transparency into how the scores were constructed and how they might be erring.

Analyzing Algorithmic Bias in Risk Assessment Instruments

March 1, 2018

Introduction

Data Exploration

Data Cleaning & Transformations

Correlations Assessment

Data Visualizations

Methodology

Creating Models to Predict Recidivism

Naive Bayes

Density Plots

Assessment of Model Metrics

ROC Curve

Logistic Regression

Probability Plot and Model Size Plot

Assessment of Model Metrics

ROC Curve

Random Forest

Variable Importance Plot

Assessment Model Metrics

ROC Curve

Summary of Two-Year Recidivism Models

Creating Models to Predict Violent Recidivism

Naive Bayes

Density Plots

Assessment of Model Metrics

ROC Curve

Logistic Regression

Probability Histograms and Model Size Plots

Assessment of Model Metrics

ROC Curve

Random Forest

Variable Importance Plot

Assessment Model Metrics

ROC Curve

Summary of Violent Recidivism Models

Evaluating Accuacy and Bias

Bias

Prediction Accuracy

Comparing Our Model Performance to COMPAS

Bias

Prediction Accuracy

Policy Implications for RAI Use & Recommendations