MITx Analytics Edge assignment

Predicting parole violators

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.

Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:

(For MITx course ‘The Analytics Edge’ 15.071x)

Data

parole <- read.csv('parole.csv')

Summary

str(parole)
## 'data.frame':    675 obs. of  9 variables:
##  $ male             : int  1 0 1 1 1 1 1 0 0 1 ...
##  $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
##  $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
##  $ state            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
##  $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
##  $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crime            : int  4 3 3 1 1 4 3 1 3 2 ...
##  $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...
sum(parole$violator)
## [1] 78
# convert unordered factors with three or more levels into factors

parole$state <- as.factor(parole$state)
parole$crime <- as.factor(parole$crime)

summary(parole)
##       male             race            age        state    time.served   
##  Min.   :0.0000   Min.   :1.000   Min.   :18.40   1:143   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:25.35   2:120   1st Qu.:3.250  
##  Median :1.0000   Median :1.000   Median :33.70   3: 82   Median :4.400  
##  Mean   :0.8074   Mean   :1.424   Mean   :34.51   4:330   Mean   :4.198  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:42.55           3rd Qu.:5.200  
##  Max.   :1.0000   Max.   :2.000   Max.   :67.00           Max.   :6.000  
##   max.sentence   multiple.offenses crime      violator     
##  Min.   : 1.00   Min.   :0.0000    1:315   Min.   :0.0000  
##  1st Qu.:12.00   1st Qu.:0.0000    2:106   1st Qu.:0.0000  
##  Median :12.00   Median :1.0000    3:153   Median :0.0000  
##  Mean   :13.06   Mean   :0.5363    4:101   Mean   :0.1156  
##  3rd Qu.:15.00   3rd Qu.:1.0000            3rd Qu.:0.0000  
##  Max.   :18.00   Max.   :1.0000            Max.   :1.0000
# create training and test sets

set.seed(144) # fix the random number generator

library(caTools) # provides splitting tools

split = sample.split(parole$violator, SplitRatio = 0.7) # ensure balance of violators in sets
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

paste('Number in training set:', nrow(train))
## [1] "Number in training set: 473"
paste('Number in test set:', nrow(test))
## [1] "Number in test set: 202"
model <- glm(violator ~ ., data = train, family = 'binomial') # 'binomial' = logistic regression
summary(model)
## 
## Call:
## glm(formula = violator ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7041  -0.4236  -0.2719  -0.1690   2.8375  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.2411574  1.2938852  -3.278  0.00105 ** 
## male               0.3869904  0.4379613   0.884  0.37690    
## race               0.8867192  0.3950660   2.244  0.02480 *  
## age               -0.0001756  0.0160852  -0.011  0.99129    
## state2             0.4433007  0.4816619   0.920  0.35739    
## state3             0.8349797  0.5562704   1.501  0.13335    
## state4            -3.3967878  0.6115860  -5.554 2.79e-08 ***
## time.served       -0.1238867  0.1204230  -1.029  0.30359    
## max.sentence       0.0802954  0.0553747   1.450  0.14705    
## multiple.offenses  1.6119919  0.3853050   4.184 2.87e-05 ***
## crime2             0.6837143  0.5003550   1.366  0.17180    
## crime3            -0.2781054  0.4328356  -0.643  0.52054    
## crime4            -0.0117627  0.5713035  -0.021  0.98357    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 340.04  on 472  degrees of freedom
## Residual deviance: 251.48  on 460  degrees of freedom
## AIC: 277.48
## 
## Number of Fisher Scoring iterations: 6
paste('Co-efficient of multiple.crimes =', model$coefficients['multiple.offenses'][[1]])
## [1] "Co-efficient of multiple.crimes = 1.61199185947379"
paste('e^coefficient of multiple.crimes =', exp(model$coefficients['multiple.offenses'][[1]]))
## [1] "e^coefficient of multiple.crimes = 5.01278605570309"

Our model predicts that a parolee who committed multiple offences has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.

Example

Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny.

According to the model, what are the odds this individual is a violator?

logit <- model$coefficients['(Intercept)'][[1]] + model$coefficients['male'][[1]] + model$coefficients['race'][[1]] * 1 +
  model$coefficients['age'][[1]] * 50 + model$coefficients['time.served'][[1]] * 3 +
  model$coefficients['max.sentence'][[1]] * 12 + model$coefficients['crime2'][[1]]

paste('Logit :', logit)
## [1] "Logit : -1.70062980028895"
paste('Odds  :', exp(logit))
## [1] "Odds  : 0.182568506139459"
paste('P(y=1):', 1/(1+exp(-logit)))
## [1] "P(y=1): 0.154383027445455"

Predictions for test set

testprediction <- predict(model, newdata = test, type = 'response')
hist(testprediction)

paste('Maximum predicted probability of a violation', max(testprediction))
## [1] "Maximum predicted probability of a violation 0.907279069042028"

Accuracy of prediction on test set

Threshold 0.5

Note that ‘0’ and ‘1’ on the left-side of the table indicate whether the parolee violated the parole (‘1’) or not (‘0’).

testpredicttable <- table(test$violator, testprediction >= 0.5)
testpredicttable
##    
##     FALSE TRUE
##   0   167   12
##   1    11   12
paste('Sensitivity : ', testpredicttable['1', 'TRUE']/sum(testpredicttable['1',]))
## [1] "Sensitivity :  0.521739130434783"
paste('Specificity : ', testpredicttable['0', 'FALSE']/sum(testpredicttable['0',]))
## [1] "Specificity :  0.932960893854749"
paste('Accuracy    : ', (testpredicttable['1', 'TRUE'] + testpredicttable['0', 'FALSE'])/sum(testpredicttable[,]))
## [1] "Accuracy    :  0.886138613861386"
paste("Accuracy of 'simple' model which predicts that every parolee is a non-violator :", sum(train$violator == FALSE)/length(train$violator))
## [1] "Accuracy of 'simple' model which predicts that every parolee is a non-violator : 0.883720930232558"

The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board.

The parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.

Receiver Operater Characteristic (ROC) Curve

The ROC area under cuve is the probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.

The AUC deals with differentiating between a randomly selected positive and negative example. It is independent of the regression cutoff selected.

library(ROCR)
pred_ROCR <- prediction(testprediction, test$violator)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,  print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

paste('Area under Curve :', auc_ROCR@y.values[[1]])
## [1] "Area under Curve : 0.894583434539713"

The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented. This is called “selection bias” or “selecting on the dependent variable,” because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation).

A prospective dataset that tracks a cohort of parolees and observes the true outcome of each is more desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a 10-year term, it might require tracking that individual for 10 years before building the model). Such a prospective analysis would not be possible using the 2004 National Corrections Reporting Program dataset.