In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
# Load the data
parole = read.csv("parole.csv")# Count the number of parolees
nrow(parole)
## [1] 675675 parolees.
# Count the number of violators
z = table(parole$violator)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 597 |
| 1 | 78 |
78 parolees have violated the terms of their parole.
# Output the structure
str(parole)
## 'data.frame': 675 obs. of 9 variables:
## $ male : int 1 0 1 1 1 1 1 0 0 1 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : int 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...While the variables male, race, state, crime, and violator are all unordered factors, only state and crime have at least 3 levels in this dataset.
# Convert to Factor
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)
# Output Summary
summary(parole$state)
## 1 2 3 4
## 143 120 82 330
summary(parole$crime)
## 1 2 3 4
## 315 106 153 101The output of summary(parole$state) or summary(parole$crime) now shows a breakdown of the number of parolees with each level of the factor, which is most similar to the output of the table() function.
# Split the data
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)SplitRatio=0.7 causes split to take the value TRUE roughly 70% of the time, so train should contain roughly 70% of the values in the dataset.
The exact same training/testing set split as the first execution of [1]-[5] correct
A different training/testing set split from the first execution of [1]-[5]
A different training/testing set split from the first execution of [1]-[5]
Using glm (and remembering the parameter family=“binomial”), train a logistic regression model on the training set. Your dependent variable is “violator”, and you should use all of the other variables as independent variables.
# Logistic Regression
mod = glm(violator~., data=train, family="binomial")
# Output the summary
summary(mod)
##
## Call:
## glm(formula = violator ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7041 -0.4236 -0.2719 -0.1690 2.8375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
## male 0.3869904 0.4379613 0.884 0.37690
## race 0.8867192 0.3950660 2.244 0.02480 *
## age -0.0001756 0.0160852 -0.011 0.99129
## state2 0.4433007 0.4816619 0.920 0.35739
## state3 0.8349797 0.5562704 1.501 0.13335
## state4 -3.3967878 0.6115860 -5.554 0.0000000279 ***
## time.served -0.1238867 0.1204230 -1.029 0.30359
## max.sentence 0.0802954 0.0553747 1.450 0.14705
## multiple.offenses 1.6119919 0.3853050 4.184 0.0000286831 ***
## crime2 0.6837143 0.5003550 1.366 0.17180
## crime3 -0.2781054 0.4328356 -0.643 0.52054
## crime4 -0.0117627 0.5713035 -0.021 0.98357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 340.04 on 472 degrees of freedom
## Residual deviance: 251.48 on 460 degrees of freedom
## AIC: 277.48
##
## Number of Fisher Scoring iterations: 6race, state4, and multiple.offenses are significant in this model.
Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
# Calculate log odds
male=1
race=1
age=50
state2=0
state3=0
state4=0
time.served=3
max.sentence=12
multiple.offenses=0
crime2=1
crime3=0
crime4=0
logodds =-4.2411574 + 0.3869904*male + 0.8867192*race - 0.0001756*age + 0.4433007*state2 + 0.8349797*state3 - 3.3967878*state4 - 0.1238867*time.served + 0.0802954*max.sentence + 1.6119919*multiple.offenses + 0.6837143*crime2 - 0.2781054*crime3 - 0.0117627*crime4
logodds
## [1] -1.700629
odds = exp(logodds)
odds
## [1] 0.1825687Odds = 0.1825687
# Calculate Probability
1/(1 + exp(-logodds))
## [1] 0.1543832Probability = 0.1543831
Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=“response”.
# Make Predictions
predictions = predict(mod, newdata=test, type="response")# Output the summary
summary(predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002334 0.023777 0.057905 0.146576 0.147452 0.907279Max Probability = 0.907
# Model Predictions with threshold of 0.5
z = table(test$violator, as.numeric(predictions >= 0.5))
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 167 | 12 |
| 1 | 11 | 12 |
# Calculate Sensitivty
z[4]/(z[4]+z[2])
## [1] 0.5217391Sensitivity = 0.522
# Calculate specifcity
z[1]/(z[3]+z[1])
## [1] 0.9329609Specificity = 0.933
# Calculate accuracy
sum(diag(z))/sum(z)
## [1] 0.8861386Accuracy = 0.886
# Tabulate the baseline
z = table(test$violator)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 179 |
| 1 | 23 |
z[1]/z[2]
## 0
## 7.782609Accuracy = 0.886
If the board used the model for parole decisions, a negative prediction would lead to a prisoner being granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board would experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actually positive, or false negative) than it would experience for denying parole to a prisoner who would not have violated parole (a positive prediction that is actually negative, or false positive).
Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false negatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease the cutoff.
The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board.
From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.
# Calculate AUC
library(ROCR)
pred = prediction(predictions, test$violator)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.8945834AUC = 0.8945834
The AUC deals with differentiating between a randomly selected positive and negative example. It is independent of the regression cutoff selected.
While expanding the dataset to include the missing parolees and labeling each as violator=0 would improve the representation of non-violators, it does not capture the true outcome, since the parolee might become a violator after 2004. Though labeling these new examples with violator=NA correctly identifies that we don’t know their true outcome, we cannot train or test a prediction model with a missing dependent variable.
As a result, a prospective dataset that tracks a cohort of parolees and observes the true outcome of each is more desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a 10-year term, it might require tracking that individual for 10 years before building the model). Such a prospective analysis would not be possible using the 2004 National Corrections Reporting Program dataset.