Setting the Environment

Before I start my work, I run a global Controller file wherein I house my libraries, custom functions, and global options. This allows me to continue my work unimpeded.

setwd("/Users/cassandrabayer/Desktop/Estimating-Treatment-Effects")
options("scipen" =100, "digits" = 4)

# Load Packages -----------------------------------------------------------
# load basic packages
library(data.table)
library(tidyr)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## Conflicts with tidy packages ----------------------------------------------
## between():   dplyr, data.table
## filter():    dplyr, stats
## first():     dplyr, data.table
## lag():       dplyr, stats
## last():      dplyr, data.table
## transpose(): purrr, data.table
# visualization packages
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(corrplot)
## corrplot 0.84 loaded
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(sjPlot)

# stats and prediction
library(stats)
library(forecast)
library(tseries)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
library(MatchIt)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
# Model Selection/Validation
library(MASS)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-5
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(rpart)

#Dates
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
# Text analysis
library(RSiteCatalyst)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
# spell check
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
# correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## 
## Attaching package: 'qdapRegex'
## The following objects are masked from 'package:dplyr':
## 
##     escape, explain
## The following object is masked from 'package:ggplot2':
## 
##     %+%
## Loading required package: qdapTools
## 
## Attaching package: 'qdapTools'
## The following object is masked from 'package:dplyr':
## 
##     id
## The following object is masked from 'package:data.table':
## 
##     shift
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:Matrix':
## 
##     %&%
## The following object is masked from 'package:forecast':
## 
##     %>%
## The following object is masked from 'package:plotly':
## 
##     %>%
## The following object is masked from 'package:dplyr':
## 
##     %>%
## The following object is masked from 'package:purrr':
## 
##     %>%
## The following object is masked from 'package:tidyr':
## 
##     %>%
## The following object is masked from 'package:base':
## 
##     Filter
#devtools::install_github("ropensci/spelling")

# Load any custom functions -----------------------------------------------
rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

shift <- function(x, offset = 1, pad = NA) {
  r <- (1 + offset):(length(x) + offset)
  r[r<1] <- NA
  ans <- x[r]
  ans[is.na(ans)] <- pad
  return(ans)
}

maxMissing <- function(x){
  if(all(is.na(x))){
    return(NA_real_)
  } else{
    return(max(x, na.rm = T))
  }
}

nwords <- function(string, pseudo=F){
  ifelse( pseudo,
          pattern <- "\\S+",
          pattern <- "[[:alpha:]]+"
  )
  str_count(string, pattern)
}

wordParser <- function(word){
  unlist(lapply(seq(nchar(word)), function(x) substr(word, x, x)))
}

Pre-Processing

Initially I load the data and convert to data.tables. I use the data.table package as it is much more performant. As a first pass, I look at the data types of each of columns in each of the tables. Finding that the dates are coded as characters, I coerce them to the data.type date, which allows me to manipulate them as if they were integers.

# Load data  --------------------------------------------------------------
# Inital loading and basic cleaning
case <- as.data.table(read.csv("case.csv", stringsAsFactors = F))
demo <- as.data.table(read.csv("demo.csv",stringsAsFactors = F))
priorArrests <- as.data.table(read.csv("prior_arrests.csv", stringsAsFactors = F))

# Pre-processing/cleaning -------------------------------------------------------------------------------
## Get names and data types for all in environment
sapply(case, class)
##      caseid   person_id arrest_date dispos_date       treat 
##   "integer"   "integer" "character" "character"   "integer"
sapply(demo, class)
##   person_id        race      gender       bdate 
##   "integer" "character" "character" "character"
sapply(priorArrests, class)
##   person_id arrest_date 
##   "integer" "character"
## Update data types
case[ , c("arrest_date", "dispos_date") := lapply(.SD, as.Date), .SDcols = c("arrest_date", "dispos_date")]
demo[, bdate := as.Date(bdate)]
priorArrests[, arrest_date := as.Date(arrest_date)]

Merges and Reshaping

Next, I tailor the data structure to what I want it to be. While I want the case data to drive the shape of the table, I still need information from the priorArrest data to glean if a person has had more arrests outside of what’s included in the case. To do so, I create caseSmall, which is unique to the person_id and arrest_date level. Once I have that subsetted data table, I then rbind it onto the full priorArrests, which is already unique to the same level. The bind allows me to have a full data set of arrest dates by person. This will come in handy later when I need to calculate the number of prior arrests for each current arrest.

Since I have what I need from the priorArrest data in a self-contained table, I then merge the demo data onto the case data and order it by person_id-arrest_date combination, which creates the full data I then will be working with. Once I do a quick check for missing data (and luckily there is none), I am free to move on.

## Merge 1: Get a comprehensive list of all arrests by person
caseSmall <- case[, .(person_id, arrest_date)]
totalArrests <- unique(rbind(caseSmall, priorArrests))
totalArrests <- totalArrests[order(person_id, arrest_date)]

## Merge 2: demo/case data together
setkey(case, person_id)
setkey(demo, person_id)
full <- demo[case]
full <- full[order(person_id, arrest_date)]

## Look for completeness of data before proceeding
sapply(full,function(x) sum(is.na(x)))
##   person_id        race      gender       bdate      caseid arrest_date 
##           0           0           0           0           0           0 
## dispos_date       treat 
##           0           0

Part 1: Data Management

Question 1

In order to create re_arrest, I need to find out whether a person was arrested in between their most recent arrest and their disposition date. To do so, I first find the total number of arrests, totalArrests, that a person has within the case data. I then look for where a person has been arrested more than once (since this parameter does not apply to those who have only been arrested once and ineligible to be assigned with the re_arrest flag), and where there arrest_date is less then the arrest_date directly below it (using the shift function), and look where the below arrest_date is also less than the dispos_date. The data.table package is a boon in this process as it operates row-by-row. Since the data is already ordered by date, I make this operation perform based on groupings by person_id which ensures that the calculations are done by unique arrestee.

I do a quick check by looking at all person_ids that have any flags for re_arrest. A handful of records show me that the calculation was performed properly.

full[, totalArrests := as.integer(uniqueN(arrest_date)), by = person_id]
full <- full[, re_arrest := 0]
full[totalArrests > 1 & 
       (arrest_date < shift(arrest_date, 1) & 
          shift(arrest_date, 1) < dispos_date), 
     re_arrest := 1, by = person_id]

## Check my work
re_arrests <- full[re_arrest==1, unique(person_id)]
full[person_id %in% re_arrests]
##        person_id     race gender      bdate caseid arrest_date dispos_date
##     1:         8    BLACK      M 1978-04-04   2913  2012-10-06  2013-12-29
##     2:         8    BLACK      M 1978-04-04   6304  2013-04-06  2013-07-07
##     3:        13    BLACK      M 1963-04-27  88153  2012-10-29  2013-01-21
##     4:        13    BLACK      M 1963-04-27  40377  2012-11-04  2013-03-05
##     5:        13    BLACK      M 1963-04-27  39447  2013-07-31  2013-08-16
##    ---                                                                    
## 11925:     19983 HISPANIC      M 1967-11-16  14837  2013-07-01  2013-09-24
## 11926:     19987 HISPANIC      M 1972-08-28  12170  2012-03-01  2013-10-18
## 11927:     19987 HISPANIC      M 1972-08-28  92133  2012-03-27  2012-04-03
## 11928:     19988    WHITE      M 1973-05-02  80147  2012-11-03  2014-02-09
## 11929:     19988    WHITE      M 1973-05-02   2301  2013-04-25  2014-05-19
##        treat totalArrests re_arrest
##     1:     1            2         1
##     2:     0            2         0
##     3:     1            3         1
##     4:     1            3         0
##     5:     1            3         0
##    ---                             
## 11925:     1            3         0
## 11926:     0            2         1
## 11927:     1            2         0
## 11928:     1            2         1
## 11929:     0            2         0

Question 2

To generate prior_requests, I use the subsetted data table totalArrests table generated earlier. I first create an index that itemizes the records, generates a cumulative sum of that index, and subtracts one to denote the number of prior arrests per current arrest. Once I generate that figure, I drop the index, and merge the the prior_arrest variable onto the full data set. I set the person_id and arrest_date as the merge key to ensure that the number of prior arrests is aligned to the proper record.

totalArrests[, index := 1]
totalArrests[, prior_arrests := cumsum(index)-1, by = person_id]
totalArrests[, index := NULL]
setkey(totalArrests, person_id, arrest_date)
setkey(full, person_id, arrest_date)
full <- totalArrests[full]
head(full)
##    person_id arrest_date prior_arrests     race gender      bdate caseid
## 1:         1  2012-01-04             2 HISPANIC      F 1985-07-03  57514
## 2:         1  2012-07-11             3 HISPANIC      F 1985-07-03  39970
## 3:         1  2013-04-04             4 HISPANIC      F 1985-07-03  88413
## 4:         5  2012-03-31             2    BLACK      M 1986-09-27  40216
## 5:         6  2012-12-09             3    BLACK      M 1991-06-07  92255
## 6:         7  2012-02-25             0 HISPANIC      F 1994-08-24  26516
##    dispos_date treat totalArrests re_arrest
## 1:  2012-03-27     0            3         0
## 2:  2012-10-20     1            3         0
## 3:  2013-06-22     0            3         0
## 4:  2013-03-25     0            1         0
## 5:  2013-11-09     0            1         0
## 6:  2012-03-26     0            1         0

Question 3

Since I have already converted the data type and ordered the data, all I have to do is simply substract the bdate from the arrest_date to get the age of the defendant at the time of each arrest. The date automatically calculates this figure in days, so once I divide by 365 and round to the nearest whole number, the age variable is in a useful format.

While I do group the age calculation by person_id, the by parameter is actually not necessary here given the nature of by row calculations in data.table, but it helps to elucidate the intention of the calculation.

full[, age := arrest_date - bdate, by = person_id]
full[, age := as.integer(age)]
full[, age := round((age/365),0)]
head(full)
##    person_id arrest_date prior_arrests     race gender      bdate caseid
## 1:         1  2012-01-04             2 HISPANIC      F 1985-07-03  57514
## 2:         1  2012-07-11             3 HISPANIC      F 1985-07-03  39970
## 3:         1  2013-04-04             4 HISPANIC      F 1985-07-03  88413
## 4:         5  2012-03-31             2    BLACK      M 1986-09-27  40216
## 5:         6  2012-12-09             3    BLACK      M 1991-06-07  92255
## 6:         7  2012-02-25             0 HISPANIC      F 1994-08-24  26516
##    dispos_date treat totalArrests re_arrest age
## 1:  2012-03-27     0            3         0  27
## 2:  2012-10-20     1            3         0  27
## 3:  2013-06-22     0            3         0  28
## 4:  2013-03-25     0            1         0  26
## 5:  2013-11-09     0            1         0  22
## 6:  2012-03-26     0            1         0  18

Quick Look at Distribution

Before moving on, I want to know if there is an even distribution amongst those who have received the treatment. To do so, I generate some quick tabs to look at the break down of age, race, and gender. While this output it helpful, I also would like to see how the data spreads. I find that age is pretty well distributed, but there is more skew in race. A predominant number of defendants in the sample are black males.

I also want to get a quick glance at the correlatons between variables. A correlation matrix shows me that there are no surprising or meaninful correlations outside the obvious (arrest_date and dispos_date).

# Quick distributions 
full[, .N, by = race]
##        race     N
## 1: HISPANIC  6221
## 2:    BLACK 15126
## 3:    ASIAN  1239
## 4:    WHITE  2414
full[, .N, by = gender]
##    gender     N
## 1:      F  4936
## 2:      M 20064
# Handle the continuous a bit differently
summary(full$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    25.0    30.0    30.7    36.0    70.0
# struncture the date for plotting
toPlot <- full[, .(arrestYear = year(arrest_date),
                   race = as.factor(race),
                   gender = as.factor(gender),
                   disposYear = year(dispos_date),
                   treat, totalArrests, re_arrest, age)]

# Plots with facet wrapping so we can get a quick at a glance for all variables (integer vars)
toPlot %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Get a correlation plot 
corData <- cor(toPlot[,.(arrestYear,
                         disposYear,
                         treat,
                         re_arrest,
                         totalArrests,age)])

corrplot(corData, 
         type = "upper", 
         order = "hclust", 
         tl.col = "black", 
         tl.srt = 45)

Part II: Statistical Analysis

Before diving in, I reset categorical variables to factors and then drop the ID and date variables since they will not be useful in any models.

Once cleaned up, I can then partition the data into training and test sets. I randomly select 75% of my data as the train and then use the remaining data as my test. While I could simply select rows, I want to ensure true randomness in my selection. To ensure completeness of the data, I do a quick check to make sure that the sum of the train and test rows are equal to the original data set.

# new col and snapshot
full[, anyReArrest := max(re_arrest), by = person_id]
fullRaw <- full # keep a raw copy before manipulating

# clean up
full[,`:=`(race = as.factor(race),
            gender = as.factor(gender))]
full <- full[, .SD, .SDcols = !grepl(x = names(full),pattern =  "id|date")]

## split data
trainData<- floor(0.75 * nrow(full))
set.seed(8675309)
trainData <- sample(seq_len(nrow(full)), size = trainData)
train <-full[trainData]
test <- full[-trainData]
nrow(train) + nrow(test) == nrow(full) # gut check
## [1] TRUE

Model selection

Why a logistical regression?

I chose a logistical regression for a few reasons:
My dependent variable is binary
I am assessing the impact of the treatment, not necessarily trying to predict what causes re_arrest
By using a step wise model selection, I can see variable-by-variable the impact on the model and thus the incrememntal decrease in explanatory power of treat
The model allows me to understand if the treatment is explaining variation in re_arrests or if other independent variables are explaining it.

Model Applicaton

The first model that I run is the basic y ~ x: the effect of treat on re_arrests. A simple correlation shows me a decently low positive relationship, which already seems suspect. This correlation nods to the fact that treatment is associated with a greater probability of getting arrested again prior to disposition.

Model 1

When I run the basic model, lm, I find that the relationship is highly significant, but I sense there is a lot of noise in the model.

# logit regression
## 1) Simple model to look at relationship between treatment and re_arrest
cor(x = train$treat, y = train$re_arrest)
## [1] 0.03079
lm <- glm(formula = re_arrest ~ treat,
          data = train,
          family = binomial(link = 'logit'))

summary(lm)
## 
## Call:
## glm(formula = re_arrest ~ treat, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.772  -0.772  -0.725  -0.725   1.712  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -1.2022     0.0252  -47.73 < 0.0000000000000002 ***
## treat         0.1438     0.0341    4.21             0.000025 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20882  on 18749  degrees of freedom
## Residual deviance: 20864  on 18748  degrees of freedom
## AIC: 20868
## 
## Number of Fisher Scoring iterations: 4

Model 2

On the other end of the spectrum, I run a model with all variables included. While this poses the risk of overfitting, I want to get a sense of the other variables likely causing variation in re_arrests outside of treatment. Sure enough, this model shows that the explanatory power of treat has all but deflated. We know see that the best predictors of re_arrests amongst the variables given are prior_arrests, age, and treat to a lesser degree.

# 2) Full model with all vars --------------------------------------------------------------------------
lmFull <- glm(formula = re_arrest ~ .,
              data = train,
              family = binomial(link = 'logit'))

summary(lmFull)
## 
## Call:
## glm(formula = re_arrest ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7272  -0.0001  -0.0001  -0.0001   1.8774  
## 
## Coefficients:
##                Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)   -20.32322  107.71945   -0.19                0.85    
## prior_arrests  -0.24265    0.01427  -17.01 <0.0000000000000002 ***
## raceBLACK       0.04764    0.09958    0.48                0.63    
## raceHISPANIC    0.01519    0.10499    0.14                0.88    
## raceWHITE      -0.10100    0.11779   -0.86                0.39    
## genderM         0.00848    0.05520    0.15                0.88    
## treat          -0.04621    0.04550   -1.02                0.31    
## totalArrests    0.03228    0.02290    1.41                0.16    
## age             0.04774    0.00366   13.04 <0.0000000000000002 ***
## anyReArrest    19.86240  107.71936    0.18                0.85    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20882  on 18749  degrees of freedom
## Residual deviance: 12101  on 18740  degrees of freedom
## AIC: 12121
## 
## Number of Fisher Scoring iterations: 18

Model 3

Once I’ve run the fullest and sparsest models, I use the stepwise model selection process. stepAIC chooses the highest quality model avaiable within your data by estimating the bias-variance tradeoff of each model. The model that it chooses only selects the variables of significance and discards the others to do away with noise.

# Step-wise: chooses model that returns the lowest deviance in the AIC
lmStep <- stepAIC(lmFull, direction = "both", 
                  trace = T)
## Start:  AIC=12121
## re_arrest ~ prior_arrests + race + gender + treat + totalArrests + 
##     age + anyReArrest
## 
##                 Df Deviance   AIC
## - gender         1    12101 12119
## - race           3    12105 12119
## - treat          1    12102 12120
## - totalArrests   1    12103 12121
## <none>                12101 12121
## - age            1    12276 12294
## - prior_arrests  1    12405 12423
## - anyReArrest    1    18265 18283
## 
## Step:  AIC=12119
## re_arrest ~ prior_arrests + race + treat + totalArrests + age + 
##     anyReArrest
## 
##                 Df Deviance   AIC
## - race           3    12105 12117
## - treat          1    12102 12118
## - totalArrests   1    12103 12119
## <none>                12101 12119
## + gender         1    12101 12121
## - age            1    12276 12292
## - prior_arrests  1    12405 12421
## - anyReArrest    1    18267 18283
## 
## Step:  AIC=12117
## re_arrest ~ prior_arrests + treat + totalArrests + age + anyReArrest
## 
##                 Df Deviance   AIC
## - treat          1    12106 12116
## - totalArrests   1    12107 12117
## <none>                12105 12117
## + race           3    12101 12119
## + gender         1    12105 12119
## - age            1    12279 12289
## - prior_arrests  1    12408 12418
## - anyReArrest    1    18274 18284
## 
## Step:  AIC=12116
## re_arrest ~ prior_arrests + totalArrests + age + anyReArrest
## 
##                 Df Deviance   AIC
## - totalArrests   1    12108 12116
## <none>                12106 12116
## + treat          1    12105 12117
## + race           3    12102 12118
## + gender         1    12106 12118
## - age            1    12281 12289
## - prior_arrests  1    12433 12441
## - anyReArrest    1    18276 18284
## 
## Step:  AIC=12116
## re_arrest ~ prior_arrests + age + anyReArrest
## 
##                 Df Deviance   AIC
## <none>                12108 12116
## + totalArrests   1    12106 12116
## + treat          1    12107 12117
## + gender         1    12108 12118
## + race           3    12104 12118
## - age            1    12282 12288
## - prior_arrests  1    12443 12449
## - anyReArrest    1    20441 20447
summary(lmStep)
## 
## Call:
## glm(formula = re_arrest ~ prior_arrests + age + anyReArrest, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7329  -0.0001  -0.0001  -0.0001   1.9024  
## 
## Coefficients:
##                Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)   -20.24019  107.74745   -0.19                0.85    
## prior_arrests  -0.24060    0.01353  -17.78 <0.0000000000000002 ***
## age             0.04671    0.00359   13.01 <0.0000000000000002 ***
## anyReArrest    19.90136  107.74742    0.18                0.85    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20882  on 18749  degrees of freedom
## Residual deviance: 12108  on 18746  degrees of freedom
## AIC: 12116
## 
## Number of Fisher Scoring iterations: 18

Model 4

Although I’m pleased with the avove model, I’m curious to add some quadratic and interaction variables to see if I can increase the efficiency of my model. I add an age2 variable that shows me the impact of each additional year, because I assume the relationship is not linear. Similarily with priorArrests, I assume the probability of re_arrest exponentially increases with each additional prior arrest. I also add an interaction terms for to estimate the effect of age and prior_arrests.

full2 <- full #snapshot before altering
full2[, `:=`(age2 = age^20, 
             priorArrests2 = prior_arrests^2,
             agePriorArrests = age*prior_arrests)]

train2 <- full[1:(.75*nrow(full))]
test2 <- full[(nrow(train)+1):nrow(full)]
nrow(train2) + nrow(test2) == nrow(full)
## [1] TRUE
## Get a second stepped model with the interaction and quadratic terms
lmFull2 <- glm(formula = re_arrest ~ .,
               data = train2,
               family = binomial(link = 'logit'))
summary(lmFull2)
## 
## Call:
## glm(formula = re_arrest ~ ., family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6536  -0.0001  -0.0001  -0.0001   2.3069  
## 
## Coefficients:
##                                                    Estimate
## (Intercept)     -20.077401739617339870846990379504859447479
## prior_arrests    -0.157416939396000876705272730760043486953
## raceBLACK         0.015424424853648152980434282710575644160
## raceHISPANIC      0.012520087556050253640993474846254684962
## raceWHITE        -0.136827202715282342238012347479525487870
## genderM           0.008033435804685939637947988956057088217
## treat            -0.042369306965862732483607544509141007438
## totalArrests      0.039341997347624391156983847395167686045
## age               0.025729611649575087617414226315304404125
## anyReArrest      19.795318856157599896050669485703110694885
## age2             -0.000000000000000000000000000000000000118
## priorArrests2    -0.025288768607028708212958179046836448833
## agePriorArrests   0.004894435371759630823418341094566130778
##                                                  Std. Error z value
## (Intercept)     108.361814347061709895569947548210620880127   -0.19
## prior_arrests     0.039705403474358226534324245449170120992   -3.96
## raceBLACK         0.100567121342676688011685826040775282308    0.15
## raceHISPANIC      0.106011920748610544151802059786859899759    0.12
## raceWHITE         0.118167809947957655469785720470099477097   -1.16
## genderM           0.054532287879749313797006493587105069309    0.15
## treat             0.045612983981102230013338072467377060093   -0.93
## totalArrests      0.023058987087375543839629088438414328266    1.71
## age               0.009181488183546164180692805700800818158    2.80
## anyReArrest     108.361553230815502502082381397485733032227    0.18
## age2              0.000000000000000000000000000000000000114   -1.03
## priorArrests2     0.005548378336612514415138175394304198562   -4.56
## agePriorArrests   0.001698589067897110591737197538009240816    2.88
##                  Pr(>|z|)    
## (Intercept)        0.8530    
## prior_arrests   0.0000735 ***
## raceBLACK          0.8781    
## raceHISPANIC       0.9060    
## raceWHITE          0.2469    
## genderM            0.8829    
## treat              0.3529    
## totalArrests       0.0880 .  
## age                0.0051 ** 
## anyReArrest        0.8551    
## age2               0.3016    
## priorArrests2   0.0000052 ***
## agePriorArrests    0.0040 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20909  on 18749  degrees of freedom
## Residual deviance: 12137  on 18737  degrees of freedom
## AIC: 12163
## 
## Number of Fisher Scoring iterations: 18
lmStep2 <- stepAIC(lmFull2, direction = "both", 
                   trace = T)
## Start:  AIC=12163
## re_arrest ~ prior_arrests + race + gender + treat + totalArrests + 
##     age + anyReArrest + age2 + priorArrests2 + agePriorArrests
## 
##                   Df Deviance   AIC
## - gender           1    12137 12161
## - race             3    12141 12161
## - treat            1    12138 12162
## - age2             1    12138 12162
## <none>                  12137 12163
## - totalArrests     1    12140 12164
## - age              1    12145 12169
## - agePriorArrests  1    12145 12169
## - prior_arrests    1    12152 12176
## - priorArrests2    1    12158 12182
## - anyReArrest      1    18020 18044
## 
## Step:  AIC=12161
## re_arrest ~ prior_arrests + race + treat + totalArrests + age + 
##     anyReArrest + age2 + priorArrests2 + agePriorArrests
## 
##                   Df Deviance   AIC
## - race             3    12141 12159
## - treat            1    12138 12160
## - age2             1    12138 12160
## <none>                  12137 12161
## - totalArrests     1    12140 12162
## + gender           1    12137 12163
## - age              1    12145 12167
## - agePriorArrests  1    12145 12167
## - prior_arrests    1    12152 12174
## - priorArrests2    1    12158 12180
## - anyReArrest      1    18024 18046
## 
## Step:  AIC=12159
## re_arrest ~ prior_arrests + treat + totalArrests + age + anyReArrest + 
##     age2 + priorArrests2 + agePriorArrests
## 
##                   Df Deviance   AIC
## - treat            1    12142 12158
## - age2             1    12142 12158
## <none>                  12141 12159
## - totalArrests     1    12144 12160
## + race             3    12137 12161
## + gender           1    12141 12161
## - age              1    12149 12165
## - agePriorArrests  1    12150 12166
## - prior_arrests    1    12157 12173
## - priorArrests2    1    12163 12179
## - anyReArrest      1    18031 18047
## 
## Step:  AIC=12158
## re_arrest ~ prior_arrests + totalArrests + age + anyReArrest + 
##     age2 + priorArrests2 + agePriorArrests
## 
##                   Df Deviance   AIC
## - age2             1    12143 12157
## <none>                  12142 12158
## - totalArrests     1    12145 12159
## + treat            1    12141 12159
## + race             3    12138 12160
## + gender           1    12142 12160
## - age              1    12150 12164
## - agePriorArrests  1    12151 12165
## - prior_arrests    1    12159 12173
## - priorArrests2    1    12163 12177
## - anyReArrest      1    18037 18051
## 
## Step:  AIC=12157
## re_arrest ~ prior_arrests + totalArrests + age + anyReArrest + 
##     priorArrests2 + agePriorArrests
## 
##                   Df Deviance   AIC
## <none>                  12143 12157
## - totalArrests     1    12146 12158
## + age2             1    12142 12158
## + treat            1    12142 12158
## + race             3    12139 12159
## + gender           1    12143 12159
## - agePriorArrests  1    12151 12163
## - age              1    12152 12164
## - prior_arrests    1    12159 12171
## - priorArrests2    1    12163 12175
## - anyReArrest      1    18039 18051
summary(lmStep2)
## 
## Call:
## glm(formula = re_arrest ~ prior_arrests + totalArrests + age + 
##     anyReArrest + priorArrests2 + agePriorArrests, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6699  -0.0001  -0.0001  -0.0001   2.2936  
## 
## Coefficients:
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)     -20.11374  108.37903   -0.19    0.8528    
## prior_arrests    -0.15325    0.03812   -4.02 0.0000582 ***
## totalArrests      0.03626    0.02299    1.58    0.1148    
## age               0.02738    0.00890    3.08    0.0021 ** 
## anyReArrest      19.79921  108.37884    0.18    0.8550    
## priorArrests2    -0.02429    0.00548   -4.43 0.0000092 ***
## agePriorArrests   0.00442    0.00161    2.75    0.0060 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20909  on 18749  degrees of freedom
## Residual deviance: 12143  on 18743  degrees of freedom
## AIC: 12157
## 
## Number of Fisher Scoring iterations: 18

Model Validation

McFadden’s and Chi-Squared Tests as a gut check

While I can look at the regression output and glean a job well enough done, I look to the Chi Squared test and McFadden’s R-Squared (the closest thing to an R-Squared for a logit regression). According the McFadden’s \(r^{2}\), the second stepwise-generated model using interaction and quadratic terms has more explanatory power. While that is exciting, I suspect that may be due to covariance between the interaction variables and the regular variables. Similarily, the Chi-Squared model confirms that the results I’m seeing are not borne of randomness.

# McFadden's R-Squared
pR2(lmStep)
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
##  -6053.7659 -10440.9400   8774.3482      0.4202      0.3737      0.5564
pR2(lmStep2)
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
##  -6071.5146 -10454.4205   8765.8118      0.4192      0.3734      0.5556
# Assess the deviance between our model and the null model
anova(lmStep, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: re_arrest
## 
## Terms added sequentially (first to last)
## 
## 
##               Df Deviance Resid. Df Resid. Dev             Pr(>Chi)    
## NULL                          18749      20882                         
## prior_arrests  1      424     18748      20458 < 0.0000000000000002 ***
## age            1       17     18747      20441              0.00004 ***
## anyReArrest    1     8333     18746      12108 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lmStep2, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: re_arrest
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev             Pr(>Chi)    
## NULL                            18749      20909                         
## prior_arrests    1      423     18748      20486 < 0.0000000000000002 ***
## totalArrests     1     1977     18747      18509 < 0.0000000000000002 ***
## age              1      224     18746      18285 < 0.0000000000000002 ***
## anyReArrest      1     6120     18745      12165 < 0.0000000000000002 ***
## priorArrests2    1       14     18744      12151              0.00014 ***
## agePriorArrests  1        8     18743      12143              0.00575 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bringing in the test data

I now can apply my models to the test data to see how I’ve done. I find that the second model is more accurately, but very narrowly. The predict package is a succint way to calculate the errors and vectorize them to roll up the model’s accuracy.

## Get probabilities for each obs that obs re_arrest = 1
lmStepProbs <- predict(lmStep,
                       newdata=test,
                       type='response')

lmStepProbs2 <- predict(lmStep2,
                        newdata=test2,
                        type='response')

## Convert probs > .5 to 1 and lower than .5 to 0
lmStepResults <- ifelse(lmStepProbs > 0.5,1,0)
lmStepResults2 <- ifelse(lmStepProbs2 > 0.5,1,0)

# Get our average
misClasificError <- mean(lmStepResults != test$re_arrest)
misClasificError2 <- mean(lmStepResults2 != test2$re_arrest)
paste0('The accuracy for model w/o interaction vars is ',1-misClasificError,
                                'and the accuracy for model w/ interaction vars is ', 1-misClasificError2, 
                                if(misClasificError < misClasificError2){'. Model 1 is more accurate.'}
                                else {'. Model 2 is more accurate.'})
## [1] "The accuracy for model w/o interaction vars is 0.80544and the accuracy for model w/ interaction vars is 0.80848. Model 2 is more accurate."

Area under the ROC Curve

I want to see how well my model accurately predicts positives, or re_arrests in this case. The ROC curve shows us the trend between false positives and true positives. My AUC relatively strong at 88%, but it still makes me wonder what else is going into the model that I could be missing; perhaps assigment of the treatment itself.

prediction <- prediction(lmStepProbs, test$re_arrest)
performance <- performance(prediction, measure = "tpr", x.measure = "fpr")
plot(performance)

auc <- performance(prediction, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8829

Thinking about Treatment Assignment

The lack of explanatory power of the model itself and of treat poignantly makes me wonder if there is a systematic difference in those who are receiving treatment. I recognize that more likely there are omitted variables biasing the model that are not within the scope of this data, therefore I will only examine what is in my locus of control.

I first develop propensity scores for each record, or the likelihood of re_arrest given the variables that have proven to have significant explanatory power. These scores allow me to “match” records of those in the treatment group with others who are similar on given metrics who are not within the treatment group. The theory here is that two people who are, for all intents and purposes the same, should have the same likelihood of re_arrest given no treatment; thus, any change in observed rate of re_arrest can be attributable to treat.

Using “nearest neighbors”, or similar propensity scores, 12k records are matched: 6k in the treatment group, and the other 6k not in the treatment group. We see a better balance between the groups given the provided metrics post-matching.

pScores <- matchit(re_arrest ~ age + prior_arrests + treat, 
                   data = full,
                   method = "nearest", 
                   distance = "logit")

summary(pScores)
## 
## Call:
## matchit(formula = re_arrest ~ age + prior_arrests + treat, data = full, 
##     method = "nearest", distance = "logit")
## 
## Summary of balance for all data:
##               Means Treated Means Control SD Control Mean Diff eQQ Med
## distance              0.261         0.237      0.067     0.024   0.024
## age                  32.405        30.106      7.580     2.299   2.000
## prior_arrests         4.372         3.609      2.126     0.763   1.000
## treat                 0.552         0.518      0.500     0.034   0.000
##               eQQ Mean eQQ Max
## distance         0.024   0.039
## age              2.300   6.000
## prior_arrests    0.764   1.000
## treat            0.034   1.000
## 
## 
## Summary of balance for matched data:
##               Means Treated Means Control SD Control Mean Diff eQQ Med
## distance              0.261         0.261      0.070     0.000       0
## age                  32.405        32.269      7.852     0.136       0
## prior_arrests         4.372         4.386      2.089    -0.013       0
## treat                 0.552         0.554      0.497    -0.002       0
##               eQQ Mean eQQ Max
## distance         0.000   0.031
## age              0.190   4.000
## prior_arrests    0.017   1.000
## treat            0.002   1.000
## 
## Percent Balance Improvement:
##               Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance           99.90     100    99.85   21.89
## age                94.06     100    91.73   33.33
## prior_arrests      98.25     100    97.74    0.00
## treat              95.20       0    95.19    0.00
## 
## Sample sizes:
##           Control Treated
## All         18919    6081
## Matched      6081    6081
## Unmatched   12838       0
## Discarded       0       0
## Keep only the matched data 
matchedFull <- data.table(match.data(pScores, distance = "pscore"))
hist(matchedFull$pscore)

Were the treatment groups balanced?

I use a t-test to see whether there is a significant difference in re_arrest rates for those who received the treatment adn those who did not. The first model, which is using unmatched data, shows that there is a highly significant difference. However, the significance of treatment is completely lost in the second t-test using the matched data.

These two differing tests suggest that any variation of re_arrest rates may not be as attributable to the treatment as one might think. Rather, the variance may be accounted for by non-random assignment, or systematic differences in people that received the treatment and those that did not.

t.test(full[treat == 1]$re_arrest, full[treat == 0]$re_arrest, paired = F)
## 
##  Welch Two Sample t-test
## 
## data:  full[treat == 1]$re_arrest and full[treat == 0]$re_arrest
## t = 4.7, df = 25000, p-value = 0.000003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01466 0.03592
## sample estimates:
## mean of x mean of y 
##    0.2552    0.2299
t.test(matchedFull[treat==1]$re_arrest,matchedFull[treat==0]$re_arrest,paired = F)
## 
##  Welch Two Sample t-test
## 
## data:  matchedFull[treat == 1]$re_arrest and matchedFull[treat == 0]$re_arrest
## t = -0.18, df = 12000, p-value = 0.9
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01954  0.01621
## sample estimates:
## mean of x mean of y 
##    0.4993    0.5009