Clean Slate Initiative Data Exercise

Author

Jonathan Zadra

Prompt

Download the dataset https://www.muckrock.com/foi/chicago-169/arrests-data-chicago-police-department-78593/ and anonymize any PII that is not relevant to the analysis. This could involve removing or hashing names and other direct identifiers.
Create a new dataset that only includes arrests between January 1, 2013 and December 31, 2018.
Analyze the demographics of individuals in the dataset including age, sex, and race. Identify any patterns in charges that vary by demographics within the 5 year period.
Compile a short report summarizing the findings. Include visualizations where appropriate and offer recommendations for Clean Slate legislation in Illinois to have the largest impact on racial equity based on your analysis of arrests in the largest city in the state. You can also include brief recommendations for further research and other policy changes based on the analysis.
Submit the final dataset you created as an Excel file (.xls, .xlsx).

Setup

library(tidyverse)
library(integral)
library(readxl)
library(janitor)
library(knitr)
library(kableExtra)
library(scales)
library(DT)
library(digest)
library(calendR)
library(plotly)

knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)

theme_set(theme_minimal())

Load Data

There are two data files at the link provided. Each has two sheets (Data1 and Data2) that appear to contain the same data formats, but different data. The description and notes in each file seem to indicate that the files are identical. However, we’ll load both and compare them to confirm.

Unfortunately, the Chicago Police Department (CPD) does not appear to have provided a data dictionary to Muckrock as requested in the FOIA, so for some variables it will be necessary to make an educated guess as to what they are likely to be.

# GET("https://cdn.muckrock.com/foia_files/2019/10/04/P514093_-_Work_File201.xlsx", write_disk(tf <- tempfile(fileext = ".xlsx")))
# read_excel(tf, sheet = "Data1")
#Other is https://cdn.muckrock.com/foia_files/2019/10/04/P514093_-_Work_File.xlsx


a1 <- read_excel("data/P514093_-_Work_File.xlsx", sheet = "Data1") 
a2 <- read_excel("data/P514093_-_Work_File.xlsx", sheet = "Data2")
# b1 <- read_excel("data/P514093_-_Work_File201.xlsx", sheet = "Data1")
# b2 <- read_excel("data/P514093_-_Work_File201.xlsx", sheet = "Data2")
# all(a1 == b1, na.rm = T)
# all(a2 == b2, na.rm = T)

Testing confirms that the files are identical, so we’ll just use one, and combine the data from sheets Data1 and Data2 into a single data frame.

rm(b1, b2)

pd <- bind_rows(a1, a2) %>% 
  clean_names()

pd <- pd %>% 
  mutate(arrest_date = dmy(arrest_date)) %>%  #convert from character to date class
  rename(statute = stat_descr,
         charge = charge_class_descr,
         sex = sex_code_cd)

1. Anonymize PII

The data contains the following variables that are PII: First name, last name, date of birth, race, and sex. Additionally, there is a variable “cb” that contains 8-digit numeric data. A quick search of potential definitions for “cb” relevant to Illinois police data did not result in any likely candidates, but “cb” is most likely a unique case number (there are no duplicates in the data). Because the values of “cb” for juvenile cases have been redacted to just “J” in instances where the same redaction was made to other PII, and because in all other rows the “cb” values are unique, this variable will be treated as PII as well.

Because the analyses requested involve analyzing demographics of age, sex, and race, these PII variables will not be anonymized. (However, if we were only interested in differences between values of the variables without knowing what the values are, the analyses could be done with the demographics anonymized as well).

PII will be hashed using the crc32 algorithm. Juvenile data that has already been redacted will be left as is for ease of identifying juveniles, as will any of the records that show REFUSED for first and/or last name in order to eliminate any confusion about the nature of duplicated name hashes.

anonymize <- function(x, algo="crc32"){
  unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
  unname(unq_hashes[x])
}

raw.pd <- pd

anon <- pd %>% 
  filter(cb != "J", first_nme != "REFUSED") %>% 
  mutate(across(cb:last_nme, ~anonymize(.)))

pd <- anon %>% 
  bind_rows(pd %>% 
              filter(cb == "J" | first_nme == "REFUSED"))

set.seed(53650)
pd %>% 
  select(cb:sex) %>% 
  sample_n(5) %>% 
  datatable(caption = "Example of hashed PII data.")

Initial Exploration

Note: This section was not requested but is part of my standard data cleaning process. Feel free to skip to section 2.

Data Summaries

In total there are 1,272,318 records.

The dates range from 2009-01-01 to 2018-12-31, which matches the FOIA request.

All columns have a completion rate (I.e. non-missing values) of above 90%, with the exception of desired court date which is 80% complete with 250,304 empty values.

Duplicates

It is good practice to check for potential data errors in the form of duplicate records. Checking for duplicate rows reveals that there are 56,377 duplicate records. Further inspection reveals that all duplicate rows are likely for Juveniles given that the names and other info have been redacted and replaced with “J”. Given that such redaction was intended, it is most likely that this roughly 5% of cases are actual group arrests rather than duplicates. However, because this juvenile subset is missing information that for the rest of the data set would allow us to make a much clearer determination, it warrants doing some additional analyses to ensure that the data is accurate and that any analyses that are run are not using inaccurate or erroneous data.

juvenile_dupes <- pd %>% 
  get_dupes(everything())

juvenile_dupes %>% 
  count(cb, first_nme, last_nme, dob_year) %>% 
  kable(caption = "Distinct identifying information for duplicates (only values are J's)", format.args = list(big.mark = ","))

Distinct identifying information for duplicates (only values are J’s)
cb	first_nme	last_nme	dob_year	n
J	J	J	J	56,377

This raises the question of whether these are true duplicate rows that should be removed, or if they are indeed unique individuals. If the latter, it would mean that each group of duplicates are actually separate people who were all arrested on the same date, same location, and for the same statute description and charge class. This could certainly be possible if the arrest involved multiple people.

There are several things to check that can help us decide which is the case:

What are the rates on number of individual juvenile records that are duplicates? In other words, how many people were arrested in groups assuming they are not duplicates? If we see lots of examples of very large numbers, it suggests that they are true duplicates as it would be unlikely for there to be many occasions when large numbers of people were arrested at the same time.
Do the potentially duplicated juvenile rates compare to those of adults arrested at the same time, place, and for the same statute description and charge? If we examine only non-juvenile cases and and look for rows that have duplicates across these variables while ignoring differences in PII variables that we don’t have for juvenile cases, do we still see similar patterns of multiple people seemingly being arrested at the same time? If so, this suggests that the juvenile data is not duplicated and people truly are being arrested in groups.
Is there a clear pattern of specific statute descriptions and charge classes for what may be group arrests versus those that more often only involve one person being arrested? I.e., do group arrests appear to indicate a potential group activity, such as any related to gang activity, and do single person arrests tend to involve a description that could only be for one person, such as those related to being the driver of a car?

Q1: Rates of juvenile records that have duplicates

pd %>% 
  filter(last_nme == "J" & first_nme == "J") %>% 
  add_count(statute, name = "stat_descr_n") %>% 
  add_count(arrest_date, street_no, statute, charge, name = "Group Size") %>% 
  count(`Group Size`) %>% 
  mutate(n = n / `Group Size`) %>% 
  datatable(caption = "Counts of number of juveniles arrested together", rownames = F, options = list(pageLength = 30))

The sizes of the groups for potential group arrests are heavily weighted toward lower numbers, with the maximum being 30 individuals at a time on only 1 occasion. This seems very reasonable and suggests that the juvenile records are unique individuals involved in group arrests.

Q2: Comparison of juvenile and adult group size patterns

juvadult_group_size <- pd %>% 
  filter(!is.na(first_nme) & !is.na(last_nme)) %>% 
  mutate(age_group = if_else(last_nme == "J" & first_nme == "J", "Juvenile", "Adult")) %>% 
  group_by(age_group) %>% 
  add_count(statute, name = "stat_descr_n") %>% 
  add_count(arrest_date, street_no, statute, charge, name = "group_n") %>% 
  mutate(`Group Size` = cut(group_n, 
                           breaks = c(1, 2, 3, 4, 5, 11, 21, 41, Inf), 
                           labels = c("1", "2", "3", "4", "5-10", "11-20", "21-40", "41+"), 
                           right = F)) %>% 
  tabyl(`Group Size`, age_group) %>% 
  adorn_percentages("col") %>% 
  adorn_pct_formatting() %>% 
  adorn_ns()

juvadult_group_size %>% 
  datatable(caption = "Percents of number of individuals with idential arrest data.", rownames = F, options = 
              list(pageLength = 25,
                   columnDefs = list(list(targets = "_all", orderable = FALSE))  # Disable sorting for the first column (ID)
  ))

Q3: Pattern of statute descriptions with high and low rates of small and large group sizes

Because there are a large number of different statute descriptors (3,115), it makes things more manageable to eliminate those that have relatively few occurrences overall. A density plot of the count of occurrences of a statute descriptions will help to guide choice of a cutoff point.

pd %>% 
  add_count(statute, name = "stat_descr_n") %>% 
  ggplot(aes(x = stat_descr_n)) + 
  geom_density() + 
  scale_x_continuous(breaks = pretty_breaks(20), labels = label_comma(big.mark = ",")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Frequency of Statute Descriptions", y = "Density")

It appears that eliminating any statute description with under 10,000 occurrences will make the total number of statute descriptions much more manageable (50 unique statute descriptions); however comparing it to the remaining descriptions when we make the cutoff 5,000 occurrences (70 unique) added some descriptions that seem to be relevant in that theoretically they should only involve 1 person being arrested at a time and for the same statute description and charge, such as incidents involving the driver of a vehicle (other people in the car who may have been arrested at the same time should have different statute descriptions and charges).

In addition because there are other terms that appear in the statute descriptions that should be cases involving larger group arrests but may not meet the cutoff threshold for total number of these descriptions, any statute descriptions that included the terms “Mob” or “GANG” with at least 40 overall total instances of that statute description were also included, as were any descriptions that had at least one occurrence of a group size greater than or equal to 20.

With this subset of the data, we can examine the proportion of group sizes for each statute description. In the table below, some group sizes are combined into ranges to further simplify the output.

occurr_stat <- pd %>% 
  add_count(statute, name = "stat_descr_n")%>% 
  add_count(arrest_date, street_no, statute, charge, name = "group_n") %>% 
  filter(stat_descr_n > 5000 | group_n >= 20 | (str_detect(statute, "MOB|GANG") & stat_descr_n > 40)) %>% 
  filter(stat_descr_n > 10) %>% 
  mutate(group_n_range = cut(group_n, 
                           breaks = c(1, 2, 3, 4, 5, 11, Inf), 
                           labels = c("1", "2", "3", "4", "5-10", "11+"), 
                           right = F)) %>% 
  tabyl(statute, group_n_range) %>% 
  adorn_percentages("row") 
  

occurr_stat %>% 
  datatable(caption = "Percentage of various group sizes for specific statute descriptions") %>% 
  formatPercentage(columns = 2:7, digits = 0)

By manipulating the sorting of each column, we can see several patterns that support the hypothesis about certain types of statute descriptions. For instance, sorting the column for “1” individual arrested, 5 of the top 11 statute descriptions involve driving, one is a being a fugitive, and one is refusing to show ID - all of which should only involve one individual. Likewise, sorting for high percentages of groups of 5-10 causes several statute descriptions that involve mob actions or gang-related activity to be in the top 10 statute descriptions.

Duplicates conclusion

Based on the answers to all three questions, the burden of evidence suggests fairly clearly that there are no actual duplicate records, but rather that multiple arrests on the same date, in the same location, and for the same statute description are actually group arrests. Therefore, it seems warranted to proceed without removing any records.

2. Select subset of arrest dates

pd <- pd %>% 
  filter(between(arrest_date, ymd(20130101), ymd(20181231)))

Changes in arrests over time

With the subset, it is worth examining any changes in arrests over time. First, we’ll examine the entire 5-year period.

pd %>% 
  ggplot(aes(x = arrest_date)) + geom_density() +
  scale_x_date(breaks = pretty_breaks(n = 8)) +
  ggtitle("Distribution of Daily Arrests") +
  labs(x = "Arrest Date")

It appears that the number of arrests has been decreasing over the 5-year period. To test this observation, we’ll run a simple model:

pd %>% 
  count(arrest_date) %>% 
  mutate(day_index = row_number()) %>% 
  lm(n ~ day_index, .) %>% 
  summary()


Call:
lm(formula = n ~ day_index, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-271.652  -29.891    2.774   33.645  157.121 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 396.195267   2.108048  187.94 <0.0000000000000002 ***
day_index    -0.096220   0.001666  -57.76 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49.32 on 2189 degrees of freedom
Multiple R-squared:  0.6038,    Adjusted R-squared:  0.6036 
F-statistic:  3336 on 1 and 2189 DF,  p-value: < 0.00000000000000022

A simple linear model predicting daily arrests from the day index of the entire period indicates that there is a highly significant decrease in arrests over time. Starting from a mean of 317.74 arrests on Jan 1, over the 5-year period there is an average decrease of 0.07 arrests made each day, yielding a predicted value of 167.00 arrests at the end of the period.

pd %>% 
  count(arrest_date) %>% 
  ggplot(aes(x = arrest_date, y = n)) + 
  geom_smooth() + 
  geom_smooth(method = "lm", color = "purple") +
  ggtitle("Number of Daily Arrests") + geom_point(alpha = .05) +
  scale_x_date(breaks = pretty_breaks(n = 6)) +
  labs(x = "Arrest Date", y = "Daily Arrests")

Plotting a GAM smoothing function (purple) on top of the linear smooth line (blue) suggests that this descending trend may have accelerated downward between the summers of 2015 and 2016, and then flattened until the summer of 2017, after which it increase slightly. Adding a polynomial term to the model increases the overall fit of the model, and indicates that the linear downward trend over time is reduced slightly by 0.0000375 per day.

mod_daycount <- pd %>% 
  count(arrest_date) %>% 
  mutate(day_index = row_number()) %>% 
  lm(n ~ day_index + I(day_index^2), .)
  
mod_daycount %>% 
  summary()


Call:
lm(formula = n ~ day_index + I(day_index^2), data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-278.36  -25.59    3.08   30.74  131.96 

Coefficients:
                   Estimate   Std. Error t value            Pr(>|t|)    
(Intercept)    433.72861734   2.97603990  145.74 <0.0000000000000002 ***
day_index       -0.19891012   0.00627069  -31.72 <0.0000000000000002 ***
I(day_index^2)   0.00004685   0.00000277   16.91 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 46.39 on 2188 degrees of freedom
Multiple R-squared:  0.6496,    Adjusted R-squared:  0.6493 
F-statistic:  2028 on 2 and 2188 DF,  p-value: < 0.00000000000000022

In summary, it appears that there may have been one or more external factors that affected arrest rates over the 5-year period. It may have been something straightforward like a policy change, but could also reflect changes in population base rates or economic factors.

Seasonal changes

Seasonal variability in arrest data is fairly common. In the initial plot of the distribution of arrests there is a clear cyclical component. First we’ll plot daily arrests by month.

pd %>% 
  mutate(a_month = month(arrest_date)) %>% 
  mutate(lab_month = factor(month(arrest_date, label = T))) %>% 
  count(a_month, lab_month) %>% 
  ggplot(aes(y = n)) + 
  geom_col(aes(x = lab_month)) +
  geom_smooth(aes(x = as.numeric(lab_month))) +
  scale_y_continuous(labels = label_comma()) +
  labs(x = "Month", y = "Monthly Arrests")

There does appear to be a monthly trend with crimes peaking in warmer months and declining in cooler months.

We will examine other changes over time relative to other factors shortly.

3. Demographics

Basic Descriptions

Age

Because the data only contains the year of the date of birth, age will be calculated based only on the year of the arrest date. As a result, age will only be approximate and could be off by up to roughly 1 year.

The majority of cases where the first name is “REFUSED” have the year of birth entered as 1900. Based on this, it can be assumed that 1900 is a placeholder that indicates that providing a DOB was also refused, but likely the CPD database does not allow non-numeric or missing values to be entered for that field. Thus, these cases will have the year of birth replaced with NAs (i.e. missing). It should also be noted that there are roughly 20 cases where the person entering the data included a typo when attempting to enter “REFUSED”. However, none of these have birth years of 1900 so they can be ignored.

pd <- pd %>% 
  mutate(dob_year = if_else(dob_year == "J" | (first_nme == "REFUSED" & dob_year == 1900), NA_character_, dob_year)) %>% 
  mutate(dob_year = as.numeric(dob_year)) %>%  
  mutate(age = year(arrest_date) - dob_year)

An initial examination of the age data shows the following descriptives:

pd %>% select(age) %>% summary() %>% kable()

	age
	Min. : 1.00
	1st Qu.: 24.00
	Median : 31.00
	Mean : 34.03
	3rd Qu.: 42.00
	Max. :136.00
	NA’s :106053

There appear to be some erroneous birth year data, given that the oldest person to ever live was 116 but there is an individual that would be 130 and a number of people over 100, which seems unlikely:

pd %>%
  filter(age >= 100) %>%
  select(dob_year, arrest_date, age) %>%
  arrange(dob_year) %>% datatable()

pd %>% 
  ggplot(aes(x= age)) + geom_density() +
  scale_x_continuous(breaks = pretty_breaks(n = 8))

There is one non-juvenile record for a person who was 1 year old at time of arrest. Either this person did not have their PII properly redacted and CPD arrested a 1-year-old, or the birth year was entered incorrectly. This case will be removed.
There are 47 records where the age is over 100.
- 31 have a DOB year of 1900. Given the aforementioned use of 1900 as a likely placeholder for DOB’s that were unknown or refused, it makes sense to remove them.
- 11 have DOB years in the late 1800’s. These are likely to be typos that should have been 1900’s, so it makes sense to remove them.
- There is only 1 record with an age of 18, likely due to the potential error of up to one year. To simplify analyses, this individual was removed.

pd <- pd %>% 
  filter(dob_year >= 1900) %>% 
  filter(age != 1) %>% 
  filter(age != 18)

First, we examine the number of arrests by age:

pd %>% 
  ggplot(aes(x = age)) + 
  geom_bar() + 
  labs(x = "Age", y = "Arrests") +
  scale_x_continuous(breaks = pretty_breaks(10))

The bulk of arrests appear to be for people between 20 and 30. The distribution is unimodal with a strong right skew.

pd %>% 
  mutate(age18 = age - 19) %>% 
  count(age18) %>% 
  lm(n ~ age18, .) %>% summary()


Call:
lm(formula = n ~ age18, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
 -4881  -2875  -1724   1529   9869 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 16994.23     868.56   19.57 <0.0000000000000002 ***
age18        -249.81      17.37  -14.38 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4191 on 81 degrees of freedom
Multiple R-squared:  0.7186,    Adjusted R-squared:  0.7152 
F-statistic: 206.9 on 1 and 81 DF,  p-value: < 0.00000000000000022

pd %>% 
  ggplot(aes(x = arrest_date, y = age)) +
  geom_smooth() +
  geom_smooth(method = "lm", color = "red")

There is an increase in the age at arrest over time. A simple linear model accounting for arrest date and age shows that this is a significant increase.

pd %>%
  count(arrest_date, age) %>%
  lm(n ~ arrest_date * age, .) %>%
  summary()


Call:
lm(formula = n ~ arrest_date * age, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.8070  -1.7592  -0.1604   1.3957  27.2847 

Coefficients:
                    Estimate   Std. Error t value            Pr(>|t|)    
(Intercept)     95.487949713  0.786158428  121.46 <0.0000000000000002 ***
arrest_date     -0.004835534  0.000046760 -103.41 <0.0000000000000002 ***
age             -1.571520199  0.017903802  -87.78 <0.0000000000000002 ***
arrest_date:age  0.000080910  0.000001065   75.97 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.939 on 97503 degrees of freedom
Multiple R-squared:  0.537, Adjusted R-squared:  0.5369 
F-statistic: 3.769e+04 on 3 and 97503 DF,  p-value: < 0.00000000000000022

The negative coefficients for arrest date and age are in line with what we’ve seen that arrests have gone down and that there are fewer arrests for older people than younger. The interaction term between arrest date and age indicates that once the effect of the other two additive variables are accounted for, there is an increase in the age at arrest over time.

Sex

Males are much more highly represented in number of arrests.

pd %>% 
  tabyl(sex) %>% 
  adorn_pct_formatting() %>% 
  kable()

sex	n	percent
F	82063	15.5%
M	448802	84.5%
X	77	0.0%

pd %>% 
  ggplot(aes(x = sex, fill = sex)) + geom_bar() +
  scale_fill_viridis_d() +
  labs(y = "Arrests", x = "Sex") +
  scale_y_continuous(labels = label_comma(big.mark = ",")) +
  guides(fill = "none")

Because sex designations of “X” represent less than 0.0% of the arrests and therefore is not a large enough group to draw conclusions from, they will be committee from analyses involving sex for the remained of this report.

There do not appear to be any major differences in arrest frequencies by sex over time beyond the changes in total arests over time seen previously:

pd %>% 
  filter(sex != "X") %>% 
  ggplot(aes(x = arrest_date, color = sex)) + geom_density() +
  scale_x_date(breaks = pretty_breaks(n = 8))

Race

Arrests for individuals classified as black are massively disproportionate, making up 72% of arrests. White Hispanics are second highest with only 17%, followed by white at 9.3%.

pd %>% 
  tabyl(race) %>% 
  adorn_pct_formatting() %>% 
  kable()

race	n	percent
AMER IND/ALASKAN NATIVE	411	0.1%
ASIAN/PACIFIC ISLANDER	3226	0.6%
BLACK	382106	72.0%
BLACK HISPANIC	3017	0.6%
UNKNOWN	922	0.2%
WHITE	49188	9.3%
WHITE HISPANIC	92072	17.3%

pd %>% 
  ggplot(aes(x = race, fill = race)) + geom_bar() +
  scale_fill_viridis_d() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) +
  scale_y_continuous(labels = label_comma(big.mark = ",")) +
  guides(fill = "none") +
  labs(x = "Race", y = "Arrests")

There do not appear to be any major differences in arrest frequencies by race over time beyond the changes in total arrests over time seen previously, with the exception that American Indians/Alaskan Natives had a slightly higher frequency of arrests between 2013-2015 than other races, and a stronger decrease in arrests after 2016 than other races, notably without the uptick in arrests since after 2018 in other races:

pd %>% 
  filter(sex != "X") %>% 
  ggplot(aes(x = arrest_date, color = race)) + geom_density() +
  theme(legend.position = "top") +
  scale_x_date(breaks = pretty_breaks(n = 8))

Demographic interactions

Age & race

For the most part the pattern of age at arrest is consistent across races, with the exception of American Indians/Alaskan Natives who tend to be slightly older at time of arrest, with a bimodal distribution peaking at around 30 and 40.

pd %>%
  ggplot(aes(x = age, color = race)) + 
  geom_density() +
  theme(legend.position = "top") +
  labs(x = "Age", color = "Race") +
  scale_x_continuous(breaks = pretty_breaks(10))

Age & sex

The distribution of age at time of arrest is fairly similar across sex designations.

pd %>% 
  filter(sex != "X") %>% 
  ggplot(aes(x = age, color = sex)) + 
  geom_density() +
  theme(legend.position = "top") +
  labs(x = "Age", color = "Sex") +
  scale_x_continuous(breaks = pretty_breaks(10))

Race & sex

The patterns of arrests by race and sex suggest that there is no major difference in the frequencies of arrests for males and females between the various races, with males again being the majority of arrests.

pd %>% 
  filter(sex != "X") %>% 
  tabyl(race, sex)

                    race     F      M
 AMER IND/ALASKAN NATIVE    92    319
  ASIAN/PACIFIC ISLANDER   561   2665
                   BLACK 59625 322422
          BLACK HISPANIC   472   2545
                 UNKNOWN   129    793
                   WHITE 10589  38591
          WHITE HISPANIC 10595  81467

pd %>% 
  filter(sex != "X") %>% 
  ggplot(aes(x = sex, fill = race)) + geom_bar() +
  facet_wrap(~race, scales = "free") +
  guides(fill = "none") +
  scale_y_continuous(labels = label_comma())

Age, race, & sex

A quick look at the relative distributions of number of arrests by age, broken into all combinations of race and sex, does not show any difference in patterns than had been shown previously in the individual distributions. The only slight exception is that American Indian/Alaskan Native men tend to be slightly older at time of arrest than women.

pd %>% 
  filter(sex != "X") %>% 
  ggplot(aes(x = age, color = sex))  +
  geom_density() +
  facet_wrap(~race, ncol = 1) +
  labs(x = "Age") +
  scale_x_continuous(breaks = pretty_breaks(n = 8))

Charge Class

Descriptives for charge class

The police data has the following categories of charge classes.

pd %>% 
  tabyl(charge) %>% 
  adorn_pct_formatting() %>% 
  kable(caption = "Charge class frequencies.")

Charge class frequencies.
charge	n	percent	valid_percent
BUSINESS OFFENSE	532	0.1%	0.1%
CLASS 1 FELONY	8748	1.6%	1.8%
CLASS 2 FELONY	16714	3.1%	3.4%
CLASS 3 FELONY	16741	3.2%	3.4%
CLASS 4 FELONY	68676	12.9%	13.8%
CLASS A MISDEMEANOR	194323	36.6%	39.1%
CLASS B MISDEMEANOR	43205	8.1%	8.7%
CLASS C MISDEMEANOR	23590	4.4%	4.7%
CLASS MURDER	1635	0.3%	0.3%
CLASS UNKNOWN	42691	8.0%	8.6%
CLASS X FELONY	15470	2.9%	3.1%
LOCAL ORDINANCE	58906	11.1%	11.9%
PETTY OFFENSE	5754	1.1%	1.2%
NA	33957	6.4%	-

pd %>% 
  count(charge) %>% 
  ggplot(aes(x = charge, y = n)) + geom_col() +
  guides(fill = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Charge Class", y = "Arrests") +
  scale_y_continuous(labels = label_comma())

For misdemeanors, Class A is the most and Class C is the least serious. For felonies, Class 4 is the least serious, descending through Class 1, Class X, and then Class “Murder” being the most serious.

Class A misdemeanors are punishable by up to 1 year in jail. The maximum fine on most Class A misdemeanors is $2,500.
Class B misdemeanors are punishable by a maximum of 6 months in jail. The maximum fine on most Class B misdemeanors is $1,500.
Class C misdemeanors are punishable by a maximum of 1 month in jail. The maximum fine on most Class C misdemeanors is $1,500.

Offense Class	Sentence Range	Extended Term	Fine	Probation Term	Supervised Release	Statue
Murder	20-Life	60-100 yrs.	up to $25,000	Not available	3 yrs.	730 ILCS 5/5-4.5-20
Class X	6-30 yrs.	30-60 yrs.	up to $25,000	Not available	3 yrs.	730 ILCS 5/5-4.5-25
Class 1	4-15 yrs.	15-30 yrs.	up to $25,000	Up to 4 yrs.	2 yrs.	730 ILCS 5/5-4.5-30
Class 2	3-7 yrs.	7-14 yrs.	up to $25,000	Up to 4 yrs.	2 yrs.	730 ILCS 5/5-4.5-35
Class 3	2-5 yrs.	5-10 yrs.	up to $25,000	Up to 30 mos.	1 yr.	730 ILCS 5/5-4.5-40
Class 4	1-3 yrs.	3-6 yrs.	up to $25,000	Up to 30 mos.	1 yr.	730 ILCS 5/5-4.5-45

A Petty Offense is any for which the maximum fine may not exceed $1,000, and there is no jail sentence. Likewise, a Business Offense means any offense punishable by a fine in excess of $1,000 and for which a sentence of imprisonment is not an authorized disposition.

There are 5,754 different statutes in this charge category. The majority of these appear to be minor driving related offenses. The top 5 are below:

pd %>% 
  filter(charge == "PETTY OFFENSE") %>% 
  tabyl(statute) %>% 
  adorn_pct_formatting() %>% 
  arrange(desc(n)) %>% 
  slice_head(n = 5) %>% 
  kable(caption = "Top 5 statute descriptions for Petty Offense arrests.")

Top 5 statute descriptions for Petty Offense arrests.
statute	n	percent
NO VALID REGISTRATION	1137	19.8%
IVC - NOT WEARING SEAT BELT/DRIVER	965	16.8%
IVC - FAIL TO REDUCE SPEED	477	8.3%
THEFT OF LOST/MISLAID PROPERTY	217	3.8%
DRIVER’S LICENSE/PERMIT - FAIL TO CARRY/DISPLAY	214	3.7%

A local ordinance offense in Illinois police arrest data typically refers to a violation of a law or regulation enacted by a local government entity, such as a city or municipality. These ordinances are distinct from state laws and are enforced at the local level.

These offenses are spread across 58906 different statute descriptions, with the highest few (drinking alcohol on the public way and soliciting unlawful business) being the most frequent with 17.1% and 12.1% of arrests respectively. The top 5 are shown here:

pd %>% 
  filter(charge == "LOCAL ORDINANCE") %>% 
  tabyl(statute) %>% 
  adorn_pct_formatting() %>% 
  arrange(desc(n)) %>% 
  slice_head(n = 5) %>% 
  kable(caption = "Top 5 statute descriptions for Local Ordinance arrests.")

Top 5 statute descriptions for Local Ordinance arrests.
statute	n	percent
DRINKING ALCOHOL ON THE PUBLIC WAY	11282	19.2%
SOLICITING UNLAWFUL BUSINESS	6232	10.6%
FAILURE TO APPEAR IN COURT	3472	5.9%
OBSTRUCTION OF TRAFFIC BY NON-MOTORIST	1766	3.0%
STOP AT STOP SIGN	1651	2.8%

A Business Offense appears to be similar to a petty offense in terms of severity. These appear to primarily involve motor vehicle operation that involve things like operating without a license, without insurance, or with expired registrations.

pd %>% 
  filter(charge == "BUSINESS OFFENSE") %>% 
  tabyl(statute) %>% 
  adorn_pct_formatting() %>% 
  arrange(desc(n)) %>% 
  kable(captrion = "Statue descriptions for business offense charges.")

statute	n	percent
INSURANCE - OPERATE MTR VEHICLE WITHOUT	345	64.8%
OPERATE MTR VEHICLE/REGIS/SUSPENDED/NON-INSURED	116	21.8%
UNREGISTERED/EXPIRED REGIS	47	8.8%
TELEPHONE HARASSMENT/CREDITOR	7	1.3%
SELL LIQUOR W/O LICENSE	4	0.8%
DECEP COLL/FALSE PERSONATION	3	0.6%
IMPEDE INVESTIGATION	3	0.6%
IVC - IMPROPER PASS EMER VEHICLE	2	0.4%
DECEP COLL/ACCEPT MONEY NOT OWED	1	0.2%
IVC - PASS EMER VEH/INJ TO ANOTHER	1	0.2%
TOWER SOLICITATION AT SCENE	1	0.2%
UNLWFL SALE/REGIS PLATE COVER	1	0.2%
VIO BARBER/COSMTLGY/ETHICS	1	0.2%

A significant number of Class Unknown (7.3% of total arrests). As can be seen in the table below, these are primarily for statute descriptions of Issuance of Warrant.

pd %>% 
  filter(charge == "CLASS UNKNOWN") %>% 
  tabyl(statute) %>% 
  adorn_pct_formatting() %>% 
  kable(caption = "Statute descriptions for unknown charge classes.")

Statute descriptions for unknown charge classes.
statute	n	percent
CHILD MURDER/VIO OFFENDER AGAINST YOUTH REGIS	46	0.1%
CONTEMPT - ENFORCE JUDGEMENT ORDER/SUPPORT	7	0.0%
DIRECT CRIMINAL CONTEMPT	2	0.0%
FUGITIVE FROM JUSTICE - OUT OF STATE WARRANT	3396	8.0%
INDIRECT CRIMINAL CONTEMPT	2	0.0%
ISSUANCE OF WARRANT	38049	89.1%
NEGLECTED OR ABUSED MINOR	11	0.0%
SEX OFFENDER REGISTRATION	89	0.2%
SOLICIT AT ACCIDENT SCENE	8	0.0%
VIOLATE ORDER OF PROTECTION	468	1.1%
VIOLATE/ORDER OF PROTECTION	71	0.2%
VIOLATION OF PAROLE	412	1.0%
VIOLATION OF PROBATION	130	0.3%

#TODO: If time, come back and build a table of counts based on wether or not the statute appears with unknown classes or not.
# pd %>% 
#   filter(!is.na(charge)) %>% 
#   mutate(is_class_unknown = charge == "CLASS UNKNOWN") %>%
#   group_by(statute) %>% 
#   mutate(has_unknown_class = any(is_class_unknown), .keep = "used") %>% 
#   filter(has_unknown_class) %>% 
#   pivot_longer(cols = -statute) %>% 
#   tabyl(statute, name)

Finally, there are a significant amount of missing values (6.1%) for class description. The vast majority (98%) are for Issuance of warrant.

pd %>% 
  filter(is.na(charge)) %>% 
  tabyl(statute) %>% 
  adorn_pct_formatting() %>% 
  arrange(desc(n)) %>% 
  kable(caption = "Statue descriptions for cases where the charge description is missing.")

Statue descriptions for cases where the charge description is missing.
statute	n	percent
ISSUANCE OF WARRANT	33293	98.0%
PETITION TO VIOLATE PROBATION	286	0.8%
CONDITIONS OF BAIL BOND	242	0.7%
POSS CANNABIS<10 GRAMS	87	0.3%
POSS DRUG PARAPHERNALIA	31	0.1%
NEGLECTED OR ABUSED MINOR	10	0.0%
SALE OF CIGARETTES W/OUT COOK COUNTY STAMP	4	0.0%
TELEVISION ON IN DRIVERS VIEW	2	0.0%
SMALL UNMANED AIRCRAFT	1	0.0%
TAX IMPOSED-UNLWFL CONCEAL OF CIGARETTES/OFFER FOR SALE	1	0.0%

Charges over time

Earlier it was evident that arrests have gone down over time. Looking at the behavior of individual charge classes over time yields some interesting patterns.

pd %>% 
  ggplot(aes(x = arrest_date, color = charge)) + 
  geom_density() +
  scale_x_date(breaks = pretty_breaks(n = 8))

The first is a data entry/classification issue. Prior to 2016 there are many examples of missing values for the charge, with a few entered as “CLASS UNKNOWN”. However it appears that at some point in 2015 something changed that drastically reduced the number of missing values, and increased the number of CLASS UNKNOWN values. Given the finding above that both Class Unknown and missing values were both primarily for statute descriptions of “issuance of warrant”, it makes sense to recode missing charge values as “CLASS UNKNOWN”. (Normally I would go back and do this early in the analysis after such a finding, but in this case I’m going to proceed with the analyses so far unchanged)

pd <- pd %>% 
  mutate(charge = replace_na(charge, "CLASS UNKNOWN"))

Other patterns that deviate from the general pattern of arrests over time (peaking in 2013, declining until 2018, and then increasing again) include business offenses and petty offenses, both which climb gradually over time, and class 2 felonies, which are in the middle in terms of frequency early on, but peak to become the most common charge type in 2018.

Differences in charge class by demographic

Age & charge class

The various charge classes appear to follow the general trend observed earlier for overall number of arrests by age. Two exceptions are class 4 felony and class murder.

Class murder arrests are predominantly individuals in their early 20’s just like other classes, however there appears to be a disproportionate decrease in class murder arrests for individuals 35 and older as compared to the pattern for other charges.
Class 4 felonies are also predominantly individuals in their early 20’s, however there is marked increase in class 4 felony arrests for individuals between 40-50 relative to other chargers that begins to decrease again after 50.

pd %>% 
  ggplot(aes(x = age, color = charge)) +
  geom_density() +
  scale_x_continuous(breaks = pretty_breaks(n = 8))

Focusing on these two charges, we will investigate whether there are any differences due to race or sex.

For felony arrests, it appears that there are interesting differences in sex for felony arrests in that males tend to be arrested at a higher rate than females in their early 20s with this rate declining over time with a slight increase around age 45. For women, while there is a peak in their mid 20s, there is also a large increase in arrests in their late 40s that isn’t observed as strongly for men.

For class murder, the patterns between sexes is generally the same except in terms of peaking in their 20s, except that females tend to be slightly older at time of arrest, peaking in their mid 20s while males peak in their early 20s. Both decline at similar rates following these peaks.

pd %>%
  filter(sex != "X") %>% 
  filter(charge == "CLASS MURDER" | charge == "CLASS 4 FELONY") %>% 
  ggplot(aes(x = age, color = sex)) + 
  geom_density() +
  facet_wrap(~charge, ncol = 1) +
  scale_x_continuous(breaks = pretty_breaks(n = 8))

Sex and charge class

Overall men are a much higher percent of overall arrests as we’ve seen previously. Because of this it is easier to compare the percentages than counts when looking for differences between sexes. Comparing the relative percentages of arrests for men and women on each charge class shows that this holds true across all charges except for class A misdemeanors and class unknowns.

pd %>% 
  filter(sex != "X") %>% 
  tabyl(charge, sex) %>% 
  adorn_percentages("col") %>% 
  adorn_pct_formatting() %>% 
  kable()

charge	F	M
BUSINESS OFFENSE	0.1%	0.1%
CLASS 1 FELONY	1.0%	1.8%
CLASS 2 FELONY	2.4%	3.3%
CLASS 3 FELONY	3.1%	3.2%
CLASS 4 FELONY	10.9%	13.3%
CLASS A MISDEMEANOR	45.5%	35.0%
CLASS B MISDEMEANOR	5.0%	8.7%
CLASS C MISDEMEANOR	3.6%	4.6%
CLASS MURDER	0.2%	0.3%
CLASS UNKNOWN	18.4%	13.7%
CLASS X FELONY	2.0%	3.1%
LOCAL ORDINANCE	6.9%	11.9%
PETTY OFFENSE	0.9%	1.1%

pd %>% 
  filter(sex != "X") %>% 
  count(charge, sex) %>% 
  group_by(sex) %>%
  mutate(total = sum(n),
         pct = n / total) %>%
  #summarize(asdf = diff(pct))
  ggplot(aes(x = charge, y = pct, fill = sex)) + geom_col(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = label_percent())

#TODO: investigate these two charge classes to see what the main statute differences are.

Race and charge class

Because there are drastically more arrests for black individuals, it is again simpler to look at relative percentages of each charge type within each race in order to compare between races. While there are a number of differences across charge class and race, the following are major standouts:

White hispanics are represented in business office categories at roughly twice the rate of other races, less often in class 3 felonies than other races, and more often in petty offense arrests.
American indians/alaskan natives are have far fewer class 1 felony arrests than other races, and have a much higher rate of class murder felonies, local ordinance arrests, and arrests with missing values for charges.
Blacks have the highest rates of class 1 and 2 felonies, class b misdemeanors, and class x felonies.
Whites have the highest rates of class 4 felonies and unknown classes.

Overall, there does not appear to be a consistent pattern of rates for a given race across different charge classes. In other words, there isn’t a specific race that is always highest or lowest in terms of number of arrests across all charge classes.

pd %>% 
  tabyl(charge, race) %>% 
  adorn_percentages("col") %>% 
  adorn_pct_formatting() %>% 
  kable()

charge	AMER IND/ALASKAN NATIVE	ASIAN/PACIFIC ISLANDER	BLACK	BLACK HISPANIC	UNKNOWN	WHITE	WHITE HISPANIC
BUSINESS OFFENSE	0.0%	0.1%	0.1%	0.1%	0.0%	0.1%	0.1%
CLASS 1 FELONY	0.5%	1.9%	1.7%	2.3%	1.0%	1.5%	1.6%
CLASS 2 FELONY	2.7%	2.7%	3.3%	2.9%	2.4%	2.4%	2.8%
CLASS 3 FELONY	3.6%	4.2%	3.4%	3.4%	3.8%	3.1%	2.3%
CLASS 4 FELONY	6.8%	7.6%	12.8%	11.7%	9.2%	13.7%	13.2%
CLASS A MISDEMEANOR	38.4%	49.6%	34.6%	42.6%	43.2%	41.4%	41.8%
CLASS B MISDEMEANOR	6.6%	7.0%	8.7%	6.1%	8.0%	7.5%	6.1%
CLASS C MISDEMEANOR	5.4%	4.7%	4.4%	4.3%	6.1%	4.2%	4.6%
CLASS MURDER	1.0%	0.2%	0.3%	0.5%	0.1%	0.1%	0.3%
CLASS UNKNOWN	15.3%	13.2%	14.4%	13.6%	14.8%	17.0%	13.1%
CLASS X FELONY	1.9%	1.9%	3.4%	2.4%	1.6%	0.9%	1.9%
LOCAL ORDINANCE	17.3%	6.4%	11.7%	9.1%	8.8%	7.4%	10.7%
PETTY OFFENSE	0.5%	0.6%	1.1%	1.1%	1.1%	0.7%	1.4%

pd %>% 
  count(charge, race) %>% 
  group_by(race) %>% 
  mutate(total = sum(n),
         pct = n / total) %>% 
  ggplot(aes(x = charge, y = pct, fill = race)) +
  geom_col(position = "dodge") +
  facet_wrap(~charge, scales = "free") +
  scale_fill_viridis_d() +
  scale_y_continuous(labels = label_percent()) +
  theme(legend.position = "top", axis.text.x = element_blank())

Charge demographics over time

Changes in the frequencies of arrests for each charge class over time by sex and by race were cursorily analyzed (due to time restrictions) but there are no immediately obvious patterns that differ from the overall changes over time seen previously.

4. Summary of Findings

Potential duplicates were present, but checking various aspects of the data suggest that all observations are unique.
The DOB year data has some errors in it, with 1900 used as a placeholder for unknown, and some individuals having birthdays in the 1800s. These were removed, but there were still 47 individuals with ages above 100 (one was 130!). DOB data should be treated as questionable.
Arrests have decreased
There is a seasonality effect of arrests
The majority of arrests are for people ages 20-30, and decrease as age increases.
Over time, the age at arrest is increasing
Arrests are predominately men
Arrests are predominately black, followed by white hispanic and then white.
The data entry process/rules for unknown charge classes was changed in 2015 such that previosuly missing values began being entered as “Class Unknown”.
Men have more arrests than women for each charge class with the exception of Class A misdemeanors and unknown, where women have more arrests.
White Hispanics are represented in business office categories at roughly twice the rate of other races, less often in class 3 felonies than other races, and more often in petty offense arrests.
American Indians/Alaskan natives are have far fewer class 1 felony arrests than other races, and have a much higher rate of class murder felonies, local ordinance arrests, and arrests with missing values for charges.
Blacks have the highest rates of class 1 and 2 felonies, class b misdemeanors, and class x felonies.
Whites have the highest rates of class 4 felonies and unknown classes.