Goals of this investigation

  1. see which charges are largest drivers of pretrial jail admissions
  2. outline demographics (age, race, gender) of those in pretrial detention
  3. show distribution of bond amounts (and show means by race)
  4. show distribution of jail stays (and show means by race)
  5. show relationship between bond amount and pretrial jail stay duration

Basic facts

Call the csv and clean up.

nrow(CT)
[1] 3534889

It has 3.53 million rows.

How many unique people are in the data?

#search unique by identifier
nrow(unique(CT[,"identifier"]))
[1] 34892

There are 34,892 unique people. But maybe some people have gone in and out of jail multiple times? So, how many unique person-admissions are there?

#search unique rows by identifier and latest admission date
nrow(unique(CT[,c("identifier","latest_admission_date")]))
[1] 50247

50,247 unique person-admissions.

Since an admission might be due to multiple charges, there might be more than 50K charges responsible for the ~50K admissions…

nrow(unique(CT[,c("identifier","latest_admission_date", "offense")]))
[1] 54367

Specifically, there are 54,367 total charges for all admissions in this data.

How many unique charges?

nrow(unique(CT[,c("offense")]))
[1] 348

1. Why in Jail?

Consider the previously mentioned ~54K charges.

charges<-unique(CT[,c("identifier","latest_admission_date", "offense")]) %>%
  group_by(offense)%>%
  summarise(count=n())%>%
  mutate(tot=sum(count), perc=count/tot)

Make a graph of top 10 charges.

library(stringr)

charges10<-subset(charges, charges$perc>.019)
charges10<-charges10[order(-charges10$perc),]

#fix up offense
charges10$offense<-tolower(charges10$offense)
charges10$offense<-str_remove(charges10$offense, " df")
charges10$offense<-str_remove(charges10$offense, "  f")
charges10$offense<-str_remove(charges10$offense, " am")
charges10$offense<-str_squish(charges10$offense)

Make a factor for ordering the x axis

charges10$offense<- factor(charges10$offense, levels = charges10$offense[order(charges10$perc)])

Before making graphs, I call libs.

#Load more libraries
library(ggplot2);library(ggrepel); library(extrafont); library(ggthemes);library(reshape);library(grid); library(dplyr)
library(scales);library(RColorBrewer);library(gridExtra)

I graph!

ggplot(data=charges10, aes(x=offense, y=perc)) + 
  geom_bar(stat = 'identity')+
  theme_minimal()+ theme(text=element_text(family="Palatino"))+
  scale_y_continuous(labels = percent, limits=c(0,.175), breaks=seq(0,.175,.025))+
  theme(plot.title = element_text(hjust = 0))+
  labs(x="", y="", caption="These calculations consider all 54,367 unique person-admission-offense observations from 7/1/16-7/28/19.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))+
  ggtitle("Why Are People in Jail? (Top 10 Charges)", subtitle = "Data Available via Connecticut Open Data")

ggsave("graphs/offense_top_10.png", width=7, height=7, dpi=900)

Note: this is not the same as looking at days in jail across offenses! This looking at admissions! Many of these are smaller so those inmates might be in and out pretty quickly.

2. Demographics

Gender/Race

First, let’s summarize by gender/race. Look at unique people over the whole time period.

CTpeople<-unique(CT[,c("identifier", "race", "gender")])

OK, there are 35,023 instead of 34,892. That means some people are being coded as different genders/races. Let’s check it out.

Let’s exclude the people coded with different race and/or genders across the time period.

CTpeople$dup<-duplicated(CTpeople, by="identifier")
CTpeopledup<-subset(CTpeople, dup=="TRUE")
CTpeopledup1<-CTpeopledup[,1]
CTpeopledup1$exclude<-1
CTpeople<-merge(CTpeople, CTpeopledup1, by ="identifier", all=T)
CTp<-subset(CTpeople, is.na(CTpeople$exclude))
CTp<-CTp[,c(1:3)]

Great, so now we use CTp to get into the race/gender demographics of the 34,761 people who are coded consistently (so, excluding the 131 inconsistent ones) and were inmates in CT correctional facilities from 7/1/2016-7/28/19.

I define some colorblind palettes.

# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
Pal <- c("#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

I graph!

ggplot(data=CTp, aes(race, fill=gender)) + 
  geom_bar()+
  theme_minimal()+ theme(text=element_text(family="Palatino"))+
  scale_fill_manual(name="Gender", values = c("gold", "dodgerblue"), labels=c("Female", "Male")) +
  scale_y_continuous(limits=c(0,14000), breaks=seq(0,14000,2000))+
  theme(plot.title = element_text(hjust = 0))+
  labs(x="", y="Number of Pretrial Inmates", caption="The total number of unique pretrial inmates in this time period is 34,892; I present data for 34,761 here.\n(I exclude 131 individuals who are coded inconsistently by race or gender over time).\nThe dataset codes Hispanic as a race (this is different than, say, the Census methodology).")+
  ggtitle("Gender and Race of Connecticut Pretrial Inmates (7/1/16-7/28/19)", subtitle = "Data Available via Connecticut Open Data")