MarylandIncarceration

Author

Julian Beckert

MD Incarceration Study

The dataset I am using in this project is “Racial and Equity Impact Notes 2024”, submitted to Maryland’s Open Data Portal by Jay E. Miller. It overviews those incarcerated in Maryland’s prison system, their offenses, jurisdictions, prisons, and demographic details such as their race, age, sex and birthplace. I care about social justice and find visualizations of social issues interesting, so I was excited to have a chance to examine a related topic myself.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/Lucinda/Downloads/data110")
equity <- read_csv("mdracialequity.csv")

Rows: 95408 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): FACILITY, Race, Sex, BIRTH_PLACE, Jurisdiction, Offense
dbl (5): ID, Age at Extract, Number of Jurisdictions, Number of Offenses, Se...
lgl (1): Lifer

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up rows

As it is, the dataset is difficult to use. Here, I remove spaces from column names and make them the same case.

names(equity) <- tolower(names(equity))
names(equity) <- gsub(" ","",names(equity))

Statistical Analysis

lengthjuris <- summary(lm(sentencelength ~ jurisdiction, data = equity))
lengthjuris


Call:
lm(formula = sentencelength ~ jurisdiction, data = equity)

Residuals:
   Min     1Q Median     3Q    Max 
-12454  -5674  -2279   3365 729101 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          8236.1      263.1  31.307  < 2e-16 ***
jurisdictionAnne Arundel County      1084.4      324.8   3.339 0.000843 ***
jurisdictionBaltimore City           2278.4      276.5   8.241  < 2e-16 ***
jurisdictionBaltimore County         1255.9      293.4   4.281 1.86e-05 ***
jurisdictionCalvert County           -307.8      529.6  -0.581 0.561107    
jurisdictionCaroline County          2195.6      546.5   4.018 5.89e-05 ***
jurisdictionCarroll County            222.7      352.0   0.633 0.526851    
jurisdictionCecil County             2502.5      441.8   5.664 1.48e-08 ***
jurisdictionCharles County           4570.6      377.2  12.118  < 2e-16 ***
jurisdictionDorchester County         613.7      423.2   1.450 0.147062    
jurisdictionFrederick County         2677.6      357.3   7.493 6.80e-14 ***
jurisdictionGarrett County          -1178.8      846.5  -1.393 0.163756    
jurisdictionHarford County           -244.3      351.5  -0.695 0.486979    
jurisdictionHoward County            1710.5      369.7   4.626 3.73e-06 ***
jurisdictionKent County               240.9     1086.5   0.222 0.824529    
jurisdictionMontgomery County        2511.4      298.3   8.419  < 2e-16 ***
jurisdictionPrince George's County   4675.3      302.9  15.437  < 2e-16 ***
jurisdictionQueen Anne's County      -217.7      577.5  -0.377 0.706203    
jurisdictionSomerset County           118.2      426.2   0.277 0.781601    
jurisdictionSt. Mary's County         450.8      442.2   1.019 0.307990    
jurisdictionTalbot County            1206.6      639.1   1.888 0.059023 .  
jurisdictionWashington County        1264.1      305.5   4.138 3.50e-05 ***
jurisdictionWicomico County          2206.2      308.2   7.158 8.25e-13 ***
jurisdictionWorcester County          -91.0      430.8  -0.211 0.832695    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11880 on 78726 degrees of freedom
  (16658 observations deleted due to missingness)
Multiple R-squared:  0.01126,   Adjusted R-squared:  0.01097 
F-statistic: 38.99 on 23 and 78726 DF,  p-value: < 2.2e-16

The p-value is < 2e-16, which is incredibly small. The same p-value applies for Baltimore City specifically. This number is very statistically significant, meaning there is a strong correlation between jurisdiction and sentencing lengths; I expected this to be true. Baltimore is known for higher rates of crime and a harsh legal justice system.

Explore dataset, clean up & mutate

I plan on making graphs analyzing racial demographics, so shortening the names of the racial groups in the dataset will be important avoid clutter. I use mutate and case_when to change the column names. I also needed to make a dataframe that filters for only single offenders; there are several major outliers in this dataset with multiple offenses and much longer sentences as a result. If I am looking at sentence length averages, these outliers will probably skew the average upwards.

unique(equity$race) # list all variables in "race" column

 [1] "Black"                                                        
 [2] "White"                                                        
 [3] NA                                                             
 [4] "White - Hispanic or Latino"                                   
 [5] "Black - Hispanic or Latino"                                   
 [6] "Asian or Pacific islander"                                    
 [7] "Unknown - Hispanic or Latino"                                 
 [8] "Asian or Pacific islander - Hispanic or Latino"               
 [9] "Unknown"                                                      
[10] "Native American Indian or Alaskan Native"                     
[11] "Native American Indian or Alaskan Native - Hispanic or Latino"

equity <- equity |>
  mutate(
    race = case_when(
    race == "Black" ~ "Black",
    race == "White" ~ "White",
    race == "NA" ~ "NA",
    race == "White - Hispanic or Latino" ~ "White (Hisp/Lat)",
    race == "Black - Hispanic or Latino" ~ "Black (Hisp/Lat)",
    race == "Asian or Pacific islander" ~ "AAPI",
    race == "Unknown - Hispanic or Latino" ~ "Unknown (Hisp/Lat)",
    race == "Asian or Pacific islander - Hispanic or Latino" ~ "AAPI (Hisp/Lat)",
    race == "Unknown" ~ "Unknown",
    race == "Native American Indian or Alaskan Native" ~ "Indigenous",
    race == "Native American Indian or Alaskan Native - Hispanic or Latino" ~ "Indigenous (Hisp/Lat)"
    ))

oneoffense <- equity |>
  filter(numberofoffenses == 1)

# quick check to see which offenses have the highest count before I continue
offensetotal <- oneoffense |> 
  group_by(offense) |>
  summarise(count = n()) |>
  arrange(desc(count))
offensetotal

# A tibble: 233 × 2
   offense                    count
   <chr>                      <int>
 1 ASSAULT-FIRST DEGREE         380
 2 ASSAULT-SEC DEGREE           318
 3 CDS: POSS W/I DIST: NARC     206
 4 FIREARM POSS W/FEL CONVICT   200
 5 MURDER-SECOND DEGREE         193
 6 MURDER-FIRST DEGREE          158
 7 SEX ABUSE MINOR              158
 8 RAPE SECOND DEGREE           151
 9 ARMED ROBBERY                149
10 ROBBERY                      132
# ℹ 223 more rows

Make means tables and other dataframes that I’ll need

Here, I make two tables that filter for the offense with the highest count, first-degree assault, filter out NAs and add a totals column.

assaultfirstall <- oneoffense |>
  filter(offense == "ASSAULT-FIRST DEGREE" & !is.na(jurisdiction) & !is.na(sentencelength)) |>
  group_by(jurisdiction, race) |>
  summarise(count = n(),
            sentencelength = mean(sentencelength)
            )

`summarise()` has grouped output by 'jurisdiction'. You can override using the
`.groups` argument.

# This is to make an averages line on my next plot.

# This is the main source of data in my next plot.
assaultavg <- oneoffense |>
      filter(offense == "ASSAULT-FIRST DEGREE" & !is.na(jurisdiction) & !is.na(sentencelength)) |>
  group_by(race) |>
  summarise(count = n(),
            sentencelength = mean(sentencelength)
            ) |> 
  filter(count > 10)

First-degree assault sentence lengths: lollipop chart

I was researching different types of plots that I could make, and came upon the “lollipop chart” - a combination of geom_point, geom_segment, and geom_hline, showing individual values and how far they fall from a middle-point: in this case, showing how far mean sentences for individual racial groups fall from the mean sentencing for all inmates

assaultavg |>
  ggplot(aes(x = race, y = sentencelength)) +
  geom_segment(aes(x = race,
                   y = mean(assaultfirstall$sentencelength),
                   xend = race,
                   yend = sentencelength)) + # creates lines leading to each dot
  geom_point(size = 7,
             color = "#FFB384",
             alpha = 0.8) + # creates dots for racial group means
  geom_hline(yintercept = mean(assaultfirstall$sentencelength), color = "#3d3b3c", size = .5) + # creates the main horizontal line
  labs(x = "Race",
       y = "Sentence Length (days)",
       title = "Average Sentence Length per Race for 1st-Degree Assault") +
  theme_minimal(base_size = 12, base_family = "serif")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Baltimore mean sentence lengths: lollipop chart

Baltimore City is the jurisdiction with the highest number of sentences of all in this dataset. Anyone who has lived in Maryland for long enough knows the context of this: Baltimore has higher rates of crime than other cities, but also a history of an unusually cruel justice system with more pronounced racial bias than most others in the state. I had seen a lot of data points skewing towards Baltimore in the plots I was making when getting a feel for this dataset, so I wanted to dive deeper into it.

oneoffbaltimore <- oneoffense |>
  filter(jurisdiction == "Baltimore City" & !is.na(sentencelength))
# "oneoffense" but just for Baltimore City

baltimore <- oneoffense |>
      filter(jurisdiction == "Baltimore City" & !is.na(sentencelength)) |>
  group_by(race) |>
  summarise(count = n(),
            sentencelength = mean(sentencelength)
            ) |> 
  mutate(mean_sentencelength = mean(sentencelength)) |>
  filter(count > 5)
# "assaultavg" but just for Baltimore City


baltimore |>
  ggplot(aes(x = race, y = sentencelength)) +
  geom_segment(aes(x = race,
                   y = mean(oneoffbaltimore$sentencelength),
                   xend = race,
                   yend = sentencelength)) +
    geom_point(size = 7,
             color = "#790033",
             alpha = 0.8) +
  geom_hline(yintercept = mean(oneoffbaltimore$sentencelength), color = "#3d3b3c", size = .5) +
  labs(x = "Race",
      y = "Sentence Length (days)",
      title = "Mean Sentence Length per Race in Baltimore for Single Offenders",
      caption = "Source: Jay E. Miller on Maryland's Open Data Portal") +
    theme_minimal(base_size = 12, base_family = "serif")

Baltimore sentencing by race: barchart

I decided I wanted to compare the previous chart to the total number of people incarcerated in Baltimore by race, to give a sense of how skewed the mean sentence length was from the inmate totals.

totalbaltimore <- oneoffense |>
  filter(jurisdiction == "Baltimore City") |>
  group_by(race) |>
  summarise(count = n())


ggplot(totalbaltimore, aes(x = count, y = race, fill = race)) +
  geom_bar(stat = "identity")+
  scale_fill_manual(values = c("White (Hisp/Lat)" = "#790033", 
                               "White" = "#8A163D",
                               "Unknown (Hisp/Lat)" = "#9B2D47",
                               "Unknown" = "#AB4351",
                               "Indigenous" = "#BC5A5C",
                               "Black (Hisp/Lat)" = "#CD7066",
                               "Black" = "#DE8670",
                               "AAPI (Hisp/Lat)" = "#EE9D7A",
                               "AAPI" = "#FFB384")) +
  labs(title = "Totals of Single Offenders Incarcerated in Baltimore City, by Race",
       x = "Inmate Total",
       y = "Race",
       caption = "Source: Jay E. Miller on Maryland Open Data Portal") +
  geom_text(aes(label=count), hjust=-0.3, color="black", size=2) +
  theme_minimal(base_size = 12, base_family = "serif")

Essay

I was very surprised by some of the numbers I saw in this dataset and the graphs I was making. One thing I noticed was that while there were people sentenced for offenses totaling in the 70s, the inmate with the highest specified sentence in the entire dataset (739,616 days, or 20,026 years) was only arrested on four chargers, none of them murder. He was arrested for three charges of possessing/distributing controlled substances, and one charge of second-degree assault. The first variable I looked at in this dataset was second-degree assault. Single offenders with this charge are usually only in for around 5 years, 10 at most. Both him and the inmate with the next longest sentence (once again, no murder) are Black men from Baltimore City. The third longest sentence was given to another Black man, from Prince George’s County this time. Since this was the first thing I saw, I was surprised when I found that white inmates are on average given longer sentences than Black inmates. Yet, at the same time, white inmates make up significantly less of the prison population, as I illustrated with the barchart on the make up of Baltimore City prisons. I have no solid answer for why this is, but I would hesitantly suggest that racial profiling may be one of the largest contributing factors. Black people are much more likely to be wrongfully convicted or sentenced to prison without sufficient evidence of their guilt. White people, due to privilege and police bias, can get away with crime more easily. Therefore, when a white person is sentenced to prison, they are more likely to have committed a serious crime with solid evidence to prove it, and may recieve a longer sentence on these grounds. I chose Baltimore because it stood out the most in the dataset and has a notorious history of a racist legal system, which came to national attention following the murder of Freddie Gray in 2015.

Some things I wanted to include but couldn’t were more comparison and stacked charts. I wanted to have a table that shows the mean sentence length per racial group in Baltimore and at the same time the total number of sentences per racial group in Baltimore. Specifically, I wanted to include another lollipop chart, showing the mean number of sentences for all groups in Baltimore as the baseline, with lines leading to dots that show the mean number of sentences for each individual racial group. I tried something similar to this, by using geom_text to add data labels for inmate count to the second lollipop chart, centering them inside the dots. However, I couldn’t find a way to add a legend explaining what they meant, so I left it out.

I was very proud of the work I did exploring and cleaning this dataset. I used tolower and gsub to make the column names easy to use, and mutate with case_when to rename the rows in the race column for legibility in my plots. Initially, when creating the filtered tables and mean tables, I did each part individually. I realized halfway through that I could use piping to streamline the process and I am glad I did.