The dataset I am using in this project is “Racial and Equity Impact Notes 2024”, submitted to Maryland’s Open Data Portal by Jay E. Miller. It overviews those incarcerated in Maryland’s prison system, their offenses, jurisdictions, prisons, and demographic details such as their race, age, sex and birthplace. I care about social justice and find visualizations of social issues interesting, so I was excited to have a chance to examine a related topic myself.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 95408 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): FACILITY, Race, Sex, BIRTH_PLACE, Jurisdiction, Offense
dbl (5): ID, Age at Extract, Number of Jurisdictions, Number of Offenses, Se...
lgl (1): Lifer
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean up rows
As it is, the dataset is difficult to use. Here, I remove spaces from column names and make them the same case.
lengthjuris <-summary(lm(sentencelength ~ jurisdiction, data = equity))lengthjuris
Call:
lm(formula = sentencelength ~ jurisdiction, data = equity)
Residuals:
Min 1Q Median 3Q Max
-12454 -5674 -2279 3365 729101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8236.1 263.1 31.307 < 2e-16 ***
jurisdictionAnne Arundel County 1084.4 324.8 3.339 0.000843 ***
jurisdictionBaltimore City 2278.4 276.5 8.241 < 2e-16 ***
jurisdictionBaltimore County 1255.9 293.4 4.281 1.86e-05 ***
jurisdictionCalvert County -307.8 529.6 -0.581 0.561107
jurisdictionCaroline County 2195.6 546.5 4.018 5.89e-05 ***
jurisdictionCarroll County 222.7 352.0 0.633 0.526851
jurisdictionCecil County 2502.5 441.8 5.664 1.48e-08 ***
jurisdictionCharles County 4570.6 377.2 12.118 < 2e-16 ***
jurisdictionDorchester County 613.7 423.2 1.450 0.147062
jurisdictionFrederick County 2677.6 357.3 7.493 6.80e-14 ***
jurisdictionGarrett County -1178.8 846.5 -1.393 0.163756
jurisdictionHarford County -244.3 351.5 -0.695 0.486979
jurisdictionHoward County 1710.5 369.7 4.626 3.73e-06 ***
jurisdictionKent County 240.9 1086.5 0.222 0.824529
jurisdictionMontgomery County 2511.4 298.3 8.419 < 2e-16 ***
jurisdictionPrince George's County 4675.3 302.9 15.437 < 2e-16 ***
jurisdictionQueen Anne's County -217.7 577.5 -0.377 0.706203
jurisdictionSomerset County 118.2 426.2 0.277 0.781601
jurisdictionSt. Mary's County 450.8 442.2 1.019 0.307990
jurisdictionTalbot County 1206.6 639.1 1.888 0.059023 .
jurisdictionWashington County 1264.1 305.5 4.138 3.50e-05 ***
jurisdictionWicomico County 2206.2 308.2 7.158 8.25e-13 ***
jurisdictionWorcester County -91.0 430.8 -0.211 0.832695
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11880 on 78726 degrees of freedom
(16658 observations deleted due to missingness)
Multiple R-squared: 0.01126, Adjusted R-squared: 0.01097
F-statistic: 38.99 on 23 and 78726 DF, p-value: < 2.2e-16
The p-value is < 2e-16, which is incredibly small. The same p-value applies for Baltimore City specifically. This number is very statistically significant, meaning there is a strong correlation between jurisdiction and sentencing lengths; I expected this to be true. Baltimore is known for higher rates of crime and a harsh legal justice system.
Explore dataset, clean up & mutate
I plan on making graphs analyzing racial demographics, so shortening the names of the racial groups in the dataset will be important avoid clutter. I use mutate and case_when to change the column names. I also needed to make a dataframe that filters for only single offenders; there are several major outliers in this dataset with multiple offenses and much longer sentences as a result. If I am looking at sentence length averages, these outliers will probably skew the average upwards.
unique(equity$race) # list all variables in "race" column
[1] "Black"
[2] "White"
[3] NA
[4] "White - Hispanic or Latino"
[5] "Black - Hispanic or Latino"
[6] "Asian or Pacific islander"
[7] "Unknown - Hispanic or Latino"
[8] "Asian or Pacific islander - Hispanic or Latino"
[9] "Unknown"
[10] "Native American Indian or Alaskan Native"
[11] "Native American Indian or Alaskan Native - Hispanic or Latino"
equity <- equity |>mutate(race =case_when( race =="Black"~"Black", race =="White"~"White", race =="NA"~"NA", race =="White - Hispanic or Latino"~"White (Hisp/Lat)", race =="Black - Hispanic or Latino"~"Black (Hisp/Lat)", race =="Asian or Pacific islander"~"AAPI", race =="Unknown - Hispanic or Latino"~"Unknown (Hisp/Lat)", race =="Asian or Pacific islander - Hispanic or Latino"~"AAPI (Hisp/Lat)", race =="Unknown"~"Unknown", race =="Native American Indian or Alaskan Native"~"Indigenous", race =="Native American Indian or Alaskan Native - Hispanic or Latino"~"Indigenous (Hisp/Lat)" ))oneoffense <- equity |>filter(numberofoffenses ==1)# quick check to see which offenses have the highest count before I continueoffensetotal <- oneoffense |>group_by(offense) |>summarise(count =n()) |>arrange(desc(count))offensetotal
`summarise()` has grouped output by 'jurisdiction'. You can override using the
`.groups` argument.
# This is to make an averages line on my next plot.# This is the main source of data in my next plot.assaultavg <- oneoffense |>filter(offense =="ASSAULT-FIRST DEGREE"&!is.na(jurisdiction) &!is.na(sentencelength)) |>group_by(race) |>summarise(count =n(),sentencelength =mean(sentencelength) ) |>filter(count >10)
I was researching different types of plots that I could make, and came upon the “lollipop chart” - a combination of geom_point, geom_segment, and geom_hline, showing individual values and how far they fall from a middle-point: in this case, showing how far mean sentences for individual racial groups fall from the mean sentencing for all inmates
assaultavg |>ggplot(aes(x = race, y = sentencelength)) +geom_segment(aes(x = race,y =mean(assaultfirstall$sentencelength),xend = race,yend = sentencelength)) +# creates lines leading to each dotgeom_point(size =7,color ="#FFB384",alpha =0.8) +# creates dots for racial group meansgeom_hline(yintercept =mean(assaultfirstall$sentencelength), color ="#3d3b3c", size = .5) +# creates the main horizontal linelabs(x ="Race",y ="Sentence Length (days)",title ="Average Sentence Length per Race for 1st-Degree Assault") +theme_minimal(base_size =12, base_family ="serif")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Baltimore mean sentence lengths: lollipop chart
Baltimore City is the jurisdiction with the highest number of sentences of all in this dataset. Anyone who has lived in Maryland for long enough knows the context of this: Baltimore has higher rates of crime than other cities, but also a history of an unusually cruel justice system with more pronounced racial bias than most others in the state. I had seen a lot of data points skewing towards Baltimore in the plots I was making when getting a feel for this dataset, so I wanted to dive deeper into it.
oneoffbaltimore <- oneoffense |>filter(jurisdiction =="Baltimore City"&!is.na(sentencelength))# "oneoffense" but just for Baltimore Citybaltimore <- oneoffense |>filter(jurisdiction =="Baltimore City"&!is.na(sentencelength)) |>group_by(race) |>summarise(count =n(),sentencelength =mean(sentencelength) ) |>mutate(mean_sentencelength =mean(sentencelength)) |>filter(count >5)# "assaultavg" but just for Baltimore Citybaltimore |>ggplot(aes(x = race, y = sentencelength)) +geom_segment(aes(x = race,y =mean(oneoffbaltimore$sentencelength),xend = race,yend = sentencelength)) +geom_point(size =7,color ="#790033",alpha =0.8) +geom_hline(yintercept =mean(oneoffbaltimore$sentencelength), color ="#3d3b3c", size = .5) +labs(x ="Race",y ="Sentence Length (days)",title ="Mean Sentence Length per Race in Baltimore for Single Offenders",caption ="Source: Jay E. Miller on Maryland's Open Data Portal") +theme_minimal(base_size =12, base_family ="serif")
Baltimore sentencing by race: barchart
I decided I wanted to compare the previous chart to the total number of people incarcerated in Baltimore by race, to give a sense of how skewed the mean sentence length was from the inmate totals.
totalbaltimore <- oneoffense |>filter(jurisdiction =="Baltimore City") |>group_by(race) |>summarise(count =n())ggplot(totalbaltimore, aes(x = count, y = race, fill = race)) +geom_bar(stat ="identity")+scale_fill_manual(values =c("White (Hisp/Lat)"="#790033", "White"="#8A163D","Unknown (Hisp/Lat)"="#9B2D47","Unknown"="#AB4351","Indigenous"="#BC5A5C","Black (Hisp/Lat)"="#CD7066","Black"="#DE8670","AAPI (Hisp/Lat)"="#EE9D7A","AAPI"="#FFB384")) +labs(title ="Totals of Single Offenders Incarcerated in Baltimore City, by Race",x ="Inmate Total",y ="Race",caption ="Source: Jay E. Miller on Maryland Open Data Portal") +geom_text(aes(label=count), hjust=-0.3, color="black", size=2) +theme_minimal(base_size =12, base_family ="serif")
Essay
I was very surprised by some of the numbers I saw in this dataset and the graphs I was making. One thing I noticed was that while there were people sentenced for offenses totaling in the 70s, the inmate with the highest specified sentence in the entire dataset (739,616 days, or 20,026 years) was only arrested on four chargers, none of them murder. He was arrested for three charges of possessing/distributing controlled substances, and one charge of second-degree assault. The first variable I looked at in this dataset was second-degree assault. Single offenders with this charge are usually only in for around 5 years, 10 at most. Both him and the inmate with the next longest sentence (once again, no murder) are Black men from Baltimore City. The third longest sentence was given to another Black man, from Prince George’s County this time. Since this was the first thing I saw, I was surprised when I found that white inmates are on average given longer sentences than Black inmates. Yet, at the same time, white inmates make up significantly less of the prison population, as I illustrated with the barchart on the make up of Baltimore City prisons. I have no solid answer for why this is, but I would hesitantly suggest that racial profiling may be one of the largest contributing factors. Black people are much more likely to be wrongfully convicted or sentenced to prison without sufficient evidence of their guilt. White people, due to privilege and police bias, can get away with crime more easily. Therefore, when a white person is sentenced to prison, they are more likely to have committed a serious crime with solid evidence to prove it, and may recieve a longer sentence on these grounds. I chose Baltimore because it stood out the most in the dataset and has a notorious history of a racist legal system, which came to national attention following the murder of Freddie Gray in 2015.
Some things I wanted to include but couldn’t were more comparison and stacked charts. I wanted to have a table that shows the mean sentence length per racial group in Baltimore and at the same time the total number of sentences per racial group in Baltimore. Specifically, I wanted to include another lollipop chart, showing the mean number of sentences for all groups in Baltimore as the baseline, with lines leading to dots that show the mean number of sentences for each individual racial group. I tried something similar to this, by using geom_text to add data labels for inmate count to the second lollipop chart, centering them inside the dots. However, I couldn’t find a way to add a legend explaining what they meant, so I left it out.
I was very proud of the work I did exploring and cleaning this dataset. I used tolower and gsub to make the column names easy to use, and mutate with case_when to rename the rows in the race column for legibility in my plots. Initially, when creating the filtered tables and mean tables, I did each part individually. I realized halfway through that I could use piping to streamline the process and I am glad I did.