The data set I am using for my final project is titled Juvenile Arrests. It is collected by DC Data and I accessed it through data.gov. The link is: https://catalog.data.gov/dataset/juvenile-arrests-434b1
The data set only focuses on the District of Columbia from 2011 until 2025. It contains all arrests made by MPD and other law enforcement agencies of individuals 17 and under, excluding any arrests that have been expunged. The most serious charge is the only one that is reported so it does not show if the individual was arrested on multiple charges. I could not find useful code books to help me understand the data set, so I mainly used the landing page as well as exploratory data analysis.
Initially, I set my directory, load in my data set and necessary packages. I also name my data ‘df’ so that it is easier to work with.
setwd("/Users/ingridellis/Desktop/CJS 310/Final Project")
library(readxl)
Juvenile_Arrests <- read_excel("/Users/ingridellis/Desktop/CJS 310/Juvenile Arrests.xlsx")
head(Juvenile_Arrests)
## # A tibble: 6 × 9
## OBJECTID ARREST_DATE TOP_CHARGE_DESC HOME_PSA CRIME_PSA GIS_ID GLOBALID
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 26241 2011/03/01 05:00:… Robbery -- For… 304 305 Juven… {119216…
## 2 26242 2011/03/01 05:00:… Juvenile Custo… 304 504 Juven… {F11EC1…
## 3 26243 2011/03/01 05:00:… Felony Escapee… 501 501 Juven… {4AC0AE…
## 4 26244 2011/03/01 05:00:… UCSA Possessio… 605 605 Juven… {F62548…
## 5 26245 2011/03/01 05:00:… Theft 2nd Degr… 404 302 Juven… {64AC66…
## 6 26246 2011/03/01 05:00:… Simple Assault 604 604 Juven… {E4AA41…
## # ℹ 2 more variables: CREATED <chr>, EDITED <chr>
df <- Juvenile_Arrests
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(stringr)
Initially, I noticed that the data set does not have much to manipulate other than temporal variables. I started by making sure the date variable would be read correctly by R, and then created graphs breaking down arrest count by year, month, and day.
df$ARREST_DATE <- as.POSIXct(df$ARREST_DATE,
format = "%Y/%m/%d %H:%M:%S",
tz = "UTC")
df$YEAR <- as.numeric(format(df$ARREST_DATE, "%Y"))
df$MONTH <- as.numeric(format(df$ARREST_DATE, "%m"))
df$DAY <- as.numeric(format(df$ARREST_DATE, "%A"))
## Warning: NAs introduced by coercion
ggplot(df, aes(x = factor(YEAR))) +
geom_bar() +
labs(
title = "Juvenile Arrest Counts by Year",
x = "Year",
y = "Count"
) +
theme_minimal()
ggplot(df, aes(x = factor(MONTH))) +
geom_bar() +
scale_x_discrete(labels = c("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November", "December")) +
labs(
title = "Juvenile Arrest Counts by Month",
x = "Month",
y = "Count"
) +
theme_minimal()
df$DATE_ONLY <- as.Date(df$ARREST_DATE)
df$DOW <- factor(
weekdays(df$DATE_ONLY),
levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
)
ggplot(df, aes(x = DOW)) +
geom_bar() +
theme_minimal() +
labs(
title = "Juvenile Arrests by Day of Week",
x = "Day of Week",
y = "Count"
)
By starting more macro and getting more granular, I am able to learn different things about the patterns within the data. It also gives me more room to ask different questions. For example, there is a dramatic drop in Juvenile Arrests in 2020. I hypothesize this is because of COVID, but will have to do more research to gain a better understanding. Once I break it down into day of week, I get a completely different picture than I would get from the yearly breakdown.
There is a steep decline on the weekends which I initially found peculiar until doing a bit more research. According to the Office of Juvenile Justice and Delinquency Prevention, the majority of violent crimes committed by youth occur during the after school hours on school days. When there is no school, violent crime is more likely to occur in the late hours, around 9pm, but the rate itself is still much lower than that of school days (Lantz & Knapp, 2024). During my time at Alexandria Police Department, I have noticed first hand how juvenile crime is often high during the after school hours when people congregate at the nearby retail plaza and bus stops. When large groups of students are together, because of school, but unsupervised, because school was just released, there is more room for delinquency and violent crime.
To extract the information I thought would be best, I had to do some extensive cleaning. While I initially began this process in Excel, I quickly moved to R in order to create some new conditional variables. Using the mutate function, I was able to create a new offense type variable based on the existing top charge description variable that was more cleanly organized. That being said, I recognize there is bias in the way I chose what should be included and what wasn’t included. I created my variables using the following code, and have made note that this category is human defined rather than defined by the original input.
df <- df %>%
mutate(
offense_type = case_when(
str_detect(TOP_CHARGE_DESC, regex("murder|homicide", ignore_case = TRUE)) ~ "Homicide",
str_detect(TOP_CHARGE_DESC, regex("sex", ignore_case = TRUE)) ~ "Sex Offense",
str_detect(TOP_CHARGE_DESC, regex("robbery", ignore_case = TRUE)) ~ "Robbery",
str_detect(TOP_CHARGE_DESC, regex("assault|adw", ignore_case = TRUE)) ~ "Assault",
str_detect(TOP_CHARGE_DESC, regex("larceny", ignore_case = TRUE)) ~ "Larceny",
str_detect(TOP_CHARGE_DESC, regex("burglary", ignore_case = TRUE)) ~ "Burglary",
str_detect(TOP_CHARGE_DESC, regex("vehicle", ignore_case = TRUE)) ~ "Vehicle Involved",
str_detect(TOP_CHARGE_DESC, regex("disorderly", ignore_case = TRUE)) ~ "Disorderly",
TRUE ~ "Other"
),
weapon_inv = case_when(
str_detect(TOP_CHARGE_DESC, regex("armed|weapon", ignore_case = TRUE)) ~ "Yes",
TRUE ~ "No"
)
)
head(df)
## # A tibble: 6 × 16
## OBJECTID ARREST_DATE TOP_CHARGE_DESC HOME_PSA CRIME_PSA GIS_ID
## <dbl> <dttm> <chr> <chr> <chr> <chr>
## 1 26241 2011-03-01 05:00:00 Robbery -- Force & Vio… 304 305 Juven…
## 2 26242 2011-03-01 05:00:00 Juvenile Custody Order… 304 504 Juven…
## 3 26243 2011-03-01 05:00:00 Felony Escapee Warrant 501 501 Juven…
## 4 26244 2011-03-01 05:00:00 UCSA Possession Mariju… 605 605 Juven…
## 5 26245 2011-03-01 05:00:00 Theft 2nd Degree 404 302 Juven…
## 6 26246 2011-03-01 05:00:00 Simple Assault 604 604 Juven…
## # ℹ 10 more variables: GLOBALID <chr>, CREATED <chr>, EDITED <chr>, YEAR <dbl>,
## # MONTH <dbl>, DAY <dbl>, DATE_ONLY <date>, DOW <fct>, offense_type <chr>,
## # weapon_inv <chr>
df %>%
select(TOP_CHARGE_DESC, offense_type)
## # A tibble: 36,100 × 2
## TOP_CHARGE_DESC offense_type
## <chr> <chr>
## 1 Robbery -- Force & Violence Robbery
## 2 Juvenile Custody Order - Prepetition Other
## 3 Felony Escapee Warrant Other
## 4 UCSA Possession Marijuana Other
## 5 Theft 2nd Degree Other
## 6 Simple Assault Assault
## 7 Juvenile Custody Order - Prepetition Other
## 8 UCSA Possession Marijuana Other
## 9 Destruction of Property (Felony) Other
## 10 Simple Assault Assault
## # ℹ 36,090 more rows
ggplot(df, aes(x = offense_type, fill = offense_type)) +
geom_bar() +
labs(
title = "Offense Types",
x = "Offense Type",
y = "Count"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Because it seems like the ‘Other’ varible I made is skewing the data, I decided to take a look at what it would look like with the ’Other’s filtered out.
ggplot(subset(df, offense_type != "Other"),
aes(x = offense_type, fill = offense_type)) +
geom_bar() +
labs(
title = "Offense Types",
x = "Offense Type",
y = "Count"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This plot looks at those same arrests but sorts them by the day of
week.
df %>%
filter(offense_type != "Other", !is.na(offense_type)) %>%
ggplot(aes(x = DOW, fill = offense_type)) +
geom_bar() +
labs(
title = "Offense Type by Day of Week",
x = "Day of Week",
y = "Count",
fill = "Offense Type"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph seems to be quite proportional as the week goes on. It is consistent with previous research and findings that there tends to be more crime on school days during the after school hours. I personally know that when it comes to people working in office rather than working from home, Tuesday and Wednesday are the most popular days. This is when there is more traffic, as most people are making the commute rather than working from home as they may do on Monday or Friday. It seems like this same framework may apply to these arrests. Because we know that the after school hours result in the majority of crime, if most students are attending school on Wednesdays, this could account for the spike.
Unfortunately, the only attendance data that DCPS collects are yearly trends and percentages of large macro overviews of drop outs, truancy, tardiness, etc. Instead of focusing on the macro data in comparison to DCPS data, I will make some initial charts to understand some of the DCPS data better for future analysis.
I retrieved this data from the District of Columbia Public Schools website under their downloadable data set tab. This is the link: https://dcps.dc.gov/node/1018342. I downloaded the enrollment data from this link and manually created a database that included all years together rather than separately. This shows a bit of a glimpse into before, during, and after COVID, and also overlaps with the years that I have Juvenile Arrest Data for. Some schools have opened and closed since the beginning of this data collection, hence the NAs present in the counts.
library(readxl)
Clean_DCPS_Enrollment <- read_excel("/Users/ingridellis/Desktop/CJS 310/Clean DCPS Enrollment.xlsx")
enrollment <- Clean_DCPS_Enrollment
head(enrollment)
## # A tibble: 6 × 15
## `School Name` `2011-2012` `2012-2013` `2013-2014` `2014-2015` `2015-2016`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Aiton Elementary … 269 252 247 262 260
## 2 Amidon-Bowen Elem… 254 293 342 345 356
## 3 Anacostia High Sc… 784 697 751 661 597
## 4 Ballou High School 910 791 678 755 933
## 5 Bancroft Elementa… 463 473 490 508 521
## 6 Bard High School … NA NA NA NA NA
## # ℹ 9 more variables: `2016-2017` <chr>, `2017-2018` <chr>, `2018-2019` <chr>,
## # `2019-2020` <chr>, `2020-2021` <chr>, `2021-2022` <chr>, `2022-2023` <chr>,
## # `2023-2024` <chr>, `2024-2025` <chr>
dcps_totals <- enrollment %>%
filter(`School Name` == "DCPS Schools Total")
dcps_totals_long <- dcps_totals %>%
pivot_longer(
cols = -`School Name`,
names_to = "Year",
values_to = "Enrollment"
)
ggplot(dcps_totals_long, aes(x = Year, y = Enrollment, fill = Year)) +
geom_bar(stat = "identity") +
geom_line(aes(group = 1), color = "black", linewidth = 1) +
geom_point(size = 3, color = "black") +
labs(
title = "DCPS Total Enrollment Over Time",
x = "Year",
y = "Total Enrollment"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_fill_viridis_d(option = "turbo") +
guides(fill = "none")
After doing that, I used R to create a data set that only included High Schools and middle schools to see if there was anything of interest. While I don’t have the data for juvenile arrest age, I am still interested in older juveniles as they may have more delinquent involvement. According to the Council on Criminal Justice, the most common age of offending is 16-17. While it can start as young as 12, it often doubles by 13-14 and peaks at 16-17. For this reason I wanted to look at these specific enrollment numbers.
library(dplyr)
library(stringr)
hs_ms <- enrollment %>%
filter(str_detect(`School Name`, "High School| Middle School"))
head(hs_ms)
## # A tibble: 6 × 15
## `School Name` `2011-2012` `2012-2013` `2013-2014` `2014-2015` `2015-2016`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Anacostia High Sc… 784 697 751 661 597
## 2 Ballou High School 910 791 678 755 933
## 3 Bard High School … NA NA NA NA NA
## 4 Benjamin Banneker… 413 394 430 449 454
## 5 Brookland Middle … 304 274 249 225 315
## 6 Coolidge High Sch… 547 490 433 395 384
## # ℹ 9 more variables: `2016-2017` <chr>, `2017-2018` <chr>, `2018-2019` <chr>,
## # `2019-2020` <chr>, `2020-2021` <chr>, `2021-2022` <chr>, `2022-2023` <chr>,
## # `2023-2024` <chr>, `2024-2025` <chr>
hsms_totals <- hs_ms %>%
filter(`School Name` == "DCPS Schools Total")
hsms_totals_long <- hs_ms %>%
pivot_longer(
cols = -`School Name`,
names_to = "Year",
values_to = "Enrollment"
)
hsms_totals_long <- hsms_totals_long %>%
mutate(Enrollment = as.numeric(Enrollment)) %>%
group_by(Year) %>%
summarise(Enrollment = sum(Enrollment, na.rm = TRUE))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Enrollment = as.numeric(Enrollment)`.
## Caused by warning:
## ! NAs introduced by coercion
ggplot(hsms_totals_long, aes(x = Year, y = Enrollment, fill = Year)) +
geom_bar(stat = "identity") +
geom_line(aes(group = 1), color = "black", linewidth = 1) +
geom_point(size = 3, color = "black") +
labs(
title = "DCPS High School + Middle School Enrollment Over Time",
x = "Year",
y = "Total High School + Middle School Enrollment"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_fill_viridis_d(option = "turbo") +
guides(fill = "none")
Now comparing the total enrollment and the juvenile arrests from each year.
library(dplyr)
library(tidyr)
library(stringr)
arrest_counts <- df %>%
count(YEAR, name = "Arrest_Count")
hsms_totals <- hs_ms %>%
slice_tail(n = 1) %>%
pivot_longer(
everything(),
names_to = "year",
values_to = "DCPS_Total"
) %>%
mutate(
year = str_extract(year, "\\d{4}$"),
year = as.numeric(year),
DCPS_Total = replace_na(DCPS_Total, 0)
) %>%
select(year, DCPS_Total)
counts <- arrest_counts %>%
rename(year = YEAR) %>%
left_join(hsms_totals, by = "year") %>%
select(year, DCPS_Total, Arrest_Count)
head(counts)
## # A tibble: 6 × 3
## year DCPS_Total Arrest_Count
## <dbl> <chr> <int>
## 1 2011 <NA> 3499
## 2 2012 1633 3022
## 3 2013 1713 3173
## 4 2014 1696 2982
## 5 2015 1788 3141
## 6 2016 1791 3278
ggplot(counts, aes(x = DCPS_Total, y = Arrest_Count)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = TRUE) +
labs(
x = "HS/MS Enrollment",
y = "Arrest Count",
title = "Relationship Between DCPS Enrollment and Arrests"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Getting rid of NA and adding color differentiation for year.
ggplot(dplyr::filter(counts, !is.na(DCPS_Total)),
aes(x = DCPS_Total, y = Arrest_Count, color = factor(year))) +
geom_point(size = 4) +
geom_smooth(method = "lm", color = "black", alpha = 0.3, linewidth = 1.3) +
theme_minimal() +
labs(color = "Year")
## `geom_smooth()` using formula = 'y ~ x'
Overall, it seems that there is a weak but negative relationship between these two factors. It is hard to say if the enrollment played much of a factor in the limited arrests because COVID also probably contributed to the drop.
Overall, there seems to have been a combination of factors that led to the fall in Juvenile Arrests. According to the Council on Criminal Justice, there was about a 25% decrease in non-lethal violent crime committed by juveniles in the year after COVID. There was a decreased amount of unstructured social time, parties, hanging out with friends, etc (Baumer & Staff, 2024). This allowed for less criminal opportunity. Interestingly, the total enrollment steadily continued to increase, even after COVID. This may have also contributed to the lack of juvenile arrests because more people were enrolled and doing school online.
It may be hard to make solid predictions just based on this information, because it is so multi-faceted. There are many different factors that may have contributed, but this is a first step in understanding.
Using R wasn’t too difficult. I think the hardest part was trying to clean and make my dataset useable which was kind of a problem I gave to myself. There weren’t a lot of open source Juvenile Arrest data sets that I could get my hands on because it is such protected data. Because these datasets were so hard to work with, it made my job in R a lot more difficult. If I didn’t use these datasets, though, I think it may have been easier.
Links:
https://www.ojjdp.gov/ojstatbb//offenders/qa03301.asp https://counciloncj.org/youth-crime-before-and-after-the-beginning-of-covid-19-a-survey-of-middle-and-high-school-students-in-the-united-states/ https://counciloncj.org/trends-in-juvenile-offending-what-you-need-to-know/#:~:text=There%20was%20significant%20variation%20in,to%202022%20for%20younger%20juveniles
Citations:
Baumer, E.P. & Staff, J. (2024). Youth crime before and after the beginning of COVID-19: A survey of middle and high school students in the United States. Council on Criminal Justice. https://counciloncj.org/youth-crime-before-and-after-the-beginning-of-covid-19-a-survey-of-middle-and-high-school-students-in-the-united-states/ Lantz, B. & Knapp, K.G. (2024). Trends in juvenile offending: What you need to know. Council on Criminal Justice. https://counciloncj.org/trends-in-juvenile-offending-what-you-need-to-know/ Office of Juvenile Justice and Delinquency Prevention. (n.d.). Juvenile offenders and victims: 2014 national report—Statistical briefing book. https://www.ojjdp.gov/ojstatbb/offenders/qa03301.asp