Overview

The data set I am using for my final project is titled Juvenile Arrests. It is collected by DC Data and I accessed it through data.gov. The link is: https://catalog.data.gov/dataset/juvenile-arrests-434b1

The data set only focuses on the District of Columbia from 2011 until 2025. It contains all arrests made by MPD and other law enforcement agencies of individuals 17 and under, excluding any arrests that have been expunged. The most serious charge is the only one that is reported so it does not show if the individual was arrested on multiple charges. I could not find useful code books to help me understand the data set, so I mainly used the landing page as well as exploratory data analysis.

Body

Initially, I set my directory, load in my data set and necessary packages. I also name my data ‘df’ so that it is easier to work with.

setwd("/Users/ingridellis/Desktop/CJS 310/Final Project")

library(readxl)
Juvenile_Arrests <- read_excel("/Users/ingridellis/Desktop/CJS 310/Juvenile Arrests.xlsx")
head(Juvenile_Arrests)

## # A tibble: 6 × 9
##   OBJECTID ARREST_DATE        TOP_CHARGE_DESC HOME_PSA CRIME_PSA GIS_ID GLOBALID
##      <dbl> <chr>              <chr>           <chr>    <chr>     <chr>  <chr>   
## 1    26241 2011/03/01 05:00:… Robbery -- For… 304      305       Juven… {119216…
## 2    26242 2011/03/01 05:00:… Juvenile Custo… 304      504       Juven… {F11EC1…
## 3    26243 2011/03/01 05:00:… Felony Escapee… 501      501       Juven… {4AC0AE…
## 4    26244 2011/03/01 05:00:… UCSA Possessio… 605      605       Juven… {F62548…
## 5    26245 2011/03/01 05:00:… Theft 2nd Degr… 404      302       Juven… {64AC66…
## 6    26246 2011/03/01 05:00:… Simple Assault  604      604       Juven… {E4AA41…
## # ℹ 2 more variables: CREATED <chr>, EDITED <chr>

df <- Juvenile_Arrests

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

library(dplyr)
library(stringr)

Initial Graphs

Initially, I noticed that the data set does not have much to manipulate other than temporal variables. I started by making sure the date variable would be read correctly by R, and then created graphs breaking down arrest count by year, month, and day.

df$ARREST_DATE <- as.POSIXct(df$ARREST_DATE,
                      format = "%Y/%m/%d %H:%M:%S",
                      tz = "UTC")

df$YEAR <- as.numeric(format(df$ARREST_DATE, "%Y"))
df$MONTH <- as.numeric(format(df$ARREST_DATE, "%m"))
df$DAY <- as.numeric(format(df$ARREST_DATE, "%A"))

## Warning: NAs introduced by coercion

Plot 1 - Juvenile Arrests by Year

ggplot(df, aes(x = factor(YEAR))) +
  geom_bar() +
  labs(
    title = "Juvenile Arrest Counts by Year",
    x = "Year",
    y = "Count"
  ) +
  theme_minimal()

Plot 2 - Juvenile Arrest Count by Month

ggplot(df, aes(x = factor(MONTH))) +
  geom_bar() +
  scale_x_discrete(labels = c("January", "February", "March", "April",
                              "May", "June", "July", "August",
                              "September", "October", "November", "December")) +
  labs(
    title = "Juvenile Arrest Counts by Month",
    x = "Month",
    y = "Count"
  ) +
  theme_minimal()

Plot 3

df$DATE_ONLY <- as.Date(df$ARREST_DATE)

df$DOW <- factor(
  weekdays(df$DATE_ONLY),
  levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
             "Thursday", "Friday", "Saturday")
)

ggplot(df, aes(x = DOW)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Juvenile Arrests by Day of Week",
    x = "Day of Week",
    y = "Count"
  )

By starting more macro and getting more granular, I am able to learn different things about the patterns within the data. It also gives me more room to ask different questions. For example, there is a dramatic drop in Juvenile Arrests in 2020. I hypothesize this is because of COVID, but will have to do more research to gain a better understanding. Once I break it down into day of week, I get a completely different picture than I would get from the yearly breakdown.

There is a steep decline on the weekends which I initially found peculiar until doing a bit more research. According to the Office of Juvenile Justice and Delinquency Prevention, the majority of violent crimes committed by youth occur during the after school hours on school days. When there is no school, violent crime is more likely to occur in the late hours, around 9pm, but the rate itself is still much lower than that of school days (Lantz & Knapp, 2024). During my time at Alexandria Police Department, I have noticed first hand how juvenile crime is often high during the after school hours when people congregate at the nearby retail plaza and bus stops. When large groups of students are together, because of school, but unsupervised, because school was just released, there is more room for delinquency and violent crime.

Data Cleaning

To extract the information I thought would be best, I had to do some extensive cleaning. While I initially began this process in Excel, I quickly moved to R in order to create some new conditional variables. Using the mutate function, I was able to create a new offense type variable based on the existing top charge description variable that was more cleanly organized. That being said, I recognize there is bias in the way I chose what should be included and what wasn’t included. I created my variables using the following code, and have made note that this category is human defined rather than defined by the original input.

df <- df %>%
  mutate(
    offense_type = case_when(
      str_detect(TOP_CHARGE_DESC, regex("murder|homicide", ignore_case = TRUE)) ~ "Homicide",
      str_detect(TOP_CHARGE_DESC, regex("sex", ignore_case = TRUE)) ~ "Sex Offense",
      str_detect(TOP_CHARGE_DESC, regex("robbery", ignore_case = TRUE)) ~ "Robbery",
      str_detect(TOP_CHARGE_DESC, regex("assault|adw", ignore_case = TRUE)) ~ "Assault",
      str_detect(TOP_CHARGE_DESC, regex("larceny", ignore_case = TRUE)) ~ "Larceny",
      str_detect(TOP_CHARGE_DESC, regex("burglary", ignore_case = TRUE)) ~ "Burglary",
      str_detect(TOP_CHARGE_DESC, regex("vehicle", ignore_case = TRUE)) ~ "Vehicle Involved",
      str_detect(TOP_CHARGE_DESC, regex("disorderly", ignore_case = TRUE)) ~ "Disorderly",
      TRUE ~ "Other"
    ),
    
    weapon_inv = case_when(
      str_detect(TOP_CHARGE_DESC, regex("armed|weapon", ignore_case = TRUE)) ~ "Yes",
      TRUE ~ "No"
    )
  )

head(df)

## # A tibble: 6 × 16
##   OBJECTID ARREST_DATE         TOP_CHARGE_DESC         HOME_PSA CRIME_PSA GIS_ID
##      <dbl> <dttm>              <chr>                   <chr>    <chr>     <chr> 
## 1    26241 2011-03-01 05:00:00 Robbery -- Force & Vio… 304      305       Juven…
## 2    26242 2011-03-01 05:00:00 Juvenile Custody Order… 304      504       Juven…
## 3    26243 2011-03-01 05:00:00 Felony Escapee Warrant  501      501       Juven…
## 4    26244 2011-03-01 05:00:00 UCSA Possession Mariju… 605      605       Juven…
## 5    26245 2011-03-01 05:00:00 Theft 2nd Degree        404      302       Juven…
## 6    26246 2011-03-01 05:00:00 Simple Assault          604      604       Juven…
## # ℹ 10 more variables: GLOBALID <chr>, CREATED <chr>, EDITED <chr>, YEAR <dbl>,
## #   MONTH <dbl>, DAY <dbl>, DATE_ONLY <date>, DOW <fct>, offense_type <chr>,
## #   weapon_inv <chr>

df %>%
  select(TOP_CHARGE_DESC, offense_type)

## # A tibble: 36,100 × 2
##    TOP_CHARGE_DESC                      offense_type
##    <chr>                                <chr>       
##  1 Robbery -- Force & Violence          Robbery     
##  2 Juvenile Custody Order - Prepetition Other       
##  3 Felony Escapee Warrant               Other       
##  4 UCSA Possession Marijuana            Other       
##  5 Theft 2nd Degree                     Other       
##  6 Simple Assault                       Assault     
##  7 Juvenile Custody Order - Prepetition Other       
##  8 UCSA Possession Marijuana            Other       
##  9 Destruction of Property (Felony)     Other       
## 10 Simple Assault                       Assault     
## # ℹ 36,090 more rows

ggplot(df, aes(x = offense_type, fill = offense_type)) +
  geom_bar() +
  labs(
    title = "Offense Types",
    x = "Offense Type",
    y = "Count"
  ) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Because it seems like the ‘Other’ varible I made is skewing the data, I decided to take a look at what it would look like with the ’Other’s filtered out.

ggplot(subset(df, offense_type != "Other"),
       aes(x = offense_type, fill = offense_type)) +
  geom_bar() +
  labs(
    title = "Offense Types",
    x = "Offense Type",
    y = "Count"
  ) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot looks at those same arrests but sorts them by the day of week.

df %>%
  filter(offense_type != "Other", !is.na(offense_type)) %>%
  ggplot(aes(x = DOW, fill = offense_type)) +
  geom_bar() +
  labs(
    title = "Offense Type by Day of Week",
    x = "Day of Week",
    y = "Count",
    fill = "Offense Type"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph seems to be quite proportional as the week goes on. It is consistent with previous research and findings that there tends to be more crime on school days during the after school hours. I personally know that when it comes to people working in office rather than working from home, Tuesday and Wednesday are the most popular days. This is when there is more traffic, as most people are making the commute rather than working from home as they may do on Monday or Friday. It seems like this same framework may apply to these arrests. Because we know that the after school hours result in the majority of crime, if most students are attending school on Wednesdays, this could account for the spike.

Unfortunately, the only attendance data that DCPS collects are yearly trends and percentages of large macro overviews of drop outs, truancy, tardiness, etc. Instead of focusing on the macro data in comparison to DCPS data, I will make some initial charts to understand some of the DCPS data better for future analysis.

DCPS Data

I retrieved this data from the District of Columbia Public Schools website under their downloadable data set tab. This is the link: https://dcps.dc.gov/node/1018342. I downloaded the enrollment data from this link and manually created a database that included all years together rather than separately. This shows a bit of a glimpse into before, during, and after COVID, and also overlaps with the years that I have Juvenile Arrest Data for. Some schools have opened and closed since the beginning of this data collection, hence the NAs present in the counts.

library(readxl)
Clean_DCPS_Enrollment <- read_excel("/Users/ingridellis/Desktop/CJS 310/Clean DCPS Enrollment.xlsx")
enrollment <- Clean_DCPS_Enrollment
head(enrollment)

## # A tibble: 6 × 15
##   `School Name`      `2011-2012` `2012-2013` `2013-2014` `2014-2015` `2015-2016`
##   <chr>              <chr>       <chr>       <chr>       <chr>       <chr>      
## 1 Aiton Elementary … 269         252         247         262         260        
## 2 Amidon-Bowen Elem… 254         293         342         345         356        
## 3 Anacostia High Sc… 784         697         751         661         597        
## 4 Ballou High School 910         791         678         755         933        
## 5 Bancroft Elementa… 463         473         490         508         521        
## 6 Bard High School … NA          NA          NA          NA          NA         
## # ℹ 9 more variables: `2016-2017` <chr>, `2017-2018` <chr>, `2018-2019` <chr>,
## #   `2019-2020` <chr>, `2020-2021` <chr>, `2021-2022` <chr>, `2022-2023` <chr>,
## #   `2023-2024` <chr>, `2024-2025` <chr>

dcps_totals <- enrollment %>%
  filter(`School Name` == "DCPS Schools Total")

dcps_totals_long <- dcps_totals %>%
  pivot_longer(
    cols = -`School Name`,
    names_to = "Year",
    values_to = "Enrollment"
  )

ggplot(dcps_totals_long, aes(x = Year, y = Enrollment, fill = Year)) +
  geom_bar(stat = "identity") +
  geom_line(aes(group = 1), color = "black", linewidth = 1) +
  geom_point(size = 3, color = "black") +
  labs(
    title = "DCPS Total Enrollment Over Time",
    x = "Year",
    y = "Total Enrollment"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1) 
  ) +
  scale_fill_viridis_d(option = "turbo") +
  guides(fill = "none")

After doing that, I used R to create a data set that only included High Schools and middle schools to see if there was anything of interest. While I don’t have the data for juvenile arrest age, I am still interested in older juveniles as they may have more delinquent involvement. According to the Council on Criminal Justice, the most common age of offending is 16-17. While it can start as young as 12, it often doubles by 13-14 and peaks at 16-17. For this reason I wanted to look at these specific enrollment numbers.

library(dplyr)
library(stringr)

hs_ms <- enrollment %>%
  filter(str_detect(`School Name`, "High School| Middle School"))
head(hs_ms)

## # A tibble: 6 × 15
##   `School Name`      `2011-2012` `2012-2013` `2013-2014` `2014-2015` `2015-2016`
##   <chr>              <chr>       <chr>       <chr>       <chr>       <chr>      
## 1 Anacostia High Sc… 784         697         751         661         597        
## 2 Ballou High School 910         791         678         755         933        
## 3 Bard High School … NA          NA          NA          NA          NA         
## 4 Benjamin Banneker… 413         394         430         449         454        
## 5 Brookland Middle … 304         274         249         225         315        
## 6 Coolidge High Sch… 547         490         433         395         384        
## # ℹ 9 more variables: `2016-2017` <chr>, `2017-2018` <chr>, `2018-2019` <chr>,
## #   `2019-2020` <chr>, `2020-2021` <chr>, `2021-2022` <chr>, `2022-2023` <chr>,
## #   `2023-2024` <chr>, `2024-2025` <chr>

hsms_totals <- hs_ms %>%
  filter(`School Name` == "DCPS Schools Total")

hsms_totals_long <- hs_ms %>%
  pivot_longer(
    cols = -`School Name`,
    names_to = "Year",
    values_to = "Enrollment"
  )

hsms_totals_long <- hsms_totals_long %>%
  mutate(Enrollment = as.numeric(Enrollment)) %>%
  group_by(Year) %>%
  summarise(Enrollment = sum(Enrollment, na.rm = TRUE))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Enrollment = as.numeric(Enrollment)`.
## Caused by warning:
## ! NAs introduced by coercion

ggplot(hsms_totals_long, aes(x = Year, y = Enrollment, fill = Year)) +
  geom_bar(stat = "identity") +
  geom_line(aes(group = 1), color = "black", linewidth = 1) +
  geom_point(size = 3, color = "black") +
  labs(
    title = "DCPS High School + Middle School Enrollment Over Time",
    x = "Year",
    y = "Total High School + Middle School Enrollment"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_fill_viridis_d(option = "turbo") +
  guides(fill = "none")

Now comparing the total enrollment and the juvenile arrests from each year.

library(dplyr)
library(tidyr)
library(stringr)

arrest_counts <- df %>%
  count(YEAR, name = "Arrest_Count")

hsms_totals <- hs_ms %>%
  slice_tail(n = 1) %>%
  pivot_longer(
    everything(),
    names_to = "year",
    values_to = "DCPS_Total"
  ) %>%
  mutate(
    year = str_extract(year, "\\d{4}$"),
    year = as.numeric(year),
    DCPS_Total = replace_na(DCPS_Total, 0)
  ) %>%
  select(year, DCPS_Total)

counts <- arrest_counts %>%
  rename(year = YEAR) %>%
  left_join(hsms_totals, by = "year") %>%
  select(year, DCPS_Total, Arrest_Count)

head(counts)

## # A tibble: 6 × 3
##    year DCPS_Total Arrest_Count
##   <dbl> <chr>             <int>
## 1  2011 <NA>               3499
## 2  2012 1633               3022
## 3  2013 1713               3173
## 4  2014 1696               2982
## 5  2015 1788               3141
## 6  2016 1791               3278

ggplot(counts, aes(x = DCPS_Total, y = Arrest_Count)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    x = "HS/MS Enrollment",
    y = "Arrest Count",
    title = "Relationship Between DCPS Enrollment and Arrests"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Getting rid of NA and adding color differentiation for year.

ggplot(dplyr::filter(counts, !is.na(DCPS_Total)),
       aes(x = DCPS_Total, y = Arrest_Count, color = factor(year))) +
  geom_point(size = 4) +
  geom_smooth(method = "lm", color = "black", alpha = 0.3, linewidth = 1.3) +
  theme_minimal() +
  labs(color = "Year")

## `geom_smooth()` using formula = 'y ~ x'

Overall, it seems that there is a weak but negative relationship between these two factors. It is hard to say if the enrollment played much of a factor in the limited arrests because COVID also probably contributed to the drop.

Conclusions

Overall, there seems to have been a combination of factors that led to the fall in Juvenile Arrests. According to the Council on Criminal Justice, there was about a 25% decrease in non-lethal violent crime committed by juveniles in the year after COVID. There was a decreased amount of unstructured social time, parties, hanging out with friends, etc (Baumer & Staff, 2024). This allowed for less criminal opportunity. Interestingly, the total enrollment steadily continued to increase, even after COVID. This may have also contributed to the lack of juvenile arrests because more people were enrolled and doing school online.

It may be hard to make solid predictions just based on this information, because it is so multi-faceted. There are many different factors that may have contributed, but this is a first step in understanding.

Using R wasn’t too difficult. I think the hardest part was trying to clean and make my dataset useable which was kind of a problem I gave to myself. There weren’t a lot of open source Juvenile Arrest data sets that I could get my hands on because it is such protected data. Because these datasets were so hard to work with, it made my job in R a lot more difficult. If I didn’t use these datasets, though, I think it may have been easier.

Sources

Links:

https://www.ojjdp.gov/ojstatbb//offenders/qa03301.asp https://counciloncj.org/youth-crime-before-and-after-the-beginning-of-covid-19-a-survey-of-middle-and-high-school-students-in-the-united-states/ https://counciloncj.org/trends-in-juvenile-offending-what-you-need-to-know/#:~:text=There%20was%20significant%20variation%20in,to%202022%20for%20younger%20juveniles

Citations:

Baumer, E.P. & Staff, J. (2024). Youth crime before and after the beginning of COVID-19: A survey of middle and high school students in the United States. Council on Criminal Justice. https://counciloncj.org/youth-crime-before-and-after-the-beginning-of-covid-19-a-survey-of-middle-and-high-school-students-in-the-united-states/ Lantz, B. & Knapp, K.G. (2024). Trends in juvenile offending: What you need to know. Council on Criminal Justice. https://counciloncj.org/trends-in-juvenile-offending-what-you-need-to-know/ Office of Juvenile Justice and Delinquency Prevention. (n.d.). Juvenile offenders and victims: 2014 national report—Statistical briefing book. https://www.ojjdp.gov/ojstatbb/offenders/qa03301.asp

Final Project Workbook

Ingrid Ellis

2026-03-05