Adult and juvenile incarceration. Photo taken in Paris, France by Karollyne Videira Hubert. Source: Unsplash, published on November 6, 2021.

 

This project explores how violent and property crime trends relate to arrest rates in the USA between 1994 and 2016, and whether these relationships are different between juveniles and adults. The analysis uses three datasets collected by the Federal Bureau of Investigation (FBI) and made available through the FBI Crime Data Explorer, which provides annual data on reported crimes and arrests. The question that this project aims to answer is: How do violent and property crime trends predict the total arrest rates in the U.S., and do these relationships differ between juveniles and adults? To answer this question, I’ll be using the following variables: year, violent_crime, population, total_male, total_female, and property_crime. Additional variables are created during the cleaning and analysis process, such as total_arrests, arrest_rate, and age_group. This topic is important because understanding arrest trends can help inform policy decisions around law enforcement and also is a subject I personally find meaningful and interesting.

library(dplyr)
library(ggplot2)
library(tidyverse)
library(plotly)
library(ggfortify)
crimes <- read.csv("estimated_crimes_1979_2023.csv")
head(crimes)
##   year state_abbr state_name population violent_crime homicide rape_legacy
## 1 1979                        220099000       1208030    21460       76390
## 2 1979         AK     Alaska     406000          1994       54         292
## 3 1979         AL    Alabama    3769000         15578      496        1037
## 4 1979         AR   Arkansas    2180000          7984      198         595
## 5 1979         AZ    Arizona    2450000         14528      219        1120
## 6 1979         CA California   22696000        184087     2952       12239
##   rape_revised robbery aggravated_assault property_crime burglary larceny
## 1               480700             629480       11041500  3327700 6601000
## 2                  445               1203          23193     5616   15076
## 3                 4127               9918         144372    48517   83791
## 4                 1626               5565          70949    21457   45267
## 5                 4305               8884         177977    48916  116976
## 6                75767              93129        1511021   496310  847148
##   motor_vehicle_theft caveats
## 1             1112800        
## 2                2501        
## 3               12064        
## 4                4225        
## 5               12085        
## 6              167563
juvenile <- read.csv("arrests_national_juvenile.csv")
head(juvenile)
##     id year state_abbr offense_code                        offense_name
## 1 1081 2016         NA    ASR_ARSON                               Arson
## 2 1082 2016         NA      ASR_AST                  Aggravated Assault
## 3 1083 2016         NA  ASR_AST_SMP                      Simple Assault
## 4 1084 2016         NA      ASR_BRG                            Burglary
## 5 1085 2016         NA      ASR_CUR Curfew and Loitering Law Violations
## 6 1086 2016         NA      ASR_DIS                  Disorderly Conduct
##   agencies population total_male total_female m_0_9 m_10_12 m_13_14  m_15  m_16
## 1    13310  264534532       1760          328   117     368     548   291   239
## 2    13310  264534532      16997         5918   141    1406    3743  3062  3839
## 3    13310  264534532      66360        38712  1043    7406   16966 12403 14134
## 4    13310  264534532      23307         3071   167    1474    5301  4789  5662
## 5    13310  264534532      19218         8319   110    1184    4160  4405  5126
## 6    13310  264534532      34438        19449   431    3428    9021  6690  7553
##    m_17 f_0_9 f_10_12 f_13_14 f_15 f_16 f_17 race_agencies race_population
## 1   197    12      54     114   66   39   43         12581       263887632
## 2  4806    19     465    1565 1171 1337 1361         12581       263887632
## 3 14408   259    3353   11210 8036 8264 7590         12581       263887632
## 4  5914    34     170     727  618  689  833         12581       263887632
## 5  4233    24     449    1991 1887 2267 1701         12581       263887632
## 6  7315    80    1682    5883 4114 4049 3641         12581       263887632
##   white black asian_pacific_islander american_indian
## 1  1436   516                     41              48
## 2 12370  9736                    296             364
## 3 59778 41923                   1130            1378
## 4 14413 11082                    368             362
## 5 15468 11045                    382             429
## 6 28572 23619                    416             892
adults <- read.csv("arrests_national_adults.csv")
head(adults)
##     id state_abbr year offense_code          offense_name agencies population
## 1 1009         NA 2016    ASR_ARSON                 Arson    13310  264534532
## 2 1010         NA 2016      ASR_AST    Aggravated Assault    13310  264534532
## 3 1011         NA 2016  ASR_AST_SMP        Simple Assault    13310  264534532
## 4 1012         NA 2016      ASR_BRG              Burglary    13310  264534532
## 5 1013         NA 2016      ASR_DIS    Disorderly Conduct    13310  264534532
## 6 1014         NA 2016      ASR_DRG Drug Abuse Violations    13310  264534532
##   total_male total_female  m_18  m_19  m_20  m_21  m_22  m_23  m_24 m_25_29
## 1       4509         1426   161   180   165   150   157   140   160     748
## 2     224176        67016  5780  6482  6934  7824  8327  8595  8741   41655
## 3     570193       213178 14018 14756 16192 18911 20346 21466 22196  104402
## 4     116213        28754  7077  6501  5568  5127  4981  4907  4978   22425
## 5     180722        68577  6997  6446  6505  8259  7595  7362  7000   30066
## 6     920190       284712 46831 49913 47116 44717 43522 42817 42116  179120
##   m_30_34 m_35_39 m_40_44 m_45_49 m_50_54 m_55_59 m_60_64 m_65p  f_18  f_19
## 1     655     549     379     308     334     223     119    81    28    33
## 2   35024   27837   19331   16523   13550    9190    4590  3793  1733  1957
## 3   88978   72203   51692   44598   36523   23302   11381  9229  6816  7127
## 4   17522   12755    7861    6866    5020    2943    1119   563  1105  1100
## 5   24432   19210   13995   13421   12472    9095    4513  3354  2974  2742
## 6  133536   97497   61460   50920   40577   24873   10526  4649 12566 13825
##    f_20  f_21  f_22  f_23  f_24 f_25_29 f_30_34 f_35_39 f_40_44 f_45_49 f_50_54
## 1    35    52    34    45    55     241     229     189     136     107     102
## 2  2150  2457  2627  2692  2904   13141   10588    8121    5754    4879    3996
## 3  7351  8368  8667  8936  9064   40311   32734   25716   18433   15443   12005
## 4  1061  1062  1094  1163  1202    5861    4988    3665    2309    1788    1279
## 5  2618  2797  2883  2842  2807   12262    9622    7742    5666    5047    4085
## 6 13052 12511 12697 12685 13050   58647   45859   33246   20883   16330   11068
##   f_55_59 f_60_64 f_65p race_agencies race_population  white  black
## 1      74      38    28         12581       263887632   4263   1373
## 2    2343     961   713         12581       263887632 183478  94982
## 3    6697    3166  2344         12581       263887632 514297 237138
## 4     621     286   170         12581       263887632 101778  39235
## 5    2464    1098   928         12581       263887632 161655  73552
## 6    5645    1852   796         12581       263887632 844916 325859
##   asian_pacific_islander american_indian
## 1                    103             183
## 2                   5365            6129
## 3                  12418           14376
## 4                   2035            1323
## 5                   2556            9460
## 6                  14813           11743

Cleaning and Wrangling

# Add a column called age group and select relevant columns from each dataset 
adults2 <- adults |>
  mutate(age_group = "adult") |>
  select(year, population, total_male, total_female, age_group)

juvenile2 <- juvenile |>
  mutate(age_group = "juvenile") |>
  select(year, population, total_male, total_female, age_group)

# Combine adult and juvenile datasets
arrests <- bind_rows(adults2, juvenile2)
head(arrests)
##   year population total_male total_female age_group
## 1 2016  264534532       4509         1426     adult
## 2 2016  264534532     224176        67016     adult
## 3 2016  264534532     570193       213178     adult
## 4 2016  264534532     116213        28754     adult
## 5 2016  264534532     180722        68577     adult
## 6 2016  264534532     920190       284712     adult
# Group by year and age group to calculate total arrests and arrest rates per 100,000
arrests_summary <- arrests |>
  group_by(year, age_group) |>
  summarize(total_arrests = sum(total_male + total_female, na.rm = TRUE), population = max(population, na.rm = TRUE)) |>
  mutate(arrest_rate = (total_arrests / population) * 100000)


# Convert crime counts to numeric and filter for years 1994–2016
# Summarize to ensure each year has only one row with total violent and property crime counts
crimes2 <- crimes |>
  mutate(across(c(violent_crime, property_crime), as.numeric)) |>
  filter(year >= 1994 & year <= 2016) |>
  group_by(year) |>
  summarize(violent_crime = sum(violent_crime, na.rm = TRUE),
  property_crime = sum(property_crime, na.rm = TRUE))

# Merge crime data with arrest summary by year
final_df <- arrests_summary |>
  inner_join(crimes2, by = "year")
head(final_df)
## # A tibble: 6 × 7
## # Groups:   year [3]
##    year age_group total_arrests population arrest_rate violent_crime
##   <int> <chr>             <int>      <int>       <dbl>         <dbl>
## 1  1994 adult           9742152  208091172       4682.       3715340
## 2  1994 juvenile        2220784  208091172       1067.       3715340
## 3  1995 adult           9834654  206783051       4756.       3597584
## 4  1995 juvenile        2227408  206783051       1077.       3597584
## 5  1996 adult           9259837  195867829       4728.       3377080
## 6  1996 juvenile        2159457  195867829       1103.       3377080
## # ℹ 1 more variable: property_crime <dbl>

Creating the Visualizations and Statistical Analysis

Visualization 1: Line Graph - Arrest Rate Over Time

# Plot arrest rates over time, separated by age group
ggplot(final_df, aes(x = year, y = arrest_rate, color = age_group)) +
  geom_line() +
  labs(title = "Arrest Rates Over Time by Age Group (1994 - 2016)", caption = "FBI Crime Data Explorer, based on data from the Uniform Crime Reporting (UCR) Program") +
  scale_y_continuous(name = "Arrest Rate per 100,000") +
  scale_x_continuous(name = "Year") +
  scale_color_manual(name = "Age Group", values = c("#1f77b4", "#df7e2b"), labels = c("Adult", "Juvenile")) +
  theme_minimal(base_size = 12, base_family = "serif")

This line graph shows how arrest rates changed between 1994 and 2016 for both juveniles and adults. Arrest rates went down for both groups over time, but they dropped more quickly for juveniles. Adults had higher arrest rates throughout the entire period.

Visualization 2: Scatterplot + Regression Line

# Scatterplot with regression line showing relationship between violent crime and arrest rate
ggplot(final_df, aes(x = violent_crime, y = arrest_rate, color = age_group)) +
  geom_point(size = 1.7, alpha = 0.8) +
  geom_smooth(method = "lm") +
  labs(title = "Violent Crime vs. Arrest Rate by Age Group", x = "Violent Crime (per year)", y = "Arrest Rate per 100,000", color = "Age Group", caption = "Source: FBI UCR / Crime Data Explorer") +
  scale_color_manual(values = c("juvenile" = "#1f77b4", "adult" = "#df7e2b"), labels = c("Juvenile", "Adult")) +
  theme_minimal(base_size = 12, base_family = "serif")

This scatterplot displays the relationship between violent crime and arrest rate, separated by age group. While arrest rates tend to increase as violent crime increases, the strength of the relationship for juveniles show a clearer linear trend, while for adults the points are more spread out and don’t follow the pattern as clearly

Linear Regression

# Fit model - regression model to predict total arrests using crime levels and age group
fit1 <- lm(total_arrests ~ violent_crime + property_crime + age_group, data = final_df)
summary(fit1)
## 
## Call:
## lm(formula = total_arrests ~ violent_crime + property_crime + 
##     age_group, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -516674 -122233  -28734  104608  728657 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        5.705e+06  3.052e+05   18.694  < 2e-16 ***
## violent_crime      4.804e-04  2.698e-01    0.002 0.998588    
## property_crime     1.654e-01  4.157e-02    3.979 0.000269 ***
## age_groupjuvenile -7.426e+06  7.169e+04 -103.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243100 on 42 degrees of freedom
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9959 
## F-statistic:  3617 on 3 and 42 DF,  p-value: < 2.2e-16

Model Equation:

Total arrests = 5,705,325.68 + 0.00048(violent crime) + 0.1654(property crime) - 7,426,440.30(juvenile)

  • For each additional violent crime, the predicted increase is 0.00048 arrests, which is practically zero.

  • For each additional property crime, there is a predicted increase of approximately 0.165 total arrests.

This means property crime is a stronger predictor of total arrests than violent crime in this model.

The intercept (5.7 million) is the expected number of arrests, when both crime variables are 0 (not meaningful in a realistic context).

The p-value for property crime (0.000269) indicates that it is a statistically significant predictor of total arrests. On the other hand, the p-value for violent crime is very high (0.998), meaning that it does not significantly explain changes in total arrests.

The age group variable is also significant. The negative coefficient for juveniles suggests that, after controlling for crime levels, juveniles are predicted to have about 7.4 million fewer arrests than adults overall. That means that for the same number of violent and property crimes, juveniles have 7.4 million fewer arrests than adults.

The Adjusted R-Squared value is 0.9959. This means the model explains 99.59% of the variation in total arrests based on violent crime, property crime, and age group (that’s extremely high, suggesting a very good fit).

# Diagnostic plots
autoplot(fit1, 1:4, nrow=2, ncol=2)

  • Residuals vs Fitted – There’s a slight curve and more spread at higher fitted values, which suggests the model doesn’t fit equally well for all data points.

  • Normal Q-Q Plot – The Q-Q plot shows that the residuals generally follow the diagonal line, with a few small deviations at the tails (not severe enough to invalidate the model).

  • Scale-Location Plot – The line increases slightly with fitted values, suggesting that residuals spread out more for larger predicted values.

  • Cook’s Distance- This helps identify influential data points. Larger values (points above 0.5 or 1) indicate that certain observations may disproportionately impact the model.

Although a few data points shows moderate leverage, the model still demonstrates a very strong fit (Adjusted R² = 0.9959), and removing violent_crime does not significantly improve the model’s performance, so I decided to keep the variable and don’t build a new model.

3: Interactive Visualization

# Create a long-format gender summary with calculated arrest rates
gender_summary <- arrests |>
  pivot_longer(cols = c(total_male, total_female), names_to = "gender1", values_to = "count")|>
  mutate(gender = ifelse(gender1 == "total_male", "Male", "Female")) |>
  group_by(year, age_group, gender) |>
  summarize(total_arrests = sum(count, na.rm = TRUE),
    population = max(population, na.rm = TRUE)) |>
  mutate(arrest_rate = (total_arrests / population) * 100000)

Essay:

This project explores how violent and property crime trends relate to arrest rates in the USA between 1994 and 2016, and whether these relationships are different between juveniles and adults. The analysis used three datasets from the FBI Crime Data Explorer: one for national juvenile arrests, one for adult arrests, and a third for estimated crimes from 1979 to 2023. After merging the juvenile and adult datasets, I created a new column to distinguish age groups. Arrests were grouped by year and age group to calculate total arrests and arrest rates per 100,000 people. The crime dataset was also cleaned so that it would only include 1994 to 2016, and the total number of violent and property crimes was calculated for each year. These cleaned datasets were merged into a final dataset used for visualization and regression.

The first visualization (line graph), shows arrest rates over time for adults and juveniles. From 1994 to 2016, arrest rates declined for both groups, but the decline was steeper for juveniles. Adults had higher arrest rates throughout the period, but both trends suggest decrease in arrests. The second visualization (scatterplot with regression lines), examined the relationship between violent crime and arrest rate for each age group. The trend lines show that while arrest rates generally increase with violent crime, the relationship is not equally strong for both groups. Juveniles showed a more defined linear relationship, while adults had more variability.

For the statistical analysis, a multiple linear regression model was used to predict total arrests based on violent crime, property crime, and age group. The model had an extremely high adjusted R-squared value of 0.9959, meaning that around 99.6% of the variation in arrests could be explained by the included variables.

The coefficient for property crime was statistically significant, however the coefficient for violent crime was not statistically significant(it did not meaningfully predict changes in arrests). The coefficient for age group (juvenile) was highly significant and negative, suggesting that, even when it was the same number of violent and property crimes, juveniles have 7.4 million fewer arrests than adults. This model indicates that property crime is a far stronger predictor of total arrests than violent crime, and that age group plays a major role in determining arrest outcomes.

I also created an interactive Tableau chart to visualize arrest trends by gender within each age group from 1994 to 2016. This visualization allows users to filter by year and compare how male and female arrests have changed over time. ** While this variable was not part of the regression analysis, I thought it would add important context for this research.

I would like to include additional factors such as regions and other countries, but the datasets were limited to national-level data without demographic breakdowns.