Introduction

This analysis examines the relationship between childhood lead exposure from leaded gasoline and subsequent criminal behavior in adulthood using US crime rate data. Tetraethyl lead was widely used as a gasoline additive throughout the mid-20th century until its phaseout began in 1976 due to established health concerns. Prior research has demonstrated that childhood lead exposure negatively affects brain development, particularly in regions governing impulse control, decision-making, and aggression—cognitive and behavioral factors associated with criminal activity. To test this hypothesis at the population level, crime rates before 1994 (representing populations exposed during childhood) are compared with rates from 1994 onward (representing populations who matured after the phaseout), employing an 18-year lag to align exposure periods with adult criminal behavior.

Hypotheses

Let \(\mu_{\text{leaded}}\) represent the mean crime rate for cohorts exposed to leaded gasoline (pre-1994), and \(\mu_{\text{unleaded}}\) represent the mean crime rate for cohorts not exposed to leaded gasoline (post-1994).

Null Hypothesis:

\[H_0: \mu_{\text{leaded}} - \mu_{\text{unleaded}} = 0\]

Alternative Hypothesis:

\[H_a: \mu_{\text{leaded}} - \mu_{\text{unleaded}} > 0\]

Analysis

Data Preparation and Analysis

The dataset is first reduced to the essential variables: year, total crime count, and population. Records containing missing values are removed to ensure analytical integrity. A normalized crime rate per 100,000 population is calculated to account for population changes over time, enabling valid comparisons across years. The data is then categorized into two eras based on the 18-year lag hypothesis: the “Leaded” era (years before 1995, representing cohorts exposed to leaded gasoline during childhood) and the “Unleaded” era (1995 onward, representing cohorts who grew up after the phaseout began in 1977). Finally, mean crime rates are computed for each era to facilitate comparison between the two groups.

Code

Load Data

us_crime_rates <-  read_csv("us_crime_rates.csv")
## Rows: 60 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): year, population, total, violent, property, murder, forcible_rape,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(us_crime_rates)
## # A tibble: 6 × 12
##    year population   total violent property murder forcible_rape robbery
##   <dbl>      <dbl>   <dbl>   <dbl>    <dbl>  <dbl>         <dbl>   <dbl>
## 1  1960  179323175 3384200  288460  3095700   9110         17190  107840
## 2  1961  182992000 3488000  289390  3198600   8740         17220  106670
## 3  1962  185771000 3752200  301510  3450700   8530         17550  110860
## 4  1963  188483000 4109500  316970  3792500   8640         17650  116470
## 5  1964  191141000 4564600  364220  4200400   9360         21420  130390
## 6  1965  193526000 4739400  387390  4352000   9960         23410  138690
## # ℹ 4 more variables: aggravated_assault <dbl>, burglary <dbl>,
## #   larceny_theft <dbl>, vehicle_theft <dbl>
summary(us_crime_rates)
##       year        population            total             violent       
##  Min.   :1960   Min.   :179323175   Min.   : 3384200   Min.   : 288460  
##  1st Qu.:1975   1st Qu.:212691000   1st Qu.: 8685625   1st Qu.: 996838  
##  Median :1990   Median :248474436   Median :11272114   Median :1278578  
##  Mean   :1990   Mean   :252716504   Mean   :10452904   Mean   :1194566  
##  3rd Qu.:2004   3rd Qu.:294369397   3rd Qu.:12600326   3rd Qu.:1425626  
##  Max.   :2019   Max.   :328239523   Max.   :14872900   Max.   :1932270  
##     property            murder      forcible_rape       robbery      
##  Min.   : 3095700   Min.   : 8530   Min.   : 17190   Min.   :106670  
##  1st Qu.: 7749522   1st Qu.:15266   1st Qu.: 55918   1st Qu.:331625  
##  Median :10053484   Median :16980   Median : 87132   Median :415836  
##  Mean   : 9256650   Mean   :17263   Mean   : 77384   Mean   :407210  
##  3rd Qu.:11216494   3rd Qu.:20200   3rd Qu.: 94385   3rd Qu.:500543  
##  Max.   :12961100   Max.   :24700   Max.   :143765   Max.   :687730  
##  aggravated_assault    burglary       larceny_theft     vehicle_theft    
##  Min.   : 154320    Min.   : 912100   Min.   :1855400   Min.   : 328200  
##  1st Qu.: 483518    1st Qu.:1913601   1st Qu.:5195649   1st Qu.: 748819  
##  Median : 763033    Median :2204156   Median :6453334   Median :1006000  
##  Mean   : 691072    Mean   :2335967   Mean   :5915788   Mean   :1004968  
##  3rd Qu.: 869517    3rd Qu.:3071950   3rd Qu.:7138300   3rd Qu.:1236357  
##  Max.   :1135610    Max.   :3795200   Max.   :8142200   Max.   :1661700

Clean Data

us_crime_rates <- us_crime_rates |>
  select(year, total, population)

us_crime_rates <- us_crime_rates |>
  drop_na()

us_crime_rates <- us_crime_rates |>
  mutate(crime_rate_normalized = total / population * 100000)

us_crime_rates <- us_crime_rates |>
  mutate(era = factor(ifelse(year < 1977 + 18, "Leaded", "Unleaded"),
                     levels = c("Leaded", "Unleaded")))

era_means <- us_crime_rates |>
  group_by(era) |>
  summarize(mean_crime_rate = mean(crime_rate_normalized))


print(era_means)
## # A tibble: 2 × 2
##   era      mean_crime_rate
##   <fct>              <dbl>
## 1 Leaded             4473.
## 2 Unleaded           3711.

Visualization

library(ggplot2)
ggplot(us_crime_rates, aes(x=year, y=crime_rate_normalized, color=era)) + 
  geom_point() +
  geom_vline(xintercept=1976.5, linetype="dashed", color="red") +
  geom_vline(xintercept=1976.5 + 18, linetype="dashed", color="blue") +
  labs(title="US Crime Rate Before and After Lead Removal",
       x="Year", y="Crime Rate per 100,000") +
  theme_minimal()

The beginning of the phase out of lead is marked with the red dashed line, whereas the delayed bifurcation point to account for childhood exposure is marked with blue.

ANOVA Analysis

An ANOVA test is used to determine whether mean crime rates differ significantly between the leaded and unleaded eras. This test is appropriate as it compares group means and assesses whether observed differences exceed what would be expected by random variation.

anova_result <- aov(crime_rate_normalized ~ era, data=us_crime_rates)
summary(anova_result)
##             Df   Sum Sq Mean Sq F value Pr(>F)  
## era          1  8475608 8475608    6.31 0.0148 *
## Residuals   58 77909916 1343274                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

The ANOVA test yielded a p-value of 0.0148, which is statistically significant at the α = 0.05 level, leading us to reject the null hypothesis. The data supports that crime rates were significantly higher when offenders would have grown up exposed to leaded gasoline (pre-1994) compared to those who grew up after the phaseout began (post-1994). The 18-year lag methodology, based on the typical age for adult criminal charges, accounts for the latency between childhood exposure and adult behavioral outcomes. Notably, this lag is actually conservative: peak criminal activity occurs in the late teens through mid-twenties, meaning many in our “unleaded” cohort still experienced some lead exposure. The fact that we observe significant differences despite this overlap suggests a robust relationship. While correlation does not prove causation, the statistically significant result combined with the biological plausibility of lead’s effects on brain development and impulse control provides compelling evidence for lead exposure as a contributing factor to historical crime rate variations. Further research could be conducted to eliminate additional confounding variables, such as differing judicial policy, or adjusting the data for the number of cars actually on the road.