Crime rates change and can vary across space and time. Reasons and even trends can be difficult to detect, due to many complex factors driving crime. Despite this uncertainty, policymakers need to know if crime rates are changing to inform their decisions.
I explore crime statistics collected between 1975 and 2016 from the State of Maryland. I’ll see if linear trends exist in crime rates across Maryland. This data comes from the Maryland Statistical Analysis Center.
I use regression analysis because it allows estimation of linear trends. But, some datasets can be hierarchical or nested, which presents a regression challenge. Many government statistics, such as crime rates, come from nested datasets. For example, counties exist within most US states (Alaska has “burrows,” and Louisiana has “parishes”). Counties and county-level governments vary even within the same state. For example, one county might have a high population density and be urban whereas a second county might have a low population density and be rural.
Hierarchical modeling allows me to capture and model this hierarchy. This figure shows the counties and their populations in Maryland. I will create a similar figure for their crime trends at the end of this project.
knitr::opts_chunk$set(echo = TRUE, include = TRUE, warning = FALSE, message = FALSE, fig.align = 'center')
# Load the packages
library(tidyverse)
library(lubridate)
library(lmerTest)
library(usmap)
# Read in the crime data
crime_raw <- read_csv("https://github.com/topherdea/Trends_in_Maryland_Crime_Rates/blob/master/Violent_Crime_by_County_1975_to_2016.csv?raw=true")
# Select and mutate columns the needed columns
crime_use <- crime_raw %>%
select(JURISDICTION, YEAR, POPULATION, crime_rate = `VIOLENT CRIME RATE PER 100,000 PEOPLE`) %>%
mutate(YEAR_2 = year(mdy_hms(YEAR)))
# Peek at the data
head(crime_use)
## # A tibble: 6 x 5
## JURISDICTION YEAR POPULATION crime_rate YEAR_2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Allegany County 01/01/1975 12:00:00 AM 79655 178. 1975
## 2 Allegany County 01/01/1976 12:00:00 AM 83923 104. 1976
## 3 Allegany County 01/01/1977 12:00:00 AM 82102 155. 1977
## 4 Allegany County 01/01/1978 12:00:00 AM 79966 128. 1978
## 5 Allegany County 01/01/1979 12:00:00 AM 79721 138 1979
## 6 Allegany County 01/01/1980 12:00:00 AM 80461 148. 1980
I have now loaded and tidied our data. Before running a regression or building a model, I like to visualize the data. Plotting the data helps me see what is going on with the data and tell the data’s story. A plot like this, after a little cleanup, would likely be in a final report on crime statistics. A picture (or well-designed figure!) can be worth a thousand words!
Now, I’ll plot the crime rate over time for each county and add a linear trend line.
# Plot the data as lines and linear trend lines
ggplot(crime_use, aes(x = YEAR_2, y = crime_rate, group = JURISDICTION)) +
geom_line() +
stat_smooth(method = "lm", se = FALSE, size = 0.5) +
theme_minimal() +
labs(x = "", y = "Crime Rate", main = "Crime Type by Year")
Now I can build a hierarchical model, also known as a linear mixed-effects regression model, using lmer(). lmer() uses similar syntax as lm(), but it also requires a random-effect argument. For example, y predicted by fixed-effect slope x and random-effect intercept group would be y ~ x + (1|group). x can also be included as a random-effect slope: y ~ x + (x|group). Hierarchical and Mixed Effect Models covers these models in greater detail.
In our case, I wonder if a linear trend through time predicts crime rates. I will estimate a trend for the entire state (a fixed-effect slope) and trends for each county (random-effect slopes). By treating each county as a random-effect, I assume the trend for each county comes from a state-wide distribution.
I use the lmerTest package because it adds p-values for fixed-effect coefficients to lmer() models, something the lme4 package does not include for reasons listed on the project’s FAQ page.
# Run a lmer() on data, notice the warning message
lmer_fail <- lmer(crime_rate ~ YEAR_2 + (YEAR_2|JURISDICTION), crime_use)
# Print the model's output
print(lmer_fail)
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: crime_rate ~ YEAR_2 + (YEAR_2 | JURISDICTION)
## Data: crime_use
## REML criterion at convergence: 13222.23
## Random effects:
## Groups Name Std.Dev. Corr
## JURISDICTION (Intercept) 138.9827
## YEAR_2 0.2477 -1.00
## Residual 161.1795
## Number of obs: 1008, groups: JURISDICTION, 24
## Fixed Effects:
## (Intercept) YEAR_2
## 3342.520 -1.425
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
In the last section, I got a warning message, singular fit, as well as more text in the model output: convergence code 0; 1 optimizer warnings; 0 lme4 warnings. The output tells me that lmer() was unable to fit the data to the model. When this happens, I double check my input data and model formula.
In my case, the warning occurs because of the intercept estimates. Specifically, regression models work best when the intercept is near zero. With my data, Year_2 goes from 1976 to 2016. 1976 is far from zero, relative to the data, making the model numerically unstable. I need to rescale YEAR_2 by changing Year_2 to start at zero, rather than 1976.
After I do this, I can rebuild the model using the new slope predictor variable.
# Mutate data to create another year column, YEAR_3
crime_use <- crime_use %>%
mutate(YEAR_3 = YEAR_2 - min(YEAR_2))
# Run a lmer() on mutuated data
lmer_crime <- lmer(crime_rate ~ YEAR_3 + (YEAR_3|JURISDICTION), crime_use)
After building the model, I can examine the model’s output using summary(). lmer() outputs are similar to the outputs from lm(). But, lmer() outputs include extra details including both fixed- and random-effects. The Hierarchical and Mixed-effects Models course provides more information on lmer() summary outputs.
The fixed-effect trend for YEAR_3 is not significantly different from zero, but what do the estimates look like for different counties? I can also access the regression coefficients using fixef() and ranef().
# Examine the model outputs using summary
summary(lmer_crime)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: crime_rate ~ YEAR_3 + (YEAR_3 | JURISDICTION)
## Data: crime_use
##
## REML criterion at convergence: 13144.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.4467 -0.4675 -0.0546 0.4328 6.5170
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## JURISDICTION (Intercept) 183231.42 428.055
## YEAR_3 17.94 4.236 -0.69
## Residual 23141.39 152.123
## Number of obs: 1008, groups: JURISDICTION, 24
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 529.0308 87.8821 22.9668 6.020 3.89e-06 ***
## YEAR_3 -1.4246 0.9507 22.9853 -1.498 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## YEAR_3 -0.660
## convergence code: 0
## Model failed to converge with max|grad| = 0.0167762 (tol = 0.002, component 1)
# This is for readability
noquote("**** Fixed-effects ****")
## [1] **** Fixed-effects ****
# Use fixef() to view fixed-effects
fixef(lmer_crime)
## (Intercept) YEAR_3
## 529.030782 -1.424551
# This is for readability
noquote("**** Random-effects ****")
## [1] **** Random-effects ****
# Use ranef() to view random-effects
ranef(lmer_crime)
## $JURISDICTION
## (Intercept) YEAR_3
## Allegany County -324.40669 6.38788013
## Anne Arundel County -125.57018 4.81462314
## Baltimore City 1751.65039 -10.69338633
## Baltimore County 376.31543 -3.94835788
## Calvert County -132.90704 -1.82530226
## Caroline County -143.10110 2.15643569
## Carroll County -297.05454 0.75049213
## Cecil County -68.66688 2.77906999
## Charles County -60.79639 2.70974127
## Dorchester County 154.05937 -2.19196232
## Frederick County -67.79590 -1.42736199
## Garrett County -335.53522 2.19631991
## Harford County -179.92119 0.25558711
## Howard County -123.98247 -3.82371742
## Kent County -220.62959 2.49308956
## Montgomery County -253.86269 0.03515831
## Prince George's County 437.97151 -3.06303714
## Queen Anne's County -181.20256 -0.78719434
## Somerset County -71.53355 0.55556573
## St. Mary's County -152.42109 0.14712656
## Talbot County -93.14857 -1.77479609
## Washington County -237.30786 2.34186466
## Wicomico County 69.09364 7.09396163
## Worcester County 280.75315 -5.18180004
##
## with conditional variances for "JURISDICTION"
I estimated a trend for the entire state as a fixed-effect. This is the average trend across all of Maryland. I estimated a trend for each county as a random-effect. The specific random-effect slope estimated for each county is the difference for the county compared to the state average. For example, Allegany County had a slope estimate of 6.4 and Maryland had a slope estimate of -1.4. Adding these together, I get an estimated slope of 5.0 (6.4 + -1.4 = 5.0) for Allegany County.
I can use R to calculate the slope estimate for each county by extracting the fixed-effect estimate and adding it to the random-effect estimates.
# Add the fixed-effect to the random-effect and save as county_slopes
county_slopes <- fixef(lmer_crime)["YEAR_3"] + ranef(lmer_crime)$JURISDICTION["YEAR_3"]
# Add a new column with county names
county_slopes <-
county_slopes %>%
rownames_to_column("county")
I now have the crime trend data ready, but I need to get map data for the plot.
I use the usmap package to get map data for the US because the package gives output directly in data frames. The older, maps package contains data for the entire world but requires some wrangling to use with ggplot2.
# Load and filter map data
county_map <- us_map(regions = "counties", include = "MD")
I need to make sure the crime data (supplied by the State of Maryland) and the map data (provided in the usmap package) use the same county names. This will allow me to merge the two data frames. In this case, I am lucky and only have one small difference.
# See which counties are not in both datasets
county_slopes %>% anti_join(county_map, by = "county")
## county YEAR_3
## 1 Baltimore City -12.11794
county_map %>% anti_join(county_slopes, by = "county")
## x y order hole piece group fips abbr full
## 1 1971953 -346337.0 22832 FALSE 1 24510.1 24510 MD Maryland
## 2 1973112 -344910.4 22833 FALSE 1 24510.1 24510 MD Maryland
## 3 1978638 -343296.4 22834 FALSE 1 24510.1 24510 MD Maryland
## 4 1986523 -340965.4 22835 FALSE 1 24510.1 24510 MD Maryland
## 5 1988768 -349066.3 22836 FALSE 1 24510.1 24510 MD Maryland
## 6 1991099 -357567.7 22837 FALSE 1 24510.1 24510 MD Maryland
## 7 1987283 -360072.8 22838 FALSE 1 24510.1 24510 MD Maryland
## 8 1980793 -357040.2 22839 FALSE 1 24510.1 24510 MD Maryland
## 9 1979316 -356705.1 22840 FALSE 1 24510.1 24510 MD Maryland
## 10 1976962 -356117.7 22841 FALSE 1 24510.1 24510 MD Maryland
## 11 1974461 -355516.2 22842 FALSE 1 24510.1 24510 MD Maryland
## 12 1972201 -347246.9 22843 FALSE 1 24510.1 24510 MD Maryland
## 13 1971953 -346337.0 22844 FALSE 1 24510.1 24510 MD Maryland
## county
## 1 Baltimore city
## 2 Baltimore city
## 3 Baltimore city
## 4 Baltimore city
## 5 Baltimore city
## 6 Baltimore city
## 7 Baltimore city
## 8 Baltimore city
## 9 Baltimore city
## 10 Baltimore city
## 11 Baltimore city
## 12 Baltimore city
## 13 Baltimore city
# Rename crime_names county
county_slopes <- county_slopes %>%
mutate(county = ifelse(county == "Baltimore City", "Baltimore city", county))
Finally, both data frames now have the same county names. Now we can merge the
# Merge the map and slope data frames
both_data <- full_join(county_map, county_slopes)
# Peek at the data
head(both_data)
## x y order hole piece group fips abbr full
## 1 1775008 -387237.8 22297 FALSE 1 24001.1 24001 MD Maryland
## 2 1779953 -357902.4 22298 FALSE 1 24001.1 24001 MD Maryland
## 3 1789820 -355286.0 22299 FALSE 1 24001.1 24001 MD Maryland
## 4 1796777 -353431.8 22300 FALSE 1 24001.1 24001 MD Maryland
## 5 1824941 -345915.7 22301 FALSE 1 24001.1 24001 MD Maryland
## 6 1828029 -345083.7 22302 FALSE 1 24001.1 24001 MD Maryland
## county YEAR_3
## 1 Allegany County 4.963329
## 2 Allegany County 4.963329
## 3 Allegany County 4.963329
## 4 Allegany County 4.963329
## 5 Allegany County 4.963329
## 6 Allegany County 4.963329
After creating the plot, let’s see if any patterns emerge.
# Set the notebook's plot settings
options(repr.plot.width = 10, repr.plot.height = 5)
# Plot the results
crime_map <- ggplot(both_data, aes(y, x, group = county, fill = YEAR_3)) +
geom_polygon() +
scale_fill_continuous(name = expression(atop("Change in crime rate","(Number year"^-1*")")),
low = "skyblue", high = "gold")
# Look at the map
crime_map
My map shows how linear trends in crime rates varied between 1976 and 2016, but I do not have a figure to show to policymakers yet.
# Plot options
options(repr.plot.width=10, repr.plot.height=5)
# Polish figure
crime_map_final <- crime_map +
theme_minimal() +
xlab("") +
ylab("") +
theme(axis.line = element_blank(),
axis.text = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
# Look at the map
print(crime_map_final)
Statewide, I found no linear trend in crime across Maryland. But, trends in crime rates varied across the state. What should I tell a policy maker? I would say to them Location is key. Some locations have had increasing crime while others have had decreasing crime.
My analysis raises other question as well. Comparing the population figure to the crime trends figure, I see that population might impact crime rate trends. Other predictor variables might explain a counties crime rate. I could look for more explanatory variables in public data sources such those supplied by the State of Maryland, the US Federal Government at Data.gov, and the US Census Bureau.
Also, the figure of trends through time suggests a non-linear trend, at least in some counties. I could either use a non-linear model or the only model the crime rate for the past 10 or 20 years. Nonlinear modeling in R with GAMS covers some non-linear models. Additionally, I could build my own, more complicated regression using a language such as JAGS, covered in Bayesian Modeling with RJAGS. The last model approach would also allow me to calculate credible intervals around the random-effects.
To finish up, I’ll quickly look to see if the population by county impacts the crime rate.
# Build a lmer with both year and population
lmer_pop <- lmer(crime_rate ~ YEAR_3 + POPULATION + (YEAR_3|JURISDICTION), crime_use)
# Inspect the results
summary(lmer_pop)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: crime_rate ~ YEAR_3 + POPULATION + (YEAR_3 | JURISDICTION)
## Data: crime_use
##
## REML criterion at convergence: 13160.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.4481 -0.4659 -0.0545 0.4322 6.5152
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## JURISDICTION (Intercept) 184386.8 429.403
## YEAR_3 18.1 4.255 -0.68
## Residual 23148.1 152.145
## Number of obs: 1008, groups: JURISDICTION, 24
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.330e+02 9.181e+01 2.242e+01 5.805 7.16e-06 ***
## YEAR_3 -1.374e+00 1.010e+00 2.676e+01 -1.359 0.185
## POPULATION -2.376e-05 1.547e-04 3.095e+01 -0.154 0.879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) YEAR_3
## YEAR_3 -0.500
## POPULATION -0.279 -0.329
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## convergence code: 0
## Model failed to converge with max|grad| = 0.0322298 (tol = 0.002, component 1)
ranef(lmer_pop)
## $JURISDICTION
## (Intercept) YEAR_3
## Allegany County -327.22428 6.3739881
## Anne Arundel County -121.40622 4.8971058
## Baltimore City 1766.24968 -10.8194184
## Baltimore County 387.94632 -3.9168107
## Calvert County -135.97576 -1.8479343
## Caroline County -146.94314 2.1352133
## Carroll County -298.73276 0.7440946
## Cecil County -71.82014 2.7847022
## Charles County -63.65423 2.7354540
## Dorchester County 150.61624 -2.2302161
## Frederick County -69.15179 -1.4059349
## Garrett County -339.06322 2.1589469
## Harford County -180.46456 0.2671489
## Howard County -124.70376 -3.7945019
## Kent County -224.57417 2.4655454
## Montgomery County -243.37547 0.1968961
## Prince George's County 449.83967 -2.9963116
## Queen Anne's County -184.55545 -0.8248845
## Somerset County -75.33004 0.5257197
## St. Mary's County -155.13296 0.1308435
## Talbot County -96.43543 -1.8212356
## Washington County -238.92873 2.3275518
## Wicomico County 65.26742 7.1363125
## Worcester County 277.55277 -5.2222747
##
## with conditional variances for "JURISDICTION"
Adapted from Datacamp Coursework by Richard Erickson | Quantitative Ecologist