TidyVerse Vignette - Philly Crime Rates

We’ll walk through an analysis of crime rates in Philadelphia, highlighting various features of the tidyverse along the way. We’ll primarily make use of dplyr, ggplot and forcats functions.

Our data will be pulled from the City of Philadelphia’s OpenPhillyData site: https://opendataphilly.org/datasets/crime-incidents/. The site provides an API, which can be queried using SQL (specifically PostgreSQL). OpenPhillyData even offers an R package — rphl — to facilitate queries, which can be installed using the remotes package (remotes::install_github(“CityOfPhiladelphia/rphl”).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(rphl)

## Loading required package: extrafont
## Registering fonts with R

We’ll begin by defining our query and submitting it via the rphl package’s get_carto function. In this query, we request counts of various types of crimes per year.

query_crime <- paste("SELECT Text_General_Code, COUNT(*) AS n,",
                        "DATE_PART('YEAR', Dispatch_Date_Time) AS Year",
                     "FROM incidents_part1_part2",
                     "GROUP BY Year, Text_General_Code",
                     "ORDER BY Year, n")

crime <- get_carto(query_crime, format = "csv") %>%
  filter(year != 2023)

crime %>%
  head(10) %>%
  knitr::kable()

text_general_code	n	year
Homicide - Justifiable	2	2006
Homicide - Criminal	59	2006
Receiving Stolen Property	140	2006
Gambling Violations	180	2006
Offenses Against Family and Children	190	2006
Public Drunkenness	253	2006
Homicide - Criminal	324	2006
Liquor Law Violations	537	2006
Embezzlement	548	2006
Arson	622	2006

crime %>%
  ggplot(aes(year, n, color = text_general_code)) +
  geom_line() +
  theme(legend.position = 'bottom')

In previewing our dataframe and generating a line plot, we see that we have too many levels in the Text_General_Code field to handle simultaneously. However, we can use tidyverse features to map in additional data and simplify this factor in support of more streamlined analysis.

We begin with another query that creates a mapping between Text_General_Code and the UCR_General codes, which are more consolidated.

query_mapping <- paste("SELECT UCR_General, Text_General_Code",
                       "FROM incidents_part1_part2",
                       "GROUP BY UCR_General, Text_General_Code",
                       "ORDER BY UCR_General")

mapping <- get_carto(query_mapping, format = "csv")

mapping %>%
  arrange(ucr_general) %>%
  head(10) %>%
  knitr::kable()

ucr_general	text_general_code
100	Homicide - Criminal
100	Homicide - Justifiable
100	Homicide - Gross Negligence
100	Homicide - Criminal
200	Rape
300	Robbery Firearm
300	Robbery No Firearm
400	Aggravated Assault No Firearm
400	Aggravated Assault Firearm
500	Burglary Non-Residential

The UCR_General field still has a few too many levels, so we’ll use the mutate function to generate a new field, code_simple. Within mutate, we’ll leverage the case_when function to manually determine how each of the 26 code categories should be reclassified.

mapping <- mapping %>%
  mutate(code_simple = case_when(ucr_general == 100 ~ 'Homocide',
                                 ucr_general == 200 ~ 'Rape',
                                 ucr_general == 300 ~ 'Robbery/Burglary',
                                 ucr_general == 400 ~ 'Assault',
                                 ucr_general == 500 ~ 'Robbery/Burglary',
                                 ucr_general == 600 ~ 'Theft',
                                 ucr_general == 700 ~ 'Theft',
                                 ucr_general == 800 ~ 'Assault',
                                 ucr_general == 900 ~ 'Arson',
                                 ucr_general == 1000 ~ 'Forgery/Fraud',
                                 ucr_general == 1100 ~ 'Forgery/Fraud',
                                 ucr_general == 1200 ~ 'Forgery/Fraud',
                                 ucr_general == 1300 ~ 'Forgery/Fraud',
                                 ucr_general == 1400 ~ 'Vandalism',
                                 ucr_general == 1500 ~ 'Weapons',
                                 ucr_general == 1600 ~ 'Prostitution/Vice',
                                 ucr_general == 1700 ~ 'Prostitution/Vice',
                                 ucr_general == 1800 ~ 'Narcotics',
                                 ucr_general == 1900 ~ 'Prostitution/Vice',
                                 ucr_general == 2000 ~ 'Other',
                                 ucr_general == 2100 ~ 'DUI',
                                 ucr_general == 2200 ~ 'Other',
                                 ucr_general == 2300 ~ 'Other',
                                 ucr_general == 2400 ~ 'Other',
                                 ucr_general == 2500 ~ 'Other',
                                 ucr_general == 2600 ~ 'Other'))

mapping %>% select(text_general_code) %>% unique() %>% nrow()

## [1] 33

mapping %>% select(code_simple) %>% unique() %>% nrow()

## [1] 13

We can see that we’ve reduced the number of levels in our crime coding from 33 to 13. We’re now able to use the left_join from dplyr to bring this simplified code field into our crime dataframe. Following the join, we can use group_by and summarize to consolidate the crime tallies under the new, simplified classifications.

crime_simple <- crime %>%
  left_join(
    select(mapping, text_general_code, code_simple)) %>%
  group_by(year, code_simple) %>%
  summarize(count = sum(n), .groups = 'keep')

## Joining with `by = join_by(text_general_code)`

crime_simple %>%
  head(10) %>%
  knitr::kable()

year	code_simple	count
2006	Arson	622
2006	Assault	37261
2006	DUI	4698
2006	Forgery/Fraud	10699
2006	Homocide	385
2006	Narcotics	13696
2006	Other	55998
2006	Prostitution/Vice	3227
2006	Rape	1030
2006	Robbery/Burglary	21956

crime_simple %>%
  ggplot(aes(year, count, color = code_simple)) +
  geom_line() +
  theme(legend.position = 'bottom') +
  scale_y_continuous(labels = scales::comma)

Our plot is now much more interpretable. However, there still appears to be a few too many levels to see the finer details, so it will be best to simplify this field a bit further. We’ll again use mutate to redefine the field, but instead of case_when we can use a different function from the forcats package: fct_collapse. After coercing the character field into a factor, we can then create new levels, listing out each of the constituent levels from the previous classification. We’ll single out Homicide, then create another level for other violent crimes, then collapse all other crimes into the Non-Violent category.

crime_simpler <- crime_simple %>%
  mutate(code_simple = as.factor(code_simple),
         code_simple = factor(fct_collapse(
           code_simple,
           Homocide = c('Homocide'),
           Rape_Assault_Burglary = c('Rape','Robbery/Burglary','Arson','Assault'),
           other_level = 'Non-Violent')),
         ordered = TRUE) %>%
  group_by(year, code_simple) %>%
  summarize(count = sum(count), .groups = 'keep') %>%
  ungroup()

crime_simpler %>%
  head(10) %>%
  knitr::kable()

year	code_simple	count
2006	Rape_Assault_Burglary	60869
2006	Non-Violent	161992
2006	Homocide	385
2007	Rape_Assault_Burglary	58653
2007	Non-Violent	152336
2007	Homocide	382
2008	Rape_Assault_Burglary	59461
2008	Non-Violent	152107
2008	Homocide	328
2009	Rape_Assault_Burglary	54444

crime_simpler %>%
  ggplot(aes(year, count)) +
  geom_line() +
  facet_grid(rows = vars(code_simple),
             scales = 'free') +
  scale_y_continuous(labels = scales::comma)

We now have a clean plot to facilitate analysis. The plot shows that both Non-Violent and Violent Crimes excluding homocide have steadily decreased since 2006, with slight upticks in 2022. The level of homicides, however, shows a different path. After some slight declines through ~2014, the number of homicides has steadily increased. The increase between 2019 and 2020 appears especially pronounced.

This raises a question. The plot above shows absolute counts of crimes, so it is unclear whether crime rates have increased. In other words, is the increase in homicide rates simply a reflection of an increasing population?

To answer this question, we’ll pull in one final dataset from OpenDataPhilly, this one detailing the total population of Philadelphia County. This dataset is available directly in .csv format, so we can use the read_csv function from readr. We’ll also use the filter function from dplyr to remove unwanted rows, focusing only on total population counts each year.

pop <- read_csv(paste0('https://opendata.arcgis.com/api/v3/datasets/',
                       'd0ac67bb117b42f39614bad23525a13e_0/downloads/',
                       'data?format=csv&spatialRefId=4326'))

## Rows: 6906 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): GEOGRAPHY_NAME, GEOGRAPHY, SEX, RACE_ETHNICITY, AGE_CATEGORY, SOURCE
## dbl (3): OBJECTID, YEAR, COUNT_
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pop_annual <- pop %>%
  filter(SEX == 'All sexes',
         RACE_ETHNICITY == 'All races/ethnicities',
         AGE_CATEGORY == 'All ages',
         SOURCE == 'Annual County Resident Population Estimates')

pop_annual %>%
  tail(10) %>%
  knitr::kable()

OBJECTID	YEAR	GEOGRAPHY_NAME	GEOGRAPHY	SEX	RACE_ETHNICITY	AGE_CATEGORY	COUNT_	SOURCE
151	2010	Philadelphia	County	All sexes	All races/ethnicities	All ages	1528306	Annual County Resident Population Estimates
166	2011	Philadelphia	County	All sexes	All races/ethnicities	All ages	1540466	Annual County Resident Population Estimates
181	2012	Philadelphia	County	All sexes	All races/ethnicities	All ages	1551824	Annual County Resident Population Estimates
196	2013	Philadelphia	County	All sexes	All races/ethnicities	All ages	1558313	Annual County Resident Population Estimates
211	2014	Philadelphia	County	All sexes	All races/ethnicities	All ages	1565460	Annual County Resident Population Estimates
226	2015	Philadelphia	County	All sexes	All races/ethnicities	All ages	1571065	Annual County Resident Population Estimates
241	2016	Philadelphia	County	All sexes	All races/ethnicities	All ages	1576051	Annual County Resident Population Estimates
256	2017	Philadelphia	County	All sexes	All races/ethnicities	All ages	1580601	Annual County Resident Population Estimates
271	2018	Philadelphia	County	All sexes	All races/ethnicities	All ages	1583592	Annual County Resident Population Estimates
286	2019	Philadelphia	County	All sexes	All races/ethnicities	All ages	1584064	Annual County Resident Population Estimates

As above, we’ll use the left_join function to map in the population counts per year, and the mutate function to calculate crime rates, defined as number of crimes divided by total population. We’ll then generate a final plot to compare crimes total and rates per category over time.

crime_simpler_pop <- crime_simpler %>%
  left_join(select(pop_annual, YEAR, COUNT_),
            by = c('year' = 'YEAR')) %>%
  rename(population = COUNT_) %>%
  fill(population, .direction = 'down') %>%
  mutate(rate = count / population)

crime_simpler_pop %>%
  pivot_longer(cols = c(count, rate),
               names_to = 'measure') %>%
  ggplot() +
  geom_line(aes(year, value)) +
  facet_wrap(measure ~ code_simple,
             # rows = vars(code_simple),
             # cols = vars(measure),
             scales = 'free') +
  scale_y_continuous(labels = scales::comma)

Surprisingly, the plots of crime counts and rates match almost exactly, indicating the crime overall has kept pace with increases in total population. On the one hand, this indicates that most crime is steadily decreasing. At the same time, homicide rates have continued to climb despite the decreases in other categories.

In conclusion, we see that the tidyverse offers a variety of tools to gather, clean and compile data efficiently and intuitively.

TidyVerse Vignette - Philly Crime Rates

Keith Colella

2023-05-09