We’ll walk through an analysis of crime rates in Philadelphia,
highlighting various features of the tidyverse along the way. We’ll
primarily make use of dplyr, ggplot and
forcats functions.
Our data will be pulled from the City of Philadelphia’s
OpenPhillyData site: https://opendataphilly.org/datasets/crime-incidents/.
The site provides an API, which can be queried using SQL (specifically
PostgreSQL). OpenPhillyData even offers an R package — rphl
— to facilitate queries, which can be installed using the
remotes package
(remotes::install_github(“CityOfPhiladelphia/rphl”).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(rphl)
## Loading required package: extrafont
## Registering fonts with R
We’ll begin by defining our query and submitting it via the
rphl package’s get_carto function. In this
query, we request counts of various types of crimes per year.
query_crime <- paste("SELECT Text_General_Code, COUNT(*) AS n,",
"DATE_PART('YEAR', Dispatch_Date_Time) AS Year",
"FROM incidents_part1_part2",
"GROUP BY Year, Text_General_Code",
"ORDER BY Year, n")
crime <- get_carto(query_crime, format = "csv") %>%
filter(year != 2023)
crime %>%
head(10) %>%
knitr::kable()
| text_general_code | n | year |
|---|---|---|
| Homicide - Justifiable | 2 | 2006 |
| Homicide - Criminal | 59 | 2006 |
| Receiving Stolen Property | 140 | 2006 |
| Gambling Violations | 180 | 2006 |
| Offenses Against Family and Children | 190 | 2006 |
| Public Drunkenness | 253 | 2006 |
| Homicide - Criminal | 324 | 2006 |
| Liquor Law Violations | 537 | 2006 |
| Embezzlement | 548 | 2006 |
| Arson | 622 | 2006 |
crime %>%
ggplot(aes(year, n, color = text_general_code)) +
geom_line() +
theme(legend.position = 'bottom')
In previewing our dataframe and generating a line plot, we see that
we have too many levels in the Text_General_Code field to
handle simultaneously. However, we can use tidyverse features to map in
additional data and simplify this factor in support of more streamlined
analysis.
We begin with another query that creates a mapping between
Text_General_Code and the UCR_General codes,
which are more consolidated.
query_mapping <- paste("SELECT UCR_General, Text_General_Code",
"FROM incidents_part1_part2",
"GROUP BY UCR_General, Text_General_Code",
"ORDER BY UCR_General")
mapping <- get_carto(query_mapping, format = "csv")
mapping %>%
arrange(ucr_general) %>%
head(10) %>%
knitr::kable()
| ucr_general | text_general_code |
|---|---|
| 100 | Homicide - Criminal |
| 100 | Homicide - Justifiable |
| 100 | Homicide - Gross Negligence |
| 100 | Homicide - Criminal |
| 200 | Rape |
| 300 | Robbery Firearm |
| 300 | Robbery No Firearm |
| 400 | Aggravated Assault No Firearm |
| 400 | Aggravated Assault Firearm |
| 500 | Burglary Non-Residential |
The UCR_General field still has a few too many levels,
so we’ll use the mutate function to generate a new field,
code_simple. Within mutate, we’ll leverage the
case_when function to manually determine how each of the 26
code categories should be reclassified.
mapping <- mapping %>%
mutate(code_simple = case_when(ucr_general == 100 ~ 'Homocide',
ucr_general == 200 ~ 'Rape',
ucr_general == 300 ~ 'Robbery/Burglary',
ucr_general == 400 ~ 'Assault',
ucr_general == 500 ~ 'Robbery/Burglary',
ucr_general == 600 ~ 'Theft',
ucr_general == 700 ~ 'Theft',
ucr_general == 800 ~ 'Assault',
ucr_general == 900 ~ 'Arson',
ucr_general == 1000 ~ 'Forgery/Fraud',
ucr_general == 1100 ~ 'Forgery/Fraud',
ucr_general == 1200 ~ 'Forgery/Fraud',
ucr_general == 1300 ~ 'Forgery/Fraud',
ucr_general == 1400 ~ 'Vandalism',
ucr_general == 1500 ~ 'Weapons',
ucr_general == 1600 ~ 'Prostitution/Vice',
ucr_general == 1700 ~ 'Prostitution/Vice',
ucr_general == 1800 ~ 'Narcotics',
ucr_general == 1900 ~ 'Prostitution/Vice',
ucr_general == 2000 ~ 'Other',
ucr_general == 2100 ~ 'DUI',
ucr_general == 2200 ~ 'Other',
ucr_general == 2300 ~ 'Other',
ucr_general == 2400 ~ 'Other',
ucr_general == 2500 ~ 'Other',
ucr_general == 2600 ~ 'Other'))
mapping %>% select(text_general_code) %>% unique() %>% nrow()
## [1] 33
mapping %>% select(code_simple) %>% unique() %>% nrow()
## [1] 13
We can see that we’ve reduced the number of levels in our crime
coding from 33 to 13. We’re now able to use the left_join
from dplyr to bring this simplified code field into our
crime dataframe. Following the join, we can use group_by
and summarize to consolidate the crime tallies under the
new, simplified classifications.
crime_simple <- crime %>%
left_join(
select(mapping, text_general_code, code_simple)) %>%
group_by(year, code_simple) %>%
summarize(count = sum(n), .groups = 'keep')
## Joining with `by = join_by(text_general_code)`
crime_simple %>%
head(10) %>%
knitr::kable()
| year | code_simple | count |
|---|---|---|
| 2006 | Arson | 622 |
| 2006 | Assault | 37261 |
| 2006 | DUI | 4698 |
| 2006 | Forgery/Fraud | 10699 |
| 2006 | Homocide | 385 |
| 2006 | Narcotics | 13696 |
| 2006 | Other | 55998 |
| 2006 | Prostitution/Vice | 3227 |
| 2006 | Rape | 1030 |
| 2006 | Robbery/Burglary | 21956 |
crime_simple %>%
ggplot(aes(year, count, color = code_simple)) +
geom_line() +
theme(legend.position = 'bottom') +
scale_y_continuous(labels = scales::comma)
Our plot is now much more interpretable. However, there still appears
to be a few too many levels to see the finer details, so it will be best
to simplify this field a bit further. We’ll again use
mutate to redefine the field, but instead of
case_when we can use a different function from the
forcats package: fct_collapse. After coercing
the character field into a factor, we can then create new levels,
listing out each of the constituent levels from the previous
classification. We’ll single out Homicide, then create another level for
other violent crimes, then collapse all other crimes into the
Non-Violent category.
crime_simpler <- crime_simple %>%
mutate(code_simple = as.factor(code_simple),
code_simple = factor(fct_collapse(
code_simple,
Homocide = c('Homocide'),
Rape_Assault_Burglary = c('Rape','Robbery/Burglary','Arson','Assault'),
other_level = 'Non-Violent')),
ordered = TRUE) %>%
group_by(year, code_simple) %>%
summarize(count = sum(count), .groups = 'keep') %>%
ungroup()
crime_simpler %>%
head(10) %>%
knitr::kable()
| year | code_simple | count |
|---|---|---|
| 2006 | Rape_Assault_Burglary | 60869 |
| 2006 | Non-Violent | 161992 |
| 2006 | Homocide | 385 |
| 2007 | Rape_Assault_Burglary | 58653 |
| 2007 | Non-Violent | 152336 |
| 2007 | Homocide | 382 |
| 2008 | Rape_Assault_Burglary | 59461 |
| 2008 | Non-Violent | 152107 |
| 2008 | Homocide | 328 |
| 2009 | Rape_Assault_Burglary | 54444 |
crime_simpler %>%
ggplot(aes(year, count)) +
geom_line() +
facet_grid(rows = vars(code_simple),
scales = 'free') +
scale_y_continuous(labels = scales::comma)
We now have a clean plot to facilitate analysis. The plot shows that both Non-Violent and Violent Crimes excluding homocide have steadily decreased since 2006, with slight upticks in 2022. The level of homicides, however, shows a different path. After some slight declines through ~2014, the number of homicides has steadily increased. The increase between 2019 and 2020 appears especially pronounced.
This raises a question. The plot above shows absolute counts of crimes, so it is unclear whether crime rates have increased. In other words, is the increase in homicide rates simply a reflection of an increasing population?
To answer this question, we’ll pull in one final dataset from
OpenDataPhilly, this one detailing the total population of Philadelphia
County. This dataset is available directly in .csv format, so we can use
the read_csv function from readr. We’ll also
use the filter function from dplyr to remove
unwanted rows, focusing only on total population counts each year.
pop <- read_csv(paste0('https://opendata.arcgis.com/api/v3/datasets/',
'd0ac67bb117b42f39614bad23525a13e_0/downloads/',
'data?format=csv&spatialRefId=4326'))
## Rows: 6906 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): GEOGRAPHY_NAME, GEOGRAPHY, SEX, RACE_ETHNICITY, AGE_CATEGORY, SOURCE
## dbl (3): OBJECTID, YEAR, COUNT_
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop_annual <- pop %>%
filter(SEX == 'All sexes',
RACE_ETHNICITY == 'All races/ethnicities',
AGE_CATEGORY == 'All ages',
SOURCE == 'Annual County Resident Population Estimates')
pop_annual %>%
tail(10) %>%
knitr::kable()
| OBJECTID | YEAR | GEOGRAPHY_NAME | GEOGRAPHY | SEX | RACE_ETHNICITY | AGE_CATEGORY | COUNT_ | SOURCE |
|---|---|---|---|---|---|---|---|---|
| 151 | 2010 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1528306 | Annual County Resident Population Estimates |
| 166 | 2011 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1540466 | Annual County Resident Population Estimates |
| 181 | 2012 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1551824 | Annual County Resident Population Estimates |
| 196 | 2013 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1558313 | Annual County Resident Population Estimates |
| 211 | 2014 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1565460 | Annual County Resident Population Estimates |
| 226 | 2015 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1571065 | Annual County Resident Population Estimates |
| 241 | 2016 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1576051 | Annual County Resident Population Estimates |
| 256 | 2017 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1580601 | Annual County Resident Population Estimates |
| 271 | 2018 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1583592 | Annual County Resident Population Estimates |
| 286 | 2019 | Philadelphia | County | All sexes | All races/ethnicities | All ages | 1584064 | Annual County Resident Population Estimates |
As above, we’ll use the left_join function to map in the
population counts per year, and the mutate function to
calculate crime rates, defined as number of crimes divided by total
population. We’ll then generate a final plot to compare crimes total and
rates per category over time.
crime_simpler_pop <- crime_simpler %>%
left_join(select(pop_annual, YEAR, COUNT_),
by = c('year' = 'YEAR')) %>%
rename(population = COUNT_) %>%
fill(population, .direction = 'down') %>%
mutate(rate = count / population)
crime_simpler_pop %>%
pivot_longer(cols = c(count, rate),
names_to = 'measure') %>%
ggplot() +
geom_line(aes(year, value)) +
facet_wrap(measure ~ code_simple,
# rows = vars(code_simple),
# cols = vars(measure),
scales = 'free') +
scale_y_continuous(labels = scales::comma)
Surprisingly, the plots of crime counts and rates match almost exactly, indicating the crime overall has kept pace with increases in total population. On the one hand, this indicates that most crime is steadily decreasing. At the same time, homicide rates have continued to climb despite the decreases in other categories.
In conclusion, we see that the tidyverse offers a variety of tools to gather, clean and compile data efficiently and intuitively.