This module extracts and describes the Raleigh Crime Incident data to be used for Model Evaluation. Crime incident data is organized by crime_category. Each observation has date, time, crime category and subcategory information, address, and geocoding information. We find that the crime_category of ASSAULT is the most frequent and that the geocoding information is sufficiently complete to warrant the dataset’s use in model building.
Visual exploratory analysis will be prepared in a separate module of this EDA.
library(tidyverse)
library(kableExtra)
library(sf)
library(fpp3)The crime incident data is furnished by the City of Raleigh, North Carolina in geojson format on its municipal open data portal. After download the most recent dataset, we load the geojson file and transform its coordinate system to a standard EPSG=2264 projection. That represents NAD83 (North Carolina) projection that is adopted for Wake County geospatial data files. There are over 350,000 incidents with 21 reported fields.
Because the dataset is updated daily and the contents are cumulative from June 2014, the file will grow in the future.
gj_crime = st_read( paste0(data_dir, "Raleigh_Police_Incidents_(NIBRS).geojson")) %>% st_transform(2264)## Reading layer `Raleigh_Police_Incidents_(NIBRS)' from data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/Raleigh_Police_Incidents_(NIBRS).geojson' using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 358732 features and 21 fields (with 92324 geometries empty)
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -78.9848 ymin: 35.56671 xmax: -78.32503 ymax: 36.00489
## Geodetic CRS: WGS 84
gj_crime %>% mutate( status = st_is_empty(geometry)) %>% # identify that rows where geometry is empty or undefined.
st_drop_geometry() %>%
group_by(status, reported_year, crime_category ) %>% summarize( ct = n()) %>%
arrange(reported_year, crime_category, status ) %>%
pivot_wider(names_from = status, values_from = ct ) %>%
mutate( total = `FALSE` + `TRUE`, # Total incidents geocoded or not
pct_valid = 100* `FALSE`/total # geocoded valid incidents
) %>%
rename(geocoded=`FALSE`, notgeocoded=`TRUE` ) -> gj_crime_na_status## `summarise()` has grouped output by 'status', 'reported_year'. You can override using the `.groups` argument.
In the table below, we observe that ASSAULT comprises 15.9% of total crimes in the historical dataset. It represents over 42,000 observations from 2014 to 2021. This is a sufficiently rich dataset for predictive analysis.
gj_crime_na_status %>%
group_by(crime_category) %>%
summarize( cnt = sum(geocoded, na.rm = TRUE) ) %>%
mutate( freq = 100 * cnt/ sum(cnt, na.rm = TRUE) ) %>%
arrange( desc(freq )) %>%
kable(digits = 2 , caption = "Total Crimes Categories By Frequency June 2014-Aug 2021" ) %>%
kable_styling(bootstrap_options = c("hover", "striped"))| crime_category | cnt | freq |
|---|---|---|
| ASSAULT | 42305 | 15.88 |
| LARCENY | 36955 | 13.87 |
| DRUGS | 26580 | 9.98 |
| VANDALISM | 23149 | 8.69 |
| LARCENY FROM MV | 22274 | 8.36 |
| FRAUD | 22063 | 8.28 |
| ALL OTHER OFFENSES | 15940 | 5.98 |
| DRUG VIOLATIONS | 12709 | 4.77 |
| BURGLARY/RESIDENTIAL | 10564 | 3.97 |
| TRAFFIC | 9784 | 3.67 |
| WEAPONS VIOLATION | 8911 | 3.34 |
| HUMANE | 8632 | 3.24 |
| MV THEFT | 6679 | 2.51 |
| ROBBERY | 4276 | 1.61 |
| DISORDERLY CONDUCT | 3278 | 1.23 |
| BURGLARY/COMMERCIAL | 3068 | 1.15 |
| STOLEN PROPERTY | 2400 | 0.90 |
| UNAUTHORIZED MOTOR VEHICLE USE | 1689 | 0.63 |
| SEX OFFENSES | 1120 | 0.42 |
| EMBEZZLEMENT | 1059 | 0.40 |
| LIQUOR LAW VIOLATIONS | 1030 | 0.39 |
| OBSCENE MATERIAL | 393 | 0.15 |
| ARSON | 347 | 0.13 |
| EXTORTION | 314 | 0.12 |
| KIDNAPPING | 307 | 0.12 |
| PROSTITUTION | 291 | 0.11 |
| MURDER | 171 | 0.06 |
| HUMAN TRAFFICKING | 96 | 0.04 |
| GAMBLING | 12 | 0.00 |
| BRIBERY | 6 | 0.00 |
| MISCELLANEOUS | 6 | 0.00 |
| JUVENILE | 0 | 0.00 |
gj_crime_na_status %>%
select(reported_year, crime_category, total) %>%
pivot_wider(names_from = reported_year, values_from = total ) %>%
arrange(crime_category) %>%
kable(digits = 1, caption = "Count of all crimes by category across years" ) %>%
kable_styling( bootstrap_options = c("hover", "striped"))| crime_category | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|---|
| ALL OTHER OFFENSES | 1413 | 2619 | 2379 | 2338 | 2261 | 2482 | 1991 | 1243 |
| ARSON | 36 | NA | 40 | NA | NA | NA | 49 | NA |
| ASSAULT | 3460 | 6179 | 5946 | 5838 | 6318 | 6537 | 5901 | 4216 |
| BRIBERY | NA | NA | NA | NA | NA | NA | NA | NA |
| BURGLARY/COMMERCIAL | 229 | NA | 462 | NA | 450 | 277 | 495 | 240 |
| BURGLARY/RESIDENTIAL | 1265 | 1643 | 1758 | 1832 | 1628 | 1198 | 763 | 546 |
| DISORDERLY CONDUCT | 306 | 505 | 442 | 474 | 482 | 514 | 340 | NA |
| DRUG VIOLATIONS | 2355 | 1812 | 1533 | 1780 | 1845 | 1749 | 1034 | 768 |
| DRUGS | 948 | 4406 | 3753 | 4137 | 4298 | 4216 | 3060 | 2098 |
| EMBEZZLEMENT | NA | 151 | NA | NA | 174 | NA | NA | NA |
| EXTORTION | NA | NA | NA | 75 | 62 | NA | NA | NA |
| FRAUD | 1531 | 2821 | 3366 | 3392 | 3075 | 3240 | 2901 | 1881 |
| GAMBLING | NA | NA | NA | NA | NA | NA | NA | NA |
| HUMAN TRAFFICKING | NA | NA | NA | NA | NA | NA | NA | NA |
| HUMANE | 554 | 1061 | 1131 | 1231 | 1237 | 1425 | 1283 | 770 |
| JUVENILE | NA | NA | NA | NA | NA | NA | NA | NA |
| KIDNAPPING | 20 | NA | NA | 37 | 46 | NA | NA | 32 |
| LARCENY | 3474 | 5412 | 5204 | 5117 | 4956 | 5525 | 4466 | 2975 |
| LARCENY FROM MV | 2239 | 2946 | 3158 | 3540 | 2954 | 2942 | 2861 | 1772 |
| LIQUOR LAW VIOLATIONS | NA | 139 | NA | 175 | 208 | 200 | 82 | NA |
| MISCELLANEOUS | NA | NA | NA | 10450 | 12098 | NA | NA | 8173 |
| MURDER | NA | NA | NA | NA | NA | NA | 27 | 19 |
| MV THEFT | 548 | 826 | 931 | 970 | 837 | 916 | 1018 | 687 |
| OBSCENE MATERIAL | NA | NA | NA | 27 | NA | NA | 130 | NA |
| PROSTITUTION | NA | NA | NA | NA | NA | NA | NA | NA |
| ROBBERY | 466 | 675 | 653 | 616 | 562 | 549 | 496 | 304 |
| SEX OFFENSES | 370 | 636 | 638 | 602 | 609 | 635 | 543 | 432 |
| STOLEN PROPERTY | 225 | 283 | 315 | 349 | 333 | 372 | 327 | 229 |
| TRAFFIC | 1642 | 2266 | 1647 | 1230 | 1124 | 1051 | 758 | 562 |
| UNAUTHORIZED MOTOR VEHICLE USE | 116 | 233 | 253 | 217 | 261 | 224 | 256 | NA |
| VANDALISM | 2152 | 3686 | 3260 | 3259 | 2883 | 3042 | 3051 | 2031 |
| WEAPONS VIOLATION | 551 | 1044 | 1086 | 1276 | 1298 | 1297 | 1396 | 1061 |
The most important data field to check for completeness is the geometry. If the geometry is empty or undefined, the observation cannot be used. Luckily for most crime_categories, we observe geocoding is over 90% complete in all years except for SEX OFFENSES where the geocoding is less than 31% for all years. An important rule of thumb in the crime hotspot research literature is that a 85% geocoding completeness rate is the minimum acceptable threshold for usage.
gj_crime_na_status %>%
select(reported_year, crime_category, pct_valid ) %>%
pivot_wider(names_from = reported_year, values_from = pct_valid ) %>%
arrange(crime_category) %>%
kable(digits = 1, caption = "% geocoded crimes by category across years" ) %>%
kable_styling( bootstrap_options = c("hover", "striped"))| crime_category | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|---|
| ALL OTHER OFFENSES | 95.2 | 91.6 | 93.3 | 95.9 | 96.3 | 97.5 | 97.0 | 97.0 |
| ARSON | 97.2 | NA | 97.5 | NA | NA | NA | 98.0 | NA |
| ASSAULT | 95.3 | 94.8 | 94.2 | 94.2 | 95.7 | 96.1 | 96.2 | 95.9 |
| BRIBERY | NA | NA | NA | NA | NA | NA | NA | NA |
| BURGLARY/COMMERCIAL | 99.6 | NA | 99.8 | NA | 99.3 | 99.6 | 99.6 | 98.3 |
| BURGLARY/RESIDENTIAL | 99.1 | 99.3 | 99.6 | 99.5 | 99.6 | 99.6 | 99.5 | 97.4 |
| DISORDERLY CONDUCT | 99.0 | 98.8 | 99.3 | 99.8 | 99.6 | 99.0 | 99.4 | NA |
| DRUG VIOLATIONS | 98.5 | 98.7 | 98.8 | 98.7 | 99.2 | 98.3 | 99.0 | 98.4 |
| DRUGS | 99.1 | 98.8 | 98.7 | 99.0 | 99.2 | 98.7 | 98.2 | 98.1 |
| EMBEZZLEMENT | NA | 98.7 | NA | NA | 99.4 | NA | NA | NA |
| EXTORTION | NA | NA | NA | 98.7 | 98.4 | NA | NA | NA |
| FRAUD | 99.3 | 99.5 | 99.1 | 99.6 | 99.5 | 99.3 | 99.3 | 99.2 |
| GAMBLING | NA | NA | NA | NA | NA | NA | NA | NA |
| HUMAN TRAFFICKING | NA | NA | NA | NA | NA | NA | NA | NA |
| HUMANE | 98.7 | 99.3 | 99.7 | 98.9 | 99.4 | 99.5 | 99.1 | 99.5 |
| JUVENILE | NA | NA | NA | NA | NA | NA | NA | NA |
| KIDNAPPING | 95.0 | NA | NA | 97.3 | 97.8 | NA | NA | 96.9 |
| LARCENY | 99.3 | 99.5 | 99.5 | 99.7 | 99.7 | 99.6 | 99.6 | 98.9 |
| LARCENY FROM MV | 99.3 | 99.4 | 99.4 | 99.7 | 99.4 | 99.5 | 99.3 | 98.9 |
| LIQUOR LAW VIOLATIONS | NA | 98.6 | NA | 99.4 | 99.5 | 99.0 | 95.1 | NA |
| MISCELLANEOUS | NA | NA | NA | 0.0 | 0.0 | NA | NA | 0.0 |
| MURDER | NA | NA | NA | NA | NA | NA | 96.3 | 94.7 |
| MV THEFT | 98.9 | 98.9 | 99.1 | 99.2 | 99.4 | 99.5 | 99.0 | 99.6 |
| OBSCENE MATERIAL | NA | NA | NA | 96.3 | NA | NA | 98.5 | NA |
| PROSTITUTION | NA | NA | NA | NA | NA | NA | NA | NA |
| ROBBERY | 98.9 | 98.7 | 98.8 | 99.5 | 99.3 | 99.1 | 98.2 | 99.3 |
| SEX OFFENSES | 30.5 | 22.6 | 22.1 | 26.6 | 29.9 | 26.5 | 21.0 | 22.7 |
| STOLEN PROPERTY | 98.2 | 98.6 | 97.8 | 99.7 | 98.5 | 98.7 | 98.5 | 99.1 |
| TRAFFIC | 95.2 | 95.2 | 94.6 | 94.5 | 95.7 | 94.9 | 95.5 | 97.0 |
| UNAUTHORIZED MOTOR VEHICLE USE | 98.3 | 98.7 | 99.2 | 99.5 | 99.6 | 99.6 | 99.6 | NA |
| VANDALISM | 98.6 | 99.1 | 99.1 | 99.1 | 99.2 | 99.2 | 99.3 | 98.8 |
| WEAPONS VIOLATION | 99.3 | 98.9 | 99.0 | 98.8 | 99.2 | 98.7 | 98.9 | 98.8 |
Our primary goal will be to analyze and predict assaults. Considering only assaults, we evaluate their count and completeness of geocoding by year and month. We observe in the table below that assault count and geocoding data quality is consistent through all years and regimes.
gj_crime %>% filter(crime_category == 'ASSAULT') -> gj_assaultgj_assault %>% mutate( status = st_is_empty(geometry)) %>% # identify that rows where geometry is empty or undefined.
st_drop_geometry() %>% # strip out the geometry when reporting
group_by(status, reported_year, reported_month) %>% # grouped by status = TRUE mean missing geometry, FALSE means well-defined, year, month
summarize(ct = n()) %>%
arrange(reported_year, reported_month, status) %>% # sort by year, month and status
pivot_wider(names_from = status, values_from = ct ) %>% # pivot to get TRUE , FALSE as columns. ROWS represent year, month, data values are counts
mutate( total = `FALSE` + `TRUE`, # Total incidents geocoded or not
pct_valid = 100* `FALSE`/total , # geocoded valid incidents
month = formatC( reported_month, width=2, format="d", flag="0") # change the month string to zero padded text for column sorting
) -> assault_na_status
gj_assault %>% filter( !st_is_empty(geometry) ) -> gj_assault_geocodedThe resulting table below shows the within month data completeness. For each year-month, we reported the number of incidents which have valid geocoding divided by total incidents in that period. The percentages belows show all periods exceed 90% data completeness. The median month completeness is 95.2%. The number of assaults range from 211 to 618 in the reporting period.
assault_na_status %>%
select(reported_year, month, pct_valid ) %>%
arrange(month, reported_year) %>%
pivot_wider(names_from = month, values_from = pct_valid ) %>%
arrange(reported_year) %>%
kable(digits = 1, caption="Data Completeness % of Assaults within each Month/Year") %>%
kable_styling(bootstrap_options = c("hover", "striped"))| reported_year | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | NA | NA | NA | NA | NA | 99.1 | 99.3 | 94.5 | 93.4 | 93.8 | 93.7 | 94.2 |
| 2015 | 95.2 | 95.1 | 93.8 | 92.4 | 95.0 | 96.0 | 94.9 | 95.8 | 96.7 | 94.2 | 93.9 | 94.9 |
| 2016 | 94.2 | 92.7 | 90.3 | 94.4 | 93.9 | 93.8 | 97.9 | 96.1 | 94.6 | 93.4 | 95.2 | 94.1 |
| 2017 | 94.6 | 93.2 | 93.1 | 93.5 | 93.5 | 93.2 | 95.9 | 94.4 | 95.1 | 95.2 | 93.3 | 95.5 |
| 2018 | 95.2 | 94.7 | 94.2 | 96.1 | 95.5 | 94.5 | 95.2 | 96.2 | 95.5 | 96.7 | 97.8 | 96.9 |
| 2019 | 97.1 | 95.5 | 96.1 | 95.9 | 96.6 | 98.2 | 96.4 | 97.0 | 96.6 | 95.2 | 95.6 | 92.9 |
| 2020 | 94.1 | 92.8 | 97.2 | 95.6 | 96.5 | 97.2 | 97.3 | 95.3 | 96.8 | 97.9 | 96.3 | 96.0 |
| 2021 | 97.0 | 94.3 | 96.6 | 96.9 | 96.7 | 96.4 | 94.3 | 92.5 | NA | NA | NA | NA |
assault_na_status %>%
rename( geocoded = `FALSE`) %>%
select(reported_year, month, geocoded ) %>%
arrange(month, reported_year) %>%
pivot_wider(names_from = month, values_from = geocoded ) %>%
arrange(reported_year) %>%
kable(digits = 1, caption="Count of Geocoded Assaults within each Month/Year") %>%
kable_styling(bootstrap_options = c("hover", "striped"))| reported_year | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | NA | NA | NA | NA | NA | 427 | 448 | 430 | 479 | 534 | 472 | 506 |
| 2015 | 497 | 428 | 500 | 485 | 566 | 528 | 464 | 519 | 495 | 467 | 428 | 482 |
| 2016 | 451 | 380 | 454 | 493 | 527 | 472 | 512 | 470 | 459 | 482 | 474 | 430 |
| 2017 | 458 | 372 | 416 | 460 | 514 | 504 | 462 | 435 | 502 | 498 | 430 | 449 |
| 2018 | 439 | 427 | 489 | 488 | 552 | 519 | 513 | 529 | 600 | 503 | 453 | 535 |
| 2019 | 527 | 426 | 489 | 560 | 592 | 546 | 534 | 575 | 517 | 536 | 498 | 483 |
| 2020 | 466 | 388 | 416 | 366 | 472 | 490 | 574 | 502 | 477 | 555 | 492 | 476 |
| 2021 | 480 | 396 | 571 | 566 | 609 | 618 | 591 | 211 | NA | NA | NA | NA |
Clearly we see that monthly assaults have some seasonality. Since the most recent month is partial, the drop in number of assaults is expected. So we drop the most recent observation to evaluate trend and patterns.
assault_na_status %>% head( length(assault_na_status$total) - 1 ) %>%
rename( geocoded = `FALSE`) %>%
mutate( ym = yearmonth(sprintf("%d-%0d", reported_year, reported_month) ) ) %>%
as_tsibble( index = ym ) ->ts_assaults_monthly
ts_assaults_monthly %>% autoplot(total) + labs(title = "Monthly Assaults in Raleigh", subtitle = "June 2014-Aug 2021")We observe some seasonality in the monthly observations.
ts_assaults_monthly %>% model(STL(geocoded )) %>% components() %>% autoplot()ts_assaults_monthly %>% as_tibble() %>% select(ym, geocoded, total, pct_valid) %>% as_tsibble() %>% gg_season()## Using `ym` as index variable.
## Plot variable not specified, automatically selected `y = geocoded`
# Note that sf guesses the format of the output file from the suffix.
st_write(gj_assault_geocoded , dsn = paste0(data_dir, "EXP_EDA_RALEIGH_assaults_Aug2021.geojson"), delete_dsn = TRUE)## Deleting source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_EDA_RALEIGH_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing layer `EXP_EDA_RALEIGH_assaults_Aug2021' to data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_EDA_RALEIGH_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing 42305 features with 21 fields and geometry type Point.
st_write(gj_assault_geocoded %>% filter(reported_year == 2018, reported_month == 1), dsn = paste0(data_dir, "EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson"), delete_dsn = TRUE)## Deleting source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing layer `EXP_RALEIGH_SUBSET_201801_assaults_Aug2021' to data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing 439 features with 21 fields and geometry type Point.