1 Introduction

This module extracts and describes the Raleigh Crime Incident data to be used for Model Evaluation. Crime incident data is organized by crime_category. Each observation has date, time, crime category and subcategory information, address, and geocoding information. We find that the crime_category of ASSAULT is the most frequent and that the geocoding information is sufficiently complete to warrant the dataset’s use in model building.

Visual exploratory analysis will be prepared in a separate module of this EDA.

library(tidyverse)
library(kableExtra)
library(sf)
library(fpp3)

1.1 Data Gathering Process

The crime incident data is furnished by the City of Raleigh, North Carolina in geojson format on its municipal open data portal. After download the most recent dataset, we load the geojson file and transform its coordinate system to a standard EPSG=2264 projection. That represents NAD83 (North Carolina) projection that is adopted for Wake County geospatial data files. There are over 350,000 incidents with 21 reported fields.

Because the dataset is updated daily and the contents are cumulative from June 2014, the file will grow in the future.

gj_crime = st_read( paste0(data_dir, "Raleigh_Police_Incidents_(NIBRS).geojson"))  %>% st_transform(2264)
## Reading layer `Raleigh_Police_Incidents_(NIBRS)' from data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/Raleigh_Police_Incidents_(NIBRS).geojson' using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 358732 features and 21 fields (with 92324 geometries empty)
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -78.9848 ymin: 35.56671 xmax: -78.32503 ymax: 36.00489
## Geodetic CRS:  WGS 84
gj_crime %>%  mutate( status = st_is_empty(geometry))  %>%   # identify that rows where geometry is empty or undefined.
  st_drop_geometry() %>%
  group_by(status, reported_year, crime_category ) %>% summarize( ct = n()) %>% 
  arrange(reported_year, crime_category, status ) %>%
  pivot_wider(names_from = status, values_from = ct ) %>% 
  mutate( total = `FALSE` + `TRUE`,   # Total incidents geocoded or not
          pct_valid = 100* `FALSE`/total     # geocoded valid incidents
          ) %>%
  rename(geocoded=`FALSE`, notgeocoded=`TRUE` ) -> gj_crime_na_status
## `summarise()` has grouped output by 'status', 'reported_year'. You can override using the `.groups` argument.

In the table below, we observe that ASSAULT comprises 15.9% of total crimes in the historical dataset. It represents over 42,000 observations from 2014 to 2021. This is a sufficiently rich dataset for predictive analysis.

gj_crime_na_status %>% 
  group_by(crime_category) %>% 
  summarize( cnt = sum(geocoded, na.rm = TRUE) ) %>%  
  mutate( freq = 100 * cnt/ sum(cnt, na.rm = TRUE) ) %>% 
  arrange( desc(freq )) %>% 
  kable(digits = 2 , caption = "Total Crimes Categories By Frequency June 2014-Aug 2021" ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Total Crimes Categories By Frequency June 2014-Aug 2021
crime_category cnt freq
ASSAULT 42305 15.88
LARCENY 36955 13.87
DRUGS 26580 9.98
VANDALISM 23149 8.69
LARCENY FROM MV 22274 8.36
FRAUD 22063 8.28
ALL OTHER OFFENSES 15940 5.98
DRUG VIOLATIONS 12709 4.77
BURGLARY/RESIDENTIAL 10564 3.97
TRAFFIC 9784 3.67
WEAPONS VIOLATION 8911 3.34
HUMANE 8632 3.24
MV THEFT 6679 2.51
ROBBERY 4276 1.61
DISORDERLY CONDUCT 3278 1.23
BURGLARY/COMMERCIAL 3068 1.15
STOLEN PROPERTY 2400 0.90
UNAUTHORIZED MOTOR VEHICLE USE 1689 0.63
SEX OFFENSES 1120 0.42
EMBEZZLEMENT 1059 0.40
LIQUOR LAW VIOLATIONS 1030 0.39
OBSCENE MATERIAL 393 0.15
ARSON 347 0.13
EXTORTION 314 0.12
KIDNAPPING 307 0.12
PROSTITUTION 291 0.11
MURDER 171 0.06
HUMAN TRAFFICKING 96 0.04
GAMBLING 12 0.00
BRIBERY 6 0.00
MISCELLANEOUS 6 0.00
JUVENILE 0 0.00
gj_crime_na_status %>% 
  select(reported_year, crime_category, total) %>%  
  pivot_wider(names_from = reported_year, values_from = total  ) %>% 
  arrange(crime_category) %>%
  kable(digits = 1, caption = "Count of all crimes by category across years" ) %>%
  kable_styling( bootstrap_options = c("hover", "striped"))
Count of all crimes by category across years
crime_category 2014 2015 2016 2017 2018 2019 2020 2021
ALL OTHER OFFENSES 1413 2619 2379 2338 2261 2482 1991 1243
ARSON 36 NA 40 NA NA NA 49 NA
ASSAULT 3460 6179 5946 5838 6318 6537 5901 4216
BRIBERY NA NA NA NA NA NA NA NA
BURGLARY/COMMERCIAL 229 NA 462 NA 450 277 495 240
BURGLARY/RESIDENTIAL 1265 1643 1758 1832 1628 1198 763 546
DISORDERLY CONDUCT 306 505 442 474 482 514 340 NA
DRUG VIOLATIONS 2355 1812 1533 1780 1845 1749 1034 768
DRUGS 948 4406 3753 4137 4298 4216 3060 2098
EMBEZZLEMENT NA 151 NA NA 174 NA NA NA
EXTORTION NA NA NA 75 62 NA NA NA
FRAUD 1531 2821 3366 3392 3075 3240 2901 1881
GAMBLING NA NA NA NA NA NA NA NA
HUMAN TRAFFICKING NA NA NA NA NA NA NA NA
HUMANE 554 1061 1131 1231 1237 1425 1283 770
JUVENILE NA NA NA NA NA NA NA NA
KIDNAPPING 20 NA NA 37 46 NA NA 32
LARCENY 3474 5412 5204 5117 4956 5525 4466 2975
LARCENY FROM MV 2239 2946 3158 3540 2954 2942 2861 1772
LIQUOR LAW VIOLATIONS NA 139 NA 175 208 200 82 NA
MISCELLANEOUS NA NA NA 10450 12098 NA NA 8173
MURDER NA NA NA NA NA NA 27 19
MV THEFT 548 826 931 970 837 916 1018 687
OBSCENE MATERIAL NA NA NA 27 NA NA 130 NA
PROSTITUTION NA NA NA NA NA NA NA NA
ROBBERY 466 675 653 616 562 549 496 304
SEX OFFENSES 370 636 638 602 609 635 543 432
STOLEN PROPERTY 225 283 315 349 333 372 327 229
TRAFFIC 1642 2266 1647 1230 1124 1051 758 562
UNAUTHORIZED MOTOR VEHICLE USE 116 233 253 217 261 224 256 NA
VANDALISM 2152 3686 3260 3259 2883 3042 3051 2031
WEAPONS VIOLATION 551 1044 1086 1276 1298 1297 1396 1061

The most important data field to check for completeness is the geometry. If the geometry is empty or undefined, the observation cannot be used. Luckily for most crime_categories, we observe geocoding is over 90% complete in all years except for SEX OFFENSES where the geocoding is less than 31% for all years. An important rule of thumb in the crime hotspot research literature is that a 85% geocoding completeness rate is the minimum acceptable threshold for usage.

gj_crime_na_status %>% 
  select(reported_year, crime_category, pct_valid ) %>%  
  pivot_wider(names_from = reported_year, values_from = pct_valid ) %>% 
  arrange(crime_category) %>%
  kable(digits = 1, caption = "% geocoded crimes by category across years" ) %>%
  kable_styling( bootstrap_options = c("hover", "striped"))
% geocoded crimes by category across years
crime_category 2014 2015 2016 2017 2018 2019 2020 2021
ALL OTHER OFFENSES 95.2 91.6 93.3 95.9 96.3 97.5 97.0 97.0
ARSON 97.2 NA 97.5 NA NA NA 98.0 NA
ASSAULT 95.3 94.8 94.2 94.2 95.7 96.1 96.2 95.9
BRIBERY NA NA NA NA NA NA NA NA
BURGLARY/COMMERCIAL 99.6 NA 99.8 NA 99.3 99.6 99.6 98.3
BURGLARY/RESIDENTIAL 99.1 99.3 99.6 99.5 99.6 99.6 99.5 97.4
DISORDERLY CONDUCT 99.0 98.8 99.3 99.8 99.6 99.0 99.4 NA
DRUG VIOLATIONS 98.5 98.7 98.8 98.7 99.2 98.3 99.0 98.4
DRUGS 99.1 98.8 98.7 99.0 99.2 98.7 98.2 98.1
EMBEZZLEMENT NA 98.7 NA NA 99.4 NA NA NA
EXTORTION NA NA NA 98.7 98.4 NA NA NA
FRAUD 99.3 99.5 99.1 99.6 99.5 99.3 99.3 99.2
GAMBLING NA NA NA NA NA NA NA NA
HUMAN TRAFFICKING NA NA NA NA NA NA NA NA
HUMANE 98.7 99.3 99.7 98.9 99.4 99.5 99.1 99.5
JUVENILE NA NA NA NA NA NA NA NA
KIDNAPPING 95.0 NA NA 97.3 97.8 NA NA 96.9
LARCENY 99.3 99.5 99.5 99.7 99.7 99.6 99.6 98.9
LARCENY FROM MV 99.3 99.4 99.4 99.7 99.4 99.5 99.3 98.9
LIQUOR LAW VIOLATIONS NA 98.6 NA 99.4 99.5 99.0 95.1 NA
MISCELLANEOUS NA NA NA 0.0 0.0 NA NA 0.0
MURDER NA NA NA NA NA NA 96.3 94.7
MV THEFT 98.9 98.9 99.1 99.2 99.4 99.5 99.0 99.6
OBSCENE MATERIAL NA NA NA 96.3 NA NA 98.5 NA
PROSTITUTION NA NA NA NA NA NA NA NA
ROBBERY 98.9 98.7 98.8 99.5 99.3 99.1 98.2 99.3
SEX OFFENSES 30.5 22.6 22.1 26.6 29.9 26.5 21.0 22.7
STOLEN PROPERTY 98.2 98.6 97.8 99.7 98.5 98.7 98.5 99.1
TRAFFIC 95.2 95.2 94.6 94.5 95.7 94.9 95.5 97.0
UNAUTHORIZED MOTOR VEHICLE USE 98.3 98.7 99.2 99.5 99.6 99.6 99.6 NA
VANDALISM 98.6 99.1 99.1 99.1 99.2 99.2 99.3 98.8
WEAPONS VIOLATION 99.3 98.9 99.0 98.8 99.2 98.7 98.9 98.8

2 Assaults

Our primary goal will be to analyze and predict assaults. Considering only assaults, we evaluate their count and completeness of geocoding by year and month. We observe in the table below that assault count and geocoding data quality is consistent through all years and regimes.

gj_crime %>% filter(crime_category == 'ASSAULT')  -> gj_assault
gj_assault %>% mutate( status = st_is_empty(geometry))  %>%   # identify that rows where geometry is empty or undefined.
  st_drop_geometry() %>%     # strip out the geometry when reporting 
  group_by(status, reported_year, reported_month) %>%   # grouped by status = TRUE mean missing geometry, FALSE means well-defined, year, month
  summarize(ct = n()) %>%    
  arrange(reported_year, reported_month, status) %>%   # sort by year, month and status
  pivot_wider(names_from = status, values_from = ct ) %>%   # pivot to get TRUE , FALSE as columns.  ROWS represent year, month,  data values are counts
  mutate( total = `FALSE` + `TRUE`,   # Total incidents geocoded or not
          pct_valid = 100* `FALSE`/total ,    # geocoded valid incidents
          month = formatC( reported_month, width=2, format="d", flag="0")   # change the month string to zero padded text for column sorting
          ) -> assault_na_status

gj_assault %>% filter( !st_is_empty(geometry) ) -> gj_assault_geocoded

The resulting table below shows the within month data completeness. For each year-month, we reported the number of incidents which have valid geocoding divided by total incidents in that period. The percentages belows show all periods exceed 90% data completeness. The median month completeness is 95.2%. The number of assaults range from 211 to 618 in the reporting period.

assault_na_status %>% 
  select(reported_year, month, pct_valid ) %>% 
  arrange(month, reported_year) %>% 
  pivot_wider(names_from = month, values_from = pct_valid ) %>% 
  arrange(reported_year) %>%
  kable(digits = 1, caption="Data Completeness % of Assaults within each Month/Year") %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))
Data Completeness % of Assaults within each Month/Year
reported_year 01 02 03 04 05 06 07 08 09 10 11 12
2014 NA NA NA NA NA 99.1 99.3 94.5 93.4 93.8 93.7 94.2
2015 95.2 95.1 93.8 92.4 95.0 96.0 94.9 95.8 96.7 94.2 93.9 94.9
2016 94.2 92.7 90.3 94.4 93.9 93.8 97.9 96.1 94.6 93.4 95.2 94.1
2017 94.6 93.2 93.1 93.5 93.5 93.2 95.9 94.4 95.1 95.2 93.3 95.5
2018 95.2 94.7 94.2 96.1 95.5 94.5 95.2 96.2 95.5 96.7 97.8 96.9
2019 97.1 95.5 96.1 95.9 96.6 98.2 96.4 97.0 96.6 95.2 95.6 92.9
2020 94.1 92.8 97.2 95.6 96.5 97.2 97.3 95.3 96.8 97.9 96.3 96.0
2021 97.0 94.3 96.6 96.9 96.7 96.4 94.3 92.5 NA NA NA NA
assault_na_status %>% 
  rename( geocoded = `FALSE`) %>%
  select(reported_year, month, geocoded ) %>% 
  arrange(month, reported_year) %>% 
  pivot_wider(names_from = month, values_from = geocoded ) %>% 
  arrange(reported_year) %>%
  kable(digits = 1, caption="Count of Geocoded Assaults within each Month/Year") %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))
Count of Geocoded Assaults within each Month/Year
reported_year 01 02 03 04 05 06 07 08 09 10 11 12
2014 NA NA NA NA NA 427 448 430 479 534 472 506
2015 497 428 500 485 566 528 464 519 495 467 428 482
2016 451 380 454 493 527 472 512 470 459 482 474 430
2017 458 372 416 460 514 504 462 435 502 498 430 449
2018 439 427 489 488 552 519 513 529 600 503 453 535
2019 527 426 489 560 592 546 534 575 517 536 498 483
2020 466 388 416 366 472 490 574 502 477 555 492 476
2021 480 396 571 566 609 618 591 211 NA NA NA NA

2.1 Visualizing Assault Frequency by Month

Clearly we see that monthly assaults have some seasonality. Since the most recent month is partial, the drop in number of assaults is expected. So we drop the most recent observation to evaluate trend and patterns.

assault_na_status %>% head( length(assault_na_status$total) - 1 ) %>%
  rename( geocoded = `FALSE`) %>%
  mutate( ym = yearmonth(sprintf("%d-%0d", reported_year, reported_month) ) ) %>% 
  as_tsibble( index = ym ) ->ts_assaults_monthly

ts_assaults_monthly %>% autoplot(total) + labs(title = "Monthly Assaults in Raleigh", subtitle = "June 2014-Aug 2021")

We observe some seasonality in the monthly observations.

ts_assaults_monthly  %>%  model(STL(geocoded )) %>% components() %>% autoplot()

ts_assaults_monthly %>% as_tibble() %>% select(ym, geocoded, total, pct_valid) %>% as_tsibble() %>% gg_season()
## Using `ym` as index variable.
## Plot variable not specified, automatically selected `y = geocoded`

2.2 Saving Assaults Data

# Note that sf guesses the format of the output file from the suffix.

st_write(gj_assault_geocoded ,  dsn = paste0(data_dir, "EXP_EDA_RALEIGH_assaults_Aug2021.geojson"), delete_dsn = TRUE)
## Deleting source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_EDA_RALEIGH_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing layer `EXP_EDA_RALEIGH_assaults_Aug2021' to data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_EDA_RALEIGH_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing 42305 features with 21 fields and geometry type Point.
st_write(gj_assault_geocoded %>% filter(reported_year == 2018, reported_month == 1),  dsn = paste0(data_dir, "EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson"), delete_dsn = TRUE)
## Deleting source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing layer `EXP_RALEIGH_SUBSET_201801_assaults_Aug2021' to data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/698_CAPSTONE_2021_FALL/project_data/EXP_RALEIGH_SUBSET_201801_assaults_Aug2021.geojson' using driver `GeoJSON'
## Writing 439 features with 21 fields and geometry type Point.