This document is also published at RPubs website

An interactive application for data visualization is hosted at Shiny.io website

In this document, I’m exploring and modeling the city of Seattle GIS data. The data is available at https://data-seattlecitygis.opendata.arcgis.com/search.

Summary

I used three datasets from the City of Seattle GIS data. Those are
1. traffic flows
2. streets
3. collisions.
I started by exploring the 3 sets separately, and then I merged them together using their geo-spatial proximity.
I then used R’s Tidymodels framework for doing unsupervised learning, with the aim of further exploring the data and finding patterns. I used two techniques, those are:
1. PCA for identifying linear relationships.
2. UMAP for identifying non-linear relationships.
Next I modeled the severity of a collision. That’s a multi-class classification problem. I explored various models (XGBoost, Random Forest & Linear Regression). I selected the best model and performed hyper parameter tuning using a racing technique.
Lastly, I used various techniques to interpret the resulting model. Those are:
1. Feature Importance for global interpretability.
2. LIME & SHAP for local interpretability.

Downloading the data

The date can be downloaded via several methods
1. REST API
2. Text .csv files
3. GIS .shp files

I chose to work with the GIS Shape files as they contain the geometries (i.e. geo-locations and shapes of the data), in contrast to the .csv files which don’t contain the geometries.

Traffic Flow

The website has traffic flow data sets starting from 2007 till 2018. The date for the years 2016, 2017 & 2018 are more comprehensive and has more features, so I will use only those latest three years.

I start by reading the data and skimming it.

data_folder <- "SeattleTraffic/data"
flows <- read_sf(data_folder, "2018_Traffic_Flow_Counts") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 10), as.factor))
skim_without_charts(flows)
Data summary
Name flows
Number of rows 1906
Number of columns 15
_______________________
Column type frequency:
character 3
Date 1
factor 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
COMPKEY 14 0.99 4 143 0 1892 0
STNAME_ORD 0 1.00 6 84 0 662 0
geometry 0 1.00 78 47618 0 1906 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
START_DATE 0 1 1970-01-01 2018-12-01 2015-11-17 357

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
DOWNTOWN 0 1.00 FALSE 3 N: 1491, Y: 351, B: 64
DATAQUALIT 0 1.00 FALSE 4 Est: 494, Stu: 492, Stu: 492, Stu: 428
FLAGS 1882 0.01 FALSE 2 Com: 15, WSD: 9

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
OBJECTID 0 1.00 961.66 555.39 1.00 482.25 962.50 1442.75 1923.00
FLOWSEGID 0 1.00 1034.33 579.72 3.00 544.25 1031.50 1538.75 3000.00
AMPK 1001 0.47 703.82 551.59 29.00 324.00 554.00 926.00 4569.00
PMPK 1001 0.47 774.95 583.42 39.00 362.00 624.00 1026.00 4756.00
AWDT 2 1.00 12120.34 10144.93 387.00 5000.00 9428.50 16000.00 108179.00
ADT 482 0.75 10473.26 9078.04 0.00 4307.25 8152.00 13655.00 102555.00
AWDT_ROUND 0 1.00 12144.28 10106.11 400.00 5000.00 9400.00 16000.00 98700.00
SHAPE_Leng 0 1.00 1321.33 1365.73 48.54 357.50 830.32 1759.56 13076.44

This is a simple dataset. It has 1906 rows & 15 columns. Each row observation is the traffic flow counts from a street segment. The data is aggregated over 1 year, i.e. one measurement per year.

Data Exploration

The data has the below variables:
##### Identifer Variables
1. OBJECTID is a unique identifier for the measured segment.
2. COMPKEY is the street segment number.
3. STNAME_ORD is the street segment name.
4. FLOWSEGID is an ID for the segment flow.

Categorical Variables
  1. DOWNTOWN Flag whether segment is within the downtown corridor.
  2. DATAQUALITY Specifies if the data is generated from an estimate or from a study. 25% of the data is an estimate.
  3. FLAGS Flag for data from SDOT, WSDOT, or a combination of both. SDOT stands for Seattle city Department of Transportation, WSDOT stands for Washington state Department of Transportation. More than 99% of this data is missing.
Numerical Variables
  1. AMPK is peak traffic in the AM hours.
  2. PMPK is the peak traffic in the PM hours.
  3. AWDT is the average weekday traffic, in thousands of vesicles.
  4. ADT is the average day traffic, in thousands of vesicles.
  5. AWDT_ROUNDED is just rounding AWDT to the nearest ’00.
  6. SHAPE_Length is the length of the street segment.
Other Variables
  1. START_DATE is the date on which the data count started. Please note that this is not a time series data; there is one reading for every street segment.
  2. geometry is the geometry of the street segments. This column is not included in the csv file. I use this column for plotting the streets.

Below is simple geographical plot for AWDT_ROUND, which stands for Average Weekday Traffic (I chose this for plotting because it is least variable with missing data)

tm_shape(flows) +
  tm_lines(col = "AWDT_ROUND", lwd = 5, palette = "Blues", alpha = 1)

Summary

This data set is collected from the the traffic cameras feeds. The data set is small and has many estimated (25%), rather than measured data.

Collisions Dataset

As the above data is small (in terms of rows & number of features), I complement it with a more comprehensive data set that lists the vehicle collisions in Seattle. It is also available at the city of Seattle GIS data website.

# read the data
collisions <- read_sf(data_folder, "Collisions") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 70), as.factor))

A very basic cleaning step for bi-level categorical (Y/N) variables:
a. Replace 1 with Y and 0 with N in UNDERINFL.
b. Replace NA with N in INATTENTIO, UNDERINFL, PEDROWNOTG, SPEEDING & HITPARKEDC.
c. Filter dates larger than 01-Jan-2004 as preceding observations were very few and not complete.

collisions_cleaned <- collisions %>%
  mutate(
    UNDERINFL = case_when(
      UNDERINFL == 1 ~ "Y",
      UNDERINFL == 0 ~ "N",
      TRUE ~ as.character(UNDERINFL)) %>% 
      as.factor(),
    across(all_of(c("INATTENTIO", "UNDERINFL", "PEDROWNOTG", "SPEEDING", "HITPARKEDC")),
                  ~replace_na(as.character(.x), "N") %>% as.factor())) %>% 
  filter(INCDATE >= "2004-01-01")

Skim the data for preliminary exploration

collisions_cleaned %>% 
  st_drop_geometry() %>% 
  skim()
Data summary
Name Piped data
Number of rows 223673
Number of columns 38
_______________________
Column type frequency:
character 4
Date 1
factor 20
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
REPORTNO 0 1.00 1 9 0 223670 0
LOCATION 4617 0.98 17 90 0 25276 0
INCDTTM 0 1.00 8 22 0 170926 0
SDOTCOLNUM 96469 0.57 7 8 0 127185 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
INCDATE 0 1 2004-01-01 2020-12-30 2011-10-26 6209

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
STATUS 0 1.00 FALSE 2 Mat: 196523, Unm: 27150
ADDRTYPE 3736 0.98 FALSE 3 Blo: 146306, Int: 72750, All: 881
EXCEPTRSNC 211786 0.05 FALSE 1 NEI: 11887
EXCEPTRSND 211786 0.05 FALSE 1 Not: 11887
SEVERITYCO 1 1.00 FALSE 5 1: 138828, 2: 59334, 0: 22001, 2b: 3151
SEVERITYDE 0 1.00 FALSE 5 Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151
COLLISIONT 27222 0.88 FALSE 10 Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781
JUNCTIONTY 12041 0.95 FALSE 7 Mid: 102909, At : 70037, Mid: 24466, Dri: 11506
SDOT_COLCO 0 1.00 FALSE 40 11: 93018, 14: 59922, 00: 19278, 16: 10998
SDOT_COLDE 0 1.00 FALSE 40 MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998
INATTENTIO 0 1.00 FALSE 2 N: 193478, Y: 30195
UNDERINFL 0 1.00 FALSE 2 N: 214044, Y: 9629
WEATHER 27413 0.88 FALSE 12 Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131
ROADCOND 27333 0.88 FALSE 9 Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233
LIGHTCOND 27504 0.88 FALSE 9 Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121
PEDROWNOTG 0 1.00 FALSE 2 N: 218450, Y: 5223
SPEEDING 0 1.00 FALSE 2 N: 213658, Y: 10015
ST_COLCODE 27222 0.88 FALSE 62 32: 45100, 10: 35849, 14: 26353, 50: 14316
ST_COLDESC 27222 0.88 FALSE 62 One: 45100, Ent: 35849, Fro: 26353, Fix: 14316
HITPARKEDC 0 1.00 FALSE 2 N: 211132, Y: 12541

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
OBJECTID 0 1.00 111837.99 64569.00 1 55920 111838 167756 223674 ▇▇▇▇▇
INCKEY 0 1.00 146654.68 90716.54 1001 72205 128328 212390 336687 ▆▇▅▃▃
COLDETKEY 0 1.00 146895.90 91107.81 1001 72205 128328 212670 338187 ▆▇▅▃▃
INTKEY 150923 0.33 37695.91 52206.87 23807 28655 29973 34008 764413 ▇▁▁▁▁
PERSONCOUN 0 1.00 2.22 1.47 0 2 2 3 93 ▇▁▁▁▁
PEDCOUNT 0 1.00 0.04 0.20 0 0 0 0 6 ▇▁▁▁▁
PEDCYLCOUN 0 1.00 0.03 0.16 0 0 0 0 2 ▇▁▁▁▁
VEHCOUNT 0 1.00 1.72 0.83 0 2 2 2 15 ▇▁▁▁▁
INJURIES 0 1.00 0.37 0.73 0 0 0 1 78 ▇▁▁▁▁
SERIOUSINJ 0 1.00 0.02 0.16 0 0 0 0 41 ▇▁▁▁▁
FATALITIES 0 1.00 0.00 0.04 0 0 0 0 5 ▇▁▁▁▁
SEGLANEKEY 0 1.00 267.45 3271.23 0 0 0 0 525241 ▇▁▁▁▁
CROSSWALKK 0 1.00 9499.31 71173.48 0 0 0 0 5239700 ▇▁▁▁▁

The data is compiled by SPD (Seattle Police Department), it is compiled from traffic accidents reports. The dataset had ~ 0.25M rows & 39 columns. It’s one row per accident. The data spans the period from Jan-2004 till Dec-2020.

Data Exploration

Identifier Variables
  1. OBJECTID A unique identifier.
  2. REPORTNO The report number.
  3. INCKEY A unique key for the incident.
  4. COLDETKEY Secondary key for the incident.
  5. INTKEY A key that corresponds to the intersection associated with a collision.
  6. SEGLANEKEY A key for the lane segment in which the collision occurred.
  7. CROSSWALKK A key for the crosswalk at which the collision occurred.
  8. SDOTCOLNUM A number given to the collision by SDOT.
Categorical Variables
  1. STATUS No description is provided. It has two values, Matched & Unmatched.
  2. ADDRTYPE Collision address type (Alley, Block, Intersection)
  3. EXCEPTRSNC No description is provided, but it has only 1 value, so it has no predictive power and will be removed.
  4. EXCEPTRSND A text description of the EXCEPTRSNC, so it is totally redundant and has no added predictive power.
  5. SEVERITYCO A code that corresponds to the severity of the collision (5 levels)
  6. SEVERITYDE A text description of the SEVERITYCO, so it is totally redundant and has no added predictive power.
  7. COLLISIONT Collision type (i.e. parked car, pedestrian, head on, ….etc).
  8. JUNCTIONTY Category of junction at which collision took place (i.e. intersection, mid-block, ramp junction, …etc).
  9. SDOT_COLCO A code given to the collision by SDOT (Seattle Department of Traffic).
  10. SDOT_COLDE A text description of SDOT_COLCO. It’s totally redundant and has no added predictive power.
  11. INATTENTIO Whether or not collision was due to inattention.
  12. UNDERINFL Whether or not a driver involved was under the influence of drugs or alcohol.
  13. WEATHER A description of the weather conditions during the time of the collision.
  14. ROADCOND The condition of the road during the collision.
  15. LIGHTCOND The light conditions during the collision.
  16. PEDROWNOTG Whether or not the pedestrian right of way was not granted.
  17. SPEEDING Whether or not speeding was a factor in the collision.
  18. ST_COLCODE A code provided by the state that describes the collision.
  19. ST_COLDESC A text description of ST_COLCODE. It’s totally redundant and has no added predictive power.
  20. HITPARKEDC Whether or not the collision involved hitting a parked car.
Numerical Variables
  1. PERSONCOUN The total number of people involved in the collision.
  2. PEDCOUNT The number of pedestrians involved in the collision.
  3. PEDCYLCOUN The number of bicycles involved in the collision.
  4. VEHCOUNT The number of vehicles involved in the collision.
  5. INJURIES The number of total injuries in the collision.
  6. SERIOUSINJ The number of serious injuries in the collision.
  7. FATALITIES The number of fatalities in the collision.
Other Variables
  1. INCDATE The date of the incident. The data spans the period from Oct-2003 till Dec-2020.
  2. LOCATION Description of the general location of the collision.
  3. INCDTTM The date and time of the incident.
  4. geometry The geometric location of the accident.

Interactive Data Exploration

Please use the interactive version hosted at Shiny.io to interact with the plots
In the below, I use Shiny code for interactive data exploration.

Map Plot

Geographical locations of the collisions, coloured by the number of injuries.

collisions_cleaned %>% 
  tm_shape() +
  tm_dots(col = "INJURIES", size = "PERSONCOUN")

Bar Plot

Explore the distribution of variables. I plot a bar plot and make a percentage table.

collisions_cleaned %>% 
  ggplot(aes(x = INJURIES)) +
  geom_bar()

collisions_cleaned %>% 
  as_tibble() %>% 
  select(-geometry) %>% 
  tabyl(INJURIES) %>% 
  adorn_pct_formatting() %>% 
  rename(Count = n, Percent = percent) %>% 
  kable()
INJURIES Count Percent
0 161070 72.0%
1 47802 21.4%
2 10779 4.8%
3 2743 1.2%
4 826 0.4%
5 275 0.1%
6 100 0.0%
7 40 0.0%
8 12 0.0%
9 10 0.0%
10 6 0.0%
11 5 0.0%
12 1 0.0%
13 2 0.0%
15 1 0.0%
78 1 0.0%

Most accidents have no injuries

Time Plot

Below I make a time series plot of the number of accidents

collisions_cleaned %>% 
  count(INCDATE) %>% 
  ggplot(aes(x = INCDATE, y = n)) +
  geom_line()

A noticeable decrease in 2020 due to COVID-19 lowdown.

Seattle Streets

Now, I’m enriching the data with a dataset for Seattle streets

data_folder <- "SeattleTraffic/data"

# read the data
streets <- read_sf(data_folder, "Seattle_Streets") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 20), as.factor))

# convert the below numeric columns to factors, as they represent categorical features
num_to_cat <- c("ARTCLASS", "TRANCLASS")
streets <- streets %>% 
  mutate(across(all_of(num_to_cat), as.factor))

skim_without_charts(streets)
Data summary
Name streets
Number of rows 23806
Number of columns 39
_______________________
Column type frequency:
character 10
factor 18
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
UNITID 5 1 5 5 0 2562 0
UNITID2 5 1 4 5 0 941 0
UNITIDSORT 5 1 9 10 0 23801 0
UNITDESC 5 1 35 92 0 23800 0
STNAME_ORD 0 1 5 25 0 2409 0
XSTRLO 5 1 5 30 0 3682 0
XSTRHI 5 1 5 30 0 3684 0
INTRLO 8 1 18 63 0 14108 0
INTRHI 10 1 18 63 0 14234 0
geometry 0 1 70 18415 0 23806 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ARTCLASS 5 1.00 FALSE 7 0: 16957, 2: 2459, 1: 2197, 3: 1888
ARTDESCRIP 5 1.00 FALSE 7 Not: 16957, Min: 2459, Pri: 2197, Col: 1888
OWNER 23062 0.03 FALSE 7 WSD: 641, PRI: 74, PAR: 12, SDO: 12
STATUS 50 1.00 FALSE 5 INS: 23726, OUT: 13, TEM: 13, PLA: 3
SEGDIR 6 1.00 FALSE 8 E: 6771, N: 6554, S: 4028, W: 3255
ONEWAY 10 1.00 FALSE 2 N: 22227, Y: 1569
ONEWAYDIR 22235 0.07 FALSE 8 S: 336, N: 327, NW: 174, SE: 169
FLOW 22237 0.07 FALSE 2 F: 903, T: 666
SURFACETYP 1198 0.95 FALSE 6 PCC: 10580, ST: 4179, AC/: 3818, AC: 3698
SURFACET_1 23123 0.03 FALSE 5 PCC: 451, AC: 105, AC/: 94, ST: 29
DIRLO 6 1.00 FALSE 8 E: 6771, N: 6554, S: 4028, W: 3255
DIRHI 6 1.00 FALSE 8 W: 6771, S: 6554, N: 4028, E: 3255
NATIONHWYS 5 1.00 FALSE 2 N: 22004, Y: 1797
STREETTYPE 793 0.97 FALSE 12 Nei: 12799, Urb: 3074, Nei: 2193, Urb: 1186
TRANCLASS 0 1.00 FALSE 7 0: 18534, 3: 3081, 2: 1737, 1: 225
TRANDESCRI 187 0.99 FALSE 7 NOT: 18347, MIN: 3081, MAJ: 1737, PRI: 225
PVMTCATEGO 81 1.00 FALSE 6 NON: 16747, ART: 6525, NON: 280, MLT: 147
PARKBOULEV 5 1.00 FALSE 2 N: 23636, Y: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
OBJECTID 0 1 11903.50 6872.34 1.0 5952.25 11903.50 17854.75 23806.00
COMPKEY 0 1 35005.36 99491.72 1000.0 7045.25 13159.50 19145.75 765446.00
BLOCKNBR 5 1 3866.84 3259.50 0.0 1300.00 3100.00 5500.00 77000.00
SPEEDLIMIT 6 1 22.07 4.61 0.0 20.00 20.00 25.00 60.00
SEGLENGTH 5 1 423.16 431.88 8.0 260.00 326.00 525.00 15035.00
SURFACEWID 8 1 25.37 15.31 0.0 21.00 25.00 32.00 295.00
INTKEYLO 8 1 49966.98 89615.33 23806.0 27379.25 31092.00 35143.75 765443.00
PVMTCONDIN 5 1 51.40 32.67 0.0 22.00 60.00 79.00 100.00
PVMTCOND_1 5 1 2.03 12.58 0.0 0.00 0.00 0.00 100.00
SLOPE_PCT 18 1 3.97 3.83 0.0 1.00 3.00 6.00 47.00
SHAPE_Leng 0 1 423.08 434.16 8.4 260.01 325.94 523.43 15035.23
tm_shape(streets) +
  tm_lines(col = "ONEWAY", lwd = 5, palette = "Blues", alpha = 1)

Data Exploration

Identifier Variables
  1. OBJECTID ESRI unique identifier
  2. COMPKEY Primary key of the Street asset table
  3. UNITIDSORT Alpha-numeric unique identifier
Categorical Variables
  1. ARTCLASS Arterial classification code
  2. ARTDESCRIP A test description of ARTCLASS
  3. OWNER The organization that owns the street
  4. STATUS Current street status
  5. SEGDIR Street segment direction
  6. ONEWAY One Way Street (Y/N)
  7. ONEWAYDIR One Way Street traffic flow direction
  8. FLOW One Way Street traffic flow classification
  9. SURFACETYP Primary pavement used on Street surface
  10. SURFACET_1 Secondary pavement used on Street surface
  11. DIRLO Direction of low address end of segment
  12. DIRHI Direction of high address end of segment
  13. NATIONHWYS Whether the street is part of the National Highway System
  14. TRANCLASS Street transit classification
  15. TRANDESCRI A text description of TRANCLASS
  16. PVMTCATEGO Pavement category
  17. PARKBOULEV Is parking allowed (Y/N)
  18. STREETTYPE Street type classification
Character Variables
  1. UNITID No description is provided
  2. UNITID2 No description is provided
  3. UNITDESC Structured description of the Street location
  4. STNAME_ORD Street segment name
  5. XSTRLO No description is provided
  6. XSTRHI No description is provided
  7. INTRLO Description of the intersection location with cross street at high address end of segment
  8. INTRHI Description of the intersection location with cross street at high address end of segment
Numeric Variables
  1. BLOCKNBR ID of block street runs adjacent to
  2. SPEEDLIMIT Street speed limit in MPH
  3. SEGLENGTH Street segment length in feet
  4. SURFACEWID Street segment width in feet
  5. INTKEYLO Intersection key at low address end of segment
  6. PVMTCONDIN Primary pavement condition, out of 100
  7. PVMTCOND_1 Secondary pavement condition, out of 100
  8. SLOPE_PCT Street grade in slope percentage
  9. SHAPE_Leng The length of the GIS feature

Joining the datasets

The 3 datasets has no common ID to use for joining them together, hence I had to resort to joining them using GIS libraries to calculate their proximity together.
I want to map each collision to its street in the streets dataset and in the flows dataset.

The collisions dataset is large, so I used parallel backend to perform the calculation in parallel on multiple cores

First, write a function to identify the nearest street or flow.
And then use a “split, apply and combine strategy”
1. Split the collisions into smaller sub data frames
2. Apply the function to each sub data frame
3. Combine the results together

This is the function to identify the nearest line (e.g. a street) to a given point, (e.g. a collision location)

#' Identify the Nearest sf Line Object to an sf Point Object
#' Identifies the nearest sf line object (for example a street) to a given
#' sf point object (e.g. a specific location)
#'
#' @param sfp sf point object
#' @param sfl sf line object
#' @param .by the name of a unique identifier column in the sfl object
#' @param .name the name given to the new sfp object id
#'
#' @return the sfp object after adding a column for the nearest sfl object
#' @export
#'
#' @examples
#' @todo modify the function to reject objects further apart than a given 
#' threshold
identify_nearest_line <- function(sfp, sfl, .by, .name) {
  dist_matrix <- st_distance(sfp, sfl, by_element = FALSE) 
  min_index <- apply(dist_matrix, 1, which.min) %>% 
    map_int(~ .x[1])
  min_dist <- diag(dist_matrix[,min_index])
  sfp[[.name]] <- sfl[[.by]][min_index]
  return(sfp)
}

This is a generic function to apply any given function in parallel

#' Apply a Function in Parallel
#' Applies a given function to a dataframe in parallel
#' The function uses a "split, apply then combine strategy"  
#' 1. Split the dataframe into smaller sub dataframes  
#' 2. Apply the function to each sub dataframe  
#' 3. Combine the results together; in this case using bind_rows  
#' @param df a dataframe to apply the function on
#' @param fct a function to apply
#' @param workers number of parallel workers to use
#' @param split_size the size (number or rows) of each split
#' @param ... 
#'
#' @return
#' @export
#'
#' @examples
apply_in_parallel <- function(df, fct, workers = 4, split_size = 3000, ...) {
  require(furrr)
  
  nr <- nrow(df)
  n <- split_size
  
  if(workers > 1) {
    plan(multisession, workers = workers)
  }
  processed <- df %>% 
    split(rep(1:ceiling(nr/n), each=n, length.out=nr)) %>% 
      future_map_dfr(fct, ...) %>% 
      ungroup()
  plan(sequential)
  return(processed)
}

Before anything else, the GIS objects should be projected onto a planar metric projection. This is necessary to make sure that the distance metrics are calculated in meters, rather than degrees. This region of North America is suited to be projected onto EPSG:32610 or EPSG:26910. The former uses WGS84 geographic 2D CRS as its base CRS and UTM zone 10N as its projection, while the later uses the NAD83 geographic CRS and UTM zone 10N.

collisions_cleaned <- collisions_cleaned %>% 
  st_transform(crs = 32610)
flows <- flows %>% 
  st_transform(crs = 32610)
streets <- streets %>% 
  st_transform(crs = 32610)  

Now, extract the nearest flow to each collision location

tic()
collisions_new <- apply_in_parallel(
  collisions_cleaned,
  identify_nearest_line,
  split_size = 1000,
  workers = 10,
  sfl = flows,
  .by = "OBJECTID",
  .name = "nearest_flow_id")
toc()

And then extract the nearest street to each collision location

tic()
collisions_new <- apply_in_parallel(
  collisions_new,
  identify_nearest_line,
  split_size = 1000,
  workers = 10,
  sfl = streets,
  .by = "OBJECTID",
  .name = "nearest_street_id")
toc()

Now, join collisions with both flows and streets

collisions_joined <- collisions_new %>% 
  inner_join(st_drop_geometry(streets), by = c("nearest_street_id" = "OBJECTID")) %>% 
  inner_join(st_drop_geometry(flows), by = c("nearest_flow_id" = "OBJECTID"))

Not too late, and for the better, I discovered an alternate method for joining using the ’sf` package. Its algorithm is optimized and it runs much faster.

collisions_joined <- collisions_cleaned %>% 
  st_join(streets, join = st_nearest_feature, left = FALSE, suffix = c(".coll", ".strt")) %>% 
  st_join(flows, join = st_nearest_feature, left = FALSE, suffix = c(".colst", ".flow"))

Now, the resulting data frame has 91 features. It’s time to extract lat & long.

collisions_joined_df <- collisions_joined %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  bind_cols(collisions_joined) %>% 
  select(-geometry) %>% 
  rename (XCord = X, YCord = Y)
skim_without_charts(collisions_joined_df)
Data summary
Name collisions_joined_df
Number of rows 223673
Number of columns 92
_______________________
Column type frequency:
character 15
Date 2
factor 41
numeric 34
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
REPORTNO 0 1.00 1 9 0 223670 0
LOCATION 4617 0.98 17 90 0 25276 0
INCDTTM 0 1.00 8 22 0 170926 0
SDOTCOLNUM 96469 0.57 7 8 0 127185 0
UNITID 7507 0.97 5 5 0 2111 0
UNITID2 7507 0.97 4 5 0 795 0
UNITIDSORT 7507 0.97 9 10 0 18343 0
UNITDESC 7507 0.97 35 90 0 18342 0
STNAME_ORD.colst 7507 0.97 5 25 0 2017 0
XSTRLO 7507 0.97 5 30 0 3120 0
XSTRHI 7507 0.97 5 30 0 3154 0
INTRLO 7508 0.97 18 63 0 11817 0
INTRHI 7525 0.97 18 63 0 12047 0
COMPKEY.flow 7814 0.97 4 143 0 1886 0
STNAME_ORD.flow 7507 0.97 6 84 0 659 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
INCDATE 0 1.00 2004-01-01 2020-12-30 2011-10-26 6209
START_DATE 7507 0.97 1970-01-01 2018-12-01 2015-08-20 356

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
STATUS.coll 0 1.00 FALSE 2 Mat: 196523, Unm: 27150
ADDRTYPE 3736 0.98 FALSE 3 Blo: 146306, Int: 72750, All: 881
EXCEPTRSNC 211786 0.05 FALSE 1 NEI: 11887
EXCEPTRSND 211786 0.05 FALSE 1 Not: 11887
SEVERITYCO 1 1.00 FALSE 5 1: 138828, 2: 59334, 0: 22001, 2b: 3151
SEVERITYDE 0 1.00 FALSE 5 Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151
COLLISIONT 27222 0.88 FALSE 10 Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781
JUNCTIONTY 12041 0.95 FALSE 7 Mid: 102909, At : 70037, Mid: 24466, Dri: 11506
SDOT_COLCO 0 1.00 FALSE 40 11: 93018, 14: 59922, 00: 19278, 16: 10998
SDOT_COLDE 0 1.00 FALSE 40 MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998
INATTENTIO 0 1.00 FALSE 2 N: 193478, Y: 30195
UNDERINFL 0 1.00 FALSE 2 N: 214044, Y: 9629
WEATHER 27413 0.88 FALSE 12 Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131
ROADCOND 27333 0.88 FALSE 9 Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233
LIGHTCOND 27504 0.88 FALSE 9 Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121
PEDROWNOTG 0 1.00 FALSE 2 N: 218450, Y: 5223
SPEEDING 0 1.00 FALSE 2 N: 213658, Y: 10015
ST_COLCODE 27222 0.88 FALSE 62 32: 45100, 10: 35849, 14: 26353, 50: 14316
ST_COLDESC 27222 0.88 FALSE 62 One: 45100, Ent: 35849, Fro: 26353, Fix: 14316
HITPARKEDC 0 1.00 FALSE 2 N: 211132, Y: 12541
ARTCLASS 7507 0.97 FALSE 6 1: 83513, 0: 67646, 2: 47940, 3: 16003
ARTDESCRIP 7507 0.97 FALSE 6 Pri: 83513, Not: 67646, Min: 47940, Col: 16003
OWNER 208526 0.07 FALSE 6 WSD: 14795, SDO: 189, PRI: 116, OTH: 32
STATUS.strt 7646 0.97 FALSE 4 INS: 215869, TEM: 94, OUT: 59, PLA: 5
SEGDIR 7533 0.97 FALSE 8 N: 57721, E: 54057, S: 37748, NW: 19059
ONEWAY 7509 0.97 FALSE 2 N: 187043, Y: 29121
ONEWAYDIR 194564 0.13 FALSE 8 N: 5126, S: 4635, NE: 4198, SE: 4195
FLOW 194552 0.13 FALSE 2 F: 15330, T: 13791
SURFACETYP 20493 0.91 FALSE 6 AC/: 90399, PCC: 74462, AC: 25198, ST: 11007
SURFACET_1 205367 0.08 FALSE 5 PCC: 13699, AC/: 2569, AC: 1782, ST: 201
DIRLO 7533 0.97 FALSE 8 N: 57721, E: 54057, S: 37748, NW: 19059
DIRHI 7533 0.97 FALSE 8 S: 57721, W: 54057, N: 37748, SE: 19059
NATIONHWYS 7507 0.97 FALSE 2 N: 149430, Y: 66736
STREETTYPE 13864 0.94 FALSE 12 Urb: 35142, Nei: 32536, Urb: 27694, Urb: 27379
TRANCLASS 7507 0.97 FALSE 7 0: 85245, 2: 61048, 3: 57602, 1: 10671
TRANDESCRI 8029 0.96 FALSE 7 NOT: 84723, MAJ: 61048, MIN: 57602, PRI: 10671
PVMTCATEGO 7738 0.97 FALSE 6 ART: 147458, NON: 67328, NON: 875, MLT: 250
PARKBOULEV 7507 0.97 FALSE 2 N: 214852, Y: 1314
DOWNTOWN 7507 0.97 FALSE 3 N: 183577, Y: 23527, B: 9062
DATAQUALIT 7507 0.97 FALSE 4 Stu: 86922, Est: 48121, Stu: 44380, Stu: 36743
FLAGS 217771 0.03 FALSE 2 Com: 4902, WSD: 1000

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
XCord 7507 0.97 550288.78 2269.22 543148.85 548884.15 550320.72 551669.39 557305.76
YCord 7507 0.97 5274302.58 6232.62 5260500.03 5269494.43 5273844.63 5279192.37 5286965.89
OBJECTID.coll 0 1.00 111837.99 64569.00 1.00 55920.00 111838.00 167756.00 223674.00
INCKEY 0 1.00 146654.68 90716.54 1001.00 72205.00 128328.00 212390.00 336687.00
COLDETKEY 0 1.00 146895.90 91107.81 1001.00 72205.00 128328.00 212670.00 338187.00
INTKEY 150923 0.33 37695.91 52206.87 23807.00 28655.00 29973.00 34008.00 764413.00
PERSONCOUN 0 1.00 2.22 1.47 0.00 2.00 2.00 3.00 93.00
PEDCOUNT 0 1.00 0.04 0.20 0.00 0.00 0.00 0.00 6.00
PEDCYLCOUN 0 1.00 0.03 0.16 0.00 0.00 0.00 0.00 2.00
VEHCOUNT 0 1.00 1.72 0.83 0.00 2.00 2.00 2.00 15.00
INJURIES 0 1.00 0.37 0.73 0.00 0.00 0.00 1.00 78.00
SERIOUSINJ 0 1.00 0.02 0.16 0.00 0.00 0.00 0.00 41.00
FATALITIES 0 1.00 0.00 0.04 0.00 0.00 0.00 0.00 5.00
SEGLANEKEY 0 1.00 267.45 3271.23 0.00 0.00 0.00 0.00 525241.00
CROSSWALKK 0 1.00 9499.31 71173.48 0.00 0.00 0.00 0.00 5239700.00
OBJECTID.strt 7507 0.97 11863.83 6962.27 1.00 6098.00 11911.00 17917.00 23806.00
COMPKEY.colst 7507 0.97 24579.89 77105.30 1000.00 7906.00 11851.00 16068.00 764415.00
BLOCKNBR 7507 0.97 3359.21 3325.00 0.00 900.00 2100.00 5000.00 77000.00
SPEEDLIMIT 7508 0.97 24.95 5.13 0.00 20.00 25.00 25.00 60.00
SEGLENGTH 7507 0.97 488.62 402.06 16.00 298.00 360.00 647.00 15035.00
SURFACEWID 7509 0.97 37.55 19.11 0.00 25.00 40.00 50.00 295.00
INTKEYLO 7508 0.97 42548.52 72848.18 23806.00 28492.00 30348.00 34085.00 765442.00
PVMTCONDIN 7507 0.97 54.59 29.71 0.00 35.00 61.00 78.00 100.00
PVMTCOND_1 7507 0.97 6.53 22.33 0.00 0.00 0.00 0.00 100.00
SLOPE_PCT 7537 0.97 3.04 3.13 0.00 1.00 2.00 4.00 32.00
SHAPE_Leng.colst 7507 0.97 488.17 400.68 16.37 297.18 360.07 647.01 15035.23
OBJECTID 7507 0.97 952.47 551.87 1.00 474.00 936.00 1432.00 1923.00
FLOWSEGID 7507 0.97 1010.91 620.99 3.00 439.00 1016.00 1575.00 3000.00
AMPK 148022 0.34 885.06 655.18 29.00 418.00 743.00 1152.00 4569.00
PMPK 148022 0.34 977.93 694.67 39.00 497.00 828.00 1278.00 4756.00
AWDT 8022 0.96 16529.50 12143.95 387.00 8382.00 13500.00 21558.00 108179.00
ADT 52687 0.76 14327.72 11045.04 0.00 7056.00 11955.00 19092.00 102555.00
AWDT_ROUND 7507 0.97 16690.85 12084.62 400.00 8400.00 13600.00 22100.00 98700.00
SHAPE_Leng.flow 7507 0.97 2213.18 1860.37 48.54 856.21 1728.97 3020.21 13076.44

Modelling with Tidymodels

Tidymodels is a new modeling framework for R developed by RStudio. In my openion, the Tidymodels philosophy and syntax is more intuitive and simpler than scikit-learn, letting the data scientist focus more on the underlying model rather than the syntax. The Tidymodels makes it possible to define a recipe for the pre-processing steps and a specification for the model to use.

It is useful to model the number of injuries in a collision. But before hand, I plan to explore the data with PCA & UMAP.

Final Cleansing

Some of the columns in the final joined dataset will be removed before modeling. Those are: 1. ID columns.
2. Name columns, like street names or intersection names.
3. Redundant columns, as some features are represented by two columns, where one of them is a numeric code, and tge other is a mere description of that code.

# id columns, I'm excluding OBJECTID.coll to allow tracking the data later
id_cols <-c("COMPKEY.colst", "UNITIDSORT", "REPORTNO", "INCKEY", "COLDETKEY",
            "INTKEY", "SEGLANEKEY", 'CROSSWALKK', 'SDOTCOLNUM', 'OBJECTID.strt',
            "FLOWSEGID", "COMPKEY.colst", "OBJECTID", "BLOCKNBR", "INTKEYLO")

redund_cols <- c("SHAPE_Leng.flow", "EXCEPTRSND", "SEVERITYDE", "SDOT_COLDE",
                 "ST_COLDESC", "ARTDESCRIP", "TRANDESCRI", "START_DATE",
                 "SHAPE_Leng.colst", "EXCEPTRSNC")

data <- collisions_joined_df %>% 
  mutate(HOUR = str_extract(INCDTTM, "(?<=\\s).*") %>%
           as.POSIXct(format="%I:%M:%S %p") %>% 
           lubridate::hour()) %>% 
  select(-where(is.character),
         -all_of(id_cols),
         -all_of(redund_cols)) %>% 
  rename(COLLISION_ID = OBJECTID.coll) %>% 
  select(COLLISION_ID, everything())

I encode the Date & Hour as cyclic values using sin & cos. This is useful to preserve the cyclic nature of dates & times (for example, hour 23 is as close to hour 0 as it is close to hour 22, Jan is close to Dec as it is to Feb)

data <- lubridate::cyclic_encoding(data$INCDATE, periods = "month") %>% 
  as_tibble() %>% 
  rename(MONTH.SIN = sin.month,
         MONTH.COS = cos.month) %>% 
  bind_cols(data) %>% 
  mutate(HOUR.COS = cos(2*3.14*HOUR/24),
         HOUR.SIN = sin(2*3.14*HOUR/24),
         .keep = "unused")
saveRDS(data, 
        paste0("data_", 
               gsub("-", "", lubridate::today()),
               ".rds"))

PCA

I will make a PCA to explore the linear relationship between the variables.

I use the Tidymodels recipe to create a specification or recipe for pre-processing steps, those are:
a. One-hot encoding using step_dummy()
b. Normalization (scaling & standardization) using step_normalise. This is necessary since PCA requires all its inputs to be normalized.

Notes
* step_unknown() creates a new unknown level even for columns with no missing values. This creates a new all NA column. It’s necessary to add step_zv() or step_naomit() after step_unkown().
* step_unknown() creates a new unknown level, wheras my data already had some Unknown level. I used step_mutate() with fct_collapse() to collapse those 2 levels.

Create a PCA recipe

pca_rec <- recipe(~., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID')) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID')) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors())
pca_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         58
## 
## Operations:
## 
## Unknown factor level assignment for all_nominal(), -has_role("ID")
## Variable mutation for LIGHTCOND, JUNCTIONTY, WEATHER, ROADCOND
## Median Imputation for all_numeric(), -has_role("ID")
## Dummy variables from all_nominal(), -has_role("ID")
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()
## No PCA components were extracted.

Now, train the recipe

pca_prep <- prep(pca_rec)

I now plot a scree plot for the PCA results, this plot shows the amount of variance captured by each principle component.

pca_step <- 7
pca_variance <- tidy(pca_prep, pca_step, type = "variance") %>% 
  select(-id)
p <- pca_variance %>% 
  filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) +
  geom_point() +
  geom_line(col = "grey")
print(p)

pca_variance %>% 
  pivot_wider(names_from = terms, values_from = value) %>% 
  select(component, contains("percent")) %>% 
  slice_head(n = 20) %>% 
  kable()
component percent variance cumulative percent variance
1 5.1045232 5.104523
2 3.0152155 8.119739
3 2.4851997 10.604938
4 2.3507613 12.955700
5 1.9797727 14.935472
6 1.6870924 16.622565
7 1.5307607 18.153326
8 1.4802663 19.633592
9 1.3542519 20.987844
10 1.3425025 22.330346
11 1.2487346 23.579081
12 1.2439870 24.823068
13 1.1554448 25.978512
14 1.1026987 27.081211
15 1.0701438 28.151355
16 1.0146546 29.166010
17 0.8966015 30.062611
18 0.8564496 30.919061
19 0.7913306 31.710391
20 0.7878970 32.498288

Takeaways

  • The 1st PC accounts for 5% of the variance, and the 2nd for 3%. After the 4th component, the contribution of each PC is very small (less than 2%).
  • Below, I will study the loading of each variable on each of the 1st 4 PCs. I’m considering only the top 10 values.
library(tidytext)
pca_loading <- tidy(pca_prep, pca_step, type = "coef")
pca_loading %>%
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>%
  top_n(6, abs(value)) %>%
  ungroup() %>%
  mutate(terms = reorder_within(terms, abs(value), component)) %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y", dir = "v") +
  scale_y_reordered() +
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  )

Takeaways

  • PC1 captures the categorical variables with unknown values. As they all have the same sign, then they are positively correlated together. For example, an observation with an unknown PARKBOULEV is likely to have an unknown DOWNTOWN.
  • PC2 shows a negative correlation between ARTCLASS_X1 (Principal Arterial streets) and unknown measurements such as WEATHER, ROADCOND & FLOW. This means that Principal Arterial streets tend to have known road condition, flow counts and weather condition at the time of a collision.
  • PC3 reveals some rational findings, but now they are backed up by the data. For example, DOWNTOWN_Y is positively correlated with ONEWAY_Y. In other words, streets in downtown tend to be one-way.
  • PC4 shows that when ADDRTYPE = “Block”, the other features tend to be unknown. In other words, those features are not recorded for “Block” addresses. This also seems rational as the block addresses are very small & least busy streets.

UMAP

UMAP is a new technique devised by McInnes et al in 2018 that helps to visualize and understand high dimensional data. In contrast to PCA, it captures non-linear relationships.

library(embed)

neighbors = 30
min_dist = 0.01

umap_rec <- recipe(~., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID')) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID')) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_umap(all_predictors(), neighbors = neighbors, min_dist = min_dist)

umap_prep <- prep(umap_rec)
umap_prep

saveRDS(umap_prep, "umapPrep_20210214.rds")
umap_prep <- readRDS("umapPrep_20210214.rds")
umap_proj <- juice(umap_prep)

Plotting the UMAP 2D projections

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = SEVERITYCO), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by Collision Severity")

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = DOWNTOWN), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by DOWNTOWN")

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = ADDRTYPE), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by Address Type")

Takeaways

  • Collision severity class 0 (0 = unknown severity) forms a clear distinct cluster.
  • Same for NA Address Type and NA DOWNTOWN.
  • Same, but with less distinction, is for Alley Address Types
  • DOWNTOWN Y forms somehow a distinct cluster from DOWNTOWN N. The distinction is fuzzy though.

Predictive Modelling

In this section, I will do a predictive modeling for the Severity Class of a collision. It’s a multiclass classification problem with 5 classes.

data %>% 
  ggplot(aes(x = SEVERITYCO)) +
  geom_bar()

Again, I will use the Tidymodels framework for its simplicity & versatility in defining various recipes and model specification.

I start by testing the below 3 models:
1. Logistic Regression It is a simple and fast model. This is a baseline for comparison with other advanced models.
2. Random Forest It is an ensemble learning method that constructs a multitude of decision trees. It was first devised by Tin Kam Ho in 1998 and later extended by Leo Breiman in 2001.
3. XGBoost It is a new algorithm, devised by Tianqi Chen et al in 2016. It is an ensemble tree boosting algorithm, and it had proved to give very good results in different scenarios.

I start with a train test split

library(tidymodels)
set.seed(104)
train_test_split <- data %>% 
  filter(!is.na(SEVERITYCO)) %>% 
  initial_split(prop = 4/5, strata = 'SEVERITYCO')
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

And then I create various recipes (a recipe is a specification for pre-processing steps). Different classification models require different recipes. For example a LR model requires normalization of the inputs, whereas a tree model doesn’t. Also, a LR model would benefit form a PCA step (as it removes co-linearity), whereas this step is not required in a tree model.

The main steps included in the LR recipe are:
1. Convert missing values in categorical features to a new category named “unknown”.
2. Median imputation for missing values in numeric features.
3. One-hot encoding for categorical variables.
4. Remove zero variance features (features that have a single value).
5. Scaling & standardization.
6. PCA, retaining 200 PCs.
7. SMOTE up-sampling as the data is unbalanced.
The DT recipe excludes the last 2 steps

library(themis)
# a recipe for logistic regression
lr_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 200) %>% 
  step_smote(SEVERITYCO)
lr_rec$name <- "LR Recipe"

# Decision tree recipe, to be used with Random Forest & XGBoost
dt_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_smote(SEVERITYCO)
dt_rec$name <- "DT Recipe"

Now, I define the model specifications. I define the below 3 specifications:
1. Multinomial logistic regression.
2. Random forest.
3. XGBoost.

#' Multinomial Regression
lr_clf <-
  multinom_reg(
    penalty = tune(),
    mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")
lr_clf$name <- 'glmnet'
params <- lr_clf %>% 
  parameters() 
lr_grid <- grid_max_entropy(params, size = 6)
#lr_grid

# Random Forest
rf_clf <- 
  rand_forest(
    trees = tune(),
    min_n = tune()) %>%
  set_engine(
    "ranger",
    num.threads = 11,
    verbose = TRUE,
    seed = 123) %>%
  set_mode("classification")
rf_clf$name <- 'ranger'
params <- rf_clf %>% 
  parameters() 
rf_grid <- grid_max_entropy(params, size = 6)
#rf_grid

#' XGBoost
xg_clf <-
  boost_tree(
    tree_depth = tune(),
    trees = 300,
    learn_rate = 0.01,
    sample_size = tune(),
    mtry = tune(),
    min_n = tune()) %>%
  set_engine(
    "xgboost",
    nthread = 11,
    verbosity = 2) %>%
  set_mode("classification")
xg_clf$name <- 'XGBoost'
params <- xg_clf %>% 
  parameters() %>% 
  update(mtry = predictor_prop())
xg_grid <- grid_max_entropy(params, size = 12)
#xg_grid
# custom list
custom_list <- list(
  list(lr_rec, lr_clf, lr_grid, "None"),
  list(dt_rec, rf_clf, rf_grid, "None"),
  list(dt_rec, xg_clf, xg_grid, "None")) %>% 
  transpose(
    .names = c("data_recipe", "clf", "clf_grid", "parallel_backend")) %>% 
  as_tibble(.name_repair = "minimal")
class(custom_list) <- c("tune_spec", "tbl_df", "tbl", "data.frame")
#print(summary(custom_list))

Below is the initial training of the 3 candidate models. I don’t do a exhaustive hyper-parameter tuning as my aim here is just an initial comparison between the 3 models. Once I settle on a choice model, I will do a more exhaustive search for the optimum hyper-parameters.

library(tictoc)
tic()
FitModel <- TuneFitFactory()
fit_results <- pmap(
  custom_list,
  FitModel,
  train_data = data_train,
  outcome = "SEVERITYCO")
timing <- toc()
elapsed <- timing$toc - timing$tic
print(str_interp("Done fitting ${nrow(custom_list)} models in $[.2f]{elapsed/60} minutes."))
saveRDS(fit_results, "fitResults_20210215.rds")

Below, I compare the results of the 3 initial tunings

fit_results <- readRDS("fitResults_20210215.rds")
models_metrics <- map_dfr(fit_results, possibly(ExtractModelInfo, otherwise = NULL))
models_metrics %>% 
  select(-best_acc_model, -best_auc_model) %>% 
  kable()
clf recipe best_accuracy best_roc_auc n mean_accuracy mean_roc_auc sd_accuracy sd_roc_auc fit_time
glmnet LR Recipe 0.9882863 0.9953769 6 0.8660117 0.9033425 0.1827193 0.1987817 5505.490
ranger DT Recipe 0.9924442 0.9994215 6 0.9919263 0.9993457 0.0005637 0.0001001 2540.848
XGBoost DT Recipe 0.9951491 0.9999239 12 0.9525491 0.9830514 0.0617329 0.0288334 5110.510
plotMetrics(models_metrics, "roc_auc")

plotMetrics(models_metrics, "accuracy")

models_metrics %>% 
  ggplot(aes(x = clf, y = fit_time)) +
  geom_col(fill = "orange") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_log10()

#### Takeaways
* The fit time of LR is longer the Random Forest & XGBoost. This may seem counter-intuitive, but it is because I trained the LR using one thread. Random Forest & XGBoost implementations has inherent parallelisation and I trained them using 11 parallel threads.
* The accuracy and ROC AUC has very high unrealistic figures. This is because I’m using features that are highly correlated with the outcome, such as the count of injured or severely insured people. In real world scenarios it is not realistic to include those features as predictors, because those features are only known after the accident and not before. I will exclude those features as my aim is to predict the accident severity before it happens.

Analyzing the feature importance confirms the above statement. To extract the feature importance, I have to train the model outside of the hyper-parameter search cycle. This is because the above training returns only the evaluation metrics, and not the trained model objects.

# the parameters of the best xgboost model
xg_best <- fit_results %>% 
  pluck(3) %>% 
  pluck("resamples") %>% 
  select_best('roc_auc')
# create a workflow using those parameters
final_wf <-
  workflow() %>% 
  add_model(xg_clf) %>% 
  add_recipe(dt_rec) %>% 
  finalize_workflow(xg_best)
final_wf

# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")

# fit the model to those resamples
#final_fit <- final_wf %>% 
#  fit(data_test)
#final_fit

xg_fit <- final_wf %>% 
  last_fit(train_test_split)
saveRDS(xg_fit, "xgFit_20210215.rds")

Now, I see the variable importance of top important features.

library(vip)
xg_fit <- readRDS("xgFit_20210215.rds")
xg_fit %>%
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_fit() %>% 
  vip()

The above graph shows that the most important features are the number of injuries, fatalities & serious injuries. Including those features is not practical in a real life situation, as they are only known after the accident happens. I will exclude those features from the modeling to make for a more practical model.

# features that are closely coupled with the predictor (i.e. those features should be predicted as well), hence I will remove from the feature space.
pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT", 
                   "INJURIES", "SERIOUSINJ", "FATALITIES")

dt_rec <- dt_rec %>% 
  update_role(any_of(pred_features), new_role = "preds")

# workflow
final_wf <-
  workflow() %>% 
  add_model(xg_clf) %>% 
  add_recipe(dt_rec) %>% 
  finalize_workflow(xg_best)

# train
xg_fit <- final_wf %>% 
  last_fit(train_test_split)

# save
saveRDS(xg_fit, 
        paste0("xgbFitReducedFeatures_", 
        gsub("-", "", lubridate::today()),
        ".rds"))

I check the variable importance of the modified model.

library(vip)
xg_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")
xg_fit %>%
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_fit() %>% 
  vip()

Takeaways

  • The accuracy is 71%, and the ROC AUC is 85%. This is good results.
  • The most important features are:
    • COLLISIONT (collision type, i.e. parked car, angles, head on, right turn, left turn, …etc).
    • Speeding Y/N
    • SDOT_COLCO, which is a code given by the police department to describe the collision type.

Now, it is time to use hyper-parameter tuning to find the best set of parameters. I also use Racing Methods, devised by [Maron and Moore (1994)], (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Hoeffding+races%3A+Accelerating+model+selection+search+for+classification+and+function+approximation&btnG=) to reduce the total number of model evaluations. The method is described by Max Kuhn (2014) in this paper.

I will use the LightGBM model instead of XGBoost as it is considerably faster. LightGBM is gradient boosting framework developed by Microsoft that uses tree based learning algorithms.

library(treesnip)

pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT", 
                   "INJURIES", "SERIOUSINJ", "FATALITIES")

# LightGBM
lg_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  update_role(any_of(pred_features), new_role = "preds") %>% 
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_zv(all_predictors())

# model spec
lg_spec <-
  boost_tree(
    tree_depth = tune(),
    trees = tune(),
    sample_size = tune(),
    mtry = tune(),
    min_n = tune()) %>%
  set_engine(
    "lightgbm",
    nthread = 11,
    verbosity = 1) %>%
  set_mode("classification")

# parameter grid
grid_size = 16
params <- lg_spec %>% 
  parameters() %>% 
  update(mtry = predictor_prop(),
         sample_size = predictor_prop())
set.seed(911)
race_grid <- grid_max_entropy(params, size = grid_size)

# workflow
wflow <- workflow() %>% 
  add_recipe(lg_rec) %>% 
  add_model(lg_spec)

# cv folds
set.seed(12)
folds <- vfold_cv(data_train, v = 8, strata = "SEVERITYCO")
library(finetune)
library(tictoc)
# race search for best hyper-parameters
tic()
race_tune <- tune_race_anova(
  wflow, 
  folds,
  grid = race_grid,
  metrics = metric_set(roc_auc),
  control = control_race(
    verbose = TRUE,
    verbose_elim = TRUE))
toc()

saveRDS(race_tune, 
        paste0("xgbRacingResamplesN16_", 
               gsub("-", "", lubridate::today()),
                              ".rds"))

The below plot illustrates the racing condition. Out of 16 different models (16 different set of hyper-parameters), only 5 made it to the end. So, instead of fitting 128 models (16 models x 8 resmaples), fewer models were fitted.

require(finetune)
race_tune <- readRDS("xgbRacingResamplesN16_20210222.rds")

plot_race(race_tune)

autoplot(race_tune)

show_best(race_tune, metric = "roc_auc", n = 3) %>% kable()
mtry trees min_n tree_depth sample_size .metric .estimator mean n std_err .config
0.9940907 1871 35 2 0.0546378 roc_auc hand_till 0.8198579 8 0.0022909 Preprocessor1_Model16
0.8477562 741 31 3 0.2075844 roc_auc hand_till 0.8194337 8 0.0022680 Preprocessor1_Model12
0.8354038 1085 2 5 0.7793958 roc_auc hand_till 0.8152093 8 0.0029104 Preprocessor1_Model10

The best 2 models models have very close ROC AUC, whereas the 2nd model is less complex. I follow the Principle of Parsimony and select the 2nd model.

I use this final model of the final training, whereas I use all the training data for training (no resamples here) and test the performance on the test samples.

bt_best <- select_by_pct_loss(race_tune, metric = "roc_auc", trees)
bt_best %>% kable()
mtry trees min_n tree_depth sample_size .metric .estimator mean n std_err .config .best .loss
0.8812664 543 7 8 0.1255015 roc_auc hand_till 0.8127876 8 0.0029113 Preprocessor1_Model13 0.8198579 0.8623834
final_wf <-
  wflow %>% 
  finalize_workflow(bt_best)

# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")
bt_fit <- final_wf %>% 
  last_fit(train_test_split)
saveRDS(bt_fit, 
        paste0("boostedTreeLastFit_", 
               gsub("-", "", lubridate::today()),
                              ".rds"))

Local Interpretability Techniques

Now I use LIME technique for local interpretability of machine learning models. This technique was devised by Riberia et al at 2016.

# load the fitted model (`last_fit` object)
bt_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")

# extract model, mold & data of final XGBoost object
final_model <- bt_fit %>% 
  pull(.workflow) %>%
  pluck(1) %>% 
  extract_model()

final_mold <- bt_fit %>% 
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_mold()

train_X_df <- final_mold$predictors  # needed for LIME
train_X <- final_mold$predictors %>% 
  data.matrix()
train_y_df <- final_mold$outcomes
train_y <- final_mold$outcomes %>% 
  data.matrix() 
train_ids <- final_mold$extras %>% 
  pluck(1) %>% 
  pluck(1)

# use LIME
library(lime)
explainer <- lime(train_X_df, final_model)
# explain 1st 1000 observations
explanation <- lime::explain(train_X_df[1:1000,], explainer, n_labels = 1, n_features = 3)
saveRDS(explanation, 
        paste0("limeExplainN1000_", 
               gsub("-", "", lubridate::today()),
               ".rds"))
explanation <- readRDS("limeExplainN1000_20210223.rds")
explanation %>% 
  select(-data) %>% 
  separate(feature, into = c("feature_name", "feature_class"), sep = "_", extra = "merge") %>% 
  ggplot(aes(x = feature_weight, y = feature_name, col = feature_value)) +
  geom_jitter(height = 0.2, alpha = 0.4)

The features that have most weight in interpreting the model are ST, WEATHER and SURFACET.

Explain the first three observations

explanation <- lime::explain(train_X_df[1:3,], explainer, n_labels = 1, n_features = 6)
plot_features(explanation)

The above is an example of interpreting specific observations. For example, the 1st observation is predicted to be of class 1 with a probability of 0.82, the features that impacted this prediction are SURFACET (Primary pavement used on Street surface), PVMCOND (pavement condition) and ADT (average daytime traffic).
The 3rd observation is impacted by1STREETTYPE (street type) and ADT (average daytime traffic).