Seattle Traffic

This document is also published at RPubs website

An interactive application for data visualization is hosted at Shiny.io website

In this document, I’m exploring and modeling the city of Seattle GIS data. The data is available at https://data-seattlecitygis.opendata.arcgis.com/search.

Summary

I used three datasets from the City of Seattle GIS data. Those are
1. traffic flows
2. streets
3. collisions.
I started by exploring the 3 sets separately, and then I merged them together using their geo-spatial proximity.
I then used R’s Tidymodels framework for doing unsupervised learning, with the aim of further exploring the data and finding patterns. I used two techniques, those are:
1. PCA for identifying linear relationships.
2. UMAP for identifying non-linear relationships.
Next I modeled the severity of a collision. That’s a multi-class classification problem. I explored various models (XGBoost, Random Forest & Linear Regression). I selected the best model and performed hyper parameter tuning using a racing technique.
Lastly, I used various techniques to interpret the resulting model. Those are:
1. Feature Importance for global interpretability.
2. LIME & SHAP for local interpretability.

Downloading the data

The date can be downloaded via several methods
1. REST API
2. Text .csv files
3. GIS .shp files

I chose to work with the GIS Shape files as they contain the geometries (i.e. geo-locations and shapes of the data), in contrast to the .csv files which don’t contain the geometries.

Traffic Flow

The website has traffic flow data sets starting from 2007 till 2018. The date for the years 2016, 2017 & 2018 are more comprehensive and has more features, so I will use only those latest three years.

I start by reading the data and skimming it.

data_folder <- "SeattleTraffic/data"
flows <- read_sf(data_folder, "2018_Traffic_Flow_Counts") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 10), as.factor))
skim_without_charts(flows)

Data summary
Name	flows
Number of rows	1906
Number of columns	15
_______________________
Column type frequency:
character	3
Date	1
factor	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
COMPKEY	14	0.99	4	143	1892
STNAME_ORD	0	1.00	6	84	662
geometry	0	1.00	78	47618	1906

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
START_DATE	0	1	1970-01-01	2018-12-01	2015-11-17	357

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
DOWNTOWN	0	1.00	FALSE	3	N: 1491, Y: 351, B: 64
DATAQUALIT	0	1.00	FALSE	4	Est: 494, Stu: 492, Stu: 492, Stu: 428
FLAGS	1882	0.01	FALSE	2	Com: 15, WSD: 9

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
OBJECTID	0	1.00	961.66	555.39	1.00	482.25	962.50	1442.75	1923.00
FLOWSEGID	0	1.00	1034.33	579.72	3.00	544.25	1031.50	1538.75	3000.00
AMPK	1001	0.47	703.82	551.59	29.00	324.00	554.00	926.00	4569.00
PMPK	1001	0.47	774.95	583.42	39.00	362.00	624.00	1026.00	4756.00
AWDT	2	1.00	12120.34	10144.93	387.00	5000.00	9428.50	16000.00	108179.00
ADT	482	0.75	10473.26	9078.04	0.00	4307.25	8152.00	13655.00	102555.00
AWDT_ROUND	0	1.00	12144.28	10106.11	400.00	5000.00	9400.00	16000.00	98700.00
SHAPE_Leng	0	1.00	1321.33	1365.73	48.54	357.50	830.32	1759.56	13076.44

This is a simple dataset. It has 1906 rows & 15 columns. Each row observation is the traffic flow counts from a street segment. The data is aggregated over 1 year, i.e. one measurement per year.

Data Exploration

The data has the below variables:
##### Identifer Variables
1. OBJECTID is a unique identifier for the measured segment.
2. COMPKEY is the street segment number.
3. STNAME_ORD is the street segment name.
4. FLOWSEGID is an ID for the segment flow.

Categorical Variables

DOWNTOWN Flag whether segment is within the downtown corridor.
DATAQUALITY Specifies if the data is generated from an estimate or from a study. 25% of the data is an estimate.
FLAGS Flag for data from SDOT, WSDOT, or a combination of both. SDOT stands for Seattle city Department of Transportation, WSDOT stands for Washington state Department of Transportation. More than 99% of this data is missing.

Numerical Variables

AMPK is peak traffic in the AM hours.
PMPK is the peak traffic in the PM hours.
AWDT is the average weekday traffic, in thousands of vesicles.
ADT is the average day traffic, in thousands of vesicles.
AWDT_ROUNDED is just rounding AWDT to the nearest ’00.
SHAPE_Length is the length of the street segment.

Other Variables

START_DATE is the date on which the data count started. Please note that this is not a time series data; there is one reading for every street segment.
geometry is the geometry of the street segments. This column is not included in the csv file. I use this column for plotting the streets.

Below is simple geographical plot for AWDT_ROUND, which stands for Average Weekday Traffic (I chose this for plotting because it is least variable with missing data)

tm_shape(flows) +
  tm_lines(col = "AWDT_ROUND", lwd = 5, palette = "Blues", alpha = 1)

Summary

This data set is collected from the the traffic cameras feeds. The data set is small and has many estimated (25%), rather than measured data.

Collisions Dataset

As the above data is small (in terms of rows & number of features), I complement it with a more comprehensive data set that lists the vehicle collisions in Seattle. It is also available at the city of Seattle GIS data website.

# read the data
collisions <- read_sf(data_folder, "Collisions") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 70), as.factor))

A very basic cleaning step for bi-level categorical (Y/N) variables:
a. Replace 1 with Y and 0 with N in UNDERINFL.
b. Replace NA with N in INATTENTIO, UNDERINFL, PEDROWNOTG, SPEEDING & HITPARKEDC.
c. Filter dates larger than 01-Jan-2004 as preceding observations were very few and not complete.

collisions_cleaned <- collisions %>%
  mutate(
    UNDERINFL = case_when(
      UNDERINFL == 1 ~ "Y",
      UNDERINFL == 0 ~ "N",
      TRUE ~ as.character(UNDERINFL)) %>% 
      as.factor(),
    across(all_of(c("INATTENTIO", "UNDERINFL", "PEDROWNOTG", "SPEEDING", "HITPARKEDC")),
                  ~replace_na(as.character(.x), "N") %>% as.factor())) %>% 
  filter(INCDATE >= "2004-01-01")

Skim the data for preliminary exploration

collisions_cleaned %>% 
  st_drop_geometry() %>% 
  skim()

Data summary
Name	Piped data
Number of rows	223673
Number of columns	38
_______________________
Column type frequency:
character	4
Date	1
factor	20
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
REPORTNO	0	1.00	1	9	223670
LOCATION	4617	0.98	17	90	25276
INCDTTM	0	1.00	8	22	170926
SDOTCOLNUM	96469	0.57	7	8	127185

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
INCDATE	0	1	2004-01-01	2020-12-30	2011-10-26	6209

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
STATUS	0	1.00	FALSE	2	Mat: 196523, Unm: 27150
ADDRTYPE	3736	0.98	FALSE	3	Blo: 146306, Int: 72750, All: 881
EXCEPTRSNC	211786	0.05	FALSE	1	NEI: 11887
EXCEPTRSND	211786	0.05	FALSE	1	Not: 11887
SEVERITYCO	1	1.00	FALSE	5	1: 138828, 2: 59334, 0: 22001, 2b: 3151
SEVERITYDE	0	1.00	FALSE	5	Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151
COLLISIONT	27222	0.88	FALSE	10	Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781
JUNCTIONTY	12041	0.95	FALSE	7	Mid: 102909, At : 70037, Mid: 24466, Dri: 11506
SDOT_COLCO	0	1.00	FALSE	40	11: 93018, 14: 59922, 00: 19278, 16: 10998
SDOT_COLDE	0	1.00	FALSE	40	MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998
INATTENTIO	0	1.00	FALSE	2	N: 193478, Y: 30195
UNDERINFL	0	1.00	FALSE	2	N: 214044, Y: 9629
WEATHER	27413	0.88	FALSE	12	Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131
ROADCOND	27333	0.88	FALSE	9	Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233
LIGHTCOND	27504	0.88	FALSE	9	Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121
PEDROWNOTG	0	1.00	FALSE	2	N: 218450, Y: 5223
SPEEDING	0	1.00	FALSE	2	N: 213658, Y: 10015
ST_COLCODE	27222	0.88	FALSE	62	32: 45100, 10: 35849, 14: 26353, 50: 14316
ST_COLDESC	27222	0.88	FALSE	62	One: 45100, Ent: 35849, Fro: 26353, Fix: 14316
HITPARKEDC	0	1.00	FALSE	2	N: 211132, Y: 12541

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
OBJECTID	0	1.00	111837.99	64569.00	1	55920	111838	167756	223674	▇▇▇▇▇
INCKEY	0	1.00	146654.68	90716.54	1001	72205	128328	212390	336687	▆▇▅▃▃
COLDETKEY	0	1.00	146895.90	91107.81	1001	72205	128328	212670	338187	▆▇▅▃▃
INTKEY	150923	0.33	37695.91	52206.87	23807	28655	29973	34008	764413	▇▁▁▁▁
PERSONCOUN	0	1.00	2.22	1.47	0	2	2	3	93	▇▁▁▁▁
PEDCOUNT	0	1.00	0.04	0.20	0	0	0	0	6	▇▁▁▁▁
PEDCYLCOUN	0	1.00	0.03	0.16	0	0	0	0	2	▇▁▁▁▁
VEHCOUNT	0	1.00	1.72	0.83	0	2	2	2	15	▇▁▁▁▁
INJURIES	0	1.00	0.37	0.73	0	0	0	1	78	▇▁▁▁▁
SERIOUSINJ	0	1.00	0.02	0.16	0	0	0	0	41	▇▁▁▁▁
FATALITIES	0	1.00	0.00	0.04	0	0	0	0	5	▇▁▁▁▁
SEGLANEKEY	0	1.00	267.45	3271.23	0	0	0	0	525241	▇▁▁▁▁
CROSSWALKK	0	1.00	9499.31	71173.48	0	0	0	0	5239700	▇▁▁▁▁

The data is compiled by SPD (Seattle Police Department), it is compiled from traffic accidents reports. The dataset had ~ 0.25M rows & 39 columns. It’s one row per accident. The data spans the period from Jan-2004 till Dec-2020.

Data Exploration

Identifier Variables

OBJECTID A unique identifier.
REPORTNO The report number.
INCKEY A unique key for the incident.
COLDETKEY Secondary key for the incident.
INTKEY A key that corresponds to the intersection associated with a collision.
SEGLANEKEY A key for the lane segment in which the collision occurred.
CROSSWALKK A key for the crosswalk at which the collision occurred.
SDOTCOLNUM A number given to the collision by SDOT.

Categorical Variables

STATUS No description is provided. It has two values, Matched & Unmatched.
ADDRTYPE Collision address type (Alley, Block, Intersection)
EXCEPTRSNC No description is provided, but it has only 1 value, so it has no predictive power and will be removed.
EXCEPTRSND A text description of the EXCEPTRSNC, so it is totally redundant and has no added predictive power.
SEVERITYCO A code that corresponds to the severity of the collision (5 levels)
SEVERITYDE A text description of the SEVERITYCO, so it is totally redundant and has no added predictive power.
COLLISIONT Collision type (i.e. parked car, pedestrian, head on, ….etc).
JUNCTIONTY Category of junction at which collision took place (i.e. intersection, mid-block, ramp junction, …etc).
SDOT_COLCO A code given to the collision by SDOT (Seattle Department of Traffic).
SDOT_COLDE A text description of SDOT_COLCO. It’s totally redundant and has no added predictive power.
INATTENTIO Whether or not collision was due to inattention.
UNDERINFL Whether or not a driver involved was under the influence of drugs or alcohol.
WEATHER A description of the weather conditions during the time of the collision.
ROADCOND The condition of the road during the collision.
LIGHTCOND The light conditions during the collision.
PEDROWNOTG Whether or not the pedestrian right of way was not granted.
SPEEDING Whether or not speeding was a factor in the collision.
ST_COLCODE A code provided by the state that describes the collision.
ST_COLDESC A text description of ST_COLCODE. It’s totally redundant and has no added predictive power.
HITPARKEDC Whether or not the collision involved hitting a parked car.

Numerical Variables

PERSONCOUN The total number of people involved in the collision.
PEDCOUNT The number of pedestrians involved in the collision.
PEDCYLCOUN The number of bicycles involved in the collision.
VEHCOUNT The number of vehicles involved in the collision.
INJURIES The number of total injuries in the collision.
SERIOUSINJ The number of serious injuries in the collision.
FATALITIES The number of fatalities in the collision.

Other Variables

INCDATE The date of the incident. The data spans the period from Oct-2003 till Dec-2020.
LOCATION Description of the general location of the collision.
INCDTTM The date and time of the incident.
geometry The geometric location of the accident.

Interactive Data Exploration

Please use the interactive version hosted at Shiny.io to interact with the plots
In the below, I use Shiny code for interactive data exploration.

Map Plot

Geographical locations of the collisions, coloured by the number of injuries.

collisions_cleaned %>% 
  tm_shape() +
  tm_dots(col = "INJURIES", size = "PERSONCOUN")

Bar Plot

Explore the distribution of variables. I plot a bar plot and make a percentage table.

collisions_cleaned %>% 
  ggplot(aes(x = INJURIES)) +
  geom_bar()

collisions_cleaned %>% 
  as_tibble() %>% 
  select(-geometry) %>% 
  tabyl(INJURIES) %>% 
  adorn_pct_formatting() %>% 
  rename(Count = n, Percent = percent) %>% 
  kable()

INJURIES	Count	Percent
0	161070	72.0%
1	47802	21.4%
2	10779	4.8%
3	2743	1.2%
4	826	0.4%
5	275	0.1%
6	100	0.0%
7	40	0.0%
8	12	0.0%
9	10	0.0%
10	6	0.0%
11	5	0.0%
12	1	0.0%
13	2	0.0%
15	1	0.0%
78	1	0.0%

Most accidents have no injuries

Time Plot

Below I make a time series plot of the number of accidents

collisions_cleaned %>% 
  count(INCDATE) %>% 
  ggplot(aes(x = INCDATE, y = n)) +
  geom_line()

A noticeable decrease in 2020 due to COVID-19 lowdown.

Seattle Streets

Now, I’m enriching the data with a dataset for Seattle streets

data_folder <- "SeattleTraffic/data"

# read the data
streets <- read_sf(data_folder, "Seattle_Streets") %>% 
  mutate(across(where(~ is.character(.x) & length(unique(.x)) < 20), as.factor))

# convert the below numeric columns to factors, as they represent categorical features
num_to_cat <- c("ARTCLASS", "TRANCLASS")
streets <- streets %>% 
  mutate(across(all_of(num_to_cat), as.factor))

skim_without_charts(streets)

Data summary
Name	streets
Number of rows	23806
Number of columns	39
_______________________
Column type frequency:
character	10
factor	18
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
UNITID	5	1	5	5	2562
UNITID2	5	1	4	5	941
UNITIDSORT	5	1	9	10	23801
UNITDESC	5	1	35	92	23800
STNAME_ORD	0	1	5	25	2409
XSTRLO	5	1	5	30	3682
XSTRHI	5	1	5	30	3684
INTRLO	8	1	18	63	14108
INTRHI	10	1	18	63	14234
geometry	0	1	70	18415	23806

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
ARTCLASS	5	1.00	FALSE	7	0: 16957, 2: 2459, 1: 2197, 3: 1888
ARTDESCRIP	5	1.00	FALSE	7	Not: 16957, Min: 2459, Pri: 2197, Col: 1888
OWNER	23062	0.03	FALSE	7	WSD: 641, PRI: 74, PAR: 12, SDO: 12
STATUS	50	1.00	FALSE	5	INS: 23726, OUT: 13, TEM: 13, PLA: 3
SEGDIR	6	1.00	FALSE	8	E: 6771, N: 6554, S: 4028, W: 3255
ONEWAY	10	1.00	FALSE	2	N: 22227, Y: 1569
ONEWAYDIR	22235	0.07	FALSE	8	S: 336, N: 327, NW: 174, SE: 169
FLOW	22237	0.07	FALSE	2	F: 903, T: 666
SURFACETYP	1198	0.95	FALSE	6	PCC: 10580, ST: 4179, AC/: 3818, AC: 3698
SURFACET_1	23123	0.03	FALSE	5	PCC: 451, AC: 105, AC/: 94, ST: 29
DIRLO	6	1.00	FALSE	8	E: 6771, N: 6554, S: 4028, W: 3255
DIRHI	6	1.00	FALSE	8	W: 6771, S: 6554, N: 4028, E: 3255
NATIONHWYS	5	1.00	FALSE	2	N: 22004, Y: 1797
STREETTYPE	793	0.97	FALSE	12	Nei: 12799, Urb: 3074, Nei: 2193, Urb: 1186
TRANCLASS	0	1.00	FALSE	7	0: 18534, 3: 3081, 2: 1737, 1: 225
TRANDESCRI	187	0.99	FALSE	7	NOT: 18347, MIN: 3081, MAJ: 1737, PRI: 225
PVMTCATEGO	81	1.00	FALSE	6	NON: 16747, ART: 6525, NON: 280, MLT: 147
PARKBOULEV	5	1.00	FALSE	2	N: 23636, Y: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
OBJECTID	0	1	11903.50	6872.34	1.0	5952.25	11903.50	17854.75	23806.00
COMPKEY	0	1	35005.36	99491.72	1000.0	7045.25	13159.50	19145.75	765446.00
BLOCKNBR	5	1	3866.84	3259.50	0.0	1300.00	3100.00	5500.00	77000.00
SPEEDLIMIT	6	1	22.07	4.61	0.0	20.00	20.00	25.00	60.00
SEGLENGTH	5	1	423.16	431.88	8.0	260.00	326.00	525.00	15035.00
SURFACEWID	8	1	25.37	15.31	0.0	21.00	25.00	32.00	295.00
INTKEYLO	8	1	49966.98	89615.33	23806.0	27379.25	31092.00	35143.75	765443.00
PVMTCONDIN	5	1	51.40	32.67	0.0	22.00	60.00	79.00	100.00
PVMTCOND_1	5	1	2.03	12.58	0.0	0.00	0.00	0.00	100.00
SLOPE_PCT	18	1	3.97	3.83	0.0	1.00	3.00	6.00	47.00
SHAPE_Leng	0	1	423.08	434.16	8.4	260.01	325.94	523.43	15035.23

tm_shape(streets) +
  tm_lines(col = "ONEWAY", lwd = 5, palette = "Blues", alpha = 1)

Data Exploration

Identifier Variables

OBJECTID ESRI unique identifier
COMPKEY Primary key of the Street asset table
UNITIDSORT Alpha-numeric unique identifier

Categorical Variables

ARTCLASS Arterial classification code
ARTDESCRIP A test description of ARTCLASS
OWNER The organization that owns the street
STATUS Current street status
SEGDIR Street segment direction
ONEWAY One Way Street (Y/N)
ONEWAYDIR One Way Street traffic flow direction
FLOW One Way Street traffic flow classification
SURFACETYP Primary pavement used on Street surface
SURFACET_1 Secondary pavement used on Street surface
DIRLO Direction of low address end of segment
DIRHI Direction of high address end of segment
NATIONHWYS Whether the street is part of the National Highway System
TRANCLASS Street transit classification
TRANDESCRI A text description of TRANCLASS
PVMTCATEGO Pavement category
PARKBOULEV Is parking allowed (Y/N)
STREETTYPE Street type classification

Character Variables

UNITID No description is provided
UNITID2 No description is provided
UNITDESC Structured description of the Street location
STNAME_ORD Street segment name
XSTRLO No description is provided
XSTRHI No description is provided
INTRLO Description of the intersection location with cross street at high address end of segment
INTRHI Description of the intersection location with cross street at high address end of segment

Numeric Variables

BLOCKNBR ID of block street runs adjacent to
SPEEDLIMIT Street speed limit in MPH
SEGLENGTH Street segment length in feet
SURFACEWID Street segment width in feet
INTKEYLO Intersection key at low address end of segment
PVMTCONDIN Primary pavement condition, out of 100
PVMTCOND_1 Secondary pavement condition, out of 100
SLOPE_PCT Street grade in slope percentage
SHAPE_Leng The length of the GIS feature

Joining the datasets

The 3 datasets has no common ID to use for joining them together, hence I had to resort to joining them using GIS libraries to calculate their proximity together.
I want to map each collision to its street in the streets dataset and in the flows dataset.

The collisions dataset is large, so I used parallel backend to perform the calculation in parallel on multiple cores

First, write a function to identify the nearest street or flow.
And then use a “split, apply and combine strategy”
1. Split the collisions into smaller sub data frames
2. Apply the function to each sub data frame
3. Combine the results together

This is the function to identify the nearest line (e.g. a street) to a given point, (e.g. a collision location)

#' Identify the Nearest sf Line Object to an sf Point Object
#' Identifies the nearest sf line object (for example a street) to a given
#' sf point object (e.g. a specific location)
#'
#' @param sfp sf point object
#' @param sfl sf line object
#' @param .by the name of a unique identifier column in the sfl object
#' @param .name the name given to the new sfp object id
#'
#' @return the sfp object after adding a column for the nearest sfl object
#' @export
#'
#' @examples
#' @todo modify the function to reject objects further apart than a given 
#' threshold
identify_nearest_line <- function(sfp, sfl, .by, .name) {
  dist_matrix <- st_distance(sfp, sfl, by_element = FALSE) 
  min_index <- apply(dist_matrix, 1, which.min) %>% 
    map_int(~ .x[1])
  min_dist <- diag(dist_matrix[,min_index])
  sfp[[.name]] <- sfl[[.by]][min_index]
  return(sfp)
}

This is a generic function to apply any given function in parallel

#' Apply a Function in Parallel
#' Applies a given function to a dataframe in parallel
#' The function uses a "split, apply then combine strategy"  
#' 1. Split the dataframe into smaller sub dataframes  
#' 2. Apply the function to each sub dataframe  
#' 3. Combine the results together; in this case using bind_rows  
#' @param df a dataframe to apply the function on
#' @param fct a function to apply
#' @param workers number of parallel workers to use
#' @param split_size the size (number or rows) of each split
#' @param ... 
#'
#' @return
#' @export
#'
#' @examples
apply_in_parallel <- function(df, fct, workers = 4, split_size = 3000, ...) {
  require(furrr)
  
  nr <- nrow(df)
  n <- split_size
  
  if(workers > 1) {
    plan(multisession, workers = workers)
  }
  processed <- df %>% 
    split(rep(1:ceiling(nr/n), each=n, length.out=nr)) %>% 
      future_map_dfr(fct, ...) %>% 
      ungroup()
  plan(sequential)
  return(processed)
}

Before anything else, the GIS objects should be projected onto a planar metric projection. This is necessary to make sure that the distance metrics are calculated in meters, rather than degrees. This region of North America is suited to be projected onto EPSG:32610 or EPSG:26910. The former uses WGS84 geographic 2D CRS as its base CRS and UTM zone 10N as its projection, while the later uses the NAD83 geographic CRS and UTM zone 10N.

collisions_cleaned <- collisions_cleaned %>% 
  st_transform(crs = 32610)
flows <- flows %>% 
  st_transform(crs = 32610)
streets <- streets %>% 
  st_transform(crs = 32610)

Now, extract the nearest flow to each collision location

tic()
collisions_new <- apply_in_parallel(
  collisions_cleaned,
  identify_nearest_line,
  split_size = 1000,
  workers = 10,
  sfl = flows,
  .by = "OBJECTID",
  .name = "nearest_flow_id")
toc()

And then extract the nearest street to each collision location

tic()
collisions_new <- apply_in_parallel(
  collisions_new,
  identify_nearest_line,
  split_size = 1000,
  workers = 10,
  sfl = streets,
  .by = "OBJECTID",
  .name = "nearest_street_id")
toc()

Now, join collisions with both flows and streets

collisions_joined <- collisions_new %>% 
  inner_join(st_drop_geometry(streets), by = c("nearest_street_id" = "OBJECTID")) %>% 
  inner_join(st_drop_geometry(flows), by = c("nearest_flow_id" = "OBJECTID"))

Not too late, and for the better, I discovered an alternate method for joining using the ’sf` package. Its algorithm is optimized and it runs much faster.

collisions_joined <- collisions_cleaned %>% 
  st_join(streets, join = st_nearest_feature, left = FALSE, suffix = c(".coll", ".strt")) %>% 
  st_join(flows, join = st_nearest_feature, left = FALSE, suffix = c(".colst", ".flow"))

Now, the resulting data frame has 91 features. It’s time to extract lat & long.

collisions_joined_df <- collisions_joined %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  bind_cols(collisions_joined) %>% 
  select(-geometry) %>% 
  rename (XCord = X, YCord = Y)
skim_without_charts(collisions_joined_df)

Data summary
Name	collisions_joined_df
Number of rows	223673
Number of columns	92
_______________________
Column type frequency:
character	15
Date	2
factor	41
numeric	34
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
REPORTNO	0	1.00	1	9	223670
LOCATION	4617	0.98	17	90	25276
INCDTTM	0	1.00	8	22	170926
SDOTCOLNUM	96469	0.57	7	8	127185
UNITID	7507	0.97	5	5	2111
UNITID2	7507	0.97	4	5	795
UNITIDSORT	7507	0.97	9	10	18343
UNITDESC	7507	0.97	35	90	18342
STNAME_ORD.colst	7507	0.97	5	25	2017
XSTRLO	7507	0.97	5	30	3120
XSTRHI	7507	0.97	5	30	3154
INTRLO	7508	0.97	18	63	11817
INTRHI	7525	0.97	18	63	12047
COMPKEY.flow	7814	0.97	4	143	1886
STNAME_ORD.flow	7507	0.97	6	84	659

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
INCDATE	0	1.00	2004-01-01	2020-12-30	2011-10-26	6209
START_DATE	7507	0.97	1970-01-01	2018-12-01	2015-08-20	356

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
STATUS.coll	0	1.00	FALSE	2	Mat: 196523, Unm: 27150
ADDRTYPE	3736	0.98	FALSE	3	Blo: 146306, Int: 72750, All: 881
EXCEPTRSNC	211786	0.05	FALSE	1	NEI: 11887
EXCEPTRSND	211786	0.05	FALSE	1	Not: 11887
SEVERITYCO	1	1.00	FALSE	5	1: 138828, 2: 59334, 0: 22001, 2b: 3151
SEVERITYDE	0	1.00	FALSE	5	Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151
COLLISIONT	27222	0.88	FALSE	10	Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781
JUNCTIONTY	12041	0.95	FALSE	7	Mid: 102909, At : 70037, Mid: 24466, Dri: 11506
SDOT_COLCO	0	1.00	FALSE	40	11: 93018, 14: 59922, 00: 19278, 16: 10998
SDOT_COLDE	0	1.00	FALSE	40	MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998
INATTENTIO	0	1.00	FALSE	2	N: 193478, Y: 30195
UNDERINFL	0	1.00	FALSE	2	N: 214044, Y: 9629
WEATHER	27413	0.88	FALSE	12	Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131
ROADCOND	27333	0.88	FALSE	9	Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233
LIGHTCOND	27504	0.88	FALSE	9	Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121
PEDROWNOTG	0	1.00	FALSE	2	N: 218450, Y: 5223
SPEEDING	0	1.00	FALSE	2	N: 213658, Y: 10015
ST_COLCODE	27222	0.88	FALSE	62	32: 45100, 10: 35849, 14: 26353, 50: 14316
ST_COLDESC	27222	0.88	FALSE	62	One: 45100, Ent: 35849, Fro: 26353, Fix: 14316
HITPARKEDC	0	1.00	FALSE	2	N: 211132, Y: 12541
ARTCLASS	7507	0.97	FALSE	6	1: 83513, 0: 67646, 2: 47940, 3: 16003
ARTDESCRIP	7507	0.97	FALSE	6	Pri: 83513, Not: 67646, Min: 47940, Col: 16003
OWNER	208526	0.07	FALSE	6	WSD: 14795, SDO: 189, PRI: 116, OTH: 32
STATUS.strt	7646	0.97	FALSE	4	INS: 215869, TEM: 94, OUT: 59, PLA: 5
SEGDIR	7533	0.97	FALSE	8	N: 57721, E: 54057, S: 37748, NW: 19059
ONEWAY	7509	0.97	FALSE	2	N: 187043, Y: 29121
ONEWAYDIR	194564	0.13	FALSE	8	N: 5126, S: 4635, NE: 4198, SE: 4195
FLOW	194552	0.13	FALSE	2	F: 15330, T: 13791
SURFACETYP	20493	0.91	FALSE	6	AC/: 90399, PCC: 74462, AC: 25198, ST: 11007
SURFACET_1	205367	0.08	FALSE	5	PCC: 13699, AC/: 2569, AC: 1782, ST: 201
DIRLO	7533	0.97	FALSE	8	N: 57721, E: 54057, S: 37748, NW: 19059
DIRHI	7533	0.97	FALSE	8	S: 57721, W: 54057, N: 37748, SE: 19059
NATIONHWYS	7507	0.97	FALSE	2	N: 149430, Y: 66736
STREETTYPE	13864	0.94	FALSE	12	Urb: 35142, Nei: 32536, Urb: 27694, Urb: 27379
TRANCLASS	7507	0.97	FALSE	7	0: 85245, 2: 61048, 3: 57602, 1: 10671
TRANDESCRI	8029	0.96	FALSE	7	NOT: 84723, MAJ: 61048, MIN: 57602, PRI: 10671
PVMTCATEGO	7738	0.97	FALSE	6	ART: 147458, NON: 67328, NON: 875, MLT: 250
PARKBOULEV	7507	0.97	FALSE	2	N: 214852, Y: 1314
DOWNTOWN	7507	0.97	FALSE	3	N: 183577, Y: 23527, B: 9062
DATAQUALIT	7507	0.97	FALSE	4	Stu: 86922, Est: 48121, Stu: 44380, Stu: 36743
FLAGS	217771	0.03	FALSE	2	Com: 4902, WSD: 1000

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
XCord	7507	0.97	550288.78	2269.22	543148.85	548884.15	550320.72	551669.39	557305.76
YCord	7507	0.97	5274302.58	6232.62	5260500.03	5269494.43	5273844.63	5279192.37	5286965.89
OBJECTID.coll	0	1.00	111837.99	64569.00	1.00	55920.00	111838.00	167756.00	223674.00
INCKEY	0	1.00	146654.68	90716.54	1001.00	72205.00	128328.00	212390.00	336687.00
COLDETKEY	0	1.00	146895.90	91107.81	1001.00	72205.00	128328.00	212670.00	338187.00
INTKEY	150923	0.33	37695.91	52206.87	23807.00	28655.00	29973.00	34008.00	764413.00
PERSONCOUN	0	1.00	2.22	1.47	0.00	2.00	2.00	3.00	93.00
PEDCOUNT	0	1.00	0.04	0.20	0.00	0.00	0.00	0.00	6.00
PEDCYLCOUN	0	1.00	0.03	0.16	0.00	0.00	0.00	0.00	2.00
VEHCOUNT	0	1.00	1.72	0.83	0.00	2.00	2.00	2.00	15.00
INJURIES	0	1.00	0.37	0.73	0.00	0.00	0.00	1.00	78.00
SERIOUSINJ	0	1.00	0.02	0.16	0.00	0.00	0.00	0.00	41.00
FATALITIES	0	1.00	0.00	0.04	0.00	0.00	0.00	0.00	5.00
SEGLANEKEY	0	1.00	267.45	3271.23	0.00	0.00	0.00	0.00	525241.00
CROSSWALKK	0	1.00	9499.31	71173.48	0.00	0.00	0.00	0.00	5239700.00
OBJECTID.strt	7507	0.97	11863.83	6962.27	1.00	6098.00	11911.00	17917.00	23806.00
COMPKEY.colst	7507	0.97	24579.89	77105.30	1000.00	7906.00	11851.00	16068.00	764415.00
BLOCKNBR	7507	0.97	3359.21	3325.00	0.00	900.00	2100.00	5000.00	77000.00
SPEEDLIMIT	7508	0.97	24.95	5.13	0.00	20.00	25.00	25.00	60.00
SEGLENGTH	7507	0.97	488.62	402.06	16.00	298.00	360.00	647.00	15035.00
SURFACEWID	7509	0.97	37.55	19.11	0.00	25.00	40.00	50.00	295.00
INTKEYLO	7508	0.97	42548.52	72848.18	23806.00	28492.00	30348.00	34085.00	765442.00
PVMTCONDIN	7507	0.97	54.59	29.71	0.00	35.00	61.00	78.00	100.00
PVMTCOND_1	7507	0.97	6.53	22.33	0.00	0.00	0.00	0.00	100.00
SLOPE_PCT	7537	0.97	3.04	3.13	0.00	1.00	2.00	4.00	32.00
SHAPE_Leng.colst	7507	0.97	488.17	400.68	16.37	297.18	360.07	647.01	15035.23
OBJECTID	7507	0.97	952.47	551.87	1.00	474.00	936.00	1432.00	1923.00
FLOWSEGID	7507	0.97	1010.91	620.99	3.00	439.00	1016.00	1575.00	3000.00
AMPK	148022	0.34	885.06	655.18	29.00	418.00	743.00	1152.00	4569.00
PMPK	148022	0.34	977.93	694.67	39.00	497.00	828.00	1278.00	4756.00
AWDT	8022	0.96	16529.50	12143.95	387.00	8382.00	13500.00	21558.00	108179.00
ADT	52687	0.76	14327.72	11045.04	0.00	7056.00	11955.00	19092.00	102555.00
AWDT_ROUND	7507	0.97	16690.85	12084.62	400.00	8400.00	13600.00	22100.00	98700.00
SHAPE_Leng.flow	7507	0.97	2213.18	1860.37	48.54	856.21	1728.97	3020.21	13076.44

Modelling with Tidymodels

Tidymodels is a new modeling framework for R developed by RStudio. In my openion, the Tidymodels philosophy and syntax is more intuitive and simpler than scikit-learn, letting the data scientist focus more on the underlying model rather than the syntax. The Tidymodels makes it possible to define a recipe for the pre-processing steps and a specification for the model to use.

It is useful to model the number of injuries in a collision. But before hand, I plan to explore the data with PCA & UMAP.

Final Cleansing

Some of the columns in the final joined dataset will be removed before modeling. Those are: 1. ID columns.
2. Name columns, like street names or intersection names.
3. Redundant columns, as some features are represented by two columns, where one of them is a numeric code, and tge other is a mere description of that code.

# id columns, I'm excluding OBJECTID.coll to allow tracking the data later
id_cols <-c("COMPKEY.colst", "UNITIDSORT", "REPORTNO", "INCKEY", "COLDETKEY",
            "INTKEY", "SEGLANEKEY", 'CROSSWALKK', 'SDOTCOLNUM', 'OBJECTID.strt',
            "FLOWSEGID", "COMPKEY.colst", "OBJECTID", "BLOCKNBR", "INTKEYLO")

redund_cols <- c("SHAPE_Leng.flow", "EXCEPTRSND", "SEVERITYDE", "SDOT_COLDE",
                 "ST_COLDESC", "ARTDESCRIP", "TRANDESCRI", "START_DATE",
                 "SHAPE_Leng.colst", "EXCEPTRSNC")

data <- collisions_joined_df %>% 
  mutate(HOUR = str_extract(INCDTTM, "(?<=\\s).*") %>%
           as.POSIXct(format="%I:%M:%S %p") %>% 
           lubridate::hour()) %>% 
  select(-where(is.character),
         -all_of(id_cols),
         -all_of(redund_cols)) %>% 
  rename(COLLISION_ID = OBJECTID.coll) %>% 
  select(COLLISION_ID, everything())

I encode the Date & Hour as cyclic values using sin & cos. This is useful to preserve the cyclic nature of dates & times (for example, hour 23 is as close to hour 0 as it is close to hour 22, Jan is close to Dec as it is to Feb)

data <- lubridate::cyclic_encoding(data$INCDATE, periods = "month") %>% 
  as_tibble() %>% 
  rename(MONTH.SIN = sin.month,
         MONTH.COS = cos.month) %>% 
  bind_cols(data) %>% 
  mutate(HOUR.COS = cos(2*3.14*HOUR/24),
         HOUR.SIN = sin(2*3.14*HOUR/24),
         .keep = "unused")
saveRDS(data, 
        paste0("data_", 
               gsub("-", "", lubridate::today()),
               ".rds"))

PCA

I will make a PCA to explore the linear relationship between the variables.

I use the Tidymodels recipe to create a specification or recipe for pre-processing steps, those are:
a. One-hot encoding using step_dummy()
b. Normalization (scaling & standardization) using step_normalise. This is necessary since PCA requires all its inputs to be normalized.

Notes
* step_unknown() creates a new unknown level even for columns with no missing values. This creates a new all NA column. It’s necessary to add step_zv() or step_naomit() after step_unkown().
* step_unknown() creates a new unknown level, wheras my data already had some Unknown level. I used step_mutate() with fct_collapse() to collapse those 2 levels.

Create a PCA `recipe`

pca_rec <- recipe(~., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID')) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID')) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors())
pca_rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         58
## 
## Operations:
## 
## Unknown factor level assignment for all_nominal(), -has_role("ID")
## Variable mutation for LIGHTCOND, JUNCTIONTY, WEATHER, ROADCOND
## Median Imputation for all_numeric(), -has_role("ID")
## Dummy variables from all_nominal(), -has_role("ID")
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()
## No PCA components were extracted.

Now, train the recipe

pca_prep <- prep(pca_rec)

I now plot a scree plot for the PCA results, this plot shows the amount of variance captured by each principle component.

pca_step <- 7
pca_variance <- tidy(pca_prep, pca_step, type = "variance") %>% 
  select(-id)
p <- pca_variance %>% 
  filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) +
  geom_point() +
  geom_line(col = "grey")
print(p)

pca_variance %>% 
  pivot_wider(names_from = terms, values_from = value) %>% 
  select(component, contains("percent")) %>% 
  slice_head(n = 20) %>% 
  kable()

component	percent variance	cumulative percent variance
1	5.1045232	5.104523
2	3.0152155	8.119739
3	2.4851997	10.604938
4	2.3507613	12.955700
5	1.9797727	14.935472
6	1.6870924	16.622565
7	1.5307607	18.153326
8	1.4802663	19.633592
9	1.3542519	20.987844
10	1.3425025	22.330346
11	1.2487346	23.579081
12	1.2439870	24.823068
13	1.1554448	25.978512
14	1.1026987	27.081211
15	1.0701438	28.151355
16	1.0146546	29.166010
17	0.8966015	30.062611
18	0.8564496	30.919061
19	0.7913306	31.710391
20	0.7878970	32.498288

Takeaways

The 1st PC accounts for 5% of the variance, and the 2nd for 3%. After the 4th component, the contribution of each PC is very small (less than 2%).
Below, I will study the loading of each variable on each of the 1st 4 PCs. I’m considering only the top 10 values.

library(tidytext)
pca_loading <- tidy(pca_prep, pca_step, type = "coef")
pca_loading %>%
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>%
  top_n(6, abs(value)) %>%
  ungroup() %>%
  mutate(terms = reorder_within(terms, abs(value), component)) %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y", dir = "v") +
  scale_y_reordered() +
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  )

Takeaways

PC1 captures the categorical variables with unknown values. As they all have the same sign, then they are positively correlated together. For example, an observation with an unknown PARKBOULEV is likely to have an unknown DOWNTOWN.
PC2 shows a negative correlation between ARTCLASS_X1 (Principal Arterial streets) and unknown measurements such as WEATHER, ROADCOND & FLOW. This means that Principal Arterial streets tend to have known road condition, flow counts and weather condition at the time of a collision.
PC3 reveals some rational findings, but now they are backed up by the data. For example, DOWNTOWN_Y is positively correlated with ONEWAY_Y. In other words, streets in downtown tend to be one-way.
PC4 shows that when ADDRTYPE = “Block”, the other features tend to be unknown. In other words, those features are not recorded for “Block” addresses. This also seems rational as the block addresses are very small & least busy streets.

UMAP

UMAP is a new technique devised by McInnes et al in 2018 that helps to visualize and understand high dimensional data. In contrast to PCA, it captures non-linear relationships.

library(embed)

neighbors = 30
min_dist = 0.01

umap_rec <- recipe(~., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID')) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID')) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_umap(all_predictors(), neighbors = neighbors, min_dist = min_dist)

umap_prep <- prep(umap_rec)
umap_prep

saveRDS(umap_prep, "umapPrep_20210214.rds")

umap_prep <- readRDS("umapPrep_20210214.rds")
umap_proj <- juice(umap_prep)

Plotting the UMAP 2D projections

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = SEVERITYCO), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by Collision Severity")

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = DOWNTOWN), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by DOWNTOWN")

umap_proj %>%
  inner_join(data, by = "COLLISION_ID") %>% 
  ggplot(aes(umap_1, umap_2)) +
  geom_point(aes(color = ADDRTYPE), alpha = 0.7, size = 2) +
  labs(color = NULL) +
  ggtitle("UMAP 2D Projection", subtitle = "Coloured by Address Type")

Takeaways

Collision severity class 0 (0 = unknown severity) forms a clear distinct cluster.
Same for NA Address Type and NA DOWNTOWN.
Same, but with less distinction, is for Alley Address Types
DOWNTOWN Y forms somehow a distinct cluster from DOWNTOWN N. The distinction is fuzzy though.

Predictive Modelling

In this section, I will do a predictive modeling for the Severity Class of a collision. It’s a multiclass classification problem with 5 classes.

data %>% 
  ggplot(aes(x = SEVERITYCO)) +
  geom_bar()

Again, I will use the Tidymodels framework for its simplicity & versatility in defining various recipes and model specification.

I start by testing the below 3 models:
1. Logistic Regression It is a simple and fast model. This is a baseline for comparison with other advanced models.
2. Random Forest It is an ensemble learning method that constructs a multitude of decision trees. It was first devised by Tin Kam Ho in 1998 and later extended by Leo Breiman in 2001.
3. XGBoost It is a new algorithm, devised by Tianqi Chen et al in 2016. It is an ensemble tree boosting algorithm, and it had proved to give very good results in different scenarios.

I start with a train test split

library(tidymodels)
set.seed(104)
train_test_split <- data %>% 
  filter(!is.na(SEVERITYCO)) %>% 
  initial_split(prop = 4/5, strata = 'SEVERITYCO')
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

And then I create various recipes (a recipe is a specification for pre-processing steps). Different classification models require different recipes. For example a LR model requires normalization of the inputs, whereas a tree model doesn’t. Also, a LR model would benefit form a PCA step (as it removes co-linearity), whereas this step is not required in a tree model.

The main steps included in the LR recipe are:
1. Convert missing values in categorical features to a new category named “unknown”.
2. Median imputation for missing values in numeric features.
3. One-hot encoding for categorical variables.
4. Remove zero variance features (features that have a single value).
5. Scaling & standardization.
6. PCA, retaining 200 PCs.
7. SMOTE up-sampling as the data is unbalanced.
The DT recipe excludes the last 2 steps

library(themis)
# a recipe for logistic regression
lr_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 200) %>% 
  step_smote(SEVERITYCO)
lr_rec$name <- "LR Recipe"

# Decision tree recipe, to be used with Random Forest & XGBoost
dt_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_medianimpute(all_numeric(), -has_role('ID')) %>% 
  step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_smote(SEVERITYCO)
dt_rec$name <- "DT Recipe"

Now, I define the model specifications. I define the below 3 specifications:
1. Multinomial logistic regression.
2. Random forest.
3. XGBoost.

#' Multinomial Regression
lr_clf <-
  multinom_reg(
    penalty = tune(),
    mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")
lr_clf$name <- 'glmnet'
params <- lr_clf %>% 
  parameters() 
lr_grid <- grid_max_entropy(params, size = 6)
#lr_grid

# Random Forest
rf_clf <- 
  rand_forest(
    trees = tune(),
    min_n = tune()) %>%
  set_engine(
    "ranger",
    num.threads = 11,
    verbose = TRUE,
    seed = 123) %>%
  set_mode("classification")
rf_clf$name <- 'ranger'
params <- rf_clf %>% 
  parameters() 
rf_grid <- grid_max_entropy(params, size = 6)
#rf_grid

#' XGBoost
xg_clf <-
  boost_tree(
    tree_depth = tune(),
    trees = 300,
    learn_rate = 0.01,
    sample_size = tune(),
    mtry = tune(),
    min_n = tune()) %>%
  set_engine(
    "xgboost",
    nthread = 11,
    verbosity = 2) %>%
  set_mode("classification")
xg_clf$name <- 'XGBoost'
params <- xg_clf %>% 
  parameters() %>% 
  update(mtry = predictor_prop())
xg_grid <- grid_max_entropy(params, size = 12)
#xg_grid

# custom list
custom_list <- list(
  list(lr_rec, lr_clf, lr_grid, "None"),
  list(dt_rec, rf_clf, rf_grid, "None"),
  list(dt_rec, xg_clf, xg_grid, "None")) %>% 
  transpose(
    .names = c("data_recipe", "clf", "clf_grid", "parallel_backend")) %>% 
  as_tibble(.name_repair = "minimal")
class(custom_list) <- c("tune_spec", "tbl_df", "tbl", "data.frame")
#print(summary(custom_list))

Below is the initial training of the 3 candidate models. I don’t do a exhaustive hyper-parameter tuning as my aim here is just an initial comparison between the 3 models. Once I settle on a choice model, I will do a more exhaustive search for the optimum hyper-parameters.

library(tictoc)
tic()
FitModel <- TuneFitFactory()
fit_results <- pmap(
  custom_list,
  FitModel,
  train_data = data_train,
  outcome = "SEVERITYCO")
timing <- toc()
elapsed <- timing$toc - timing$tic
print(str_interp("Done fitting ${nrow(custom_list)} models in $[.2f]{elapsed/60} minutes."))
saveRDS(fit_results, "fitResults_20210215.rds")

Below, I compare the results of the 3 initial tunings

fit_results <- readRDS("fitResults_20210215.rds")
models_metrics <- map_dfr(fit_results, possibly(ExtractModelInfo, otherwise = NULL))
models_metrics %>% 
  select(-best_acc_model, -best_auc_model) %>% 
  kable()

clf	recipe	best_accuracy	best_roc_auc	n	mean_accuracy	mean_roc_auc	sd_accuracy	sd_roc_auc	fit_time
glmnet	LR Recipe	0.9882863	0.9953769	6	0.8660117	0.9033425	0.1827193	0.1987817	5505.490
ranger	DT Recipe	0.9924442	0.9994215	6	0.9919263	0.9993457	0.0005637	0.0001001	2540.848
XGBoost	DT Recipe	0.9951491	0.9999239	12	0.9525491	0.9830514	0.0617329	0.0288334	5110.510

plotMetrics(models_metrics, "roc_auc")

plotMetrics(models_metrics, "accuracy")

models_metrics %>% 
  ggplot(aes(x = clf, y = fit_time)) +
  geom_col(fill = "orange") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_log10()

#### Takeaways
* The fit time of LR is longer the Random Forest & XGBoost. This may seem counter-intuitive, but it is because I trained the LR using one thread. Random Forest & XGBoost implementations has inherent parallelisation and I trained them using 11 parallel threads.
* The accuracy and ROC AUC has very high unrealistic figures. This is because I’m using features that are highly correlated with the outcome, such as the count of injured or severely insured people. In real world scenarios it is not realistic to include those features as predictors, because those features are only known after the accident and not before. I will exclude those features as my aim is to predict the accident severity before it happens.

Analyzing the feature importance confirms the above statement. To extract the feature importance, I have to train the model outside of the hyper-parameter search cycle. This is because the above training returns only the evaluation metrics, and not the trained model objects.

# the parameters of the best xgboost model
xg_best <- fit_results %>% 
  pluck(3) %>% 
  pluck("resamples") %>% 
  select_best('roc_auc')

# create a workflow using those parameters
final_wf <-
  workflow() %>% 
  add_model(xg_clf) %>% 
  add_recipe(dt_rec) %>% 
  finalize_workflow(xg_best)
final_wf

# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")

# fit the model to those resamples
#final_fit <- final_wf %>% 
#  fit(data_test)
#final_fit

xg_fit <- final_wf %>% 
  last_fit(train_test_split)
saveRDS(xg_fit, "xgFit_20210215.rds")

Now, I see the variable importance of top important features.

library(vip)
xg_fit <- readRDS("xgFit_20210215.rds")
xg_fit %>%
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_fit() %>% 
  vip()

The above graph shows that the most important features are the number of injuries, fatalities & serious injuries. Including those features is not practical in a real life situation, as they are only known after the accident happens. I will exclude those features from the modeling to make for a more practical model.

# features that are closely coupled with the predictor (i.e. those features should be predicted as well), hence I will remove from the feature space.
pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT", 
                   "INJURIES", "SERIOUSINJ", "FATALITIES")

dt_rec <- dt_rec %>% 
  update_role(any_of(pred_features), new_role = "preds")

# workflow
final_wf <-
  workflow() %>% 
  add_model(xg_clf) %>% 
  add_recipe(dt_rec) %>% 
  finalize_workflow(xg_best)

# train
xg_fit <- final_wf %>% 
  last_fit(train_test_split)

# save
saveRDS(xg_fit, 
        paste0("xgbFitReducedFeatures_", 
        gsub("-", "", lubridate::today()),
        ".rds"))

I check the variable importance of the modified model.

library(vip)
xg_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")
xg_fit %>%
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_fit() %>% 
  vip()

Takeaways

The accuracy is 71%, and the ROC AUC is 85%. This is good results.
The most important features are:
- COLLISIONT (collision type, i.e. parked car, angles, head on, right turn, left turn, …etc).
- Speeding Y/N
- SDOT_COLCO, which is a code given by the police department to describe the collision type.

Now, it is time to use hyper-parameter tuning to find the best set of parameters. I also use Racing Methods, devised by [Maron and Moore (1994)], (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Hoeffding+races%3A+Accelerating+model+selection+search+for+classification+and+function+approximation&btnG=) to reduce the total number of model evaluations. The method is described by Max Kuhn (2014) in this paper.

I will use the LightGBM model instead of XGBoost as it is considerably faster. LightGBM is gradient boosting framework developed by Microsoft that uses tree based learning algorithms.

library(treesnip)

pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT", 
                   "INJURIES", "SERIOUSINJ", "FATALITIES")

# LightGBM
lg_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
  update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
  update_role(any_of(pred_features), new_role = "preds") %>% 
  step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>% 
  step_mutate(
    LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
    JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
    WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
    ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>% 
  step_zv(all_predictors())

# model spec
lg_spec <-
  boost_tree(
    tree_depth = tune(),
    trees = tune(),
    sample_size = tune(),
    mtry = tune(),
    min_n = tune()) %>%
  set_engine(
    "lightgbm",
    nthread = 11,
    verbosity = 1) %>%
  set_mode("classification")

# parameter grid
grid_size = 16
params <- lg_spec %>% 
  parameters() %>% 
  update(mtry = predictor_prop(),
         sample_size = predictor_prop())
set.seed(911)
race_grid <- grid_max_entropy(params, size = grid_size)

# workflow
wflow <- workflow() %>% 
  add_recipe(lg_rec) %>% 
  add_model(lg_spec)

# cv folds
set.seed(12)
folds <- vfold_cv(data_train, v = 8, strata = "SEVERITYCO")

library(finetune)
library(tictoc)
# race search for best hyper-parameters
tic()
race_tune <- tune_race_anova(
  wflow, 
  folds,
  grid = race_grid,
  metrics = metric_set(roc_auc),
  control = control_race(
    verbose = TRUE,
    verbose_elim = TRUE))
toc()

saveRDS(race_tune, 
        paste0("xgbRacingResamplesN16_", 
               gsub("-", "", lubridate::today()),
                              ".rds"))

The below plot illustrates the racing condition. Out of 16 different models (16 different set of hyper-parameters), only 5 made it to the end. So, instead of fitting 128 models (16 models x 8 resmaples), fewer models were fitted.

require(finetune)
race_tune <- readRDS("xgbRacingResamplesN16_20210222.rds")

plot_race(race_tune)

autoplot(race_tune)

show_best(race_tune, metric = "roc_auc", n = 3) %>% kable()

mtry	trees	min_n	tree_depth	sample_size	.metric	.estimator	mean	n	std_err	.config
0.9940907	1871	35	2	0.0546378	roc_auc	hand_till	0.8198579	8	0.0022909	Preprocessor1_Model16
0.8477562	741	31	3	0.2075844	roc_auc	hand_till	0.8194337	8	0.0022680	Preprocessor1_Model12
0.8354038	1085	2	5	0.7793958	roc_auc	hand_till	0.8152093	8	0.0029104	Preprocessor1_Model10

The best 2 models models have very close ROC AUC, whereas the 2nd model is less complex. I follow the Principle of Parsimony and select the 2nd model.

I use this final model of the final training, whereas I use all the training data for training (no resamples here) and test the performance on the test samples.

bt_best <- select_by_pct_loss(race_tune, metric = "roc_auc", trees)
bt_best %>% kable()

mtry	trees	min_n	tree_depth	sample_size	.metric	.estimator	mean	n	std_err	.config	.best	.loss
0.8812664	543	7	8	0.1255015	roc_auc	hand_till	0.8127876	8	0.0029113	Preprocessor1_Model13	0.8198579	0.8623834

final_wf <-
  wflow %>% 
  finalize_workflow(bt_best)

# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")

bt_fit <- final_wf %>% 
  last_fit(train_test_split)
saveRDS(bt_fit, 
        paste0("boostedTreeLastFit_", 
               gsub("-", "", lubridate::today()),
                              ".rds"))

Local Interpretability Techniques

Now I use LIME technique for local interpretability of machine learning models. This technique was devised by Riberia et al at 2016.

# load the fitted model (`last_fit` object)
bt_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")

# extract model, mold & data of final XGBoost object
final_model <- bt_fit %>% 
  pull(.workflow) %>%
  pluck(1) %>% 
  extract_model()

final_mold <- bt_fit %>% 
  pull(.workflow) %>%
  pluck(1) %>% 
  pull_workflow_mold()

train_X_df <- final_mold$predictors  # needed for LIME
train_X <- final_mold$predictors %>% 
  data.matrix()
train_y_df <- final_mold$outcomes
train_y <- final_mold$outcomes %>% 
  data.matrix() 
train_ids <- final_mold$extras %>% 
  pluck(1) %>% 
  pluck(1)

# use LIME
library(lime)
explainer <- lime(train_X_df, final_model)

# explain 1st 1000 observations
explanation <- lime::explain(train_X_df[1:1000,], explainer, n_labels = 1, n_features = 3)
saveRDS(explanation, 
        paste0("limeExplainN1000_", 
               gsub("-", "", lubridate::today()),
               ".rds"))

explanation <- readRDS("limeExplainN1000_20210223.rds")
explanation %>% 
  select(-data) %>% 
  separate(feature, into = c("feature_name", "feature_class"), sep = "_", extra = "merge") %>% 
  ggplot(aes(x = feature_weight, y = feature_name, col = feature_value)) +
  geom_jitter(height = 0.2, alpha = 0.4)

The features that have most weight in interpreting the model are ST, WEATHER and SURFACET.

Explain the first three observations

explanation <- lime::explain(train_X_df[1:3,], explainer, n_labels = 1, n_features = 6)
plot_features(explanation)

The above is an example of interpreting specific observations. For example, the 1st observation is predicted to be of class 1 with a probability of 0.82, the features that impacted this prediction are SURFACET (Primary pavement used on Street surface), PVMCOND (pavement condition) and ADT (average daytime traffic).
The 3rd observation is impacted by1STREETTYPE (street type) and ADT (average daytime traffic).

Seattle Traffic

Hany Nagtay

Feb-2021

Summary

Downloading the data

Traffic Flow

Data Exploration

Categorical Variables

Numerical Variables

Other Variables

Summary

Collisions Dataset

Data Exploration

Identifier Variables

Categorical Variables

Numerical Variables

Other Variables

Interactive Data Exploration

Map Plot

Bar Plot

Time Plot

Seattle Streets

Data Exploration

Identifier Variables

Categorical Variables

Character Variables

Numeric Variables

Joining the datasets

Modelling with Tidymodels

Final Cleansing

PCA

Create a PCA `recipe`

Takeaways

Takeaways

UMAP

Takeaways

Predictive Modelling

Takeaways

Local Interpretability Techniques

Seattle Traffic

Hany Nagtay

Feb-2021

Summary

Downloading the data

Traffic Flow

Data Exploration

Categorical Variables

Numerical Variables

Other Variables

Summary

Collisions Dataset

Data Exploration

Identifier Variables

Categorical Variables

Numerical Variables

Other Variables

Interactive Data Exploration

Map Plot

Bar Plot

Time Plot

Seattle Streets

Data Exploration

Identifier Variables

Categorical Variables

Character Variables

Numeric Variables

Joining the datasets

Modelling with Tidymodels

Final Cleansing

PCA

Create a PCA recipe

Takeaways

Takeaways

UMAP

Takeaways

Predictive Modelling

Takeaways

Local Interpretability Techniques

Create a PCA `recipe`