This document is also published at RPubs website
An interactive application for data visualization is hosted at Shiny.io website
In this document, I’m exploring and modeling the city of Seattle GIS data. The data is available at https://data-seattlecitygis.opendata.arcgis.com/search.
I used three datasets from the City of Seattle GIS data. Those are
1. traffic flows
2. streets
3. collisions.
I started by exploring the 3 sets separately, and then I merged them together using their geo-spatial proximity.
I then used R’s Tidymodels framework for doing unsupervised learning, with the aim of further exploring the data and finding patterns. I used two techniques, those are:
1. PCA for identifying linear relationships.
2. UMAP for identifying non-linear relationships.
Next I modeled the severity of a collision. That’s a multi-class classification problem. I explored various models (XGBoost, Random Forest & Linear Regression). I selected the best model and performed hyper parameter tuning using a racing technique.
Lastly, I used various techniques to interpret the resulting model. Those are:
1. Feature Importance for global interpretability.
2. LIME & SHAP for local interpretability.
The date can be downloaded via several methods
1. REST API
2. Text .csv files
3. GIS .shp files
I chose to work with the GIS Shape files as they contain the geometries (i.e. geo-locations and shapes of the data), in contrast to the .csv files which don’t contain the geometries.
The website has traffic flow data sets starting from 2007 till 2018. The date for the years 2016, 2017 & 2018 are more comprehensive and has more features, so I will use only those latest three years.
I start by reading the data and skimming it.
data_folder <- "SeattleTraffic/data"
flows <- read_sf(data_folder, "2018_Traffic_Flow_Counts") %>%
mutate(across(where(~ is.character(.x) & length(unique(.x)) < 10), as.factor))
skim_without_charts(flows)
| Name | flows |
| Number of rows | 1906 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| factor | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| COMPKEY | 14 | 0.99 | 4 | 143 | 0 | 1892 | 0 |
| STNAME_ORD | 0 | 1.00 | 6 | 84 | 0 | 662 | 0 |
| geometry | 0 | 1.00 | 78 | 47618 | 0 | 1906 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| START_DATE | 0 | 1 | 1970-01-01 | 2018-12-01 | 2015-11-17 | 357 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| DOWNTOWN | 0 | 1.00 | FALSE | 3 | N: 1491, Y: 351, B: 64 |
| DATAQUALIT | 0 | 1.00 | FALSE | 4 | Est: 494, Stu: 492, Stu: 492, Stu: 428 |
| FLAGS | 1882 | 0.01 | FALSE | 2 | Com: 15, WSD: 9 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| OBJECTID | 0 | 1.00 | 961.66 | 555.39 | 1.00 | 482.25 | 962.50 | 1442.75 | 1923.00 |
| FLOWSEGID | 0 | 1.00 | 1034.33 | 579.72 | 3.00 | 544.25 | 1031.50 | 1538.75 | 3000.00 |
| AMPK | 1001 | 0.47 | 703.82 | 551.59 | 29.00 | 324.00 | 554.00 | 926.00 | 4569.00 |
| PMPK | 1001 | 0.47 | 774.95 | 583.42 | 39.00 | 362.00 | 624.00 | 1026.00 | 4756.00 |
| AWDT | 2 | 1.00 | 12120.34 | 10144.93 | 387.00 | 5000.00 | 9428.50 | 16000.00 | 108179.00 |
| ADT | 482 | 0.75 | 10473.26 | 9078.04 | 0.00 | 4307.25 | 8152.00 | 13655.00 | 102555.00 |
| AWDT_ROUND | 0 | 1.00 | 12144.28 | 10106.11 | 400.00 | 5000.00 | 9400.00 | 16000.00 | 98700.00 |
| SHAPE_Leng | 0 | 1.00 | 1321.33 | 1365.73 | 48.54 | 357.50 | 830.32 | 1759.56 | 13076.44 |
This is a simple dataset. It has 1906 rows & 15 columns. Each row observation is the traffic flow counts from a street segment. The data is aggregated over 1 year, i.e. one measurement per year.
The data has the below variables:
##### Identifer Variables
1. OBJECTID is a unique identifier for the measured segment.
2. COMPKEY is the street segment number.
3. STNAME_ORD is the street segment name.
4. FLOWSEGID is an ID for the segment flow.
AWDT to the nearest ’00.Below is simple geographical plot for AWDT_ROUND, which stands for Average Weekday Traffic (I chose this for plotting because it is least variable with missing data)
tm_shape(flows) +
tm_lines(col = "AWDT_ROUND", lwd = 5, palette = "Blues", alpha = 1)
This data set is collected from the the traffic cameras feeds. The data set is small and has many estimated (25%), rather than measured data.
As the above data is small (in terms of rows & number of features), I complement it with a more comprehensive data set that lists the vehicle collisions in Seattle. It is also available at the city of Seattle GIS data website.
# read the data
collisions <- read_sf(data_folder, "Collisions") %>%
mutate(across(where(~ is.character(.x) & length(unique(.x)) < 70), as.factor))
A very basic cleaning step for bi-level categorical (Y/N) variables:
a. Replace 1 with Y and 0 with N in UNDERINFL.
b. Replace NA with N in INATTENTIO, UNDERINFL, PEDROWNOTG, SPEEDING & HITPARKEDC.
c. Filter dates larger than 01-Jan-2004 as preceding observations were very few and not complete.
collisions_cleaned <- collisions %>%
mutate(
UNDERINFL = case_when(
UNDERINFL == 1 ~ "Y",
UNDERINFL == 0 ~ "N",
TRUE ~ as.character(UNDERINFL)) %>%
as.factor(),
across(all_of(c("INATTENTIO", "UNDERINFL", "PEDROWNOTG", "SPEEDING", "HITPARKEDC")),
~replace_na(as.character(.x), "N") %>% as.factor())) %>%
filter(INCDATE >= "2004-01-01")
Skim the data for preliminary exploration
collisions_cleaned %>%
st_drop_geometry() %>%
skim()
| Name | Piped data |
| Number of rows | 223673 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 1 |
| factor | 20 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| REPORTNO | 0 | 1.00 | 1 | 9 | 0 | 223670 | 0 |
| LOCATION | 4617 | 0.98 | 17 | 90 | 0 | 25276 | 0 |
| INCDTTM | 0 | 1.00 | 8 | 22 | 0 | 170926 | 0 |
| SDOTCOLNUM | 96469 | 0.57 | 7 | 8 | 0 | 127185 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| INCDATE | 0 | 1 | 2004-01-01 | 2020-12-30 | 2011-10-26 | 6209 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| STATUS | 0 | 1.00 | FALSE | 2 | Mat: 196523, Unm: 27150 |
| ADDRTYPE | 3736 | 0.98 | FALSE | 3 | Blo: 146306, Int: 72750, All: 881 |
| EXCEPTRSNC | 211786 | 0.05 | FALSE | 1 | NEI: 11887 |
| EXCEPTRSND | 211786 | 0.05 | FALSE | 1 | Not: 11887 |
| SEVERITYCO | 1 | 1.00 | FALSE | 5 | 1: 138828, 2: 59334, 0: 22001, 2b: 3151 |
| SEVERITYDE | 0 | 1.00 | FALSE | 5 | Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151 |
| COLLISIONT | 27222 | 0.88 | FALSE | 10 | Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781 |
| JUNCTIONTY | 12041 | 0.95 | FALSE | 7 | Mid: 102909, At : 70037, Mid: 24466, Dri: 11506 |
| SDOT_COLCO | 0 | 1.00 | FALSE | 40 | 11: 93018, 14: 59922, 00: 19278, 16: 10998 |
| SDOT_COLDE | 0 | 1.00 | FALSE | 40 | MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998 |
| INATTENTIO | 0 | 1.00 | FALSE | 2 | N: 193478, Y: 30195 |
| UNDERINFL | 0 | 1.00 | FALSE | 2 | N: 214044, Y: 9629 |
| WEATHER | 27413 | 0.88 | FALSE | 12 | Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131 |
| ROADCOND | 27333 | 0.88 | FALSE | 9 | Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233 |
| LIGHTCOND | 27504 | 0.88 | FALSE | 9 | Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121 |
| PEDROWNOTG | 0 | 1.00 | FALSE | 2 | N: 218450, Y: 5223 |
| SPEEDING | 0 | 1.00 | FALSE | 2 | N: 213658, Y: 10015 |
| ST_COLCODE | 27222 | 0.88 | FALSE | 62 | 32: 45100, 10: 35849, 14: 26353, 50: 14316 |
| ST_COLDESC | 27222 | 0.88 | FALSE | 62 | One: 45100, Ent: 35849, Fro: 26353, Fix: 14316 |
| HITPARKEDC | 0 | 1.00 | FALSE | 2 | N: 211132, Y: 12541 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| OBJECTID | 0 | 1.00 | 111837.99 | 64569.00 | 1 | 55920 | 111838 | 167756 | 223674 | ▇▇▇▇▇ |
| INCKEY | 0 | 1.00 | 146654.68 | 90716.54 | 1001 | 72205 | 128328 | 212390 | 336687 | ▆▇▅▃▃ |
| COLDETKEY | 0 | 1.00 | 146895.90 | 91107.81 | 1001 | 72205 | 128328 | 212670 | 338187 | ▆▇▅▃▃ |
| INTKEY | 150923 | 0.33 | 37695.91 | 52206.87 | 23807 | 28655 | 29973 | 34008 | 764413 | ▇▁▁▁▁ |
| PERSONCOUN | 0 | 1.00 | 2.22 | 1.47 | 0 | 2 | 2 | 3 | 93 | ▇▁▁▁▁ |
| PEDCOUNT | 0 | 1.00 | 0.04 | 0.20 | 0 | 0 | 0 | 0 | 6 | ▇▁▁▁▁ |
| PEDCYLCOUN | 0 | 1.00 | 0.03 | 0.16 | 0 | 0 | 0 | 0 | 2 | ▇▁▁▁▁ |
| VEHCOUNT | 0 | 1.00 | 1.72 | 0.83 | 0 | 2 | 2 | 2 | 15 | ▇▁▁▁▁ |
| INJURIES | 0 | 1.00 | 0.37 | 0.73 | 0 | 0 | 0 | 1 | 78 | ▇▁▁▁▁ |
| SERIOUSINJ | 0 | 1.00 | 0.02 | 0.16 | 0 | 0 | 0 | 0 | 41 | ▇▁▁▁▁ |
| FATALITIES | 0 | 1.00 | 0.00 | 0.04 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| SEGLANEKEY | 0 | 1.00 | 267.45 | 3271.23 | 0 | 0 | 0 | 0 | 525241 | ▇▁▁▁▁ |
| CROSSWALKK | 0 | 1.00 | 9499.31 | 71173.48 | 0 | 0 | 0 | 0 | 5239700 | ▇▁▁▁▁ |
The data is compiled by SPD (Seattle Police Department), it is compiled from traffic accidents reports. The dataset had ~ 0.25M rows & 39 columns. It’s one row per accident. The data spans the period from Jan-2004 till Dec-2020.
EXCEPTRSNC, so it is totally redundant and has no added predictive power.SEVERITYCO, so it is totally redundant and has no added predictive power.SDOT_COLCO. It’s totally redundant and has no added predictive power.ST_COLCODE. It’s totally redundant and has no added predictive power.Please use the interactive version hosted at Shiny.io to interact with the plots
In the below, I use Shiny code for interactive data exploration.
Geographical locations of the collisions, coloured by the number of injuries.
collisions_cleaned %>%
tm_shape() +
tm_dots(col = "INJURIES", size = "PERSONCOUN")
Explore the distribution of variables. I plot a bar plot and make a percentage table.
collisions_cleaned %>%
ggplot(aes(x = INJURIES)) +
geom_bar()
collisions_cleaned %>%
as_tibble() %>%
select(-geometry) %>%
tabyl(INJURIES) %>%
adorn_pct_formatting() %>%
rename(Count = n, Percent = percent) %>%
kable()
| INJURIES | Count | Percent |
|---|---|---|
| 0 | 161070 | 72.0% |
| 1 | 47802 | 21.4% |
| 2 | 10779 | 4.8% |
| 3 | 2743 | 1.2% |
| 4 | 826 | 0.4% |
| 5 | 275 | 0.1% |
| 6 | 100 | 0.0% |
| 7 | 40 | 0.0% |
| 8 | 12 | 0.0% |
| 9 | 10 | 0.0% |
| 10 | 6 | 0.0% |
| 11 | 5 | 0.0% |
| 12 | 1 | 0.0% |
| 13 | 2 | 0.0% |
| 15 | 1 | 0.0% |
| 78 | 1 | 0.0% |
Most accidents have no injuries
Below I make a time series plot of the number of accidents
collisions_cleaned %>%
count(INCDATE) %>%
ggplot(aes(x = INCDATE, y = n)) +
geom_line()
A noticeable decrease in 2020 due to COVID-19 lowdown.
Now, I’m enriching the data with a dataset for Seattle streets
data_folder <- "SeattleTraffic/data"
# read the data
streets <- read_sf(data_folder, "Seattle_Streets") %>%
mutate(across(where(~ is.character(.x) & length(unique(.x)) < 20), as.factor))
# convert the below numeric columns to factors, as they represent categorical features
num_to_cat <- c("ARTCLASS", "TRANCLASS")
streets <- streets %>%
mutate(across(all_of(num_to_cat), as.factor))
skim_without_charts(streets)
| Name | streets |
| Number of rows | 23806 |
| Number of columns | 39 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| factor | 18 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| UNITID | 5 | 1 | 5 | 5 | 0 | 2562 | 0 |
| UNITID2 | 5 | 1 | 4 | 5 | 0 | 941 | 0 |
| UNITIDSORT | 5 | 1 | 9 | 10 | 0 | 23801 | 0 |
| UNITDESC | 5 | 1 | 35 | 92 | 0 | 23800 | 0 |
| STNAME_ORD | 0 | 1 | 5 | 25 | 0 | 2409 | 0 |
| XSTRLO | 5 | 1 | 5 | 30 | 0 | 3682 | 0 |
| XSTRHI | 5 | 1 | 5 | 30 | 0 | 3684 | 0 |
| INTRLO | 8 | 1 | 18 | 63 | 0 | 14108 | 0 |
| INTRHI | 10 | 1 | 18 | 63 | 0 | 14234 | 0 |
| geometry | 0 | 1 | 70 | 18415 | 0 | 23806 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ARTCLASS | 5 | 1.00 | FALSE | 7 | 0: 16957, 2: 2459, 1: 2197, 3: 1888 |
| ARTDESCRIP | 5 | 1.00 | FALSE | 7 | Not: 16957, Min: 2459, Pri: 2197, Col: 1888 |
| OWNER | 23062 | 0.03 | FALSE | 7 | WSD: 641, PRI: 74, PAR: 12, SDO: 12 |
| STATUS | 50 | 1.00 | FALSE | 5 | INS: 23726, OUT: 13, TEM: 13, PLA: 3 |
| SEGDIR | 6 | 1.00 | FALSE | 8 | E: 6771, N: 6554, S: 4028, W: 3255 |
| ONEWAY | 10 | 1.00 | FALSE | 2 | N: 22227, Y: 1569 |
| ONEWAYDIR | 22235 | 0.07 | FALSE | 8 | S: 336, N: 327, NW: 174, SE: 169 |
| FLOW | 22237 | 0.07 | FALSE | 2 | F: 903, T: 666 |
| SURFACETYP | 1198 | 0.95 | FALSE | 6 | PCC: 10580, ST: 4179, AC/: 3818, AC: 3698 |
| SURFACET_1 | 23123 | 0.03 | FALSE | 5 | PCC: 451, AC: 105, AC/: 94, ST: 29 |
| DIRLO | 6 | 1.00 | FALSE | 8 | E: 6771, N: 6554, S: 4028, W: 3255 |
| DIRHI | 6 | 1.00 | FALSE | 8 | W: 6771, S: 6554, N: 4028, E: 3255 |
| NATIONHWYS | 5 | 1.00 | FALSE | 2 | N: 22004, Y: 1797 |
| STREETTYPE | 793 | 0.97 | FALSE | 12 | Nei: 12799, Urb: 3074, Nei: 2193, Urb: 1186 |
| TRANCLASS | 0 | 1.00 | FALSE | 7 | 0: 18534, 3: 3081, 2: 1737, 1: 225 |
| TRANDESCRI | 187 | 0.99 | FALSE | 7 | NOT: 18347, MIN: 3081, MAJ: 1737, PRI: 225 |
| PVMTCATEGO | 81 | 1.00 | FALSE | 6 | NON: 16747, ART: 6525, NON: 280, MLT: 147 |
| PARKBOULEV | 5 | 1.00 | FALSE | 2 | N: 23636, Y: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| OBJECTID | 0 | 1 | 11903.50 | 6872.34 | 1.0 | 5952.25 | 11903.50 | 17854.75 | 23806.00 |
| COMPKEY | 0 | 1 | 35005.36 | 99491.72 | 1000.0 | 7045.25 | 13159.50 | 19145.75 | 765446.00 |
| BLOCKNBR | 5 | 1 | 3866.84 | 3259.50 | 0.0 | 1300.00 | 3100.00 | 5500.00 | 77000.00 |
| SPEEDLIMIT | 6 | 1 | 22.07 | 4.61 | 0.0 | 20.00 | 20.00 | 25.00 | 60.00 |
| SEGLENGTH | 5 | 1 | 423.16 | 431.88 | 8.0 | 260.00 | 326.00 | 525.00 | 15035.00 |
| SURFACEWID | 8 | 1 | 25.37 | 15.31 | 0.0 | 21.00 | 25.00 | 32.00 | 295.00 |
| INTKEYLO | 8 | 1 | 49966.98 | 89615.33 | 23806.0 | 27379.25 | 31092.00 | 35143.75 | 765443.00 |
| PVMTCONDIN | 5 | 1 | 51.40 | 32.67 | 0.0 | 22.00 | 60.00 | 79.00 | 100.00 |
| PVMTCOND_1 | 5 | 1 | 2.03 | 12.58 | 0.0 | 0.00 | 0.00 | 0.00 | 100.00 |
| SLOPE_PCT | 18 | 1 | 3.97 | 3.83 | 0.0 | 1.00 | 3.00 | 6.00 | 47.00 |
| SHAPE_Leng | 0 | 1 | 423.08 | 434.16 | 8.4 | 260.01 | 325.94 | 523.43 | 15035.23 |
tm_shape(streets) +
tm_lines(col = "ONEWAY", lwd = 5, palette = "Blues", alpha = 1)
The 3 datasets has no common ID to use for joining them together, hence I had to resort to joining them using GIS libraries to calculate their proximity together.
I want to map each collision to its street in the streets dataset and in the flows dataset.
The collisions dataset is large, so I used parallel backend to perform the calculation in parallel on multiple cores
First, write a function to identify the nearest street or flow.
And then use a “split, apply and combine strategy”
1. Split the collisions into smaller sub data frames
2. Apply the function to each sub data frame
3. Combine the results together
This is the function to identify the nearest line (e.g. a street) to a given point, (e.g. a collision location)
#' Identify the Nearest sf Line Object to an sf Point Object
#' Identifies the nearest sf line object (for example a street) to a given
#' sf point object (e.g. a specific location)
#'
#' @param sfp sf point object
#' @param sfl sf line object
#' @param .by the name of a unique identifier column in the sfl object
#' @param .name the name given to the new sfp object id
#'
#' @return the sfp object after adding a column for the nearest sfl object
#' @export
#'
#' @examples
#' @todo modify the function to reject objects further apart than a given
#' threshold
identify_nearest_line <- function(sfp, sfl, .by, .name) {
dist_matrix <- st_distance(sfp, sfl, by_element = FALSE)
min_index <- apply(dist_matrix, 1, which.min) %>%
map_int(~ .x[1])
min_dist <- diag(dist_matrix[,min_index])
sfp[[.name]] <- sfl[[.by]][min_index]
return(sfp)
}
This is a generic function to apply any given function in parallel
#' Apply a Function in Parallel
#' Applies a given function to a dataframe in parallel
#' The function uses a "split, apply then combine strategy"
#' 1. Split the dataframe into smaller sub dataframes
#' 2. Apply the function to each sub dataframe
#' 3. Combine the results together; in this case using bind_rows
#' @param df a dataframe to apply the function on
#' @param fct a function to apply
#' @param workers number of parallel workers to use
#' @param split_size the size (number or rows) of each split
#' @param ...
#'
#' @return
#' @export
#'
#' @examples
apply_in_parallel <- function(df, fct, workers = 4, split_size = 3000, ...) {
require(furrr)
nr <- nrow(df)
n <- split_size
if(workers > 1) {
plan(multisession, workers = workers)
}
processed <- df %>%
split(rep(1:ceiling(nr/n), each=n, length.out=nr)) %>%
future_map_dfr(fct, ...) %>%
ungroup()
plan(sequential)
return(processed)
}
Before anything else, the GIS objects should be projected onto a planar metric projection. This is necessary to make sure that the distance metrics are calculated in meters, rather than degrees. This region of North America is suited to be projected onto EPSG:32610 or EPSG:26910. The former uses WGS84 geographic 2D CRS as its base CRS and UTM zone 10N as its projection, while the later uses the NAD83 geographic CRS and UTM zone 10N.
collisions_cleaned <- collisions_cleaned %>%
st_transform(crs = 32610)
flows <- flows %>%
st_transform(crs = 32610)
streets <- streets %>%
st_transform(crs = 32610)
Now, extract the nearest flow to each collision location
tic()
collisions_new <- apply_in_parallel(
collisions_cleaned,
identify_nearest_line,
split_size = 1000,
workers = 10,
sfl = flows,
.by = "OBJECTID",
.name = "nearest_flow_id")
toc()
And then extract the nearest street to each collision location
tic()
collisions_new <- apply_in_parallel(
collisions_new,
identify_nearest_line,
split_size = 1000,
workers = 10,
sfl = streets,
.by = "OBJECTID",
.name = "nearest_street_id")
toc()
Now, join collisions with both flows and streets
collisions_joined <- collisions_new %>%
inner_join(st_drop_geometry(streets), by = c("nearest_street_id" = "OBJECTID")) %>%
inner_join(st_drop_geometry(flows), by = c("nearest_flow_id" = "OBJECTID"))
Not too late, and for the better, I discovered an alternate method for joining using the ’sf` package. Its algorithm is optimized and it runs much faster.
collisions_joined <- collisions_cleaned %>%
st_join(streets, join = st_nearest_feature, left = FALSE, suffix = c(".coll", ".strt")) %>%
st_join(flows, join = st_nearest_feature, left = FALSE, suffix = c(".colst", ".flow"))
Now, the resulting data frame has 91 features. It’s time to extract lat & long.
collisions_joined_df <- collisions_joined %>%
st_coordinates() %>%
as_tibble() %>%
bind_cols(collisions_joined) %>%
select(-geometry) %>%
rename (XCord = X, YCord = Y)
skim_without_charts(collisions_joined_df)
| Name | collisions_joined_df |
| Number of rows | 223673 |
| Number of columns | 92 |
| _______________________ | |
| Column type frequency: | |
| character | 15 |
| Date | 2 |
| factor | 41 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| REPORTNO | 0 | 1.00 | 1 | 9 | 0 | 223670 | 0 |
| LOCATION | 4617 | 0.98 | 17 | 90 | 0 | 25276 | 0 |
| INCDTTM | 0 | 1.00 | 8 | 22 | 0 | 170926 | 0 |
| SDOTCOLNUM | 96469 | 0.57 | 7 | 8 | 0 | 127185 | 0 |
| UNITID | 7507 | 0.97 | 5 | 5 | 0 | 2111 | 0 |
| UNITID2 | 7507 | 0.97 | 4 | 5 | 0 | 795 | 0 |
| UNITIDSORT | 7507 | 0.97 | 9 | 10 | 0 | 18343 | 0 |
| UNITDESC | 7507 | 0.97 | 35 | 90 | 0 | 18342 | 0 |
| STNAME_ORD.colst | 7507 | 0.97 | 5 | 25 | 0 | 2017 | 0 |
| XSTRLO | 7507 | 0.97 | 5 | 30 | 0 | 3120 | 0 |
| XSTRHI | 7507 | 0.97 | 5 | 30 | 0 | 3154 | 0 |
| INTRLO | 7508 | 0.97 | 18 | 63 | 0 | 11817 | 0 |
| INTRHI | 7525 | 0.97 | 18 | 63 | 0 | 12047 | 0 |
| COMPKEY.flow | 7814 | 0.97 | 4 | 143 | 0 | 1886 | 0 |
| STNAME_ORD.flow | 7507 | 0.97 | 6 | 84 | 0 | 659 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| INCDATE | 0 | 1.00 | 2004-01-01 | 2020-12-30 | 2011-10-26 | 6209 |
| START_DATE | 7507 | 0.97 | 1970-01-01 | 2018-12-01 | 2015-08-20 | 356 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| STATUS.coll | 0 | 1.00 | FALSE | 2 | Mat: 196523, Unm: 27150 |
| ADDRTYPE | 3736 | 0.98 | FALSE | 3 | Blo: 146306, Int: 72750, All: 881 |
| EXCEPTRSNC | 211786 | 0.05 | FALSE | 1 | NEI: 11887 |
| EXCEPTRSND | 211786 | 0.05 | FALSE | 1 | Not: 11887 |
| SEVERITYCO | 1 | 1.00 | FALSE | 5 | 1: 138828, 2: 59334, 0: 22001, 2b: 3151 |
| SEVERITYDE | 0 | 1.00 | FALSE | 5 | Pro: 138828, Inj: 59334, Unk: 22002, Ser: 3151 |
| COLLISIONT | 27222 | 0.88 | FALSE | 10 | Par: 48740, Ang: 35849, Rea: 34907, Oth: 24781 |
| JUNCTIONTY | 12041 | 0.95 | FALSE | 7 | Mid: 102909, At : 70037, Mid: 24466, Dri: 11506 |
| SDOT_COLCO | 0 | 1.00 | FALSE | 40 | 11: 93018, 14: 59922, 00: 19278, 16: 10998 |
| SDOT_COLDE | 0 | 1.00 | FALSE | 40 | MOT: 93018, MOT: 59922, NOT: 19278, MOT: 10998 |
| INATTENTIO | 0 | 1.00 | FALSE | 2 | N: 193478, Y: 30195 |
| UNDERINFL | 0 | 1.00 | FALSE | 2 | N: 214044, Y: 9629 |
| WEATHER | 27413 | 0.88 | FALSE | 12 | Cle: 115582, Rai: 34166, Ove: 28742, Unk: 15131 |
| ROADCOND | 27333 | 0.88 | FALSE | 9 | Dry: 129587, Wet: 48949, Unk: 15161, Ice: 1233 |
| LIGHTCOND | 27504 | 0.88 | FALSE | 9 | Day: 120312, Dar: 50463, Unk: 13549, Dus: 6121 |
| PEDROWNOTG | 0 | 1.00 | FALSE | 2 | N: 218450, Y: 5223 |
| SPEEDING | 0 | 1.00 | FALSE | 2 | N: 213658, Y: 10015 |
| ST_COLCODE | 27222 | 0.88 | FALSE | 62 | 32: 45100, 10: 35849, 14: 26353, 50: 14316 |
| ST_COLDESC | 27222 | 0.88 | FALSE | 62 | One: 45100, Ent: 35849, Fro: 26353, Fix: 14316 |
| HITPARKEDC | 0 | 1.00 | FALSE | 2 | N: 211132, Y: 12541 |
| ARTCLASS | 7507 | 0.97 | FALSE | 6 | 1: 83513, 0: 67646, 2: 47940, 3: 16003 |
| ARTDESCRIP | 7507 | 0.97 | FALSE | 6 | Pri: 83513, Not: 67646, Min: 47940, Col: 16003 |
| OWNER | 208526 | 0.07 | FALSE | 6 | WSD: 14795, SDO: 189, PRI: 116, OTH: 32 |
| STATUS.strt | 7646 | 0.97 | FALSE | 4 | INS: 215869, TEM: 94, OUT: 59, PLA: 5 |
| SEGDIR | 7533 | 0.97 | FALSE | 8 | N: 57721, E: 54057, S: 37748, NW: 19059 |
| ONEWAY | 7509 | 0.97 | FALSE | 2 | N: 187043, Y: 29121 |
| ONEWAYDIR | 194564 | 0.13 | FALSE | 8 | N: 5126, S: 4635, NE: 4198, SE: 4195 |
| FLOW | 194552 | 0.13 | FALSE | 2 | F: 15330, T: 13791 |
| SURFACETYP | 20493 | 0.91 | FALSE | 6 | AC/: 90399, PCC: 74462, AC: 25198, ST: 11007 |
| SURFACET_1 | 205367 | 0.08 | FALSE | 5 | PCC: 13699, AC/: 2569, AC: 1782, ST: 201 |
| DIRLO | 7533 | 0.97 | FALSE | 8 | N: 57721, E: 54057, S: 37748, NW: 19059 |
| DIRHI | 7533 | 0.97 | FALSE | 8 | S: 57721, W: 54057, N: 37748, SE: 19059 |
| NATIONHWYS | 7507 | 0.97 | FALSE | 2 | N: 149430, Y: 66736 |
| STREETTYPE | 13864 | 0.94 | FALSE | 12 | Urb: 35142, Nei: 32536, Urb: 27694, Urb: 27379 |
| TRANCLASS | 7507 | 0.97 | FALSE | 7 | 0: 85245, 2: 61048, 3: 57602, 1: 10671 |
| TRANDESCRI | 8029 | 0.96 | FALSE | 7 | NOT: 84723, MAJ: 61048, MIN: 57602, PRI: 10671 |
| PVMTCATEGO | 7738 | 0.97 | FALSE | 6 | ART: 147458, NON: 67328, NON: 875, MLT: 250 |
| PARKBOULEV | 7507 | 0.97 | FALSE | 2 | N: 214852, Y: 1314 |
| DOWNTOWN | 7507 | 0.97 | FALSE | 3 | N: 183577, Y: 23527, B: 9062 |
| DATAQUALIT | 7507 | 0.97 | FALSE | 4 | Stu: 86922, Est: 48121, Stu: 44380, Stu: 36743 |
| FLAGS | 217771 | 0.03 | FALSE | 2 | Com: 4902, WSD: 1000 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| XCord | 7507 | 0.97 | 550288.78 | 2269.22 | 543148.85 | 548884.15 | 550320.72 | 551669.39 | 557305.76 |
| YCord | 7507 | 0.97 | 5274302.58 | 6232.62 | 5260500.03 | 5269494.43 | 5273844.63 | 5279192.37 | 5286965.89 |
| OBJECTID.coll | 0 | 1.00 | 111837.99 | 64569.00 | 1.00 | 55920.00 | 111838.00 | 167756.00 | 223674.00 |
| INCKEY | 0 | 1.00 | 146654.68 | 90716.54 | 1001.00 | 72205.00 | 128328.00 | 212390.00 | 336687.00 |
| COLDETKEY | 0 | 1.00 | 146895.90 | 91107.81 | 1001.00 | 72205.00 | 128328.00 | 212670.00 | 338187.00 |
| INTKEY | 150923 | 0.33 | 37695.91 | 52206.87 | 23807.00 | 28655.00 | 29973.00 | 34008.00 | 764413.00 |
| PERSONCOUN | 0 | 1.00 | 2.22 | 1.47 | 0.00 | 2.00 | 2.00 | 3.00 | 93.00 |
| PEDCOUNT | 0 | 1.00 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 6.00 |
| PEDCYLCOUN | 0 | 1.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 |
| VEHCOUNT | 0 | 1.00 | 1.72 | 0.83 | 0.00 | 2.00 | 2.00 | 2.00 | 15.00 |
| INJURIES | 0 | 1.00 | 0.37 | 0.73 | 0.00 | 0.00 | 0.00 | 1.00 | 78.00 |
| SERIOUSINJ | 0 | 1.00 | 0.02 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 41.00 |
| FATALITIES | 0 | 1.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 |
| SEGLANEKEY | 0 | 1.00 | 267.45 | 3271.23 | 0.00 | 0.00 | 0.00 | 0.00 | 525241.00 |
| CROSSWALKK | 0 | 1.00 | 9499.31 | 71173.48 | 0.00 | 0.00 | 0.00 | 0.00 | 5239700.00 |
| OBJECTID.strt | 7507 | 0.97 | 11863.83 | 6962.27 | 1.00 | 6098.00 | 11911.00 | 17917.00 | 23806.00 |
| COMPKEY.colst | 7507 | 0.97 | 24579.89 | 77105.30 | 1000.00 | 7906.00 | 11851.00 | 16068.00 | 764415.00 |
| BLOCKNBR | 7507 | 0.97 | 3359.21 | 3325.00 | 0.00 | 900.00 | 2100.00 | 5000.00 | 77000.00 |
| SPEEDLIMIT | 7508 | 0.97 | 24.95 | 5.13 | 0.00 | 20.00 | 25.00 | 25.00 | 60.00 |
| SEGLENGTH | 7507 | 0.97 | 488.62 | 402.06 | 16.00 | 298.00 | 360.00 | 647.00 | 15035.00 |
| SURFACEWID | 7509 | 0.97 | 37.55 | 19.11 | 0.00 | 25.00 | 40.00 | 50.00 | 295.00 |
| INTKEYLO | 7508 | 0.97 | 42548.52 | 72848.18 | 23806.00 | 28492.00 | 30348.00 | 34085.00 | 765442.00 |
| PVMTCONDIN | 7507 | 0.97 | 54.59 | 29.71 | 0.00 | 35.00 | 61.00 | 78.00 | 100.00 |
| PVMTCOND_1 | 7507 | 0.97 | 6.53 | 22.33 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 |
| SLOPE_PCT | 7537 | 0.97 | 3.04 | 3.13 | 0.00 | 1.00 | 2.00 | 4.00 | 32.00 |
| SHAPE_Leng.colst | 7507 | 0.97 | 488.17 | 400.68 | 16.37 | 297.18 | 360.07 | 647.01 | 15035.23 |
| OBJECTID | 7507 | 0.97 | 952.47 | 551.87 | 1.00 | 474.00 | 936.00 | 1432.00 | 1923.00 |
| FLOWSEGID | 7507 | 0.97 | 1010.91 | 620.99 | 3.00 | 439.00 | 1016.00 | 1575.00 | 3000.00 |
| AMPK | 148022 | 0.34 | 885.06 | 655.18 | 29.00 | 418.00 | 743.00 | 1152.00 | 4569.00 |
| PMPK | 148022 | 0.34 | 977.93 | 694.67 | 39.00 | 497.00 | 828.00 | 1278.00 | 4756.00 |
| AWDT | 8022 | 0.96 | 16529.50 | 12143.95 | 387.00 | 8382.00 | 13500.00 | 21558.00 | 108179.00 |
| ADT | 52687 | 0.76 | 14327.72 | 11045.04 | 0.00 | 7056.00 | 11955.00 | 19092.00 | 102555.00 |
| AWDT_ROUND | 7507 | 0.97 | 16690.85 | 12084.62 | 400.00 | 8400.00 | 13600.00 | 22100.00 | 98700.00 |
| SHAPE_Leng.flow | 7507 | 0.97 | 2213.18 | 1860.37 | 48.54 | 856.21 | 1728.97 | 3020.21 | 13076.44 |
Tidymodels is a new modeling framework for R developed by RStudio. In my openion, the Tidymodels philosophy and syntax is more intuitive and simpler than scikit-learn, letting the data scientist focus more on the underlying model rather than the syntax. The Tidymodels makes it possible to define a recipe for the pre-processing steps and a specification for the model to use.
It is useful to model the number of injuries in a collision. But before hand, I plan to explore the data with PCA & UMAP.
Some of the columns in the final joined dataset will be removed before modeling. Those are: 1. ID columns.
2. Name columns, like street names or intersection names.
3. Redundant columns, as some features are represented by two columns, where one of them is a numeric code, and tge other is a mere description of that code.
# id columns, I'm excluding OBJECTID.coll to allow tracking the data later
id_cols <-c("COMPKEY.colst", "UNITIDSORT", "REPORTNO", "INCKEY", "COLDETKEY",
"INTKEY", "SEGLANEKEY", 'CROSSWALKK', 'SDOTCOLNUM', 'OBJECTID.strt',
"FLOWSEGID", "COMPKEY.colst", "OBJECTID", "BLOCKNBR", "INTKEYLO")
redund_cols <- c("SHAPE_Leng.flow", "EXCEPTRSND", "SEVERITYDE", "SDOT_COLDE",
"ST_COLDESC", "ARTDESCRIP", "TRANDESCRI", "START_DATE",
"SHAPE_Leng.colst", "EXCEPTRSNC")
data <- collisions_joined_df %>%
mutate(HOUR = str_extract(INCDTTM, "(?<=\\s).*") %>%
as.POSIXct(format="%I:%M:%S %p") %>%
lubridate::hour()) %>%
select(-where(is.character),
-all_of(id_cols),
-all_of(redund_cols)) %>%
rename(COLLISION_ID = OBJECTID.coll) %>%
select(COLLISION_ID, everything())
I encode the Date & Hour as cyclic values using sin & cos. This is useful to preserve the cyclic nature of dates & times (for example, hour 23 is as close to hour 0 as it is close to hour 22, Jan is close to Dec as it is to Feb)
data <- lubridate::cyclic_encoding(data$INCDATE, periods = "month") %>%
as_tibble() %>%
rename(MONTH.SIN = sin.month,
MONTH.COS = cos.month) %>%
bind_cols(data) %>%
mutate(HOUR.COS = cos(2*3.14*HOUR/24),
HOUR.SIN = sin(2*3.14*HOUR/24),
.keep = "unused")
saveRDS(data,
paste0("data_",
gsub("-", "", lubridate::today()),
".rds"))
I will make a PCA to explore the linear relationship between the variables.
I use the Tidymodels recipe to create a specification or recipe for pre-processing steps, those are:
a. One-hot encoding using step_dummy()
b. Normalization (scaling & standardization) using step_normalise. This is necessary since PCA requires all its inputs to be normalized.
Notes
* step_unknown() creates a new unknown level even for columns with no missing values. This creates a new all NA column. It’s necessary to add step_zv() or step_naomit() after step_unkown().
* step_unknown() creates a new unknown level, wheras my data already had some Unknown level. I used step_mutate() with fct_collapse() to collapse those 2 levels.
recipepca_rec <- recipe(~., data = data) %>%
update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
step_unknown(all_nominal(), -has_role('ID')) %>%
step_mutate(
LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>%
step_medianimpute(all_numeric(), -has_role('ID')) %>%
step_dummy(all_nominal(), -has_role('ID')) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## id 2
## predictor 58
##
## Operations:
##
## Unknown factor level assignment for all_nominal(), -has_role("ID")
## Variable mutation for LIGHTCOND, JUNCTIONTY, WEATHER, ROADCOND
## Median Imputation for all_numeric(), -has_role("ID")
## Dummy variables from all_nominal(), -has_role("ID")
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()
## No PCA components were extracted.
Now, train the recipe
pca_prep <- prep(pca_rec)
I now plot a scree plot for the PCA results, this plot shows the amount of variance captured by each principle component.
pca_step <- 7
pca_variance <- tidy(pca_prep, pca_step, type = "variance") %>%
select(-id)
p <- pca_variance %>%
filter(terms == "percent variance") %>%
ggplot(aes(x = component, y = value)) +
geom_point() +
geom_line(col = "grey")
print(p)
pca_variance %>%
pivot_wider(names_from = terms, values_from = value) %>%
select(component, contains("percent")) %>%
slice_head(n = 20) %>%
kable()
| component | percent variance | cumulative percent variance |
|---|---|---|
| 1 | 5.1045232 | 5.104523 |
| 2 | 3.0152155 | 8.119739 |
| 3 | 2.4851997 | 10.604938 |
| 4 | 2.3507613 | 12.955700 |
| 5 | 1.9797727 | 14.935472 |
| 6 | 1.6870924 | 16.622565 |
| 7 | 1.5307607 | 18.153326 |
| 8 | 1.4802663 | 19.633592 |
| 9 | 1.3542519 | 20.987844 |
| 10 | 1.3425025 | 22.330346 |
| 11 | 1.2487346 | 23.579081 |
| 12 | 1.2439870 | 24.823068 |
| 13 | 1.1554448 | 25.978512 |
| 14 | 1.1026987 | 27.081211 |
| 15 | 1.0701438 | 28.151355 |
| 16 | 1.0146546 | 29.166010 |
| 17 | 0.8966015 | 30.062611 |
| 18 | 0.8564496 | 30.919061 |
| 19 | 0.7913306 | 31.710391 |
| 20 | 0.7878970 | 32.498288 |
library(tidytext)
pca_loading <- tidy(pca_prep, pca_step, type = "coef")
pca_loading %>%
filter(component %in% paste0("PC", 1:4)) %>%
group_by(component) %>%
top_n(6, abs(value)) %>%
ungroup() %>%
mutate(terms = reorder_within(terms, abs(value), component)) %>%
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
facet_wrap(~component, scales = "free_y", dir = "v") +
scale_y_reordered() +
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
)
unknown values. As they all have the same sign, then they are positively correlated together. For example, an observation with an unknown PARKBOULEV is likely to have an unknown DOWNTOWN.WEATHER, ROADCOND & FLOW. This means that Principal Arterial streets tend to have known road condition, flow counts and weather condition at the time of a collision.UMAP is a new technique devised by McInnes et al in 2018 that helps to visualize and understand high dimensional data. In contrast to PCA, it captures non-linear relationships.
library(embed)
neighbors = 30
min_dist = 0.01
umap_rec <- recipe(~., data = data) %>%
update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
step_unknown(all_nominal(), -has_role('ID')) %>%
step_mutate(
LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>%
step_medianimpute(all_numeric(), -has_role('ID')) %>%
step_dummy(all_nominal(), -has_role('ID')) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_umap(all_predictors(), neighbors = neighbors, min_dist = min_dist)
umap_prep <- prep(umap_rec)
umap_prep
saveRDS(umap_prep, "umapPrep_20210214.rds")
umap_prep <- readRDS("umapPrep_20210214.rds")
umap_proj <- juice(umap_prep)
Plotting the UMAP 2D projections
umap_proj %>%
inner_join(data, by = "COLLISION_ID") %>%
ggplot(aes(umap_1, umap_2)) +
geom_point(aes(color = SEVERITYCO), alpha = 0.7, size = 2) +
labs(color = NULL) +
ggtitle("UMAP 2D Projection", subtitle = "Coloured by Collision Severity")
umap_proj %>%
inner_join(data, by = "COLLISION_ID") %>%
ggplot(aes(umap_1, umap_2)) +
geom_point(aes(color = DOWNTOWN), alpha = 0.7, size = 2) +
labs(color = NULL) +
ggtitle("UMAP 2D Projection", subtitle = "Coloured by DOWNTOWN")
umap_proj %>%
inner_join(data, by = "COLLISION_ID") %>%
ggplot(aes(umap_1, umap_2)) +
geom_point(aes(color = ADDRTYPE), alpha = 0.7, size = 2) +
labs(color = NULL) +
ggtitle("UMAP 2D Projection", subtitle = "Coloured by Address Type")
In this section, I will do a predictive modeling for the Severity Class of a collision. It’s a multiclass classification problem with 5 classes.
data %>%
ggplot(aes(x = SEVERITYCO)) +
geom_bar()
Again, I will use the Tidymodels framework for its simplicity & versatility in defining various recipes and model specification.
I start by testing the below 3 models:
1. Logistic Regression It is a simple and fast model. This is a baseline for comparison with other advanced models.
2. Random Forest It is an ensemble learning method that constructs a multitude of decision trees. It was first devised by Tin Kam Ho in 1998 and later extended by Leo Breiman in 2001.
3. XGBoost It is a new algorithm, devised by Tianqi Chen et al in 2016. It is an ensemble tree boosting algorithm, and it had proved to give very good results in different scenarios.
I start with a train test split
library(tidymodels)
set.seed(104)
train_test_split <- data %>%
filter(!is.na(SEVERITYCO)) %>%
initial_split(prop = 4/5, strata = 'SEVERITYCO')
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
And then I create various recipes (a recipe is a specification for pre-processing steps). Different classification models require different recipes. For example a LR model requires normalization of the inputs, whereas a tree model doesn’t. Also, a LR model would benefit form a PCA step (as it removes co-linearity), whereas this step is not required in a tree model.
The main steps included in the LR recipe are:
1. Convert missing values in categorical features to a new category named “unknown”.
2. Median imputation for missing values in numeric features.
3. One-hot encoding for categorical variables.
4. Remove zero variance features (features that have a single value).
5. Scaling & standardization.
6. PCA, retaining 200 PCs.
7. SMOTE up-sampling as the data is unbalanced.
The DT recipe excludes the last 2 steps
library(themis)
# a recipe for logistic regression
lr_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>%
step_mutate(
LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>%
step_medianimpute(all_numeric(), -has_role('ID')) %>%
step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = 200) %>%
step_smote(SEVERITYCO)
lr_rec$name <- "LR Recipe"
# Decision tree recipe, to be used with Random Forest & XGBoost
dt_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>%
step_mutate(
LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>%
step_medianimpute(all_numeric(), -has_role('ID')) %>%
step_dummy(all_nominal(), -has_role('ID'), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_smote(SEVERITYCO)
dt_rec$name <- "DT Recipe"
Now, I define the model specifications. I define the below 3 specifications:
1. Multinomial logistic regression.
2. Random forest.
3. XGBoost.
#' Multinomial Regression
lr_clf <-
multinom_reg(
penalty = tune(),
mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
lr_clf$name <- 'glmnet'
params <- lr_clf %>%
parameters()
lr_grid <- grid_max_entropy(params, size = 6)
#lr_grid
# Random Forest
rf_clf <-
rand_forest(
trees = tune(),
min_n = tune()) %>%
set_engine(
"ranger",
num.threads = 11,
verbose = TRUE,
seed = 123) %>%
set_mode("classification")
rf_clf$name <- 'ranger'
params <- rf_clf %>%
parameters()
rf_grid <- grid_max_entropy(params, size = 6)
#rf_grid
#' XGBoost
xg_clf <-
boost_tree(
tree_depth = tune(),
trees = 300,
learn_rate = 0.01,
sample_size = tune(),
mtry = tune(),
min_n = tune()) %>%
set_engine(
"xgboost",
nthread = 11,
verbosity = 2) %>%
set_mode("classification")
xg_clf$name <- 'XGBoost'
params <- xg_clf %>%
parameters() %>%
update(mtry = predictor_prop())
xg_grid <- grid_max_entropy(params, size = 12)
#xg_grid
# custom list
custom_list <- list(
list(lr_rec, lr_clf, lr_grid, "None"),
list(dt_rec, rf_clf, rf_grid, "None"),
list(dt_rec, xg_clf, xg_grid, "None")) %>%
transpose(
.names = c("data_recipe", "clf", "clf_grid", "parallel_backend")) %>%
as_tibble(.name_repair = "minimal")
class(custom_list) <- c("tune_spec", "tbl_df", "tbl", "data.frame")
#print(summary(custom_list))
Below is the initial training of the 3 candidate models. I don’t do a exhaustive hyper-parameter tuning as my aim here is just an initial comparison between the 3 models. Once I settle on a choice model, I will do a more exhaustive search for the optimum hyper-parameters.
library(tictoc)
tic()
FitModel <- TuneFitFactory()
fit_results <- pmap(
custom_list,
FitModel,
train_data = data_train,
outcome = "SEVERITYCO")
timing <- toc()
elapsed <- timing$toc - timing$tic
print(str_interp("Done fitting ${nrow(custom_list)} models in $[.2f]{elapsed/60} minutes."))
saveRDS(fit_results, "fitResults_20210215.rds")
Below, I compare the results of the 3 initial tunings
fit_results <- readRDS("fitResults_20210215.rds")
models_metrics <- map_dfr(fit_results, possibly(ExtractModelInfo, otherwise = NULL))
models_metrics %>%
select(-best_acc_model, -best_auc_model) %>%
kable()
| clf | recipe | best_accuracy | best_roc_auc | n | mean_accuracy | mean_roc_auc | sd_accuracy | sd_roc_auc | fit_time |
|---|---|---|---|---|---|---|---|---|---|
| glmnet | LR Recipe | 0.9882863 | 0.9953769 | 6 | 0.8660117 | 0.9033425 | 0.1827193 | 0.1987817 | 5505.490 |
| ranger | DT Recipe | 0.9924442 | 0.9994215 | 6 | 0.9919263 | 0.9993457 | 0.0005637 | 0.0001001 | 2540.848 |
| XGBoost | DT Recipe | 0.9951491 | 0.9999239 | 12 | 0.9525491 | 0.9830514 | 0.0617329 | 0.0288334 | 5110.510 |
plotMetrics(models_metrics, "roc_auc")
plotMetrics(models_metrics, "accuracy")
models_metrics %>%
ggplot(aes(x = clf, y = fit_time)) +
geom_col(fill = "orange") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_log10()
#### Takeaways
* The fit time of LR is longer the Random Forest & XGBoost. This may seem counter-intuitive, but it is because I trained the LR using one thread. Random Forest & XGBoost implementations has inherent parallelisation and I trained them using 11 parallel threads.
* The accuracy and ROC AUC has very high unrealistic figures. This is because I’m using features that are highly correlated with the outcome, such as the count of injured or severely insured people. In real world scenarios it is not realistic to include those features as predictors, because those features are only known after the accident and not before. I will exclude those features as my aim is to predict the accident severity before it happens.
Analyzing the feature importance confirms the above statement. To extract the feature importance, I have to train the model outside of the hyper-parameter search cycle. This is because the above training returns only the evaluation metrics, and not the trained model objects.
# the parameters of the best xgboost model
xg_best <- fit_results %>%
pluck(3) %>%
pluck("resamples") %>%
select_best('roc_auc')
# create a workflow using those parameters
final_wf <-
workflow() %>%
add_model(xg_clf) %>%
add_recipe(dt_rec) %>%
finalize_workflow(xg_best)
final_wf
# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")
# fit the model to those resamples
#final_fit <- final_wf %>%
# fit(data_test)
#final_fit
xg_fit <- final_wf %>%
last_fit(train_test_split)
saveRDS(xg_fit, "xgFit_20210215.rds")
Now, I see the variable importance of top important features.
library(vip)
xg_fit <- readRDS("xgFit_20210215.rds")
xg_fit %>%
pull(.workflow) %>%
pluck(1) %>%
pull_workflow_fit() %>%
vip()
The above graph shows that the most important features are the number of injuries, fatalities & serious injuries. Including those features is not practical in a real life situation, as they are only known after the accident happens. I will exclude those features from the modeling to make for a more practical model.
# features that are closely coupled with the predictor (i.e. those features should be predicted as well), hence I will remove from the feature space.
pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT",
"INJURIES", "SERIOUSINJ", "FATALITIES")
dt_rec <- dt_rec %>%
update_role(any_of(pred_features), new_role = "preds")
# workflow
final_wf <-
workflow() %>%
add_model(xg_clf) %>%
add_recipe(dt_rec) %>%
finalize_workflow(xg_best)
# train
xg_fit <- final_wf %>%
last_fit(train_test_split)
# save
saveRDS(xg_fit,
paste0("xgbFitReducedFeatures_",
gsub("-", "", lubridate::today()),
".rds"))
I check the variable importance of the modified model.
library(vip)
xg_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")
xg_fit %>%
pull(.workflow) %>%
pluck(1) %>%
pull_workflow_fit() %>%
vip()
Now, it is time to use hyper-parameter tuning to find the best set of parameters. I also use Racing Methods, devised by [Maron and Moore (1994)], (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=Hoeffding+races%3A+Accelerating+model+selection+search+for+classification+and+function+approximation&btnG=) to reduce the total number of model evaluations. The method is described by Max Kuhn (2014) in this paper.
I will use the LightGBM model instead of XGBoost as it is considerably faster. LightGBM is gradient boosting framework developed by Microsoft that uses tree based learning algorithms.
library(treesnip)
pred_features <- c("PERSONCOUN", "PEDCOUNT", "PEDCYLCOUN", "VEHCOUNT",
"INJURIES", "SERIOUSINJ", "FATALITIES")
# LightGBM
lg_rec <- recipe(SEVERITYCO ~ ., data = data) %>%
update_role(COLLISION_ID, INCDATE, new_role = "id") %>%
update_role(any_of(pred_features), new_role = "preds") %>%
step_unknown(all_nominal(), -has_role('ID'), -all_outcomes()) %>%
step_mutate(
LIGHTCOND = fct_collapse(LIGHTCOND, "Unknown" = c("Unknown", "unknown")),
JUNCTIONTY = fct_collapse(JUNCTIONTY, "Unknown" = c("Unknown", "unknown")),
WEATHER = fct_collapse(WEATHER, "Unknown" = c("Unknown", "unknown")),
ROADCOND = fct_collapse(ROADCOND, "Unknown" = c("Unknown", "unknown"))) %>%
step_zv(all_predictors())
# model spec
lg_spec <-
boost_tree(
tree_depth = tune(),
trees = tune(),
sample_size = tune(),
mtry = tune(),
min_n = tune()) %>%
set_engine(
"lightgbm",
nthread = 11,
verbosity = 1) %>%
set_mode("classification")
# parameter grid
grid_size = 16
params <- lg_spec %>%
parameters() %>%
update(mtry = predictor_prop(),
sample_size = predictor_prop())
set.seed(911)
race_grid <- grid_max_entropy(params, size = grid_size)
# workflow
wflow <- workflow() %>%
add_recipe(lg_rec) %>%
add_model(lg_spec)
# cv folds
set.seed(12)
folds <- vfold_cv(data_train, v = 8, strata = "SEVERITYCO")
library(finetune)
library(tictoc)
# race search for best hyper-parameters
tic()
race_tune <- tune_race_anova(
wflow,
folds,
grid = race_grid,
metrics = metric_set(roc_auc),
control = control_race(
verbose = TRUE,
verbose_elim = TRUE))
toc()
saveRDS(race_tune,
paste0("xgbRacingResamplesN16_",
gsub("-", "", lubridate::today()),
".rds"))
The below plot illustrates the racing condition. Out of 16 different models (16 different set of hyper-parameters), only 5 made it to the end. So, instead of fitting 128 models (16 models x 8 resmaples), fewer models were fitted.
require(finetune)
race_tune <- readRDS("xgbRacingResamplesN16_20210222.rds")
plot_race(race_tune)
autoplot(race_tune)
show_best(race_tune, metric = "roc_auc", n = 3) %>% kable()
| mtry | trees | min_n | tree_depth | sample_size | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.9940907 | 1871 | 35 | 2 | 0.0546378 | roc_auc | hand_till | 0.8198579 | 8 | 0.0022909 | Preprocessor1_Model16 |
| 0.8477562 | 741 | 31 | 3 | 0.2075844 | roc_auc | hand_till | 0.8194337 | 8 | 0.0022680 | Preprocessor1_Model12 |
| 0.8354038 | 1085 | 2 | 5 | 0.7793958 | roc_auc | hand_till | 0.8152093 | 8 | 0.0029104 | Preprocessor1_Model10 |
The best 2 models models have very close ROC AUC, whereas the 2nd model is less complex. I follow the Principle of Parsimony and select the 2nd model.
I use this final model of the final training, whereas I use all the training data for training (no resamples here) and test the performance on the test samples.
bt_best <- select_by_pct_loss(race_tune, metric = "roc_auc", trees)
bt_best %>% kable()
| mtry | trees | min_n | tree_depth | sample_size | .metric | .estimator | mean | n | std_err | .config | .best | .loss |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.8812664 | 543 | 7 | 8 | 0.1255015 | roc_auc | hand_till | 0.8127876 | 8 | 0.0029113 | Preprocessor1_Model13 | 0.8198579 | 0.8623834 |
final_wf <-
wflow %>%
finalize_workflow(bt_best)
# create cv folds (resamples)
set.seed(12)
folds <- vfold_cv(data_train, v = 10, strata = "SEVERITYCO")
bt_fit <- final_wf %>%
last_fit(train_test_split)
saveRDS(bt_fit,
paste0("boostedTreeLastFit_",
gsub("-", "", lubridate::today()),
".rds"))
Now I use LIME technique for local interpretability of machine learning models. This technique was devised by Riberia et al at 2016.
# load the fitted model (`last_fit` object)
bt_fit <- readRDS("xgbFitReducedFeatures_20210221.rds")
# extract model, mold & data of final XGBoost object
final_model <- bt_fit %>%
pull(.workflow) %>%
pluck(1) %>%
extract_model()
final_mold <- bt_fit %>%
pull(.workflow) %>%
pluck(1) %>%
pull_workflow_mold()
train_X_df <- final_mold$predictors # needed for LIME
train_X <- final_mold$predictors %>%
data.matrix()
train_y_df <- final_mold$outcomes
train_y <- final_mold$outcomes %>%
data.matrix()
train_ids <- final_mold$extras %>%
pluck(1) %>%
pluck(1)
# use LIME
library(lime)
explainer <- lime(train_X_df, final_model)
# explain 1st 1000 observations
explanation <- lime::explain(train_X_df[1:1000,], explainer, n_labels = 1, n_features = 3)
saveRDS(explanation,
paste0("limeExplainN1000_",
gsub("-", "", lubridate::today()),
".rds"))
explanation <- readRDS("limeExplainN1000_20210223.rds")
explanation %>%
select(-data) %>%
separate(feature, into = c("feature_name", "feature_class"), sep = "_", extra = "merge") %>%
ggplot(aes(x = feature_weight, y = feature_name, col = feature_value)) +
geom_jitter(height = 0.2, alpha = 0.4)
The features that have most weight in interpreting the model are
ST, WEATHER and SURFACET.
Explain the first three observations
explanation <- lime::explain(train_X_df[1:3,], explainer, n_labels = 1, n_features = 6)
plot_features(explanation)
The above is an example of interpreting specific observations. For example, the 1st observation is predicted to be of class 1 with a probability of 0.82, the features that impacted this prediction are SURFACET (Primary pavement used on Street surface), PVMCOND (pavement condition) and ADT (average daytime traffic).
The 3rd observation is impacted by1STREETTYPE (street type) and ADT (average daytime traffic).