Factors influencing perceptions towards a potential Motorbike ban in Hanoi: A boosted trees approach

Author

Eric Wanjau, Minh Kieu

Quarto

This Notebook is created using Quarto. Quarto is a multi-language, next-generation version of R Markdown from RStudio, and includes dozens of new features and capabilities while at the same being able to render most existing Rmd files without modification. To learn more about Quarto see https://quarto.org.

Summary: the important parts

This document implements a Classification model in Machine Learning on the Hanoi’s travel survey data, as part of the project Urban transport modelling for sustainable well-being in Hanoi. This is me learning and trying to build upon Minh Kieu’s work.

The objectives of the model were to:

See if we can predict whether a person would like or dislike the motorbike ban
See if how many variables we can take off the training dataset, and still make good prediction? This is important because what if we can anticipate travellers’ perception to transport policies using limited dataset (e.g. census data)
Explain the model’s outcomes to tell variable importance, and understand more about the data
See if we can have some policy implications while doing this modelling work

The following packages will be required for this adventure:

Code

suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'here', 'sf', 'tmap', 'stplanr','broom',
               'tidymodels', 'paletteer', 'patchwork', 'themis', 'stacks',
               "spatialsample", "glue",
               "vip", "xgboost", "SHAPforxgboost", "tidytext", "textrecipes", "beepr", "finetune")

#conflicted::conflict_prefer("slice", "dplyr")

doParallel::registerDoParallel()

1 Set Up

1.1 Loading data

The code chunk below loads a copy of a cleaned survey data set.

Code

# Load the data
hn_survey <- read_csv("hn_survey.csv", col_types = cols(.default = "c")) %>% 
  
  mutate(across(contains(c("row","homel", "origl", "destl", "dist", "trav", "index")), ~as.numeric(.x)))


# Dimensions of data
dim(hn_survey)

[1] 26109    57

Code

# User defined functions for EDA
# Function for summarizing the distribution of opinions
summarize_opinionban <- function(tbl){
  tbl %>% 
  #group_by(occup) %>% 
  summarise(n_agree = sum(opinion_ban == "agree"), n_total = n()) %>%
  ungroup() %>% 
  mutate(pct_agree = n_agree / n_total) %>% 
  arrange(desc(pct_agree)) %>% 
  mutate(low = qbeta(0.025, n_agree + 5, n_total - n_agree + .5),
         high = qbeta(0.975, n_agree + 5, n_total - n_agree + .5),
         pct = n_total / sum(n_total))
  } 


# Function for binding number of observations to above results
bind_count = function(x){
  as_tibble(x) %>% 
  add_count(value) %>% 
  mutate(value = glue("{value} ({n})")) %>%
    pull(value)
  
}

Let’s take a peek at the data

glimpse
tibble

Code

glimpse(hn_survey)

Rows: 26,109
Columns: 57
$ rowid          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
$ id             <chr> "80606769", "77517825", "80493824", "84394903", "844778~
$ homelat        <dbl> 20.99640, 20.97593, 21.01729, 21.00348, 20.99657, 21.01~
$ homelon        <dbl> 105.8451, 105.7936, 105.8085, 105.8084, 105.8084, 105.8~
$ age            <chr> "18_25", "18_25", "18_25", "18_25", "less_18", "less_18~
$ occup          <chr> "student", "student", "student", "student", "student", ~
$ gender         <chr> "male", "female", "male", "female", "female", "male", "~
$ origlat        <dbl> 20.99640, 20.98280, 21.01729, 21.00348, 20.99657, 21.01~
$ origlon        <dbl> 105.8451, 105.8009, 105.8085, 105.8084, 105.8084, 105.8~
$ destlat        <dbl> 21.00334, 20.99657, 20.99640, 20.99657, 20.98974, 20.98~
$ destlon        <dbl> 105.8378, 105.8084, 105.8451, 105.8084, 105.7936, 105.7~
$ purp           <chr> "education", "education", "education", "education", "ed~
$ vehic          <chr> "moto", "moto", "moto", "moto", "moto", "moto", "ebike"~
$ freqpweek      <chr> "4_7", "4_7", "4_7", "4_7", "4_7", "4_7", "4_7", "4_7",~
$ OD_dist        <dbl> 1.0825039, 1.7145142, 4.4571154, 0.7676662, 1.7085007, ~
$ network_dist   <dbl> 1.766566, 2.342279, 5.515824, 1.401829, 2.223775, 3.349~
$ travtime       <dbl> 5, 10, 20, 15, 15, 15, 30, 10, 15, 15, 15, 5, 20, 20, 8~
$ reason         <chr> "convinient", "convinient", "cost convinient timesaving~
$ own_car        <chr> "2", "0", "0", "0", "0", "1", "0", "1", "1", "1", "1", ~
$ own_motob      <chr> "2", "more_5", "2", "4", "3", "2", "3", "2", "2", "3", ~
$ own_ebike      <chr> "1", "2", "1", "0", "0", "0", "2", "0", "0", "0", "2", ~
$ own_bike       <chr> "0", "1", "0", "0", "0", "0", "3", "1", "0", "0", "0", ~
$ freq_car       <chr> "0", "1_5", "0", "1_5", "0", "1_5", "0", "0", "1_5", "1~
$ freq_motob     <chr> "0", "more_20", "1_5", "more_20", "more_20", "more_20",~
$ freq_ebike     <chr> "0", "1_5", "1_5", "1_5", "0", "0", "more_20", "0", "0"~
$ freq_bike      <chr> "0", "1_5", "0", "1_5", "0", "0", "0", "0", "0", "0", "~
$ freq_taxi      <chr> "0", "1_5", "0", "1_5", "0", "0", "0", "0", "0", "0", "~
$ freq_bus       <chr> "0", "1_5", "0", "6_10", "0", "0", "0", "0", "0", "0", ~
$ school_acc     <chr> NA, "good", "good", "very_good", "good", "very_good", "~
$ market_acc     <chr> NA, "neutral", "good", "very_good", "good", "very_good"~
$ hosp_acc       <chr> NA, "neutral", "good", "good", "good", "neutral", "neut~
$ bank_acc       <chr> NA, "good", "good", "very_good", "good", "neutral", "ne~
$ leis_acc       <chr> NA, "neutral", "good", "very_good", "good", "good", "ne~
$ type           <chr> "old_building", "resettlement", "high_rise", "private_h~
$ status         <chr> "long_stay", "long_stay", "long_stay", "long_stay", "pe~
$ own            <chr> "rent", "rent", "rent", "rent", "parent_house", "parent~
$ reas_not_ebike <chr> "cost", "unsafe", "slow", "jam", "jam", "jam", "cost", ~
$ reas_not_motob <chr> "cost", "jam", "unsafe", "jam", "jam", "jam", "cost", "~
$ dist_to_pub    <dbl> 50, 1000, 500, 750, 10, 500, 600, 50, 400, 200, 100, 15~
$ aware_ban      <chr> "no", "yes", "yes", "yes", "yes", "yes", "no", "no", "y~
$ fut_veh        <chr> "car", "car", "car", "no", "no", "no", "moto", "no", "n~
$ alt_veh        <chr> "car", "ebike bike bus walk", "car lighttrain", "ebike ~
$ alt_car        <chr> "1", "0", "1", "0", "0", "0", "0", "1", "0", "1", "1", ~
$ alt_ebike      <chr> "0", "1", "0", "1", "0", "0", "1", "0", "0", "1", "1", ~
$ alt_bike       <chr> "0", "1", "0", "0", "0", "1", "0", "0", "0", "0", "0", ~
$ alt_bus        <chr> "0", "1", "0", "1", "1", "1", "0", "0", "1", "0", "0", ~
$ alt_ltrain     <chr> "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", ~
$ alt_taxi       <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", ~
$ alt_walk       <chr> "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", ~
$ opinion_car    <chr> "neutral", "neutral", "verybad", "good", "neutral", "go~
$ opinion_motob  <chr> "neutral", "good", "verybad", "verygood", "good", "good~
$ opinion_ebike  <chr> "neutral", "neutral", "verybad", "verygood", "good", "g~
$ opinion_bike   <chr> "neutral", "neutral", "verybad", "verygood", "good", "g~
$ opinion_taxi   <chr> "neutral", "neutral", "verybad", "good", "good", "good"~
$ opinion_bus    <chr> "neutral", "neutral", "verybad", "good", "neutral", "ne~
$ opinion_ban    <chr> "agree", "disagree", "strongagree", "neutral", "agree",~
$ index          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

Code

hn_survey %>% 
  slice_head(n = 5)

# A tibble: 5 x 57
  rowid id    homelat homelon age   occup gender origlat origlon destlat destlon
  <dbl> <chr>   <dbl>   <dbl> <chr> <chr> <chr>    <dbl>   <dbl>   <dbl>   <dbl>
1     1 8060~    21.0    106. 18_25 stud~ male      21.0    106.    21.0    106.
2     2 7751~    21.0    106. 18_25 stud~ female    21.0    106.    21.0    106.
3     3 8049~    21.0    106. 18_25 stud~ male      21.0    106.    21.0    106.
4     4 8439~    21.0    106. 18_25 stud~ female    21.0    106.    21.0    106.
5     5 8447~    21.0    106. less~ stud~ female    21.0    106.    21.0    106.
# ... with 46 more variables: purp <chr>, vehic <chr>, freqpweek <chr>,
#   OD_dist <dbl>, network_dist <dbl>, travtime <dbl>, reason <chr>,
#   own_car <chr>, own_motob <chr>, own_ebike <chr>, own_bike <chr>,
#   freq_car <chr>, freq_motob <chr>, freq_ebike <chr>, freq_bike <chr>,
#   freq_taxi <chr>, freq_bus <chr>, school_acc <chr>, market_acc <chr>,
#   hosp_acc <chr>, bank_acc <chr>, leis_acc <chr>, type <chr>, status <chr>,
#   own <chr>, reas_not_ebike <chr>, reas_not_motob <chr>, ...

Where do we have missing values?

Code

colSums(is.na(hn_survey)) %>% 
  as.data.frame() %>% 
  filter(. > 0)

                 .
network_dist   205
school_acc   13021
market_acc   12634
hosp_acc     12723
bank_acc     12771
leis_acc     13066

Let’s see how we’ll handle these - that is, if they will be even important in our model.

1.2 Data Preprocessing

Here we do the following steps to make sure that the data is ready for modelling:

Reduce the label column from 5 outcomes (Agree, Strongly Agree, Neutral, Disagree, Strongly Disagree) to two: Agree or Disagree

What’s the various levels of the outcome?

Code

theme_set(theme_minimal())
# Distribution of opinions
hn_survey %>% 
  count(opinion_ban) %>%
  mutate(opinion_ban = fct_reorder(opinion_ban, desc(n))) %>%
  ggplot(mapping = aes(x = opinion_ban, y = n)) +
  geom_col(aes(fill = opinion_ban), show.legend = FALSE, alpha = 0.7) +
  geom_text(aes(label = n, y = n + 200)) +
  paletteer::scale_fill_paletteer_d("ggsci::default_nejm")

Count of various opinions towards a potential motorbike ban

Note

In this survey, majority of the respondents agree with the ban. This is quite different from what we had at the beginning when majority of respondents were disagreeing with the ban.

Data preprocessing steps:

Drop Neutral
Aggregate disagree and strong disagree
Aggregate agree and strong agree

Code

# Aggregate data
hn_survey_agg <- hn_survey %>% 
  # Make Neutral Disagree
  filter(opinion_ban != "neutral") %>% 
  # Aggregate opinions
  mutate(opinion_ban = case_when(
    opinion_ban == "strongdisagree" ~ "disagree",
    #opinion_ban == "neutral" ~ "disagree",
    opinion_ban == "strongagree" ~ "agree",
    TRUE ~ opinion_ban
  )) %>% 
  tibble()

Tibble Count
Bar Plot

Code

# Verify
hn_survey_agg %>% 
  count(opinion_ban, sort = TRUE)

# A tibble: 2 x 2
  opinion_ban     n
  <chr>       <int>
1 agree       12321
2 disagree     7489

Code

hn_survey_agg %>% 
  count(opinion_ban) %>% 
  mutate(opinion_ban = fct_reorder(opinion_ban, (n))) %>% 
  ggplot(mapping = aes(x = opinion_ban, y = n)) +
  geom_col(aes(fill = opinion_ban), show.legend = FALSE, alpha = 0.7, width = 0.5) +
  geom_text(aes(label = n, y = n + 200)) +
  paletteer::scale_fill_paletteer_d("ggsci::default_nejm")

Note

Not much of a class imbalance but will upsample on the training set.

2 A little Exploratory Data Analysis of nominal variables

The following sections try to do some Exploratory Data Analysis aimed at:

Getting a better understanding of how the variables related to opinion on ban
Identify which variables might be potentially predictive
Identify potential feature engineering possibilities to extract a better signal.

2.1 How is age related to opinion on ban?

Code

# How opinions vary with age 
hn_survey_agg %>% 
  group_by(age) %>% 
  summarize_opinionban()

# A tibble: 6 x 7
  age     n_agree n_total pct_agree   low  high     pct
  <chr>     <int>   <int>     <dbl> <dbl> <dbl>   <dbl>
1 more_75     124     166     0.747 0.685 0.814 0.00838
2 18_25      3745    5459     0.686 0.674 0.698 0.276  
3 36_55      3406    5288     0.644 0.631 0.657 0.267  
4 26_35      4356    7483     0.582 0.571 0.594 0.378  
5 56_75       607    1048     0.579 0.551 0.611 0.0529 
6 less_18      83     366     0.227 0.195 0.281 0.0185

Let’s explore this visually.

Code

# Bar plots
hn_survey_agg %>% 
  count(age, opinion_ban) %>% 
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  mutate(age = factor(age,
                      levels = hn_survey_agg %>% distinct(age) %>% dplyr::slice(c(2, 1, 3:6)) %>% pull(age) )) %>% 
  ggplot(mapping = aes(x = age, y = n)) +
  geom_col(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size = 2, replace = FALSE)) +
  theme(legend.position = "top") +
  labs(fill = "")

Code

# Scatter plot
hn_survey_agg %>% 
  group_by(age = bind_count(age)) %>% 
  summarize_opinionban() %>% 
  mutate(age = fct_reorder(age, pct_agree)) %>% 
  ggplot(mapping = aes(x = pct_agree, y = age)) +
  geom_point(aes(size = pct), show.legend = T) +
  geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
  labs(
    x = "Percentage of folks in each category who agree",
    title = "Distribution of opinions by age",
    size = "%prevalence",
    subtitle = "The number in brackets indicates the total count of respondents in that category"
  ) +
  scale_x_continuous(labels = percent) +
  scale_size_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))

Note

It seems that cumulatively, people in the age group > 18 years tend to agree with the ban of motorcycles from a section of Hanoi CBD. 75 % of respondents > 75 years agree though they only make 0.8% of the total respondents. Their opinion makes sense because of their age and are less likely to use motorcycles.

Interesting to observe that a higher proportion of respondents in the bracket 18-25 tend to agree than in 26-35.

Majority of 18 - 25 would be students/young professionals. Are their opinions based on their educational knowledge of the effects of motorbikes? But as we’ll see, trips for education purposes have lower probabilities of agreeing to ban.

Why are respondents less that 18 disagreeing with the ban?

Majority of 26-35 would be working. Around 58% of them agree to the ban. Probably a lot of them use motorcyles for work?

Good questions to explore with the team!

Seems like Age matters.

2.2 Does occupation influence opinion?

Code

# Investigate r/ship between occupation and opinion
hn_survey_agg %>% 
  group_by(occup) %>% 
  summarize_opinionban()

# A tibble: 5 x 7
  occup   n_agree n_total pct_agree   low  high    pct
  <chr>     <int>   <int>     <dbl> <dbl> <dbl>  <dbl>
1 fdi        3617    4610     0.785 0.773 0.796 0.233 
2 state      4224    6355     0.665 0.653 0.676 0.321 
3 student    1479    2851     0.519 0.501 0.538 0.144 
4 retired     400     781     0.512 0.480 0.550 0.0394
5 private    2601    5213     0.499 0.486 0.513 0.263

Exploring this visually:

Code

# General overview
hn_survey_agg %>% 
  count(occup, opinion_ban) %>%  
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  ggplot(mapping = aes(x = occup, y = n)) +
  geom_col(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size = 2, replace = FALSE))+ 
  theme(legend.position = "top") +
  labs(fill = "")

Code

# Narrowing our analysis to the different factor levels
hn_survey_agg %>% 
  group_by(occup = bind_count(occup)) %>% 
  summarize_opinionban() %>% 
  mutate(occup = fct_reorder(occup, pct_agree)) %>% 
  ggplot(mapping = aes(x = pct_agree, y = occup)) +
  geom_point(aes(size = pct), show.legend = F) +
  geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +
  labs(
    x = "Percentage of folks in each category who agree",
    title = "Distribution of opinions by Occupation",
  size = "%prevalence"
  ) +
  scale_x_continuous(labels = percent) +
  scale_size_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))

Note

Respondents employed by FDI and state tend to agree with ban. Is this due to influence by government?

There is very little difference among respondents who agree/disagree with the ban in the other categories.

2.3 How opinion on ban varies with gender of respondents.

Let’s create some user defined functions that take a particular returns the plots above.

Tip

Please see rlang documentation for more info on Tidy evaluation.

Code

# Plot bar charts
bar_plot <- function(column){
  hn_survey_agg %>% 
  count({{ column }}, opinion_ban) %>%  
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>%
  ggplot(mapping = aes(x = {{ column }}, y = n)) +
  geom_col(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#EB5291FF"), size = 2, replace = FALSE)) +
  theme(legend.position = "top") +
  labs(fill = "")
    
}


# Make dot plot

dot_plot <- function(column){
  hn_survey_agg %>% 
  group_by({{ column }} := bind_count({{ column }})) %>% 
  summarize_opinionban() %>% 
  mutate({{ column }} := fct_reorder({{ column }}, pct_agree)) %>%
  ggplot(mapping = aes(x = pct_agree, y = {{ column }})) +
  geom_point(aes(size = pct), show.legend = FALSE) +
  geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +
  labs(
    x = "Percentage of folks in each category who agree",
    title = "",
  size = "%prevalence"
  ) +
  scale_x_continuous(labels = percent) +
  scale_size_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))
}

Influence of gender on opinion:

Code

hn_survey_agg %>% 
  group_by(gender) %>% 
  summarize_opinionban()

# A tibble: 2 x 7
  gender n_agree n_total pct_agree   low  high   pct
  <chr>    <int>   <int>     <dbl> <dbl> <dbl> <dbl>
1 male      7286   10561     0.690 0.681 0.699 0.533
2 female    5035    9249     0.544 0.534 0.555 0.467

Code

# Does gender influence one's opinion
bar_plot(gender)

Code

# Narrowing our analysis to the different factor levels
dot_plot(gender)

Note

Around 68% of male respondents agree with the ban. Does this imply usage of motorcyles is greater among female respondents?

2.4 How opinion on ban varies with the means of transport used.

Code

hn_survey_agg %>% 
  group_by(vehic) %>% 
  summarize_opinionban()

# A tibble: 8 x 7
  vehic n_agree n_total pct_agree   low  high      pct
  <chr>   <int>   <int>     <dbl> <dbl> <dbl>    <dbl>
1 car      4605    5586     0.824 0.814 0.834 0.282   
2 bike      667     891     0.749 0.721 0.777 0.0450  
3 tram        5       7     0.714 0.549 0.962 0.000353
4 ebike     657    1128     0.582 0.555 0.613 0.0569  
5 moto     5993   11291     0.531 0.522 0.540 0.570   
6 taxi      116     235     0.494 0.440 0.566 0.0119  
7 walk      164     370     0.443 0.400 0.501 0.0187  
8 bus       114     302     0.377 0.333 0.442 0.0152

Code

# Does gender influence one's opinion
bar_plot(vehic)

Code

# Narrowing our analysis to the different factor levels
dot_plot(vehic)

Note

Majority of respondents who use cars and bikes tend to agree with the ban. Taxi, walk and bus users tend to disagree with the ban. Perhaps they use motorcycles at some point in their journeys?

Drop tram since it’s sparse?

2.5 How opinion ban looks like to frequency of usage of a particular means of transport

Code

hn_survey_agg %>% 
  select(contains("freq_"), opinion_ban) %>%
  pivot_longer(!opinion_ban, names_to = "vehic_freq", values_to = "freq") %>%
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree")),
         freq = factor(freq, levels = c("0", "1_5", "6_10", "11_15", "16_20", "more_20"))) %>% 
  ggplot(mapping = aes(x = freq)) +
  geom_bar(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = c("#BC3C29FF", "#0072B5FF")) +
  facet_wrap(~ vehic_freq) +
  theme(legend.position = "top") +
  labs(fill = "")

Note

Respondents who use cars more frequently tend to agree with the ban.

Respondents using motorcycles more than 15 times a week tend to disagree with ban.

How do I include this in the model? Only cars/motorcycles?

2.6 Does access to schools, market influence opinion on ban?

Code

hn_survey_agg %>% 
  select(contains("_acc"), opinion_ban) %>%
  pivot_longer(!opinion_ban, names_to = "amenities", values_to = "access") %>%
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  ggplot(mapping = aes(x = access)) +
  geom_bar(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = c("#BC3C29FF", "#0072B5FF")) +
  facet_wrap(~ amenities)+
  theme(legend.position = "top") +
  labs(fill = "")

Note

The pattern seems pretty much the same across all amenities - not much signal can be derived from this. Has lots of NA values. Won’t include this for now.

2.7 Is ownership of residence related to opinion on ban?

Code

# Does residence ownership influence one's opinion
bar_plot(own)

Code

# Narrowing our analysis to the different factor levels
dot_plot(own)

Note

Respondents who own and live in parent’s house tend to agree to ban than those who mortgage and rent.

2.8 Do people who make more trips per week agree/disagree with ban?

Code

hn_survey_agg %>% 
  group_by(freqpweek) %>% 
  summarize_opinionban()

# A tibble: 7 x 7
  freqpweek n_agree n_total pct_agree   low  high     pct
  <chr>       <int>   <int>     <dbl> <dbl> <dbl>   <dbl>
1 4_7          7880   11261     0.700 0.691 0.708 0.568  
2 1_3          2185    3575     0.611 0.596 0.628 0.180  
3 14_16         234     422     0.555 0.512 0.606 0.0213 
4 8_10         1204    2379     0.506 0.487 0.527 0.120  
5 17_20         427     926     0.461 0.432 0.496 0.0467 
6 more_20        29      86     0.337 0.276 0.472 0.00434
7 11_13         362    1161     0.312 0.288 0.342 0.0586

Code

# Does frequency of trips influence one's opinion
bar_plot(freqpweek)

Code

# Narrowing our analysis to the different factor levels
dot_plot(freqpweek)

Note

Majority of respondents make 4-7 trips per week most of them tend to agree with the ban.

Generally, as the number of trips per week increases, the number of people agreeing to ban decreases. Are motorbikes cheap for frequent travels?

2.9 How opinion on ban varies with trip purpose.

A summary of opinions based on trip purpose:

Code

hn_survey_agg %>% 
  group_by(fut_veh) %>%  
  summarize_opinionban()

# A tibble: 5 x 7
  fut_veh n_agree n_total pct_agree   low  high    pct
  <chr>     <int>   <int>     <dbl> <dbl> <dbl>  <dbl>
1 bike        552     715     0.772 0.742 0.803 0.0361
2 ebike       665     914     0.728 0.699 0.757 0.0461
3 car        4499    6760     0.666 0.654 0.677 0.341 
4 no         4589    7753     0.592 0.581 0.603 0.391 
5 moto       2016    3668     0.550 0.534 0.566 0.185

Code

# Does trip purpose influence one's opinion
bar_plot(purp)

Code

# Narrowing our analysis to the different factor levels
dot_plot(purp)

Note

Respondents who make trips for leisure and shopping are more likely to agree to ban. though they cumulatively make up to 10% of the respondents.

Trips made for education purpose have a lower probability of agreeing to ban. Why? Are motorbikes the most used vehicle for education trips?

2.10 Do opinions on other means of transport influence opinion on motorbike ban?

Code

hn_survey_agg %>% 
  select(contains("opinion")) %>%
  pivot_longer(!opinion_ban, names_to = "mode", values_to = "opinion") %>%
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  ggplot(mapping = aes(x = opinion)) +
  geom_bar(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = c("#BC3C29FF", "#0072B5FF")) +
  facet_wrap(~ mode)

Note

Really interesting to see that respondents who think motorcycles are good tend to agree to the ban.

Also majority of opinions on cars is very _good. Are people more inclined to use cars next?

2.11 Does being aware of the ban influence opinion on the ban?

Code

hn_survey_agg %>% 
  group_by(aware_ban) %>%  
  summarize_opinionban()

# A tibble: 3 x 7
  aware_ban n_agree n_total pct_agree   low  high   pct
  <chr>       <int>   <int>     <dbl> <dbl> <dbl> <dbl>
1 yes          9417   12977     0.726 0.718 0.733 0.655
2 donotcare    1614    3653     0.442 0.426 0.459 0.184
3 no           1290    3180     0.406 0.390 0.424 0.161

Code

# Does awareness of ban influence one's opinion
bar_plot(aware_ban)

Code

# Narrowing our analysis to the different factor levels
dot_plot(aware_ban)

Note

Generally, respondents aware of the ban tend to agree to the ban than those who do not care.

Majority of respondents who are unaware of the ban tend to disagree to it.

Would sensitizing the public about the ban lead to more cooperation?

2.12 Any relation between future vehicle and opinion on ban?

Summary of opinions based on future vehicles

Code

hn_survey_agg %>% 
  group_by(fut_veh) %>%  
  summarize_opinionban()

# A tibble: 5 x 7
  fut_veh n_agree n_total pct_agree   low  high    pct
  <chr>     <int>   <int>     <dbl> <dbl> <dbl>  <dbl>
1 bike        552     715     0.772 0.742 0.803 0.0361
2 ebike       665     914     0.728 0.699 0.757 0.0461
3 car        4499    6760     0.666 0.654 0.677 0.341 
4 no         4589    7753     0.592 0.581 0.603 0.391 
5 moto       2016    3668     0.550 0.534 0.566 0.185

Code

# R/ship between future vehicle and one's opinion
bar_plot(fut_veh)

Code

# Narrowing our analysis to the different factor levels
dot_plot(fut_veh)

Note

Majority of respondents (39%) have no future plans of purchasing a vehicle. However, the remaining would purchase a car and 66% of them agree with the ban of motorcycles. Respondents who considered purchasing Bike and ebike are more likely to agree to ban than those intending to buy motorcycles which is plausible.

3 How are opinions spread out based on home location?

Caution

An assumption was made that respondents do actually live in the home location and this sought to explore whether living in certain areas makes one susceptible to agreeing/disagreeing with the ban.

Also, the analysis narrowed down to people who live in Hanoi and within a 5 KM buffer area and trips made within Hanoi (will be useful is exploring how distance and time of travel relate to ban)

Preprocessing spatial variables to only include respondents who live in Hanoi and trips within Hanoi:

Code

library(tmap)
tmap_mode("view")

# Load VN full data set
vn_orig <- st_read("../../data-synced/raw_data/VN.shp", quiet = T) 
hn <- st_read("../../data-synced/raw_data/Hanoi_commune.shp", quiet = T)

# Create a HN outline
hn_out <- hn %>% 
  st_union() 

# Put a 5KM buffer around it
hn_buf <- hn_out %>% 
  st_transform(3405) %>% 
  st_buffer(dist = 5000) %>% 
  st_transform(4326)

# Subset VN data
vn = vn_orig[hn_buf, ]

# Assign area codes to column code
vn <- vn %>% 
  rownames_to_column(var = "code") 

# Home and OD coordinates
home_coord <- st_as_sf(hn_survey_agg, coords = c("homelon", "homelat"), agr = "constant", crs = 4326, remove = FALSE)
origin <- hn_survey_agg %>% 
  st_as_sf(agr = "constant", coords = c("origlon", "origlat"), crs = 4326)
destination <- hn_survey_agg %>% 
  st_as_sf(agr = "constant", coords = c("destlon", "destlat"), crs = 4326)

# Ensure that home coordinates are within HN
index <- st_intersects(home_coord, vn, sparse = F)
home_true <- rowSums(index)

# Ensure that OD coordinates fall within hanoi
index <- st_intersects(origin, vn, sparse = F)
or_true <- rowSums(index)
index <- st_intersects(destination, vn, sparse = F)
de_true <- rowSums(index)



# Filter for those coordinates that fall within Hanoi i.e home_true = 1
home_coord <- home_coord %>% 
  mutate(home_true = home_true, 
         or_true = or_true,
         de_true = de_true) %>% 
  filter(home_true > 0, or_true > 0, de_true > 0)



# Add district names
# District names
index <- st_intersects(home_coord, vn, sparse = F)
home_true <- rowSums(index)

valid_mat <- index %>%
  as_tibble()
names(valid_mat) <- vn$NAME_2


# Pivot longer then obtain the home area codes for regions that intersect
home_code <- valid_mat %>%
  pivot_longer(everything(), names_to = "homecode", values_to = "is_intersect") %>%
  filter(is_intersect == TRUE) %>%
  pull(homecode)

# Add district name
home_coord <- home_coord %>% 
  mutate(district_name = home_code) %>%
  relocate(district_name, .after = homelon)


# Print out data
home_coord %>% 
  slice_sample(n = 5)

Simple feature collection with 5 features and 61 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 105.7493 ymin: 20.95542 xmax: 105.8306 ymax: 21.02422
Geodetic CRS:  WGS 84
# A tibble: 5 x 62
  rowid id      homelat homelon district_name age   occup gender origlat origlon
  <dbl> <chr>     <dbl>   <dbl> <chr>         <chr> <chr> <chr>    <dbl>   <dbl>
1 19685 136428~    21.0    106. Hà Ðông       56_75 reti~ female    21.0    106.
2 12932 786108~    21.0    106. Ð<U+1ED1>ng Ða       18_25 stud~ female    21.0    106.
3  8620 135409~    21.0    106. Hoàng Mai     26_35 state male      21.0    106.
4   150 845051~    21.0    106. Ð<U+1ED1>ng Ða       18_25 stud~ female    21.0    106.
5 12392 812787~    21.0    106. T<U+1EEB> Liêm       18_25 stud~ male      21.0    106.
# ... with 52 more variables: destlat <dbl>, destlon <dbl>, purp <chr>,
#   vehic <chr>, freqpweek <chr>, OD_dist <dbl>, network_dist <dbl>,
#   travtime <dbl>, reason <chr>, own_car <chr>, own_motob <chr>,
#   own_ebike <chr>, own_bike <chr>, freq_car <chr>, freq_motob <chr>,
#   freq_ebike <chr>, freq_bike <chr>, freq_taxi <chr>, freq_bus <chr>,
#   school_acc <chr>, market_acc <chr>, hosp_acc <chr>, bank_acc <chr>,
#   leis_acc <chr>, type <chr>, status <chr>, own <chr>, ...

Code

rm(index)

Distribution of opinions based on respondents’ home location.

Code

pal <- c("#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51")

# Aggregate areas and group them according to opinion
agg_lat_long = home_coord %>%
  as_tibble() %>% 
  # Aggregate certain areas
  group_by(homelat = round(homelat, 2),
           homelon = round(homelon, 2),
           opinion_ban) %>% 
  summarise(n = n()) %>% 
  st_as_sf(coords = c("homelon", "homelat"), agr = "constant", crs = 4326)


# Plot opinions based on location
agg_lat_long %>%
  tm_shape() +
  tm_dots(size = "n", col = "opinion_ban", palette = c("dodgerblue", "firebrick3"), alpha = 0.5) +
  tm_basemap("OpenStreetMap")

Note

Some areas have predominantly agree/disagree opinions. But most are overlayed on each other, probably due to Nick’s preprocessing? More living further from the city center (especially to the West) tend to agree more? Why lots of red dots in locations across the river? Try out two heatmaps.

Alternative: Summarize opinions based on the district names:

Plot
Tibble

Code

# Opinions based on location plot
home_coord %>% 
  st_drop_geometry() %>% 
  group_by(district_name = bind_count(district_name)) %>% 
  summarize_opinionban() %>% 
  filter(n_total > 10) %>% 
  mutate(district_name = fct_reorder(district_name, pct_agree)) %>% 
  ggplot(mapping = aes(x = pct_agree, y = district_name)) +
  geom_point(aes(size = n_total), show.legend = T) +
  #geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +
  labs(
    x = "Percentage of folks in each category who agree",
    title = "Distribution of opinions by location",
    size = "Total respondents"
  ) +
  scale_x_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))

Code

# Opinions based on location tibble
home_coord %>% 
  st_drop_geometry() %>% 
  group_by(district_name = (district_name)) %>% 
  summarize_opinionban() %>% 
  filter(n_total > 10)

# A tibble: 22 x 7
   district_name n_agree n_total pct_agree   low  high     pct
   <chr>           <int>   <int>     <dbl> <dbl> <dbl>   <dbl>
 1 Hoài Ð<U+1EE9>c          613     662     0.926 0.905 0.944 0.0336 
 2 Hoàng Mai        1262    1652     0.764 0.744 0.785 0.0839 
 3 Ðan Phu<U+1EE3>ng         30      41     0.732 0.621 0.864 0.00208
 4 Ba Ðình          1546    2119     0.730 0.711 0.749 0.108  
 5 C<U+1EA7>u Gi<U+1EA5>y         1888    2696     0.700 0.683 0.718 0.137  
 6 Ð<U+1ED1>ng Ða          1811    2653     0.683 0.665 0.701 0.135  
 7 T<U+1EEB> Liêm          1803    2672     0.675 0.657 0.693 0.136  
 8 Tây H<U+1ED3>            478     716     0.668 0.635 0.703 0.0364 
 9 Thanh Xuân        674    1104     0.611 0.583 0.640 0.0561 
10 Hà Ðông           415     754     0.550 0.518 0.588 0.0383 
# ... with 12 more rows

Note

Most of the significant districts (with a large number of respondents) have a large number agreeing with the ban. Long Bien district has a large number of respondents and less than 30% of them agree to ban. Why?

4 A little Exploratory Data Analysis of numeric variables

Let’s explore the numeric predictors. From a previous analysis, there seems to be some incorrect values for dist_to_pub feature:

Code

as_tibble(home_coord) %>% 
  ggplot(mapping = aes(x = dist_to_pub, y = 1)) +
  geom_boxplot()

Code

# Summary statistics for distance to public transport
as_tibble(home_coord) %>% 
  select(dist_to_pub) %>% 
  summary()

  dist_to_pub      
 Min.   :    -6.0  
 1st Qu.:   120.0  
 Median :   200.0  
 Mean   :   516.5  
 3rd Qu.:   500.0  
 Max.   :636578.0

Caution

The following preprocessing is done:

Distance is made absolute
Only distances < 10 KM are considered. Perhaps pick a better threshold for this.

Code

# Remove outliers
home_coord <- home_coord %>% 
  mutate(dist_to_pub = abs(dist_to_pub)) %>% 
  filter(dist_to_pub < 10000)

Summary
Boxplot

Code

# Summary stats of distance to public transport
as_tibble(home_coord) %>% 
  select(dist_to_pub) %>% 
  summary()

  dist_to_pub  
 Min.   :   0  
 1st Qu.: 120  
 Median : 200  
 Mean   : 425  
 3rd Qu.: 500  
 Max.   :8400

Code

# Distribution of distnace to public transport
as_tibble(home_coord) %>% 
  select(dist_to_pub) %>%
  ggplot(mapping = aes(x = dist_to_pub, y = 1)) +
  geom_boxplot(color = "midnightblue")

Code

  #geom_jitter(size = 0.1, alpha = 0.2)

Seems there are quite a number of outliers. Perhaps a threshold of 1KM would be better. Ask Minh’s opinion.

A estimate of how numeric variables relate to the opinion on ban.

Code

numeric_vars <- as_tibble(home_coord) %>% 
  mutate(across(everything(), as.character)) %>% 
  mutate(across(contains("own_"), ~ case_when(.x == "more_5" ~ "7", TRUE ~ .x))) %>%
  select(contains("own_"), OD_dist, network_dist, travtime, dist_to_pub, opinion_ban) %>% 
  mutate(across(!opinion_ban, ~ abs(as.numeric(.x))),
         opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  pivot_longer(!opinion_ban, names_to = "metric", values_to = "value")

# Density plot
numeric_vars %>% 
  ggplot(mapping = aes(x = value, color = opinion_ban)) +
  geom_density() +
  scale_x_log10() + 
  labs(color = "") +
  theme(legend.position = "top") +
  facet_wrap(~ metric, scales = "free")

Code

# Use ROc_AUC to determine how well each numeric predictor
# would sway opinions
numeric_vars %>% 
  group_by(metric) %>% 
  roc_auc(opinion_ban, value, event_level = "second") %>% 
  mutate(metric = fct_reorder(metric, .estimate)) %>% 
  ggplot(mapping = aes(x = .estimate, y = metric)) +
  geom_point() +
  geom_vline(aes(xintercept = .5)) +
  labs(
    x = "AUC estimates of agreeing",
    title = "How predictive is each numeric predictor by itself?",
    subtitle = ">.5 means positively influences agree and <.5 means negatively influences agree"
  )

Note

As distance to public transport increases, people tend to disagree with the ban
As the network distance increases, people tend to agree with ban. Why would this be the case? Congestion?
Respondents who own 2 motorbikes tend to agree with ban
Ownership of a car is generally associated with agreeance of the ban.
Respondents who own 1, 3 or 4 motorbikes tend to disagree with the ban.
Generally respondents who own ebikes tend to disgaree with ban

5 Text Analysis of opinions

Word tokenization to explore the relationship between alternative vehicle and opinion ban.

Summary tibble of what how opinions on ban vary with future alternative vehicle in case of ban:

Code

as_tibble(home_coord) %>% 
  unnest_tokens(input = alt_veh, output = alt_veh, token = "words") %>% 
  group_by(alt_veh) %>% 
  summarize_opinionban()

# A tibble: 7 x 7
  alt_veh    n_agree n_total pct_agree   low  high   pct
  <chr>        <int>   <int>     <dbl> <dbl> <dbl> <dbl>
1 bike          4583    6062     0.756 0.745 0.767 0.125
2 car           8955   11899     0.753 0.745 0.760 0.245
3 bus           5747    8023     0.716 0.707 0.726 0.165
4 taxi          4613    6454     0.715 0.704 0.726 0.133
5 walk          3728    5236     0.712 0.700 0.724 0.108
6 lighttrain    3824    5411     0.707 0.695 0.719 0.111
7 ebike         3441    5478     0.628 0.616 0.641 0.113

Explore this visually:

Code

library(tidytext)

# Split future vehicle into tokens and visualize opinion
as_tibble(home_coord) %>% 
  unnest_tokens(input = alt_veh, output = alt_veh, token = "words") %>%
  mutate(opinion_ban = factor(opinion_ban, levels = c("disagree", "agree"))) %>% 
  ggplot(mapping = aes(x = alt_veh)) +
  geom_bar(aes(fill = opinion_ban), position = "dodge", alpha = 0.7) +
  labs(fill = "") +
  scale_fill_manual(values = sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size = 2, replace = FALSE)) +
  theme(legend.position = "top")

Code

# Category wise analysis
as_tibble(home_coord) %>% 
  unnest_tokens(input = alt_veh, output = alt_veh, token = "words") %>%
  group_by(alt_veh = bind_count(alt_veh)) %>%
  summarize_opinionban() %>% 
  mutate(alt_veh = fct_reorder(alt_veh, pct_agree)) %>% 
  ggplot(mapping = aes(x = pct_agree, y = alt_veh)) +
  geom_point(aes(size = pct), show.legend = F) +
  geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +
  labs(
    x = "Percentage of folks in each category who agree",
    title = "Distribution of opinions by future vehicle preference",
 size = "%prevalence"
  ) +
  scale_x_continuous(labels = percent) +
  scale_size_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))

Note

If motorcycles were banned, majority of respondents would result to cars.
Majority of respondents who would result to cars, bike agree with ban.
Interesting to see that respondents who would result to ebikes have lowest probability of agreeing with ban.

In general, we have predictors that clearly distinguish the observations in terms of the opinion on ban, while some, not so much. Will begin by including everything in the model, then use Variable Importance to eliminate non-predictive features.

6 Creating a Boosted Tree model

There are several steps to create a useful model, including parameter estimation, model selection and tuning, and performance assessment.

Tip

If some of the methodologies are a bit over-explained, that’s for future references for my talks, blogs, etc. 🙂

6.1 Data Budgeting

In this section, the data is split into two distinct sets, the training set and the test set. The training set (typically larger) is used to develop and optimize the model by fitting different models and investigating various feature engineering strategies etc.

The other portion of the data is the test set. This is held in reserve until one or two models are chosen as the methods that are most likely to succeed. The test set is then used as the final arbiter to determine the efficacy of the model.

Drop amenity accessibility columns
Remove tram since it’s too sparsely distributed for now.
Make a 70/30 split specification.

Code

set.seed(2056)

# Drop opinions on amenity accessibility
hn_survey_ml <-  as_tibble(home_coord) %>% 
  mutate(district_name_eng = str_remove_all(district_name, "[^a-zA-z]")) %>% 
  relocate(district_name_eng, .after = district_name) %>% 
  select(!contains("_acc")) %>% 
  drop_na() %>% 
  mutate(opinion_ban = factor(opinion_ban, levels = c("agree", "disagree"))) %>% 
  filter(vehic != "tram")

# Make a split specification
hn_survey_split <- hn_survey_ml %>% 
  select(!geometry) %>% 
  as_tibble() %>% 
  mutate(across(contains("own_"), as.character),
    across(contains("own_"), ~ case_when(.x == "more_5" ~ "7", TRUE ~ .x)),
         across(contains("own_"), ~ as.numeric(.x)
         )) %>% 
  # Stratified split based on opinion ban
  initial_split(strata = opinion_ban, prop = 0.7)

# Obtain train and test sets
train <- training(hn_survey_split)
test <- testing(hn_survey_split)

# Print out observations in each category
glue(
  'The training set has {nrow(train)} observations \n',
  'The testing set has {nrow(test)} observations'
)

The training set has 13757 observations 
The testing set has 5897 observations

6.2 Resampling for model evaluation and tuning

Boosted tree models typically have tuning parameters or hyperparameters must be specified ahead of time and can’t be directly found from training data. These are unknown structural or other kind of values that have significant impact on the model but cannot be directly estimated from the data. Instead, hyperparameters are estimated using simulated data sets created from a process called resampling such as cross-validation or bootstrap resampling.

The idea of resampling is to create simulated data sets that can be used to estimate the performance of your model.

Since this data has a spatial component, we’ll do a spatial resample. Spatial or cluster cross-validation splits the data into V groups of disjointed sets using k-means clustering of some variables, typically spatial coordinates. This ensures that we account for the presence of spatial autocorrelation in geospatial data.

Tip

Please see for more information on spatial resampling: Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest

Perform 10 fold spatial cross validation:

Code

library(spatialsample)
set.seed(2056)
#Spatial V fold cross validation
train_folds <- spatial_clustering_cv(data = train, coords = c(homelat, homelon), v = 10)

Visualize resamples
Tibble resamples

Code

plot_splits <- function(split) {
  p <- bind_rows(
    analysis(split) %>%
      mutate(analysis = "Analysis"),
    assessment(split) %>%
      mutate(analysis = "Assessment")
  ) %>%
    ggplot(aes(homelon, homelat, color = analysis)) +
    geom_point(size = 1.5, alpha = 0.8) +
    coord_fixed() +
    labs(color = NULL)
  print(p)
}

walk(train_folds$splits, plot_splits)

Code

train_folds

#  10-fold spatial cross-validation 
# A tibble: 10 x 2
   splits               id    
   <list>               <chr> 
 1 <split [12748/1009]> Fold01
 2 <split [12336/1421]> Fold02
 3 <split [13358/399]>  Fold03
 4 <split [10076/3681]> Fold04
 5 <split [12022/1735]> Fold05
 6 <split [11611/2146]> Fold06
 7 <split [13736/21]>   Fold07
 8 <split [13733/24]>   Fold08
 9 <split [11553/2204]> Fold09
10 <split [12640/1117]> Fold10

6.3 Feature Engineering with recipes

Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. The following transformation steps will be done before refining the model further down the line:

One hot encoding: xgboost does not have a means to translate factor predictors to grouped splits. Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine.
Remove zero variance: potentially remove variables that are highly sparse and unbalanced.
Impute missing values:
Normalize/transform: not really required for tree models
Collapse some categorical levels: potentially pool infrequently occurring values into an “other” category.
Upsampling: There is some class imbalance with the survey having more people who agree with ban. It is equally important to understand the reasons behind people disagreeing with the ban. A SMOTE algorithm (Chawla et al. (2002)) will be applied which synthetically generates new points based on existing points.

Tip

SMOTE requires all variables to be numeric with no missing data. Good thing is that these requirements can and should be handled by previous recipe steps e.g step_impute_mean and step_dummy

The recipes package allows you to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets.

The workflows package allows the user to bind modeling and preprocessing objects together. You can then fit the entire workflow to the data, so that the model encapsulates all of the pre-processing, modeling, and post-processing requests.

Code

set.seed(2056)
library(textrecipes)

# Function for prepping and baking a recipe
prep_juice <- function(recipe){
  recipe %>% 
    prep() %>% 
    juice()
}


library(beepr)
# Data preprocessing with recipes
boost_recipe <- recipe(
  opinion_ban ~ age + occup + 
    gender + vehic + 
    freq_bike + freq_ebike +
    freq_bus + freq_motob +
    freq_car + freq_taxi +
    own + freqpweek + purp +
    aware_ban + fut_veh +
    own_car + own_bike +
    own_ebike + dist_to_pub +
    homelat + homelon + alt_veh + network_dist, data = train) %>%
  
  # Tokenize future vehicle considereation
  step_tokenize(alt_veh) %>% 
  # One hot encode the tokens
  step_tf(alt_veh) %>% 
  
  # Pool infrequently occurring values into an "other" category.
   #step_other(purp, threshold = 0.05) %>%
   #step_other(freqpweek, threshold = 0.05) %>%
  step_other(contains("freq_"), threshold = 0.05) %>% 
  
   # Dummy variables for categorical 
  #step_integer(all_nominal_predictors(), zero_based = TRUE) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_upsample(opinion_ban)

# %>%
#   
#   
#   # Near zero variance filter
#   step_nzv(all_predictors()) 

# Just for sanity check
#View(prep_juice(boost_recipe))


# Allow recipe to handle new levels
# bp <-  hardhat::default_recipe_blueprint(
#   allow_novel_levels = TRUE
# )


# Create boosted tree model specification
boost_spec <- boost_tree(
  mtry = tune(),
  trees = tune(),
  #min_n = tune(),
  #tree_depth = tune(),
  learn_rate = 0.01,
  #loss_reduction = tune(),
  #sample_size = tune(),
  #stop_iter = tune()
  ) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# Bind recipe and model specification together
boost_wf <- workflow() %>% 
  add_recipe(boost_recipe) %>% 
  add_model(boost_spec)

# Print workflow
boost_wf

== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()

-- Preprocessor ----------------------------------------------------------------
5 Recipe Steps

* step_tokenize()
* step_tf()
* step_other()
* step_dummy()
* step_upsample()

-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  learn_rate = 0.01

Computational engine: xgboost

6.4 Model tuning

Now that we have a tunable workflow, we’ll need to figure out a set of possible values to try out then choose the best.

Tuning parameter optimization usually falls into one of two categories:

grid_search: when a pre-defined set of parameter values is evaluated and the best set picked. We’ll use a non-regular grid in this case.

When grid search is infeasible or inefficient, iterative methods are a sensible approach for optimizing tuning parameters.

Iterative search or sequential search is when we sequentially discover new parameter combinations based on previous results e.g simulated annealing

Using racing methods to perform an efficient grid search.

In racing methods, the tuning process evaluates all hyperparameter combinations on an initial subset of resamples. Based on their current performance metrics, the process eliminates tuning parameter combinations that are unlikely to be the best results using a repeated measure ANOVA model. This is unlike a classical grid search where all models need to be fit across all resamples before any tuning parameters can be evaluated. This efficiently reduces training time.

grid: An integer or data frame. When an integer is used, the function creates a space-filling design with grid number of candidate parameter combinations. Non-regular space-filling designs generally find a configuration of points that cover the parameter space with the smallest chance of overlapping or redundant values. Examples of such designs are Latin hypercubes (McKay (1979))

Let’s go ahead and tune the model:

Code

doParallel::registerDoParallel()
set.seed(2056)
library(finetune)

# Evaluation metrics during tuning
eval_metrics <- metric_set(mn_log_loss, accuracy)

# Efficient grid search via racing
xgb_race <- tune_race_anova(
  object = boost_wf,
  resamples = train_folds,
  metrics = eval_metrics,
  
  # Try out 20 different hyperparameter combinations
  grid = 20,
  control = control_race(
    verbose_elim = TRUE
  )
)

beepr::beep(9)
#saveRDS(xgb_race, "smote_xgb_race")

We can observe the incremental elimination of tuning parameters that may not have likely improved model performance:

Code

# Plot racing results
plot_race(xgb_race)

Let’ see which hyperparameter combinations performed best on the resamples:

Accuracy

Code

# Tibble
xgb_race %>% 
  show_best(metric = "accuracy")

# A tibble: 3 x 8
   mtry trees .metric  .estimator  mean     n std_err .config              
  <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1    46  1725 accuracy binary     0.840    10  0.0156 Preprocessor1_Model12
2    35  1581 accuracy binary     0.835    10  0.0195 Preprocessor1_Model09
3    41  1936 accuracy binary     0.835    10  0.0143 Preprocessor1_Model10

Code

# Plot
xgb_race %>% 
  collect_metrics() %>% 
  select(mtry, trees, .metric, mean) %>% 
  filter(.metric == "accuracy") %>% 
  pivot_longer(!c(mean, .metric), names_to = "metrics", values_to = "values") %>% 
  ggplot(mapping = aes(x = values, y = mean)) +
  geom_point(size = 2, color = "darkorange", alpha = 0.7) +
  geom_line(color = "dodgerblue", size = 1.2, alpha = 0.7) +
  labs(y = "accuracy") +
  facet_wrap(vars(metrics), scales = "free_x")

How accuracy varies with different combinations of mtry and trees

Mean log loss

Code

# Tibble
xgb_race %>% 
  show_best(metric = "mn_log_loss")

# A tibble: 3 x 8
   mtry trees .metric     .estimator  mean     n std_err .config              
  <int> <int> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
1    35  1581 mn_log_loss binary     0.377    10  0.0365 Preprocessor1_Model09
2    46  1725 mn_log_loss binary     0.381    10  0.0354 Preprocessor1_Model12
3    41  1936 mn_log_loss binary     0.383    10  0.0354 Preprocessor1_Model10

Code

# Plot
xgb_race %>% 
  collect_metrics() %>% 
  select(mtry, trees, .metric, mean) %>% 
  filter(.metric == "mn_log_loss") %>% 
  pivot_longer(!c(mean, .metric), names_to = "metrics", values_to = "values") %>% 
  ggplot(mapping = aes(x = values, y = mean)) +
  geom_point(size = 2, color = "dodgerblue", alpha = 0.7) +
  geom_line(color = "darkorange", size = 1.2, alpha = 0.7) +
  labs(y = "mn_log_loss") +
  facet_wrap(vars(metrics), scales = "free_x")

How mean log loss varies with different combinations of mtry and trees

Plausible enough! Probably increasing trees would have lowered the mean log loss further? For this model, we’ll finalize the hyperparameters based on the values of the mean log loss.

Using resamples, we were able to get the best estimates of mtry and trees and a good indicator of how those values would perform on the actual train and test sets.

6.5 Finalizing the model, training and testing

Now that we have the best performance values, we can use tune::finalize_workflow() to update (or “finalize”) our workflow object with the best estimate values for our hyper-parameters.

Finally, we can return to our test data and estimate the model performance we expect to see with new data. We use the function last_fit() with our finalized model; this function fits the finalized model on the full training data set and evaluates the finalized model on the testing data.

Code

# Finalize workflow
final_boost_wf <- boost_wf %>% 
  finalize_workflow(select_best(xgb_race, metric = "mn_log_loss" #mn_log_loss
                    ))

# Train then test
xgb_model <- final_boost_wf %>% 
  last_fit(hn_survey_split, metrics = metric_set(accuracy, recall, spec, ppv, roc_auc, mn_log_loss, f_meas))

# Collect metrics
xgb_model %>% 
  collect_metrics()

# A tibble: 7 x 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.878 Preprocessor1_Model1
2 recall      binary         0.853 Preprocessor1_Model1
3 spec        binary         0.919 Preprocessor1_Model1
4 ppv         binary         0.946 Preprocessor1_Model1
5 f_meas      binary         0.897 Preprocessor1_Model1
6 roc_auc     binary         0.954 Preprocessor1_Model1
7 mn_log_loss binary         0.269 Preprocessor1_Model1

Another performance metric associated with classification problems is the confusion matrix. A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model.

Code

# Plot confusion matrix
xgb_model %>% 
  collect_predictions() %>% 
  conf_mat(truth = opinion_ban, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

Code

# Prettier?
update_geom_defaults(geom = "rect", new = list(fill = "midnightblue", alpha = 0.7))

xgb_model %>% 
  collect_predictions() %>% 
  conf_mat(opinion_ban, .pred_class) %>% 
  autoplot()

Code

# Extract trained workflow
xgb_wf <- xgb_model %>% 
  extract_workflow()

# Save model
saveRDS(xgb_model %>% extract_workflow(), "smote2_boost_hnwf")
#xgb_wf <- readRDS("smote_boost_hnwf")

The performance metrics considered are:

Recall: TP/(TP + FN) defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity.

Specificity: TN/(TN + FP) defined as the proportion of negative results out of the number of samples which were actually negative.

Precision: TP/(TP + FP) defined as the proportion of predicted positives that are actually positive. Also called positive predictive value

Accuracy: TP + TN/(TP + TN + FP + FN) The percentage of labels predicted accurately for a sample.

F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.

Note

Overall, the performance metrics look really good. Generally, the model did a better job in predicting observations that disagree with the ban.

With this model, we can now probe it to investigate/understand the underlying reasons for agreeing/disagreeing with ban.

7 Explaining the model and predictions

Tip

Some reference books that I found really helpful for this section include:

Kuhn, Max and Silge, Julia. 2022. Tidy Modeling with R. https://www.tmwr.org/
Biecek, Przemyslaw, and Tomasz Burzykowski. 2021. Explanatory Model Analysis. Chapman; Hall/CRC, New York. https://ema.drwhy.ai/.
Molnar, Christopher. 2020. Interpretable Machine Learning. lulu.com. https://christophm.github.io/interpretable-ml-book/.

In this section, we explore why the model makes the predictions it does. Answering the question “why?” allows modeling practitioners to understand which features were important in predictions and even how model predictions would change under different values for the features.

Note

There are two types of model explanations, global and local. Global model explanations provide an overall understanding aggregated over a whole set of observations; local model explanations provide information about a prediction for a single observation.

7.1 Global model explanations

For global model interpretation, the following will be considered:

Variable Importance
Partial Dependency Plots
SHAP Values

7.1.1 Variable Importance/Permutation feature importance

Variable Importance plots are one way of understanding which predictor has the largest effect on the model outcomes.

The main idea is to measure how much does a model’s performance changes if the effect of a selected explanatory variable, or of a group of variables, is removed. To remove the effect, we use perturbations, like resampling from an empirical distribution or permutation of the values of the variable.

If a variable is important, then we expect that, after permuting/shuffling the values of the variable, the model’s performance will worsen. The larger the change in the performance, the more important is the variable. Please see Explanatory Model Analysis and Interpretable Machine Learning

The method may be applied for several purposes.

Model simplification: variables that do not influence a model’s predictions may be excluded from the model.
Model exploration: comparison of variables’ importance in different models may help in discovering interrelations between the variables. Also, the ordering of variables in the function of their importance is helpful in deciding in which order should we perform further model exploration.
Domain-knowledge-based model validation: identification of the most important variables may be helpful in assessing the validity of the model based on domain knowledge.
Knowledge generation: identification of the most important variables may lead to the discovery of new factors involved in a particular mechanism.

Variable Importance:

Code

# Load saved model
#xgb_wf <- readRDS("smote_boost_hnwf")
options(scipen = 999)

# Extract variable importance
library(vip)
vi <- xgb_wf %>% 
  extract_fit_parsnip() %>% 
  vi()

vi

# A tibble: 83 x 2
   Variable       Importance
   <chr>               <dbl>
 1 freq_car_X0        0.130 
 2 dist_to_pub        0.121 
 3 own_car            0.0753
 4 homelon            0.0685
 5 tf_alt_veh_car     0.0652
 6 aware_ban_yes      0.0552
 7 homelat            0.0490
 8 network_dist       0.0474
 9 vehic_car          0.0314
10 own_bike           0.0299
# ... with 73 more rows

Let’s represent the above as Variable Importance Plots:

Code

# Make Variable Importance Plots
vi %>% 
  slice_max(Importance, n = 42) %>%
  mutate(Variable = fct_reorder(Variable, Importance)) %>%
  ggplot(mapping = aes(y = Variable, x = Importance)) +
  geom_point(size = 3, color = "dodgerblue") + 
  geom_segment(aes(y = Variable, yend = Variable, x = 0, xend = Importance), size = 2, color = "dodgerblue", alpha = 0.7 ) +
  ggtitle(paste("Variable Importance plot of top", round(nrow(vi)/2), "variables")) +
  theme(plot.title = element_text(hjust = 0.5))

Code

# Bottom n variable
vi %>% 
  slice_min(Importance, n = 42) %>%
  mutate(Variable = fct_reorder(Variable, Importance)) %>%
  ggplot(mapping = aes(y = Variable, x = Importance)) +
  geom_point(size = 3, color = "cyan4") + 
  geom_segment(aes(y = Variable, yend = Variable, x = 0, xend = Importance), size = 2, color = "cyan4", alpha = 0.7) +
  ggtitle(paste("Variable Importance plot of bottom", round(nrow(vi)/2), "variables")) +
  theme(plot.title = element_text(hjust = 0.5))

From the VIP plot, we can observe which variables most/least influence the model performance. For instance the following quick inferences can be made:

The frequency of usage of cars, in this case whether an individual does not use a car, has the most predictive value in this model. This warrants further exploration. Also usage of motobike and bikes 0 times a week is an important factor driving individuals’ opinion to ban.
distance to public transport and the distance to destination are among the most important variable in this model
In terms of vehicle ownership, number of cars and bikes are the most important.
Being aware of the ban has a significant effect on the model as well as a person’s home location.
Text preprocessing made it on the list with alternative vehicles car and train topping the list.
Making a trip 4-7 times a week was an important underlying factor in driving opinions towards ban.
Occupation, gender, age and trip purpose were not as important compared to other features

There are lots of inferences that can be made about this characteristics later and some of the less important features can be removed. Perhaps this can improve model importance.

Tip

The feature importance plot is useful, but contains no information beyond the importances i.e, do the features influence the predictions positively or negatively. For a more informative plot, we will next look at the SHAP summary plot.

7.1.2 Shapley Additive exPlanations (SHAP)

SHAP feature importance is an alternative to permutation feature importance above.

SHAP is based on the idea of averaging the value of a variable’s attribution over all (or a large number of) possible orderings.

As such, permutation feature importance is based on decrease in model’s performance with shuffling while SHAP is based on magnitude of feature attributions.

The code below calculates the SHAP values for the top 25 variables ranked by mean SHAP.

Code

# SHAP for xgboost
library(SHAPforxgboost)

# Prepare shap values for plotting. Requires a matrix
opinion_shap <- shap.prep(
  # Actual Boost engine
  xgb_model = xgb_wf %>% 
    extract_fit_engine(),
  # predictors used to calculate SHAP values
  X_train = boost_recipe %>% 
    prep() %>% bake(has_role("predictor"),
                 new_data = NULL,
                 composition = "matrix"),
  top_n = round(nrow(vi)/2) + 10
  
)

shap.plot.summary(opinion_shap)

Your regular reminder: All effects describe the behavior of the model and are not necessarily causal in the real world.

The SHAP summary plot combines feature importance and feature effects with features being ordered according to their importance. Each point on the SHAP summary plot is a Shapley value for a feature and an instance and the color represents the value of the feature from low to high. Some inferences we can derive from our Shapley plot:

Use of cars 0 times a week is associated with lower shapley values and this pushes the model prediction towards disagree
High values of distance to public transport reduce probability of agreeing with the ban.
Owning more cars has a larger effect on the model (High Shapley values) and pushes the model towards agree This is in contrast to ebikes where high ownershing of ebikes pushes the model towards disagree
Individuals who make trips for leisure have a greater probability of agreeing to ban than those who make trips for visit
Having alternative vehicles as car or train in case of ban increases the probability of agreeing with ban in contrast to having taxis
High instances of of frequency of motorbikes being greater than 20 times a week pushes the model towards disagree

More inferences to be made while writing the paper.

SHAP summary plots therefore give us first indications of the relationship between a feature and the impact on predictions.

7.1.3 SHAP Dependence Plot

SHAP summary plot give us first indications of the relationship between a feature and the impact on predictions. We can take a closer look at the exact form of the relationship by making a SHAP dependence plot as follows:

Pick a feature.
For each data instance, plot a point with the feature value on the x-axis and the corresponding Shapley value on the y-axis.
Done.

SHAP Dependence plots for the top 20 variables:

Code

opinion_shap %>% 
  filter(variable %in% (opinion_shap %>% distinct(variable) %>% slice_head(n = 20) %>% pull())) %>% 
  ggplot(mapping = aes(x = rfvalue, y = value)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "loess", se = F, color = "blue", alpha = 0.7, lwd = 0.4) +
  facet_wrap(vars(variable), scales = "free") +
  ylab("SHAP")

The SHAP dependence plots allow one to explore the variance of the Shapley values with different data points of a feature.

Let’s explore some of these relationships further by looking at some SHAP interaction values. SHAP interaction values separate the impact of variable into main effects and interaction effects. They add up roughly to the dependence plot.

Warning

The preparation of SHAP interaction values take time since it calculates all the combinations. The code blow took about 4 hours to run.

Code

# Prepare the interaction SHAP values from predict.xgb.Booster
shap_int <- shap.prep.interaction(
  xgb_model = xgb_wf %>% extract_fit_engine(),
  X_train = boost_recipe %>%
    prep() %>%
    bake(has_role("predictor"),
    new_data = NULL,      
    composition = "matrix"))

saveRDS(shap_int, "opinion_interaction")

How effect of network distance varies with the vehicle used

Code

shap_int = readRDS("opinion_interaction")
shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "network_dist",
  y = "vehic_moto",
  color_feature = "vehic_moto",
  size0 = 1.2,
  smooth = F
  )

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "network_dist",
  y = "vehic_car",
  color_feature = "vehic_car",
  size0 = 1.2,
  smooth = F
  )

Usage of cars for distances > 17 KM is associated with lower SHAP interaction values compared to use of motorcycles.

Influence of owning/renting at certain locations

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "homelon",
  y = "own_owner",
  color_feature = "own_owner",
  size0 = 1.2,
  smooth = F
  )

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "homelon",
  y = "own_rent",
  color_feature = "own_rent",
  size0 = 1.2,
  smooth = F
  )

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "homelat",
  y = "own_owner",
  color_feature = "own_owner",
  size0 = 1.2,
  smooth = F
  )

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "homelat",
  y = "own_rent",
  color_feature = "own_rent",
  size0 = 1.2,
  smooth = F
  )

Owning/renting a home at certain locations has a significant influence on agreeing/disagreeing with the ban.

Let’s explore the interaction between the number of cars one owns and their occupation, and how this influences opinion:

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "own_car",
  y = "occup_state",
  color_feature = "occup_state",
  size0 = 1.2,
  smooth = F
  )

Code

shap.plot.dependence(
  data_long = opinion_shap,
  data_int = shap_int,
  x = "own_car",
  y = "occup_fdi",
  color_feature = "occup_fdi",
  size0 = 1.2,
  smooth = F
  )

Generally, having 1 car and occupation is FDI increases the probability of agreeing with the ban while having 2 cars and occupation is FDI reduces probability of agreeing to ban.

This non-linear interactions will be explored further if required.

7.2 Local explanations

Local model explanations provide information about a prediction for a single observation.

Probably the most commonly asked question when trying to understand a model’s prediction for a single observation is: which variables contribute to this result the most? There is no single best approach that can be used to answer this question. In this section, we explore break-down (BD) plots, which offer a possible solution.

Tip

Break down plots compute how contributions attributed to individual features change the mean model’s prediction for a particular observation. We can use the fact that these break-down explanations change based on the order of features to compute the most important features over all (or many) possible orderings. This is the idea behind Shapley Additive Explanations (Lundberg and Lee 2017), where the average contributions of features are computed under different combinations or “coalitions” of feature orderings.

The underlying idea of BD plots is thus to capture the contribution of an explanatory variable to the model’s prediction by computing the shift in the expected value of \(Y\), while fixing the values of other variables.

To compute any kind of model explanation, global or local, using DALEX package, we first create an explainer for each model.

Code

# Load DALEX package
library(DALEXtra)

# Vector of predictors
vip_features <- xgb_wf %>% 
   extract_preprocessor() %>% 
   summary() %>% 
   filter(role == "predictor") %>% 
   pull(variable)

vip_train <- train %>%
  select(all_of(vip_features)) 

# Explainer for xgboost
explainer_xgboost <- explain_tidymodels(
  # a fitted workflow
  model <- xgb_wf,
  # Data should be passed without target column 
  # this shall be provided as the y argument
  data = vip_train,
  y = train$opinion_ban,
  label = "xg_boost",
  verbose = F
  
)

Let’s use the fact that these break-down explanations change based on order to compute the most important features over all (or many) possible orderings and compute SHAP attributions for some sample observations, using B = 25 random orderings (takes a while to compute).

Code

set.seed(2056)
# Pick an observation
sample_observation <- vip_train %>% 
  dplyr::slice(2)

glimpse(sample_observation)

Rows: 1
Columns: 23
$ age          <chr> "18_25"
$ occup        <chr> "student"
$ gender       <chr> "male"
$ vehic        <chr> "moto"
$ freq_bike    <chr> "0"
$ freq_ebike   <chr> "1_5"
$ freq_bus     <chr> "0"
$ freq_motob   <chr> "1_5"
$ freq_car     <chr> "0"
$ freq_taxi    <chr> "0"
$ own          <chr> "rent"
$ freqpweek    <chr> "4_7"
$ purp         <chr> "education"
$ aware_ban    <chr> "yes"
$ fut_veh      <chr> "car"
$ own_car      <dbl> 0
$ own_bike     <dbl> 0
$ own_ebike    <dbl> 1
$ dist_to_pub  <dbl> 500
$ homelat      <dbl> 21.01729
$ homelon      <dbl> 105.8085
$ alt_veh      <chr> "car lighttrain"
$ network_dist <dbl> 5.515824

Computing the break down explanations for the above observation:

Code

# Compute SHAP attributions
shap_bd <- predict_parts(
  explainer = explainer_xgboost,
  new_observation = sample_observation,
  type = "shap",
  # Default is 25
  B = 25
)

# Make break down plots
shap_bd %>% 
  group_by(variable) %>% 
  mutate(mean_val = mean(contribution)) %>% 
  ungroup() %>% 
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>% 
  filter(contribution != 0) %>% 
  ggplot(mapping = aes(x = contribution, y = variable, fill = mean_val > 0)) +
  geom_boxplot() +
  geom_col(data = ~distinct(., variable, mean_val),
           aes(x = mean_val, y = variable),
           alpha = 0.5) +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

Code

xgboost_breakdown <- predict_parts(
  explainer = explainer_xgboost,
  new_observation = sample_observation,
  order = shap_bd %>% 
    group_by(variable_name) %>% 
    mutate(mean_val = mean(contribution)) %>% 
    ungroup() %>% 
    mutate(variable_name = fct_reorder(variable_name, desc(abs(mean_val)))) %>% distinct(variable_name, mean_val) %>% pull(variable_name) %>% levels()
)

plot(xgboost_breakdown, max_vars = 1000)

Tip

purple variables increase the probability of agreeing with the ban while yellow variables increase probability of disagreeing with the ban.

I this package, probability of:

0-0.5: agree

0.5-1: disagree

Let’s interpret the above plots a little bit:

Note

Theintercept shows the distribution and the mean value of the model’s predictions for all data.

The next rows show the distribution and the mean value of the predictions when fixing values of subsequent explanatory variables. For instance the fact that the individual’s alternative vehicle is car lighttrain, is aware of ban and future vehicle is car reduces the probability of disagreeing with ban while the fact that the individual uses a motorcycle, the trip purpose is eduaction and rents increases the probability of disagreeing with ban.

The prediction shows the prediction for the particular instance of interest.

Let’s do the same for an observation that disagreed with the ban:

Code

set.seed(2056)
# Pick an observation
sample_observation <- train %>%
  select(all_of(vip_features), opinion_ban) %>% 
  dplyr::slice(10234)

glimpse(sample_observation)

Rows: 1
Columns: 24
$ age          <chr> "26_35"
$ occup        <chr> "state"
$ gender       <chr> "female"
$ vehic        <chr> "car"
$ freq_bike    <chr> "1_5"
$ freq_ebike   <chr> "1_5"
$ freq_bus     <chr> "1_5"
$ freq_motob   <chr> "6_10"
$ freq_car     <chr> "1_5"
$ freq_taxi    <chr> "1_5"
$ own          <chr> "parent_house"
$ freqpweek    <chr> "17_20"
$ purp         <chr> "work"
$ aware_ban    <chr> "donotcare"
$ fut_veh      <chr> "bike"
$ own_car      <dbl> 1
$ own_bike     <dbl> 1
$ own_ebike    <dbl> 1
$ dist_to_pub  <dbl> 50
$ homelat      <dbl> 21.02412
$ homelon      <dbl> 105.8232
$ alt_veh      <chr> "ebike"
$ network_dist <dbl> 5.225949
$ opinion_ban  <fct> disagree

Computing the breakdown explanations for the above observation:

Code

# Compute SHAP attributions
shap_bd <- predict_parts(
  explainer = explainer_xgboost,
  new_observation = sample_observation,
  type = "shap",
  # Default is 25
  B = 25
)

# Make break down plots
shap_bd %>% 
  group_by(variable) %>% 
  mutate(mean_val = mean(contribution)) %>% 
  ungroup() %>% 
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>% 
  filter(contribution != 0) %>% 
  ggplot(mapping = aes(x = contribution, y = variable, fill = mean_val > 0)) +
  geom_boxplot() +
  geom_col(data = ~distinct(., variable, mean_val),
           aes(x = mean_val, y = variable),
           alpha = 0.5) +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

Code

xgboost_breakdown <- predict_parts(
  explainer = explainer_xgboost,
  new_observation = sample_observation,
  order = shap_bd %>% 
    group_by(variable_name) %>% 
    mutate(mean_val = mean(contribution)) %>% 
    ungroup() %>% 
    mutate(variable_name = fct_reorder(variable_name, desc(abs(mean_val)))) %>% distinct(variable_name, mean_val) %>% pull(variable_name) %>% levels()
)

plot(xgboost_breakdown, max_vars = 1000)

Note

The fact that the respondent does not care about ban, makes 17-20 trips per week, alternative vehicle is ebike, uses taxi 1-5 times a week increases the probability of disagreeing with the ban. The fact that the respondent uses a car 1_5 times a week, future vehicle purchase is bike, lives at lat 21.02 lon 105.8, increases the probability of agreeing with the ban.

The model’s prediction is 0.793 indicating the individual disagreed with the ban.

More breakdown plots will be made available in the Shiny App.

We’ll wrap up the adventure here. In this notebook:

EDA was done to uncover any relationships between the predictors and the outcome variable.
An xgboost model was tuned and fitted to the data
Global and Local model explanations were computed to uncover some of the factors driving opinions towards the ban

8 Next steps

Shiny dashboard for model explainability. A WIP can be found here: https://r-icntay.shinyapps.io/Explainable_ML/
Linear model explainability as suggested by Lex.
Transfer outputs to paper with Minh.
Explore ML devops features such as deploying model, github actions to automate some processes, integration with shiny etc.

References

Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57.

McKay, MD. 1979. “McKay, MD, Conover, WJ and RJ Beckman (1979).” A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 21: 239–45.