This Notebook is created using Quarto. Quarto is a multi-language, next-generation version of R Markdown from RStudio, and includes dozens of new features and capabilities while at the same being able to render most existing Rmd files without modification. To learn more about Quarto see https://quarto.org.
Summary: the important parts
This document implements a Classification model in Machine Learning on the Hanoi’s travel survey data, as part of the project Urban transport modelling for sustainable well-being in Hanoi. This is me learning and trying to build upon Minh Kieu’s work.
The objectives of the model were to:
See if we can predict whether a person would like or dislike the motorbike ban
See if how many variables we can take off the training dataset, and still make good prediction? This is important because what if we can anticipate travellers’ perception to transport policies using limited dataset (e.g. census data)
Explain the model’s outcomes to tell variable importance, and understand more about the data
See if we can have some policy implications while doing this modelling work
The following packages will be required for this adventure:
Let’s see how we’ll handle these - that is, if they will be even important in our model.
1.2 Data Preprocessing
Here we do the following steps to make sure that the data is ready for modelling:
Reduce the label column from 5 outcomes (Agree, Strongly Agree, Neutral, Disagree, Strongly Disagree) to two: Agree or Disagree
What’s the various levels of the outcome?
Code
theme_set(theme_minimal())# Distribution of opinionshn_survey %>%count(opinion_ban) %>%mutate(opinion_ban =fct_reorder(opinion_ban, desc(n))) %>%ggplot(mapping =aes(x = opinion_ban, y = n)) +geom_col(aes(fill = opinion_ban), show.legend =FALSE, alpha =0.7) +geom_text(aes(label = n, y = n +200)) + paletteer::scale_fill_paletteer_d("ggsci::default_nejm")
Count of various opinions towards a potential motorbike ban
Note
In this survey, majority of the respondents agree with the ban. This is quite different from what we had at the beginning when majority of respondents were disagreeing with the ban.
# Scatter plothn_survey_agg %>%group_by(age =bind_count(age)) %>%summarize_opinionban() %>%mutate(age =fct_reorder(age, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = age)) +geom_point(aes(size = pct), show.legend = T) +geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by age",size ="%prevalence",subtitle ="The number in brackets indicates the total count of respondents in that category" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))
Note
It seems that cumulatively, people in the age group > 18 years tend to agree with the ban of motorcycles from a section of Hanoi CBD. 75 % of respondents > 75 years agree though they only make 0.8% of the total respondents. Their opinion makes sense because of their age and are less likely to use motorcycles.
Interesting to observe that a higher proportion of respondents in the bracket 18-25 tend to agree than in 26-35.
Majority of 18 - 25 would be students/young professionals. Are their opinions based on their educational knowledge of the effects of motorbikes? But as we’ll see, trips for education purposes have lower probabilities of agreeing to ban.
Why are respondents less that 18 disagreeing with the ban?
Majority of 26-35 would be working. Around 58% of them agree to the ban. Probably a lot of them use motorcyles for work?
Good questions to explore with the team!
Seems like Age matters.
2.2 Does occupation influence opinion?
Code
# Investigate r/ship between occupation and opinionhn_survey_agg %>%group_by(occup) %>%summarize_opinionban()
# Does gender influence one's opinionbar_plot(vehic)
Code
# Narrowing our analysis to the different factor levelsdot_plot(vehic)
Note
Majority of respondents who use cars and bikes tend to agree with the ban. Taxi, walk and bus users tend to disagree with the ban. Perhaps they use motorcycles at some point in their journeys?
Drop tram since it’s sparse?
2.5 How opinion ban looks like to frequency of usage of a particular means of transport
The pattern seems pretty much the same across all amenities - not much signal can be derived from this. Has lots of NA values. Won’t include this for now.
2.7 Is ownership of residence related to opinion on ban?
Code
# Does residence ownership influence one's opinionbar_plot(own)
Code
# Narrowing our analysis to the different factor levelsdot_plot(own)
Note
Respondents who own and live in parent’s house tend to agree to ban than those who mortgage and rent.
2.8 Do people who make more trips per week agree/disagree with ban?
# A tibble: 5 x 7
fut_veh n_agree n_total pct_agree low high pct
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 bike 552 715 0.772 0.742 0.803 0.0361
2 ebike 665 914 0.728 0.699 0.757 0.0461
3 car 4499 6760 0.666 0.654 0.677 0.341
4 no 4589 7753 0.592 0.581 0.603 0.391
5 moto 2016 3668 0.550 0.534 0.566 0.185
Code
# R/ship between future vehicle and one's opinionbar_plot(fut_veh)
Code
# Narrowing our analysis to the different factor levelsdot_plot(fut_veh)
Note
Majority of respondents (39%) have no future plans of purchasing a vehicle. However, the remaining would purchase a car and 66% of them agree with the ban of motorcycles. Respondents who considered purchasing Bike and ebike are more likely to agree to ban than those intending to buy motorcycles which is plausible.
3 How are opinions spread out based on home location?
Caution
An assumption was made that respondents do actually live in the home location and this sought to explore whether living in certain areas makes one susceptible to agreeing/disagreeing with the ban.
Also, the analysis narrowed down to people who live in Hanoi and within a 5 KM buffer area and trips made within Hanoi (will be useful is exploring how distance and time of travel relate to ban)
Preprocessing spatial variables to only include respondents who live in Hanoi and trips within Hanoi:
Code
library(tmap)tmap_mode("view")# Load VN full data setvn_orig <-st_read("../../data-synced/raw_data/VN.shp", quiet = T) hn <-st_read("../../data-synced/raw_data/Hanoi_commune.shp", quiet = T)# Create a HN outlinehn_out <- hn %>%st_union() # Put a 5KM buffer around ithn_buf <- hn_out %>%st_transform(3405) %>%st_buffer(dist =5000) %>%st_transform(4326)# Subset VN datavn = vn_orig[hn_buf, ]# Assign area codes to column codevn <- vn %>%rownames_to_column(var ="code") # Home and OD coordinateshome_coord <-st_as_sf(hn_survey_agg, coords =c("homelon", "homelat"), agr ="constant", crs =4326, remove =FALSE)origin <- hn_survey_agg %>%st_as_sf(agr ="constant", coords =c("origlon", "origlat"), crs =4326)destination <- hn_survey_agg %>%st_as_sf(agr ="constant", coords =c("destlon", "destlat"), crs =4326)# Ensure that home coordinates are within HNindex <-st_intersects(home_coord, vn, sparse = F)home_true <-rowSums(index)# Ensure that OD coordinates fall within hanoiindex <-st_intersects(origin, vn, sparse = F)or_true <-rowSums(index)index <-st_intersects(destination, vn, sparse = F)de_true <-rowSums(index)# Filter for those coordinates that fall within Hanoi i.e home_true = 1home_coord <- home_coord %>%mutate(home_true = home_true, or_true = or_true,de_true = de_true) %>%filter(home_true >0, or_true >0, de_true >0)# Add district names# District namesindex <-st_intersects(home_coord, vn, sparse = F)home_true <-rowSums(index)valid_mat <- index %>%as_tibble()names(valid_mat) <- vn$NAME_2# Pivot longer then obtain the home area codes for regions that intersecthome_code <- valid_mat %>%pivot_longer(everything(), names_to ="homecode", values_to ="is_intersect") %>%filter(is_intersect ==TRUE) %>%pull(homecode)# Add district namehome_coord <- home_coord %>%mutate(district_name = home_code) %>%relocate(district_name, .after = homelon)# Print out datahome_coord %>%slice_sample(n =5)
Distribution of opinions based on respondents’ home location.
Code
pal <-c("#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51")# Aggregate areas and group them according to opinionagg_lat_long = home_coord %>%as_tibble() %>%# Aggregate certain areasgroup_by(homelat =round(homelat, 2),homelon =round(homelon, 2), opinion_ban) %>%summarise(n =n()) %>%st_as_sf(coords =c("homelon", "homelat"), agr ="constant", crs =4326)# Plot opinions based on locationagg_lat_long %>%tm_shape() +tm_dots(size ="n", col ="opinion_ban", palette =c("dodgerblue", "firebrick3"), alpha =0.5) +tm_basemap("OpenStreetMap")
Note
Some areas have predominantly agree/disagree opinions. But most are overlayed on each other, probably due to Nick’s preprocessing? More living further from the city center (especially to the West) tend to agree more? Why lots of red dots in locations across the river? Try out two heatmaps.
Alternative: Summarize opinions based on the district names:
Most of the significant districts (with a large number of respondents) have a large number agreeing with the ban. Long Bien district has a large number of respondents and less than 30% of them agree to ban. Why?
4 A little Exploratory Data Analysis of numeric variables
Let’s explore the numeric predictors. From a previous analysis, there seems to be some incorrect values for dist_to_pub feature:
Code
as_tibble(home_coord) %>%ggplot(mapping =aes(x = dist_to_pub, y =1)) +geom_boxplot()
Code
# Summary statistics for distance to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%summary()
dist_to_pub
Min. : -6.0
1st Qu.: 120.0
Median : 200.0
Mean : 516.5
3rd Qu.: 500.0
Max. :636578.0
Caution
The following preprocessing is done:
Distance is made absolute
Only distances < 10 KM are considered. Perhaps pick a better threshold for this.
# Summary stats of distance to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%summary()
dist_to_pub
Min. : 0
1st Qu.: 120
Median : 200
Mean : 425
3rd Qu.: 500
Max. :8400
Code
# Distribution of distnace to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%ggplot(mapping =aes(x = dist_to_pub, y =1)) +geom_boxplot(color ="midnightblue")
Code
#geom_jitter(size = 0.1, alpha = 0.2)
Seems there are quite a number of outliers. Perhaps a threshold of 1KM would be better. Ask Minh’s opinion.
A estimate of how numeric variables relate to the opinion on ban.
# Use ROc_AUC to determine how well each numeric predictor# would sway opinionsnumeric_vars %>%group_by(metric) %>%roc_auc(opinion_ban, value, event_level ="second") %>%mutate(metric =fct_reorder(metric, .estimate)) %>%ggplot(mapping =aes(x = .estimate, y = metric)) +geom_point() +geom_vline(aes(xintercept = .5)) +labs(x ="AUC estimates of agreeing",title ="How predictive is each numeric predictor by itself?",subtitle =">.5 means positively influences agree and <.5 means negatively influences agree" )
Note
As distance to public transport increases, people tend to disagree with the ban
As the network distance increases, people tend to agree with ban. Why would this be the case? Congestion?
Respondents who own 2 motorbikes tend to agree with ban
Ownership of a car is generally associated with agreeance of the ban.
Respondents who own 1, 3 or 4 motorbikes tend to disagree with the ban.
Generally respondents who own ebikes tend to disgaree with ban
5 Text Analysis of opinions
Word tokenization to explore the relationship between alternative vehicle and opinion ban.
Summary tibble of what how opinions on ban vary with future alternative vehicle in case of ban:
# Category wise analysisas_tibble(home_coord) %>%unnest_tokens(input = alt_veh, output = alt_veh, token ="words") %>%group_by(alt_veh =bind_count(alt_veh)) %>%summarize_opinionban() %>%mutate(alt_veh =fct_reorder(alt_veh, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = alt_veh)) +geom_point(aes(size = pct), show.legend = F) +geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by future vehicle preference",size ="%prevalence" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))
Note
If motorcycles were banned, majority of respondents would result to cars.
Majority of respondents who would result to cars, bike agree with ban.
Interesting to see that respondents who would result to ebikes have lowest probability of agreeing with ban.
In general, we have predictors that clearly distinguish the observations in terms of the opinion on ban, while some, not so much. Will begin by including everything in the model, then use Variable Importance to eliminate non-predictive features.
6 Creating a Boosted Tree model
There are several steps to create a useful model, including parameter estimation, model selection and tuning, and performance assessment.
Tip
If some of the methodologies are a bit over-explained, that’s for future references for my talks, blogs, etc. 🙂
6.1 Data Budgeting
In this section, the data is split into two distinct sets, the training set and the test set. The training set (typically larger) is used to develop and optimize the model by fitting different models and investigating various feature engineering strategies etc.
The other portion of the data is the test set. This is held in reserve until one or two models are chosen as the methods that are most likely to succeed. The test set is then used as the final arbiter to determine the efficacy of the model.
Drop amenity accessibility columns
Remove tram since it’s too sparsely distributed for now.
Make a 70/30 split specification.
Code
set.seed(2056)# Drop opinions on amenity accessibilityhn_survey_ml <-as_tibble(home_coord) %>%mutate(district_name_eng =str_remove_all(district_name, "[^a-zA-z]")) %>%relocate(district_name_eng, .after = district_name) %>%select(!contains("_acc")) %>%drop_na() %>%mutate(opinion_ban =factor(opinion_ban, levels =c("agree", "disagree"))) %>%filter(vehic !="tram")# Make a split specificationhn_survey_split <- hn_survey_ml %>%select(!geometry) %>%as_tibble() %>%mutate(across(contains("own_"), as.character),across(contains("own_"), ~case_when(.x =="more_5"~"7", TRUE~ .x)),across(contains("own_"), ~as.numeric(.x) )) %>%# Stratified split based on opinion baninitial_split(strata = opinion_ban, prop =0.7)# Obtain train and test setstrain <-training(hn_survey_split)test <-testing(hn_survey_split)# Print out observations in each categoryglue('The training set has {nrow(train)} observations \n','The testing set has {nrow(test)} observations')
The training set has 13757 observations
The testing set has 5897 observations
6.2 Resampling for model evaluation and tuning
Boosted tree models typically have tuning parameters or hyperparameters must be specified ahead of time and can’t be directly found from training data. These are unknown structural or other kind of values that have significant impact on the model but cannot be directly estimated from the data. Instead, hyperparameters are estimated using simulated data sets created from a process called resampling such as cross-validation or bootstrap resampling.
The idea of resampling is to create simulated data sets that can be used to estimate the performance of your model.
Since this data has a spatial component, we’ll do a spatial resample. Spatial or cluster cross-validation splits the data into V groups of disjointed sets using k-means clustering of some variables, typically spatial coordinates. This ensures that we account for the presence of spatial autocorrelation in geospatial data.
Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. The following transformation steps will be done before refining the model further down the line:
One hot encoding: xgboost does not have a means to translate factor predictors to grouped splits. Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine.
Remove zero variance: potentially remove variables that are highly sparse and unbalanced.
Impute missing values:
Normalize/transform: not really required for tree models
Collapse some categorical levels: potentially pool infrequently occurring values into an “other” category.
Upsampling: There is some class imbalance with the survey having more people who agree with ban. It is equally important to understand the reasons behind people disagreeing with the ban. A SMOTE algorithm (Chawla et al. (2002)) will be applied which synthetically generates new points based on existing points.
Tip
SMOTE requires all variables to be numeric with no missing data. Good thing is that these requirements can and should be handled by previous recipe steps e.g step_impute_mean and step_dummy
The recipes package allows you to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets.
The workflows package allows the user to bind modeling and preprocessing objects together. You can then fit the entire workflow to the data, so that the model encapsulates all of the pre-processing, modeling, and post-processing requests.
Code
set.seed(2056)library(textrecipes)# Function for prepping and baking a recipeprep_juice <-function(recipe){ recipe %>%prep() %>%juice()}library(beepr)# Data preprocessing with recipesboost_recipe <-recipe( opinion_ban ~ age + occup + gender + vehic + freq_bike + freq_ebike + freq_bus + freq_motob + freq_car + freq_taxi + own + freqpweek + purp + aware_ban + fut_veh + own_car + own_bike + own_ebike + dist_to_pub + homelat + homelon + alt_veh + network_dist, data = train) %>%# Tokenize future vehicle considereationstep_tokenize(alt_veh) %>%# One hot encode the tokensstep_tf(alt_veh) %>%# Pool infrequently occurring values into an "other" category.#step_other(purp, threshold = 0.05) %>%#step_other(freqpweek, threshold = 0.05) %>%step_other(contains("freq_"), threshold =0.05) %>%# Dummy variables for categorical #step_integer(all_nominal_predictors(), zero_based = TRUE) %>%step_dummy(all_nominal_predictors(), one_hot =TRUE) %>%step_upsample(opinion_ban)# %>%# # # # Near zero variance filter# step_nzv(all_predictors()) # Just for sanity check#View(prep_juice(boost_recipe))# Allow recipe to handle new levels# bp <- hardhat::default_recipe_blueprint(# allow_novel_levels = TRUE# )# Create boosted tree model specificationboost_spec <-boost_tree(mtry =tune(),trees =tune(),#min_n = tune(),#tree_depth = tune(),learn_rate =0.01,#loss_reduction = tune(),#sample_size = tune(),#stop_iter = tune() ) %>%set_engine("xgboost") %>%set_mode("classification")# Bind recipe and model specification togetherboost_wf <-workflow() %>%add_recipe(boost_recipe) %>%add_model(boost_spec)# Print workflowboost_wf
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
5 Recipe Steps
* step_tokenize()
* step_tf()
* step_other()
* step_dummy()
* step_upsample()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
learn_rate = 0.01
Computational engine: xgboost
6.4 Model tuning
Now that we have a tunable workflow, we’ll need to figure out a set of possible values to try out then choose the best.
Tuning parameter optimization usually falls into one of two categories:
grid_search: when a pre-defined set of parameter values is evaluated and the best set picked. We’ll use a non-regular grid in this case.
When grid search is infeasible or inefficient, iterative methods are a sensible approach for optimizing tuning parameters.
Iterative search or sequential search is when we sequentially discover new parameter combinations based on previous results e.g simulated annealing
Using racing methods to perform an efficient grid search.
In racing methods, the tuning process evaluates all hyperparameter combinations on an initial subset of resamples. Based on their current performance metrics, the process eliminates tuning parameter combinations that are unlikely to be the best results using a repeated measure ANOVA model. This is unlike a classical grid search where all models need to be fit across all resamples before any tuning parameters can be evaluated. This efficiently reduces training time.
grid: An integer or data frame. When an integer is used, the function creates a space-filling design with grid number of candidate parameter combinations. Non-regular space-filling designs generally find a configuration of points that cover the parameter space with the smallest chance of overlapping or redundant values. Examples of such designs are Latin hypercubes (McKay (1979))
Let’s go ahead and tune the model:
Code
doParallel::registerDoParallel()set.seed(2056)library(finetune)# Evaluation metrics during tuningeval_metrics <-metric_set(mn_log_loss, accuracy)# Efficient grid search via racingxgb_race <-tune_race_anova(object = boost_wf,resamples = train_folds,metrics = eval_metrics,# Try out 20 different hyperparameter combinationsgrid =20,control =control_race(verbose_elim =TRUE ))beepr::beep(9)#saveRDS(xgb_race, "smote_xgb_race")
We can observe the incremental elimination of tuning parameters that may not have likely improved model performance:
Code
# Plot racing resultsplot_race(xgb_race)
Let’ see which hyperparameter combinations performed best on the resamples:
Accuracy
Code
# Tibblexgb_race %>%show_best(metric ="accuracy")
# A tibble: 3 x 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 46 1725 accuracy binary 0.840 10 0.0156 Preprocessor1_Model12
2 35 1581 accuracy binary 0.835 10 0.0195 Preprocessor1_Model09
3 41 1936 accuracy binary 0.835 10 0.0143 Preprocessor1_Model10
How mean log loss varies with different combinations of mtry and trees
Plausible enough! Probably increasing trees would have lowered the mean log loss further? For this model, we’ll finalize the hyperparameters based on the values of the mean log loss.
Using resamples, we were able to get the best estimates of mtry and trees and a good indicator of how those values would perform on the actual train and test sets.
6.5 Finalizing the model, training and testing
Now that we have the best performance values, we can use tune::finalize_workflow() to update (or “finalize”) our workflow object with the best estimate values for our hyper-parameters.
Finally, we can return to our test data and estimate the model performance we expect to see with new data. We use the function last_fit() with our finalized model; this function fits the finalized model on the full training data set and evaluates the finalized model on the testing data.
Another performance metric associated with classification problems is the confusion matrix. A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model.
Recall: TP/(TP + FN) defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity.
Specificity: TN/(TN + FP) defined as the proportion of negative results out of the number of samples which were actually negative.
Precision: TP/(TP + FP) defined as the proportion of predicted positives that are actually positive. Also called positive predictive value
Accuracy: TP + TN/(TP + TN + FP + FN) The percentage of labels predicted accurately for a sample.
F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.
Note
Overall, the performance metrics look really good. Generally, the model did a better job in predicting observations that disagree with the ban.
With this model, we can now probe it to investigate/understand the underlying reasons for agreeing/disagreeing with ban.
7 Explaining the model and predictions
Tip
Some reference books that I found really helpful for this section include:
In this section, we explore why the model makes the predictions it does. Answering the question “why?” allows modeling practitioners to understand which features were important in predictions and even how model predictions would change under different values for the features.
Note
There are two types of model explanations, global and local. Global model explanations provide an overall understanding aggregated over a whole set of observations; local model explanations provide information about a prediction for a single observation.
7.1 Global model explanations
For global model interpretation, the following will be considered:
Variable Importance plots are one way of understanding which predictor has the largest effect on the model outcomes.
The main idea is to measure how much does a model’s performance changes if the effect of a selected explanatory variable, or of a group of variables, is removed. To remove the effect, we use perturbations, like resampling from an empirical distribution or permutation of the values of the variable.
If a variable is important, then we expect that, after permuting/shuffling the values of the variable, the model’s performance will worsen. The larger the change in the performance, the more important is the variable. Please see Explanatory Model Analysis and Interpretable Machine Learning
The method may be applied for several purposes.
Model simplification: variables that do not influence a model’s predictions may be excluded from the model.
Model exploration: comparison of variables’ importance in different models may help in discovering interrelations between the variables. Also, the ordering of variables in the function of their importance is helpful in deciding in which order should we perform further model exploration.
Domain-knowledge-based model validation: identification of the most important variables may be helpful in assessing the validity of the model based on domain knowledge.
Knowledge generation: identification of the most important variables may lead to the discovery of new factors involved in a particular mechanism.
Let’s represent the above as Variable Importance Plots:
Code
# Make Variable Importance Plotsvi %>%slice_max(Importance, n =42) %>%mutate(Variable =fct_reorder(Variable, Importance)) %>%ggplot(mapping =aes(y = Variable, x = Importance)) +geom_point(size =3, color ="dodgerblue") +geom_segment(aes(y = Variable, yend = Variable, x =0, xend = Importance), size =2, color ="dodgerblue", alpha =0.7 ) +ggtitle(paste("Variable Importance plot of top", round(nrow(vi)/2), "variables")) +theme(plot.title =element_text(hjust =0.5))
Code
# Bottom n variablevi %>%slice_min(Importance, n =42) %>%mutate(Variable =fct_reorder(Variable, Importance)) %>%ggplot(mapping =aes(y = Variable, x = Importance)) +geom_point(size =3, color ="cyan4") +geom_segment(aes(y = Variable, yend = Variable, x =0, xend = Importance), size =2, color ="cyan4", alpha =0.7) +ggtitle(paste("Variable Importance plot of bottom", round(nrow(vi)/2), "variables")) +theme(plot.title =element_text(hjust =0.5))
From the VIP plot, we can observe which variables most/least influence the model performance. For instance the following quick inferences can be made:
The frequency of usage of cars, in this case whether an individual does not use a car, has the most predictive value in this model. This warrants further exploration. Also usage of motobike and bikes0 times a week is an important factor driving individuals’ opinion to ban.
distance to public transport and the distance to destination are among the most important variable in this model
In terms of vehicle ownership, number of cars and bikes are the most important.
Being aware of the ban has a significant effect on the model as well as a person’s home location.
Text preprocessing made it on the list with alternative vehicles car and train topping the list.
Making a trip 4-7 times a week was an important underlying factor in driving opinions towards ban.
Occupation, gender, age and trip purpose were not as important compared to other features
There are lots of inferences that can be made about this characteristics later and some of the less important features can be removed. Perhaps this can improve model importance.
Tip
The feature importance plot is useful, but contains no information beyond the importances i.e, do the features influence the predictions positively or negatively. For a more informative plot, we will next look at the SHAP summary plot.
7.1.2 Shapley Additive exPlanations (SHAP)
SHAP feature importance is an alternative to permutation feature importance above.
SHAP is based on the idea of averaging the value of a variable’s attribution over all (or a large number of) possible orderings.
As such, permutation feature importance is based on decrease in model’s performance with shuffling while SHAP is based on magnitude of feature attributions.
The code below calculates the SHAP values for the top 25 variables ranked by mean SHAP.
Code
# SHAP for xgboostlibrary(SHAPforxgboost)# Prepare shap values for plotting. Requires a matrixopinion_shap <-shap.prep(# Actual Boost enginexgb_model = xgb_wf %>%extract_fit_engine(),# predictors used to calculate SHAP valuesX_train = boost_recipe %>%prep() %>%bake(has_role("predictor"),new_data =NULL,composition ="matrix"),top_n =round(nrow(vi)/2) +10)shap.plot.summary(opinion_shap)
Your regular reminder: All effects describe the behavior of the model and are not necessarily causal in the real world.
The SHAP summary plot combines feature importance and feature effects with features being ordered according to their importance. Each point on the SHAP summary plot is a Shapley value for a feature and an instance and the color represents the value of the feature from low to high. Some inferences we can derive from our Shapley plot:
Use of cars 0 times a week is associated with lower shapley values and this pushes the model prediction towards disagree
High values of distance to public transport reduce probability of agreeing with the ban.
Owning more cars has a larger effect on the model (High Shapley values) and pushes the model towards agree This is in contrast to ebikes where high ownershing of ebikes pushes the model towards disagree
Individuals who make trips for leisure have a greater probability of agreeing to ban than those who make trips for visit
Having alternative vehicles as car or train in case of ban increases the probability of agreeing with ban in contrast to having taxis
High instances of of frequency of motorbikes being greater than 20 times a week pushes the model towards disagree
More inferences to be made while writing the paper.
SHAP summary plots therefore give us first indications of the relationship between a feature and the impact on predictions.
7.1.3 SHAP Dependence Plot
SHAP summary plot give us first indications of the relationship between a feature and the impact on predictions. We can take a closer look at the exact form of the relationship by making a SHAP dependence plot as follows:
Pick a feature.
For each data instance, plot a point with the feature value on the x-axis and the corresponding Shapley value on the y-axis.
The SHAP dependence plots allow one to explore the variance of the Shapley values with different data points of a feature.
Let’s explore some of these relationships further by looking at some SHAP interaction values. SHAP interaction values separate the impact of variable into main effects and interaction effects. They add up roughly to the dependence plot.
Warning
The preparation of SHAP interaction values take time since it calculates all the combinations. The code blow took about 4 hours to run.
Code
# Prepare the interaction SHAP values from predict.xgb.Boostershap_int <-shap.prep.interaction(xgb_model = xgb_wf %>%extract_fit_engine(),X_train = boost_recipe %>%prep() %>%bake(has_role("predictor"),new_data =NULL, composition ="matrix"))saveRDS(shap_int, "opinion_interaction")
How effect of network distance varies with the vehicle used
Generally, having 1 car and occupation is FDI increases the probability of agreeing with the ban while having 2 cars and occupation is FDI reduces probability of agreeing to ban.
This non-linear interactions will be explored further if required.
7.2 Local explanations
Local model explanations provide information about a prediction for a single observation.
Probably the most commonly asked question when trying to understand a model’s prediction for a single observation is: which variables contribute to this result the most? There is no single best approach that can be used to answer this question. In this section, we explore break-down (BD) plots, which offer a possible solution.
Tip
Break down plots compute how contributions attributed to individual features change the mean model’s prediction for a particular observation. We can use the fact that these break-down explanations change based on the order of features to compute the most important features over all (or many) possible orderings. This is the idea behind Shapley Additive Explanations (Lundberg and Lee 2017), where the average contributions of features are computed under different combinations or “coalitions” of feature orderings.
The underlying idea of BD plots is thus to capture the contribution of an explanatory variable to the model’s prediction by computing the shift in the expected value of \(Y\), while fixing the values of other variables.
To compute any kind of model explanation, global or local, using DALEX package, we first create an explainer for each model.
Code
# Load DALEX packagelibrary(DALEXtra)# Vector of predictorsvip_features <- xgb_wf %>%extract_preprocessor() %>%summary() %>%filter(role =="predictor") %>%pull(variable)vip_train <- train %>%select(all_of(vip_features)) # Explainer for xgboostexplainer_xgboost <-explain_tidymodels(# a fitted workflow model <- xgb_wf,# Data should be passed without target column # this shall be provided as the y argumentdata = vip_train,y = train$opinion_ban,label ="xg_boost",verbose = F)
Let’s use the fact that these break-down explanations change based on order to compute the most important features over all (or many) possible orderings and compute SHAP attributions for some sample observations, using B = 25 random orderings (takes a while to compute).
Code
set.seed(2056)# Pick an observationsample_observation <- vip_train %>% dplyr::slice(2)glimpse(sample_observation)
purple variables increase the probability of agreeing with the ban while yellow variables increase probability of disagreeing with the ban.
I this package, probability of:
0-0.5: agree
0.5-1: disagree
Let’s interpret the above plots a little bit:
Note
Theintercept shows the distribution and the mean value of the model’s predictions for all data.
The next rows show the distribution and the mean value of the predictions when fixing values of subsequent explanatory variables. For instance the fact that the individual’s alternative vehicle is car lighttrain, is aware of ban and future vehicle is car reduces the probability of disagreeing with ban while the fact that the individual uses a motorcycle, the trip purpose is eduaction and rents increases the probability of disagreeing with ban.
The prediction shows the prediction for the particular instance of interest.
Let’s do the same for an observation that disagreed with the ban:
Code
set.seed(2056)# Pick an observationsample_observation <- train %>%select(all_of(vip_features), opinion_ban) %>% dplyr::slice(10234)glimpse(sample_observation)
The fact that the respondent does not care about ban, makes 17-20 trips per week, alternative vehicle is ebike, uses taxi 1-5 times a week increases the probability of disagreeing with the ban. The fact that the respondent uses a car 1_5 times a week, future vehicle purchase is bike, lives at lat 21.02 lon 105.8, increases the probability of agreeing with the ban.
The model’s prediction is 0.793 indicating the individual disagreed with the ban.
More breakdown plots will be made available in the Shiny App.
We’ll wrap up the adventure here. In this notebook:
EDA was done to uncover any relationships between the predictors and the outcome variable.
An xgboost model was tuned and fitted to the data
Global and Local model explanations were computed to uncover some of the factors driving opinions towards the ban
Explore ML devops features such as deploying model, github actions to automate some processes, integration with shiny etc.
References
Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.”Journal of Artificial Intelligence Research 16: 321–57.
McKay, MD. 1979. “McKay, MD, Conover, WJ and RJ Beckman (1979).”A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 21: 239–45.
Source Code
---title: "Factors influencing perceptions towards a potential Motorbike ban in Hanoi: A boosted trees approach"author: "Eric Wanjau, Minh Kieu"format: html: number-sections: true toc: true toc-location: left toc-depth: 4 code-tools: true code-fold: true code-link: trueeditor: visualexecute: warning: false message: falsebibliography: references.bib---**Quarto**This Notebook is created using Quarto.Quarto is a multi-language, next-generation version of R Markdown from RStudio, and includes dozens of new features and capabilities while at the same being able to render most existing Rmd files without modification.To learn more about Quarto see <https://quarto.org>.## Summary: the important parts {.unnumbered}This document implements a Classification model in Machine Learning on the Hanoi's travel survey data, as part of the project `Urban transport modelling for sustainable well-being in Hanoi`.This is me learning and trying to build upon [`Minh Kieu`](https://unidirectory.auckland.ac.nz/profile/minh-kieu)'s work.The objectives of the model were to:- See if we can predict whether a person would like or dislike the motorbike ban- See if how many variables we can take off the training dataset, and still make good prediction? This is important because what if we can anticipate travellers' perception to transport policies using limited dataset (e.g. census data)- Explain the model's outcomes to tell variable importance, and understand more about the data- See if we can have some policy implications while doing this modelling workThe following packages will be required for this adventure:```{r}suppressWarnings(if(!require("pacman")) install.packages("pacman"))pacman::p_load('tidyverse', 'here', 'sf', 'tmap', 'stplanr','broom','tidymodels', 'paletteer', 'patchwork', 'themis', 'stacks',"spatialsample", "glue","vip", "xgboost", "SHAPforxgboost", "tidytext", "textrecipes", "beepr", "finetune")#conflicted::conflict_prefer("slice", "dplyr")doParallel::registerDoParallel()```## Set Up### Loading dataThe code chunk below loads a copy of a cleaned survey data set.```{r}# Load the datahn_survey <-read_csv("hn_survey.csv", col_types =cols(.default ="c")) %>%mutate(across(contains(c("row","homel", "origl", "destl", "dist", "trav", "index")), ~as.numeric(.x)))# Dimensions of datadim(hn_survey)# User defined functions for EDA# Function for summarizing the distribution of opinionssummarize_opinionban <-function(tbl){ tbl %>%#group_by(occup) %>% summarise(n_agree =sum(opinion_ban =="agree"), n_total =n()) %>%ungroup() %>%mutate(pct_agree = n_agree / n_total) %>%arrange(desc(pct_agree)) %>%mutate(low =qbeta(0.025, n_agree +5, n_total - n_agree + .5),high =qbeta(0.975, n_agree +5, n_total - n_agree + .5),pct = n_total /sum(n_total)) } # Function for binding number of observations to above resultsbind_count =function(x){as_tibble(x) %>%add_count(value) %>%mutate(value =glue("{value} ({n})")) %>%pull(value)}```Let's take a peek at the data::: panel-tabset## glimpse```{r}glimpse(hn_survey)```## tibble```{r}hn_survey %>%slice_head(n =5)```:::Where do we have missing values?```{r}colSums(is.na(hn_survey)) %>%as.data.frame() %>%filter(. >0)```Let's see how we'll handle these - that is, if they will be even important in our model.### Data PreprocessingHere we do the following steps to make sure that the data is ready for modelling:- Reduce the label column from 5 outcomes (Agree, Strongly Agree, Neutral, Disagree, Strongly Disagree) to two: `Agree` or `Disagree`What's the various levels of the outcome?```{r}#| fig-subcap: "Count of various opinions towards a potential motorbike ban"theme_set(theme_minimal())# Distribution of opinionshn_survey %>%count(opinion_ban) %>%mutate(opinion_ban =fct_reorder(opinion_ban, desc(n))) %>%ggplot(mapping =aes(x = opinion_ban, y = n)) +geom_col(aes(fill = opinion_ban), show.legend =FALSE, alpha =0.7) +geom_text(aes(label = n, y = n +200)) + paletteer::scale_fill_paletteer_d("ggsci::default_nejm")```::: callout-noteIn this survey, majority of the respondents agree with the ban.This is quite different from what we had at the beginning when majority of respondents were disagreeing with the ban.:::Data preprocessing steps:- **Drop Neutral**- **Aggregate `disagree` and `strong disagree`**- **Aggregate `agree` and `strong agree`**```{r}# Aggregate datahn_survey_agg <- hn_survey %>%# Make Neutral Disagreefilter(opinion_ban !="neutral") %>%# Aggregate opinionsmutate(opinion_ban =case_when( opinion_ban =="strongdisagree"~"disagree",#opinion_ban == "neutral" ~ "disagree", opinion_ban =="strongagree"~"agree",TRUE~ opinion_ban )) %>%tibble()```::: panel-tabset## Tibble Count```{r}# Verifyhn_survey_agg %>%count(opinion_ban, sort =TRUE)```## Bar Plot```{r}hn_survey_agg %>%count(opinion_ban) %>%mutate(opinion_ban =fct_reorder(opinion_ban, (n))) %>%ggplot(mapping =aes(x = opinion_ban, y = n)) +geom_col(aes(fill = opinion_ban), show.legend =FALSE, alpha =0.7, width =0.5) +geom_text(aes(label = n, y = n +200)) + paletteer::scale_fill_paletteer_d("ggsci::default_nejm")```:::::: callout-noteNot much of a class imbalance but will upsample on the training set.:::## A little Exploratory Data Analysis of nominal variablesThe following sections try to do some Exploratory Data Analysis aimed at:- Getting a better understanding of how the variables related to opinion on ban- Identify which variables might be potentially predictive- Identify potential feature engineering possibilities to extract a better signal.### How is age related to opinion on ban?```{r}# How opinions vary with age hn_survey_agg %>%group_by(age) %>%summarize_opinionban() ```Let's explore this visually.```{r}#| layout-ncol: 2#| column: page# Bar plotshn_survey_agg %>%count(age, opinion_ban) %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%mutate(age =factor(age,levels = hn_survey_agg %>%distinct(age) %>% dplyr::slice(c(2, 1, 3:6)) %>%pull(age) )) %>%ggplot(mapping =aes(x = age, y = n)) +geom_col(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size =2, replace =FALSE)) +theme(legend.position ="top") +labs(fill ="")# Scatter plothn_survey_agg %>%group_by(age =bind_count(age)) %>%summarize_opinionban() %>%mutate(age =fct_reorder(age, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = age)) +geom_point(aes(size = pct), show.legend = T) +geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by age",size ="%prevalence",subtitle ="The number in brackets indicates the total count of respondents in that category" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))```::: callout-noteIt seems that cumulatively, people in the age group \> 18 years tend to **agree** with the **ban of motorcycles from a section of Hanoi CBD.** 75 % of respondents \> 75 years agree though they only make 0.8% of the total respondents.Their opinion makes sense because of their age and are less likely to use motorcycles.Interesting to observe that a higher proportion of respondents in the bracket 18-25 tend to agree than in 26-35.Majority of 18 - 25 would be students/young professionals.Are their opinions based on their educational knowledge of the effects of motorbikes?But as we'll see, trips for education purposes have lower probabilities of agreeing to ban.Why are respondents less that 18 disagreeing with the ban?Majority of 26-35 would be working.Around 58% of them agree to the ban.Probably a lot of them use motorcyles for work?Good questions to explore with the team!Seems like **Age matters.**:::### Does occupation influence opinion?```{r}# Investigate r/ship between occupation and opinionhn_survey_agg %>%group_by(occup) %>%summarize_opinionban()```Exploring this visually:```{r}#| layout-ncol: 2#| column: page# General overviewhn_survey_agg %>%count(occup, opinion_ban) %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%ggplot(mapping =aes(x = occup, y = n)) +geom_col(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size =2, replace =FALSE))+theme(legend.position ="top") +labs(fill ="")# Narrowing our analysis to the different factor levelshn_survey_agg %>%group_by(occup =bind_count(occup)) %>%summarize_opinionban() %>%mutate(occup =fct_reorder(occup, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = occup)) +geom_point(aes(size = pct), show.legend = F) +geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by Occupation",size ="%prevalence" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))```::: callout-noteRespondents employed by FDI and state tend to agree with ban.Is this due to influence by government?There is very little difference among respondents who agree/disagree with the ban in the other categories.:::### **How opinion on ban varies with gender of respondents.**Let's create some user defined functions that take a particular returns the plots above.::: callout-tipPlease see [rlang documentation](https://rlang.r-lib.org/reference/glue-operators.html) for more info on Tidy evaluation.:::```{r}# Plot bar chartsbar_plot <-function(column){ hn_survey_agg %>%count({{ column }}, opinion_ban) %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%ggplot(mapping =aes(x = {{ column }}, y = n)) +geom_col(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#EB5291FF"), size =2, replace =FALSE)) +theme(legend.position ="top") +labs(fill ="")}# Make dot plotdot_plot <-function(column){ hn_survey_agg %>%group_by({{ column }} :=bind_count({{ column }})) %>%summarize_opinionban() %>%mutate({{ column }} :=fct_reorder({{ column }}, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = {{ column }})) +geom_point(aes(size = pct), show.legend =FALSE) +geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="",size ="%prevalence" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))}```Influence of gender on opinion:```{r}hn_survey_agg %>%group_by(gender) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # Does gender influence one's opinionbar_plot(gender)# Narrowing our analysis to the different factor levelsdot_plot(gender)```::: callout-noteAround 68% of male respondents agree with the ban.Does this imply usage of motorcyles is greater among female respondents?:::### **How opinion on ban varies with the means of transport used.**```{r}hn_survey_agg %>%group_by(vehic) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # Does gender influence one's opinionbar_plot(vehic)# Narrowing our analysis to the different factor levelsdot_plot(vehic)```::: callout-noteMajority of respondents who use `cars` and `bikes` tend to agree with the ban.`Taxi`, `walk` and `bus` users tend to disagree with the ban.Perhaps they use motorcycles at some point in their journeys?Drop `tram` since it's sparse?:::### How opinion ban looks like to frequency of usage of a particular means of transport```{r}#| fig-width: 10#| fig-height: 10hn_survey_agg %>%select(contains("freq_"), opinion_ban) %>%pivot_longer(!opinion_ban, names_to ="vehic_freq", values_to ="freq") %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree")),freq =factor(freq, levels =c("0", "1_5", "6_10", "11_15", "16_20", "more_20"))) %>%ggplot(mapping =aes(x = freq)) +geom_bar(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =c("#BC3C29FF", "#0072B5FF")) +facet_wrap(~ vehic_freq) +theme(legend.position ="top") +labs(fill ="")```::: callout-noteRespondents who use cars more frequently tend to agree with the ban.Respondents using motorcycles more than 15 times a week tend to disagree with ban.How do I include this in the model?Only cars/motorcycles?:::### Does access to schools, market influence opinion on ban?```{r}#| fig-width: 10#| fig-height: 10hn_survey_agg %>%select(contains("_acc"), opinion_ban) %>%pivot_longer(!opinion_ban, names_to ="amenities", values_to ="access") %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%ggplot(mapping =aes(x = access)) +geom_bar(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =c("#BC3C29FF", "#0072B5FF")) +facet_wrap(~ amenities)+theme(legend.position ="top") +labs(fill ="")```::: callout-noteThe pattern seems pretty much the same across all amenities - not much signal can be derived from this.Has lots of `NA` values.Won't include this for now.:::### Is ownership of residence related to opinion on ban?```{r}#| layout-ncol: 2#| column: page#| # Does residence ownership influence one's opinionbar_plot(own)# Narrowing our analysis to the different factor levelsdot_plot(own)```::: callout-noteRespondents who own and live in parent's house tend to agree to ban than those who mortgage and rent.:::### Do people who make more trips per week agree/disagree with ban?```{r}hn_survey_agg %>%group_by(freqpweek) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # Does frequency of trips influence one's opinionbar_plot(freqpweek)# Narrowing our analysis to the different factor levelsdot_plot(freqpweek)```::: callout-noteMajority of respondents make `4-7` trips per week most of them tend to agree with the ban.Generally, as the number of trips per week increases, the number of people agreeing to ban decreases.Are motorbikes cheap for frequent travels?:::### How opinion on ban varies with trip purpose.A summary of opinions based on trip purpose:```{r}hn_survey_agg %>%group_by(fut_veh) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # Does trip purpose influence one's opinionbar_plot(purp)# Narrowing our analysis to the different factor levelsdot_plot(purp)```::: callout-noteRespondents who make trips for `leisure` and `shopping` are more likely to `agree` to ban.though they cumulatively make up to 10% of the respondents.Trips made for `education` purpose have a lower probability of agreeing to ban.Why?Are motorbikes the most used vehicle for education trips?:::### Do opinions on other means of transport influence opinion on motorbike ban?```{r}#| fig-width: 10#| fig-height: 10hn_survey_agg %>%select(contains("opinion")) %>%pivot_longer(!opinion_ban, names_to ="mode", values_to ="opinion") %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%ggplot(mapping =aes(x = opinion)) +geom_bar(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +scale_fill_manual(values =c("#BC3C29FF", "#0072B5FF")) +facet_wrap(~ mode)```::: callout-noteReally interesting to see that respondents who think `motorcycles` are `good` tend to `agree` to the ban.Also majority of opinions on `cars` is `very _good`.Are people more inclined to use cars next?:::### Does being aware of the ban influence opinion on the ban?```{r}hn_survey_agg %>%group_by(aware_ban) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # Does awareness of ban influence one's opinionbar_plot(aware_ban)# Narrowing our analysis to the different factor levelsdot_plot(aware_ban)```::: callout-noteGenerally, respondents `aware` of the ban tend to `agree` to the ban than those who do not care.Majority of respondents who are `unaware` of the ban tend to `disagree` to it.Would sensitizing the public about the ban lead to more cooperation?:::### Any relation between future vehicle and opinion on ban?Summary of opinions based on future vehicles```{r}hn_survey_agg %>%group_by(fut_veh) %>%summarize_opinionban()``````{r}#| layout-ncol: 2#| column: page#| # R/ship between future vehicle and one's opinionbar_plot(fut_veh)# Narrowing our analysis to the different factor levelsdot_plot(fut_veh)```::: callout-noteMajority of respondents (39%) have no future plans of purchasing a vehicle.However, the remaining would purchase a `car` and 66% of them agree with the ban of motorcycles.Respondents who considered purchasing `Bike` and `ebike` are more likely to `agree` to ban than those intending to buy `motorcycles` which is plausible.:::## How are opinions spread out based on home location?::: callout-cautionAn assumption was made that respondents do actually live in the home location and this sought to explore whether living in certain areas makes one susceptible to agreeing/disagreeing with the ban.Also, the analysis narrowed down to people who live in Hanoi and within a 5 KM buffer area and trips made within Hanoi (will be useful is exploring how distance and time of travel relate to ban):::Preprocessing spatial variables to only include respondents who live in Hanoi and trips within Hanoi:```{r}library(tmap)tmap_mode("view")# Load VN full data setvn_orig <-st_read("../../data-synced/raw_data/VN.shp", quiet = T) hn <-st_read("../../data-synced/raw_data/Hanoi_commune.shp", quiet = T)# Create a HN outlinehn_out <- hn %>%st_union() # Put a 5KM buffer around ithn_buf <- hn_out %>%st_transform(3405) %>%st_buffer(dist =5000) %>%st_transform(4326)# Subset VN datavn = vn_orig[hn_buf, ]# Assign area codes to column codevn <- vn %>%rownames_to_column(var ="code") # Home and OD coordinateshome_coord <-st_as_sf(hn_survey_agg, coords =c("homelon", "homelat"), agr ="constant", crs =4326, remove =FALSE)origin <- hn_survey_agg %>%st_as_sf(agr ="constant", coords =c("origlon", "origlat"), crs =4326)destination <- hn_survey_agg %>%st_as_sf(agr ="constant", coords =c("destlon", "destlat"), crs =4326)# Ensure that home coordinates are within HNindex <-st_intersects(home_coord, vn, sparse = F)home_true <-rowSums(index)# Ensure that OD coordinates fall within hanoiindex <-st_intersects(origin, vn, sparse = F)or_true <-rowSums(index)index <-st_intersects(destination, vn, sparse = F)de_true <-rowSums(index)# Filter for those coordinates that fall within Hanoi i.e home_true = 1home_coord <- home_coord %>%mutate(home_true = home_true, or_true = or_true,de_true = de_true) %>%filter(home_true >0, or_true >0, de_true >0)# Add district names# District namesindex <-st_intersects(home_coord, vn, sparse = F)home_true <-rowSums(index)valid_mat <- index %>%as_tibble()names(valid_mat) <- vn$NAME_2# Pivot longer then obtain the home area codes for regions that intersecthome_code <- valid_mat %>%pivot_longer(everything(), names_to ="homecode", values_to ="is_intersect") %>%filter(is_intersect ==TRUE) %>%pull(homecode)# Add district namehome_coord <- home_coord %>%mutate(district_name = home_code) %>%relocate(district_name, .after = homelon)# Print out datahome_coord %>%slice_sample(n =5)rm(index)```Distribution of opinions based on respondents' home location.```{r}pal <-c("#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51")# Aggregate areas and group them according to opinionagg_lat_long = home_coord %>%as_tibble() %>%# Aggregate certain areasgroup_by(homelat =round(homelat, 2),homelon =round(homelon, 2), opinion_ban) %>%summarise(n =n()) %>%st_as_sf(coords =c("homelon", "homelat"), agr ="constant", crs =4326)# Plot opinions based on locationagg_lat_long %>%tm_shape() +tm_dots(size ="n", col ="opinion_ban", palette =c("dodgerblue", "firebrick3"), alpha =0.5) +tm_basemap("OpenStreetMap")```::: callout-noteSome areas have predominantly agree/disagree opinions.But most are overlayed on each other, probably due to Nick's preprocessing?More living further from the city center (especially to the West) tend to agree more?Why lots of red dots in locations across the river?Try out two heatmaps.:::Alternative: Summarize opinions based on the district names:::: panel-tabset## Plot```{r}# Opinions based on location plothome_coord %>%st_drop_geometry() %>%group_by(district_name =bind_count(district_name)) %>%summarize_opinionban() %>%filter(n_total >10) %>%mutate(district_name =fct_reorder(district_name, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = district_name)) +geom_point(aes(size = n_total), show.legend = T) +#geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by location",size ="Total respondents" ) +scale_x_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))```## Tibble```{r}# Opinions based on location tibblehome_coord %>%st_drop_geometry() %>%group_by(district_name = (district_name)) %>%summarize_opinionban() %>%filter(n_total >10)```:::::: callout-noteMost of the significant districts (with a large number of respondents) have a large number agreeing with the ban.`Long Bien` district has a large number of respondents and less than 30% of them agree to ban.Why?:::## A little Exploratory Data Analysis of numeric variablesLet's explore the numeric predictors.From a previous analysis, there seems to be some incorrect values for `dist_to_pub` feature:```{r}as_tibble(home_coord) %>%ggplot(mapping =aes(x = dist_to_pub, y =1)) +geom_boxplot()``````{r}# Summary statistics for distance to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%summary()```::: callout-cautionThe following preprocessing is done:- Distance is made absolute- Only distances \<`10 KM` are considered. Perhaps pick a better threshold for this.:::```{r}# Remove outliershome_coord <- home_coord %>%mutate(dist_to_pub =abs(dist_to_pub)) %>%filter(dist_to_pub <10000) ```::: panel-tabset## Summary```{r}# Summary stats of distance to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%summary()```## Boxplot```{r}# Distribution of distnace to public transportas_tibble(home_coord) %>%select(dist_to_pub) %>%ggplot(mapping =aes(x = dist_to_pub, y =1)) +geom_boxplot(color ="midnightblue") #geom_jitter(size = 0.1, alpha = 0.2)```:::Seems there are quite a number of outliers.Perhaps a threshold of `1KM` would be better.Ask Minh's opinion.A estimate of how numeric variables relate to the opinion on ban.```{r}#| column: page#| layout-ncol: 2numeric_vars <-as_tibble(home_coord) %>%mutate(across(everything(), as.character)) %>%mutate(across(contains("own_"), ~case_when(.x =="more_5"~"7", TRUE~ .x))) %>%select(contains("own_"), OD_dist, network_dist, travtime, dist_to_pub, opinion_ban) %>%mutate(across(!opinion_ban, ~abs(as.numeric(.x))),opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%pivot_longer(!opinion_ban, names_to ="metric", values_to ="value")# Density plotnumeric_vars %>%ggplot(mapping =aes(x = value, color = opinion_ban)) +geom_density() +scale_x_log10() +labs(color ="") +theme(legend.position ="top") +facet_wrap(~ metric, scales ="free")# Use ROc_AUC to determine how well each numeric predictor# would sway opinionsnumeric_vars %>%group_by(metric) %>%roc_auc(opinion_ban, value, event_level ="second") %>%mutate(metric =fct_reorder(metric, .estimate)) %>%ggplot(mapping =aes(x = .estimate, y = metric)) +geom_point() +geom_vline(aes(xintercept = .5)) +labs(x ="AUC estimates of agreeing",title ="How predictive is each numeric predictor by itself?",subtitle =">.5 means positively influences agree and <.5 means negatively influences agree" )```::: callout-note- As distance to public transport increases, people tend to disagree with the ban- As the network distance increases, people tend to agree with ban. Why would this be the case? Congestion?- Respondents who own 2 motorbikes tend to agree with ban- Ownership of a car is generally associated with agreeance of the ban.- Respondents who own 1, 3 or 4 motorbikes tend to disagree with the ban.- Generally respondents who own ebikes tend to disgaree with ban:::## Text Analysis of opinionsWord tokenization to explore the relationship between alternative vehicle and opinion ban.Summary tibble of what how opinions on ban vary with future alternative vehicle in case of ban:```{r}as_tibble(home_coord) %>%unnest_tokens(input = alt_veh, output = alt_veh, token ="words") %>%group_by(alt_veh) %>%summarize_opinionban()```Explore this visually:```{r}#| layout-ncol: 2#| column: pagelibrary(tidytext)# Split future vehicle into tokens and visualize opinionas_tibble(home_coord) %>%unnest_tokens(input = alt_veh, output = alt_veh, token ="words") %>%mutate(opinion_ban =factor(opinion_ban, levels =c("disagree", "agree"))) %>%ggplot(mapping =aes(x = alt_veh)) +geom_bar(aes(fill = opinion_ban), position ="dodge", alpha =0.7) +labs(fill ="") +scale_fill_manual(values =sample(c("#32A251FF" ,"#FF7F0FFF" , "#3CB7CCFF", "#FFD94AFF", "#39737CFF", "#B85A0DFF"), size =2, replace =FALSE)) +theme(legend.position ="top")# Category wise analysisas_tibble(home_coord) %>%unnest_tokens(input = alt_veh, output = alt_veh, token ="words") %>%group_by(alt_veh =bind_count(alt_veh)) %>%summarize_opinionban() %>%mutate(alt_veh =fct_reorder(alt_veh, pct_agree)) %>%ggplot(mapping =aes(x = pct_agree, y = alt_veh)) +geom_point(aes(size = pct), show.legend = F) +geom_errorbarh(aes(xmax = low, xmin = high), height = .3) +labs(x ="Percentage of folks in each category who agree",title ="Distribution of opinions by future vehicle preference",size ="%prevalence" ) +scale_x_continuous(labels = percent) +scale_size_continuous(labels = percent) +theme(plot.title =element_text(hjust =0.5))```::: callout-note- If `motorcycles` were banned, majority of respondents would result to `cars`.- Majority of respondents who would result to `cars`, `bike` agree with ban.- Interesting to see that respondents who would result to `ebikes` have lowest probability of agreeing with ban.:::In general, we have predictors that clearly distinguish the observations in terms of the opinion on ban, while some, not so much.Will begin by including everything in the model, then use `Variable Importance` to eliminate non-predictive features.## Creating a Boosted Tree modelThere are several steps to create a useful model, including parameter estimation, model selection and tuning, and performance assessment.::: callout-tipIf some of the methodologies are a bit over-explained, that's for future references for my talks, blogs, etc. 🙂:::### Data BudgetingIn this section, the data is split into two distinct sets, the training set and the test set.The training set (typically larger) is used to develop and optimize the model by fitting different models and investigating various feature engineering strategies etc.The other portion of the data is the *test set*.This is held in reserve until one or two models are chosen as the methods that are most likely to succeed.The test set is then used as the final arbiter to determine the efficacy of the model.- Drop amenity accessibility columns- Remove `tram` since it's too sparsely distributed for now.- Make a 70/30 split specification.```{r}set.seed(2056)# Drop opinions on amenity accessibilityhn_survey_ml <-as_tibble(home_coord) %>%mutate(district_name_eng =str_remove_all(district_name, "[^a-zA-z]")) %>%relocate(district_name_eng, .after = district_name) %>%select(!contains("_acc")) %>%drop_na() %>%mutate(opinion_ban =factor(opinion_ban, levels =c("agree", "disagree"))) %>%filter(vehic !="tram")# Make a split specificationhn_survey_split <- hn_survey_ml %>%select(!geometry) %>%as_tibble() %>%mutate(across(contains("own_"), as.character),across(contains("own_"), ~case_when(.x =="more_5"~"7", TRUE~ .x)),across(contains("own_"), ~as.numeric(.x) )) %>%# Stratified split based on opinion baninitial_split(strata = opinion_ban, prop =0.7)# Obtain train and test setstrain <-training(hn_survey_split)test <-testing(hn_survey_split)# Print out observations in each categoryglue('The training set has {nrow(train)} observations \n','The testing set has {nrow(test)} observations')```### Resampling for model evaluation and tuningBoosted tree models typically have *tuning parameters* or *hyperparameters* must be specified ahead of time and can't be directly found from training data.These are unknown structural or other kind of values that have significant impact on the model but cannot be directly estimated from the data.Instead, hyperparameters are estimated using simulated data sets created from a process called resampling such as cross-validation or bootstrap resampling.The idea of resampling is to create simulated data sets that can be used to estimate the performance of your model.Since this data has a spatial component, we'll do a **spatial resample.** Spatial or cluster cross-validation splits the data into V groups of disjointed sets using k-means clustering of some variables, typically spatial coordinates.This ensures that we account for the presence of spatial autocorrelation in geospatial data.::: callout-tipPlease see for more information on spatial resampling: [Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest](https://ieeexplore.ieee.org/document/6352393):::Perform 10 fold spatial cross validation:```{r}library(spatialsample)set.seed(2056)#Spatial V fold cross validationtrain_folds <-spatial_clustering_cv(data = train, coords =c(homelat, homelon), v =10)```::: panel-tabset## Visualize resamples```{r}#| column: screen-inset-shaded#| layout-nrow: 4plot_splits <-function(split) { p <-bind_rows(analysis(split) %>%mutate(analysis ="Analysis"),assessment(split) %>%mutate(analysis ="Assessment") ) %>%ggplot(aes(homelon, homelat, color = analysis)) +geom_point(size =1.5, alpha =0.8) +coord_fixed() +labs(color =NULL)print(p)}walk(train_folds$splits, plot_splits)```## Tibble resamples```{r}train_folds```:::### Feature Engineering with recipesFeature engineering entails reformatting predictor values to make them easier for a model to use effectively.The following transformation steps will be done before refining the model further down the line:- `One hot encoding`: xgboost does not have a means to translate factor predictors to grouped splits. Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine.- `Remove zero variance`: potentially remove variables that are highly sparse and unbalanced.- `Impute missing values`:- `Normalize/transform`: not really required for tree models- `Collapse some categorical levels`: potentially pool infrequently occurring values into an "other" category.- `Upsampling`: There is some class imbalance with the survey having more people who agree with ban. It is equally important to understand the reasons behind people disagreeing with the ban. A SMOTE algorithm (@chawla2002smote) will be applied which synthetically generates new points based on existing points.::: callout-tip`SMOTE` requires all variables to be numeric with no missing data.Good thing is that these requirements can and should be handled by previous recipe steps e.g `step_impute_mean` and `step_dummy`:::The [**recipes**](https://recipes.tidymodels.org/) package allows you to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets.The [**workflows**](https://workflows.tidymodels.org/) package allows the user to bind modeling and preprocessing objects together.You can then fit the entire workflow to the data, so that the model encapsulates all of the pre-processing, modeling, and post-processing requests.```{r}set.seed(2056)library(textrecipes)# Function for prepping and baking a recipeprep_juice <-function(recipe){ recipe %>%prep() %>%juice()}library(beepr)# Data preprocessing with recipesboost_recipe <-recipe( opinion_ban ~ age + occup + gender + vehic + freq_bike + freq_ebike + freq_bus + freq_motob + freq_car + freq_taxi + own + freqpweek + purp + aware_ban + fut_veh + own_car + own_bike + own_ebike + dist_to_pub + homelat + homelon + alt_veh + network_dist, data = train) %>%# Tokenize future vehicle considereationstep_tokenize(alt_veh) %>%# One hot encode the tokensstep_tf(alt_veh) %>%# Pool infrequently occurring values into an "other" category.#step_other(purp, threshold = 0.05) %>%#step_other(freqpweek, threshold = 0.05) %>%step_other(contains("freq_"), threshold =0.05) %>%# Dummy variables for categorical #step_integer(all_nominal_predictors(), zero_based = TRUE) %>%step_dummy(all_nominal_predictors(), one_hot =TRUE) %>%step_upsample(opinion_ban)# %>%# # # # Near zero variance filter# step_nzv(all_predictors()) # Just for sanity check#View(prep_juice(boost_recipe))# Allow recipe to handle new levels# bp <- hardhat::default_recipe_blueprint(# allow_novel_levels = TRUE# )# Create boosted tree model specificationboost_spec <-boost_tree(mtry =tune(),trees =tune(),#min_n = tune(),#tree_depth = tune(),learn_rate =0.01,#loss_reduction = tune(),#sample_size = tune(),#stop_iter = tune() ) %>%set_engine("xgboost") %>%set_mode("classification")# Bind recipe and model specification togetherboost_wf <-workflow() %>%add_recipe(boost_recipe) %>%add_model(boost_spec)# Print workflowboost_wf```### Model tuningNow that we have a **tunable workflow**, we'll need to figure out a set of possible values to try out then choose the best.Tuning parameter optimization usually falls into one of two categories:- `grid_search`: when a pre-defined set of parameter values is evaluated and the best set picked. We'll use a non-regular grid in this case.When grid search is infeasible or inefficient, iterative methods are a sensible approach for optimizing tuning parameters.- *Iterative search* or sequential search is when we sequentially discover new parameter combinations based on previous results e.g *simulated annealing*#### Using racing methods to perform an efficient grid search. {.unnumbered}In racing methods, the tuning process evaluates all hyperparameter combinations on an initial subset of resamples.Based on their current performance metrics, the process eliminates tuning parameter combinations that are unlikely to be the best results using a repeated measure ANOVA model.This is unlike a classical grid search where all models need to be fit across all resamples before any tuning parameters can be evaluated.This efficiently reduces training time.> `grid`: An integer or data frame.> When an integer is used, the function creates a space-filling design with `grid` number of candidate parameter combinations.> Non-regular *space-filling designs* generally find a configuration of points that cover the parameter space with the smallest chance of overlapping or redundant values.> Examples of such designs are Latin hypercubes (@mckay1979mckay)Let's go ahead and tune the model:```{r}#| cache: truedoParallel::registerDoParallel()set.seed(2056)library(finetune)# Evaluation metrics during tuningeval_metrics <-metric_set(mn_log_loss, accuracy)# Efficient grid search via racingxgb_race <-tune_race_anova(object = boost_wf,resamples = train_folds,metrics = eval_metrics,# Try out 20 different hyperparameter combinationsgrid =20,control =control_race(verbose_elim =TRUE ))beepr::beep(9)#saveRDS(xgb_race, "smote_xgb_race")```We can observe the incremental elimination of tuning parameters that may not have likely improved model performance:```{r}# Plot racing resultsplot_race(xgb_race)```Let' see which hyperparameter combinations performed best on the resamples:**Accuracy**```{r}#| fig-cap: How accuracy varies with different combinations of mtry and trees# Tibblexgb_race %>%show_best(metric ="accuracy")# Plotxgb_race %>%collect_metrics() %>%select(mtry, trees, .metric, mean) %>%filter(.metric =="accuracy") %>%pivot_longer(!c(mean, .metric), names_to ="metrics", values_to ="values") %>%ggplot(mapping =aes(x = values, y = mean)) +geom_point(size =2, color ="darkorange", alpha =0.7) +geom_line(color ="dodgerblue", size =1.2, alpha =0.7) +labs(y ="accuracy") +facet_wrap(vars(metrics), scales ="free_x")```**Mean log loss**```{r}#| fig-cap: How mean log loss varies with different combinations of mtry and trees# Tibblexgb_race %>%show_best(metric ="mn_log_loss")# Plotxgb_race %>%collect_metrics() %>%select(mtry, trees, .metric, mean) %>%filter(.metric =="mn_log_loss") %>%pivot_longer(!c(mean, .metric), names_to ="metrics", values_to ="values") %>%ggplot(mapping =aes(x = values, y = mean)) +geom_point(size =2, color ="dodgerblue", alpha =0.7) +geom_line(color ="darkorange", size =1.2, alpha =0.7) +labs(y ="mn_log_loss") +facet_wrap(vars(metrics), scales ="free_x")```Plausible enough!Probably increasing trees would have lowered the mean log loss further?For this model, we'll finalize the hyperparameters based on the values of the mean log loss.Using resamples, we were able to get the best estimates of `mtry` and `trees` and a good indicator of how those values would perform on the actual train and test sets.### Finalizing the model, training and testingNow that we have the best performance values, we can use `tune::finalize_workflow()` to update (or "finalize") our workflow object with the best estimate values for our hyper-parameters.Finally, we can return to our test data and estimate the model performance we expect to see with new data.We use the function [`last_fit()`](https://tidymodels.github.io/tune/reference/last_fit.html) with our finalized model; this function *fits* the finalized model on the full training data set and *evaluates* the finalized model on the testing data.```{r}# Finalize workflowfinal_boost_wf <- boost_wf %>%finalize_workflow(select_best(xgb_race, metric ="mn_log_loss"#mn_log_loss ))# Train then testxgb_model <- final_boost_wf %>%last_fit(hn_survey_split, metrics =metric_set(accuracy, recall, spec, ppv, roc_auc, mn_log_loss, f_meas))# Collect metricsxgb_model %>%collect_metrics() ```Another performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix).A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model.```{r}# Plot confusion matrixxgb_model %>%collect_predictions() %>%conf_mat(truth = opinion_ban, estimate = .pred_class) %>%autoplot(type ="heatmap")# Prettier?update_geom_defaults(geom ="rect", new =list(fill ="midnightblue", alpha =0.7))xgb_model %>%collect_predictions() %>%conf_mat(opinion_ban, .pred_class) %>%autoplot()# Extract trained workflowxgb_wf <- xgb_model %>%extract_workflow()# Save modelsaveRDS(xgb_model %>%extract_workflow(), "smote2_boost_hnwf")#xgb_wf <- readRDS("smote_boost_hnwf")```The performance metrics considered are:Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive.Also known as `sensitivity`.Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive.Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value "Positive predictive value")Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.::: callout-noteOverall, the performance metrics look really good.Generally, the model did a better job in predicting observations that `disagree` with the ban.With this model, we can now probe it to investigate/understand the underlying reasons for agreeing/disagreeing with ban.:::## Explaining the model and predictions::: callout-tipSome reference books that I found really helpful for this section include:- Kuhn, Max and Silge, Julia. 2022. Tidy Modeling with R. <https://www.tmwr.org/>- Biecek, Przemyslaw, and Tomasz Burzykowski. 2021. *Explanatory Model Analysis*. Chapman; Hall/CRC, New York. <https://ema.drwhy.ai/>.- Molnar, Christopher. 2020. *Interpretable Machine Learning*. lulu.com. <https://christophm.github.io/interpretable-ml-book/>.:::In this section, we explore why the model makes the predictions it does.Answering the question `“why?”` allows modeling practitioners to understand which features were important in predictions and even how model predictions would change under different values for the features.::: callout-noteThere are two types of model explanations, *global* and *local*.Global model explanations provide an overall understanding aggregated over a whole set of observations; local model explanations provide information about a prediction for a single observation.:::### Global model explanationsFor global model interpretation, the following will be considered:1. `Variable Importance`2. `Partial Dependency Plots`3. `SHAP Values`#### Variable Importance/Permutation feature importanceVariable Importance plots are one way of understanding which predictor has the largest effect on the model outcomes.The main idea is to measure how much does a model's performance changes if the effect of a selected explanatory variable, or of a group of variables, is removed.To remove the effect, we use perturbations, like resampling from an empirical distribution or permutation of the values of the variable.If a variable is important, then we expect that, after permuting/shuffling the values of the variable, the model's performance will worsen.The larger the change in the performance, the more important is the variable.Please see [Explanatory Model Analysis](https://ema.drwhy.ai/featureImportance.html) and [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)The method may be applied for several purposes.- `Model simplification`: variables that do not influence a model's predictions may be excluded from the model.- `Model exploration`: comparison of variables' importance in different models may help in discovering interrelations between the variables. Also, the ordering of variables in the function of their importance is helpful in deciding in which order should we perform further model exploration.- `Domain-knowledge-based model validation`: identification of the most important variables may be helpful in assessing the validity of the model based on domain knowledge.- `Knowledge generation`: identification of the most important variables may lead to the discovery of new factors involved in a particular mechanism.**Variable Importance:**```{r}# Load saved model#xgb_wf <- readRDS("smote_boost_hnwf")options(scipen =999)# Extract variable importancelibrary(vip)vi <- xgb_wf %>%extract_fit_parsnip() %>%vi()vi```Let's represent the above as Variable Importance Plots:```{r}#| fig-width: 8#| fig-height: 8#| layout-ncol: 2#| column: page# Make Variable Importance Plotsvi %>%slice_max(Importance, n =42) %>%mutate(Variable =fct_reorder(Variable, Importance)) %>%ggplot(mapping =aes(y = Variable, x = Importance)) +geom_point(size =3, color ="dodgerblue") +geom_segment(aes(y = Variable, yend = Variable, x =0, xend = Importance), size =2, color ="dodgerblue", alpha =0.7 ) +ggtitle(paste("Variable Importance plot of top", round(nrow(vi)/2), "variables")) +theme(plot.title =element_text(hjust =0.5))# Bottom n variablevi %>%slice_min(Importance, n =42) %>%mutate(Variable =fct_reorder(Variable, Importance)) %>%ggplot(mapping =aes(y = Variable, x = Importance)) +geom_point(size =3, color ="cyan4") +geom_segment(aes(y = Variable, yend = Variable, x =0, xend = Importance), size =2, color ="cyan4", alpha =0.7) +ggtitle(paste("Variable Importance plot of bottom", round(nrow(vi)/2), "variables")) +theme(plot.title =element_text(hjust =0.5))```From the VIP plot, we can observe which variables most/least influence the model performance.For instance the following quick inferences can be made:- The `frequency of usage of cars`, in this case whether an individual does not use a car, has the most predictive value in this model. This warrants further exploration. Also usage of `motobike` and `bikes``0` times a week is an important factor driving individuals' opinion to ban.- distance to public transport and the distance to destination are among the most important variable in this model- In terms of vehicle ownership, number of cars and bikes are the most important.- Being aware of the ban has a significant effect on the model as well as a person's home location.- Text preprocessing made it on the list with alternative vehicles car and train topping the list.- Making a trip `4-7` times a week was an important underlying factor in driving opinions towards ban.- Occupation, gender, age and trip purpose were not as important compared to other featuresThere are lots of inferences that can be made about this characteristics later and some of the less important features can be removed.Perhaps this can improve model importance.::: callout-tipThe feature importance plot is useful, but contains no information beyond the importances i.e, do the features influence the predictions positively or negatively.For a more informative plot, we will next look at the SHAP summary plot.:::#### Shapley Additive exPlanations (SHAP)[SHAP](https://christophm.github.io/interpretable-ml-book/shap.html) feature importance is an alternative to permutation feature importance above.SHAP is based on the idea of averaging the value of a variable's attribution over all (or a large number of) possible orderings.As such, **permutation feature importance** is based on **decrease in model's performance with shuffling** while **SHAP** is based on **magnitude of feature attributions.**The code below calculates the SHAP values for the top 25 variables ranked by mean SHAP.```{r}#| fig-width: 10#| fig-height: 10#| fig-cap: "Your regular reminder: All effects describe the behavior of the model and are not necessarily causal in the real world."# SHAP for xgboostlibrary(SHAPforxgboost)# Prepare shap values for plotting. Requires a matrixopinion_shap <-shap.prep(# Actual Boost enginexgb_model = xgb_wf %>%extract_fit_engine(),# predictors used to calculate SHAP valuesX_train = boost_recipe %>%prep() %>%bake(has_role("predictor"),new_data =NULL,composition ="matrix"),top_n =round(nrow(vi)/2) +10)shap.plot.summary(opinion_shap)```The SHAP summary plot combines **feature importance** and **feature effects** with features being ordered according to their importance.Each point on the SHAP summary plot is a Shapley value for a feature and an instance and the color represents the value of the feature from low to high.Some inferences we can derive from our Shapley plot:- Use of cars 0 times a week is associated with lower shapley values and this pushes the model prediction towards `disagree`- High values of distance to public transport reduce probability of agreeing with the ban.- Owning more cars has a larger effect on the model (High Shapley values) and pushes the model towards `agree` This is in contrast to ebikes where high ownershing of ebikes pushes the model towards `disagree`- Individuals who make trips for `leisure` have a greater probability of agreeing to ban than those who make trips for `visit`- Having alternative vehicles as `car` or `train` in case of ban increases the probability of agreeing with ban in contrast to having `taxis`- High instances of of frequency of motorbikes being greater than 20 times a week pushes the model towards `disagree`More inferences to be made while writing the paper.SHAP summary plots therefore give us first indications of the relationship between a feature and the impact on predictions.#### SHAP Dependence PlotSHAP summary plot give us first indications of the relationship between a feature and the impact on predictions.We can take a closer look at the exact form of the relationship by making a SHAP dependence plot as follows:1) Pick a feature.2) For each data instance, plot a point with the feature value on the x-axis and the corresponding Shapley value on the y-axis.3) Done.SHAP Dependence plots for the top 20 variables:```{r}#| fig-width: 14#| fig-height: 14opinion_shap %>%filter(variable %in% (opinion_shap %>%distinct(variable) %>%slice_head(n =20) %>%pull())) %>%ggplot(mapping =aes(x = rfvalue, y = value)) +geom_point(size =0.5) +geom_smooth(method ="loess", se = F, color ="blue", alpha =0.7, lwd =0.4) +facet_wrap(vars(variable), scales ="free") +ylab("SHAP")```The SHAP dependence plots allow one to explore the variance of the Shapley values with different data points of a feature.Let's explore some of these relationships further by looking at some SHAP interaction values.SHAP interaction values separate the impact of variable into main effects and interaction effects.They add up roughly to the dependence plot.::: callout-warningThe preparation of SHAP interaction values take time since it calculates all the combinations.The code blow took about 4 hours to run.:::```{r eval=FALSE, include=TRUE}# Prepare the interaction SHAP values from predict.xgb.Boostershap_int <-shap.prep.interaction(xgb_model = xgb_wf %>%extract_fit_engine(),X_train = boost_recipe %>%prep() %>%bake(has_role("predictor"),new_data =NULL, composition ="matrix"))saveRDS(shap_int, "opinion_interaction")```#### How effect of network distance varies with the vehicle used {.unnumbered}```{r}#| layout-ncol: 2#| column: pageshap_int =readRDS("opinion_interaction")shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="network_dist",y ="vehic_moto",color_feature ="vehic_moto",size0 =1.2,smooth = F )shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="network_dist",y ="vehic_car",color_feature ="vehic_car",size0 =1.2,smooth = F )```Usage of cars for distances \> 17 KM is associated with lower SHAP interaction values compared to use of motorcycles.#### Influence of owning/renting at certain locations {.unnumbered}```{r}#| column: screen-inset-shaded#| layout-nrow: 2shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="homelon",y ="own_owner",color_feature ="own_owner",size0 =1.2,smooth = F )shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="homelon",y ="own_rent",color_feature ="own_rent",size0 =1.2,smooth = F )shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="homelat",y ="own_owner",color_feature ="own_owner",size0 =1.2,smooth = F )shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="homelat",y ="own_rent",color_feature ="own_rent",size0 =1.2,smooth = F )```Owning/renting a home at certain locations has a significant influence on agreeing/disagreeing with the ban.Let's explore the interaction between the number of cars one owns and their occupation, and how this influences opinion:```{r}#| layout-ncol: 2#| column: pageshap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="own_car",y ="occup_state",color_feature ="occup_state",size0 =1.2,smooth = F )shap.plot.dependence(data_long = opinion_shap,data_int = shap_int,x ="own_car",y ="occup_fdi",color_feature ="occup_fdi",size0 =1.2,smooth = F )```Generally, having 1 car and occupation is FDI increases the probability of agreeing with the ban while having 2 cars and occupation is FDI reduces probability of agreeing to ban.This non-linear interactions will be explored further if required.### Local explanationsLocal model explanations provide information about a prediction for a single observation.Probably the most commonly asked question when trying to understand a model's prediction for a single observation is: *which variables contribute to this result the most?* There is no single best approach that can be used to answer this question.In this section, we explore [break-down (BD) plots](https://ema.drwhy.ai/breakDown.html), which offer a possible solution.::: callout-tipBreak down plots compute how contributions attributed to individual features change the mean model's prediction for a particular observation.We can use the fact that these break-down explanations change based on the order of features to compute the most important features over all (or many) possible orderings.This is the idea behind Shapley Additive Explanations ([Lundberg and Lee 2017](https://www.tmwr.org/explain.html#ref-Lundberg2017)), where the average contributions of features are computed under different combinations or "coalitions" of feature orderings.The underlying idea of BD plots is thus to capture the contribution of an explanatory variable to the model's prediction by computing the shift in the expected value of $Y$, while fixing the values of other variables.:::To compute any kind of model explanation, global or local, using **DALEX** package, we first create an *explainer* for each model.```{r}#| layout-ncol: 2#| column: page# Load DALEX packagelibrary(DALEXtra)# Vector of predictorsvip_features <- xgb_wf %>%extract_preprocessor() %>%summary() %>%filter(role =="predictor") %>%pull(variable)vip_train <- train %>%select(all_of(vip_features)) # Explainer for xgboostexplainer_xgboost <-explain_tidymodels(# a fitted workflow model <- xgb_wf,# Data should be passed without target column # this shall be provided as the y argumentdata = vip_train,y = train$opinion_ban,label ="xg_boost",verbose = F)```Let's use the fact that these break-down explanations change based on order to compute the most important features over all (or many) possible orderings and compute SHAP attributions for some sample observations, using `B = 25` random orderings (takes a while to compute).```{r}set.seed(2056)# Pick an observationsample_observation <- vip_train %>% dplyr::slice(2)glimpse(sample_observation)```Computing the break down explanations for the above observation:```{r}#| layout-ncol: 2#| column: page# Compute SHAP attributionsshap_bd <-predict_parts(explainer = explainer_xgboost,new_observation = sample_observation,type ="shap",# Default is 25B =25)# Make break down plotsshap_bd %>%group_by(variable) %>%mutate(mean_val =mean(contribution)) %>%ungroup() %>%mutate(variable =fct_reorder(variable, abs(mean_val))) %>%filter(contribution !=0) %>%ggplot(mapping =aes(x = contribution, y = variable, fill = mean_val >0)) +geom_boxplot() +geom_col(data =~distinct(., variable, mean_val),aes(x = mean_val, y = variable),alpha =0.5) +theme(legend.position ="none") +scale_fill_viridis_d() +labs(y =NULL)xgboost_breakdown <-predict_parts(explainer = explainer_xgboost,new_observation = sample_observation,order = shap_bd %>%group_by(variable_name) %>%mutate(mean_val =mean(contribution)) %>%ungroup() %>%mutate(variable_name =fct_reorder(variable_name, desc(abs(mean_val)))) %>%distinct(variable_name, mean_val) %>%pull(variable_name) %>%levels())plot(xgboost_breakdown, max_vars =1000)```::: callout-tip`purple` variables increase the probability of `agreeing` with the ban while `yellow` variables increase probability of `disagreeing` with the ban.I this package, probability of:`0-0.5`: agree`0.5-1`: disagree:::Let's interpret the above plots a little bit:::: callout-noteThe`intercept` shows the distribution and the mean value of the model's predictions for all data.The next rows show the distribution and the mean value of the predictions when fixing values of subsequent explanatory variables.For instance the fact that the individual's alternative vehicle is `car lighttrain`, is `aware` of ban and future vehicle is `car` reduces the probability of disagreeing with ban while the fact that the individual uses a `motorcycle`, the trip purpose is `eduaction` and `rents` increases the probability of disagreeing with ban.The `prediction` shows the prediction for the particular instance of interest.:::Let's do the same for an observation that `disagreed` with the ban:```{r}#| layout-ncol: 2#| column: pageset.seed(2056)# Pick an observationsample_observation <- train %>%select(all_of(vip_features), opinion_ban) %>% dplyr::slice(10234)glimpse(sample_observation)```Computing the breakdown explanations for the above observation:```{r}# Compute SHAP attributionsshap_bd <-predict_parts(explainer = explainer_xgboost,new_observation = sample_observation,type ="shap",# Default is 25B =25)# Make break down plotsshap_bd %>%group_by(variable) %>%mutate(mean_val =mean(contribution)) %>%ungroup() %>%mutate(variable =fct_reorder(variable, abs(mean_val))) %>%filter(contribution !=0) %>%ggplot(mapping =aes(x = contribution, y = variable, fill = mean_val >0)) +geom_boxplot() +geom_col(data =~distinct(., variable, mean_val),aes(x = mean_val, y = variable),alpha =0.5) +theme(legend.position ="none") +scale_fill_viridis_d() +labs(y =NULL)xgboost_breakdown <-predict_parts(explainer = explainer_xgboost,new_observation = sample_observation,order = shap_bd %>%group_by(variable_name) %>%mutate(mean_val =mean(contribution)) %>%ungroup() %>%mutate(variable_name =fct_reorder(variable_name, desc(abs(mean_val)))) %>%distinct(variable_name, mean_val) %>%pull(variable_name) %>%levels())plot(xgboost_breakdown, max_vars =1000)```::: callout-noteThe fact that the respondent `does not care about ban`, makes `17-20` trips per week, alternative vehicle is `ebike`, uses `taxi 1-5` times a week increases the probability of `disagreeing` with the ban.The fact that the respondent uses a `car 1_5 times a week`, future vehicle purchase is `bike`, lives at `lat 21.02 lon 105.8`, increases the probability of `agreeing` with the ban.The model's prediction is 0.793 indicating the individual disagreed with the ban.More breakdown plots will be made available in the Shiny App.:::We'll wrap up the adventure here.In this notebook:- EDA was done to uncover any relationships between the predictors and the outcome variable.- An xgboost model was tuned and fitted to the data- Global and Local model explanations were computed to uncover some of the factors driving opinions towards the ban## Next steps- Shiny dashboard for model explainability. A WIP can be found here: <https://r-icntay.shinyapps.io/Explainable_ML/>- Linear model explainability as suggested by Lex.- Transfer outputs to paper with Minh.- Explore ML devops features such as deploying model, github actions to automate some processes, integration with shiny etc.