Finding the most important single thing, the 'key driver'

Stephen Peplow

This note is about the Shapley Value, a product of game theory. The Shapley value satisfies the Nash equilibrium, and assigns a score to each player. Because the Nash equilibrium is satisfied we know that the final score is the best possible under all combinations.

Use in marketing research

This result is useful in the analysis of ranked customer responses. Well-known problems with ranked responses include multicollinearity (the responses tend to 'point the same way'); and the responses being ordinal rather than interval based. For example, is the 'gap' between 7 and 8 the same as the gap between 3 and 4? The Shapley value won't solve these problems but at the least it suggests a way forward. It should be noted that there are other promising methods, especially artifical neural networks (ANN). I will write some more on this approach in a few days.

Example

Below is an example using synthetic data (from Joel Cadwell) of passenger responses to a questionnaire. There are a total of fifteen questions. Twelve are about the flight itself, for example 'Seat Comfort'. Three ask whether the respondent was 'satisfied' with the flight; would 'fly again'; would 'recommend' the airline to a third party.

The fake responses are under these headings:

names(ratings)
 [1] "Easy_Reservation" "Preferred_Seats"  "Flight_Options"  
 [4] "Ticket_Prices"    "Seat_Comfort"     "Seat_Roominess"  
 [7] "Overhead_Storage" "Clean_Aircraft"   "Courtesy"        
[10] "Friendliness"     "Helpfulness"      "Service"         
[13] "Satisfaction"     "Fly_Again"        "Recommend"       

There are 1,000 respondents who have answered the 15 questions in the synthetic dataset. A first step is to visualise the data with a graph showing the interconnecting 'edges' to form a Network Map

plot of chunk unnamed-chunk-3

What we want to know is how the responses to the 12 variables relate to the responses (satisfied with the service; will fly again; will recommend). The ratings are scaled so that the outcome will yield standardized results under ordinary least squares. Therefore there are three regressions, identical except that the dependent variable is the score for 'satisfied', 'recommend' and 'will fly again' in turn. Below is the output for 'recommend'.

rec <- lm(Recommend ~ Easy_Reservation + Preferred_Seats + Flight_Options + 
    Ticket_Prices + Seat_Comfort + Seat_Roominess + Overhead_Storage + Clean_Aircraft + 
    Courtesy + Friendliness + Helpfulness + Service, data = scaled_ratings)
summary(rec)

Call:
lm(formula = Recommend ~ Easy_Reservation + Preferred_Seats + 
    Flight_Options + Ticket_Prices + Seat_Comfort + Seat_Roominess + 
    Overhead_Storage + Clean_Aircraft + Courtesy + Friendliness + 
    Helpfulness + Service, data = scaled_ratings)

Residuals:
   Min     1Q Median     3Q    Max 
-2.627 -0.429  0.130  0.465  1.843 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -8.47e-17   2.10e-02    0.00  1.00000    
Easy_Reservation  1.91e-01   3.30e-02    5.80  9.0e-09 ***
Preferred_Seats   1.74e-01   3.11e-02    5.58  3.1e-08 ***
Flight_Options    8.77e-02   2.98e-02    2.94  0.00335 ** 
Ticket_Prices     1.30e-01   2.92e-02    4.46  9.0e-06 ***
Seat_Comfort      1.39e-01   3.59e-02    3.86  0.00012 ***
Seat_Roominess    8.00e-02   3.27e-02    2.45  0.01450 *  
Overhead_Storage  1.14e-01   3.29e-02    3.47  0.00054 ***
Clean_Aircraft    1.30e-01   3.37e-02    3.87  0.00012 ***
Courtesy          1.10e-02   3.42e-02    0.32  0.74685    
Friendliness      2.09e-02   3.62e-02    0.58  0.56443    
Helpfulness      -1.02e-01   3.79e-02   -2.69  0.00731 ** 
Service          -3.43e-02   3.76e-02   -0.91  0.36234    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.663 on 987 degrees of freedom
Multiple R-squared: 0.566,  Adjusted R-squared: 0.56 
F-statistic:  107 on 12 and 987 DF,  p-value: <2e-16 

Let's decompose the 'recommend' regression and find out what the r-squared is made of.

library(relaimpo)
This is the global version of package relaimpo. 
If you are a non-US user, a version with the interesting additional metric pmvd is available 
from Ulrike Groempings web site at http://prof.beuth-hochschule.de/groemping/relaimpo. 
calc.relimp(rec, type = c("lmg"), rela = TRUE, rank = TRUE)
Response variable: Recommend 
Total response variance: 1 
Analysis based on 1000 observations 

12 Regressors: 
Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy Friendliness Helpfulness Service 
Proportion of variance explained by model: 56.56%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

                     lmg
Easy_Reservation 0.13192
Preferred_Seats  0.12013
Flight_Options   0.08873
Ticket_Prices    0.10155
Seat_Comfort     0.10560
Seat_Roominess   0.08013
Overhead_Storage 0.09549
Clean_Aircraft   0.10009
Courtesy         0.04711
Friendliness     0.04829
Helpfulness      0.04090
Service          0.04007

Average coefficients for different model sizes: 

                     1X    2Xs    3Xs     4Xs      5Xs       6Xs      7Xs
Easy_Reservation 0.6224 0.4659 0.3864 0.33718  0.30270  0.276573  0.25575
Preferred_Seats  0.5954 0.4256 0.3485 0.30297  0.27171  0.248309  0.22981
Flight_Options   0.5477 0.3614 0.2838 0.23770  0.20500  0.179492  0.15846
Ticket_Prices    0.5652 0.3850 0.3082 0.26279  0.23121  0.207242  0.18813
Seat_Comfort     0.5991 0.4294 0.3436 0.29083  0.25414  0.226574  0.20478
Seat_Roominess   0.5539 0.3597 0.2707 0.21835  0.18302  0.157151  0.13717
Overhead_Storage 0.5757 0.3930 0.3091 0.25888  0.22423  0.198153  0.17746
Clean_Aircraft   0.5882 0.4115 0.3248 0.27220  0.23621  0.209661  0.18909
Courtesy         0.4928 0.2679 0.1642 0.10760  0.07383  0.052558  0.03860
Friendliness     0.5000 0.2772 0.1704 0.11151  0.07664  0.055186  0.04171
Helpfulness      0.4710 0.2241 0.1004 0.02915 -0.01518 -0.044109 -0.06362
Service          0.4729 0.2318 0.1177 0.05527  0.01854 -0.003791 -0.01751
                      8Xs      9Xs     10Xs     11Xs     12Xs
Easy_Reservation  0.23858  0.22408  0.21165  0.20085  0.19138
Preferred_Seats   0.21467  0.20198  0.19117  0.18186  0.17377
Flight_Options    0.14052  0.12490  0.11110  0.09879  0.08773
Ticket_Prices     0.17243  0.15927  0.14810  0.13851  0.13021
Seat_Comfort      0.18694  0.17200  0.15928  0.14830  0.13876
Seat_Roominess    0.12118  0.10807  0.09714  0.08791  0.08001
Overhead_Storage  0.16047  0.14618  0.13398  0.12341  0.11416
Clean_Aircraft    0.17260  0.15910  0.14785  0.13836  0.13025
Courtesy          0.02913  0.02247  0.01761  0.01393  0.01104
Friendliness      0.03317  0.02775  0.02433  0.02218  0.02087
Helpfulness      -0.07710 -0.08661 -0.09341 -0.09836 -0.10199
Service          -0.02584 -0.03070 -0.03326 -0.03429 -0.03428

The author of the R package 'relaimpo' which I've just used prefers the term 'lmg' but this is really a variant of the Shapley value. (There are alternatives, and we could have specified a vector of pairwise differences between relative contributions). The first table lists the lmg-Shapley value under 'relative importance metrics'. It seems that 'Easy_Reservation' is the most important at 0.132. We can see that the other closely-related variables (in the red circles) also have high values.The lmg figures add up to 1, of 100%. In other words, for those clients who say they are going to recommend your airline, ease of reservation was the quality they will stress most and that makes up 13% of their argument in your favour.

Interestingly, when the exercise is run again on 'satisfied', quite different results appear, almost opposite.

Above I have used a synthetic dataset where N = 1000. For smaller and more realistic datasets, it is possible to bootstrap.

Extensions

It is possible to run a sequence of decompositions so that the Shapley value's change over time can be observed.

Caveats

This model assumes a causal effect AND a static environment. Just pressing on one thing won't work. The analyst needs to see the bigger picture.