This note is about the Shapley Value, a product of game theory. The Shapley value satisfies the Nash equilibrium, and assigns a score to each player. Because the Nash equilibrium is satisfied we know that the final score is the best possible under all combinations.
This result is useful in the analysis of ranked customer responses. Well-known problems with ranked responses include multicollinearity (the responses tend to 'point the same way'); and the responses being ordinal rather than interval based. For example, is the 'gap' between 7 and 8 the same as the gap between 3 and 4? The Shapley value won't solve these problems but at the least it suggests a way forward. It should be noted that there are other promising methods, especially artifical neural networks (ANN). I will write some more on this approach in a few days.
Below is an example using synthetic data (from Joel Cadwell) of passenger responses to a questionnaire. There are a total of fifteen questions. Twelve are about the flight itself, for example 'Seat Comfort'. Three ask whether the respondent was 'satisfied' with the flight; would 'fly again'; would 'recommend' the airline to a third party.
The fake responses are under these headings:
names(ratings)
[1] "Easy_Reservation" "Preferred_Seats" "Flight_Options"
[4] "Ticket_Prices" "Seat_Comfort" "Seat_Roominess"
[7] "Overhead_Storage" "Clean_Aircraft" "Courtesy"
[10] "Friendliness" "Helpfulness" "Service"
[13] "Satisfaction" "Fly_Again" "Recommend"
There are 1,000 respondents who have answered the 15 questions in the synthetic dataset. A first step is to visualise the data with a graph showing the interconnecting 'edges' to form a Network Map
What we want to know is how the responses to the 12 variables relate to the responses (satisfied with the service; will fly again; will recommend). The ratings are scaled so that the outcome will yield standardized results under ordinary least squares. Therefore there are three regressions, identical except that the dependent variable is the score for 'satisfied', 'recommend' and 'will fly again' in turn. Below is the output for 'recommend'.
rec <- lm(Recommend ~ Easy_Reservation + Preferred_Seats + Flight_Options +
Ticket_Prices + Seat_Comfort + Seat_Roominess + Overhead_Storage + Clean_Aircraft +
Courtesy + Friendliness + Helpfulness + Service, data = scaled_ratings)
summary(rec)
Call:
lm(formula = Recommend ~ Easy_Reservation + Preferred_Seats +
Flight_Options + Ticket_Prices + Seat_Comfort + Seat_Roominess +
Overhead_Storage + Clean_Aircraft + Courtesy + Friendliness +
Helpfulness + Service, data = scaled_ratings)
Residuals:
Min 1Q Median 3Q Max
-2.627 -0.429 0.130 0.465 1.843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.47e-17 2.10e-02 0.00 1.00000
Easy_Reservation 1.91e-01 3.30e-02 5.80 9.0e-09 ***
Preferred_Seats 1.74e-01 3.11e-02 5.58 3.1e-08 ***
Flight_Options 8.77e-02 2.98e-02 2.94 0.00335 **
Ticket_Prices 1.30e-01 2.92e-02 4.46 9.0e-06 ***
Seat_Comfort 1.39e-01 3.59e-02 3.86 0.00012 ***
Seat_Roominess 8.00e-02 3.27e-02 2.45 0.01450 *
Overhead_Storage 1.14e-01 3.29e-02 3.47 0.00054 ***
Clean_Aircraft 1.30e-01 3.37e-02 3.87 0.00012 ***
Courtesy 1.10e-02 3.42e-02 0.32 0.74685
Friendliness 2.09e-02 3.62e-02 0.58 0.56443
Helpfulness -1.02e-01 3.79e-02 -2.69 0.00731 **
Service -3.43e-02 3.76e-02 -0.91 0.36234
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.663 on 987 degrees of freedom
Multiple R-squared: 0.566, Adjusted R-squared: 0.56
F-statistic: 107 on 12 and 987 DF, p-value: <2e-16
Let's decompose the 'recommend' regression and find out what the r-squared is made of.
library(relaimpo)
This is the global version of package relaimpo.
If you are a non-US user, a version with the interesting additional metric pmvd is available
from Ulrike Groempings web site at http://prof.beuth-hochschule.de/groemping/relaimpo.
calc.relimp(rec, type = c("lmg"), rela = TRUE, rank = TRUE)
Response variable: Recommend
Total response variance: 1
Analysis based on 1000 observations
12 Regressors:
Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy Friendliness Helpfulness Service
Proportion of variance explained by model: 56.56%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
Easy_Reservation 0.13192
Preferred_Seats 0.12013
Flight_Options 0.08873
Ticket_Prices 0.10155
Seat_Comfort 0.10560
Seat_Roominess 0.08013
Overhead_Storage 0.09549
Clean_Aircraft 0.10009
Courtesy 0.04711
Friendliness 0.04829
Helpfulness 0.04090
Service 0.04007
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs 6Xs 7Xs
Easy_Reservation 0.6224 0.4659 0.3864 0.33718 0.30270 0.276573 0.25575
Preferred_Seats 0.5954 0.4256 0.3485 0.30297 0.27171 0.248309 0.22981
Flight_Options 0.5477 0.3614 0.2838 0.23770 0.20500 0.179492 0.15846
Ticket_Prices 0.5652 0.3850 0.3082 0.26279 0.23121 0.207242 0.18813
Seat_Comfort 0.5991 0.4294 0.3436 0.29083 0.25414 0.226574 0.20478
Seat_Roominess 0.5539 0.3597 0.2707 0.21835 0.18302 0.157151 0.13717
Overhead_Storage 0.5757 0.3930 0.3091 0.25888 0.22423 0.198153 0.17746
Clean_Aircraft 0.5882 0.4115 0.3248 0.27220 0.23621 0.209661 0.18909
Courtesy 0.4928 0.2679 0.1642 0.10760 0.07383 0.052558 0.03860
Friendliness 0.5000 0.2772 0.1704 0.11151 0.07664 0.055186 0.04171
Helpfulness 0.4710 0.2241 0.1004 0.02915 -0.01518 -0.044109 -0.06362
Service 0.4729 0.2318 0.1177 0.05527 0.01854 -0.003791 -0.01751
8Xs 9Xs 10Xs 11Xs 12Xs
Easy_Reservation 0.23858 0.22408 0.21165 0.20085 0.19138
Preferred_Seats 0.21467 0.20198 0.19117 0.18186 0.17377
Flight_Options 0.14052 0.12490 0.11110 0.09879 0.08773
Ticket_Prices 0.17243 0.15927 0.14810 0.13851 0.13021
Seat_Comfort 0.18694 0.17200 0.15928 0.14830 0.13876
Seat_Roominess 0.12118 0.10807 0.09714 0.08791 0.08001
Overhead_Storage 0.16047 0.14618 0.13398 0.12341 0.11416
Clean_Aircraft 0.17260 0.15910 0.14785 0.13836 0.13025
Courtesy 0.02913 0.02247 0.01761 0.01393 0.01104
Friendliness 0.03317 0.02775 0.02433 0.02218 0.02087
Helpfulness -0.07710 -0.08661 -0.09341 -0.09836 -0.10199
Service -0.02584 -0.03070 -0.03326 -0.03429 -0.03428
The author of the R package 'relaimpo' which I've just used prefers the term 'lmg' but this is really a variant of the Shapley value. (There are alternatives, and we could have specified a vector of pairwise differences between relative contributions). The first table lists the lmg-Shapley value under 'relative importance metrics'. It seems that 'Easy_Reservation' is the most important at 0.132. We can see that the other closely-related variables (in the red circles) also have high values.The lmg figures add up to 1, of 100%. In other words, for those clients who say they are going to recommend your airline, ease of reservation was the quality they will stress most and that makes up 13% of their argument in your favour.
Interestingly, when the exercise is run again on 'satisfied', quite different results appear, almost opposite.
Above I have used a synthetic dataset where N = 1000. For smaller and more realistic datasets, it is possible to bootstrap.
It is possible to run a sequence of decompositions so that the Shapley value's change over time can be observed.
This model assumes a causal effect AND a static environment. Just pressing on one thing won't work. The analyst needs to see the bigger picture.