In November of last year I released a fictional dataset and analytic exercise here. The aim of the exercise is to encourage people to take an evidence-based approach to a specific business problem. The main aims of the exercise were:
Decide an approach and execute an analysis of the dataset to answer a specific question. The question was to determine what variables in the dataset may have a relationship with manager performance and, if possible, to present some indication of how strong that relationship is.
(Optional) To present the results to a role played set of stakeholders, in order to practice how to deliver complex content and how to defend an evidence-based approach in front of stakeholders with strong ingoing concerns and biases.
In this document I will provide a sample approach to Objective 1 - how to conduct this analysis using an evidence-based approach. The phrase ‘evidence-based’ implies that care will be taken to ensure that any conclusions from the analysis are appropriately supported by statistical evidence, that widely-accepted statistical standards are applied in the workflow, and that limitations are clearly articulated. There is no single, undisputedly correct approach to this analysis, so this should not be misinterpreted as a ‘textbook solution’. However, I believe it reflects an approach that a large number of professional statisticians would opt for.
This is a practical document, and I will not spend a lot of time diving into the underlying theory of my methodology - there are plenty of academic textbooks out there for those that are interested in that. However, I will explain my approach as I proceed, and touch on elements of theory as necessary to support this.
My approach uses the R statistical programming language, which is my analytic language of choice.
We have been provided with a dataset with a number of data columns. Each row is an observation of a specific manager in the company. Our stakeholder is interested in whether we can glean any information from this dataset about potential factors affecting manager performance. We have been provided with a description of the data fields, which is useful. Let’s pull the dataset and take a quick look at the data.
library(dplyr)
## pull Rdata file direct from github and load into a dataframe
download.file("https://github.com/keithmcnulty/ebp_exercise/raw/master/data.RData", destfile = "data.RData")
data <- readRDS("data.RData")
## briefly look at first few rows
knitr::kable(data %>%
head(5))
| employee_id | performance_group | yrs_employed | manager_hire | test_score | group_size | concern_flag | mobile_flag | customers | high_hours_flag | transfers | reduced_schedule | city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| c4578853 | Bottom | 4.6 | N | 205 | 10 | N | N | 12 | N | 0 | Y | San Francisco |
| a7d7afd6 | Middle | 5.3 | N | 227 | 14 | N | Y | 18 | N | 0 | N | New York |
| 272b93f1 | Bottom | 5.2 | N | 227 | 10 | N | N | 12 | N | 0 | Y | Chicago |
| be8b6baa | Middle | 4.9 | N | 273 | 19 | N | N | 26 | Y | 0 | N | New York |
| a18ecc4e | Bottom | 4.9 | N | 227 | 17 | Y | N | 26 | Y | 5 | Y | Orlando |
So, it looks like each row has an Employee ID (which is a unique identifier and not relevant to this analysis), a performance rating (which is clearly our outcome/dependent variable here), and 11 other datapoints which may or may not have a relationship with the outcome variable.
Let’s take a look at how the data is structured:
## look at data structure
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 571 obs. of 13 variables:
## $ employee_id : chr "c4578853" "a7d7afd6" "272b93f1" "be8b6baa" ...
## $ performance_group: Ord.factor w/ 3 levels "Bottom"<"Middle"<..: 1 2 1 2 1 2 2 2 2 2 ...
## $ yrs_employed : num 4.6 5.3 5.2 4.9 4.9 4.3 4.8 5 4.4 4.5 ...
## $ manager_hire : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ test_score : num 205 227 227 273 227 159 250 326 152 326 ...
## $ group_size : num 10 14 10 19 17 10 13 13 10 11 ...
## $ concern_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
## $ mobile_flag : Factor w/ 2 levels "N","Y": 1 2 1 1 1 1 1 1 1 1 ...
## $ customers : num 12 18 12 26 26 18 22 16 20 28 ...
## $ high_hours_flag : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 2 ...
## $ transfers : num 0 0 0 0 5 0 0 3 0 0 ...
## $ reduced_schedule : chr "Y" "N" "Y" "N" ...
## $ city : Factor w/ 6 levels "San Francisco",..: 1 2 3 2 4 4 2 5 3 3 ...
We see that, because we have imported the data as an RData file, our data is mostly already in the format we need. In particular:
Our outcome variable is already an ordered factor "Bottom" < "Middle" < "Top", so R understands the ranking here. If this were not the case, this can easily be fixed using data$performance_group <- ordered(data$performance_group, levels = c("Bottom", "Middle", "Top")).
Most other variables are either numeric variables, or binary factors, except we note that reduced_schedule should be a binary factor, not a character. Even though R will automatically coerce this to a factor whenever it has to perform calculations on it, let’s do this ourselves anyway to be thorough.
The city variable is an unordered factor, which is appropriate for this analysis.
## convert reduced_schedule to factor
data$reduced_schedule <- factor(data$reduced_schedule)
str(data$reduced_schedule)
## Factor w/ 2 levels "N","Y": 2 1 2 1 2 2 1 1 2 2 ...
Let’s look at the basic statistical properties of the data:
## quick summary
summary(data)
## employee_id performance_group yrs_employed manager_hire
## Length:571 Bottom:129 Min. :2.000 N:542
## Class :character Middle:376 1st Qu.:4.300 Y: 29
## Mode :character Top : 66 Median :4.600
## Mean :4.596
## 3rd Qu.:5.000
## Max. :6.000
## test_score group_size concern_flag mobile_flag customers
## Min. : 0.0 Min. : 5.00 N:514 N:433 Min. :10.00
## 1st Qu.:182.0 1st Qu.:10.00 Y: 57 Y:138 1st Qu.:18.00
## Median :235.0 Median :11.00 Median :20.00
## Mean :240.2 Mean :11.82 Mean :21.06
## 3rd Qu.:295.0 3rd Qu.:13.00 3rd Qu.:24.00
## Max. :500.0 Max. :25.00 Max. :40.00
## high_hours_flag transfers reduced_schedule city
## N:345 Min. :0.0000 N:428 San Francisco: 52
## Y:226 1st Qu.:0.0000 Y:143 New York :196
## Median :0.0000 Chicago : 62
## Mean :0.7968 Orlando : 23
## 3rd Qu.:2.0000 Toronto :213
## Max. :5.0000 Houston : 25
Here are a few early observations we can make about this data - some of these observations will play an important role in how we will formulate conclusions from this analysis:
performance_group is quite clumped in the middle with over 60% of individuals regarded as “Middle Performers”. This is quite common in Human Capital contexts, but this can limit the ability of any model to differentiate performance.The first thing to consider is the nature of the question being asked. We are not being asked to predict manager performance, but rather to explain it. Therefore we will need to use an approach that provides visibility into how the variables interact and their relative importance to the outcome variable. For this type of analysis, and noting that our dataset is not particularly large, most statisticians would choose a regression approach.
There are various options for regression approaches, the most common being:
In our case, our outcome variable (performance_group) is an ordered factor of three levels, so Ordinal Logistic Regression is the efficient approach. (Binary Logistic could also be used by transforming the outcome variable to “Bottom” vs “Not Bottom” or “Top” vs “Not Top”, but this is unlikely to provide as much value as an Ordinal approach).
An Ordinal Logistic Regression model will calculate the effect each input variable will have on the odds of an individual being in the next highest outcome (performance) level. This effect is measured independently for each variable, as if all other variables were held still. The results are expressed in terms of odds ratios, that is, the percentage change in odds of higher performance given a one unit change in the input variable.
It’s relatively easy to run an Ordinal Logistic Regression model in R. However, there are two important considerations before doing so:
If any of the input variables are already highly correlated with each other, this can substantially distort the results. We will look at this momentarily.
A test of the proportional odds/parallel regression assumption must be conducted in order to establish that the data is suitable for an ordinal logistic regression approach. I will not go into details on this here, however I will include the test of proportional odds in an appendix to this document (it’s fairly easy to do).
We will do a simple correlation plot to see if any of the input variables are substantially correlated with each other, to ensure that we address point 1 above.
library(corrplot)
# create input dataframe by converting all inpity cols except city to numeric
inputs <- data %>%
dplyr::select(-employee_id, -performance_group, -city) %>%
dplyr::mutate_if(is.factor, as.numeric)
## make correlation matrix
M <- cor(inputs)
## visualize
corrplot::corrplot(M, method = "circle")
None of these correlations appear to be particularly strong given the size of the dataset. It is most notable that manager_hire is somewhat negatively correlated with yrs_employed, which makes sense as these people are much more likely to have been hired more recently as they were not promoted into the position internally. Based on sight of the correlation matrix, it seems reasonable to proceed to run our model.
We start by running the basic model relating all the input variables to the single outcome variable of performance_group:
library(MASS)
## run model and sight results
(model <- MASS::polr(performance_group ~ yrs_employed + manager_hire + test_score + group_size
+ concern_flag + mobile_flag + customers + high_hours_flag
+ transfers + city, data = data, Hess = TRUE))
## Call:
## MASS::polr(formula = performance_group ~ yrs_employed + manager_hire +
## test_score + group_size + concern_flag + mobile_flag + customers +
## high_hours_flag + transfers + city, data = data, Hess = TRUE)
##
## Coefficients:
## yrs_employed manager_hireY test_score group_size
## -1.198804408 -1.534678427 0.003479351 0.096081098
## concern_flagY mobile_flagY customers high_hours_flagY
## -0.103407442 -0.111525321 -0.005049641 0.560651040
## transfers cityNew York cityChicago cityOrlando
## -0.236419495 0.416010949 0.179565007 0.388607219
## cityToronto cityHouston
## 0.408904061 0.309828281
##
## Intercepts:
## Bottom|Middle Middle|Top
## -4.791495 -1.104035
##
## Residual Deviance: 901.2862
## AIC: 933.2862
The coefficients will be helpful in understanding the strength of the relationship with performance. In the raw output these coefficients represent log odds, and we will have to transform them to understand them, which we will do shortly. The AIC is an important and useful measure of information loss and model fit, and will be useful to us in understanding if we can safely drop unimportant variables from our model.
First let’s transform our model coefficient so they are expressed as odds ratios:
exp(model$coefficients)
## yrs_employed manager_hireY test_score group_size
## 0.3015545 0.2155250 1.0034854 1.1008483
## concern_flagY mobile_flagY customers high_hours_flagY
## 0.9017595 0.8944687 0.9949631 1.7518126
## transfers cityNew York cityChicago cityOrlando
## 0.7894494 1.5159025 1.1966967 1.4749251
## cityToronto cityHouston
## 1.5051673 1.3631910
We now see the relative odds of higher manager performance given a single unit change in the input variable. For example:
While this is interesting, it’s important to determine if such differences in odds are statistically meaningful. Every subgroup has some differences from other groups, and we need to determine if the differences are large enough to suggest that the difference is not just random and that there may be a possible causation at work.
We do this by looking at the p-value of each input variable. So let’s construct a table that contains both the odds ratios and the p-values, to allow us to better observe the importance of each variable:
# transform model summary
ctable <- coef(summary(model))
odds_ratio <- exp(coef(summary(model))[ , c("Value")])
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
coef_summary <- cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1), "p value" = p)
knitr::kable(coef_summary[1:(nrow(coef_summary) - 2), ])
| Value | Std. Error | t value | odds_ratio | p value | |
|---|---|---|---|---|---|
| yrs_employed | -1.1988044 | 0.2118695 | -5.6582212 | 0.3015545 | 0.0000000 |
| manager_hireY | -1.5346784 | 0.4920017 | -3.1192540 | 0.2155250 | 0.0018131 |
| test_score | 0.0034794 | 0.0011753 | 2.9604005 | 1.0034854 | 0.0030724 |
| group_size | 0.0960811 | 0.0347292 | 2.7665783 | 1.1008483 | 0.0056648 |
| concern_flagY | -0.1034074 | 0.3096568 | -0.3339421 | 0.9017595 | 0.7384232 |
| mobile_flagY | -0.1115253 | 0.2216773 | -0.5030975 | 0.8944687 | 0.6148957 |
| customers | -0.0050496 | 0.0212884 | -0.2372020 | 0.9949631 | 0.8125001 |
| high_hours_flagY | 0.5606510 | 0.2033308 | 2.7573345 | 1.7518126 | 0.0058275 |
| transfers | -0.2364195 | 0.0840426 | -2.8130919 | 0.7894494 | 0.0049068 |
| cityNew York | 0.4160109 | 0.3512993 | 1.1842066 | 1.5159025 | 0.2363313 |
| cityChicago | 0.1795650 | 0.3985697 | 0.4505235 | 1.1966967 | 0.6523330 |
| cityOrlando | 0.3886072 | 0.5308373 | 0.7320646 | 1.4749251 | 0.4641291 |
| cityToronto | 0.4089041 | 0.3346954 | 1.2217200 | 1.5051673 | 0.2218135 |
| cityHouston | 0.3098283 | 0.5118803 | 0.6052748 | 1.3631910 | 0.5449964 |
To be confident that an input variable may be reflecting a non-random phenomenon, we are looking for small p-values (usually less than 0.05). We see a few here - in particular: yrs_employed, manager_hireY, test_score, group_size, high_hours_flagY and transfers.
Ideally, we don’t want to complicate our model by including variables that we know not to be statistically important. We want a lean model that we can explain more easily. One option is to manually remove columns that have high p-values from the preliminary model we have run, and then run the model again without those variables.
A simple approach is to use the stepAIC() function inside the MASS package. This function will rapidly iterate through the model and remove variables that can safely be removed without significant information loss. Let’s do that now, and then repeat the transformation we did above on our simpler model:
## simple model
model_simple <- MASS::stepAIC(model)
## Start: AIC=933.29
## performance_group ~ yrs_employed + manager_hire + test_score +
## group_size + concern_flag + mobile_flag + customers + high_hours_flag +
## transfers + city
##
## Df AIC
## - city 5 925.26
## - customers 1 931.34
## - concern_flag 1 931.40
## - mobile_flag 1 931.54
## <none> 933.29
## - high_hours_flag 1 939.00
## - group_size 1 939.00
## - transfers 1 939.25
## - test_score 1 940.13
## - manager_hire 1 941.20
## - yrs_employed 1 964.82
##
## Step: AIC=925.26
## performance_group ~ yrs_employed + manager_hire + test_score +
## group_size + concern_flag + mobile_flag + customers + high_hours_flag +
## transfers
##
## Df AIC
## - customers 1 923.30
## - concern_flag 1 923.36
## - mobile_flag 1 924.07
## <none> 925.26
## - high_hours_flag 1 931.07
## - transfers 1 931.34
## - test_score 1 932.14
## - group_size 1 932.42
## - manager_hire 1 933.55
## - yrs_employed 1 958.28
##
## Step: AIC=923.3
## performance_group ~ yrs_employed + manager_hire + test_score +
## group_size + concern_flag + mobile_flag + high_hours_flag +
## transfers
##
## Df AIC
## - concern_flag 1 921.40
## - mobile_flag 1 922.10
## <none> 923.30
## - high_hours_flag 1 929.71
## - transfers 1 929.74
## - test_score 1 930.33
## - group_size 1 930.81
## - manager_hire 1 931.74
## - yrs_employed 1 956.37
##
## Step: AIC=921.4
## performance_group ~ yrs_employed + manager_hire + test_score +
## group_size + mobile_flag + high_hours_flag + transfers
##
## Df AIC
## - mobile_flag 1 920.19
## <none> 921.40
## - high_hours_flag 1 927.88
## - transfers 1 928.03
## - test_score 1 928.34
## - group_size 1 928.88
## - manager_hire 1 930.45
## - yrs_employed 1 958.59
##
## Step: AIC=920.19
## performance_group ~ yrs_employed + manager_hire + test_score +
## group_size + high_hours_flag + transfers
##
## Df AIC
## <none> 920.19
## - high_hours_flag 1 926.63
## - test_score 1 927.24
## - transfers 1 927.40
## - group_size 1 928.24
## - manager_hire 1 929.11
## - yrs_employed 1 959.09
# transform simple model summary
ctable <- coef(summary(model_simple))
odds_ratio <- exp(coef(summary(model_simple))[ , c("Value")])
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
coef_summary <- cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1), "p value" = p)
knitr::kable(coef_summary[1:(nrow(coef_summary) - 2), ])
| Value | Std. Error | t value | odds_ratio | p value | |
|---|---|---|---|---|---|
| yrs_employed | -1.2550181 | 0.2011307 | -6.239815 | 0.2850707 | 0.0000000 |
| manager_hireY | -1.5694761 | 0.4807575 | -3.264590 | 0.2081542 | 0.0010962 |
| test_score | 0.0034804 | 0.0011621 | 2.994847 | 1.0034865 | 0.0027458 |
| group_size | 0.1024320 | 0.0324642 | 3.155229 | 1.1078619 | 0.0016037 |
| high_hours_flagY | 0.5496504 | 0.1908320 | 2.880285 | 1.7326472 | 0.0039732 |
| transfers | -0.2453308 | 0.0811026 | -3.024943 | 0.7824457 | 0.0024868 |
So, after stepAIC() working its magic, we can conclude that there are six meaningful variables that may be playing a role in manager performance:
Most models provide some sort of insight from the data, and it is important not to overstate the insight they provide. In particular, having some sort of measure of the overall strength of the model is important. We do not want our stakeholders to believe that the six observations above explain everything to do with manager performance.
The overall fit of a logistic regression model can be determined using the Pseudo-R2 metric - an attempt to approximate the equivalent of the R2 metric from linear regression models. Numerous variants of Pseudo-R2 exist, and the DescTools package has a handy function to calculate them all:
library(DescTools)
r2 <- DescTools::PseudoR2(model_simple, which = "all")
knitr::kable(r2)
| x | |
|---|---|
| McFadden | 0.0799904 |
| CoxSnell | 0.1286216 |
| Nagelkerke | 0.1566365 |
| AldrichNelson | NA |
| VeallZimmermann | NA |
| Efron | NA |
| McKelveyZavoina | NA |
| Tjur | NA |
| AIC | 920.1867132 |
| BIC | 954.9658269 |
| logLik | -452.0933566 |
| logLik0 | -491.4007078 |
| G2 | 78.6147024 |
The function returns the three Pseudo-R2 variants that are available for ordinal models. They range from 0.08-0.16, indicating that the model can only explain a relatively small proportion of manager performance. This is common in a Human Capital context, due to a couple of reasons:
It is always important to lead and close any discussion with a reminder of the overall model fit. This is a way of ensuring that any conclusions are understood in context and not over-interpreted.
The results of our ordinal regression model indicates that this data might explain a small proportion of manager performance. A large amount of the manager performance equation remains unexplained.
In particular, the following factors may be related to the performance of managers:
The results, combined with our observations about the data in an earlier session, can facilitate a rich discussion about how to interpret the model output. Here are a few observations of interest:
Conclusions 1 and 2 appear contradictory. However, we observed that there are only a very small number of managers who were hired directly, so this means it’s entirely possible that both conclusions can be true. Despite the small numbers, the model determined the difference in performance for these direct hires to be large enough to be statistically significant. Nevertheless, some care should be taken not to over-interpret based on such a small population.
Models will often reveal conclusions that might be inconsistent with stated organizational values and culture. In this case, the observation about hours worked is likely to be problematic to stakeholders.
It is interesting that the number of transfer requests is detected as a significant variable, but whether or not a concern was raised about a manager has not been detected as significant. This could suggest a discussion around complaint processes in the organization and whether they are generating useful information about manager performance. Genuinely troubled direct reports may be more likely to vote with their feet than to participate in the complaints process. It also might call into question precisely how managers are evaluated and whether the evaluation rating is a valid indicator of genuine performance (this can of worms is out of scope here, thank goodness).
I referenced above that a proportional odds check should be performed in order to validly use an ordinal logistic regression methodology. Here are some further details and an example.
One of the mathematical assumptions behind the methods we have used here is that the explanatory effect of any input variable is consistent across the various levels of the outcome variable. In this case, there is an assumption that any input variable is similarly important in explaining the difference between a Bottom and Middle performer as it is in explaining the difference between a Middle and Top performer. This is known as the proportional odds assumption or the parallel regression assumption.
Of course, it is not guaranteed that this assumption is true. A good analytics professional is thorough and should check this assumption before generating conclusions from their work. They don’t need to explain this to their stakeholders - it’s more a test of whether they can be confident in their conclusions.
library(brant)
brant::brant(model_simple)
## ----------------------------------------------------
## Test for X2 df probability
## ----------------------------------------------------
## Omnibus 4.87 6 0.56
## yrs_employed 0.02 1 0.88
## manager_hireY 0.79 1 0.37
## test_score 0 1 0.95
## group_size 0.21 1 0.65
## high_hours_flagY 2.71 1 0.1
## transfers 0.51 1 0.47
## ----------------------------------------------------
##
## H0: Parallel Regression Assumption holds
## X2 df probability
## Omnibus 4.874474640 6 0.5600107
## yrs_employed 0.024181454 1 0.8764241
## manager_hireY 0.787454164 1 0.3748710
## test_score 0.004439927 1 0.9468741
## group_size 0.206083768 1 0.6498544
## high_hours_flagY 2.705539664 1 0.1000002
## transfers 0.512060323 1 0.4742483
The brant package has very quickly confirmed for us that we can be confident in our model because the parallel regression (proportional odds) assumption holds.