Evidence Based Practice Exercise: Example Analytic Approach

Introduction

Context

In November of last year I released a fictional dataset and analytic exercise here. The aim of the exercise is to encourage people to take an evidence-based approach to a specific business problem. The main aims of the exercise were:

Decide an approach and execute an analysis of the dataset to answer a specific question. The question was to determine what variables in the dataset may have a relationship with manager performance and, if possible, to present some indication of how strong that relationship is.
(Optional) To present the results to a role played set of stakeholders, in order to practice how to deliver complex content and how to defend an evidence-based approach in front of stakeholders with strong ingoing concerns and biases.

Purpose of this document

In this document I will provide a sample approach to Objective 1 - how to conduct this analysis using an evidence-based approach. The phrase ‘evidence-based’ implies that care will be taken to ensure that any conclusions from the analysis are appropriately supported by statistical evidence, that widely-accepted statistical standards are applied in the workflow, and that limitations are clearly articulated. There is no single, undisputedly correct approach to this analysis, so this should not be misinterpreted as a ‘textbook solution’. However, I believe it reflects an approach that a large number of professional statisticians would opt for.

This is a practical document, and I will not spend a lot of time diving into the underlying theory of my methodology - there are plenty of academic textbooks out there for those that are interested in that. However, I will explain my approach as I proceed, and touch on elements of theory as necessary to support this.

My approach uses the R statistical programming language, which is my analytic language of choice.

Data Exploration and Preparation

Situation

We have been provided with a dataset with a number of data columns. Each row is an observation of a specific manager in the company. Our stakeholder is interested in whether we can glean any information from this dataset about potential factors affecting manager performance. We have been provided with a description of the data fields, which is useful. Let’s pull the dataset and take a quick look at the data.

library(dplyr)

## pull Rdata file direct from github and load into a dataframe

download.file("https://github.com/keithmcnulty/ebp_exercise/raw/master/data.RData", destfile = "data.RData")

data <- readRDS("data.RData")

## briefly look at first few rows

knitr::kable(data %>% 
               head(5))

employee_id	performance_group	yrs_employed	manager_hire	test_score	group_size	concern_flag	mobile_flag	customers	high_hours_flag	transfers	reduced_schedule	city
c4578853	Bottom	4.6	N	205	10	N	N	12	N	0	Y	San Francisco
a7d7afd6	Middle	5.3	N	227	14	N	Y	18	N	0	N	New York
272b93f1	Bottom	5.2	N	227	10	N	N	12	N	0	Y	Chicago
be8b6baa	Middle	4.9	N	273	19	N	N	26	Y	0	N	New York
a18ecc4e	Bottom	4.9	N	227	17	Y	N	26	Y	5	Y	Orlando

So, it looks like each row has an Employee ID (which is a unique identifier and not relevant to this analysis), a performance rating (which is clearly our outcome/dependent variable here), and 11 other datapoints which may or may not have a relationship with the outcome variable.

Data structure

Let’s take a look at how the data is structured:

## look at data structure

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    571 obs. of  13 variables:
##  $ employee_id      : chr  "c4578853" "a7d7afd6" "272b93f1" "be8b6baa" ...
##  $ performance_group: Ord.factor w/ 3 levels "Bottom"<"Middle"<..: 1 2 1 2 1 2 2 2 2 2 ...
##  $ yrs_employed     : num  4.6 5.3 5.2 4.9 4.9 4.3 4.8 5 4.4 4.5 ...
##  $ manager_hire     : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ test_score       : num  205 227 227 273 227 159 250 326 152 326 ...
##  $ group_size       : num  10 14 10 19 17 10 13 13 10 11 ...
##  $ concern_flag     : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
##  $ mobile_flag      : Factor w/ 2 levels "N","Y": 1 2 1 1 1 1 1 1 1 1 ...
##  $ customers        : num  12 18 12 26 26 18 22 16 20 28 ...
##  $ high_hours_flag  : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 2 ...
##  $ transfers        : num  0 0 0 0 5 0 0 3 0 0 ...
##  $ reduced_schedule : chr  "Y" "N" "Y" "N" ...
##  $ city             : Factor w/ 6 levels "San Francisco",..: 1 2 3 2 4 4 2 5 3 3 ...

We see that, because we have imported the data as an RData file, our data is mostly already in the format we need. In particular:

Our outcome variable is already an ordered factor "Bottom" < "Middle" < "Top", so R understands the ranking here. If this were not the case, this can easily be fixed using data$performance_group <- ordered(data$performance_group, levels = c("Bottom", "Middle", "Top")).
Most other variables are either numeric variables, or binary factors, except we note that reduced_schedule should be a binary factor, not a character. Even though R will automatically coerce this to a factor whenever it has to perform calculations on it, let’s do this ourselves anyway to be thorough.
The city variable is an unordered factor, which is appropriate for this analysis.

## convert reduced_schedule to factor

data$reduced_schedule <- factor(data$reduced_schedule)

str(data$reduced_schedule)

##  Factor w/ 2 levels "N","Y": 2 1 2 1 2 2 1 1 2 2 ...

Data summary

Let’s look at the basic statistical properties of the data:

## quick summary
summary(data)

##  employee_id        performance_group  yrs_employed   manager_hire
##  Length:571         Bottom:129        Min.   :2.000   N:542       
##  Class :character   Middle:376        1st Qu.:4.300   Y: 29       
##  Mode  :character   Top   : 66        Median :4.600               
##                                       Mean   :4.596               
##                                       3rd Qu.:5.000               
##                                       Max.   :6.000               
##    test_score      group_size    concern_flag mobile_flag   customers    
##  Min.   :  0.0   Min.   : 5.00   N:514        N:433       Min.   :10.00  
##  1st Qu.:182.0   1st Qu.:10.00   Y: 57        Y:138       1st Qu.:18.00  
##  Median :235.0   Median :11.00                            Median :20.00  
##  Mean   :240.2   Mean   :11.82                            Mean   :21.06  
##  3rd Qu.:295.0   3rd Qu.:13.00                            3rd Qu.:24.00  
##  Max.   :500.0   Max.   :25.00                            Max.   :40.00  
##  high_hours_flag   transfers      reduced_schedule            city    
##  N:345           Min.   :0.0000   N:428            San Francisco: 52  
##  Y:226           1st Qu.:0.0000   Y:143            New York     :196  
##                  Median :0.0000                    Chicago      : 62  
##                  Mean   :0.7968                    Orlando      : 23  
##                  3rd Qu.:2.0000                    Toronto      :213  
##                  Max.   :5.0000                    Houston      : 25

Here are a few early observations we can make about this data - some of these observations will play an important role in how we will formulate conclusions from this analysis:

performance_group is quite clumped in the middle with over 60% of individuals regarded as “Middle Performers”. This is quite common in Human Capital contexts, but this can limit the ability of any model to differentiate performance.
Only a very small number of managers were hired directly into the manager position. This is important to bear in mind as we move forward to conduct analysis. Similarly only a small proportion of individuals have had concerns raised about them.
It looks like most managers have never had a transfer request from their team.
New York and Toronto locations make up a large proportion of the manager dataset. Other locations are substantially smaller.

Methodology

Choosing an approach

The first thing to consider is the nature of the question being asked. We are not being asked to predict manager performance, but rather to explain it. Therefore we will need to use an approach that provides visibility into how the variables interact and their relative importance to the outcome variable. For this type of analysis, and noting that our dataset is not particularly large, most statisticians would choose a regression approach.

There are various options for regression approaches, the most common being:

Linear regression - where the outcome variable is linear in scale.
Binary Logistic Regression - where the outcome variable is a binary factor.
Ordinal Logistic Regression - where the outcome variable is an ordered factor of more than two levels.

In our case, our outcome variable (performance_group) is an ordered factor of three levels, so Ordinal Logistic Regression is the efficient approach. (Binary Logistic could also be used by transforming the outcome variable to “Bottom” vs “Not Bottom” or “Top” vs “Not Top”, but this is unlikely to provide as much value as an Ordinal approach).

Ordinal Logistic Regression: a brief layperson’s explanation

An Ordinal Logistic Regression model will calculate the effect each input variable will have on the odds of an individual being in the next highest outcome (performance) level. This effect is measured independently for each variable, as if all other variables were held still. The results are expressed in terms of odds ratios, that is, the percentage change in odds of higher performance given a one unit change in the input variable.

It’s relatively easy to run an Ordinal Logistic Regression model in R. However, there are two important considerations before doing so:

If any of the input variables are already highly correlated with each other, this can substantially distort the results. We will look at this momentarily.
A test of the proportional odds/parallel regression assumption must be conducted in order to establish that the data is suitable for an ordinal logistic regression approach. I will not go into details on this here, however I will include the test of proportional odds in an appendix to this document (it’s fairly easy to do).

Testing for relative independence of input variables

We will do a simple correlation plot to see if any of the input variables are substantially correlated with each other, to ensure that we address point 1 above.

library(corrplot)

#  create input dataframe by converting all inpity cols except city to numeric

inputs <- data %>% 
  dplyr::select(-employee_id, -performance_group, -city) %>% 
  dplyr::mutate_if(is.factor, as.numeric)

## make correlation matrix

M <- cor(inputs)

## visualize

corrplot::corrplot(M, method = "circle")

None of these correlations appear to be particularly strong given the size of the dataset. It is most notable that manager_hire is somewhat negatively correlated with yrs_employed, which makes sense as these people are much more likely to have been hired more recently as they were not promoted into the position internally. Based on sight of the correlation matrix, it seems reasonable to proceed to run our model.

Running the model

Initial model creation

We start by running the basic model relating all the input variables to the single outcome variable of performance_group:

library(MASS)

## run model and sight results

(model <- MASS::polr(performance_group ~ yrs_employed + manager_hire + test_score + group_size 
                    + concern_flag + mobile_flag + customers + high_hours_flag 
                    + transfers + city, data = data, Hess = TRUE))

## Call:
## MASS::polr(formula = performance_group ~ yrs_employed + manager_hire + 
##     test_score + group_size + concern_flag + mobile_flag + customers + 
##     high_hours_flag + transfers + city, data = data, Hess = TRUE)
## 
## Coefficients:
##     yrs_employed    manager_hireY       test_score       group_size 
##     -1.198804408     -1.534678427      0.003479351      0.096081098 
##    concern_flagY     mobile_flagY        customers high_hours_flagY 
##     -0.103407442     -0.111525321     -0.005049641      0.560651040 
##        transfers     cityNew York      cityChicago      cityOrlando 
##     -0.236419495      0.416010949      0.179565007      0.388607219 
##      cityToronto      cityHouston 
##      0.408904061      0.309828281 
## 
## Intercepts:
## Bottom|Middle    Middle|Top 
##     -4.791495     -1.104035 
## 
## Residual Deviance: 901.2862 
## AIC: 933.2862

The coefficients will be helpful in understanding the strength of the relationship with performance. In the raw output these coefficients represent log odds, and we will have to transform them to understand them, which we will do shortly. The AIC is an important and useful measure of information loss and model fit, and will be useful to us in understanding if we can safely drop unimportant variables from our model.

First let’s transform our model coefficient so they are expressed as odds ratios:

exp(model$coefficients)

##     yrs_employed    manager_hireY       test_score       group_size 
##        0.3015545        0.2155250        1.0034854        1.1008483 
##    concern_flagY     mobile_flagY        customers high_hours_flagY 
##        0.9017595        0.8944687        0.9949631        1.7518126 
##        transfers     cityNew York      cityChicago      cityOrlando 
##        0.7894494        1.5159025        1.1966967        1.4749251 
##      cityToronto      cityHouston 
##        1.5051673        1.3631910

We now see the relative odds of higher manager performance given a single unit change in the input variable. For example:

Each additional year employed reduces the odds of higher performance by ~70%
Being flagged as working high hours increases the odds of higher performance by ~75%
City odds are anchored on San Francisco - so for example, managers working in Chicago have a ~20% greater odds of higher performance compared to San Francisco

While this is interesting, it’s important to determine if such differences in odds are statistically meaningful. Every subgroup has some differences from other groups, and we need to determine if the differences are large enough to suggest that the difference is not just random and that there may be a possible causation at work.

Determining importance of variables

We do this by looking at the p-value of each input variable. So let’s construct a table that contains both the odds ratios and the p-values, to allow us to better observe the importance of each variable:

# transform model summary
ctable <- coef(summary(model))
odds_ratio <- exp(coef(summary(model))[ , c("Value")])
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
coef_summary <- cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1), "p value" = p)

knitr::kable(coef_summary[1:(nrow(coef_summary) - 2), ])

	Value	Std. Error	t value	odds_ratio	p value
yrs_employed	-1.1988044	0.2118695	-5.6582212	0.3015545	0.0000000
manager_hireY	-1.5346784	0.4920017	-3.1192540	0.2155250	0.0018131
test_score	0.0034794	0.0011753	2.9604005	1.0034854	0.0030724
group_size	0.0960811	0.0347292	2.7665783	1.1008483	0.0056648
concern_flagY	-0.1034074	0.3096568	-0.3339421	0.9017595	0.7384232
mobile_flagY	-0.1115253	0.2216773	-0.5030975	0.8944687	0.6148957
customers	-0.0050496	0.0212884	-0.2372020	0.9949631	0.8125001
high_hours_flagY	0.5606510	0.2033308	2.7573345	1.7518126	0.0058275
transfers	-0.2364195	0.0840426	-2.8130919	0.7894494	0.0049068
cityNew York	0.4160109	0.3512993	1.1842066	1.5159025	0.2363313
cityChicago	0.1795650	0.3985697	0.4505235	1.1966967	0.6523330
cityOrlando	0.3886072	0.5308373	0.7320646	1.4749251	0.4641291
cityToronto	0.4089041	0.3346954	1.2217200	1.5051673	0.2218135
cityHouston	0.3098283	0.5118803	0.6052748	1.3631910	0.5449964

To be confident that an input variable may be reflecting a non-random phenomenon, we are looking for small p-values (usually less than 0.05). We see a few here - in particular: yrs_employed, manager_hireY, test_score, group_size, high_hours_flagY and transfers.

Simplifying the model

Ideally, we don’t want to complicate our model by including variables that we know not to be statistically important. We want a lean model that we can explain more easily. One option is to manually remove columns that have high p-values from the preliminary model we have run, and then run the model again without those variables.

A simple approach is to use the stepAIC() function inside the MASS package. This function will rapidly iterate through the model and remove variables that can safely be removed without significant information loss. Let’s do that now, and then repeat the transformation we did above on our simpler model:

## simple model

model_simple <- MASS::stepAIC(model)

## Start:  AIC=933.29
## performance_group ~ yrs_employed + manager_hire + test_score + 
##     group_size + concern_flag + mobile_flag + customers + high_hours_flag + 
##     transfers + city
## 
##                   Df    AIC
## - city             5 925.26
## - customers        1 931.34
## - concern_flag     1 931.40
## - mobile_flag      1 931.54
## <none>               933.29
## - high_hours_flag  1 939.00
## - group_size       1 939.00
## - transfers        1 939.25
## - test_score       1 940.13
## - manager_hire     1 941.20
## - yrs_employed     1 964.82
## 
## Step:  AIC=925.26
## performance_group ~ yrs_employed + manager_hire + test_score + 
##     group_size + concern_flag + mobile_flag + customers + high_hours_flag + 
##     transfers
## 
##                   Df    AIC
## - customers        1 923.30
## - concern_flag     1 923.36
## - mobile_flag      1 924.07
## <none>               925.26
## - high_hours_flag  1 931.07
## - transfers        1 931.34
## - test_score       1 932.14
## - group_size       1 932.42
## - manager_hire     1 933.55
## - yrs_employed     1 958.28
## 
## Step:  AIC=923.3
## performance_group ~ yrs_employed + manager_hire + test_score + 
##     group_size + concern_flag + mobile_flag + high_hours_flag + 
##     transfers
## 
##                   Df    AIC
## - concern_flag     1 921.40
## - mobile_flag      1 922.10
## <none>               923.30
## - high_hours_flag  1 929.71
## - transfers        1 929.74
## - test_score       1 930.33
## - group_size       1 930.81
## - manager_hire     1 931.74
## - yrs_employed     1 956.37
## 
## Step:  AIC=921.4
## performance_group ~ yrs_employed + manager_hire + test_score + 
##     group_size + mobile_flag + high_hours_flag + transfers
## 
##                   Df    AIC
## - mobile_flag      1 920.19
## <none>               921.40
## - high_hours_flag  1 927.88
## - transfers        1 928.03
## - test_score       1 928.34
## - group_size       1 928.88
## - manager_hire     1 930.45
## - yrs_employed     1 958.59
## 
## Step:  AIC=920.19
## performance_group ~ yrs_employed + manager_hire + test_score + 
##     group_size + high_hours_flag + transfers
## 
##                   Df    AIC
## <none>               920.19
## - high_hours_flag  1 926.63
## - test_score       1 927.24
## - transfers        1 927.40
## - group_size       1 928.24
## - manager_hire     1 929.11
## - yrs_employed     1 959.09

# transform simple model summary
ctable <- coef(summary(model_simple))
odds_ratio <- exp(coef(summary(model_simple))[ , c("Value")])
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
coef_summary <- cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1), "p value" = p)

knitr::kable(coef_summary[1:(nrow(coef_summary) - 2), ])

	Value	Std. Error	t value	odds_ratio	p value
yrs_employed	-1.2550181	0.2011307	-6.239815	0.2850707	0.0000000
manager_hireY	-1.5694761	0.4807575	-3.264590	0.2081542	0.0010962
test_score	0.0034804	0.0011621	2.994847	1.0034865	0.0027458
group_size	0.1024320	0.0324642	3.155229	1.1078619	0.0016037
high_hours_flagY	0.5496504	0.1908320	2.880285	1.7326472	0.0039732
transfers	-0.2453308	0.0811026	-3.024943	0.7824457	0.0024868

So, after stepAIC() working its magic, we can conclude that there are six meaningful variables that may be playing a role in manager performance:

Every additional year employed reduces the odds of higher performance by about 70%
Those hired directly as manager have about 80% lower odds of performing better.
Each additional point scored on the managers test increases the odds of higher performance. Although the odds difference is very small, over many points it could be more substantial.
Each additional person in the group a manager managers increases the odds that they are better performers by about 10%.
Those known to work longer hours have ~70% greater odds of higher performance.
For every additional transfer request from a managers team, there is ~20% lower odds that the manager is a higher performer.

Putting results in the context of model strength

Most models provide some sort of insight from the data, and it is important not to overstate the insight they provide. In particular, having some sort of measure of the overall strength of the model is important. We do not want our stakeholders to believe that the six observations above explain everything to do with manager performance.

The overall fit of a logistic regression model can be determined using the Pseudo-R² metric - an attempt to approximate the equivalent of the R² metric from linear regression models. Numerous variants of Pseudo-R² exist, and the DescTools package has a handy function to calculate them all:

library(DescTools)

r2 <- DescTools::PseudoR2(model_simple, which = "all")

knitr::kable(r2)

	x
McFadden	0.0799904
CoxSnell	0.1286216
Nagelkerke	0.1566365
AldrichNelson	NA
VeallZimmermann	NA
Efron	NA
McKelveyZavoina	NA
Tjur	NA
AIC	920.1867132
BIC	954.9658269
logLik	-452.0933566
logLik0	-491.4007078
G2	78.6147024

The function returns the three Pseudo-R² variants that are available for ordinal models. They range from 0.08-0.16, indicating that the model can only explain a relatively small proportion of manager performance. This is common in a Human Capital context, due to a couple of reasons:

Outcome measures are often not very varied and suffer from central tendency bias, as our observation above confirms that most of this dataset are Middle performers. This makes accurate modelling very challenging.
Factors not measurable in data can affect the outcome, for example untracked elements of individual experience, data fields not captured, or rater biases.

It is always important to lead and close any discussion with a reminder of the overall model fit. This is a way of ensuring that any conclusions are understood in context and not over-interpreted.

Conclusions

Statement of results

The results of our ordinal regression model indicates that this data might explain a small proportion of manager performance. A large amount of the manager performance equation remains unexplained.

In particular, the following factors may be related to the performance of managers:

Tenure in the organization - longer tenured managers are have greater odds of lower performance.
Promotion to manager - Those promoted to manager and not hired directly have greater odds of higher manager performance.
Test performance - the test given to mangers appears to be a valid indicator of manager performance.
Group size - Managing a larger group is associated with greater odds of higher performance.
Hours of work - those who work longer hours have greater odds of higher performance.
The number of transfer requests coming from a team is associated with lower odds of higher manager performance.

Discussion of results

The results, combined with our observations about the data in an earlier session, can facilitate a rich discussion about how to interpret the model output. Here are a few observations of interest:

Conclusions 1 and 2 appear contradictory. However, we observed that there are only a very small number of managers who were hired directly, so this means it’s entirely possible that both conclusions can be true. Despite the small numbers, the model determined the difference in performance for these direct hires to be large enough to be statistically significant. Nevertheless, some care should be taken not to over-interpret based on such a small population.
Models will often reveal conclusions that might be inconsistent with stated organizational values and culture. In this case, the observation about hours worked is likely to be problematic to stakeholders.
It is interesting that the number of transfer requests is detected as a significant variable, but whether or not a concern was raised about a manager has not been detected as significant. This could suggest a discussion around complaint processes in the organization and whether they are generating useful information about manager performance. Genuinely troubled direct reports may be more likely to vote with their feet than to participate in the complaints process. It also might call into question precisely how managers are evaluated and whether the evaluation rating is a valid indicator of genuine performance (this can of worms is out of scope here, thank goodness).

Appendix

I referenced above that a proportional odds check should be performed in order to validly use an ordinal logistic regression methodology. Here are some further details and an example.

Checking proportional odds assumption

One of the mathematical assumptions behind the methods we have used here is that the explanatory effect of any input variable is consistent across the various levels of the outcome variable. In this case, there is an assumption that any input variable is similarly important in explaining the difference between a Bottom and Middle performer as it is in explaining the difference between a Middle and Top performer. This is known as the proportional odds assumption or the parallel regression assumption.

Of course, it is not guaranteed that this assumption is true. A good analytics professional is thorough and should check this assumption before generating conclusions from their work. They don’t need to explain this to their stakeholders - it’s more a test of whether they can be confident in their conclusions.

Example check of proportional odds assumption

library(brant)

brant::brant(model_simple)

## ---------------------------------------------------- 
## Test for     X2  df  probability 
## ---------------------------------------------------- 
## Omnibus          4.87    6   0.56
## yrs_employed     0.02    1   0.88
## manager_hireY        0.79    1   0.37
## test_score       0   1   0.95
## group_size       0.21    1   0.65
## high_hours_flagY 2.71    1   0.1
## transfers        0.51    1   0.47
## ---------------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

##                           X2 df probability
## Omnibus          4.874474640  6   0.5600107
## yrs_employed     0.024181454  1   0.8764241
## manager_hireY    0.787454164  1   0.3748710
## test_score       0.004439927  1   0.9468741
## group_size       0.206083768  1   0.6498544
## high_hours_flagY 2.705539664  1   0.1000002
## transfers        0.512060323  1   0.4742483

The brant package has very quickly confirmed for us that we can be confident in our model because the parallel regression (proportional odds) assumption holds.