In a large service organisation producing nearly 48,000 working days per month management suspected that sickness absence was increasing.
Figure 1: Line chart of sickness absence
However, from looking at the line graph it is clear that sickness absence varies greatly with the season and that any change will be difficult if not impossible to identify by eye.
I might be tempting to add a regression line to the graph and to calculate the p-value for the null hypothesis that the coefficient equals zero. This would be very, very wrong. I quote myself from a previous post:
Never use trend lines to show change over time. A trend line suggests a linear relation between the outcome and time, which is an extremely rare situation.
Instead, always begin your analysis of time series data with a run chart. Read my previous post on run charts with R if you have not done that already.
Data for Figure 1 come from this data frame, d
. The first column is the month. The next two columns, absence
and norm
, are the total hours of sickness absence and the expected working hours respectively for that month. Thus, Figure 1 plots absence hours divided by norm hours multiplied by 100 percent.
# R version 3.2.1
# Print the first 10 rows of the raw data
d
## Source: local data frame [40 x 3]
##
## month absence norm
## 1 2012-01-01 54004 1106831
## 2 2012-02-01 50795 1059873
## 3 2012-03-01 50637 1109634
## 4 2012-04-01 36111 1061485
## 5 2012-05-01 37239 1166583
## 6 2012-06-01 41823 1065073
## 7 2012-07-01 31922 1123217
## 8 2012-08-01 37042 1187736
## 9 2012-09-01 44971 1032627
## 10 2012-10-01 54967 1194396
## .. ... ... ...
A run chart plots data versus time and performs tests for non-random variation.
# Load the qicharts package
library(qicharts)
# Plot the dots
qic(y = absence, # Numerator
n = norm, # Denominator
x = month, # X axis values
data = d, # Data frame containing y, n, and x
multiply = 100, # Multiply indicator by 100 percent
runvals = TRUE, # Display run chart analysis
main = 'Sickness Absence', # Chart title
ylab = 'Percent', # Y axis label
xlab = 'Month') # X axis label
Figure 2: Run chart of sickness absence
The run chart analysis clearly shows that data contains non-random variation since the graph crosses the median fewer times than expected (11 vs. 14). However, it is obvious that this cannot (solely) be ascribed to persistent changes in the overall level of sickness absence, rather than the seasonal pattern we already discovered. This is an example of data that are auto correlated, i.e. if one data point is high the next data point will probably also be high and vice versa.
Many methods for identifying change in seasonal data are available including but not limited to Arima models and dynamic linear models. As an ordinary physician, most of these methods are far beyond my grasping capacity. In this post I will demonstrate a very simple approach to identifying change in seasonal data.
The general idea is to estimate the expected value of sickness absence for each data point and then do the run chart analysis on the ratio of observed to expected values.
First we need to calculate the expected sickness absence for each month of the year. This could be done in several ways using more or less complicated approaches. I go for simple by calculating the average value for each month (sum(absence) / sum(norm)
.
The data frame now contains two new variables: p.obs
, the observed sickness absence (absence
/ norm
), and p.exp
, the expected sickness absence.
## Source: local data frame [40 x 5]
##
## month absence norm p.obs p.exp
## 1 2012-01-01 54004 1106831 0.04879154 0.05048592
## 2 2012-02-01 50795 1059873 0.04792554 0.05105568
## 3 2012-03-01 50637 1109634 0.04563395 0.04637784
## 4 2012-04-01 36111 1061485 0.03401934 0.03595128
## 5 2012-05-01 37239 1166583 0.03192143 0.03403684
## 6 2012-06-01 41823 1065073 0.03926772 0.03805956
## 7 2012-07-01 31922 1123217 0.02842016 0.03024238
## 8 2012-08-01 37042 1187736 0.03118706 0.03314012
## 9 2012-09-01 44971 1032627 0.04355009 0.04559338
## 10 2012-10-01 54967 1194396 0.04602074 0.04664736
## .. ... ... ... ... ...
The full data set is available in Appendix 1. If you are interested in the R code used to calculate the expected values, see Appendix 2.
Now for the run chart of the “normalised” data.
# Plot observed vs. expected
qic(p.obs, p.exp, month,
data = d,
runvals = TRUE,
main = 'Observed vs. expected sickness absence',
ylab = 'Ratio',
xlab = 'Month')
Figure 3: Run chart of observed vs. expected sickness absence
Figure 3 reveals the answer to our question: Sickness absence has indeed increased. Non-random variation is signalled by an unusually long run and too few crossings. And this time this is attributable solely to a shift in data since the auto correlation has been removed from data.
From a visual inspection it seem that the shift appeared in March 2014 and has been sustained. If we then split the run chart, we get two separate periods each showing only random variation.
# Add break at data point number 26
qic(p.obs, p.exp, month,
data = d,
runvals = TRUE,
decimals = 2, # Number of decimals (to prevent rounding median to one digit)
breaks = 26,
main = 'Observed vs. expected sickness absence',
ylab = 'Ratio',
xlab = 'Month')
Figure 4: Run chart with split
I have not been able to find a good explanation for the shift in March 2014. It seems curious that sickness absence should jump 9% from one month to the next, and I suspect that the shift may have other reasons than the staff suddenly getting more ill. Maybe there have been changes to the definitions regarding sickness or in source data. If ever, this is a good time to repeat the old saying “further studies are needed 😉”.
A logical first step in further analyses would be to stratify data by department to see if the shift is universal or local. I will leave this analysis to the organisation itself.
Now, there is another thing that strikes my eyes. The data point for January 2012 seems out of tune with the rest. The question is whether this represents something special or it is simply random. To answer this question, we consult a control chart.
# Plot I chart of observed vs. expected
qic(p.obs, p.exp, month,
data = d,
chart = 'i',
decimals = 2,
breaks = 26,
main = 'Observed vs. expected sickness absence',
ylab = 'Ratio',
xlab = 'Month')
The control chart identifies January 2012 as a special cause, i.e. this data point was under the influence of forces that is not present the rest of the time.
What happened in January 2012? Influenza. This winter, Denmark was hit hard by influenza.
I rarely use control charts, since run charts often answer my questions. But this an example of the proper use of control charts to identify large, sudden, and transient shift in data as special cause variation.
Want more on control charts? Read my previous Control Charts with qicharts for R.
I could have used the difference rather than the ratio between observed and expected values, which would have reached the same conclusion. The choice between ratio and difference is a matter of what communicates best. I this case, I personally find the ratio of two ratios easier to understand than the difference. In other cases, e.g. hospital infections, I would prefer the difference between observed and expected counts.
In this post I have demonstrated a simple method for analysing seasonal data by transforming data to a ratio (or difference) of observed vs. expected values and using run charts to identify non-random variation.
month | absence | norm | p.obs | p.exp |
---|---|---|---|---|
2012-01-01 | 54004 | 1106831 | 0.0487915 | 0.0504859 |
2012-02-01 | 50795 | 1059873 | 0.0479255 | 0.0510557 |
2012-03-01 | 50637 | 1109634 | 0.0456339 | 0.0463778 |
2012-04-01 | 36111 | 1061485 | 0.0340193 | 0.0359513 |
2012-05-01 | 37239 | 1166583 | 0.0319214 | 0.0340368 |
2012-06-01 | 41823 | 1065073 | 0.0392677 | 0.0380596 |
2012-07-01 | 31922 | 1123217 | 0.0284202 | 0.0302424 |
2012-08-01 | 37042 | 1187736 | 0.0311871 | 0.0331401 |
2012-09-01 | 44971 | 1032627 | 0.0435501 | 0.0455934 |
2012-10-01 | 54967 | 1194396 | 0.0460207 | 0.0466474 |
2012-11-01 | 51842 | 1147503 | 0.0451781 | 0.0480092 |
2012-12-01 | 50368 | 1095622 | 0.0459721 | 0.0445231 |
2013-01-01 | 65709 | 1201327 | 0.0546970 | 0.0504859 |
2013-02-01 | 56415 | 1053997 | 0.0535248 | 0.0510557 |
2013-03-01 | 47516 | 1108551 | 0.0428632 | 0.0463778 |
2013-04-01 | 40138 | 1160804 | 0.0345778 | 0.0359513 |
2013-05-01 | 40250 | 1214971 | 0.0331284 | 0.0340368 |
2013-06-01 | 38379 | 1056066 | 0.0363415 | 0.0380596 |
2013-07-01 | 37190 | 1222926 | 0.0304107 | 0.0302424 |
2013-08-01 | 37675 | 1170681 | 0.0321821 | 0.0331401 |
2013-09-01 | 50991 | 1112242 | 0.0458452 | 0.0455934 |
2013-10-01 | 54502 | 1221477 | 0.0446198 | 0.0466474 |
2013-11-01 | 52163 | 1114760 | 0.0467930 | 0.0480092 |
2013-12-01 | 49062 | 1165244 | 0.0421045 | 0.0445231 |
2014-01-01 | 57344 | 1216026 | 0.0471569 | 0.0504859 |
2014-02-01 | 51228 | 1068984 | 0.0479221 | 0.0510557 |
2014-03-01 | 53301 | 1125682 | 0.0473499 | 0.0463778 |
2014-04-01 | 43962 | 1180144 | 0.0372514 | 0.0359513 |
2014-05-01 | 43629 | 1176885 | 0.0370716 | 0.0340368 |
2014-06-01 | 43211 | 1121489 | 0.0385300 | 0.0380596 |
2014-07-01 | 39282 | 1238033 | 0.0317294 | 0.0302424 |
2014-08-01 | 40892 | 1130074 | 0.0361853 | 0.0331401 |
2014-09-01 | 55477 | 1176644 | 0.0471485 | 0.0455934 |
2014-10-01 | 60505 | 1227934 | 0.0492738 | 0.0466474 |
2014-11-01 | 55809 | 1066558 | 0.0523263 | 0.0480092 |
2014-12-01 | 55789 | 1225391 | 0.0455275 | 0.0445231 |
2015-01-01 | 59853 | 1168411 | 0.0512260 | 0.0504859 |
2015-02-01 | 58501 | 1066212 | 0.0548681 | 0.0510557 |
2015-03-01 | 58187 | 1176416 | 0.0494612 | 0.0463778 |
2015-04-01 | 44345 | 1174763 | 0.0377481 | 0.0359513 |
# Load the dplyr package
library(dplyr)
# Add variables "mo" identifying month as decimal number (01-12) and "p.obs" as the absence rate
d <- d %>%
mutate(mo = format(month, '%m'),
p.obs = absence / norm)
# Calculate the expected (average) sickness absence by month
pred <- d %>%
group_by(mo) %>%
summarise(absence = sum(absence),
norm = sum(norm)) %>%
mutate(p.exp = absence / norm) %>%
select(mo, p.exp)
# Join data frames, select, and reorder columns
d <- d %>%
full_join(pred, by = 'mo') %>%
select(month, absence, norm, p.obs, p.exp)
# Print first 10 rows of data frame with added variables
d