Introduction

In a large service organisation producing nearly 48,000 working days per month management suspected that sickness absence was increasing.

Figure 1: Line chart of sickness absence

Figure 1: Line chart of sickness absence

However, from looking at the line graph it is clear that sickness absence varies greatly with the season and that any change will be difficult if not impossible to identify by eye.

I might be tempting to add a regression line to the graph and to calculate the p-value for the null hypothesis that the coefficient equals zero. This would be very, very wrong. I quote myself from a previous post:

Never use trend lines to show change over time. A trend line suggests a linear relation between the outcome and time, which is an extremely rare situation.

Instead, always begin your analysis of time series data with a run chart. Read my previous post on run charts with R if you have not done that already.

An inconclusive run chart

Data for Figure 1 come from this data frame, d. The first column is the month. The next two columns, absence and norm, are the total hours of sickness absence and the expected working hours respectively for that month. Thus, Figure 1 plots absence hours divided by norm hours multiplied by 100 percent.

# R version 3.2.1
# Print the first 10 rows of the raw data
d
## Source: local data frame [40 x 3]
## 
##         month absence    norm
## 1  2012-01-01   54004 1106831
## 2  2012-02-01   50795 1059873
## 3  2012-03-01   50637 1109634
## 4  2012-04-01   36111 1061485
## 5  2012-05-01   37239 1166583
## 6  2012-06-01   41823 1065073
## 7  2012-07-01   31922 1123217
## 8  2012-08-01   37042 1187736
## 9  2012-09-01   44971 1032627
## 10 2012-10-01   54967 1194396
## ..        ...     ...     ...

A run chart plots data versus time and performs tests for non-random variation.

# Load the qicharts package
library(qicharts)

# Plot the dots
qic(y        = absence,            # Numerator 
    n        = norm,               # Denominator
    x        = month,              # X axis values
    data     = d,                  # Data frame containing y, n, and x
    multiply = 100,                # Multiply indicator by 100 percent
    runvals  = TRUE,               # Display run chart analysis
    main     = 'Sickness Absence', # Chart title
    ylab     = 'Percent',          # Y axis label
    xlab     = 'Month')            # X axis label
Figure 2: Run chart of sickness absence

Figure 2: Run chart of sickness absence

The run chart analysis clearly shows that data contains non-random variation since the graph crosses the median fewer times than expected (11 vs. 14). However, it is obvious that this cannot (solely) be ascribed to persistent changes in the overall level of sickness absence, rather than the seasonal pattern we already discovered. This is an example of data that are auto correlated, i.e. if one data point is high the next data point will probably also be high and vice versa.

Many methods for identifying change in seasonal data are available including but not limited to Arima models and dynamic linear models. As an ordinary physician, most of these methods are far beyond my grasping capacity. In this post I will demonstrate a very simple approach to identifying change in seasonal data.

The observed vs. expected trick

The general idea is to estimate the expected value of sickness absence for each data point and then do the run chart analysis on the ratio of observed to expected values.

First we need to calculate the expected sickness absence for each month of the year. This could be done in several ways using more or less complicated approaches. I go for simple by calculating the average value for each month (sum(absence) / sum(norm).

The data frame now contains two new variables: p.obs, the observed sickness absence (absence / norm), and p.exp, the expected sickness absence.

## Source: local data frame [40 x 5]
## 
##         month absence    norm      p.obs      p.exp
## 1  2012-01-01   54004 1106831 0.04879154 0.05048592
## 2  2012-02-01   50795 1059873 0.04792554 0.05105568
## 3  2012-03-01   50637 1109634 0.04563395 0.04637784
## 4  2012-04-01   36111 1061485 0.03401934 0.03595128
## 5  2012-05-01   37239 1166583 0.03192143 0.03403684
## 6  2012-06-01   41823 1065073 0.03926772 0.03805956
## 7  2012-07-01   31922 1123217 0.02842016 0.03024238
## 8  2012-08-01   37042 1187736 0.03118706 0.03314012
## 9  2012-09-01   44971 1032627 0.04355009 0.04559338
## 10 2012-10-01   54967 1194396 0.04602074 0.04664736
## ..        ...     ...     ...        ...        ...

The full data set is available in Appendix 1. If you are interested in the R code used to calculate the expected values, see Appendix 2.

Now for the run chart of the “normalised” data.

# Plot observed vs. expected
qic(p.obs, p.exp, month,
    data     = d,
    runvals  = TRUE,
    main     = 'Observed vs. expected sickness absence',
    ylab     = 'Ratio',
    xlab     = 'Month')
Figure 3: Run chart of observed vs. expected sickness absence

Figure 3: Run chart of observed vs. expected sickness absence

Figure 3 reveals the answer to our question: Sickness absence has indeed increased. Non-random variation is signalled by an unusually long run and too few crossings. And this time this is attributable solely to a shift in data since the auto correlation has been removed from data.

From a visual inspection it seem that the shift appeared in March 2014 and has been sustained. If we then split the run chart, we get two separate periods each showing only random variation.

# Add break at data point number 26
qic(p.obs, p.exp, month,
    data     = d,
    runvals  = TRUE,
    decimals = 2, # Number of decimals (to prevent rounding median to one digit)
    breaks   = 26,
    main     = 'Observed vs. expected sickness absence',
    ylab     = 'Ratio',
    xlab     = 'Month')
Figure 4: Run chart with split

Figure 4: Run chart with split

I have not been able to find a good explanation for the shift in March 2014. It seems curious that sickness absence should jump 9% from one month to the next, and I suspect that the shift may have other reasons than the staff suddenly getting more ill. Maybe there have been changes to the definitions regarding sickness or in source data. If ever, this is a good time to repeat the old saying “further studies are needed 😉”.

A logical first step in further analyses would be to stratify data by department to see if the shift is universal or local. I will leave this analysis to the organisation itself.

A rare control chart

Now, there is another thing that strikes my eyes. The data point for January 2012 seems out of tune with the rest. The question is whether this represents something special or it is simply random. To answer this question, we consult a control chart.

# Plot I chart of observed vs. expected
qic(p.obs, p.exp, month,
    data     = d,
    chart    = 'i',
    decimals = 2,
    breaks   = 26,
    main     = 'Observed vs. expected sickness absence',
    ylab     = 'Ratio',
    xlab     = 'Month')

The control chart identifies January 2012 as a special cause, i.e. this data point was under the influence of forces that is not present the rest of the time.

What happened in January 2012? Influenza. This winter, Denmark was hit hard by influenza.

I rarely use control charts, since run charts often answer my questions. But this an example of the proper use of control charts to identify large, sudden, and transient shift in data as special cause variation.

Want more on control charts? Read my previous Control Charts with qicharts for R.

Observed vs. expected – ratio or difference?

I could have used the difference rather than the ratio between observed and expected values, which would have reached the same conclusion. The choice between ratio and difference is a matter of what communicates best. I this case, I personally find the ratio of two ratios easier to understand than the difference. In other cases, e.g. hospital infections, I would prefer the difference between observed and expected counts.

Conclusion

In this post I have demonstrated a simple method for analysing seasonal data by transforming data to a ratio (or difference) of observed vs. expected values and using run charts to identify non-random variation.

Appendix 1: Data table

month absence norm p.obs p.exp
2012-01-01 54004 1106831 0.0487915 0.0504859
2012-02-01 50795 1059873 0.0479255 0.0510557
2012-03-01 50637 1109634 0.0456339 0.0463778
2012-04-01 36111 1061485 0.0340193 0.0359513
2012-05-01 37239 1166583 0.0319214 0.0340368
2012-06-01 41823 1065073 0.0392677 0.0380596
2012-07-01 31922 1123217 0.0284202 0.0302424
2012-08-01 37042 1187736 0.0311871 0.0331401
2012-09-01 44971 1032627 0.0435501 0.0455934
2012-10-01 54967 1194396 0.0460207 0.0466474
2012-11-01 51842 1147503 0.0451781 0.0480092
2012-12-01 50368 1095622 0.0459721 0.0445231
2013-01-01 65709 1201327 0.0546970 0.0504859
2013-02-01 56415 1053997 0.0535248 0.0510557
2013-03-01 47516 1108551 0.0428632 0.0463778
2013-04-01 40138 1160804 0.0345778 0.0359513
2013-05-01 40250 1214971 0.0331284 0.0340368
2013-06-01 38379 1056066 0.0363415 0.0380596
2013-07-01 37190 1222926 0.0304107 0.0302424
2013-08-01 37675 1170681 0.0321821 0.0331401
2013-09-01 50991 1112242 0.0458452 0.0455934
2013-10-01 54502 1221477 0.0446198 0.0466474
2013-11-01 52163 1114760 0.0467930 0.0480092
2013-12-01 49062 1165244 0.0421045 0.0445231
2014-01-01 57344 1216026 0.0471569 0.0504859
2014-02-01 51228 1068984 0.0479221 0.0510557
2014-03-01 53301 1125682 0.0473499 0.0463778
2014-04-01 43962 1180144 0.0372514 0.0359513
2014-05-01 43629 1176885 0.0370716 0.0340368
2014-06-01 43211 1121489 0.0385300 0.0380596
2014-07-01 39282 1238033 0.0317294 0.0302424
2014-08-01 40892 1130074 0.0361853 0.0331401
2014-09-01 55477 1176644 0.0471485 0.0455934
2014-10-01 60505 1227934 0.0492738 0.0466474
2014-11-01 55809 1066558 0.0523263 0.0480092
2014-12-01 55789 1225391 0.0455275 0.0445231
2015-01-01 59853 1168411 0.0512260 0.0504859
2015-02-01 58501 1066212 0.0548681 0.0510557
2015-03-01 58187 1176416 0.0494612 0.0463778
2015-04-01 44345 1174763 0.0377481 0.0359513

Appendix 2: Calculations of extra variables

# Load the dplyr package
library(dplyr)

# Add variables "mo" identifying month as decimal number (01-12) and "p.obs" as the absence rate
d <- d %>% 
  mutate(mo    = format(month, '%m'),
         p.obs = absence / norm)

# Calculate the expected (average) sickness absence by month
pred <- d %>%
  group_by(mo) %>%
  summarise(absence = sum(absence),
            norm    = sum(norm)) %>%
  mutate(p.exp = absence / norm) %>%
  select(mo, p.exp)

# Join data frames, select, and reorder columns
d <- d %>% 
  full_join(pred, by = 'mo') %>% 
  select(month, absence, norm, p.obs, p.exp)

# Print first 10 rows of data frame with added variables
d