Outlier detection for skewed distribution

[Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal., 52(12):5186–5201.]

The boxplot is a very popular graphical tool to visualize the distribution of continuous unimodal data. It shows information about the location, spread, skewness as well as the tails of the data. However, when the data are skewed, usually many points exceed the whiskers and are often erroneously declared as outliers. An ad- justment of the boxplot is presented that includes a robust measure of skewness in the determination of the whiskers. This results in a more accurate representation of the data and of possible outliers. Consequently, this adjusted boxplot can also be used as a fast and automatic outlier detection tool without making any parametric assumption about the distribution of the bulk of the data.

Skewness adjustment to the boxplot

To measure the skewness of a univariate sample ${x_1, . . . , x_n}$ from a continuous unimodal distribution F, we use the medcouple (MC), introduced in Brys et al. (2004). It is defined as \[MC= med_{xi\leq Q2 \leq xj} h(x_i,x_j) \in [-1, 1]\] with Q2 the sample median and where for all $x_i \neq x_j$ the kernel function h is given by \[h(x_i,x_j)= \frac{(x_j − Q2) − (Q2 −x_i)}{ x_j −x_i}\] The medcouple thus equals the median of all $h(x_i,x_j)$ values for which $x_ i \leq Q2 \leq x_j$.

A distribution that is skewed to the right has a positive value for the medcouple, whereas the MC becomes negative at a left skewed distribution. Finally, a symmetric distribution has a zero medcouple.

In order to make the standard boxplot skewness-adjusted, we incorporate the medcouple into the definition of the whiskers. This can be done by introducing some functions $h_l(MC)$ and $h_u(MC)$ into the outlier cutoff values. Instead of using the fence \[[Q1 −1.5 IQR ; Q3 +1.5 IQR],\] we propose the boundaries of the interval to be defined as \[[Q1 − hl(MC) IQR ; Q3 + hu(MC) IQR].\] Additionally, we require that $h_l(0) = h_u(0) = 1.5$ in order to obtain the standard boxplot at symmetric distributions. As the medcouple is location and scale invariant, this interval is location and scale equivariant. Note that by using different functions hl and hu, we allow the fence to be asymmet- ric around the box, so that adjustment for skewness is indeed possible. Also note that at the population level, no distinction should be made between the whiskers and the fence. At finite samples on the other hand, the whiskers are drawn up to the most remote points before the fence.

The outlying observations are identified using the following bounds are: \[ [Q1 −1.5×e^a MC × IQR; Q3 +1.5 × e^b MC ×IQR] \] Among different but simple model, the exponential one is preferred, that is \[ h_l(MC) = 1.5 e^a MC, ~ h_u(MC) = 1.5 e^b MC \]

Where $MC$ is the medcouple; when $MC > 0$ (positive skewness) then $a = −4$ and $b = 3$; on the contrary $a = −3$ and $b = 4$ for negative skewness ($MC < 0$). This adjustment of the boxplot, according to Hubert and Vandervieren (2008), works with moderate skewness ($−0.6 \leq MC \leq 0.6$).

Note that $a$ and $b$ were rounded off to the nearest smaller integer (in the sense that exact computations leads to $ a = −3.79$ and $b = 3.87$), yielding a smaller fence and consequently a more robust model. The current values of $a$ and $b$ are constructed on outlier-free populations. But as robust estimators also show some bias at contaminated samples, the estimate of the medcouple might increase (with outliers on the upper side of the distribution) or decrease (with outliers on the lower side). To decrease the potential influence of outliers in our exponential model, we therefore prefer lower values of $a$ and $b$.

To summarize, we thus can say that when $MC \geq 0$, all observations outside the interval \[[Q1 − 1.5 e^{−4 MC} IQR ; Q3 + 1.5 e^{3 MC} IQR]\] will be marked as potential outlier. For $MC < 0$, the interval becomes \[[Q1 − 1.5 e^{−3MC} IQR ; Q3 + 1.5 e^{4MC} IQR].\]

Let’s see what happens with average expenditure of patients in Germany.

Plot the approximate density function and consider the skewness of the data.

ggplot(df, aes(x=EXPENDITURE_PER_PATIENT)) +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")

skewness(df$EXPENDITURE_PER_PATIENT)[1]

## [1] 28.30774

For symmetry skewness should be 0. Now compute the medcouple measure we talked about before.

mc(df$EXPENDITURE_PER_PATIENT)[1]

## [1] 0.08035983

$-0.6 \leq MC \leq 0.6$ as requested by Hubert and Vandervieren (2008) for their adjusted boxplot.

rownames(df) <- df$BSNR
adj_boxplot <- boxB(df$EXPENDITURE_PER_PATIENT, method="adjbox")

## The MedCouple skewness measure is: 0.0804

## No. of outliers in left tail: 1014

## No. of outliers in right tail: 4421

Alternatively, we can use log-transformed distribution for the average expenditure of the patients, in order to make the distribution of the data more symmetric. We can use longer whiskers for the traditional boxplot.

adj_boxplot <- boxB(df$EXPENDITURE_PER_PATIENT, k = 2, method="resistant", logt=TRUE)

## No. of outliers in left tail: 1815

## No. of outliers in right tail: 2739

Outlier detection with Gamma distribution

This is an attempt to model skewed data with a skewed distribution, Gamma. I empirically verified that it is not the best distribution (better attempts may be done with InvGamma or Weibull), but with $Gamma(\alpha, \beta)$ we can easily compute the distribution of the sample mean for each practice (that is what we have in the column EXPENDITURE_PER_PATIENT), indeed: \[ X_i \sim Gamma(\alpha, \beta) ~ iid \rightarrow \bar{X} \sim Gamma(\alpha, \beta) \] thus we can compute for each institution

$\mu$ = mean = $\frac{\alpha}{\beta}$
$\sigma^2$ = variance = $\frac{\alpha}{\beta^2}$
mode = $\frac{\alpha - 1}{\beta}$ if $\alpha > 1$

Since we want to model the distribution of the expenditure per patient $\bar{X}$ with a Gamma, we need to find $\alpha$ and $\beta$. To better mantain the shape of the empirical distribution, we estimate them starting from the mode and the variance.

First we compute $\sigma^2$ and mode, then we express $\alpha$ and $\beta$ in terms of $\sigma^2$ and mode.

$\frac{\sigma^2}{mode^2} \alpha^2 + (-2 \frac{\sigma^2}{mode^2} -1) \alpha + \frac{\sigma^2}{mode^2} = 0$
$\beta = \frac{\alpha - 1}{mode}$

mode = mlv(df$EXPENDITURE_PER_PATIENT, bw=0.9) # generic function for estimating the mode of a univariate distributio
var = var(df$EXPENDITURE_PER_PATIENT)

# find alpha
a = var/mode^2
b = -2*(var/mode^2) - 1
c = var/mode^2

delta = b^2 - 4*a*c
alpha1 = (- b + sqrt(delta))/(2*a)
alpha2 = (- b - sqrt(delta))/(2*a)

alpha = min(alpha1, alpha2)
if (alpha < 1)
  {alpha = max(alpha1, alpha2)}

beta= (alpha - 1)/mode

# plot(rlnorm(50000, mu, sqrt(sigma2)))
sim = data.frame(rgamma(50000, alpha, beta))
ggplot(sim, aes(x=rgamma.50000..alpha..beta.)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")

At this point we can plot a Tunnel plot. It clearly do not account for inner variance of single institutions, but it is an option to visualize the behaviour of the practices vs a benchmark (mean = $\frac{\alpha}{\beta}$) and a confidence interval given by \[ \mathbb{P} (a < \bar{X} <b ) = 1 - \alpha ~ \longrightarrow a = q_{\alpha/2}, ~ b = q_{1 - \alpha/2}\] where $\alpha$ in this case is the significance of the CI.

val<-seq(0, max(df$AMOUNT_AVD_PATIENTS), 0.5)
p1<-0.99; q1 <- qgamma(p1, shape=alpha, rate=beta)
p2<-0.01; q2 <- qgamma(p2, shape=alpha, rate=beta)
p3<-0.975; q3 <- qgamma(p3, shape=alpha, rate=beta)
p4<-0.025; q4 <- qgamma(p4, shape=alpha, rate=beta)

theta<-alpha/beta          #### global mean of rehospitalization is the target theta_0
df$VARIANCE_EXPENDITURE = alpha/(beta^2 * df$EXPENDITURE_PER_PATIENT)

# m <- c(0, q2, q1, Inf)  
# cols <- cut(df$EXPENDITURE_PER_PATIENT, breaks = m)
plot(df$AMOUNT_AVD_PATIENTS,df$EXPENDITURE_PER_PATIENT,type="p", xlab="Volume of patients", ylab="Average expenditure per patient",  main="Tunnel plot for average expenditure per patient", pch=16, cex=0.3)
abline(h=theta)

points(val, rep(q1, length(val)), type='l', col='green')
points(val, rep(q2, length(val)), type='l', col='green')
points(val, rep(q3, length(val)), type='l', col='blue')
points(val, rep(q4, length(val)), type='l', col='blue')

print(paste('outliers for right tail, significance = 0.02: ', sum(df$EXPENDITURE_PER_PATIENT > qgamma(p1, shape=alpha, rate=beta))))

## [1] "outliers for right tail, significance = 0.02:  542"

print(paste('outliers for left tail, significance = 0.02: ', sum(df$EXPENDITURE_PER_PATIENT < qgamma(p2, shape=alpha, rate=beta))))

## [1] "outliers for left tail, significance = 0.02:  34"

print(paste('outliers for right tail, significance = 0.05: ', sum(df$EXPENDITURE_PER_PATIENT > qgamma(p3, shape=alpha, rate=beta))))

## [1] "outliers for right tail, significance = 0.05:  639"

print(paste('outliers for left tail, significance = 0.05: ', sum(df$EXPENDITURE_PER_PATIENT < qgamma(p4, shape=alpha, rate=beta))))

## [1] "outliers for left tail, significance = 0.05:  1488"

Outlier detection for skewed distribution

Dario Delle Vedove

Skewness adjustment to the boxplot

Outlier detection with Gamma distribution