Outliers: Love’em or leave’em

Detecting and Removing Outliers
Embracing Outliers

library(rstan)
library(rethinking)

library(igraph)     # draw model  graphs
library(dagitty)    # draw causal graphs (ref: http://www.dagitty.net/primer/)

library(latex2exp)  # use latex expressions

library(ggplot2)
library(ggthemes)
library(ggmcmc)
library(gridExtra)

A quote from E.T. Jaynes:

The problem of outliers in data has been a topic of lively discussion since the 18th century, when it arose in astronomy, geodesy, calorimetry, and doubtless many other measurements. Let us interpret ‘apparatus’ broadly as meaning any method for acquiring data. On the philosophical side, two opposite views have been expressed repeatedly.

(I) Something must have gone wrong with the apparatus; the outlier is not part of the good data and we must throw it out to avoid getting erroneous conclusions.

(II) No! It is dishonest to throw away any part of your data merely because it was unexpected. That outlier may well be the most significant datum you have, and it must be taken into account in your data analysis, otherwise you are ‘fudging’ the data arbitrarily and you can make no pretense of scientific objectivity. – ET Jaynes - Probability Theory, pgs.615–616

We will go through an example for each perspective:

Detecting and Removing Outliers

plot.errors <- function(x, y, e) {
  plot(x, y, pch=19,ylim=c(min(y)*0.75,max(y)*1.2))
  segments(x, y-e,x, y+e)
  width.bar <- mean(e)/10
  segments(x-width.bar,y-e,x+width.bar,y-e)
  segments(x-width.bar,y+e,x+width.bar,y+e)
}

Consider the next data set with points \((x_i,y_i)\) where we have the error measurement \(e_i\) associated with the i-th data point.

x <- c(0,  3,  9, 14, 15, 19, 20, 21, 30, 35, 40, 41, 42, 43, 54, 56, 67, 69, 72, 88)
y <- c(33, 68, 34, 34, 37, 71, 37, 44, 48, 49, 53, 49, 50, 48, 56, 60, 61, 63, 44, 71)
e <- c(3.6, 3.9, 2.6, 3.4, 3.8, 3.8, 2.2, 2.1, 2.3, 3.8, 2.2, 2.8, 3.9, 3.1, 3.4, 2.6, 3.4, 3.7, 2.0, 3.5) # error of y

plot.errors(x,y,e)

Let’s first apply the frequentist non-robust linear regression (red) and a robust version (green). Notice that the robust method does not tell us what points are considered outliers.

plot.errors(x,y,e)

#### Linear Regression
fit <- lm(y~x, data=data.frame(x=x,y=y))
abline(fit, col="red", lwd=2)                        # show linear fit

#### Robust Regression using Huber loss function
# a: residuals, ie, y - hat.y
huber <- function(a, delta) {
  ifelse(abs(a)<delta, a^2/2, delta*(abs(a)-delta/2))
}

# selecting the huber loss and delta=3 were subjective decisions
huber.loss <- function(theta, x=x, y=y, e=e, delta=3) {
  sum( huber((y - theta[1] - theta[2]*x)/e, delta) )
}

# find best values using optimization
fit.huber <- optim(par=c(0,0), fn=huber.loss, x=x, y=y, e=e) 
abline(fit.huber$par[1], fit.huber$par[2], col="green3", lwd=2) # show robust fit

legend(45, 87, c("Linear", "Robust"), lty=c(1,1), lwd=4, 
       col=c("red", "green3"))

Let’s do this using a Bayesian approach.

Here we acknowledge the existence of outliers. That means we need to adapt the standard linear regression

\[y_i \sim \mathcal{N}(\theta_0 + \theta_1 ~ x_i,e_i)\]

with one that includes the possibility of a given point \((x_i,y_i)\) being an outlier:

\[y_i \sim \mathcal{N}(\theta_0 + \theta_1 ~ x_i,e_i) ~\text{ or } ~ y_i \sim \mathcal{N}(\theta_0 + \theta_1 ~ x_i,\sigma_B) \] with some probability \(1-\alpha_i\) of being an outlier which follows another normal distribuion with equal mean \(\theta_0 + \theta_1 ~ x_i\) but large standard deviation \(\sigma_B\). We assume that all outliers follow the same distribution.

The standard deviation \(\sigma_B\) could be considered an extra nuissance parameter, or just be set to some high value. We’ll choose this second option, and consider it as part of the data with value 50.

This implies the following likelihood needs to be coded in Stan:

\[p(x_i,y_i,e_i | \theta_0, \theta_1, \alpha_i,\sigma_B) = \frac{\color{orange}{\alpha_i}}{\sqrt{2\pi \color{red}{e_i^2}}} \exp \Big\{ -\frac{1}{2 \color{red}{e_i^2}} (y_i - \mu_i)^2 \Big\} + \frac{\color{orange}{1-\alpha_i}}{\sqrt{2\pi \color{red}{\sigma_B^2}}} \exp \Big\{ -\frac{1}{2\color{red}{\sigma_B^2}} (y_i - \mu_i)^2 \Big\}\] where parameters \(\theta_0, \theta_1\) define the linear fits \(\theta_0 + \theta_1 x_i\). The expression \(y_i - \mu_i\) is the residual between the true value \(y_i\) and the one estimated by the fit \(\mu_i = \hat{y}(x_i|\theta) = \theta_0 + \theta_1 ~ x_i\).

Stan includes a section functions allowing the construction of arbitrary likelihoods. To be specific, we need to code log-likelihoods (the function’s name must have suffix _log).

functions {
   real outliers_log(real y, real x, real e, 
                     real alpha, real theta0, real theta1, real sigmaB) {

      real likelihood1;
      real likelihood2;
      real mu;
      
      mu = theta0 + theta1 * x;
      
      likelihood1 = (alpha/sqrt(2*pi()*pow(e,2))) * 
                    exp(-0.5 * pow(y-mu,2)/pow(e,2));
                    
      likelihood2 = ((1-alpha)/sqrt(2*pi()*pow(sigmaB,2))) * 
                    exp(-0.5 * pow(y-mu,2)/pow(sigmaB,2));
                    
      return log(likelihood1 + likelihood2);
   }
}

data {
  int<lower=1>  N;  
  real x[N]; 
  real y[N]; 
  real e[N]; 
  real<lower=0> sigmaB;
}

parameters {
  real theta0;
  real theta1;
  real<lower=0, upper=1> alpha[N];
}

model {
  theta0 ~ cauchy(0,10);
  theta1 ~ cauchy(0,10);
  
  for ( i in 1:N ) {
    alpha[i] ~ beta(1,1);
  }

  for ( i in 1:N ) {
    y[i] ~ outliers(x[i], e[i], alpha[i], theta0, theta1, sigmaB);
  }

}

The values \(\alpha_i\) are also parameters (one parameter per datapoint), and will represent the probability that point \((x_i,y_i)\) is not an outlier. That means that, after the fit, we can consult those values to find which points were considered outliers by the model!

fit1 <- sampling(model1, 
                 data    = list(N = length(x),
                                x = x,
                                y = y,
                                e = e,
                                sigmaB = 50),
                 iter    = 5000,
                 chains  = 4,
                 cores   = 4,
                 refresh = 0
                 )

precis(fit1) # use precis(fit, depth=2) to check alpha parameters

##              mean         sd       5.5%     94.5%    n_eff     Rhat4
## theta0 31.1881610 1.59557092 28.6682510 33.730039 11192.68 1.0000203
## theta1  0.4711307 0.03714094  0.4130898  0.530389 11228.50 0.9998951

Let’s extract the samples, draw the robust fit and plot the outliers signaled by the model fit process:

samples <- rstan::extract(fit1)

sample_theta0_mean  <- mean(samples$theta0)
sample_theta1_mean  <- mean(samples$theta1)

plot.errors(x,y,e)

abline(fit, col=rgb(1,0,0,0.3), lwd=2)                                   # linear fit
abline(fit.huber$par[1], fit.huber$par[2], col="darkolivegreen1", lwd=2) # robust fit
abline(sample_theta0_mean, sample_theta1_mean, col="blueviolet",lwd=2)   # Bayes fit

# define outliers as those which alpha[i] is less than 0.5
posterior_alpha_means <- rep(NA, length(x))
for(i in 1:length(x)) {
  posterior_alpha_means[i] <- mean(samples$alpha[,i])
}
outliers <- which(posterior_alpha_means<0.5)
points(x[outliers], y[outliers], col="blueviolet", pch=0) # plot outliers

legend(45, 87, c("Bayes", "Linear", "Robust"), lty=c(1,1), lwd=4, 
       col=c("blueviolet", rgb(1,0,0,0.3), "darkolivegreen1"))

Embracing Outliers

Let’s consider the following problem from Stephen F. Gull in 1988:

A lighthouse is somewhere off a piece of straight coastline at a position \(\alpha\) along the shore and a distance \(\beta\) out at sea. It emits a series of short highly collimated flashes at random intervals and hence at random azimuths. These pulses are intercepted on the coast by photo-detectors that record only the fact that a flash has occurred, but not the angle from which it came. A total of N flashes have so far been recorded at positions \(x_k\). Where is the lighthouse?’ – DS Sivia - Data Analysis, A Bayesian Tutorial, pg.29

Problem diagram (also from Sivia)

Here’s the sample data for this problem:

par(mfrow=c(1,2)) 
boxplot(    x_k , horizontal = T, pch=20, xlab=TeX('$x_k$'))
boxplot(log(x_k), horizontal = T, pch=20, xlab=TeX('$\\log(x_k)$'))

Herein we will not remove these ‘outliers’ as some measurement error but, instead, will consider them as vital information about the problem.

Notice that using the mean of \(x_k\) as an estimate of \(\alpha\) does not work:

set.seed(2414)
n.resamplings <- 1e4

bootstrap.samples <- 
  replicate(n.resamplings, mean(sample(x_k, replace=TRUE)))

hist(bootstrap.samples, breaks=50, col="dodgerblue", 
     xlab=TeX('$\\hat{\\mu}$'), main="")

The spread of bootstrapped values is very large…

If we removed the outliers, say the 5% more extreme values, and estimate the data mean:

range_no_outliers <- rethinking::PI(x_k, prob=0.95)
x_k2 <- x_k[ x_k > range_no_outliers[1] & x_k < range_no_outliers[2]]
estimation_alpha_no_outliers <- mean(x_k2)
estimation_alpha_no_outliers

## [1] 8.326329

Let’s return to this estimation later on…

This problem has a mathematical approach that help us define an appropriate probabilistic model.

As implied in the description, the angle \(\theta\) is uniform over \(\pm \pi/2\), so

\[p(\theta | \alpha, \beta) = \frac{1}{\pi}\]

The position \(x\) is given by trigonometry

\[x = \beta \tan \theta + \alpha\]

The rule of variable transformation allows us to compute \(p(x | \alpha, \beta)\):

\[p(x | \alpha, \beta) = p(\theta | \alpha, \beta) \left| \frac{d\theta}{dx} \right|\]

Since

\[x = \beta \tan \theta + \alpha \iff \theta = \tan^{-1} \frac{x-\alpha}{\beta}\]

if we differenciate \(\theta\) wrt \(x\):

\[\frac{d\theta}{dx} \left[ \tan^{-1} \frac{x-\alpha}{\beta} \right] = \frac{\beta}{(x-\alpha)^2 + \beta^2}\]

Getting back to \(p(x | \alpha, \beta)\):

\[p(x | \alpha, \beta) = p(\theta | \alpha, \beta) \left| \frac{d\theta}{dx} \right| = \frac{1}{\pi} \frac{\beta}{(x-\alpha)^2 + \beta^2}\]

With this likelihood we could then program it into stan, as we did in the previous linear regression example.

But this distribution is precisely the Cauchy distribution. It has two parameters, location \(x_0\) (which we called \(\alpha\)), and scale \(\beta>0\):

\[p(x|x_0,\beta) = \frac{1}{\pi} \frac{\beta}{(x-x_0)^2 + \beta^2}\]

A side note: this explains why the mean of \(x_k\) is not a good estimate for \(\alpha\). We cannot apply the Central Limit Theorem (CLT). The CLT only works when the data is iid (which it is) and sampled from a distribution with finite mean (which it is not). The Cauchy does not have a mean.

The Cauchy is symmetrical and have a median, \(x_0\), which can be used as an estimator here

median(x_k)

## [1] 8.741815

Applying the bootstrap give us a smaller range of values:

set.seed(2414)
n.resamplings <- 1e4

bootstrap.samples <- 
  replicate(n.resamplings, median(sample(x_k, replace=TRUE)))

hist(bootstrap.samples, breaks=50, col="dodgerblue", xlab=TeX('$\\hat{\\mu}$'))

Anyway, either estimator provides much less information than the posterior \(p(\alpha|\mathcal{D})\) will give us.

The probabilistic model is thus:

\[y \sim \text{Cauchy}(\alpha, \beta)\]

\[\alpha \sim \mathcal{N}(0,20)\]

\[\beta \sim \text{LogNormal}(0,1)\] We allow for \(\alpha\) to spread around zero with high sd. Since \(\beta\) must be positive, we assigned a log-normal prior.

Let’s do a quick prior predictive check to confirm that we enable a good exploration of parameter space:

set.seed(132)
n <- 50
alpha.prior <- rnorm(n, 0, 20)
beta.prior <- rlnorm(n, 0, 1)
plot(NULL, xlim=c(-30,30), ylim=c(0,1.1), xlab=NA, ylab=NA)
for(i in 1:n) {
  curve(dcauchy(x,alpha.prior[i], beta.prior[i]), -30, 30, col="lightgrey", add=T)
}

The next Stan program models our problem:

data {
  int<lower=0> N;
  real x_[N];
}

parameters {
  real alpha; 
  real<lower=0> beta;
}

model {

  alpha ~ normal(0, 20);
  beta ~ lognormal(0,1);
  
  for (k in 1:N) {
     x_[k] ~ cauchy(alpha, beta);
  }
}

fit2 <- sampling(model2,
                 data    = list(N  = length(x_k),
                                x_ = x_k),
                 iter    = 20000,
                 chains  = 4,
                 cores   = 4,
                 refresh = 0
                 )

precis(fit2) # use precis(fit, depth=2) to check alpha parameters

##            mean       sd      5.5%    94.5%    n_eff     Rhat4
## alpha  9.393625 1.352838  7.233679 11.54357 34662.82 0.9999779
## beta  30.690782 1.356504 28.568001 32.89772 34964.02 1.0000384

chains <- ggs(fit2)

fig1 <- ggs_histogram(chains, greek=TRUE, bins=50) +
   theme_minimal()
fig2 <- ggs_traceplot(chains, greek=TRUE)
fig3 <- ggs_density(chains)
grid.arrange(fig1, fig2, fig3, ncol = 2, nrow = 2)

ggs_pairs(chains, lower = list(continuous = "density"))

Let’s get the samples:

samples <- rstan::extract(fit2)

sample_alpha <- samples$alpha
sample_beta  <- samples$beta

alpha.estimate <- mean(sample_alpha)
beta.estimate  <- mean(sample_beta)

Here are the empirical distributions for the parameters (the vertical lines are the true parameter values used to create the data set):

par(mfrow=c(1,2)) 
hist(x=sample_alpha, breaks=30, col="dodgerblue",
     xlab=bquote(alpha), ylab="",main="", yaxt='n')
abline(v=alpha, col="orange", lwd=2)
hist(x=sample_beta, breaks=30, col="dodgerblue", 
     xlab=bquote(beta), ylab="",main="", yaxt='n')
abline(v=beta, col="orange", lwd=2)

The real values are \(\alpha\) = 10 and \(\beta\) = 30. The mean estimation was 8.33 (but could have been any value whatsoever) and the median estimation was 8.74. The mean from the parameter chains is 9.39. Besides being closer, we also got an estimate for \(\beta\) = 30.69 which the previous point estimates were useless to find it.

To compare results let’s compute the Cauchy parameter estimates using both the quantile method and MLE. Again, these two estimations are just point estimations, they don’t convey the uncertainty associated with each parameter estimate like the posterior distributions do.

qs <- quantile(x_k, probs = c(.25, .5, .75))
physics.alpha <- qs[2]
physics.beta  <- (qs[3] - qs[1])/2

library(fitdistrplus)
fit.cauchy <- fitdist(x_k, "cauchy")
MLE.alpha <- fit.cauchy$estimate[1]
MLE.beta  <- fit.cauchy$estimate[2]

Stan includes function optimizing that returns a point estimate by maximizing the joint posterior. This result is an approximation of the joint distribution mode, usually called Maximum A Posteriori (MAP).

opt <- optimizing(model2,
                  list(N = length(x_k), x_ = x_k))

MAP.alpha <- opt$par[1]
MAP.beta  <- opt$par[2]

Let’s compare these estimations against the empirical posterior distribution that Stan computed.

plot(sample_alpha, sample_beta, pch=20, col=rgb(0,0,0,.2), 
     xlab=TeX("$\\alpha"), ylab=TeX("$\\beta"))

points(alpha, beta, pch=3, col="red", lwd=3) # true value
points(physics.alpha, physics.beta, pch=3, col="lightblue", lwd=3) # quantile estimation
points(MLE.alpha,     MLE.beta,     pch=3, col="orange",    lwd=3) # MLE estimation
points(MAP.alpha,     MAP.beta,     pch=3, col="purple",    lwd=3) # MAP estimation
points(mean(sample_alpha), mean(sample_beta), 
       pch=3, col="green3", lwd=3) # Bayes mean estimation

legend("topright", pch=3, pt.cex=1.2, pt.lwd=3,
       c("true position", "Bayes mean", "MAP", "MLE", "Quantile"), 
       col=c("red","green3", "purple", "orange", "lightblue"))

Using a contour plot shows that the highest density area of the samples is quite close to the true value:

df <- data.frame(x=sample_alpha,
                 y=sample_beta)
ggplot(df, aes(x=x, y=y) ) +
  geom_density_2d(aes(colour = ..level..)) +
  scale_color_viridis_c() + 
  xlab(TeX("$\\alpha")) +
  ylab(TeX("$\\beta")) +
  geom_point(data=data.frame(x=alpha, y=beta), shape="+", size=5, col="red") +
  theme_clean()

We can also use the MCMC samples to infer about the angles \(\theta_k\).

Notice that

\[\theta_k = \arctan \frac{x_k - \alpha}{\beta}\]

that is, each \(\theta_k\) is a function of two random vars, so is itself a random var.

To find the empirical distribution of angle \(\theta_k\) we just need to use the samples of \(\alpha\) and \(\beta\).

Let’s plot the posterior distribution \(p(\theta_1 | \mathcal{D})\):

sample_theta <- atan((x_k[1] - sample_alpha)/sample_beta)

hist(sample_theta, breaks=40, yaxt='n', col="dodgerblue",
     xlab=TeX('$\\theta_1$ (in radians)'), ylab="", 
     main=TeX('$p(\\theta_1 | D)$'))

Outliers: Love’em or leave’em

João Neto

April 2020

Detecting and Removing Outliers

Embracing Outliers