flexmix case study

Dean Attali & Jenny Bryan
June 21, 2014

We tried using flexmix to define two subpopulations in our bivariate data, but discovered that it did not perform as we were hoping.
Each population is roughly linear, though they are both extremely skewed to the higher end of the distribution.
This skewness could very well be the reason flexmix doesn't give us the wanted results.
Below is a replication of what we saw, but with mock data.

The code below is available in raw text format as a gist

Create test data

Each of the two populations is generated from a uniform distribution over most of the range, and from a gaussian with a low SD at the end of the range - hence the skewness of the data.
The two populations are differentiated by colour.
(Skip this code chunk if you're not interested in how the data is generated)

library(flexmix)
set.seed(50)

numRain <- 50
numClus <- 250

# population 1
x1rain <- runif(numRain, 0, 90)
y1rain <- x1rain + rnorm(50, 0, 3)
x1clus <- rnorm(numClus, 95, 3) 
y1clus <- rnorm(numClus, 95, 3)
x1 <- c(x1rain, x1clus)
y1 <- c(y1rain, y1clus)

# population 2
x2rain <- runif(numRain, 0, 30)
y2rain <- 3*x2rain + rnorm(50, 0, 9)
x2clus <- rnorm(numClus, 35, 3) 
y2clus <- rnorm(numClus, 95, 3)
x2 <- c(x2rain, x2clus)
y2 <- c(y2rain, y2clus)

data <- data.frame(x = c(x1, x2), y = c(y1, y2), pop = rep(1:2, each=length(x1)))

plot(y ~ x, data, col = pop)
title("2 true populations")

plot of chunk create-data

flexmix naive use

Here is what happens with a naive attempt of defining two populations with flexmix

set.seed(50)
m1 <- flexmix(y ~ x, data = data, k = 2)
plot(y ~ x, data, col=clusters(m1))
abline(a = parameters(m1)[1, 1], b = parameters(m1)[2, 1], col="blue")
abline(a = parameters(m1)[1, 2], b = parameters(m1)[2, 2], col="blue")
title("2 populations with basic flexmix")

plot of chunk naive-flexmix

flexmix with a-priori weights

To get results closer to what we want, we had to add weights to the tails of the populations. All the points that are not in the dense cluster have 10x the weight.

weights <- rep(c(rep(10, times=numRain),
                 rep(1, times=numClus)),
               times=2)
set.seed(50)
m2 <- flexmix(y ~ x, data = data, k = 2, weights = as.integer(weights))
plot(y ~ x, data, col=clusters(m2))
abline(a = parameters(m2)[1, 1], b = parameters(m2)[2, 1], col="blue")
abline(a = parameters(m2)[1, 2], b = parameters(m2)[2, 2], col="blue")
title("2 true with flexmix with weights")

plot of chunk weights-flexmix

Ideally, we'd like to be able to get this end result without having to supply the weights.