Dean Attali & Jenny Bryan
June 21, 2014
We tried using flexmix to define two subpopulations in our bivariate data, but discovered
that it did not perform as we were hoping.
Each population is roughly linear, though
they are both extremely skewed to the higher end of the distribution.
This skewness could very well be the reason flexmix doesn't give us the wanted results.
Below is a replication of what we saw, but with mock data.
The code below is available in raw text format as a gist
Each of the two populations is generated from a uniform distribution over most
of the range, and from a gaussian with a low SD at the end of the range - hence
the skewness of the data.
The two populations are differentiated by colour.
(Skip this code chunk if you're not interested in how the data is generated)
library(flexmix)
set.seed(50)
numRain <- 50
numClus <- 250
# population 1
x1rain <- runif(numRain, 0, 90)
y1rain <- x1rain + rnorm(50, 0, 3)
x1clus <- rnorm(numClus, 95, 3)
y1clus <- rnorm(numClus, 95, 3)
x1 <- c(x1rain, x1clus)
y1 <- c(y1rain, y1clus)
# population 2
x2rain <- runif(numRain, 0, 30)
y2rain <- 3*x2rain + rnorm(50, 0, 9)
x2clus <- rnorm(numClus, 35, 3)
y2clus <- rnorm(numClus, 95, 3)
x2 <- c(x2rain, x2clus)
y2 <- c(y2rain, y2clus)
data <- data.frame(x = c(x1, x2), y = c(y1, y2), pop = rep(1:2, each=length(x1)))
plot(y ~ x, data, col = pop)
title("2 true populations")
Here is what happens with a naive attempt of defining two populations with flexmix
set.seed(50)
m1 <- flexmix(y ~ x, data = data, k = 2)
plot(y ~ x, data, col=clusters(m1))
abline(a = parameters(m1)[1, 1], b = parameters(m1)[2, 1], col="blue")
abline(a = parameters(m1)[1, 2], b = parameters(m1)[2, 2], col="blue")
title("2 populations with basic flexmix")
To get results closer to what we want, we had to add weights to the tails of the populations. All the points that are not in the dense cluster have 10x the weight.
weights <- rep(c(rep(10, times=numRain),
rep(1, times=numClus)),
times=2)
set.seed(50)
m2 <- flexmix(y ~ x, data = data, k = 2, weights = as.integer(weights))
plot(y ~ x, data, col=clusters(m2))
abline(a = parameters(m2)[1, 1], b = parameters(m2)[2, 1], col="blue")
abline(a = parameters(m2)[1, 2], b = parameters(m2)[2, 2], col="blue")
title("2 true with flexmix with weights")
Ideally, we'd like to be able to get this end result without having to supply the weights.