What is clustering

The procedure of partitioning a set of observations into a set of meaningful subclasses or clusters. All the observations in a cluster share some similarities but are essentially distinct from the cluster observations

Clustering methods

Instead of optimizing each observation to clsuters’ centers, like k-means, we optimize a probabilistic model

library(ggplot2)
library(dplyr)
library(mixtools) # the poisson ditribution is not implemented
## Warning: package 'mixtools' was built under R version 4.0.3
library(flexmix)
## Warning: package 'flexmix' was built under R version 4.0.3
library(magrittr)
library(ellipse)
## Warning: package 'ellipse' was built under R version 4.0.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.3
gender <- read.csv("gender.csv")
data <- read.csv("digits.csv")
crimes <- read.csv("CoC_crimes.csv")
# Have a look to gender_with_probs (after clustering)
head(gender )
##   Gender   Height   Weight        BMI  probability
## 1   Male 73.84702 241.8936 0.04435662 5.778312e-06
## 2   Male 68.78190 162.3105 0.03430822 6.059525e-01
## 3   Male 74.11011 212.7409 0.03873433 2.625952e-05
## 4   Male 71.73098 220.0425 0.04276545 3.628734e-04
## 5   Male 69.88180 206.3498 0.04225479 4.611901e-03
## 6   Male 67.25302 152.2122 0.03365316 9.110687e-01
# Scatterplot with probabilities
gender %>% 
  ggplot(aes(x = Weight, y = BMI, col = probability))+
  geom_point(alpha = 0.5)

Gaussian Distribution

population_sample <- rnorm(1000, 10, 2)
mean_estimate <- mean(population_sample)
std_estimate <- sd(population_sample)

population_sample %<>% data.frame(x = population_sample)

ggplot(data =population_sample) +
  geom_histogram(aes(x = x, y = ..density..)) +
  stat_function(geom = "line", fun = dnorm, args = list(mean =mean_estimate,
                                                        sd= std_estimate)) +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Gender dataset

# Plot the sample with the estimated curve

mean_estimate <- gender$Weight %>% mean()
sd_estimate <- gender$Weight %>%  sd()

gender %>% 
  ggplot() + geom_histogram(aes(x = Weight, y = ..density..), bins = 100) + 
  stat_function(geom = "line", fun = dnorm,
                args = list(mean = mean_estimate, sd = sd_estimate)) +
  theme_minimal()

It appears to be two different graussian distributions, but we are treating them as one.

Gausiian mixture models - GMM

number_of_obs <- 500
coin <- sample(c(0, 1), size = number_of_obs,
               replace = TRUE, prob = c(0.5, 0.5))
head(coin)
## [1] 1 0 1 0 1 0
table(coin)
## coin
##   0   1 
## 237 263
# Gaussian 1: Heads
gauss_1 <- rnorm(n = number_of_obs, mean = 5, sd = 2)

# Gaussian 2: Tails
gauss_2 <- rnorm(n = number_of_obs)

mixture_simulation <- ifelse(coin, gauss_1, gauss_2)
head(cbind(coin, gauss_1, gauss_2, mixture_simulation))
##      coin  gauss_1    gauss_2 mixture_simulation
## [1,]    1 4.801685 -1.1031314          4.8016846
## [2,]    0 6.984278 -0.1769560         -0.1769560
## [3,]    1 4.207626 -1.5725863          4.2076261
## [4,]    0 1.798294 -1.5954626         -1.5954626
## [5,]    1 7.730304 -0.5768537          7.7303043
## [6,]    0 4.488410 -0.2602325         -0.2602325
mixture_simulation %<>% data.frame(x = mixture_simulation)

ggplot(mixture_simulation) +
  geom_histogram(aes(x=x, ..density..), bins = 40)

Mixture of three distributions

proportions <- sample(c(0, 1, 2), number_of_obs,
                       replace = TRUE, prob =c(1/3, 1/3, 1/3))

gauss_3 <- rnorm(n= number_of_obs,mean= 10, sd = 1)

mixture_simulation <- data.frame(x = ifelse(proportions == 0, gauss_1,
                                            ifelse(proportions == 1, gauss_2, gauss_3)))

ggplot(mixture_simulation) +
  geom_histogram(aes(x=x, y =..density..))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Structure of mixture models

To cluster the data with mixture model, there are three questions to be answered:

  1. Which is the suitable probability disitribution?
  1. How many clusters explain the data?
  1. What are the parameters and their estimations

Gender dataset applications

  1. Which distribution? Bivariate distribution
  2. How many clusters? Two clusters: two subpopulations: gender
  3. what are the estimates? Mans, standard deviations and proportions

Handwritten digits application

  1. Which distributions? Bernoulli distributions
  2. How many clusters? Two clusters
  3. What are the estimates? The mean probability of being 1 fro every dot and proportions

Crime types

  1. Which distributions? Multivariate poisson distribution
  2. How many clusters? Six clusters
  3. What are the estimates? Average number of crimes by type and proportions

Case

Imagine you work for a company that has M different products and N clients. You have been provided by a matrix where each row represents a client and each column, a product. The elements of this matrix are the quantity of the corresponding product bought last year by the client. If you are asked to find clusters of clients, which distribution would you use?

R: POISSON DISTRIBUTION

show_digit <- function(arr256, col=gray(4:1/4), ...) {
  arr256 <- as.numeric(arr256)
  image(matrix(arr256, nrow=16)[,16:1],col=col,...)
}


# Apply `glimpse` to the data
glimpse(data)
## Rows: 1,593
## Columns: 266
## $ V1   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V2   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V3   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ V4   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ V5   <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,...
## $ V6   <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,...
## $ V7   <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V8   <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V9   <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V10  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V11  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ V12  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ V13  <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
## $ V14  <int> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ V15  <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ V16  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V17  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V18  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ V19  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,...
## $ V20  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,...
## $ V21  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V22  <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ V23  <int> 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ V24  <int> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ V25  <int> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ V26  <int> 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0,...
## $ V27  <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V28  <int> 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V29  <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ V30  <int> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
## $ V31  <int> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ V32  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V33  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ V34  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,...
## $ V35  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V36  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V37  <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,...
## $ V38  <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ V39  <int> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ V40  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ V41  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ V42  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ V43  <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ V44  <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0,...
## $ V45  <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,...
## $ V46  <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1,...
## $ V47  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,...
## $ V48  <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V49  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,...
## $ V50  <int> 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V51  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V52  <int> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,...
## $ V53  <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V54  <int> 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V55  <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V56  <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V57  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V58  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V59  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V60  <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ V61  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ V62  <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ V63  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,...
## $ V64  <int> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V65  <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,...
## $ V66  <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V67  <int> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,...
## $ V68  <int> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ V69  <int> 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V70  <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V71  <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V72  <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V73  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V74  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V75  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V76  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V77  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,...
## $ V78  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,...
## $ V79  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ V80  <int> 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,...
## $ V81  <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V82  <int> 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V83  <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,...
## $ V84  <int> 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V85  <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V86  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V87  <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V88  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V89  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V90  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V91  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V92  <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V93  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V94  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,...
## $ V95  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ V96  <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ V97  <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V98  <int> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V99  <int> 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ V100 <int> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V101 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V102 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V103 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V104 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V105 <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V106 <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V107 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V108 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V109 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V110 <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,...
## $ V111 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V112 <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V113 <int> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V114 <int> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V115 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ V116 <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V117 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V118 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V119 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V120 <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V121 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V122 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V123 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V124 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V125 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V126 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ V127 <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V128 <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V129 <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V130 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V131 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ V132 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V133 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V134 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V135 <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V136 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V137 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V138 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V139 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V140 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V141 <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V142 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V143 <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V144 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V145 <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V146 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V147 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ V148 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V149 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V150 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V151 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V152 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V153 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V154 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V155 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V156 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V157 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V158 <int> 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V159 <int> 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V160 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,...
## $ V161 <int> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V162 <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,...
## $ V163 <int> 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,...
## $ V164 <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ V165 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ V166 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V167 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V168 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V169 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V170 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V171 <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V172 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V173 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V174 <int> 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,...
## $ V175 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V176 <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ V177 <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ V178 <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,...
## $ V179 <int> 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,...
## $ V180 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,...
## $ V181 <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ V182 <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V183 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V184 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V185 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V186 <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V187 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V188 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ V189 <int> 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ V190 <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V191 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V192 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ V193 <int> 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ V194 <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ V195 <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0,...
## $ V196 <int> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,...
## $ V197 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1,...
## $ V198 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V199 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V200 <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V201 <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ V202 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ V203 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ V204 <int> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,...
## $ V205 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V206 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V207 <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0,...
## $ V208 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V209 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ V210 <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ V211 <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,...
## $ V212 <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ V213 <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,...
## $ V214 <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,...
## $ V215 <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ V216 <int> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ V217 <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ V218 <int> 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1,...
## $ V219 <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,...
## $ V220 <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ V221 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V222 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,...
## $ V223 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ V224 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V225 <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V226 <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,...
## $ V227 <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,...
## $ V228 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ V229 <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ V230 <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,...
## $ V231 <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V232 <int> 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V233 <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V234 <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V235 <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ V236 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ V237 <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,...
## $ V238 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V239 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V240 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V241 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V242 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V243 <int> 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V244 <int> 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ V245 <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ V246 <int> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,...
## $ V247 <int> 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V248 <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ V249 <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ V250 <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,...
## $ V251 <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ V252 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ V253 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V254 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V255 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V256 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V257 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ V258 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ V259 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V260 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V261 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V262 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V263 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V264 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V265 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ V266 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
# Digit in row 50
show_digit(data[50,])
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

# Digit in row 100
show_digit(data[100,])
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

Estimation given the probabilities

Parameters estimation for mixture models is not an easy task. But if you are provided with the probabilities of belonging to each cluster, the estimation of the means and the proportions of the clusters is not so difficult.

random_numbers <- c(rnorm(5000, 5, 2), rnorm(5000))
probs <- seq(10000, 1, by=-1) / 10000

gaussian_sample_with_probs <- data.frame(
  x = random_numbers,
  prob_cluster1 = probs
)  %>%  mutate(  prob_cluster2 = 1 - prob_cluster1) %>% 
  arrange(desc(prob_cluster1))
head(gaussian_sample_with_probs)
##             x prob_cluster1 prob_cluster2
## 1  1.66212734        1.0000         0e+00
## 2 -0.04419671        0.9999         1e-04
## 3  8.14058740        0.9998         2e-04
## 4  5.14076818        0.9997         3e-04
## 5  1.19245685        0.9996         4e-04
## 6  5.11770965        0.9995         5e-04
# Estimation of the means
means_estimates <- gaussian_sample_with_probs %>% 
  summarise(mean_cluster1= sum(x*prob_cluster1)/sum(prob_cluster1),
            mean_cluster2 = sum(x*prob_cluster2)/sum(prob_cluster2))
means_estimates
##   mean_cluster1 mean_cluster2
## 1      3.711209      1.223994
# Estimation of the proportions
props_estimates <- gaussian_sample_with_probs %>% 
  summarise(props_cluster1 = mean(prob_cluster1),
            props_cluster2 = 1 - mean(prob_cluster1))
props_estimates
##   props_cluster1 props_cluster2
## 1        0.50005        0.49995
# Transform to a vector
means_estimates <- as.numeric(means_estimates)

# Plot histogram with means estimates
ggplot(gaussian_sample_with_probs) + geom_histogram(aes(x = x), bins = 100) +
  geom_vline(xintercept = means_estimates)

Calculating the probabilities

gaussian_sample <- gaussian_sample_with_probs %>% select(x)

# Create data frame with probabilities
gaussian_sample_with_probs <- gaussian_sample %>% 
  mutate(prob_from_cluster1 = 0.35 * dnorm(x, mean = 10, sd = 10),
         prob_from_cluster2 = 0.65 * dnorm(x, mean = 50, sd = 10),
         prob_cluster1 = prob_from_cluster1/(prob_from_cluster1 + prob_from_cluster2),
         prob_cluster2 = prob_from_cluster2/(prob_from_cluster1 + prob_from_cluster2)) %>%
  select(x, prob_cluster1, prob_cluster2) 
         
head(gaussian_sample_with_probs)
##             x prob_cluster1 prob_cluster2
## 1  1.66212734     0.9999778  2.218416e-05
## 2 -0.04419671     0.9999888  1.121060e-05
## 3  8.14058740     0.9997040  2.960382e-04
## 4  5.14076818     0.9999108  8.918990e-05
## 5  1.19245685     0.9999816  1.838461e-05
## 6  5.11770965     0.9999116  8.837112e-05

EM algorithm

Iteration 0: initial parameters

  • Assume variances are equal to 1.
# Arbitrary mean
means_init <- c(1,2)
means_init
## [1] 1 2
# Arbitrary proportions
props_init <- c(0.5, 0.5)
props_init
## [1] 0.5 0.5
expectation <- function(data, means, proportions, sds){
  # Estimate the probabilities
  exp_data <- data %>% 
    mutate(prob_from_cluster1 = proportions[1] * dnorm(x, mean = means[1], sd = sds[1]),
           prob_from_cluster2 = proportions[2] * dnorm(x, mean = means[2], sd = sds[2]),
           prob_cluster1 = prob_from_cluster1/(prob_from_cluster1 + prob_from_cluster2),
           prob_cluster2 = prob_from_cluster2/(prob_from_cluster1 + prob_from_cluster2)) %>% 
    select(x, prob_cluster1, prob_cluster2)
    
  # Return data with probabilities
  return(exp_data)
}



maximization <- function(data_with_probs){
  means_estimates <- data_with_probs %>%
    summarise(mean_1 = sum(x * prob_cluster1) / sum(prob_cluster1),
              mean_2 = sum(x * prob_cluster2) / sum(prob_cluster2)) %>% 
    as.numeric()
  props_estimates <- data_with_probs %>% 
    summarise(proportion_1 = mean(prob_cluster1),
              proportion_2 = 1 - proportion_1) %>% 
    as.numeric()
  list(means_estimates, props_estimates)   
}

EM iteration

means_init <- c(0, 100)
props_init <- c(0.5, 0.5)

# Iterative process
for(i in 1:1){
  new_values <- maximization(expectation(gaussian_sample, means_init, props_init, c(10, 10)))
  means_init <- new_values[[1]]
  props_init <- new_values[[2]]
  cat(c(i, means_init, props_init), "\n")
}
## 1 2.467726 8.864184 1 0
fun_gaussian <- function(x, mean, proportion){
  proportion * dnorm(x, mean, sd = 10)
}

means_iter10 <- means_init
props_iter10 <- props_init

gaussian_sample %>% 
  ggplot() + geom_histogram(aes(x = x, y = ..density..), bins = 200) +
  stat_function(geom = "line", fun = fun_gaussian,
                args = list(mean = means_iter10[1], proportion = props_iter10[1])) +
  stat_function(geom = "line", fun = fun_gaussian,
                args = list(mean = means_iter10[2], proportion = props_iter10[2]))

## Univariate Gaussian Mixture models

To understand what type of mixture model we might use we must do EDA.

gender %>% 
  ggplot(aes(x=Weight)) + geom_histogram(bins=100)

Graphically we can identify Two clusters

Parameters:

  • Two means

  • Two Standard deviations

  • Two proportions

  • EM algorithm

set.seed(1515)
fit_mixture = flexmix(Weight ~ 1,
        data=gender,
        k = 2,
        model = FLXMCnorm1(),
        control = list(tol = 1e-15,
                       verbose = 0,
                       iter = 10000))
proportions <- prior(fit_mixture)
proportions
## [1] 0.5070334 0.4929666
comp_1 <- parameters(fit_mixture, component = 1)
comp_2 <- parameters(fit_mixture, component = 2)


parameters(fit_mixture)
##         Comp.1    Comp.2
## mean 186.61523 135.54712
## sd    19.96247  18.94874
fun_prop <- function(x, mean, sd, proportion){
  proportion * dnorm(x = x, mean = mean, sd = sd)
}


gender %>% 
  ggplot() + geom_histogram(aes(Weight, y = ..density..)) +
  stat_function(geom = "line", fun = fun_prop,
                args = list(mean = comp_1[1],
                            sd   = comp_1[2],
                            proportion = proportions[1])) +
   stat_function(geom = "line", fun = fun_prop,
                args = list(mean = comp_2[1],
                            sd   = comp_2[2],
                            proportion = proportions[2]))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Probability to belong to each cluster

posterior(fit_mixture) %>% head(20)
##            [,1]         [,2]
##  [1,] 0.9999931 6.850964e-06
##  [2,] 0.5578120 4.421880e-01
##  [3,] 0.9993998 6.002129e-04
##  [4,] 0.9997998 2.001852e-04
##  [5,] 0.9984504 1.549566e-03
##  [6,] 0.2455903 7.544097e-01
##  [7,] 0.9618190 3.818101e-02
##  [8,] 0.7318215 2.681785e-01
##  [9,] 0.8912543 1.087457e-01
## [10,] 0.3626267 6.373733e-01
## [11,] 0.9735642 2.643577e-02
## [12,] 0.9994832 5.168492e-04
## [13,] 0.7085447 2.914553e-01
## [14,] 0.9822151 1.778490e-02
## [15,] 0.9729320 2.706796e-02
## [16,] 0.8298096 1.701904e-01
## [17,] 0.9930274 6.972603e-03
## [18,] 0.8429897 1.570103e-01
## [19,] 0.9711951 2.880493e-02
## [20,] 0.9532200 4.677999e-02

Assign to cluster

clusters(fit_mixture) %>% head() 
## [1] 1 1 1 1 1 2
table(gender$Gender, clusters(fit_mixture))
##         
##             1    2
##   Female  500 4500
##   Male   4556  444

bivariate Gaussian Mixture models

Different to univariate GMM, in the bivariate case now we must estimate the means in two dimensions and the standard deviation and de covariance between the two variables (covariance matrix), and the proportions.

Bivariate GMM without joint variability between variables

set.seed(1313)
fit_with_covariance <- flexmix(cbind(Weight,BMI) ~ 1,
               data = gender,
               k = 2, 
               model = FLXMCmvnorm(diag = TRUE), # Change to FaLSE to allow covariance between variables
               control = list(tolerance = 1e-15, iter.max = 1000))
# Get the parameters
comp_1 <- parameters(fit_with_covariance, component = 1)
comp_2 <- parameters(fit_with_covariance, component = 2)

# The means
mean_comp_1 <- comp_1[1:2]
mean_comp_1
## [1] 186.30915383   0.03914529
mean_comp_2 <- comp_2[1:2]
mean_comp_2
## [1] 133.23110205   0.03293304
# The covariance matrices
covariance_comp_1 <- matrix(comp_1[3:6], nrow = 2)
covariance_comp_1
##          [,1]         [,2]
## [1,] 366.8305 0.000000e+00
## [2,]   0.0000 4.071906e-06
covariance_comp_2 <- matrix(comp_2[3:6], nrow = 2)
covariance_comp_2
##          [,1]         [,2]
## [1,] 286.8994 0.000000e+00
## [2,]   0.0000 6.202358e-06
# Create ellipse curve 1
ellipse_comp_1 <- ellipse(x = covariance_comp_1, 
                          centre = mean_comp_1,
                          npoints = nrow(gender))
head(ellipse_comp_1)
##             x          y
## [1,] 219.4592 0.04263790
## [2,] 219.4384 0.04264010
## [3,] 219.4175 0.04264229
## [4,] 219.3967 0.04264448
## [5,] 219.3758 0.04264667
## [6,] 219.3549 0.04264886
# Create ellipse curve 2
ellipse_comp_2 <- ellipse(x = covariance_comp_2, 
                          centre = mean_comp_2,
                          npoints = nrow(gender))
head(ellipse_comp_2)
##             x          y
## [1,] 162.5479 0.03724356
## [2,] 162.5295 0.03724627
## [3,] 162.5110 0.03724897
## [4,] 162.4926 0.03725168
## [5,] 162.4741 0.03725438
## [6,] 162.4556 0.03725708
# Plot the ellipses
gender %>% 
  ggplot(aes(x = Weight, y = BMI)) + geom_point()+
  geom_path(data = data.frame(ellipse_comp_1), aes(x=x,y=y), col = "red") +
  geom_path(data = data.frame(ellipse_comp_2), aes(x=x,y=y), col = "blue")

# Check the assignments
table(gender$Gender, clusters(fit_with_covariance))
##         
##             1    2
##   Female  595 4405
##   Male   4723  277

Bernoulli mixture models

# Create the vector of probabilities
p_cluster_1 <- c(0.8, 0.8, 0.2, 0.9)
# Create the sample for each pixel
pixel_1 <- sample(c(0, 1), 100, replace = TRUE, 
                  prob = c(1-p_cluster_1[1], p_cluster_1[1]))
pixel_2 <- sample(c(0, 1), 100, replace = TRUE, 
                  prob = c(1-p_cluster_1[2], p_cluster_1[2]))
pixel_3 <- sample(c(0, 1), 100, replace = TRUE, 
                  prob = c(1-p_cluster_1[3], p_cluster_1[3]))
pixel_4 <- sample(c(0, 1), 100, replace = TRUE, 
                  prob = c(1-p_cluster_1[4], p_cluster_1[4]))
# Combine the samples
sample_cluster_1 <- cbind(pixel_1, pixel_2, pixel_3, pixel_4)
# Have a look to the sample
head(sample_cluster_1)
##      pixel_1 pixel_2 pixel_3 pixel_4
## [1,]       1       0       1       1
## [2,]       1       1       0       1
## [3,]       1       1       0       1
## [4,]       1       1       1       1
## [5,]       1       1       0       1
## [6,]       1       1       0       1
data_matrix <- as.matrix(data[])


show_digit(data_matrix[4,])
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

show_digit(data_matrix[67,])
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

set.seed(1513)
# Fit Bernoulli mixture model
bernoulli_mix_model <- flexmix(data_matrix ~ 1,
                               k = 10,
                               model = FLXMCmvbinary(),
                               control = list(tolerance = 1e-15, iter.max = 1000))
# Check the proportions
prior(bernoulli_mix_model)
## [1] 0.07863443 0.08959832 0.11288063 0.15800281 0.17153762 0.20231154 0.08909212
## [8] 0.09794253
# Extract the parameters for each cluster
param_comp_1 <- parameters(bernoulli_mix_model, component = 1)
param_comp_2 <- parameters(bernoulli_mix_model, component = 2)
param_comp_3 <- parameters(bernoulli_mix_model, component = 3)
param_comp_4 <- parameters(bernoulli_mix_model, component = 4)
# Visualize the clusters
show_digit(param_comp_1)
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

show_digit(param_comp_2)
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

show_digit(param_comp_3)
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

show_digit(param_comp_4)
## Warning in matrix(arr256, nrow = 16): la longitud de los datos [266] no es un
## submúltiplo o múltiplo del número de filas [16] en la matriz

Poisson mixture models

Simulation

set.seed(1541)
# Create the vector of lambdas
lambda_1 <- c(150, 300, 50)
# Create the sample of each crime
assault_1 <- rpois(n = 10, lambda = lambda_1[1])
robbery_1 <- rpois(n = 10, lambda = lambda_1[2])
battery_1 <- rpois(n = 10, lambda = lambda_1[3])
# Combine the results
cities_1 <- cbind(assault_1, robbery_1, battery_1)
# Check the sample
cities_1
##       assault_1 robbery_1 battery_1
##  [1,]       155       287        56
##  [2,]       146       311        49
##  [3,]       151       304        40
##  [4,]       122       326        46
##  [5,]       149       283        48
##  [6,]       143       269        53
##  [7,]       142       295        52
##  [8,]       136       289        43
##  [9,]       153       297        54
## [10,]       163       311        58

Crimes Dataset

# Check the dimension
dim(crimes)
## [1] 77 13
# Check with glimpse
glimpse(crimes)
## Rows: 77
## Columns: 13
## $ COMMUNITY           <chr> "ALBANY PARK", "ARCHER HEIGHTS", "ARMOUR SQUARE...
## $ ASSAULT             <int> 123, 51, 74, 169, 708, 1198, 118, 135, 337, 63,...
## $ BATTERY             <int> 429, 134, 184, 448, 1681, 3347, 280, 350, 850, ...
## $ BURGLARY            <int> 147, 92, 55, 194, 339, 517, 76, 145, 327, 64, 1...
## $ CRIMINAL.DAMAGE     <int> 287, 114, 99, 379, 859, 1666, 150, 310, 528, 12...
## $ CRIMINAL.TRESPASS   <int> 38, 23, 56, 43, 228, 265, 29, 36, 88, 29, 27, 3...
## $ DECEPTIVE.PRACTICE  <int> 137, 67, 59, 178, 310, 767, 73, 200, 314, 89, 9...
## $ MOTOR.VEHICLE.THEFT <int> 176, 50, 37, 189, 281, 732, 58, 123, 411, 42, 6...
## $ NARCOTICS           <int> 27, 18, 9, 30, 345, 1456, 15, 22, 119, 10, 35, ...
## $ OTHER               <int> 107, 37, 48, 114, 584, 1261, 76, 88, 238, 22, 6...
## $ OTHER.OFFENSE       <int> 158, 44, 35, 164, 590, 1130, 94, 142, 339, 77, ...
## $ ROBBERY             <int> 144, 30, 98, 111, 349, 829, 65, 109, 172, 24, 5...
## $ THEFT               <int> 690, 180, 263, 461, 1201, 2137, 239, 669, 846, ...
# Transform into a matrix, without `community`
matrix_crimes <- crimes %>%
  select(-COMMUNITY) %>%  
  as.matrix()

# Check the first values
head(matrix_crimes)
##      ASSAULT BATTERY BURGLARY CRIMINAL.DAMAGE CRIMINAL.TRESPASS
## [1,]     123     429      147             287                38
## [2,]      51     134       92             114                23
## [3,]      74     184       55              99                56
## [4,]     169     448      194             379                43
## [5,]     708    1681      339             859               228
## [6,]    1198    3347      517            1666               265
##      DECEPTIVE.PRACTICE MOTOR.VEHICLE.THEFT NARCOTICS OTHER OTHER.OFFENSE
## [1,]                137                 176        27   107           158
## [2,]                 67                  50        18    37            44
## [3,]                 59                  37         9    48            35
## [4,]                178                 189        30   114           164
## [5,]                310                 281       345   584           590
## [6,]                767                 732      1456  1261          1130
##      ROBBERY THEFT
## [1,]     144   690
## [2,]      30   180
## [3,]      98   263
## [4,]     111   461
## [5,]     349  1201
## [6,]     829  2137
set.seed(2017)
# Fit the Poisson mixture model
poisson_mm <- stepFlexmix(matrix_crimes ~ 1, 
                                 k = 1:15, 
                                 nrep = 5, 
                                 model = FLXMCmvpois(),
                                 control = list(tolerance = 1e-15, iter.max = 1000))
## 1 : * * * * *
## 2 : * * * * *
## 3 : * * * * *
## 4 : * * * * *
## 5 : * * * * *
## 6 : * * * * *
## 7 : * * * * *
## 8 : * * * * *
## 9 : * * * * *
## 10 : * * * * *
## 11 : * * * * *
## 12 : * * * * *
## 13 : * * * * *
## 14 : * * * * *
## 15 : * * * * *
# Select the model that minimize the BIC
best_poisson_mm <- getModel(poisson_mm, which = "BIC")
# Get the parameters into a data frame
params_lambdas <- data.frame(parameters(best_poisson_mm))

# Add the column with the type of crime
params_lambdas_crime <- params_lambdas %>% 
  mutate(crime = colnames(matrix_crimes))

# Plot the clusters with their lambdas
params_lambdas_crime %>% 
  gather(cluster, lambdas, -crime) %>% 
  ggplot(aes(x = crime, y = lambdas, fill = crime)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~ cluster) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        legend.position = "none")

# Add the cluster assignments
crimes_with_clusters <- crimes %>% 
  mutate(cluster = factor(clusters(best_poisson_mm)))

# Enumerate the cluster's elements
crimes_with_clusters <- crimes_with_clusters %>% 
  group_by(cluster) %>% 
  mutate(number = row_number()) 

# Plot the clusters with the communities
crimes_with_clusters %>% 
  ggplot(aes(x = cluster, y = number, col = cluster)) + 
  geom_text(aes(label = COMMUNITY), size = 2.3)+
  theme(legend.position="none")