Numeric distributional data

Let’s load some data

data<-BLOOD
plot(BLOOD)

Distributional dataset

A distributional dataset is a classical table with \(N\) observations on the rows and \(P\) variables, indexing the columns, such that the generic term \(y_{ij}\) is a numerical univariate distribution

\[y_{ij}\sim f_{ij}(x_j)\] where \(x_j\in D_j \subset \Re\) and \(f_{ij}(x_j)\geq 0\),

The basic plot for the \(i\)-th observation

The \(i\)-th observation is the vector \[y_i=[y_{i1},\ldots,y_{ij},\dots,y_{iP}]\]

Steps :

  1. Domain discretization

    • For continuous variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, fixing a \(K_j\) integer value we partition \(D_j\) into \(K_j\) equi-width intervals (bins) of values, such that: \[D_j=\left\{ B_{jk}=(a_k,b_k], \lvert \, b_k>a_k,\, k=1,\ldots,K_j\, , \bigcup_{k=1}^KB_{jk}=[\min(D_j),\max(D_j)],B_{jk}\cap B_{jk'}=\emptyset, \text{ for } k\neq k' \right\} \]
    • For discete variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, being \(\# D_j=K_j\) the cardinality of \(D_j\), we consider the elements of \(D_j\).
  2. Choice of a divergent colour palette_ We consider a divergent color palette with \(K_j\) levels, such that \(K_1\) represent the lowest category and \(K_j\) the highest one.

  3. Stacked percentage barcharts We compute the mass observed in each bin/category for each \(y_{ij}\)

For the \(Y_i\) observation, \(P\) bars are generated. The order of the bar can be decided accordingly to the user preferences, or can be suggested by a correlation analysis for all the data in advance (one may cluster the distributional variables using a hierarchical clustering based on the Wasserstein correlation and then using the order returned by after the aggregation).

  1. Polar coordinates Polar coordinates allow to represent the stacked barcharts as circles that mimics an Eye Iris. We called this plot Eye Iris plot (EI plot.)

NOTE: discretization can be done on standardized variables using the Wasserstein standard deviation

Example using BLOOD data

The extremes of the domains of the variables

## [1] "Range of Cholesterol => [ 80 ; 270 ] "
## [1] "Range of Hemoglobin => [ 10.2 ; 15 ] "
## [1] "Range of Hematocrit => [ 30 ; 47 ] "

Choice of \(K\) and of a color palette

We fix \(K=50\) and we will use a color palette from Red (low values), passing through Yellow (middle values) to Green (high values).

Now, let’s take the first observation

Recode the distribution according to \(K=50\) partition of the domains.

Since the bins represent classes of values, we can consider them as ranked levels of the domain.

We propose to see all the three distributions using a stacked percentage barchart as follows. Note that each level of color has a area that is proportional to the mass associated with each bin.

The dashed line is positioned at level \(0.5\) suggesting where the median of each distribution is positioned taking into consideration the level of color associated with the bin of the respective domain.

But, this kind of visualization is not so immediate for comparing several observations. Let’s see an example:

For this reason, we propose to use a plot based on polar coordinates, but adding pupil for reducing the distorsion due to the polar transformation, as follows:

## Saving 4 x 4 in image

Since a human is able to catch eyes shapes and color, we believe that this kind of visualization can be more interpretable. For example, let’s see all the 14 observations together.

Interpretation

According to the filling colours we can compare both observations and distributional values.

The Enriched plot

We propose to add information about dispersion and skewness.

The dispersion

Each variable in the dataset may have a different dispersion. Each distributional variable has its dispersion accounted by its proper standard deviation \(\sigma_{ij}\). We normalize each standard deviation \(\sigma_{ij}\) by the maximum standard deviation of observed for the the \(j\)-th variable \(\max(\sigma_{ij})\) where \(i=1,\ldots,N\). A segment, centered in the respective sector, allow to perceive the dispersion associated with each distribution.

The skewness

Each \(y_{ij}\) has its skewness value computed via the Third standardized moment \(\gamma_{ij}\).

We represent the skewness of \(y_{ij}\) external to the dashed circle if it is positive, while it is positioned internally if it is negative. The distance from the dashed circle represent the absolute value of the skewness index. If the segment is very close to the dashed circle, it means that the distribution is almost symmetric.

An example applied to Hierarchical clustering

library(ggdendro)
res3<-WH_hclust(data,standardize = T)
dendro_data<-ggdendro::dendro_data(res3)
hplo<-ggplot() +
  geom_segment(data = segment(dendro_data), 
               aes(x = x, y = y, xend = xend, yend = yend)
  ) +
  geom_text(data = label(dendro_data), 
            aes(x = x, y = y, label = label, hjust = 0), 
            size = 3
  ) 
   

size_mult<-0.6
for(i in 1:length(res3$order)){
  hplo<-hplo+
    annotation_custom(ppgro[[res3$order[i]]], 
                      ymin=1.3-size_mult, 
                      ymax=1.3+size_mult, 
                      xmin=dendro_data$labels$x[i]-size_mult, 
                      xmax=dendro_data$labels$x[i]+size_mult)
}
hplo+
coord_flip() +
   scale_y_reverse(expand = c(0.4, 0))+theme_void()

in a PCA

coord<-as_tibble(Pca_res$ind$coord)
pca12<-ggplot(coord,aes(x=Dim.1,y=Dim.2))+geom_point()+geom_hline(yintercept = 0,color="grey")+geom_vline(xintercept = 0,color="grey")
size_mult<-0.5
for(i in 1:nrow(coord)){
  pca12<-pca12+
    annotation_custom(ppgro_en[[i]], 
                      ymin=coord$Dim.2[i]-size_mult, 
                      ymax=coord$Dim.2[i]+size_mult, 
                      xmin=coord$Dim.1[i]-size_mult, 
                      xmax=coord$Dim.1[i]+size_mult)
}
pca12+
  scale_y_continuous(expand = c(0.2, 0.2))+
  scale_x_continuous(expand = c(0.2, 0.2))+
  xlab(paste0("Dim. 1 (",round(Pca_res$eig[1,2],2),"%) "))+
  ylab(paste0("Dim. 2 (",round(Pca_res$eig[2,2],2),"%) "))+
  theme_minimal()

Heatmaps for distributional data