Exploratory Data Analysis of
data described by empirical distributions
using the HistDAWass R Package

Antonio Irpino, PhD
University of Campania ‘’L. Vanvitelli’’
Dept. of Mathematics and Physics
Caserta, Italy
antonio.irpino@unicampania.it

logo

April, 23th, 2021

A brief introduction to data described by distributions

Classical variables

A numerical variable measures an object using a number.
A categorical variable measures an object using a category.

Symbolic variables

Useful for describing an object coming from an aggregation of low-level objects (e.g., a group or a class of individuals), or repeated measures on the same object.

A set-valued variable measures an object using a set of values (numbers or categories)
- An interval variable measures an object using an interval of real numbers.
- A multinominal (or multicategorical) variable measures an object using a set of categories.
- A modal variable measures an object using a set or a distribution of positive weights for the support of the variable

Symbolic Data Analysis and distributional data (Bock and Diday,2000)

The measurement done on an object for a variable may have several values: namely, data are, or might be, multi-valued.

Especially, if an object is an higher order statistical unit, namely, generalizes a set of individual measurements (a Region, a City, a market segment, a typology,… ). But, it is not only this!

Distributional variable

A distributional variable is a modal variable measuring an object using a distribution having a numeric support.

Distribution of what? Of frequencies, if aggregated raw data are described by a classic numeric variable.

What if support is categorical?

Compositional data

Compositional data approach provides some useful answers to the treatment of such data.

Note: in this presentation we do not treat compositional data.

Main approaches to the statistical treatment of such variables

Symbolic Data Analysis

Distributions are the representation of the summary of raw data (histograms, empirical cdf,..)

Functional Data Analysis

Data are represented by density functions, some specific smoothing splines (for example) guarantee to define some spaces of representation of data where classical operations are allowed.

Compositional Data Analysis

The Aitchison’s geometry of compositions is used. Some extensions to density valued data have been proposed using, for example, the concept of Bayes Spaces defined by Egozcue et al (2012), where distributions are points in an Hilbert space.

All the three approaches have PROS and CONS. Unfortunately, no well-grounded inference is defined on these data, so it is not always possible to justify the superiority of one approach against others.

Exploratory data analysis of distributional data

Extending the principles of classical EDA, we aim at analyze and investigate a distributional data set and summarize its main characteristics.

EDA is a detective work – numerical detective work – or counting detective work – or graphical detective work.
(J.W. Tukey, 1977)

The National Institute of Standards and Technology (NIST) describes EDA as an approach to data analysis, not a model, that uses these techniques:

Maximize insights into a dataset.

Uncover underlying structures.

Extract important variables.

Detect outliers and anomalies.

Test underlying assumptions.

Develop parsimonious models.

Determine optimal factor settings

Distributional observation

A distributional observation (datum) comes operationally from the aggregation/summarization/model of raw measurements to a frequency/density distribution. It can be identified by

Empirical distribution	Empirical CDF	Empirical QF

The empirical density function $y=f(x)\geq0\,\textrm{ such that }\,\int\limits_{-\infty}^{+\infty}f(x)dx=1$,
the empirical cumulative distribution function $0\leq F(x_0)=\int\limits_{-\infty}^{x_0}f(x)dx \leq 1$ , and
the empirical quantile function $Q(F(x_0))=x_0$ or $Q=F^{-1}$.

They are in one-to-one relation.

Basic statistics for a single distributional variable

It is not straightforward to define basic statistics of a single distributional variable.

Let’s consider the concept of mean.

Several approaches in the literature. If $Y=\left\{ y_i|i\in \Omega \right\}$ are the observed data:

Wald (1950): given an aggregation function of errors and an error function for data, the mean is the value/object that minimize the function of errors. \[m=\arg \min_g \Phi\left(e(g,y_i)\right) \]
Frechét (1948): (a particular case of Wald) the mean is a value/object that minimizes the sum ($\Phi(.)$) of squared distances ($e(g,y_i)$). \[m=\arg \min_g \sum_id^2(g,y_i) \]
Chisini (1929): given a function of data, the mean is a value that, if substituted to the original data, returns the same function value. \[f(Y)=f(m)\]

Note: it is desirable that $m$ has the same nature of $y_i$’s

The mean of a set of distributions

Using Frechét approach we need to define a squared distance.

Several distances exist between distributions, among them we choose two Euclidean distances:

The Euclidean distance between density functions, or cdf’s (let’s say a vertical distance) \[ d_{E1}^2(y_i,y_{i'})= \int\limits_{-\infty}^{+\infty} \left[ f_i(x)-f_{i'}(x)\right]^2dx\,\, \mathrm{ or }\,\, d_{E2}^2(y_i,y_{i'})= \int\limits_{-\infty}^{+\infty} \left[ F_i(x)-F_{i'}(x)\right]^2dx\]
The Euclidean distance between quantile functions (let’s say an horizontal distance distance) \[ d_{W}^2(y_i,y_{i'})= \int\limits_{0}^{1} \left[ Q_i(p)-Q_{i'}(p)\right]^2dp\] (aka, squared $L2$ Wasserstein distance, or Mallow’s distance)

The mean distribution using Euclidean distance

Both $d^2_{E1}(\cdot,\cdot)$ and $d^2_{E2}(\cdot,\cdot)$ can be considered in the Frechét scheme and they lead to a mean distribution that is a mixture of the input ones. \[m=\arg \min_g \sum_i d^2_{E1}(g,y_i) =\arg \min_g \sum_i d^2_{E2}(g,y_i)\] $The mean using d^2_{E1}(\cdot,\cdot) and d^2_{E2}(\cdot,\cdot)$

The mean distribution using $L2$ Wass. distance

Using $d^2_{W}(\cdot,\cdot)$ in the Frechét formula, we obtain a different mean distribution with a different interpretation.

\[m=\arg \min_g \sum_i d^2_{W}(g,y_i) \] $The mean using d^2_{E1}(\cdot,\cdot) and d^2_{W}(\cdot,\cdot)$

An interesting interpretative property of $d^2_W(\cdot,\cdot)$

Being:

$\mu_1$ and $\mu_2$ the respective means of $y_1$ and $y_2$,
$\sigma_1$ and $\sigma_2$, their respective standard deviations, and
$\rho_{1,2}$ the Pearson correlation between the quantile functions of $y_1$ and $y_2$,

\[d^2_W (y_1,y_2)= \underbrace{\left(\mu_1-\mu_2\right)^2}_{Location}+\underbrace{\underbrace{\left(\sigma_1-\sigma_2\right)^2}_{Scale}+\underbrace{2\sigma_1 \sigma_2 (1-\rho_{1,2})}_{Shape}}_{Internal\,variability}\]

We used this property (Irpino and Romano, 2006) for developing a set of basic statistics and exploratory methods for distributional variables that are implemented in the HistDAWass package, freely available in R.

Visually

Decomposition of $d^2_W$

Order statistics and other types of means (geometric, harmonic, and so on)

Since we are dealing with set-valued descriptions, it is not straighforward to generalize these basic statistics

Order statistics: median, quartiles and quantiles of a distributional variable. It is not possible to define a general and natural ordering among set-valued data. Some interesting approaches could be borrowed from the functional data analysis and the depth concept generalized for them. Depth is the degree of centrality of a curve, thus the median of a set of curves is the most central one.
Other types of means requires the definition of mathematical operations (product, power,…) between distributions (or their related functions).

Dispersion measures

From now and ahead we will use $d^2_W$ as reference, but if anyone is interested in the implication of using $d^2_{E1}$ or $d^2_{E2}$, can refer to some resources listed at end of the presentation.

We focus on the so-called Frechét dispersion, which is very similar to the sum of square error SSQ we use for computing standard deviation.

Dispersion for classical data

A classical variable dispersion is a measure of the difference/heterogeneity of the observed values. Note: A single datum has null internal-dispersion.

Dispersion for aggregate data.

The dispersion of a variable describing objects coming from the aggregation of raw classical data has, at least, two sources of variability: one inherited by the aggregation, and one coming from the aggregate descriptions.

Assumpions about the dispersion of a distributional variable

A measure of dispersion of a single distributional variable should quantify how much different are the observed data.

As consequence, a set of identical distributions should produce a null measure of dispersion.

A desired interpretative aid
If two distributional variables have the same dispersion, is it desirable to interpret if there is a difference due to different variability induced by the set-valued data?

The SSQ of a distributional variable

Given $m$ as the mean distribution: $SSQ(Y)=\sum\limits_i d^2_W(m,y_i)$

from which we defined the variance $VAR(Y)=n^{-1}SSQ(Y)$

and the standard deviation of $Y$: $S(Y)=\sqrt{VAR(Y)}$ .

They are positive values and are equal to zero when the $y_i$’s distributions are the same.

SSQ interpretation

$SSQ(Y)$, thanks to the decomposition of presented above, can be expressed as the sum of (at least) two independent components: \[SSQ(Y)=\underbrace{\sum\limits_i\left(\mu_i-\mu_m\right)^2}_{Location}+\underbrace{\underbrace{\sum\limits_i\left(\sigma_i-\sigma_m\right)^2}_{Scale}+\underbrace{2\sigma_m\sum\limits_i\sigma_i (1-\rho_{i,m})}_{Shape}}_{Internal\,variability}\] Using this decomposition one can interpret the components of a distributional variable and use it for other analysis (we used this in clustering methods, for example).

Association measures

Let’s focus only on the covariance and Pearson’s correlation. Some of the well-known properties of the covariance and correlation indexes are:

The covariance between two identical variables corresponds to the variance of one of the two variables.
A classical variable correlates perfectly with itself.

Using the product of quantile functions that induces the L2 Wasserstein metric, we developed an extension of the covariance and of correlation indexes for a pair of distributional variables.

PRO:

the covariance satisfies the two above-mentioned properties
the covariance can be decomposed as SSQ, even if the terms can be both positive and negative
some other properties of the classical covariance is satisfied.

CONS:

the linear combination of distributional data, when working with quantile functions, is not easy to model. Indeed, since quantile functions are increasing functions, generally, only conic combinations are allowed.
the interpretation of linear combination of functions.

The `HistDAWass` package for the analysis of distributional (histogram) data

All the above mentioned statistics can be computed for distributions having finite the first two moments. Empirical distributions have this property. Further, it suffices that data can be represented by a cdf (or a qf) that is a $C^0$-class function (e.g.,piece-wise linear functions).

In 2015, we started to develop an package called HistDAWass

Histogram
Data
Analysis
using Wasserstein metric

The package can be downloaded from CRAN:

install.packages("HistDAWass") #install
library("HistDAWass") #use it in your environment

HistDAWass Package on CRAN

Outline

The package and how to install it
Main classes and methods
From raw data to histograms
Histogram-valued data table
Distances for histogram-valued data: the $L_2$ Wasserstein distance
Basic univariate statistics for histogram variables
Association measure for histogram variables

Classes of `HistDAWass`

Classes and methods were implemented in the package using S4

The main motivation is that the S4 paradigm is object-oriented, while S3 does not.

Main classes of the package

distributionH – the class representing a histogram-valued data (HD);
MatH – a matrix of HD;
TdistributionH – an HD with a timestamp, or a time interval (useful for the analysis of histogram time series)
TMatH – a matrix of HD’s with a timestamp, or a time interval;
HTS – a Histogram Time Series (an ordered list of TdistributionH);

The `distributionH` class: a class for describing and manipulating histogram-valued data

A class describing a 1-d histogram

mydist=distributionH(x=c(10,20,50,60,70,120),
                     p=c(0,0.1, 0.35,0.7,0.9,1))
str(mydist)
## Formal class 'distributionH' [package "HistDAWass"] with 4 slots
##   ..@ x: num [1:6] 10 20 50 60 70 120
##   ..@ p: num [1:6] 0 0.1 0.35 0.7 0.9 1
##   ..@ m: num 52
##   ..@ s: num 21.8

Show function

mydist 
##              X            p
## Bin_1  [ 10 ; 20 )          0.1
## Bin_2  [ 20 ; 50 )         0.25
## Bin_3  [ 50 ; 60 )         0.35
## Bin_4  [ 60 ; 70 )          0.2
## Bin_5 [ 70 ; 120 ]          0.1
## 
##  mean =  52   std  =  21.8174242292714 
##

Some options for plotting a distributionH object

plot(mydist, type="HISTO", col="red", border="blue")
#plots a density approximation for mydist
plot(mydist, type="DENS", col="red", border="blue") 
plot(mydist, type="HBOXPLOT") #plots a horizontal boxplot for mydist
plot(mydist, type="VBOXPLOT") #plots a vertical boxplot for mydist
#plots the cumulative distribution function of mydist
plot(mydist, type="CDF") 
#plots the quantile function of mydist
plot(mydist, type="QF")

How to retrieve some basic information from distributionH objects

Obtaining the histogram and the CDF of a distributionH object

mydist.histo<-get.histo(mydist) #this returns the histogram
mydist.cdf<-get.distr(mydist) #this returns the CDF
#into data.frame objects
mydist.histo
##   min.x max.x    p
## 1    10    20 0.10
## 2    20    50 0.25
## 3    50    60 0.35
## 4    60    70 0.20
## 5    70   120 0.10
mydist.cdf
##     x    p
## 1  10 0.00
## 2  20 0.10
## 3  50 0.35
## 4  60 0.70
## 5  70 0.90
## 6 120 1.00

Obtaining a single quantile or a probability from a distributionH object

# computes the CDF value for x=25
compP(object = mydist,q = 25) 
## [1] 0.1416667
# computes the quantile  for p=0.13
compQ(object = mydist,p = 0.13) 
## [1] 23.6

Other basic statistics for distributionH objects

meanH(mydist) #computes the mean
## [1] 52
stdH(mydist) #computes the standard deviation
## [1] 21.81742
skewH(mydist) #computes the 3rd stand. centr. moment
## [1] 0.3693756
kurtH(mydist) #computes the 4th stand. centr. moment
## [1] 3.701504

Being $Q(p)$ a quantile function, the four measures are, respectively, the histogram version of the following formulas (Gilchrist,2000):
\[\mu=\int\limits_0^1Q(p)dp;\quad \sigma=\sqrt{\int\limits_0^1Q(p)^2dp-\mu^2},\] \[sk=\int\limits_0^1\left(\frac{Q(p)-\mu}{\sigma}\right)^3dp; \quad ku=\int\limits_0^1\left(\frac{Q(p)-\mu}{\sigma}\right)^4dp.\]

The histogram trick

We can consider the histogram as a weighted mixture of $k$ (the number of classes or bins of a hstogram) disjointed uniform s.
The CDF of a histogram is piece-wise linear curve (the quantile function, too)
Even if a set of histograms have different bins and CDF levels, it is possible to recode them such that they share the same CDF levels, without modifing their distributions (registration).

It is possible to show that all the previous and the next statistics for histogram variables have exact solutions (namely the integrals are exactly computed), and have a time complexity depending from the number of bins only.

The histogram trick: the mean and the standard deviation

We have the distribution $H=\{([a_1,b_1],\pi_1),\ldots,([a_k,b_k],\pi_k)\}$. Each bin can be described in terms of center and radii: $c_{\ell}=\frac{a_{\ell}+b_{\ell}}{2}$ and $r_{\ell}=\frac{b_{\ell}-a_{\ell}}{2}$. The mean and the standard deviation of a (single) histogram is computed as follows:

\[\mu=\int\limits_0^1Q(p)dp=\sum\limits_{\ell=1}^k\int\limits_{F_{\ell-1}}^{F_\ell}Q(p)dp=\sum\limits_{\ell=1}^k\pi_{\ell}\frac{a_{\ell}+b_{\ell}}{2} =\sum\limits_{\ell=1}^k\pi_{\ell}c_{\ell} \]

\[\sigma=\sqrt{\int\limits_0^1Q(p)^2dp-\mu^2}=\sqrt{\sum\limits_{\ell=1}^k\int\limits_{F_{\ell-1}}^{F_\ell}Q(p)^2dp-\mu^2}= \sqrt{\sum\limits_{\ell=1}^k\pi_{\ell}\left[c_{\ell}^2+\frac{1}{3}r_{\ell}^2\right] -\mu^2}\]

How to recode two hstograms such that thay have the same number of bins and the same masses?

 dist1<-distributionH(c(1,2,3),c(0, 0.4, 1))
 dist1
##           X         p
## Bin_1 [ 1 ; 2 )       0.4
## Bin_2 [ 2 ; 3 ]       0.6
## 
##  mean =  2.1   std  =  0.568624070307733 
## 
 dist2<-distributionH(c(7,8,10,15),c(0, 0.2, 0.7, 1))
 dist2
##             X           p
## Bin_1   [ 7 ; 8 )         0.2
## Bin_2  [ 8 ; 10 )         0.5
## Bin_3 [ 10 ; 15 ]         0.3
## 
##  mean =  9.75   std  =  2.09065380523255 
##

Before (green) and after (yellow) registration

The register method in action

 registered<-register(dist1,dist2) ## register the two distributions
 #returns two registered distributionH objects
 registered[[1]]
##             X           p
## Bin_1 [ 1 ; 1.5 )         0.2
## Bin_2 [ 1.5 ; 2 )         0.2
## Bin_3 [ 2 ; 2.5 )         0.3
## Bin_4 [ 2.5 ; 3 ]         0.3
## 
##  mean =  2.1   std  =  0.568624070307733 
## 
 registered[[2]]
##              X            p
## Bin_1    [ 7 ; 8 )          0.2
## Bin_2  [ 8 ; 8.8 )          0.2
## Bin_3 [ 8.8 ; 10 )          0.3
## Bin_4  [ 10 ; 15 ]          0.3
## 
##  mean =  9.75   std  =  2.09065380523255 
##

Computing the $L_2$ Wasserstein distance

Using register method, the (Squared) $L_2$ Wasserstein distance between two histograms exactly:

WassSqDistH(dist1,dist2) #computes the squared L2 Wass. distance
## [1] 61.03667
WassSqDistH(dist1,dist2,details=TRUE) #computes with details
##  SQ_W_dist   POSITION       SIZE      SHAPE        rQQ 
## 61.0366667 58.5225000  2.3165745  0.1975922  0.9168940

\[d_W^2(f,g)=\underbrace{\left(\mu_f-\mu_g\right)^2}_{Position}+\underbrace{\underbrace{\left(\sigma_f-\sigma_g\right)^2}_{Size}+\underbrace{2\sigma_f\sigma_g\left[1-\rho_{QQ}(f,g)\right]}_{Shape}}_{Variability}\]

A bit more on rQQ

The method rQQ computes the Pearson correlation index of two quantile functions. It is equal to $1$ when the distributions have the same shape (except for the means and the standard deviations).

Operations between distributionH objects

Sum (Between two qf’s, or a qf and a number)
Multiplication (Between a qf and a positive number)
Difference (When admissible, namely the result is a qf, between two qf’s, or a qf and a number)

From raw data to `distributionH` objects

data2hist, a function for converting raw data to histogram-valued one.

The `MatH`class: a histogram-valued data table

A data table of HD: the `MatH` object

A MatH object is a table (a matrix): each row is an individual each column is a histogram variable.

plot(BLOOD)

The `MatH` class and its initialization

MatH(x = list(new("distributionH")), nrows = 1, ncols = 1,
  rownames = NULL, varnames = NULL, by.row = FALSE)

- x is a list of distributionH objects - nrows is the number of rows (the individuals) - ncols is the number of columns (the variables) - rownames is a vector of strings with the labels of the individuals - varnames is a vector of strings with the labels of variables - by.row indicates if the matrix must be filled by row (TRUE) or by column (FALSE this is the default)

An example of creation of a new `MatH` object

##---- create a list of six distributionH objects
ListOfDist<-vector("list",6)
ListOfDist[[1]]<-distributionH(c(1,2,3),c(0, 0.4, 1))
ListOfDist[[2]]<-distributionH(c(7,8,10,15),c(0, 0.2, 0.7, 1))
ListOfDist[[3]]<-distributionH(c(9,11,20),c(0, 0.5, 1))
ListOfDist[[4]]<-distributionH(c(2,5,8),c(0, 0.3, 1))
ListOfDist[[5]]<-distributionH(c(8,10,15),c(0,  0.75, 1))
ListOfDist[[6]]<-distributionH(c(20,22,24),c(0, 0.12, 1))

## create a MatH object filling it by columns
MyMAT=MatH(x=ListOfDist,nrows=3,ncols=2,
  rownames=c("I1","I2","I3"), varnames=c("Var1","Var2"),by.row=FALSE)

#bulding an empty 10 by 4 matrix of histograms
empty.MAT=MatH(nrows=10,ncols=4)

`show` method

show(MyMAT) #or simply type MyMAT
## a matrix of distributions 
##  2  variables  3  rows 
##  each distibution in the cell is represented by the mean and the standard deviation 
##               Var1                     Var2          
## I1  [m= 2.1  ,s= 0.56862 ]   [m= 5.6  ,s= 1.6248 ]  
## I2  [m= 9.75  ,s= 2.0907 ]  [m= 9.875  ,s= 1.7515 ] 
## I3 [m= 12.75  ,s= 3.3323 ]  [m= 22.76  ,s= 0.86933 ]

`plot` method

#plots a matrix of histograms
plot(BLOOD, type="HISTO",  border="blue")

#plots a matrix of densities
plot(BLOOD, type="DENS",  border="blue")

#plots a matrix of boxplots
plot(BLOOD, type="BOXPLOT")

Functions for manipulating `MatH` objects

Useful for the histogram trick

registerMH(MyMat), returns a new MatH object with all the distributions transformed. All the distributions have the same number of bins each one containing the same mass (It is useful for computing exactly the basic statistics based on $L_2$ Wasserstein metric)

Fuctions for subsetting of for justapposing `MatH` objects

WH.bind(MAT1, MAT2, byrow=TRUE) attaches MAT2 on the right of MAT1. Naturally, the two MatH objects must have the same number of rows;
WH.bind(MAT1, MAT2, byrow=FALSE) attaches MAT2 under MAT1. Naturally, the two MatH objects must have the same number of columns;
[,] an overloaded method for subsetting a matrix. It returns a new MatH object.

Matrix operation between `MatH` objects

Methods based on $L_2$ Wasserstein norm for summing and multiplying matrices

WH.mat.sum performs a classic cell by cell sum. In particular, the result is a new MatH object having in each cell a distribution associated with the quantile function resulting from the sum of the corresponding quantile functions (It is a sum consistent with the Wasserstein metric). As usual, the matrix must have the same dimensions.

MAT.sum=WH.mat.sum(MyMAT1,MyMAT2)

WH.mat.prod performs the matrix multiplication of two MatH objects. It returns a matrix of numbers according to the dot product defined for two distributions and associated with the $L_2$ Wasserstein metric. It is possible also to consider trasposition of matrices.

MAT.prod=WH.mat.prod(MyMAT1,MyMAT2, traspose1=FALSE, traspose2=FALSE)
MAT.prod=WH.mat.prod(MyMAT1,MyMAT2, traspose1=TRUE, traspose2=FALSE)
MAT.prod=WH.mat.prod(MyMAT1,MyMAT2, traspose1=FALSE, traspose2=TRUE)

Methods for univariate and bivariate statistics of histogram variables: the mean distribution

WH.vec.mean computes a distributionH that is the mean distribution of a vector or a matrix of distributions (a MatH object). The mean distribution is computed accordingly to the sum based on the $L_2$ Wasserstein distance, namely, it is the distribution associated with the average quantile function of the quantile functions of the vector of distributions. It is possible also to assign weights to the elements of vetor in order to obtain a weighted mean.
WH.vec.sum same as WH.vec.mean, but computes only the sum.

WH.vec.mean(BLOOD[,2]) #returns the average distribution of 
# the Hemoglobin variable as a distributionH object

Methods for univariate and bivariate statistics of histogram variables: the variability and the association measures

These methods returns a matrix of numbers accordingly to the number of the compared variables. The formulas are those presented in [@IrpVer2015].

Sum of squares

WH.SSQ(MyMAT, w=w) computes the wheighted Sum of Squares and the Sum of products of the deviations from the means of a MatH object. The weights are associated to the rows, if missing, the weights are considered equal. The result is square matrix of dimension equal to the number of variables of the MatH object.
WH.SSQ2(MyMAT1, MyMAT2, w=w) computes the wheighted Sum of products of the deviations from the means of two MatH objects. The two matrices must have the same number of rows. The result is rectangular matrix with rows equal to the number of variables of the first MatH object and columns the number of variables of the first MatH object.

Using these it is possible to compute covariances and correlations.

Covariance and correlation measures

Variance, covariance and correlation

WH.var.covar(MyMAT, w=w) computes the variance-covariance matrix of a MatH object.
WH.var.covar2(MyMAT1, MyMAT2, w=w) computes a covariance matrix between two MatH objects.
WH.correlation(MyMAT, w=w) computes the correlation matrix of a MatH object.
WH.correlation2(MyMAT1, MyMAT2, w=w) computes the correlation matrix of two MatH objects.

WH.var.covar(BLOOD)
##             Cholesterol Hemoglobin  Hematocrit
## Cholesterol  388.137633 -5.0005134 -14.9202378
## Hemoglobin    -5.000513  0.2802215   0.8266558
## Hematocrit   -14.920238  0.8266558   2.9779786
WH.correlation(BLOOD)
##             Cholesterol Hemoglobin Hematocrit
## Cholesterol   1.0000000 -0.4794806 -0.4388560
## Hemoglobin   -0.4794806  1.0000000  0.9049264
## Hematocrit   -0.4388560  0.9049264  1.0000000

# the variance of a histogram variable
VAR.Choresterol=WH.var.covar(BLOOD[,1])
VAR.Choresterol
##             Cholesterol
## Cholesterol    388.1376
# the standard deviation
STD.Choresterol=sqrt(VAR.Choresterol)
STD.Choresterol
##             Cholesterol
## Cholesterol    19.70121

The `TdistributionH` class: a class for a histogram-valued data observed along time

`TdistributionH` class (Not completely developed!!)

It is essentially a specilized class of distributionH equipped with a time-related information. Being a distribution, we considered to specialize the class such that it can contains

a tstamp, a numeric value that represents a time stamp, or;
a period, a list containing two slots start and end, two numbers (for now, a time/date dormat will be chosen in the future), indicating the initial and the final period of observation. Indeed, we imagine that a histogram (or a distribution) is the set of values observed during a time period.

My.Time.histo  = new('TdistributionH',tstamp=1, 
                      x=dist1@x, p=dist1@p)
My.Period.histo= new('TdistributionH',period=list(start=1,end=3),
                      x=dist1@x, p=dist1@p)

The `HTS` class: a class for histogram time series

A `HTS` is a list of `TdistributionH`

Not a very sophisticated class, but it is under development. ### Initializing a HTS

MyHTS=new('HTS', epocs=10, ListOfTimedElements = ListOfTdistributions)

The list of TdistributionH objects are organized into a vector of dimension equal to epoc. The slot containing the vector is called data. If we need to see what is in the HTS we need to call elements from @data slot.

# RetHTS is a HTS of the returns five-minutes returns of DOLL/YEN changes 
# rates observed along 108 days (it is one of the dataset available in the pkg)
RetHTS@data[2]
## [[1]]
##                   X                 p
## Bin_1  [ -0.1 ; -0.06 )           0.03833
## Bin_2 [ -0.06 ; -0.03 )            0.1394
## Bin_3  [ -0.03 ; 0.03 )            0.7038
## Bin_4   [ 0.03 ; 0.06 )           0.09059
## Bin_5   [ 0.06 ; 0.12 ]           0.02787
## 
##  mean =  -0.00275259208362382   std  =  0.0342206481885284 
##

Main methods

subsetHTS(HTS, from, to) extracts a piece (in the time sense) of HTS
show(HTS) returns a list of values and basic statistics related to the distributions in the HTS (wee see the results in the next slide)
plot(HTS,...) shows a plot of the HTS, (we see examples in the next slides)

The `show` method

# RetHTS is a HTS of the returns five-minutes returns of DOLL/YEN changes 
# rates observed along 108 days (it is one of the dataset available in the pkg)
options(width = 300)
show(RetHTS)

##     Tstamp T_period_start T_period_end          mean        std         skew      kurt
## 1    38749          38749        38749 -0.0019860297 0.04224402  0.696353637  6.433505
## 2    38750          38750        38750 -0.0027525921 0.03422065  0.161710205  3.860462
## 3    38751          38751        38751 -0.0022545180 0.04252923  1.054658950  6.334018
## 4    38754          38754        38754 -0.0056181618 0.03660866 -0.287735490  4.077056
## 5    38755          38755        38755 -0.0108187759 0.04469958 -0.691931506  5.949878
## 6    38756          38756        38756 -0.0015679186 0.03932478  0.676302020  4.805117
## 7    38757          38757        38757 -0.0038849961 0.03521140  0.474190871  4.194543
## 8    38758          38758        38758 -0.0057090553 0.05180512 -0.108275213  4.937425
## 9    38761          38761        38761 -0.0048545204 0.03741751 -0.264327106  4.290259
## 10   38762          38762        38762 -0.0060801071 0.04140493  0.276796508  6.148626
## 11   38763          38763        38763 -0.0018814832 0.05313136 -0.151476128  9.303889
## 12   38764          38764        38764 -0.0060526111 0.03971651 -0.041866347  4.189392
## 13   38765          38765        38765 -0.0019454181 0.05084949  0.354931260  6.160713
## 14   38768          38768        38768 -0.0051052438 0.02793509  0.009895520  5.072376
## 15   38769          38769        38769 -0.0034561197 0.02806190  0.028340867  4.330262
## 16   38770          38770        38770 -0.0069337719 0.03726113 -0.602640682  5.229702
## 17   38771          38771        38771 -0.0106445600 0.05501497 -0.995706447  6.528730
## 18   38772          38772        38772 -0.0073090483 0.05280985 -0.254510473  7.056863
## 19   38775          38775        38775 -0.0092402357 0.04994473 -2.456202932 16.218037
## 20   38776          38776        38776 -0.0070208812 0.03425427 -0.170867642  3.533228
## 21   38777          38777        38777 -0.0047212118 0.04867714  0.296046999  5.833572
## 22   38778          38778        38778 -0.0047038051 0.03674708  0.240445851  6.362662
## 23   38779          38779        38779 -0.0064233064 0.05259779 -1.269210426 11.922934
## 24   38782          38782        38782 -0.0011817858 0.04103728  0.718818282  5.971927
## 25   38783          38783        38783 -0.0047734896 0.03857179 -0.141197527  5.804973
## 26   38784          38784        38784 -0.0048606093 0.03369879 -0.515454518  3.729712
## 27   38785          38785        38785 -0.0047037812 0.05545347 -0.295411527  8.626437
## 28   38786          38786        38786 -0.0007272310 0.05204268  0.707027815  8.276109
## 29   38789          38789        38789 -0.0064363434 0.03191699 -0.078502518  3.762579
## 30   38790          38790        38790 -0.0105923052 0.04003190 -0.199008788  3.885703
## 31   38791          38791        38791 -0.0069511893 0.03898982 -0.086497721  5.293574
## 32   38792          38792        38792 -0.0074389851 0.04508932 -0.935993029  7.833595
## 33   38793          38793        38793 -0.0075729608 0.04550285  0.000219057  4.890152
## 34   38796          38796        38796 -0.0030633416 0.04627440  0.093724269  5.230596
## 35   38797          38797        38797 -0.0017307386 0.04216918  0.617796078  5.686636
## 36   38798          38798        38798 -0.0052264546 0.03984408  0.118317228  5.678807
## 37   38799          38799        38799 -0.0010626723 0.04243500  2.076246948 15.385329
## 38   38800          38800        38800 -0.0087817694 0.05320716 -2.306995986 13.792963
## 39   38803          38803        38803 -0.0072534963 0.04214570 -1.085068170  4.974648
## 40   38804          38804        38804 -0.0002786968 0.05085144  1.087733108  8.830547
## 41   38805          38805        38805 -0.0051219166 0.04418986 -0.011704014  6.374511
## 42   38806          38806        38806 -0.0076828832 0.05040461 -1.028022895  9.496948
## 43   38807          38807        38807 -0.0060545003 0.05235379 -0.606038718  8.880585
## 44   38810          38810        38810 -0.0061090452 0.05826492 -0.395657824  6.925153
## 45   38811          38811        38811 -0.0055985506 0.04355256 -0.486466779  6.884867
## 46   38812          38812        38812 -0.0046689580 0.04185633  0.459999525  5.036577
## 47   38813          38813        38813 -0.0029616471 0.03672018 -0.042695220  4.208889
## 48   38814          38814        38814 -0.0041635993 0.04382331  0.448764601  7.897283
## 49   38817          38817        38817 -0.0043612930 0.03274848  0.187488595  4.668999
## 50   38818          38818        38818 -0.0057142672 0.03244375 -0.246011044  3.562023
## 51   38819          38819        38819 -0.0036585120 0.03779608  0.122059841  5.802111
## 52   38820          38820        38820 -0.0049477194 0.02559904  0.261164825  4.069244
## 53   38821          38821        38821 -0.0045724760 0.01996054  0.563223610  5.127033
## 54   38824          38824        38824 -0.0091635871 0.04472303  0.237533944  9.820100
## 55   38825          38825        38825 -0.0081532739 0.04128826 -0.939945016  7.934035
## 56   38826          38826        38826 -0.0046166875 0.04686141  0.537764380  5.537140
## 57   38827          38827        38827 -0.0029616389 0.03641048 -0.028545536  6.009552
## 58   38828          38828        38828 -0.0089010695 0.04042375  0.713792677  6.595900
## 59   38831          38831        38831 -0.0101929316 0.05653616 -1.690331963 11.243804
## 60   38832          38832        38832 -0.0028571058 0.04468332  0.083454199  5.849040
## 61   38833          38833        38833 -0.0060278319 0.04528931 -0.440157943  6.474160
## 62   38834          38834        38834 -0.0093705645 0.06338602 -3.139906218 21.884934
## 63   38835          38835        38835 -0.0048908788 0.03923458 -0.850739442  5.869432
## 64   38838          38838        38838 -0.0075879791 0.05844668 -0.987439643  9.724247
## 65   38839          38839        38839 -0.0040766233 0.04768498 -0.616224072  4.198066
## 66   38840          38840        38840 -0.0047901832 0.03987908 -0.187875569  4.626865
## 67   38841          38841        38841 -0.0050174516 0.04519448 -0.363053027  3.945153
## 68   38842          38842        38842 -0.0100362910 0.06474940 -3.844042297 30.346126
## 69   38845          38845        38845 -0.0069999632 0.04992085 -0.780782148  5.403221
## 70   38846          38846        38846 -0.0075086792 0.04351507 -0.119390300  4.367429
## 71   38847          38847        38847 -0.0098083081 0.05765520 -0.452096610  7.850489
## 72   38848          38848        38848 -0.0054180625 0.06859318 -1.126436266  8.279600
## 73   38849          38849        38849 -0.0075272232 0.06127886  0.827625388  6.821111
## 74   38852          38852        38852 -0.0041985656 0.06142446  0.529495001  4.612564
## 75   38853          38853        38853 -0.0087804475 0.05342095 -0.070909200  6.192405
## 76   38854          38854        38854 -0.0054529214 0.06511938  0.008851600  5.422810
## 77   38855          38855        38855 -0.0081010001 0.05995930 -0.677938766  6.368277
## 78   38856          38856        38856  0.0004364151 0.06453323  0.156490593  6.391666
## 79   38859          38859        38859 -0.0039895079 0.05471918 -0.333312689  4.294341
## 80   38860          38860        38860 -0.0028146368 0.05766031  0.328591403  5.971598
## 81   38861          38861        38861 -0.0048257411 0.06988168  0.611671245  4.677875
## 82   38862          38862        38862 -0.0083797562 0.04348548 -0.346286093  5.958471
## 83   38863          38863        38863 -0.0021897476 0.04553626  0.064172498  5.382289
## 84   38866          38866        38866 -0.0051851697 0.02586819 -0.160231468  4.313987
## 85   38867          38867        38867 -0.0056097161 0.05844088  0.599221592  5.366215
## 86   38868          38868        38868 -0.0041637175 0.05226989 -1.287457703  9.113494
## 87   38869          38869        38869 -0.0031184325 0.04704296 -0.807542954  6.421688
## 88   38870          38870        38870 -0.0099453839 0.06255275 -4.200081947 33.147580
## 89   38873          38873        38873 -0.0027655402 0.03808986  1.089654951  7.042168
## 90   38874          38874        38874 -0.0005419267 0.04214455  0.897775010  6.642544
## 91   38875          38875        38875 -0.0033971822 0.03736942 -0.093831117  5.687776
## 92   38876          38876        38876 -0.0036933386 0.04352173  0.834360678  8.806083
## 93   38877          38877        38877 -0.0062726803 0.04113597  0.501480836  7.990728
## 94   38880          38880        38880 -0.0043090723 0.03271514 -0.149798527  3.469891
## 95   38881          38881        38881 -0.0002787105 0.04565224  1.053359064  7.260356
## 96   38882          38882        38882 -0.0059789644 0.04801573 -1.297302438 14.176536
## 97   38883          38883        38883 -0.0067595325 0.04059745 -0.074733516  9.526191
## 98   38884          38884        38884 -0.0035948660 0.03331588 -0.289386820  4.557180
## 99   38887          38887        38887 -0.0036701835 0.03245845  0.535800150  6.465983
## 100  38888          38888        38888 -0.0074911337 0.03855696 -0.354157836  6.562610
## 101  38889          38889        38889 -0.0052787230 0.03079112  0.191272381  4.722468
## 102  38890          38890        38890 -0.0007491067 0.03511747  0.840005560  5.816567
## 103  38891          38891        38891 -0.0049817761 0.04259499 -0.213627962  7.646801
## 104  38894          38894        38894 -0.0061031761 0.03335348  0.516790674  5.481879
## 105  38895          38895        38895 -0.0040592063 0.03850145  0.089547785  4.773725
## 106  38896          38896        38896 -0.0051567757 0.02822681 -0.303361576  4.315656
## 107  38897          38897        38897 -0.0105069555 0.04670969 -2.181569432 12.221400
## 108  38898          38898        38898 -0.0077189437 0.04557718 -0.555615542  6.150324

The `plot` method (1/4)

#plots the first 20 elements RetHTS dataset
plot(subsetHTS(RetHTS,from=1,to=20))

The `plot` method (2/4)

# plots the first 80 elements RetHTS dataset divided in 
# pieces of 40 epocs
plot(subsetHTS(RetHTS,from=1,to=80), maxno.perplot=40)

The `plot` method (3/4)

# plots HTS elements RetHTS dataset divided in 
# pieces of 36 epocs
plot(RetHTS, maxno.perplot=36)

The `plot` method (4/4): plotting using boxplots

# plots HTS elements RetHTS dataset divided in 
# pieces of 54 epocs
plot(RetHTS, type="BOXPLOT", maxno.perplot=54)

Implemented methods for the analysis of `MatH` tables

Basic statistics distributional variables
- mean distribution of each variable
- SSQ, Variance and standard deviation (with decomposition too)
- Covariance, correlation (with decomposition too)
Clustering methods
- Hard clustering
  - Hierarchical clustering
  - K-means (aka, Dynamic clustering)
  - Adaptive K_means (variable and components are authomaticly weighted, globally or cluster-wisely)
  - Kohonen maps
  - Adaptive Kohonen maps
- Fuzzy clustering
  - Fuzzy c-means
  - Adaptive c-means
Regression analysis methods
- Two components model

Dimension reduction
- Principal components analysis on one distributional variable (It sounds strange, isn’t it?)
- Multiple PCA
Time series analysis for histogram-valued time series
- Moving averages
- Exponential smoothing
- K-NN prediction
Visualization tools based on ggplot
Some datasets are available in the package.
- Blood dataset (small) (14 rows, 3 variables)
- Age-sex population pyramids of World countries in 2014
- A climatic dataset of China provinces
- An Ozone dataset of some US stations
- A histogram time series of returns

Clustering methods

An application on a temperature dataset of USA

In this example, we use data of mean monthly temperatures observed in 48 states of US. Raw data are free available at the National Climatic Data Center website of US (http://www1.ncdc.noaa.gov/pub/data/cirs/). The original dataset drd964x.tmpst.txt contains the sequential Time Biased Corrected state climatic division monthly Average Temperatures recorded in the 48 (Hawaii and Alaska are not present in the dataset) states of US from 1895 to 2014.

R code about this example is available here

First of all you can access to data and R scripts from this link.

A sketch of the data

load("USA_TMP.RData")
plot(USA_TMP_MAT)

Performing a K_means (Partitive model)

DCA.5k.resu<-WH_kmeans(USA_TMP_MAT,k = 5,rep = 20)
# we consider the best result among 20 runs
DCA.5k.resu$solution$cardinality
IDX
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 
   6   11   12    7   12 
DCA.5k.resu$solution$centers
a matrix of distributions 
 12  variables  5  rows 
....
$solution$Crit
[1] 3860.546
$quality
[1] 0.9015709

The prototypes

DCA on USA k=5: the map

An example on hierarchical clustering on World countries poulation pyramids

We considered population age-sex pyramids data collected by the Census Bureau of USA in 2014.

A population pyramid is a common way to represent jointly the distribution of sex and age of people living in a given administrative unit (city, region or country, for instance).

In this dataset (available in the package with the name Age_Pyramids_2014), each country (228 countries) is represented by two histograms describing the age distribution for the male and the female population. Both distributions are represented by vertically juxtaposing, and the representation is similar to a pyramid. The shape of pyramids varies according to the distribution of the age in the population and it is related to the development of a country.

Hierarchical clustering of population pyramids

Work_data=Age_Pyramids_2014[2:229,2:3] #Take a part of the data
Hward=WH_hclust(Work_data,method = "ward.D2") #Do the dirty work
# cut dendrogram in 4 clusters
  hc=Hward
  hcd=as.dendrogram(hc)
  clusMember = cutree(hc, 4)
  labelColors = c("red",  "yellow", "green", "purple")
  # function to get color labels
  ...
  # using dendrapply
  clusDendro = dendrapply(hcd, colLab)
  # make plot
  clusDendro<-assign_values_to_leaves_nodePar(clusDendro, 0.5, "lab.cex")
  plot(clusDendro)

The dendrogram

Map and prototypes

Some plots from the Kohonen map method

Countries assigned to neurons

Neurons’ prototypes (pyramids)

A PCA on histograms: using the BLOOD dataset (some outputs)

my_pca<-WH.MultiplePCA(BLOOD,list.of.vars = c(1:3))

Plans for the future

Implement some supervised learning algorithms: Discriminant analysis, SVM. (Not so difficult)
Replicate the most of the data analysis algorithms when compositional data are mixed with distributional ones.
- Indeed, it is much more frequent to observe categories than numbers in the big data era. (It requires some methodological in- depth analysis)
Improve the graphical outputs using interactivity, too. (It requires a fast coder)
Include new analysis methods. (It requires research)
Improve computational times (It requires a software engineer)
…
Translate all in Python? (I am still thinking about it)

Thank you for the attention

Questions?

Antonio Irpino, PhD

antonio.irpino@unicampania.it

This presentation is publicly available here:

https://rpubs.com/A_Irpino/ABCD_webinar_DD

Exploratory Data Analysis of data described by empirical distributions using the HistDAWass R Package