April, 23th, 2021
Useful for describing an object coming from an aggregation of low-level objects (e.g., a group or a class of individuals), or repeated measures on the same object.
A set-valued variable measures an object using a set of values (numbers or categories)
The measurement done on an object for a variable may have several values: namely, data are, or might be, multi-valued.
Especially, if an object is an higher order statistical unit, namely, generalizes a set of individual measurements (a Region, a City, a market segment, a typology,… ). But, it is not only this!A distributional variable is a modal variable measuring an object using a distribution having a numeric support.
Distribution of what? Of frequencies, if aggregated raw data are described by a classic numeric variable.
Compositional data approach provides some useful answers to the treatment of such data.
Note: in this presentation we do not treat compositional data.
Distributions are the representation of the summary of raw data (histograms, empirical cdf,..)
Data are represented by density functions, some specific smoothing splines (for example) guarantee to define some spaces of representation of data where classical operations are allowed.
The Aitchison’s geometry of compositions is used. Some extensions to density valued data have been proposed using, for example, the concept of Bayes Spaces defined by Egozcue et al (2012), where distributions are points in an Hilbert space.
All the three approaches have PROS and CONS. Unfortunately, no well-grounded inference is defined on these data, so it is not always possible to justify the superiority of one approach against others.
Extending the principles of classical EDA, we aim at analyze and investigate a distributional data set and summarize its main characteristics.
EDA is a detective work – numerical detective work – or counting detective work – or graphical detective work.
(J.W. Tukey, 1977)
The National Institute of Standards and Technology (NIST) describes EDA as an approach to data analysis, not a model, that uses these techniques:
A distributional observation (datum) comes operationally from the aggregation/summarization/model of raw measurements to a frequency/density distribution. It can be identified by
| Empirical distribution | Empirical CDF | Empirical QF |
They are in one-to-one relation.
It is not straightforward to define basic statistics of a single distributional variable.
Let’s consider the concept of mean.
Several approaches in the literature. If \(Y=\left\{ y_i|i\in \Omega \right\}\) are the observed data:
Note: it is desirable that \(m\) has the same nature of \(y_i\)’s
Using Frechét approach we need to define a squared distance.
Several distances exist between distributions, among them we choose two Euclidean distances:
The Euclidean distance between density functions, or cdf’s (let’s say a vertical distance) \[ d_{E1}^2(y_i,y_{i'})= \int\limits_{-\infty}^{+\infty} \left[ f_i(x)-f_{i'}(x)\right]^2dx\,\, \mathrm{ or }\,\, d_{E2}^2(y_i,y_{i'})= \int\limits_{-\infty}^{+\infty} \left[ F_i(x)-F_{i'}(x)\right]^2dx\]
The Euclidean distance between quantile functions (let’s say an horizontal distance distance) \[ d_{W}^2(y_i,y_{i'})= \int\limits_{0}^{1} \left[ Q_i(p)-Q_{i'}(p)\right]^2dp\] (aka, squared \(L2\) Wasserstein distance, or Mallow’s distance)
Using \(d^2_{W}(\cdot,\cdot)\) in the Frechét formula, we obtain a different mean distribution with a different interpretation.
\[m=\arg \min_g \sum_i d^2_{W}(g,y_i) \]Being:
\[d^2_W (y_1,y_2)= \underbrace{\left(\mu_1-\mu_2\right)^2}_{Location}+\underbrace{\underbrace{\left(\sigma_1-\sigma_2\right)^2}_{Scale}+\underbrace{2\sigma_1 \sigma_2 (1-\rho_{1,2})}_{Shape}}_{Internal\,variability}\]
We used this property (Irpino and Romano, 2006) for developing a set of basic statistics and exploratory methods for distributional variables that are implemented in the
HistDAWasspackage, freely available in R.
\[d^2_W (y_1,y_2)= \underbrace{\left(\mu_1-\mu_2\right)^2}_{Location}+\underbrace{\underbrace{\left(\sigma_1-\sigma_2\right)^2}_{Scale}+\underbrace{2\sigma_1 \sigma_2 (1-\rho_{1,2})}_{Shape}}_{Internal\,variability}\]
Decomposition of \(d^2_W\)
Order statistics: median, quartiles and quantiles of a distributional variable. It is not possible to define a general and natural ordering among set-valued data. Some interesting approaches could be borrowed from the functional data analysis and the depth concept generalized for them. Depth is the degree of centrality of a curve, thus the median of a set of curves is the most central one.
Other types of means requires the definition of mathematical operations (product, power,…) between distributions (or their related functions).
From now and ahead we will use \(d^2_W\) as reference, but if anyone is interested in the implication of using \(d^2_{E1}\) or \(d^2_{E2}\), can refer to some resources listed at end of the presentation.
We focus on the so-called Frechét dispersion, which is very similar to the sum of square error SSQ we use for computing standard deviation.
A classical variable dispersion is a measure of the difference/heterogeneity of the observed values. Note: A single datum has null internal-dispersion.
The dispersion of a variable describing objects coming from the aggregation of raw classical data has, at least, two sources of variability: one inherited by the aggregation, and one coming from the aggregate descriptions.
A measure of dispersion of a single distributional variable should quantify how much different are the observed data.
As consequence, a set of identical distributions should produce a null measure of dispersion.
A desired interpretative aid
If two distributional variables have the same dispersion, is it desirable to interpret if there is a difference due to different variability induced by the set-valued data?
The SSQ of a distributional variable
Given \(m\) as the mean distribution: \(SSQ(Y)=\sum\limits_i d^2_W(m,y_i)\)
from which we defined the variance \(VAR(Y)=n^{-1}SSQ(Y)\)
and the standard deviation of \(Y\): \(S(Y)=\sqrt{VAR(Y)}\) .
They are positive values and are equal to zero when the \(y_i\)’s distributions are the same.
\(SSQ(Y)\), thanks to the decomposition of presented above, can be expressed as the sum of (at least) two independent components: \[SSQ(Y)=\underbrace{\sum\limits_i\left(\mu_i-\mu_m\right)^2}_{Location}+\underbrace{\underbrace{\sum\limits_i\left(\sigma_i-\sigma_m\right)^2}_{Scale}+\underbrace{2\sigma_m\sum\limits_i\sigma_i (1-\rho_{i,m})}_{Shape}}_{Internal\,variability}\] Using this decomposition one can interpret the components of a distributional variable and use it for other analysis (we used this in clustering methods, for example).
Let’s focus only on the covariance and Pearson’s correlation. Some of the well-known properties of the covariance and correlation indexes are:
Using the product of quantile functions that induces the L2 Wasserstein metric, we developed an extension of the covariance and of correlation indexes for a pair of distributional variables.
PRO:
CONS:
HistDAWass All the above mentioned statistics can be computed for distributions having finite the first two moments. Empirical distributions have this property. Further, it suffices that data can be represented by a cdf (or a qf) that is a \(C^0\)-class function (e.g.,piece-wise linear functions).
|
In 2015, we started to develop an
The package can be downloaded from CRAN: |
HistDAWass Package on CRAN |
HistDAWassThe main motivation is that the S4 paradigm is object-oriented, while S3 does not.
distributionH class: a class for describing and manipulating histogram-valued datamydist=distributionH(x=c(10,20,50,60,70,120),
p=c(0,0.1, 0.35,0.7,0.9,1))
str(mydist)
## Formal class 'distributionH' [package "HistDAWass"] with 4 slots
## ..@ x: num [1:6] 10 20 50 60 70 120
## ..@ p: num [1:6] 0 0.1 0.35 0.7 0.9 1
## ..@ m: num 52
## ..@ s: num 21.8plot(mydist, type="HISTO", col="red", border="blue")
#plots a density approximation for mydist
plot(mydist, type="DENS", col="red", border="blue")
plot(mydist, type="HBOXPLOT") #plots a horizontal boxplot for mydist
plot(mydist, type="VBOXPLOT") #plots a vertical boxplot for mydist
#plots the cumulative distribution function of mydist
plot(mydist, type="CDF")
#plots the quantile function of mydist
plot(mydist, type="QF")Obtaining the histogram and the CDF of a distributionH object
mydist.histo<-get.histo(mydist) #this returns the histogram
mydist.cdf<-get.distr(mydist) #this returns the CDF
#into data.frame objects
mydist.histo
## min.x max.x p
## 1 10 20 0.10
## 2 20 50 0.25
## 3 50 60 0.35
## 4 60 70 0.20
## 5 70 120 0.10
mydist.cdf
## x p
## 1 10 0.00
## 2 20 0.10
## 3 50 0.35
## 4 60 0.70
## 5 70 0.90
## 6 120 1.00Obtaining a single quantile or a probability from a distributionH object
Other basic statistics for distributionH objects
meanH(mydist) #computes the mean
## [1] 52
stdH(mydist) #computes the standard deviation
## [1] 21.81742
skewH(mydist) #computes the 3rd stand. centr. moment
## [1] 0.3693756
kurtH(mydist) #computes the 4th stand. centr. moment
## [1] 3.701504 Being \(Q(p)\) a quantile function, the four measures are, respectively, the histogram version of the following formulas (Gilchrist,2000):
\[\mu=\int\limits_0^1Q(p)dp;\quad \sigma=\sqrt{\int\limits_0^1Q(p)^2dp-\mu^2},\] \[sk=\int\limits_0^1\left(\frac{Q(p)-\mu}{\sigma}\right)^3dp; \quad ku=\int\limits_0^1\left(\frac{Q(p)-\mu}{\sigma}\right)^4dp.\]
It is possible to show that all the previous and the next statistics for histogram variables have exact solutions (namely the integrals are exactly computed), and have a time complexity depending from the number of bins only.
We have the distribution \(H=\{([a_1,b_1],\pi_1),\ldots,([a_k,b_k],\pi_k)\}\). Each bin can be described in terms of center and radii: \(c_{\ell}=\frac{a_{\ell}+b_{\ell}}{2}\) and \(r_{\ell}=\frac{b_{\ell}-a_{\ell}}{2}\). The mean and the standard deviation of a (single) histogram is computed as follows:
\[\mu=\int\limits_0^1Q(p)dp=\sum\limits_{\ell=1}^k\int\limits_{F_{\ell-1}}^{F_\ell}Q(p)dp=\sum\limits_{\ell=1}^k\pi_{\ell}\frac{a_{\ell}+b_{\ell}}{2} =\sum\limits_{\ell=1}^k\pi_{\ell}c_{\ell} \]
\[\sigma=\sqrt{\int\limits_0^1Q(p)^2dp-\mu^2}=\sqrt{\sum\limits_{\ell=1}^k\int\limits_{F_{\ell-1}}^{F_\ell}Q(p)^2dp-\mu^2}= \sqrt{\sum\limits_{\ell=1}^k\pi_{\ell}\left[c_{\ell}^2+\frac{1}{3}r_{\ell}^2\right] -\mu^2}\]
dist1<-distributionH(c(1,2,3),c(0, 0.4, 1))
dist1
## X p
## Bin_1 [ 1 ; 2 ) 0.4
## Bin_2 [ 2 ; 3 ] 0.6
##
## mean = 2.1 std = 0.568624070307733
##
dist2<-distributionH(c(7,8,10,15),c(0, 0.2, 0.7, 1))
dist2
## X p
## Bin_1 [ 7 ; 8 ) 0.2
## Bin_2 [ 8 ; 10 ) 0.5
## Bin_3 [ 10 ; 15 ] 0.3
##
## mean = 9.75 std = 2.09065380523255
## The register method in action
registered<-register(dist1,dist2) ## register the two distributions
#returns two registered distributionH objects
registered[[1]]
## X p
## Bin_1 [ 1 ; 1.5 ) 0.2
## Bin_2 [ 1.5 ; 2 ) 0.2
## Bin_3 [ 2 ; 2.5 ) 0.3
## Bin_4 [ 2.5 ; 3 ] 0.3
##
## mean = 2.1 std = 0.568624070307733
##
registered[[2]]
## X p
## Bin_1 [ 7 ; 8 ) 0.2
## Bin_2 [ 8 ; 8.8 ) 0.2
## Bin_3 [ 8.8 ; 10 ) 0.3
## Bin_4 [ 10 ; 15 ] 0.3
##
## mean = 9.75 std = 2.09065380523255
## Using register method, the (Squared) \(L_2\) Wasserstein distance between two histograms exactly:
WassSqDistH(dist1,dist2) #computes the squared L2 Wass. distance
## [1] 61.03667
WassSqDistH(dist1,dist2,details=TRUE) #computes with details
## SQ_W_dist POSITION SIZE SHAPE rQQ
## 61.0366667 58.5225000 2.3165745 0.1975922 0.9168940\[d_W^2(f,g)=\underbrace{\left(\mu_f-\mu_g\right)^2}_{Position}+\underbrace{\underbrace{\left(\sigma_f-\sigma_g\right)^2}_{Size}+\underbrace{2\sigma_f\sigma_g\left[1-\rho_{QQ}(f,g)\right]}_{Shape}}_{Variability}\]
A bit more on rQQ
The method rQQ computes the Pearson correlation index of two quantile functions. It is equal to \(1\) when the distributions have the same shape (except for the means and the standard deviations).
distributionH objectsdata2hist, a function for converting raw data to histogram-valued one.
MatHclass: a histogram-valued data tableMatH objectA MatH object is a table (a matrix): each row is an individual each column is a histogram variable.
MatH class and its initializationMatH(x = list(new("distributionH")), nrows = 1, ncols = 1,
rownames = NULL, varnames = NULL, by.row = FALSE) - x is a list of distributionH objects - nrows is the number of rows (the individuals) - ncols is the number of columns (the variables) - rownames is a vector of strings with the labels of the individuals - varnames is a vector of strings with the labels of variables - by.row indicates if the matrix must be filled by row (TRUE) or by column (FALSE this is the default)
MatH object##---- create a list of six distributionH objects
ListOfDist<-vector("list",6)
ListOfDist[[1]]<-distributionH(c(1,2,3),c(0, 0.4, 1))
ListOfDist[[2]]<-distributionH(c(7,8,10,15),c(0, 0.2, 0.7, 1))
ListOfDist[[3]]<-distributionH(c(9,11,20),c(0, 0.5, 1))
ListOfDist[[4]]<-distributionH(c(2,5,8),c(0, 0.3, 1))
ListOfDist[[5]]<-distributionH(c(8,10,15),c(0, 0.75, 1))
ListOfDist[[6]]<-distributionH(c(20,22,24),c(0, 0.12, 1))
## create a MatH object filling it by columns
MyMAT=MatH(x=ListOfDist,nrows=3,ncols=2,
rownames=c("I1","I2","I3"), varnames=c("Var1","Var2"),by.row=FALSE)
#bulding an empty 10 by 4 matrix of histograms
empty.MAT=MatH(nrows=10,ncols=4)show methodshow(MyMAT) #or simply type MyMAT
## a matrix of distributions
## 2 variables 3 rows
## each distibution in the cell is represented by the mean and the standard deviation
## Var1 Var2
## I1 [m= 2.1 ,s= 0.56862 ] [m= 5.6 ,s= 1.6248 ]
## I2 [m= 9.75 ,s= 2.0907 ] [m= 9.875 ,s= 1.7515 ]
## I3 [m= 12.75 ,s= 3.3323 ] [m= 22.76 ,s= 0.86933 ]plot methodMatH objectsregisterMH(MyMat), returns a new MatH object with all the distributions transformed. All the distributions have the same number of bins each one containing the same mass (It is useful for computing exactly the basic statistics based on \(L_2\) Wasserstein metric)MatH objectsWH.bind(MAT1, MAT2, byrow=TRUE) attaches MAT2 on the right of MAT1. Naturally, the two MatH objects must have the same number of rows;WH.bind(MAT1, MAT2, byrow=FALSE) attaches MAT2 under MAT1. Naturally, the two MatH objects must have the same number of columns;[,] an overloaded method for subsetting a matrix. It returns a new MatH object.MatH objectsWH.mat.sum performs a classic cell by cell sum. In particular, the result is a new MatH object having in each cell a distribution associated with the quantile function resulting from the sum of the corresponding quantile functions (It is a sum consistent with the Wasserstein metric). As usual, the matrix must have the same dimensions.WH.mat.prod performs the matrix multiplication of two MatH objects. It returns a matrix of numbers according to the dot product defined for two distributions and associated with the \(L_2\) Wasserstein metric. It is possible also to consider trasposition of matrices.WH.vec.mean computes a distributionH that is the mean distribution of a vector or a matrix of distributions (a MatH object). The mean distribution is computed accordingly to the sum based on the \(L_2\) Wasserstein distance, namely, it is the distribution associated with the average quantile function of the quantile functions of the vector of distributions. It is possible also to assign weights to the elements of vetor in order to obtain a weighted mean.WH.vec.sum same as WH.vec.mean, but computes only the sum.WH.vec.mean(BLOOD[,2]) #returns the average distribution of
# the Hemoglobin variable as a distributionH object These methods returns a matrix of numbers accordingly to the number of the compared variables. The formulas are those presented in [@IrpVer2015].
WH.SSQ(MyMAT, w=w) computes the wheighted Sum of Squares and the Sum of products of the deviations from the means of a MatH object. The weights are associated to the rows, if missing, the weights are considered equal. The result is square matrix of dimension equal to the number of variables of the MatH object.WH.SSQ2(MyMAT1, MyMAT2, w=w) computes the wheighted Sum of products of the deviations from the means of two MatH objects. The two matrices must have the same number of rows. The result is rectangular matrix with rows equal to the number of variables of the first MatH object and columns the number of variables of the first MatH object.Using these it is possible to compute covariances and correlations.
WH.var.covar(MyMAT, w=w) computes the variance-covariance matrix of a MatH object.WH.var.covar2(MyMAT1, MyMAT2, w=w) computes a covariance matrix between two MatH objects.WH.correlation(MyMAT, w=w) computes the correlation matrix of a MatH object.WH.correlation2(MyMAT1, MyMAT2, w=w) computes the correlation matrix of two MatH objects.WH.var.covar(BLOOD)
## Cholesterol Hemoglobin Hematocrit
## Cholesterol 388.137633 -5.0005134 -14.9202378
## Hemoglobin -5.000513 0.2802215 0.8266558
## Hematocrit -14.920238 0.8266558 2.9779786
WH.correlation(BLOOD)
## Cholesterol Hemoglobin Hematocrit
## Cholesterol 1.0000000 -0.4794806 -0.4388560
## Hemoglobin -0.4794806 1.0000000 0.9049264
## Hematocrit -0.4388560 0.9049264 1.0000000TdistributionH class: a class for a histogram-valued data observed along timeTdistributionH class (Not completely developed!!)It is essentially a specilized class of distributionH equipped with a time-related information. Being a distribution, we considered to specialize the class such that it can contains
tstamp, a numeric value that represents a time stamp, or;period, a list containing two slots start and end, two numbers (for now, a time/date dormat will be chosen in the future), indicating the initial and the final period of observation. Indeed, we imagine that a histogram (or a distribution) is the set of values observed during a time period.HTS class: a class for histogram time seriesHTS is a list of TdistributionHNot a very sophisticated class, but it is under development. ### Initializing a HTS
The list of TdistributionH objects are organized into a vector of dimension equal to epoc. The slot containing the vector is called data. If we need to see what is in the HTS we need to call elements from @data slot.
# RetHTS is a HTS of the returns five-minutes returns of DOLL/YEN changes
# rates observed along 108 days (it is one of the dataset available in the pkg)
RetHTS@data[2]
## [[1]]
## X p
## Bin_1 [ -0.1 ; -0.06 ) 0.03833
## Bin_2 [ -0.06 ; -0.03 ) 0.1394
## Bin_3 [ -0.03 ; 0.03 ) 0.7038
## Bin_4 [ 0.03 ; 0.06 ) 0.09059
## Bin_5 [ 0.06 ; 0.12 ] 0.02787
##
## mean = -0.00275259208362382 std = 0.0342206481885284
## subsetHTS(HTS, from, to) extracts a piece (in the time sense) of HTSshow(HTS) returns a list of values and basic statistics related to the distributions in the HTS (wee see the results in the next slide)plot(HTS,...) shows a plot of the HTS, (we see examples in the next slides)show method# RetHTS is a HTS of the returns five-minutes returns of DOLL/YEN changes
# rates observed along 108 days (it is one of the dataset available in the pkg)
options(width = 300)
show(RetHTS)## Tstamp T_period_start T_period_end mean std skew kurt
## 1 38749 38749 38749 -0.0019860297 0.04224402 0.696353637 6.433505
## 2 38750 38750 38750 -0.0027525921 0.03422065 0.161710205 3.860462
## 3 38751 38751 38751 -0.0022545180 0.04252923 1.054658950 6.334018
## 4 38754 38754 38754 -0.0056181618 0.03660866 -0.287735490 4.077056
## 5 38755 38755 38755 -0.0108187759 0.04469958 -0.691931506 5.949878
## 6 38756 38756 38756 -0.0015679186 0.03932478 0.676302020 4.805117
## 7 38757 38757 38757 -0.0038849961 0.03521140 0.474190871 4.194543
## 8 38758 38758 38758 -0.0057090553 0.05180512 -0.108275213 4.937425
## 9 38761 38761 38761 -0.0048545204 0.03741751 -0.264327106 4.290259
## 10 38762 38762 38762 -0.0060801071 0.04140493 0.276796508 6.148626
## 11 38763 38763 38763 -0.0018814832 0.05313136 -0.151476128 9.303889
## 12 38764 38764 38764 -0.0060526111 0.03971651 -0.041866347 4.189392
## 13 38765 38765 38765 -0.0019454181 0.05084949 0.354931260 6.160713
## 14 38768 38768 38768 -0.0051052438 0.02793509 0.009895520 5.072376
## 15 38769 38769 38769 -0.0034561197 0.02806190 0.028340867 4.330262
## 16 38770 38770 38770 -0.0069337719 0.03726113 -0.602640682 5.229702
## 17 38771 38771 38771 -0.0106445600 0.05501497 -0.995706447 6.528730
## 18 38772 38772 38772 -0.0073090483 0.05280985 -0.254510473 7.056863
## 19 38775 38775 38775 -0.0092402357 0.04994473 -2.456202932 16.218037
## 20 38776 38776 38776 -0.0070208812 0.03425427 -0.170867642 3.533228
## 21 38777 38777 38777 -0.0047212118 0.04867714 0.296046999 5.833572
## 22 38778 38778 38778 -0.0047038051 0.03674708 0.240445851 6.362662
## 23 38779 38779 38779 -0.0064233064 0.05259779 -1.269210426 11.922934
## 24 38782 38782 38782 -0.0011817858 0.04103728 0.718818282 5.971927
## 25 38783 38783 38783 -0.0047734896 0.03857179 -0.141197527 5.804973
## 26 38784 38784 38784 -0.0048606093 0.03369879 -0.515454518 3.729712
## 27 38785 38785 38785 -0.0047037812 0.05545347 -0.295411527 8.626437
## 28 38786 38786 38786 -0.0007272310 0.05204268 0.707027815 8.276109
## 29 38789 38789 38789 -0.0064363434 0.03191699 -0.078502518 3.762579
## 30 38790 38790 38790 -0.0105923052 0.04003190 -0.199008788 3.885703
## 31 38791 38791 38791 -0.0069511893 0.03898982 -0.086497721 5.293574
## 32 38792 38792 38792 -0.0074389851 0.04508932 -0.935993029 7.833595
## 33 38793 38793 38793 -0.0075729608 0.04550285 0.000219057 4.890152
## 34 38796 38796 38796 -0.0030633416 0.04627440 0.093724269 5.230596
## 35 38797 38797 38797 -0.0017307386 0.04216918 0.617796078 5.686636
## 36 38798 38798 38798 -0.0052264546 0.03984408 0.118317228 5.678807
## 37 38799 38799 38799 -0.0010626723 0.04243500 2.076246948 15.385329
## 38 38800 38800 38800 -0.0087817694 0.05320716 -2.306995986 13.792963
## 39 38803 38803 38803 -0.0072534963 0.04214570 -1.085068170 4.974648
## 40 38804 38804 38804 -0.0002786968 0.05085144 1.087733108 8.830547
## 41 38805 38805 38805 -0.0051219166 0.04418986 -0.011704014 6.374511
## 42 38806 38806 38806 -0.0076828832 0.05040461 -1.028022895 9.496948
## 43 38807 38807 38807 -0.0060545003 0.05235379 -0.606038718 8.880585
## 44 38810 38810 38810 -0.0061090452 0.05826492 -0.395657824 6.925153
## 45 38811 38811 38811 -0.0055985506 0.04355256 -0.486466779 6.884867
## 46 38812 38812 38812 -0.0046689580 0.04185633 0.459999525 5.036577
## 47 38813 38813 38813 -0.0029616471 0.03672018 -0.042695220 4.208889
## 48 38814 38814 38814 -0.0041635993 0.04382331 0.448764601 7.897283
## 49 38817 38817 38817 -0.0043612930 0.03274848 0.187488595 4.668999
## 50 38818 38818 38818 -0.0057142672 0.03244375 -0.246011044 3.562023
## 51 38819 38819 38819 -0.0036585120 0.03779608 0.122059841 5.802111
## 52 38820 38820 38820 -0.0049477194 0.02559904 0.261164825 4.069244
## 53 38821 38821 38821 -0.0045724760 0.01996054 0.563223610 5.127033
## 54 38824 38824 38824 -0.0091635871 0.04472303 0.237533944 9.820100
## 55 38825 38825 38825 -0.0081532739 0.04128826 -0.939945016 7.934035
## 56 38826 38826 38826 -0.0046166875 0.04686141 0.537764380 5.537140
## 57 38827 38827 38827 -0.0029616389 0.03641048 -0.028545536 6.009552
## 58 38828 38828 38828 -0.0089010695 0.04042375 0.713792677 6.595900
## 59 38831 38831 38831 -0.0101929316 0.05653616 -1.690331963 11.243804
## 60 38832 38832 38832 -0.0028571058 0.04468332 0.083454199 5.849040
## 61 38833 38833 38833 -0.0060278319 0.04528931 -0.440157943 6.474160
## 62 38834 38834 38834 -0.0093705645 0.06338602 -3.139906218 21.884934
## 63 38835 38835 38835 -0.0048908788 0.03923458 -0.850739442 5.869432
## 64 38838 38838 38838 -0.0075879791 0.05844668 -0.987439643 9.724247
## 65 38839 38839 38839 -0.0040766233 0.04768498 -0.616224072 4.198066
## 66 38840 38840 38840 -0.0047901832 0.03987908 -0.187875569 4.626865
## 67 38841 38841 38841 -0.0050174516 0.04519448 -0.363053027 3.945153
## 68 38842 38842 38842 -0.0100362910 0.06474940 -3.844042297 30.346126
## 69 38845 38845 38845 -0.0069999632 0.04992085 -0.780782148 5.403221
## 70 38846 38846 38846 -0.0075086792 0.04351507 -0.119390300 4.367429
## 71 38847 38847 38847 -0.0098083081 0.05765520 -0.452096610 7.850489
## 72 38848 38848 38848 -0.0054180625 0.06859318 -1.126436266 8.279600
## 73 38849 38849 38849 -0.0075272232 0.06127886 0.827625388 6.821111
## 74 38852 38852 38852 -0.0041985656 0.06142446 0.529495001 4.612564
## 75 38853 38853 38853 -0.0087804475 0.05342095 -0.070909200 6.192405
## 76 38854 38854 38854 -0.0054529214 0.06511938 0.008851600 5.422810
## 77 38855 38855 38855 -0.0081010001 0.05995930 -0.677938766 6.368277
## 78 38856 38856 38856 0.0004364151 0.06453323 0.156490593 6.391666
## 79 38859 38859 38859 -0.0039895079 0.05471918 -0.333312689 4.294341
## 80 38860 38860 38860 -0.0028146368 0.05766031 0.328591403 5.971598
## 81 38861 38861 38861 -0.0048257411 0.06988168 0.611671245 4.677875
## 82 38862 38862 38862 -0.0083797562 0.04348548 -0.346286093 5.958471
## 83 38863 38863 38863 -0.0021897476 0.04553626 0.064172498 5.382289
## 84 38866 38866 38866 -0.0051851697 0.02586819 -0.160231468 4.313987
## 85 38867 38867 38867 -0.0056097161 0.05844088 0.599221592 5.366215
## 86 38868 38868 38868 -0.0041637175 0.05226989 -1.287457703 9.113494
## 87 38869 38869 38869 -0.0031184325 0.04704296 -0.807542954 6.421688
## 88 38870 38870 38870 -0.0099453839 0.06255275 -4.200081947 33.147580
## 89 38873 38873 38873 -0.0027655402 0.03808986 1.089654951 7.042168
## 90 38874 38874 38874 -0.0005419267 0.04214455 0.897775010 6.642544
## 91 38875 38875 38875 -0.0033971822 0.03736942 -0.093831117 5.687776
## 92 38876 38876 38876 -0.0036933386 0.04352173 0.834360678 8.806083
## 93 38877 38877 38877 -0.0062726803 0.04113597 0.501480836 7.990728
## 94 38880 38880 38880 -0.0043090723 0.03271514 -0.149798527 3.469891
## 95 38881 38881 38881 -0.0002787105 0.04565224 1.053359064 7.260356
## 96 38882 38882 38882 -0.0059789644 0.04801573 -1.297302438 14.176536
## 97 38883 38883 38883 -0.0067595325 0.04059745 -0.074733516 9.526191
## 98 38884 38884 38884 -0.0035948660 0.03331588 -0.289386820 4.557180
## 99 38887 38887 38887 -0.0036701835 0.03245845 0.535800150 6.465983
## 100 38888 38888 38888 -0.0074911337 0.03855696 -0.354157836 6.562610
## 101 38889 38889 38889 -0.0052787230 0.03079112 0.191272381 4.722468
## 102 38890 38890 38890 -0.0007491067 0.03511747 0.840005560 5.816567
## 103 38891 38891 38891 -0.0049817761 0.04259499 -0.213627962 7.646801
## 104 38894 38894 38894 -0.0061031761 0.03335348 0.516790674 5.481879
## 105 38895 38895 38895 -0.0040592063 0.03850145 0.089547785 4.773725
## 106 38896 38896 38896 -0.0051567757 0.02822681 -0.303361576 4.315656
## 107 38897 38897 38897 -0.0105069555 0.04670969 -2.181569432 12.221400
## 108 38898 38898 38898 -0.0077189437 0.04557718 -0.555615542 6.150324
plot method (1/4)plot method (2/4)# plots the first 80 elements RetHTS dataset divided in
# pieces of 40 epocs
plot(subsetHTS(RetHTS,from=1,to=80), maxno.perplot=40) plot method (3/4)plot method (4/4): plotting using boxplots# plots HTS elements RetHTS dataset divided in
# pieces of 54 epocs
plot(RetHTS, type="BOXPLOT", maxno.perplot=54) MatH tablesIn this example, we use data of mean monthly temperatures observed in 48 states of US. Raw data are free available at the National Climatic Data Center website of US (http://www1.ncdc.noaa.gov/pub/data/cirs/). The original dataset drd964x.tmpst.txt contains the sequential Time Biased Corrected state climatic division monthly Average Temperatures recorded in the 48 (Hawaii and Alaska are not present in the dataset) states of US from 1895 to 2014.
First of all you can access to data and R scripts from this link.
DCA.5k.resu<-WH_kmeans(USA_TMP_MAT,k = 5,rep = 20)
# we consider the best result among 20 runs
DCA.5k.resu$solution$cardinality
IDX
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5
6 11 12 7 12
DCA.5k.resu$solution$centers
a matrix of distributions
12 variables 5 rows
....
$solution$Crit
[1] 3860.546
$quality
[1] 0.9015709The prototypes
DCA on USA k=5: the map
We considered population age-sex pyramids data collected by the Census Bureau of USA in 2014.
A population pyramid is a common way to represent jointly the distribution of sex and age of people living in a given administrative unit (city, region or country, for instance).
In this dataset (available in the package with the name Age_Pyramids_2014), each country (228 countries) is represented by two histograms describing the age distribution for the male and the female population. Both distributions are represented by vertically juxtaposing, and the representation is similar to a pyramid. The shape of pyramids varies according to the distribution of the age in the population and it is related to the development of a country.
Work_data=Age_Pyramids_2014[2:229,2:3] #Take a part of the data
Hward=WH_hclust(Work_data,method = "ward.D2") #Do the dirty work
# cut dendrogram in 4 clusters
hc=Hward
hcd=as.dendrogram(hc)
clusMember = cutree(hc, 4)
labelColors = c("red", "yellow", "green", "purple")
# function to get color labels
...
# using dendrapply
clusDendro = dendrapply(hcd, colLab)
# make plot
clusDendro<-assign_values_to_leaves_nodePar(clusDendro, 0.5, "lab.cex")
plot(clusDendro)The dendrogram
Countries assigned to neurons
Neurons’ prototypes (pyramids)
Thank you for the attention
Questions?
Antonio Irpino, PhD
This presentation is publicly available here: