Pawlowsky-Glahn, Vera, Juan José Egozcue, and Raimon Tolosana-Delgado. Modeling and analysis of compositional data. John Wiley & Sons, 2015.
http://www.stat.boogaart.de/compositionsRBook/
A value of central tendency of a compositional sample is the closed geometric mean. It is called center. For a data set of size \(n\), it is defined as \[ \operatorname{cen}(\mathbf{X})=\hat{\mathrm{g}}=C\left[\hat{g}_{1}, \hat{g}_{2}, \ldots, \hat{g}_{D}\right] \] with \(\hat{g}_{j}=\left(\prod_{i=1}^{n} x_{i j}\right)^{1 / n}, j=1,2, \ldots, D\)
library(compositions)
setwd("D:/Compositional Data analysis/CODA_OCT2021")
GeoChemSed= read.csv("geochemsed.csv",header=TRUE,skip=1)
head(GeoChemSed)## Sample GS Qz_added SiO2 TiO2 Al2O3 MnO MgO CaO Na2O K2O P2O5 Fe2O3t
## 1 RT1-1 -1 0 73.40 0.22 14.19 0.037 0.48 1.04 4.33 4.18 0.06 1.74
## 2 RT1-1 0 0 73.86 0.26 14.00 0.038 0.51 1.14 4.32 3.95 0.07 1.72
## 3 RT1-1 1 0 73.85 0.30 13.85 0.041 0.55 1.15 4.31 3.82 0.07 1.82
## 4 RT1-1 2 0 78.12 0.32 11.36 0.043 0.57 1.02 3.41 3.16 0.06 1.81
## 5 RT1-1 3 0 78.77 0.39 10.29 0.061 0.86 1.07 2.81 2.92 0.07 2.54
## 6 RT1-1 4 0 74.43 0.54 11.88 0.073 0.94 2.22 2.95 3.20 0.29 2.96
## Summe.HE. Ba Co Cr Cu Ga Nb Ni Pb Rb Sc Sr V Y Zn Zr Nd Summe.SE.
## 1 99.68 720 2 11 0 15 12 5 14 148 3 143 17 24 25 134 30 1303
## 2 99.87 689 1 8 1 15 13 4 16 137 3 148 17 23 26 145 33 1279
## 3 99.75 673 3 3 0 15 15 7 16 134 0 149 17 26 29 118 42 1247
## 4 99.87 566 1 10 1 12 17 5 17 114 1 126 21 27 32 108 44 1102
## 5 99.78 579 4 8 0 12 19 6 20 121 6 129 27 31 50 240 39 1291
## 6 99.48 736 5 10 3 16 30 12 20 136 10 221 33 67 52 767 57 2175
## Summe.SEHE.
## 1 99.81
## 2 99.99
## 3 99.88
## 4 99.98
## 5 99.91
## 6 99.70
## Cr Zn Pb
## [1,] 0.2200000 0.5000000 0.2800000
## [2,] 0.1600000 0.5200000 0.3200000
## [3,] 0.0625000 0.6041667 0.3333333
## [4,] 0.1694915 0.5423729 0.2881356
## [5,] 0.1025641 0.6410256 0.2564103
## [6,] 0.1219512 0.6341463 0.2439024
## attr(,"class")
## [1] acomp
## Cr Zn Pb
## 0.1508323 0.5772793 0.2718884
## attr(,"class")
## [1] acomp
Dispersion in a compositional data set can be described either by the variation matrix, originally defined by Aitchison (1986) as \[ \mathbf{T}=\left(\begin{array}{cccc} t_{11} & t_{12} & \ldots & t_{1 D} \\ t_{21} & t_{22} & \ldots & t_{2 D} \\ \vdots & \vdots & \ddots & \vdots \\ t_{D 1} & t_{D 2} & \ldots & t_{D D} \end{array}\right), \quad t_{i j}=\operatorname{var}\left(\ln \frac{x_{i}}{x_{j}}\right) \]
or by the normalized variation matrix \[ \mathbf{T}^{*}=\left(\begin{array}{cccc} t_{11}^{*} & t_{12}^{*} & \cdots & t_{1 D}^{*} \\ t_{21}^{*} & t_{22}^{*} & \cdots & t_{2 D}^{*} \\ \vdots & \vdots & \ddots & \vdots \\ t_{D 1}^{*} & t_{D 2}^{*} & \cdots & t_{D D}^{*} \end{array}\right), \quad t_{i j}^{*}=\operatorname{var}\left(\frac{1}{\sqrt{2}} \ln \frac{x_{i}}{x_{j}}\right) \]
## Cr Zn Pb
## Cr 0.0000000 0.2610508 0.3259089
## Zn 0.2610508 0.0000000 0.2780853
## Pb 0.3259089 0.2780853 0.0000000
## $mean
## Cr Zn Pb
## 0.1508323 0.5772793 0.2718884
## attr(,"class")
## [1] acomp
##
## $mean.ratio
## Cr Zn Pb
## Cr 1.000000 0.2612814 0.5547583
## Zn 3.827291 1.0000000 2.1232216
## Pb 1.802587 0.4709824 1.0000000
##
## $variation
## Cr Zn Pb
## Cr 0.0000000 0.2610508 0.3259089
## Zn 0.2610508 0.0000000 0.2780853
## Pb 0.3259089 0.2780853 0.0000000
##
## $expsd
## Cr Zn Pb
## Cr 1.000000 1.666843 1.769831
## Zn 1.666843 1.000000 1.694416
## Pb 1.769831 1.694416 1.000000
##
## $invexpsd
## Cr Zn Pb
## Cr 1.0000000 0.5999366 0.5650255
## Zn 0.5999366 1.0000000 0.5901740
## Pb 0.5650255 0.5901740 1.0000000
##
## $min
## Cr Zn Pb
## Cr 1.0000000 0.03571429 0.05555556
## Zn 1.4838710 1.00000000 1.06666667
## Pb 0.5409836 0.12500000 1.00000000
##
## $q1
## Cr Zn Pb
## Cr 1.000000 0.1917421 0.4285714
## Zn 2.619048 1.0000000 1.4545455
## Pb 1.309524 0.3514831 1.0000000
##
## $med
## Cr Zn Pb
## Cr 1.000000 0.2857143 0.5454545
## Zn 3.500000 1.0000000 1.8125000
## Pb 1.833333 0.5517241 1.0000000
##
## $q3
## Cr Zn Pb
## Cr 1.000000 0.3819444 0.7638889
## Zn 5.215385 1.0000000 2.8453947
## Pb 2.333333 0.6875000 1.0000000
##
## $max
## Cr Zn Pb
## Cr 1 0.673913 1.848485
## Zn 28 1.000000 8.000000
## Pb 18 0.937500 1.000000
##
## $missingness
## missingType
## variable NMV BDL MAR MNAR SZ Err
## Cr 87 0 0 0 0 0
## Zn 87 0 0 0 0 0
## Pb 87 0 0 0 0 0
##
## attr(,"class")
## [1] "summary.acomp"
#Boxplots of all pairwise log ratios. This is the result of boxplot(x) with an acomp object
opar <- par(mar=c(3,3,1,1))
boxplot(x)# get the statistics:
mn = mean(x)
mvr = mvar(x)
# center the data:
dat1 = scale(x,center=TRUE,scale=TRUE)
dat2 = (x-mn)/sqrt(mvr)
#If we only want to center the data, we can try either
dat3 = scale(x,center=TRUE,scale=FALSE)
dat4 = x-mnx = acomp(GeoChemSed,4:13)
pcx = princomp(x)
#The first thing we should do is to obtain the proportion of explained variance that the biplot will capture. This is obtained as
sum(pcx$sdev[1:2]^2)/mvar(x)## [1] 0.9026089
# 0.9026 quite a good value. The biplot of the data is obtained with
opar <- par(mar=c(1,1,1,1))
dots = rep(".",times=nrow(x))
biplot(pcx, xlabs=dots)A balance-dendrogram is the joint representation of the ollowing elements: 1. a sequential binary partition, in the form of a tree structure; 2. the sample variance and mean of each balance, as vertical bars and inception points of these bars; 3. a box-plot, summarizing the order statistics of each balance on horizontal axes.
# first example: take the data set from the example, select only
# compositional parts
data(Hydrochem)
x = acomp(Hydrochem[,-c(1:5)])
gr = Hydrochem[,4] # river groups (useful afterwards)
# use an ilr basis coming from a clustering of parts
dd = dist(t(clr(x)))
hc1 = hclust(dd,method="ward.D")
plot(hc1)mergetree=hc1$merge
CoDaDendrogram(X=acomp(x),mergetree=mergetree,col="red",range=c(-8,8),box.space=1)
# add the mean of each river
color=c("green3","red","blue","darkviolet")
aux = sapply(split(x,gr),mean)
aux## Anoia Cardener LowerLLobregat UpperLLobregat
## H 7.681283e-12 1.741958e-11 1.419453e-11 1.950521e-11
## Na 1.057750e-01 1.223121e-01 1.406311e-01 5.440953e-02
## K 7.570733e-03 1.894806e-02 2.572564e-02 4.633610e-03
## Mg 4.087110e-02 3.800140e-02 3.464647e-02 3.037682e-02
## Ca 1.277392e-01 1.308379e-01 1.096325e-01 1.821764e-01
## Sr 2.258993e-03 2.430692e-03 1.655782e-03 2.132551e-03
## Ba 3.657214e-05 8.477833e-05 6.427606e-05 1.025198e-04
## NH4 4.696969e-04 2.568855e-04 5.946778e-04 2.338011e-04
## Cl 1.419020e-01 2.037912e-01 2.352855e-01 7.236928e-02
## NO3 4.288144e-03 8.873105e-03 6.638845e-03 4.734691e-03
## PO4 6.054526e-04 7.044785e-04 7.825075e-04 5.715776e-04
## SO4 3.210349e-01 1.784094e-01 1.655095e-01 2.146219e-01
## HCO3 2.428401e-01 2.911536e-01 2.741530e-01 4.285757e-01
## TOC 4.608154e-03 4.196326e-03 4.680093e-03 5.061633e-03
# second example: box-plots by rivers (filled)
CoDaDendrogram(X=acomp(x),mergetree=mergetree,col="black",range=c(-8,8),type="l")
xsplit = split(x,gr)
for(i in 1:4){
CoDaDendrogram(X=xsplit[[i]],col=color[i],type="box",box.pos=i-2.5,box.space=0.5,add=TRUE)
}# third example: fewer parts, partition defined by a signary, and empty box-plots
x = acomp(Hydrochem[,c("Na","K","Mg","Ca","Sr","Ba","NH4")])
signary = t(matrix( c(1, 1, 1, 1, 1, 1, -1,
1, 1, -1, -1, -1, -1, 0,
1, -1, 0, 0, 0, 0, 0,
0, 0, -1, 1, -1, -1, 0,
0, 0, 1, 0, -1, 1, 0,
0, 0, 1, 0, 0, -1, 0),ncol=7,nrow=6,byrow=TRUE))
CoDaDendrogram(X=acomp(x),signary=signary,col="black",range=c(-8,8),type="l")
xsplit = split(x,gr)
for(i in 1:4){
CoDaDendrogram(X=acomp(xsplit[[i]]),border=color[i],
type="box",box.pos=i-2.5,box.space=1.5,add=TRUE)
CoDaDendrogram(X=acomp(xsplit[[i]]),col=color[i],
type="line",add=TRUE)
}