Analysis Compositional Data by classical Euclidean geometry might be problematic and gives wrong or biased analysis.
In fact as your data are constrain by summation to 1 (100%),it is projected in a lower sub-space that standard statistical analysis cant deal with.
Let’s plot the basic % or counts between two variables (linked by their composition count/sum of counts). One can proceed only with counts but definitely the size of the total row will be disregarded.
scatterplot(natcomp$Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....,natcomp$Aubonne.Total.pct.Plunger.stopper.Missing...)
#Look to the pseudo-outlier in the xaxis and pseudo-leverage in yaxis.
hist(natcomp$Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....,col="blue",xlab="Plunger Stopper Misplaced in %")
hist(natcomp$Aubonne.Total.pct.Plunger.stopper.Missing....,col="blue",xlab="Plunger Stopper MISSING in %")
Something isnt right!? The data dont extend and make think of outliers
One trick to get around : the log ratios and Atchinson Geometry transform Matrix.
First we will try to transform these ratios by taking the log on both variables - Note sum of log isn’t equal to log of summation. - Note The data include full of Zeroes that creates you problem in log and in the regression line.
nanew=natcomp[,c(2,3,4)]
nanew1=nanew
##Some data are real zero so log transform will not be accepted
#workaround:Add a value -1 log of min value of data
nanew=log(nanew+0.001)
summary(nanew)
## Aubonne.Total.pct.Plunger.stopper.Height....
## Min. :-6.908
## 1st Qu.:-6.908
## Median :-6.908
## Mean :-6.840
## 3rd Qu.:-6.908
## Max. :-1.458
## Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....
## Min. :-6.908
## 1st Qu.:-6.908
## Median :-6.908
## Mean :-6.696
## 3rd Qu.:-6.908
## Max. :-1.103
## Aubonne.Total.pct.Plunger.stopper.Missing....
## Min. :-6.908
## 1st Qu.:-6.908
## Median :-6.908
## Mean :-6.589
## 3rd Qu.:-6.908
## Max. :-1.689
summary(nanew1)
## Aubonne.Total.pct.Plunger.stopper.Height....
## Min. :0.0000000
## 1st Qu.:0.0000000
## Median :0.0000000
## Mean :0.0006416
## 3rd Qu.:0.0000000
## Max. :0.2316370
## Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.001607
## 3rd Qu.:0.000000
## Max. :0.330865
## Aubonne.Total.pct.Plunger.stopper.Missing....
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.003587
## 3rd Qu.:0.000000
## Max. :0.183767
scatterplot(nanew$Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....,nanew$Aubonne.Total.pct.Plunger.stopper.Missing....)
Now the true relationship appears clearer and extend to the range . As this problematic Dataset if full of zero (no record) if you are willing to remove the zeroes of the scatter plot is not advisable as it could implicitly destroy the structure of your data matrix Scatterplot might not be fixed and draw and cor matrix structure destroyed.
There is another solution : Zoom on plot
scatterplot(nanew$Aubonne.Total.pct.Plunger.stopper.Misplaced.or.multiple....,nanew$Aubonne.Total.pct.Plunger.stopper.Missing....,xlim=c(-6.4,0),ylim=c(-6.4,0))
If you want a real plot with exact geometry a change of coordinates can be done via package compositions.
Now look there is a a flaw:
Imagine that your data is given to you just telling you in percent ( in order of 0.00016) but the data provider forgets to gives you all columns (Some missing) When you ask Acomp function every row is re-calulated as sum constraints to one (or eq. 100%).
According to Pearson and Atchison (1986) we should then observed due to missing columns a negative correlation between variables which contradict our 2nd plotted observations.
dim(nanew1)##only 3 row is selected and will be rescale to make 100%
## [1] 932 3
Acomp=acomp(nanew1)
pairs(Acomp,panel=panel.smooth)
#Even more if you received some compositonal data with removed ratio (%) as the new total don't sum to 1 the scatter plot will probably shows you negative (Spurious) correlation.
Note that the zeros make the smoother unstable and a bad estimator (use regression Line instead).
Negative correlation is obvious: Lesson: Never missed a variables of % and re-scalled them to 100% as your analysis will be seriously biased.
Now we will use the correct methodology:
Use the counts of variables and make the total of counts (row-wise).Exact method :Use a package that deal will compositional data geometry composition data or use the log ratios technique for an better analysis.
AA=aplus(natcomp5[,4:37])
BB=acomp(natcomp5[,4:37])
RR=rcomp(natcomp5[,4:37])
##by row here the constraint on the 37 column defect sum to 1t
##BB is the composition of our 37 making 100% of our defectby rows note that column excluding total batch for now.
#making the correct projection geometry
comp1=clr(BB)#log ratio centered Matrix A(tchinson 1986)
#next plot is the correct projection from a row total percentage of 100% amongst all variable (excl.grand total count VI)
pairs(comp1[,5:10],panel = panel.smooth)
pairs(comp1[,11:17],panel = panel.smooth)
This type of projections might be adequate but other exist like isometric log ratio (ilr).
Now regarding PCA projected on this new coordinates
#making names shorter
nam2=colnames(natpp)
necol=paste0("X",1:length(nam2))
colnames(natpp)<-necol
necol
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12"
## [13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23" "X24"
## [25] "X25" "X26" "X27" "X28" "X29" "X30" "X31" "X32" "X33" "X34" "X35" "X36"
colnames(natpp)
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12"
## [13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23" "X24"
## [25] "X25" "X26" "X27" "X28" "X29" "X30" "X31" "X32" "X33" "X34" "X35" "X36"
#Some eigen value are neg or null due to zero variable to be removed#remove #X27 X25 X10 X12 X1 X2
natpp$X1<-NULL
natpp$X2<-NULL
natpp$X10<-NULL
natpp$X12<-NULL
natpp$X25<-NULL
natpp$X27<-NULL
##Direct PCA on acomp data not possible due to negative eigen values but still prinatble but we proceed with clr and ilr coordinates in PCA wich visually dont change the results
PCA2clr=princomp(clr(acomp(natpp)))
#centered log ratio
fviz_pca_biplot(PCA2clr, repel = TRUE, select.var = list(contrib = 20),cex.text=1.2)
fviz_pca_var(PCA2clr, col.var = "cos2")
fviz_pca_var(PCA2clr, col.var = "contrib")
fviz_screeplot(PCA2clr)
##making PCA on clr log transform if not already done when comp data is injected in pca (prcomp ) PCA2clr=princomp(x=clr(acomp(natpp)))plot(PCA2clr)
fviz_pca_var(PCA2clr, col.var = "contrib")
##via ILr coordinates
PCA2ilr=princomp(ilr(acomp(natpp)))
fviz_pca_biplot(PCA2ilr, repel = TRUE, select.var = list(contrib = 20),cex.text=1.2)
fviz_pca_var(PCA2ilr, col.var = "contrib")
Note that data was randomly revamped for FAIR process.