Sleep Sedent LowPA MedPA VigPA Weight
186 0.4256066 0.2904476 0.2180762 0.04222257 0.023646978 Healthy
188 0.4364543 0.3080640 0.1961706 0.02998026 0.029330771 Healthy
189 0.3672305 0.3496161 0.2504670 0.02313998 0.009546540 Healthy
190 0.3932206 0.3264883 0.2345340 0.03293886 0.012818189 Healthy
191 0.4055074 0.3381771 0.2174014 0.02813187 0.010782160 Healthy
192 0.3824199 0.4042658 0.1815667 0.02298345 0.008764219 Healthy
Overview
In this brief tutorial, I will introduce compositional data analysis (CODA for short). First, I will define compositional data and under what circumstances we would encounter or use it. Second, I will introduce basic summary statistics used to describe compositional data. Finally, I will demonstrate how to visualize compositional data, in particular using ternary diagrams.
Check out this awesome horizontal line break that you can use in Quarto below:
Part I: Introduction to CODA
Definition of Compositional Data
Data is considered “compositional” if it can be divided into individual parts, which are called the “components”. These components sum to a total, and the fraction represented by each component divided by the sum total is known as the “portion”. In CODA, the total sum is irrelevant- - what is relevant is the relationship of the portions. For example, we might want to understand what component makes up the largest portion of the total and what happens as we increase this component. What happens to the other components? Do they increase or decrease? Which components are highly related?
Compositions are by nature multivariate and are always positive; therefore, they cannot be modeled by a Normal distribution. Because of this constraint, researchers have traditionally used both log transformations and other distributions, most notably the Dirichlet distribution, to model compositional data.
Examples of Compositional Data
Compositional data is more common than we might realize and is found in numerous types of studies:
- Microbiome studies (relative abundance of certain bacteria in different parts of the human body) (Gloor et al. 2016):
Percentage of different plate appearances (ex: home run, strikeout, etc.) for professional baseball players (Null 2009)
The breakdown of the Scale of Psychological Well-Being (SPWB) scores: each dimension contributes to overall well-being score (Cortés-Rodríguez et al. 2023)
Daily activity data (used in this tutorial) (Clarke and Janssen 2021)
% of time spent sleeping
% of time spent being sedentary
% of time doing physical activity
light physical activity
medium physical activity
vigorous physical activity
Daily Activity Data set
As noted above, we will utilize daily activity data in this tutorial. Specifically, we will use the bmi_activity data set from the coda.base R package. This data set features the daily percentange breakdown of five activities for 393 participants, as well as their sex and standardized BMI score. This BMI score is used to categorize participants as “healthy” or “obese”. For this analysis, we will only consider female participants. A sample of the cleaned data set is shown below:
Notice that each row of activities sums to 1! In CODA, we are interested in the portions (also called proportions) and how they relate to each other, not the sum total. The total is irrelevant to us!
Part II: Summary Statistics
Dirichlet Distribution
As mentioned previously, compositional data can be modeled using the Dirichlet distribution. Suppose that we have a compositional data set with k variables, denoted as \(X = [X_1,...,X_k]^T\), such that \(\sum_{j=1}^{k}X_j=1\). This multivariate random variable \(X\) is said to follow a Dirichlet distribution if its density function is defined as:
\[f(x|a)=\frac{\Gamma(A)}{\Gamma(\alpha_1)...\Gamma(\alpha_k)}\prod_{j=1}^{k}{x_j}^{\alpha_j-1}\]
\[0\leq x_j \leq1 \text{ for } j=1,...,k\]
We will use this distribution to calculate the mean and variance of each component below.
Mean and Standard Deviation
The mean of each component is simply \(E[X_i]=\frac{\alpha_i}{A}=\pi_i\). The variance is given by \(Var[X_i]=\frac{\pi_1(1-\pi_i)}{A+1}\). We use the following code to find the mean and standard deviation of all participants, regardless of weight:
apply(bmi[,1:5],2,mean) Sleep Sedent LowPA MedPA VigPA
0.40172057 0.33933347 0.21784048 0.02720815 0.01389732
apply(bmi[,1:5],2,sd) Sleep Sedent LowPA MedPA VigPA
0.026983405 0.038101233 0.032576963 0.007398704 0.007090750
We see that sleep has the highest mean value of .402, followed by sedentary at 0.339. Sedentary has the highest standard deviation at 0.038.
Correlation
To calculate the correlation between the individual components we write cor(bmi[,-6]), which gives us:
Sleep Sedent LowPA MedPA VigPA
Sleep 1.00000000 -0.3059890 -0.4451885 -0.1601442 0.05117943
Sedent -0.30598897 1.0000000 -0.6648791 -0.6351715 -0.49154456
LowPA -0.44518851 -0.6648791 1.0000000 0.4955858 0.15538014
MedPA -0.16014418 -0.6351715 0.4955858 1.0000000 0.70213587
VigPA 0.05117943 -0.4915446 0.1553801 0.7021359 1.00000000
We see that sleep is negatively correlated with sedentary, low physical activity, and medium physical activity, but positively correlated with vigorous physical activity. This seems to make intuitive sense because the more vigorous the activity, the more sleep and rest neeeded to receover.
We can visualize the correlations between components for both the healthy and obese group using the code below (results not shown. Left for a future learning exercise.)
library(GGally)
ggpairs(bmi,aes(color=BMI))Part III: Visualization
Ternary Diagrams
Compositional data is commonly visualized using a ternary diagram, which is represented by a triangle. Each vertex (and adjacent side) of the triangle represents a different aspect of the data: two of the vertices represent two of the components found in the data, while the third vertex represents the other components combined. Below is an example ternary diagram using the daily activity data:
The three components represented in this ternary diagram are sleep, sedentary activity, and the rest of the components, which all happen to be physical activities. The points on the diagram are colored by BMI group (healthy or obese). Because the points overlap, we determine that we do not see evidence of a difference in the groups.
The left-hand side of the triangle features the sleep axis, the right-hand side features the sedentary axis, and the bottom features the physical activity axis. The points are fairly close to the center of the triangle, meaning that they are somewhat balanced, but they are closer to the sleep corner. This indicates that that sleep makes up the biggest portion of the daily activity time, which we saw numerically when creating the summary statistics.
Conclusion
I hope you enjoyed this very brief introduction to compositional data analysis! In particular, I hope that you begin to notice compositional data in future analyses and in your daily life. There are many more examples of compositional data that were not mentioned in the Examples Section of this tutorial! [Trick #3, Dr. South - internal link]
There are many extensions to CODA to be had, and I hope to work on one (or two..or three!) over the next couple of years! Thank you for reading!