Code
knitr::kable(head(data, n = 3), digits = 1)| sleep | sed | lpa | mvpa | age | bmi | race | site | glucose |
|---|---|---|---|---|---|---|---|---|
| 485.1 | 639.4 | 257.0 | 0 | 36 | 29.6 | Black | A | 154.1 |
| 503.3 | 676.9 | 336.7 | 22 | 27 | 27.7 | Black | A | 131.7 |
| 601.5 | 601.3 | 240.2 | 0 | 21 | 30.4 | Black | A | 161.8 |
Compositional data represent parts of a whole, where only the ratios between parts matter, not their absolute values.
List of positive numbers (e.g., percentages, proportions, counts).
Always add up to a fixed total (e.g., 100%, 1, or 24 hours in a day)
Imagine a cocktail composed of gin, tonic water, and lime juice. The total volume is fixed — if you add more gin, you must reduce tonic or lime to keep the drink balanced. What matters is the proportion between ingredients, not their absolute amounts
1. Parts of a Whole
Example: A cocktail made of gin, tonic water, and lime juice.
The total is always fixed = one drink (e.g., 250 mL).
2. Ratio not absolute values:
Suppose the drink is composed of 40% gin, 50% tonic water, and 10% lime juice.
Relative: The proportion of each ingredient determines the drink’s flavor and strength.
This is why we focus on proportions or percentages in CoDA, rather than raw amounts.
Not absolute: Milliliters (e.g., 100 mL of gin) depend on the size of the glass.
3. Fixed total:
Original: 40% gin, 50% tonic water, 10% lime juice.
If you remove lime juice (10%), the new proportions adjust:
Gin becomes ~44.4%
Tonic water becomes ~55.6% (total still = 100%).
When daily time is split into mutually exclusive categories (e.g., sleep, SED, LPA, MVPA), the resulting data form a composition.
Total time is fixed (24 hours = 1,440 minutes)
Increasing time in one domain requires decreasing time in another.
The parts are not independent.
Example: If a person sleeps 8 h (33.3 % of the day), is sedentary for 12 h 30 min (52.1 % of the day) and performs light activity for 3 h (12.5 % of the day), only 30 min (2.1 % of the day) remain for MVPA.
You can’t increase MVPA without reducing something else (e.g., sleep or sedentary time).
Standard regression models assume independent predictors (or predictors vary independent).
However, in compositional data, this assumption is violated due to the closure constraint (all parts sum to a constant), which induces dependencies among components.
Changing one necessarily affects the others.
Example: If you know someone’s sleep, SED, and LPA time, you can calculate MVPA exactly (MVPA = 24h - sleep - sed - lpa).
MVPA isn’t a “free variable” → Model can’t estimate its coefficient properly .
Suppose the model says: “More MVPA improves health” and “Less Sedentary time improves health”.
The model can’t tell if the benefit comes from adding MVPA or removing sedentary time
In fixed-sum data, “increasing” one behavior automatically “decreases” another
Hypothetical example: “More social media use correlates with higher anxiety.”
If someone spends more time on social media, they might reduce sleep or exercise time (which protect mental health).
The “harm” of social media could be due to the displacement of healthier activities, not social media itself.
- Apply CoDA transformations (e.g. isometric log-ratio, ilr) to remove the constant-sum constraint and generate variables suitable for regression.
Naïve model: Health ~ Sleep + Sed + PA
Compositional model: Health ~ log(Sleep/Sed) + log(PA/Sed)
Example: Fat = 0.70 | Lean = 0.25 | Bone = 0.05
Compute geometric mean of the “other two” \[
\sqrt{\mathrm{Lean}\times \mathrm{Bone}}
= \sqrt{0.25 \times 0.05}
= 0.112
\]
Ratio: \[
R_1
= \frac{\mathrm{Fat}}{\sqrt{\mathrm{Lean}\times \mathrm{Bone}}}
= \frac{0.70}{0.112}
\approx 6.26
\]
Take the log: \[
\mathrm{ilr}_1
= \ln\bigl(R_1\bigr)
\approx \ln(6.26)
\approx 1.84
\]
Simple ratio \(\mathrm{ilr}_2\)
\[
R_2
= \frac{\mathrm{Lean}}{\mathrm{Bone}}
= \frac{0.25}{0.05}
= 5
\]
Take the log \(\mathrm{ilr}_2\)
\[
\mathrm{ilr}_2
= \ln\bigl(R_2\bigr)
= \ln(5)
\approx 1.61
\]
Result
\[
(\mathrm{ilr}_1,\,\mathrm{ilr}_2)\approx (1.84,\,1.61)
\]
This case is based on a simulated dataset representing pregnant women in the second trimester of gestation.
Our aim is to explore how daily time-use (Sleep, SED, LPA and MVPA) composition is associated with the 1-hour 50-gram oral glucose screening test.
Sample:
- n = 411 simulated observations
Variables:
- Sleep (min/day)
- Sedentary behavior (min/day)
- Light physical activity – LPA (min/day)
- Moderate-to-vigorous physical activity – MVPA (min/day)
- Glucose screening test (Medical record)
- Age, Pre-pregnancy BMI, Study Site
The dataset was simulated using normal distributions based on realistic parameters (mean and standard deviation) drawn from a real cohort study involving pregnant women in the second trimester. Negative values were set to zero to maintain plausible durations.
data |>
summarise(
Sleep_Mean = mean(sleep), Sleep_SD = sd(sleep),
Sed_Mean = mean(sed), Sed_SD = sd(sed),
LPA_Mean = mean(lpa), LPA_SD = sd(lpa),
MVPA_Mean = mean(mvpa), MVPA_SD = sd(mvpa)
) |>
pivot_longer(everything(), names_to = "Metric", values_to = "Value") |>
separate(Metric, into = c("Behavior", "Statistic"), sep = "_") |>
pivot_wider(names_from = Statistic, values_from = Value) |>
knitr::kable(digits = 1, caption = "Mean and SD by behavior")| Behavior | Mean | SD |
|---|---|---|
| Sleep | 516.5 | 52.9 |
| Sed | 614.3 | 97.6 |
| LPA | 267.1 | 93.7 |
| MVPA | 27.3 | 16.3 |
ggplot(data, aes(x = glucose)) +
geom_histogram(fill = "skyblue", color = "white", bins = 30) +
geom_vline(aes(xintercept = mean(glucose)), color = "red", linetype = "solid", size = 1) +
geom_vline(aes(xintercept = quantile(glucose, 0.25)), color = "darkgreen", linetype = "dashed") +
geom_vline(aes(xintercept = quantile(glucose, 0.75)), color = "darkgreen", linetype = "dashed") +
labs(
x = "Glucose (mg/dL)", y = "Count") +
theme_minimal()data |>
summarise(
Glucose_Mean = mean(glucose), Glucose_SD = sd(glucose),
BMI_Mean = mean(bmi), BMI_SD = sd(bmi),
Age_Mean = mean(age), Age_SD = sd(age)
) |>
pivot_longer(everything(), names_to = "Metric", values_to = "Value") |>
separate(Metric, into = c("Variable", "Statistic"), sep = "_") |>
pivot_wider(names_from = Statistic, values_from = Value) |>
knitr::kable(digits = 1, caption = "")| Variable | Mean | SD |
|---|---|---|
| Glucose | 119.6 | 28.7 |
| BMI | 27.6 | 3.7 |
| Age | 30.2 | 5.1 |
In this case study, we use a set of R packages designed to compositional data analysis and model interpretation.
library(compositions) # For compositional data analysis
library(zCompositions) # For handling zero-replacement in compositional data
library(car) # For Type II ANOVA and model diagnostics
library(performance) # For model performance checks
library(tidyverse) # For data manipulation and visualization
library(ggtern) # For ternary plots (exploratory)
library(parameters) # Extract model parameters📚 Key references on compositions package:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1005 1334 1428 1425 1528 1916
Min. 1st Qu. Median Mean 3rd Qu. Max.
1440 1440 1440 1440 1440 1440
Log transformations are undefined for zero
Check for zeros before applying log-ratio methods
sleep sed lpa mvpa
Min. :351.0 Min. :396.7 Min. : 40.14 Min. : 0.00
1st Qu.:485.0 1st Qu.:575.2 1st Qu.:213.07 1st Qu.:15.21
Median :521.7 Median :622.9 Median :269.81 Median :27.81
Mean :525.7 Mean :620.3 Mean :266.24 Mean :27.83
3rd Qu.:568.8 3rd Qu.:669.1 3rd Qu.:318.95 3rd Qu.:39.15
Max. :781.0 Max. :799.0 Max. :458.21 Max. :85.83
-Replaces zeros using a Bayesian-multiplicative method for compositional data.
No. adjusted imputations: 29
sleep sed lpa mvpa
Min. :351.0 Min. :396.7 Min. : 40.14 Min. : 0.02048
1st Qu.:485.0 1st Qu.:575.2 1st Qu.:213.07 1st Qu.:15.20846
Median :521.7 Median :622.9 Median :269.81 Median :27.81215
Mean :525.7 Mean :620.3 Mean :266.24 Mean :27.82837
3rd Qu.:568.8 3rd Qu.:669.1 3rd Qu.:318.95 3rd Qu.:39.14651
Max. :781.0 Max. :799.0 Max. :458.21 Max. :85.83144
sleep sed lpa mvpa
1 505.6735 666.4289 267.8976 0.02047755
2 470.9306 633.4337 315.0426 20.59319744
3 600.2396 600.0918 239.6686 0.02047755
sleep sed lpa mvpa
[1,] "0.35116" "0.46279" "0.18604" "0.00001"
[2,] "0.32704" "0.43988" "0.21878" "0.01430"
[3,] "0.41683" "0.41672" "0.16643" "0.00001"
attr(,"class")
[1] "acomp"
sleep sed lpa mvpa
"0.37152067" "0.43824898" "0.17983075" "0.01039961"
attr(,"class")
[1] "acomp"
sleep sed lpa mvpa
535.0 631.1 259.0 15.0
In compositional data analysis, the mean composition is calculated using the geometric mean across observations, preserving the relative structure of the data.
First, calculate the geometric mean for each part:
\[ \overline{p}_i = \left( \prod_{j=1}^n p_{ij} \right)^{1/n} \]
where:
- ( i ) = compositional mean proportion for part (i),
- ( p{ij} ) = observed proportion for part (i) in observation (j),
- ( n ) = number of observations.
Then, rescale to the total time:
\[ \overline{m}_i = \overline{p}_i \times T \]
where ( T ) is the total time (e.g., 1,440 minutes).
Example with 2 observations:
Observation 1:
8h sleep (32%), 10h sedentary (41.67%), 4h LPA (16.67%), 2h MVPA (8.33%).
Observation 2:
7h sleep (29.17%), 12h sedentary (50%), 3h LPA (12.5%), 2h MVPA (8.33%).
Calculate the geometric mean for each part
\[ \sqrt{32\% \times 29.17\%} = \sqrt{0.32 \times 0.2917} = \sqrt{0.093344} \approx 0.3055 = 30.55\% \]
\[ \sqrt{41.67\% \times 50\%} = \sqrt{0.4167 \times 0.5} = \sqrt{0.20835} \approx 0.4565 = 45.65\% \]
\[ \sqrt{16.67\% \times 12.5\%} = \sqrt{0.1667 \times 0.125} = \sqrt{0.0208375} \approx 0.1444 = 14.44\% \]
\[ \sqrt{8.33\% \times 8.33\%} = \sqrt{0.0833 \times 0.0833} = \sqrt{0.006944} \approx 0.0833 = 8.33\% \]
Step 2: Rescale to 1,440 minutes
A simple arithmetic mean (e.g., averaging minutes directly) ignores the relative nature of compositions. For example, if one person has 1000 min Sed and another has 200 min, the arithmetic mean (600 min) doesn’t respect the 1440-min constraint.
The geometric mean in log-space preserves ratios (e.g., Sleep/Sed), which is key for CoDA, ensuring the mean composition is valid (sums to 1 or 1440).
This respect for relative structure is what distinguishes CoDA from traditional descriptive statistics.
Two friends have different monthly budgets (total = $3,000).
Alice’s budget :
Rent: $1,500 (50%)
Groceries: $900 (30%)
Fun: $600 (20%)
Bob’s budget :
Rent: $900 (30%)
Groceries: $900 (30%)
Fun: $1,200 (40%)
Problem with Arithmetic Mean
Step 1: Calculate the arithmetic mean of absolute dollars :
Mean Rent: $(1,500 + $900)/2 = $1,200
Mean Groceries: $(900 + $900)/2 = $900
Mean Fun: $(600 + $1,200)/2 = $900
Step 2: Sum the means:
$1,200 (Rent) + $900 (Groceries) + $900 (Fun) = $3,000 → Looks okay?
But the ratios are distorted!
Alice’s Rent/Groceries ratio : 50%/30% = 1.67:1
Bob’s Rent/Groceries ratio : 30%/30% = 1:1
Mean’s Rent/Groceries ratio : 1,200/900 = 1.33:1
The arithmetic mean doesn’t preserve the relative trade-offs between rent and groceries. It averages the absolute values but ignores how categories depend on each other.
The geometric mean finds a central ratio that respects the compositional structure.
sleep sed lpa mvpa
sleep 0.000 0.041 0.172 3.982
sed 0.041 0.000 0.185 3.987
lpa 0.172 0.185 0.000 4.256
mvpa 3.982 3.987 4.256 0.000
\[
\mathrm{Var}\bigl(\ln\tfrac{C_j}{C_i}\bigr)
\]
Example: If “sleep” and “sed” have low variance, people’s sleep/sedentary ratios don’t change much.
# Prepare compositional data with four components
comp_data <- data.frame(
MVPA = comp[,"mvpa"],
LPA = comp[,"lpa"],
SED = comp[,"sed"],
Sleep = comp[,"sleep"]
)
# Calculate the mean composition
mean_comp <- colMeans(comp_data)
subtitle_text <- sprintf(
"Red point = mean composition (Sleep: %.2f, SED: %.2f, LPA: %.2f)",
mean_comp["Sleep"],
mean_comp["SED"],
mean_comp["LPA"]
)
# Plot with ggtern
ggtern::ggtern(data = comp_data, aes(x = Sleep, y = SED, z = LPA, color = MVPA)) +
geom_point(size = 2, alpha = 0.8) +
annotate("point",
x = mean_comp["Sleep"],
y = mean_comp["SED"],
z = mean_comp["LPA"],
color = "red", size = 3, shape = 18) +
theme_rgbw() +
theme(
text = element_text(size = 11),
tern.axis.title = element_text(size = 13, face = "bold"),
legend.position = "right",
plot.title = element_text(face = "bold", size = 14)
) +
labs(
x = "Sleep",
y = "SED",
z = "LPA",
color = "MVPA (4th Component)",
subtitle = subtitle_text
) +
scale_color_gradient(low = "lightblue", high = "darkblue") +
guides(color = guide_colorbar(barwidth = 0.8, barheight = 8))Data Frame Summary
covariates
Dimensions: 411 x 4
Duplicates: 0
--------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------- ------------------------ --------------------- ------------------- ---------- ---------
1 age Mean (sd) : 30.2 (5.1) 27 distinct values : 411 0
[numeric] min < med < max: : : (100.0%) (0.0%)
18 < 30 < 44 : : :
IQR (CV) : 7 (0.2) : : : : : .
. : : : : : : : .
2 bmi Mean (sd) : 27.6 (3.7) 409 distinct values . : . 411 0
[numeric] min < med < max: . : : : (100.0%) (0.0%)
18.5 < 27.7 < 40.1 : : : :
IQR (CV) : 4.7 (0.1) : : : : :
: : : : : : : :
3 race 1. White 247 (60.1%) IIIIIIIIIIII 411 0
[factor] 2. Black 80 (19.5%) III (100.0%) (0.0%)
3. Hispanic 52 (12.7%) II
4. Asian 19 ( 4.6%)
5. Other 13 ( 3.2%)
4 local 1. A 164 (39.9%) IIIIIII 411 0
[factor] 2. B 145 (35.3%) IIIIIII (100.0%) (0.0%)
3. C 102 (24.8%) IIII
--------------------------------------------------------------------------------------------------------
[,1] [,2] [,3]
[1,] 0.1951913153 -0.6314031 -8.655551
[2,] 0.2096179915 -0.4492569 -2.679971
[3,] -0.0001741719 -0.7495021 -8.642630
[4,] 0.0688752324 -0.7775520 -2.584114
[5,] -0.2726525586 -0.5099092 -2.565553
[6,] -0.0067887848 -1.5102301 -1.869668
attr(,"class")
[1] "rmult"
Our data shows 411 people splitting their 24-hour day into Sleep, SED, LPA, and MVPA. These proportions sum to 100%, so they’re interdependent—more Sleep means less SED.
ILR coordinates transform these proportions into numbers that:
Keep the ratios (e.g., Sleep/SED).
Remove the 100% constraint.
Work in standard models.
Anova Table (Type II tests)
Response: glucose
Sum Sq Df F value Pr(>F)
ilr_comp 193616 3 184.0695 <2e-16 ***
age 826 1 2.3551 0.1257
bmi 366 1 1.0442 0.3075
race 345 4 0.2463 0.9119
local 760 2 1.0838 0.3393
Residuals 139898 399
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Parameter | Coefficient | SE | 95% CI | t(399) | p
-------------------------------------------------------------------------
(Intercept) | 67.56 | 9.38 | [ 49.11, 86.00] | 7.20 | < .001
ilr comp1 | 91.01 | 6.50 | [ 78.23, 103.79] | 14.00 | < .001
ilr comp2 | -25.93 | 2.84 | [-31.51, -20.35] | -9.14 | < .001
ilr comp3 | -9.09 | 0.54 | [-10.16, -8.03] | -16.85 | < .001
age | -0.28 | 0.18 | [ -0.64, 0.08] | -1.53 | 0.126
bmi | 0.26 | 0.26 | [ -0.24, 0.77] | 1.02 | 0.307
race [Black] | 0.75 | 2.44 | [ -4.05, 5.56] | 0.31 | 0.758
race [Hispanic] | -1.49 | 2.88 | [ -7.16, 4.18] | -0.52 | 0.605
race [Asian] | 1.51 | 4.53 | [ -7.39, 10.41] | 0.33 | 0.738
race [Other] | -3.39 | 5.37 | [-13.95, 7.17] | -0.63 | 0.528
local [B] | -2.91 | 2.16 | [ -7.17, 1.34] | -1.35 | 0.179
local [C] | -0.04 | 2.38 | [ -4.73, 4.65] | -0.02 | 0.987
Results: The way people split their day (ILR coordinates) strongly predicts glucose levels, unlike age, BMI, or race, which aren’t significant here. Time-use patterns are key!
Limitation: We can’t isolate each activity’s effect (e.g., Sleep alone). For that, we’ll explore methods like isotemporal substitution.
Equation:
\[ \operatorname{\widehat{glucose}} = 67.56 + 91.01(\operatorname{ilr\_comp}_{\operatorname{1}}) - 25.93(\operatorname{ilr\_comp}_{\operatorname{2}}) - 9.09(\operatorname{ilr\_comp}_{\operatorname{3}}) - 0.28(\operatorname{age}) + 0.26(\operatorname{bmi}) + 0.75(\operatorname{race}_{\operatorname{Black}}) - 1.49(\operatorname{race}_{\operatorname{Hispanic}}) + 1.51(\operatorname{race}_{\operatorname{Asian}}) - 3.39(\operatorname{race}_{\operatorname{Other}}) - 2.91(\operatorname{local}_{\operatorname{B}}) - 0.04(\operatorname{local}_{\operatorname{C}}) \]
As an example, let’s theoretically exchange 30 minutes of SED for MVPA, keeping the total time (1440 minutes) fixed, to predict changes in glucose.
Steps:
Prepare the data and run a new model to make future prediction steps easier
# Load outcome variables and covariates from the dataset
glucose <- data$glucose # Outcome variable: glucose from screening test
bmi <- data$bmi # Covariate: body mass index
race <- factor(data$race) # Covariate: race category
local <- factor(data$site) # Covariate: study site
# Extract isometric log-ratio (ILR) coordinates from the compositional data
ilr_data <- as.data.frame(ilr_comp)
colnames(ilr_data) <- c("ilr1", "ilr2", "ilr3")
# Create dataframe for the regression model
# Combining outcome, ILR coordinates, and covariates
model_df <- data.frame(
glucose = glucose,
ilr1 = ilr_data[,1],
ilr2 = ilr_data[,2],
ilr3 = ilr_data[,3],
bmi = bmi,
race = race,
local = local
)
# Fit the regression model with all covariates
model <- lm(glucose ~ ilr1 + ilr2 + ilr3 + bmi + race + local, data = model_df)
Obtain the compositional mean of the sample and transform it into an isometric log-ratio
# Get the compositional mean (reference composition)
comp_mean <- mean(comp)
comp_mean_min <- clo(comp_mean, total = 1440) # Scale to minutes per day (24h = 1440 min)
# Convert to proper format (named numeric vector in minutes)
comp_ref <- as.numeric(comp_mean_min)
names(comp_ref) <- c("sleep", "sed", "lpa", "mvpa")
# Display mean composition in minutes
print(round(comp_ref, 1))sleep sed lpa mvpa
535.0 631.1 259.0 15.0
[1] 0.1168018 -0.6598760 -2.9349900
attr(,"class")
[1] "rmult"
Prepares the theoretical substitution data and transforms it into isometric log-ratio
Isotemporal : Total time remains constant (no extra time added).
ILR Coordinates : Necessary to input into the regression model.
[1] 1440
[1] 0.08236237 -0.63999237 -1.96855747
attr(,"class")
[1] "rmult"
Prepare the data for prediction
# Automatically detect the most common categories for categorical variables
race <- names(which.max(table(race)))
local <- names(which.max(table(local)))
# 5. Create dataframes for prediction
# For reference composition
data_ref <- data.frame(
ilr1 = ref_ilr[1],
ilr2 = ref_ilr[2],
ilr3 = ref_ilr[3],
bmi = mean(bmi),
race = factor(race, levels = levels(data$race)), # Most common race category
local = factor(local, levels = levels(data$site)) # Most common site
)
head(data_ref) ilr1 ilr2 ilr3 bmi race local
1 0.1168018 -0.659876 -2.93499 27.60357 White A
ilr1 ilr2 ilr3 bmi race local
1 0.08236237 -0.6399924 -1.968557 27.60357 White A
Make predictions with the model
fit lwr upr
1 120.6163 117.3079 123.9247
fit lwr upr
1 108.1861 104.6688 111.7035
Compositional Data Analysis in Time-Use Epidemiology: What, Why, How
A compositional analysis study of body composition and cardiometabolic risk factors
Physical Activity and Women’s Health Lab (PAWH)