1 Introduction

Sweden comprises 290 municipalities that vary considerably in their socioeconomic and demographic characteristics. Understanding the multivariate structure of these differences is important for regional planning, policy evaluation, and the study of environmental inequalities. In 2017, Statistics Sweden (SCB) compiled a broad set of socioeconomic and environmental indicators at the municipal level, covering population dynamics, labour-market conditions, income, education, and emissions of greenhouse gases and dioxins.

The research question addressed here is: what are the main axes of variation among Swedish municipalities, and which socioeconomic and environmental factors drive these differences?

Principal Component Analysis (PCA) is a dimensionality-reduction technique that transforms a set of correlated continuous variables into a smaller number of uncorrelated components (principal components, PCs) ordered by the amount of variance they explain (Jolliffe, 2002). Because the 15 continuous variables in this dataset are measured on different scales and units, the PCA was performed on the correlation matrix (i.e. variables were centered and scaled to unit variance). The categorical variables region, county, and income were excluded from the PCA itself but are used to aid interpretation of the score plots.


2 Methods

2.1 Data

The dataset SCB_assign3.xlsx contains data for all 290 Swedish municipalities for the year 2017. It includes four categorical variables (municipality, region, county, income) and 15 continuous socioeconomic and environmental response variables (Table 1).

Table 1. Description of the 15 continuous response variables used in the PCA. All data are for the year 2017 and were obtained from Statistics Sweden (SCB).

Variable

Description

pop.size

Number of inhabitants

area

Area of the municipality (km2)

mean.age

Mean age of inhabitants (years)

mortality

Number of deaths / population size

natality

Number of births / population size

pop.change

Population change: (pop. 31 Dec - pop. 1 Jan) / pop. 31 Dec

immigration

Immigration rate (% of population)

emigration

Emigration rate (% of population)

tax.capacity

Tax base per inhabitant (SEK)

tax.equal

Economic equalisation per inhabitant (SEK); support based on tax capacity and structural costs

unemployment

Percentage of unemployed inhabitants

foreign.origin

Percentage of foreign-born inhabitants or inhabitants with two foreign-born parents

higher.edu

Percentage of inhabitants with university education

greenhouse.gases

Total emissions of greenhouse gases (kt CO2-equivalents)

dioxin.mg

Dioxin emissions (mg)

2.2 Statistical Analysis

PCA was performed in R (v4.x) using the built-in function prcomp() with center = TRUE and scale. = TRUE, so that each variable was standardised to zero mean and unit variance prior to analysis. Standardisation is necessary because the variables differ in units and magnitudes; without it, variables with large numeric ranges (e.g. pop.size, greenhouse.gases) would dominate the PCA solution.

The number of PCs to retain was decided using two complementary criteria:

  • Kaiser criterion: retain PCs with eigenvalue > 1, i.e. components that explain more variance than a single standardised variable (Kaiser, 1960).
  • Cumulative variance: retain enough PCs to explain a reasonable proportion of total variance (commonly >= 70–80%).

Score plots (PC coordinates of each municipality) were produced for the PC pairs PC1–PC2 and PC1–PC3, and municipalities were colour-coded by region and income class to aid interpretation. A biplot overlaying variable loading vectors on the score plot was used to link the position of municipalities to the underlying variables. Component loadings for PC1 and PC2 are additionally visualised as a bar chart to facilitate comparison across variables.


3 Results

3.1 Variance Explained and Choice of Principal Components

Table 2. Eigenvalues and proportion of variance explained by each principal component. Rows in bold have eigenvalue > 1 (Kaiser criterion).

PC

Eigenvalue

Variance explained (%)

Cumulative variance (%)

PC1

4.960

33.07

33.07

PC2

2.662

17.75

50.81

PC3

2.115

14.10

64.91

PC4

1.345

8.96

73.88

PC5

0.887

5.92

79.79

PC6

0.733

4.89

84.68

PC7

0.508

3.39

88.07

PC8

0.474

3.16

91.23

PC9

0.382

2.55

93.78

PC10

0.335

2.23

96.01

PC11

0.242

1.61

97.62

PC12

0.172

1.15

98.77

PC13

0.107

0.71

99.48

PC14

0.078

0.52

100.00

PC15

0.000

0.00

100.00

The PCA yielded 15 principal components. According to the Kaiser criterion (eigenvalue > 1), four components are worth retaining: PC1 (eigenvalue = 4.96, 33.1% of variance), PC2 (eigenvalue = 2.66, 17.7%), PC3 (eigenvalue = 2.12, 14.1%), and PC4 (eigenvalue = 1.35, 9.0%), together explaining 73.9% of the total variance (Table 2, Figure 1).

PC4, however, has an eigenvalue only marginally above 1 and adds relatively little interpretive value beyond PC1–PC3. The cumulative variance curve shows a pronounced elbow after PC3, and PC1–PC3 together account for 64.9% of total variance (Figure 2). For the main interpretation, this report therefore focuses on PC1, PC2, and PC3.

Figure 1. Scree plot showing the eigenvalue of each principal component. The dashed red line marks the Kaiser criterion (eigenvalue = 1); components above this line are retained. Four components exceed the threshold (PC1--PC4).

Figure 1. Scree plot showing the eigenvalue of each principal component. The dashed red line marks the Kaiser criterion (eigenvalue = 1); components above this line are retained. Four components exceed the threshold (PC1–PC4).

Figure 2. Cumulative variance explained by successive principal components. The dashed orange line marks the 80% threshold. PC1--PC3 explain 64.9% and PC1--PC5 explain 79.8% of total variance.

Figure 2. Cumulative variance explained by successive principal components. The dashed orange line marks the 80% threshold. PC1–PC3 explain 64.9% and PC1–PC5 explain 79.8% of total variance.

3.2 Component Loadings

The loadings of the 15 variables on PC1, PC2, and PC3 are shown in Figure 3 and Table 3.

PC1 (33.1%) represents a gradient from older, declining municipalities to younger, growing, and better-educated ones. Variables with strong positive loadings are higher.edu (0.36), pop.change (0.34), natality (0.26), pop.size (0.27), and tax.capacity (0.28), while mean.age (-0.39), mortality (-0.39), and tax.equal (-0.19) load negatively. Municipalities with high positive PC1 scores are populous, young, and have high tax capacity, whereas municipalities with negative PC1 scores are older, have higher mortality rates, and depend on fiscal equalisation transfers.

PC2 (17.7%) is driven primarily by mobility variables. Emigration (0.51) and immigration (0.49) load positively and almost equally, along with foreign.origin (0.33) and unemployment (0.25), while greenhouse.gases (-0.27) and dioxin.mg (-0.30) load negatively. PC2 therefore separates municipalities with high population turnover and foreign-origin inhabitants (positive scores) from municipalities with high industrial emissions (negative scores).

PC3 (14.1%) contrasts municipalities with high unemployment (-0.49), dioxin.mg (-0.43), foreign.origin (-0.31), and pop.size (-0.36) against those with high tax.capacity (0.30) and pop.change (0.19). This component highlights an industrial/socioeconomic stress dimension that is orthogonal to the gradients captured by PC1 and PC2.

Figure 3. Bar chart of PCA component loadings for PC1 and PC2. Blue bars indicate positive loadings and red bars indicate negative loadings. Variables are ordered by their loading value within each panel.

Figure 3. Bar chart of PCA component loadings for PC1 and PC2. Blue bars indicate positive loadings and red bars indicate negative loadings. Variables are ordered by their loading value within each panel.

Table 3. PCA component loadings for PC1, PC2, and PC3. Values closer to +/-1 indicate a stronger association between the variable and the component.

Variable

PC1

PC2

PC3

pop.size

0.273

-0.188

-0.365

area

-0.151

-0.198

-0.134

mean.age

-0.394

-0.072

-0.059

mortality

-0.392

-0.020

-0.172

natality

0.256

0.148

-0.224

pop.change

0.344

0.067

0.187

immigration

0.167

0.494

0.039

emigration

0.052

0.506

-0.072

tax.capacity

0.278

-0.145

0.298

tax.equal

-0.193

0.126

-0.011

unemployment

-0.125

0.250

-0.486

foreign.origin

0.211

0.330

-0.307

higher.edu

0.364

-0.135

0.119

greenhouse.gases

0.172

-0.274

-0.324

dioxin.mg

0.207

-0.302

-0.430

3.3 Score Plots

3.3.1 PC1 vs PC2 – Region and Income

Figure 4. PCA score plot of PC1 (33.1%) vs PC2 (17.7%) with municipalities colour-coded by Swedish region. Norrland municipalities (green) are distributed across negative PC1 values, reflecting older populations and higher mortality. Svealand municipalities (blue) show the widest spread along PC2, and several high-income municipalities in the Stockholm area appear as outliers at high positive PC1 scores.

Figure 4. PCA score plot of PC1 (33.1%) vs PC2 (17.7%) with municipalities colour-coded by Swedish region. Norrland municipalities (green) are distributed across negative PC1 values, reflecting older populations and higher mortality. Svealand municipalities (blue) show the widest spread along PC2, and several high-income municipalities in the Stockholm area appear as outliers at high positive PC1 scores.

Figure 5. PCA score plot of PC1 vs PC2 with municipalities colour-coded by median income class (Low, Middle, High). High-income municipalities (blue) cluster at positive PC1 scores, consistent with their higher tax capacity, higher education levels, and younger populations. Low-income municipalities (orange) dominate the negative PC1 region.

Figure 5. PCA score plot of PC1 vs PC2 with municipalities colour-coded by median income class (Low, Middle, High). High-income municipalities (blue) cluster at positive PC1 scores, consistent with their higher tax capacity, higher education levels, and younger populations. Low-income municipalities (orange) dominate the negative PC1 region.

3.3.2 Biplot PC1 vs PC2

Figure 6. PCA biplot of PC1 vs PC2. Points represent individual municipalities (colour-coded by region) and arrows represent the loading vectors of the 15 variables. Arrow direction and length indicate the contribution of each variable to the two components. Variables pointing in the same direction are positively correlated; variables pointing in opposite directions are negatively correlated.

Figure 6. PCA biplot of PC1 vs PC2. Points represent individual municipalities (colour-coded by region) and arrows represent the loading vectors of the 15 variables. Arrow direction and length indicate the contribution of each variable to the two components. Variables pointing in the same direction are positively correlated; variables pointing in opposite directions are negatively correlated.

3.3.3 PC1 vs PC3 – Region

Figure 7. PCA score plot of PC1 (33.1%) vs PC3 (14.1%) with municipalities colour-coded by region. PC3 separates municipalities with high unemployment and industrial dioxin emissions (negative PC3 scores) from municipalities with high tax capacity and population growth (positive PC3 scores). Several Norrland municipalities appear at very negative PC3 values, reflecting their historically industrial character.

Figure 7. PCA score plot of PC1 (33.1%) vs PC3 (14.1%) with municipalities colour-coded by region. PC3 separates municipalities with high unemployment and industrial dioxin emissions (negative PC3 scores) from municipalities with high tax capacity and population growth (positive PC3 scores). Several Norrland municipalities appear at very negative PC3 values, reflecting their historically industrial character.


4 Discussion

The PCA identified PC1, PC2, and PC3 as the most informative components, together explaining 64.9% of the total variance in 15 socioeconomic and environmental variables across 290 Swedish municipalities in 2017.

PC1 (33.1%) captures the most fundamental dimension of municipal variation in Sweden: a contrast between growing, young, educated, and economically strong municipalities and shrinking, ageing, and economically dependent ones. High positive PC1 scores are associated with high tax capacity, higher education, population growth, and high natality – characteristics typical of urban and suburban municipalities, particularly in Svealand and the larger cities of Gotaland. Negative PC1 scores reflect high mean age, high mortality, and dependence on fiscal equalisation transfers, which are hallmarks of many rural and remote municipalities in Norrland. This is clearly visible in Figure 4, where Norrland municipalities (green) cluster at negative PC1 values, and is corroborated by Figure 5, where high-income municipalities consistently appear at positive PC1 values.

PC2 (17.7%) adds a distinct dimension related to population mobility and diversity. The near-equal and high loadings of immigration and emigration suggest that this axis captures a general population turnover gradient rather than a net migration effect. The positive loading of foreign.origin implies that municipalities with high immigration and emigration rates also tend to have a larger proportion of inhabitants with foreign background. At the opposite end, negative loadings of greenhouse.gases and dioxin.mg indicate that high-emission municipalities tend to have lower population turnover – consistent with municipalities dominated by heavy industry rather than diverse urban populations.

PC3 (14.1%) reveals an additional industrial stress and socioeconomic disadvantage dimension that is largely orthogonal to PC1 and PC2. The strong negative loadings of unemployment, dioxin.mg, foreign.origin, and pop.size identify municipalities characterised by industrial emissions, unemployment, and large but disadvantaged populations – a pattern consistent with post-industrial towns and certain Norrland municipalities with a legacy of mining and heavy manufacturing (Figure 7).

Taken together, the three components suggest that differences among Swedish municipalities are primarily structured by: (1) a demographic and economic vitality gradient (PC1), (2) a population mobility and diversity gradient (PC2), and (3) an industrial and socioeconomic disadvantage gradient (PC3). These findings are consistent with well-documented patterns of urban–rural polarisation and regional inequality in Sweden (Statistics Sweden, 2017).


5 References

Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.

Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141–151.

Statistics Sweden (SCB). (2017). Municipal statistics 2017. Retrieved from https://www.scb.se/en/


6 Appendix: R Code

# ---- Load packages ----
library(readxl)
library(ggplot2)
library(ggrepel)
library(factoextra)
library(dplyr)
library(tidyr)
library(flextable)

# ---- Load data ----
scb <- read_excel("SCB_assign3.xlsx")

# ---- Define continuous variables ----
continuous_vars <- c("pop.size", "area", "mean.age", "mortality", "natality",
                     "pop.change", "immigration", "emigration", "tax.capacity",
                     "tax.equal", "unemployment", "foreign.origin", "higher.edu",
                     "greenhouse.gases", "dioxin.mg")

scb_num <- scb[, continuous_vars]

# ---- Run PCA ----
pca_result <- prcomp(scb_num, center = TRUE, scale. = TRUE)

# ---- Eigenvalues and variance ----
eigenvalues <- pca_result$sdev^2
prop_var    <- eigenvalues / sum(eigenvalues)
cum_var     <- cumsum(prop_var)

# ---- Scores data frame ----
scores_df <- as.data.frame(pca_result$x)
scores_df$municipality <- scb$municipality
scores_df$region       <- scb$region
scores_df$county       <- scb$county
scores_df$income       <- factor(scb$income, levels = c("Low", "Middle", "High"))

# ---- Loadings ----
loadings_df   <- as.data.frame(pca_result$rotation[, 1:3])
loadings_df$Variable <- rownames(loadings_df)

loadings_plot <- as.data.frame(pca_result$rotation[, 1:2])
loadings_plot$Variable <- rownames(loadings_plot)
arrow_scale <- 4

# ---- Scree plot ----
scree_df <- data.frame(PC = 1:15, Eigenvalue = eigenvalues)
ggplot(scree_df, aes(x = PC, y = Eigenvalue)) +
  geom_line(color = "steelblue", linewidth = 0.8) +
  geom_point(color = "steelblue", size = 2.5) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  scale_x_continuous(breaks = 1:15) +
  labs(x = "Principal Component", y = "Eigenvalue") +
  theme_bw()

# ---- Cumulative variance plot ----
cum_df <- data.frame(PC = 1:15, Cumulative = cum_var * 100)
ggplot(cum_df, aes(x = PC, y = Cumulative)) +
  geom_line(color = "darkgreen", linewidth = 0.8) +
  geom_point(color = "darkgreen", size = 2.5) +
  geom_hline(yintercept = 80, linetype = "dashed", color = "orange") +
  scale_x_continuous(breaks = 1:15) +
  labs(x = "Principal Component", y = "Cumulative Variance (%)") +
  theme_bw()

# ---- Loadings bar chart (PC1 & PC2) ----
loadings_long <- loadings_df %>%
  select(Variable, PC1, PC2) %>%
  pivot_longer(cols = c(PC1, PC2), names_to = "Component", values_to = "Loading")

ggplot(loadings_long,
       aes(x = reorder(Variable, Loading), y = Loading, fill = Loading > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ Component) +
  scale_fill_manual(values = c("FALSE" = "#D7191C", "TRUE" = "#2C7BB6")) +
  labs(x = "Variable", y = "Loading") +
  theme_bw()

# ---- Score plot: PC1 vs PC2, colored by region ----
ggplot(scores_df, aes(x = PC1, y = PC2, color = region)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dotted", color = "grey50") +
  scale_color_manual(
    values = c("Gotaland" = "#E41A1C", "Svealand" = "#377EB8", "Norrland" = "#4DAF4A"),
    name = "Region"
  ) +
  labs(x = "PC1 (33.1% variance)", y = "PC2 (17.7% variance)") +
  theme_bw()

# ---- Score plot: PC1 vs PC2, colored by income ----
ggplot(scores_df, aes(x = PC1, y = PC2, color = income)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dotted", color = "grey50") +
  scale_color_manual(
    values = c("Low" = "#FC8D59", "Middle" = "#FFFFBF", "High" = "#91BFDB"),
    name = "Median Income"
  ) +
  labs(x = "PC1 (33.1% variance)", y = "PC2 (17.7% variance)") +
  theme_bw()

# ---- Biplot: PC1 vs PC2 ----
ggplot() +
  geom_point(data = scores_df, aes(x = PC1, y = PC2, color = region),
             size = 1.8, alpha = 0.7) +
  geom_segment(data = loadings_plot,
               aes(x = 0, y = 0,
                   xend = PC1 * arrow_scale,
                   yend = PC2 * arrow_scale),
               arrow = arrow(length = unit(0.2, "cm")),
               color = "black", linewidth = 0.5) +
  geom_text_repel(data = loadings_plot,
                  aes(x = PC1 * arrow_scale,
                      y = PC2 * arrow_scale,
                      label = Variable),
                  size = 3, color = "black", max.overlaps = 20) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dotted", color = "grey50") +
  scale_color_manual(
    values = c("Gotaland" = "#E41A1C", "Svealand" = "#377EB8", "Norrland" = "#4DAF4A"),
    name = "Region"
  ) +
  labs(x = "PC1 (33.1% variance)", y = "PC2 (17.7% variance)") +
  theme_bw()

# ---- Score plot: PC1 vs PC3, colored by region ----
ggplot(scores_df, aes(x = PC1, y = PC3, color = region)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dotted", color = "grey50") +
  scale_color_manual(
    values = c("Gotaland" = "#E41A1C", "Svealand" = "#377EB8", "Norrland" = "#4DAF4A"),
    name = "Region"
  ) +
  labs(x = "PC1 (33.1% variance)", y = "PC3 (14.1% variance)") +
  theme_bw()

7 Part 2: NMDS – Gut Microbiome Composition


8 Introduction

The human gut microbiome – the community of bacteria residing in the gastrointestinal tract – is increasingly recognised as a key determinant of host health, metabolism, and immune function (Arumugam et al., 2011). Gut microbiome composition varies substantially among individuals and is shaped by factors including diet, age, health status, and geographic origin. Understanding whether nationality – as a proxy for long-term dietary and environmental exposures – is associated with differences in microbiome composition is therefore of broad scientific interest.

The research question addressed in this part is: are there any significant differences in gut microbiome composition among faecal samples from humans of different nationalities?

To address this question, Non-metric Multidimensional Scaling (NMDS) was used to visualise the multivariate structure of OTU (Operational Taxonomic Unit) composition among 33 samples from six nationalities. NMDS is a rank-based ordination method that preserves the rank order of pairwise dissimilarities between samples in a low-dimensional space (Kruskal, 1964). Following ordination, the homogeneity of group dispersions was assessed with betadisper(), and – given non-significant dispersion differences – a permutational multivariate analysis of variance (PERMANOVA) was used to formally test for differences in microbiome composition among nationalities.


9 Methods

9.1 Data

The dataset is a modified version of the enterotype dataset (Arumugam et al., 2011) from the phyloseq R package. Samples lacking data on Nationality, Gender, or ClinicalStatus, and samples from individuals below the age of one year, were removed prior to analysis. The final dataset comprises 33 faecal samples from six nationalities: American, Danish, French, Italian, Japanese, and Spanish (Table 1).

Table 1. Number of faecal samples per nationality in the modified enterotype dataset.

Nationality

N samples

American

2

Danish

4

French

8

Italian

6

Japanese

9

Spanish

4

The OTU table contains relative abundances of 553 OTUs per sample (each OTU’s count divided by the sample total). A substantial proportion of OTUs in the original dataset could not be taxonomically assigned and were grouped into a single unassigned OTU, which is retained in the analysis as instructed. No further transformation, standardisation, or rarefaction was applied prior to analysis, as the data are already in relative abundance form.

9.2 Statistical Analysis

NMDS was performed using metaMDS() from the vegan package (Oksanen et al., 2022), with Bray-Curtis dissimilarity (distance = "bray"), two dimensions (k = 2), and 100 random starts (trymax = 100). Bray-Curtis dissimilarity is widely used for species composition data because it is sensitive to differences in relative abundances and is bounded between 0 (identical) and 1 (completely different). A random seed was set (set.seed(123)) for reproducibility.

Stress was used to evaluate the quality of the NMDS solution. Stress < 0.10 indicates an excellent fit, stress < 0.20 indicates a fair to good fit, and stress > 0.30 indicates a poor representation (Clarke, 1993).

Homogeneity of multivariate dispersions was assessed using betadisper() followed by a permutation test (permutest(), 999 permutations). This step tests whether nationality groups differ in their within-group variability (dispersion) around the group centroid, which is a prerequisite for valid interpretation of PERMANOVA results.

PERMANOVA (adonis2(), 999 permutations) was used to test whether the centroids of nationality groups differ significantly in multivariate space. PERMANOVA partitions the total Bray-Curtis dissimilarity into among-group and within-group components and tests significance by permutation (Anderson, 2001). PERMANOVA was only proceeded with after confirming that group dispersions were homogeneous, to avoid confounding location and dispersion effects.


10 Results

10.1 NMDS Ordination

The NMDS ordination converged to a stress value of 0.149, indicating a fair to good representation of the multivariate Bray-Curtis dissimilarity structure in two dimensions (Clarke, 1993).

Figure 1. NMDS ordination of 33 faecal samples based on Bray-Curtis dissimilarity of OTU relative abundances. Points represent individual samples coloured by nationality. Dashed ellipses show 95% confidence ellipses for nationalities with three or more samples (american, n = 2, is excluded from the ellipse). Stress = 0.149, indicating a fair to good ordination fit.

Figure 1. NMDS ordination of 33 faecal samples based on Bray-Curtis dissimilarity of OTU relative abundances. Points represent individual samples coloured by nationality. Dashed ellipses show 95% confidence ellipses for nationalities with three or more samples (american, n = 2, is excluded from the ellipse). Stress = 0.149, indicating a fair to good ordination fit.

The NMDS plot (Figure 1) shows that samples from different nationalities overlap considerably in the ordination space, suggesting no complete separation of microbiome compositions by nationality. However, some tendency for grouping is visible: Japanese samples show a wider spread along NMDS1, and French samples cluster relatively tightly in the centre of the ordination. American samples (n = 2) are positioned as outliers at negative NMDS2 values.

10.2 Homogeneity of Multivariate Dispersions

Before testing for differences in microbiome composition between nationalities, the homogeneity of within-group dispersions was assessed. The permutation test on betadisper() yielded a non-significant result (F = 0.787, p = 0.581), indicating that the six nationality groups do not differ significantly in their within-group variability around the group centroid (Table 2). This result justifies proceeding with PERMANOVA.

Table 2. Permutation test (permutest) for homogeneity of multivariate dispersions (betadisper) among nationality groups. A non-significant p-value indicates homogeneous dispersions.

Source

Df

Sum Sq

Mean Sq

F

Permutations

p-value

Nationality

5

0.02889

0.00578

0.787

999

0.581

Figure 2. Boxplot of distances from individual samples to their nationality group centroid (Bray-Curtis dissimilarity). Individual sample points are overlaid as jittered dots. No significant differences in dispersion among nationalities were detected (permutation test, p = 0.581).

Figure 2. Boxplot of distances from individual samples to their nationality group centroid (Bray-Curtis dissimilarity). Individual sample points are overlaid as jittered dots. No significant differences in dispersion among nationalities were detected (permutation test, p = 0.581).

The boxplot (Figure 2) visualises the within-group spread for each nationality. Italian, Japanese, and Spanish samples show higher median distances to their group centroids compared to Danish and French samples, but this variation is not statistically significant.

10.3 PERMANOVA

Given the non-significant betadisper result, PERMANOVA was used to test whether nationality groups differ in their centroid location in multivariate space. The PERMANOVA revealed a significant effect of nationality on gut microbiome composition (F = 3.314, R2 = 0.38, p = 0.001; Table 3). Nationality explained 38% of the total variation in Bray-Curtis dissimilarities among samples.

Table 3. PERMANOVA (adonis2) results testing for differences in gut microbiome composition (Bray-Curtis dissimilarity) among nationality groups. 999 permutations were used.

Source

Df

Sum of Squares

R2

F

p-value

Nationality

5

0.9124

0.3803

3.314

0.001

Residual

27

1.4868

0.6197

Total

32

2.3991

1.0000


11 Discussion

The aim of this analysis was to determine whether gut microbiome composition differs significantly among faecal samples from individuals of six different nationalities. The results from the NMDS ordination, betadisper analysis, and PERMANOVA together support the conclusion that nationality is a significant predictor of gut microbiome composition.

The NMDS ordination (stress = 0.149) provided a fair to good two-dimensional representation of the Bray-Curtis dissimilarity structure among the 33 samples. While the score plot (Figure 1) shows considerable overlap between nationality groups – reflecting the inherent individual-level variation in microbiome composition – some group-level tendencies are visible. Japanese samples show the widest spread along NMDS1, potentially reflecting greater within-group diversity in OTU composition, while French samples cluster more tightly near the ordination centre.

Before applying PERMANOVA, the homogeneity of within-group dispersions was confirmed using betadisper() followed by a permutation test (F = 0.787, p = 0.581). The non-significant result indicates that the nationality groups do not differ in their within-group variability, satisfying the assumption of PERMANOVA and allowing differences in group location (centroid) rather than dispersion to be attributed to the significant PERMANOVA result (Anderson, 2001; see ?adonis2).

The PERMANOVA revealed a statistically significant effect of nationality on microbiome composition (F = 3.314, R2 = 0.38, p = 0.001), with nationality explaining 38% of the total Bray-Curtis dissimilarity. This is consistent with earlier studies showing that gut microbiome composition reflects long-term dietary patterns, which are strongly shaped by national food culture and geographic environment (Arumugam et al., 2011). The remaining ~62% of variation is attributable to within-nationality individual differences, driven by factors such as age, health status, antibiotic use, and host genetics – all of which vary considerably within the nationality groups in this dataset (e.g. Italian samples are exclusively elderly individuals, while Japanese samples span a wide age range including one infant).

In conclusion, there are significant differences in gut microbiome OTU composition among samples from different nationalities (PERMANOVA: p = 0.001, R2 = 0.38). The analysis further shows that these differences reflect true location differences in multivariate space rather than differences in within-group dispersion.


12 References

Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26(1), 32–46.

Arumugam, M., et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346), 174–180.

Clarke, K. R. (1993). Non-parametric multivariate analyses of changes in community structure. Australian Journal of Ecology, 18(1), 117–143.

Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2), 115–129.

Moore, R. E., & Townsend, S. D. (2019). Temporal development of the infant gut microbiome. Open Biology, 9(9), 190128.

Oksanen, J., et al. (2022). vegan: Community Ecology Package. R package version 2.6-4. https://CRAN.R-project.org/package=vegan


13 Appendix: R Code

# ---- Load packages ----
library(vegan)
library(ggplot2)
library(dplyr)
library(flextable)

# ---- Load data ----
otu_mat   <- readRDS("entero_vegan.rds")
sample_df <- readRDS("sampledf_vegan.rds")

# ---- Colour palette ----
nationality_colors <- c(
  "american" = "#E41A1C", "danish"   = "#377EB8",
  "french"   = "#4DAF4A", "italian"  = "#FF7F00",
  "japanese" = "#984EA3", "spanish"  = "#A65628"
)

# ---- Run NMDS ----
set.seed(123)
nmds_result <- metaMDS(otu_mat, distance = "bray", k = 2,
                       trymax = 100, trace = FALSE)
nmds_result$stress

# ---- Extract scores ----
scores_nmds <- as.data.frame(scores(nmds_result, display = "sites"))
scores_nmds$Nationality    <- sample_df$Nationality
scores_nmds$ClinicalStatus <- sample_df$ClinicalStatus

# Groups with >= 3 samples for ellipses
nat_counts     <- table(sample_df$Nationality)
ellipse_groups <- names(nat_counts[nat_counts >= 3])
scores_ellipse <- scores_nmds %>% filter(Nationality %in% ellipse_groups)

# ---- NMDS plot ----
ggplot(scores_nmds, aes(x = NMDS1, y = NMDS2, color = Nationality)) +
  stat_ellipse(data = scores_ellipse, aes(group = Nationality),
               type = "t", linetype = "dashed", linewidth = 0.6, level = 0.95) +
  geom_point(size = 3, alpha = 0.9) +
  annotate("text",
           x = min(scores_nmds$NMDS1) + 0.02, y = max(scores_nmds$NMDS2),
           label = paste0("Stress = ", round(nmds_result$stress, 3)),
           hjust = 0, size = 3.5) +
  scale_color_manual(values = nationality_colors, name = "Nationality") +
  labs(x = "NMDS1", y = "NMDS2") +
  theme_bw()

# ---- Bray-Curtis distance matrix ----
bray_dist <- vegdist(otu_mat, method = "bray")

# ---- betadisper ----
bd      <- betadisper(bray_dist, group = sample_df$Nationality)
set.seed(123)
bd_perm <- permutest(bd, permutations = 999)
print(bd_perm)

# ---- Boxplot: distance to centroid ----
centroid_distances <- data.frame(
  Distance    = bd$distances,
  Nationality = sample_df[names(bd$distances), "Nationality"]
)

ggplot(centroid_distances,
       aes(x = Nationality, y = Distance, fill = Nationality)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.15, size = 1.8, alpha = 0.7) +
  scale_fill_manual(values = nationality_colors) +
  labs(x = "Nationality", y = "Distance to centroid") +
  theme_bw() +
  theme(legend.position = "none")

# ---- PERMANOVA ----
set.seed(123)
permanova_result <- adonis2(
  bray_dist ~ Nationality,
  data = sample_df, permutations = 999
)
print(permanova_result)