Introduction

This report contains two multivariate statistical analyses:

  1. Principal Component Analysis (PCA) on Swedish municipalities
  2. Non-metric Multidimensional Scaling (NMDS) on microbiome data

These methods help visualize complex datasets with many variables.


PART 1 — Principal Component Analysis (PCA)

Load Required Libraries

library(readxl)
library(tidyverse)
library(ggplot2)
library(factoextra)

Load Dataset

# Read Excel file

scb <- read_excel("SCB_assign3.xlsx")

# Check structure

str(scb)
## tibble [290 × 19] (S3: tbl_df/tbl/data.frame)
##  $ municipality    : chr [1:290] "Ale" "Alingsås" "Alvesta" "Aneby" ...
##  $ region          : chr [1:290] "Götaland" "Götaland" "Götaland" "Götaland" ...
##  $ county          : chr [1:290] "Västra Götalands län" "Västra Götalands län" "Kronobergs län" "Jönköpings län" ...
##  $ income          : chr [1:290] "Middle" "Middle" "Low" "Low" ...
##  $ pop.size        : num [1:290] 30223 40390 20026 6776 13934 ...
##  $ area            : num [1:290] 317 472 974 518 325 ...
##  $ mean.age        : num [1:290] 39.5 42.1 41.8 42.4 44.2 47 45.3 44.6 46.2 43.9 ...
##  $ mortality       : num [1:290] 0.84 1.005 0.939 0.945 1.184 ...
##  $ natality        : num [1:290] 1.18 1.05 1.31 1.45 1.08 ...
##  $ pop.change      : num [1:290] 2.23 0.854 0.879 2.553 0.222 ...
##  $ immigration     : num [1:290] 7.41 4.88 6.82 8.09 5.96 ...
##  $ emigration      : num [1:290] 5.52 4.12 6.31 6.02 5.76 ...
##  $ tax.capacity    : num [1:290] 197627 199056 174595 181317 177804 ...
##  $ tax.equal       : num [1:290] -462 -1025 2357 -837 -885 ...
##  $ unemployment    : num [1:290] 3.9 5.1 8.5 5.5 9.4 4.8 7.6 6.6 5.2 11.8 ...
##  $ foreign.origin  : num [1:290] 21.8 14.6 24.3 14.8 18.2 ...
##  $ higher.edu      : num [1:290] 21 24.2 17.2 17.7 18.9 ...
##  $ greenhouse.gases: num [1:290] 103773 105709 119691 50356 62066 ...
##  $ dioxin.mg       : num [1:290] 29.5 69.2 69.7 18.5 35 ...
summary(scb)
##  municipality          region             county             income         
##  Length:290         Length:290         Length:290         Length:290        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     pop.size           area             mean.age       mortality     
##  Min.   :  2451   Min.   :    8.69   Min.   :36.30   Min.   :0.5156  
##  1st Qu.:  9970   1st Qu.:  344.96   1st Qu.:41.40   1st Qu.:0.9211  
##  Median : 15937   Median :  670.47   Median :43.50   Median :1.0888  
##  Mean   : 34897   Mean   : 1404.52   Mean   :43.25   Mean   :1.0932  
##  3rd Qu.: 34905   3rd Qu.: 1291.24   3rd Qu.:45.08   3rd Qu.:1.2521  
##  Max.   :949761   Max.   :19155.37   Max.   :49.60   Max.   :2.1542  
##     natality        pop.change       immigration       emigration    
##  Min.   :0.4913   Min.   :-2.7589   Min.   : 3.490   Min.   : 3.038  
##  1st Qu.:0.9368   1st Qu.: 0.1163   1st Qu.: 5.446   1st Qu.: 4.766  
##  Median :1.0321   Median : 0.7146   Median : 6.369   Median : 5.605  
##  Mean   :1.0317   Mean   : 0.7555   Mean   : 6.645   Mean   : 5.868  
##  3rd Qu.:1.1337   3rd Qu.: 1.4155   3rd Qu.: 7.358   3rd Qu.: 6.567  
##  Max.   :1.6692   Max.   : 4.1021   Max.   :15.571   Max.   :14.183  
##   tax.capacity      tax.equal        unemployment    foreign.origin  
##  Min.   :146813   Min.   :-5214.0   Min.   : 2.100   Min.   : 7.168  
##  1st Qu.:174204   1st Qu.: -631.5   1st Qu.: 5.300   1st Qu.:12.904  
##  Median :182235   Median :  341.0   Median : 7.200   Median :16.428  
##  Mean   :188085   Mean   :  825.3   Mean   : 7.504   Mean   :18.554  
##  3rd Qu.:196206   3rd Qu.: 1787.8   3rd Qu.: 9.200   3rd Qu.:21.953  
##  Max.   :358342   Max.   :11864.0   Max.   :15.100   Max.   :58.555  
##    higher.edu    greenhouse.gases    dioxin.mg       
##  Min.   :11.94   Min.   :  10597   Min.   :   5.135  
##  1st Qu.:15.38   1st Qu.:  52604   1st Qu.:  24.855  
##  Median :18.07   Median :  87251   Median :  41.683  
##  Mean   :19.61   Mean   : 181646   Mean   :  85.663  
##  3rd Qu.:22.27   3rd Qu.: 152722   3rd Qu.:  82.013  
##  Max.   :42.39   Max.   :4052201   Max.   :1282.273

Select Continuous Variables

Categorical variables removed:

  • municipality
  • region
  • county
  • income
scb_numeric <- scb %>%
  select(pop.size:dioxin.mg)

head(scb_numeric)
## # A tibble: 6 × 15
##   pop.size   area mean.age mortality natality pop.change immigration emigration
##      <dbl>  <dbl>    <dbl>     <dbl>    <dbl>      <dbl>       <dbl>      <dbl>
## 1    30223   317.     39.5     0.840    1.18       2.23         7.41       5.52
## 2    40390   472      42.1     1.01     1.05       0.854        4.88       4.12
## 3    20026   974.     41.8     0.939    1.31       0.879        6.82       6.31
## 4     6776   518.     42.4     0.945    1.45       2.55         8.09       6.02
## 5    13934   325.     44.2     1.18     1.08       0.222        5.96       5.76
## 6     2821 12558.     47       1.42     0.922     -1.95         4.04       5.49
## # ℹ 7 more variables: tax.capacity <dbl>, tax.equal <dbl>, unemployment <dbl>,
## #   foreign.origin <dbl>, higher.edu <dbl>, greenhouse.gases <dbl>,
## #   dioxin.mg <dbl>

Standardize Data

Standardization is necessary because variables have different scales.

scb_scaled <- scale(scb_numeric)

Run PCA

pca_result <- prcomp(
  scb_scaled,
  center = TRUE,
  scale. = TRUE
)

summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2271 1.6315 1.4544 1.15955 0.94201 0.85625 0.71273
## Proportion of Variance 0.3307 0.1774 0.1410 0.08964 0.05916 0.04888 0.03387
## Cumulative Proportion  0.3307 0.5081 0.6491 0.73878 0.79794 0.84682 0.88068
##                            PC8     PC9    PC10   PC11    PC12   PC13    PC14
## Standard deviation     0.68860 0.61825 0.57896 0.4915 0.41455 0.3263 0.27902
## Proportion of Variance 0.03161 0.02548 0.02235 0.0161 0.01146 0.0071 0.00519
## Cumulative Proportion  0.91229 0.93778 0.96012 0.9762 0.98768 0.9948 0.99997
##                           PC15
## Standard deviation     0.02074
## Proportion of Variance 0.00003
## Cumulative Proportion  1.00000

We select principal components explaining:

More than 60–70% of total variance.


Scree Plot

fviz_eig(pca_result)

Interpretation

The first principal components were chosen because they summarize the largest variation in the dataset.


PCA Plot Grouped by Region

fviz_pca_ind(
  pca_result,
  geom = "point",
  habillage = scb$region,
  addEllipses = TRUE,
  label = "none"
)

Interpretation

Municipalities close together have similar socioeconomic characteristics.


PCA Loadings Plot

fviz_pca_var(pca_result)

Interpretation

This plot shows which variables drive differences between municipalities.


PCA Biplot

fviz_pca_biplot(
  pca_result,
  habillage = scb$region,
  addEllipses = TRUE,
  label = "var"
)


PCA Interpretation

PC1 Interpretation

pca_result$rotation[,1]
##         pop.size             area         mean.age        mortality 
##       0.27320477      -0.15113010      -0.39419653      -0.39182306 
##         natality       pop.change      immigration       emigration 
##       0.25639831       0.34404101       0.16733833       0.05176083 
##     tax.capacity        tax.equal     unemployment   foreign.origin 
##       0.27750312      -0.19284174      -0.12458883       0.21110679 
##       higher.edu greenhouse.gases        dioxin.mg 
##       0.36357993       0.17163745       0.20708180

PC1 was positively associated with:

  • population size
  • immigration
  • tax capacity

This suggests municipalities with high PC1 scores are large urban municipalities.


PC2 Interpretation

pca_result$rotation[,2]
##         pop.size             area         mean.age        mortality 
##      -0.18834479      -0.19818064      -0.07166123      -0.02038489 
##         natality       pop.change      immigration       emigration 
##       0.14751696       0.06707752       0.49389563       0.50642689 
##     tax.capacity        tax.equal     unemployment   foreign.origin 
##      -0.14539976       0.12551530       0.24996009       0.33048716 
##       higher.edu greenhouse.gases        dioxin.mg 
##      -0.13473684      -0.27359438      -0.30237918

PC2 was associated with:

  • mortality
  • mean age

Municipalities with high PC2 scores tend to have older populations.


Overall PCA Conclusion

Population size, demographic structure, and economic capacity explain major differences among Swedish municipalities.


PART 2 — NMDS (Microbiome Data)

Load Required Libraries

library(vegan)
## Loading required package: permute
library(ggplot2)

Load NMDS Data

entero <- readRDS("entero_vegan.rds")

sampledf <- readRDS("sampledf_vegan.rds")

Run NMDS

Using Bray–Curtis distance.

nmds <- metaMDS(
  entero,
  distance = "bray",
  k = 2,
  trymax = 100
)
## Run 0 stress 0.1508602 
## Run 1 stress 0.1501291 
## ... New best solution
## ... Procrustes: rmse 0.1315384  max resid 0.5536647 
## Run 2 stress 0.1739011 
## Run 3 stress 0.1606822 
## Run 4 stress 0.1499106 
## ... New best solution
## ... Procrustes: rmse 0.0219446  max resid 0.1068088 
## Run 5 stress 0.1501291 
## ... Procrustes: rmse 0.02195007  max resid 0.1068811 
## Run 6 stress 0.1579322 
## Run 7 stress 0.1609625 
## Run 8 stress 0.1510025 
## Run 9 stress 0.1530473 
## Run 10 stress 0.1763764 
## Run 11 stress 0.1575809 
## Run 12 stress 0.1567907 
## Run 13 stress 0.1609626 
## Run 14 stress 0.1530473 
## Run 15 stress 0.1514533 
## Run 16 stress 0.1715047 
## Run 17 stress 0.1554413 
## Run 18 stress 0.1634385 
## Run 19 stress 0.1579323 
## Run 20 stress 0.1716996 
## Run 21 stress 0.1751966 
## Run 22 stress 0.1502523 
## ... Procrustes: rmse 0.02329153  max resid 0.1034964 
## Run 23 stress 0.1575807 
## Run 24 stress 0.1634385 
## Run 25 stress 0.1579322 
## Run 26 stress 0.1508602 
## Run 27 stress 0.1555186 
## Run 28 stress 0.1495229 
## ... New best solution
## ... Procrustes: rmse 0.01184774  max resid 0.06061541 
## Run 29 stress 0.1553503 
## Run 30 stress 0.1499105 
## ... Procrustes: rmse 0.01178412  max resid 0.06065181 
## Run 31 stress 0.1564722 
## Run 32 stress 0.1512937 
## Run 33 stress 0.1512932 
## Run 34 stress 0.1508601 
## Run 35 stress 0.1530475 
## Run 36 stress 0.1827846 
## Run 37 stress 0.1721044 
## Run 38 stress 0.1530473 
## Run 39 stress 0.1508602 
## Run 40 stress 0.1541209 
## Run 41 stress 0.1549409 
## Run 42 stress 0.1744663 
## Run 43 stress 0.1506509 
## Run 44 stress 0.1549411 
## Run 45 stress 0.1504013 
## Run 46 stress 0.1508601 
## Run 47 stress 0.182646 
## Run 48 stress 0.1506509 
## Run 49 stress 0.1631079 
## Run 50 stress 0.158413 
## Run 51 stress 0.1633539 
## Run 52 stress 0.3941036 
## Run 53 stress 0.1836335 
## Run 54 stress 0.1508603 
## Run 55 stress 0.1504013 
## Run 56 stress 0.1554107 
## Run 57 stress 0.1530473 
## Run 58 stress 0.1681186 
## Run 59 stress 0.1634386 
## Run 60 stress 0.1988293 
## Run 61 stress 0.1606821 
## Run 62 stress 0.163108 
## Run 63 stress 0.1506508 
## Run 64 stress 0.1530473 
## Run 65 stress 0.1506508 
## Run 66 stress 0.1530472 
## Run 67 stress 0.1960253 
## Run 68 stress 0.1530472 
## Run 69 stress 0.1606821 
## Run 70 stress 0.1530472 
## Run 71 stress 0.149028 
## ... New best solution
## ... Procrustes: rmse 0.02408859  max resid 0.09703141 
## Run 72 stress 0.1541217 
## Run 73 stress 0.1630893 
## Run 74 stress 0.1512932 
## Run 75 stress 0.1504013 
## Run 76 stress 0.1506513 
## Run 77 stress 0.1553503 
## Run 78 stress 0.1752673 
## Run 79 stress 0.1512932 
## Run 80 stress 0.1688955 
## Run 81 stress 0.3898181 
## Run 82 stress 0.1688955 
## Run 83 stress 0.1530473 
## Run 84 stress 0.1809899 
## Run 85 stress 0.1634386 
## Run 86 stress 0.154121 
## Run 87 stress 0.1503611 
## Run 88 stress 0.149028 
## ... New best solution
## ... Procrustes: rmse 7.815705e-05  max resid 0.0003023142 
## ... Similar to previous best
## *** Best solution repeated 1 times
nmds
## 
## Call:
## metaMDS(comm = entero, distance = "bray", k = 2, trymax = 100) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     entero 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.149028 
## Stress type 1, weak ties
## Best solution was repeated 1 time in 88 tries
## The best solution was from try 88 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'entero'

Stress Interpretation

  • Stress < 0.2 → Acceptable
  • Stress < 0.1 → Good

NMDS Plot Colored by Nationality

nmds_points <- as.data.frame(nmds$points)

nmds_points$Nationality <- sampledf$Nationality

ggplot(nmds_points,
       aes(x = MDS1,
           y = MDS2,
           color = Nationality)) +
  geom_point(size = 3) +
  theme_minimal()

Interpretation

Partial clustering suggests microbiome differences among nationalities.


Betadisper Test (Dispersion)

dist_matrix <- vegdist(
  entero,
  method = "bray"
)

bd <- betadisper(
  dist_matrix,
  sampledf$Nationality
)

anova(bd)
## Analysis of Variance Table
## 
## Response: Distances
##           Df   Sum Sq   Mean Sq F value Pr(>F)
## Groups     5 0.028893 0.0057786  0.7874 0.5679
## Residuals 27 0.198156 0.0073391

Interpretation

If:

p > 0.05

→ Dispersion not different
→ PERMANOVA can be performed


Boxplot of Dispersion

boxplot(
  bd,
  xlab = "Nationality",
  ylab = "Distance to centroid"
)


PERMANOVA Test

adonis_result <- adonis2(
  dist_matrix ~ Nationality,
  data = sampledf,
  permutations = 999
)

adonis_result
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = dist_matrix ~ Nationality, data = sampledf, permutations = 999)
##          Df SumOfSqs     R2      F Pr(>F)    
## Model     5  0.91239 0.3803 3.3139  0.001 ***
## Residual 27  1.48675 0.6197                  
## Total    32  2.39914 1.0000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

If:

p < 0.05

→ Significant differences between nationalities.


Overall NMDS Interpretation

The NMDS ordination showed clustering of samples according to nationality, indicating differences in gut microbiome composition.


Final Conclusion

The multivariate analyses demonstrated measurable variation in both datasets.

PCA Results

  • Population size strongly influences municipal differences
  • Age and mortality also contribute to variation

NMDS Results

  • Nationality influences microbiome composition
  • Statistical tests confirmed group differences

These results highlight the usefulness of multivariate statistical methods for analyzing complex datasets.


End of Report