1 Part 1

1.1 Principal Component Analysis

The main task for the first part is to find out the loading factor’s effects on Swedish municipalities with the PCA method. The given dataset (SCB_assign3.xlsx) extracts from Statistics Sweden(SCB) for the year 2017 only.

In principal component analysis (PCA), loading defines as

\[\text{Loadings} = \text{Eigenvectors} \cdot \sqrt{\text{Eigenvalues}}\]

In this study, we main to use the “FactoMineR” package for Principal Component Analysis. The PCA function in FactoMineR never uses the word “loading” or “scoring”. Instead, it uses the word “coordinates” (or “Contributions”) for standardized loadings( or for standardized scorings) . PCA method is very useful in a highly correlated numerical dataset for high dimension reductions. With the transformation of a dataset from the Cartesian coordinate system into eigenspace system, each principal component dimension (PCD or called Orthogonal Basis) in eigenspace system is a linear combination of all the original factors in Cartesian space system, however, its weight of contributions by each factor in eigenspace system is altered and different. More contribution by an individual factor in one principal-component dimension (PCD), more dominated the effects on the PCD. In other words, extremely less or no contribution by many or all factors on one PCD will be considered as an insignificant PCD which can be ignored for dimension reduction’s purpose. Academic papers of Abdi and Williams (2010) and Kostov, Bécue-Bertaut, and Husson (2013) are suggested for reference.

1.2 Data Standardization

Data standardization is a necessary procedure for PCA analysis if factors have a wide range in scale units. Without data standardization before PCA analysis, the results of PCA analysis would be dominated by larger scaling factors, resulting in the wrong computation and interpretation in loading or scoring. Therefore, data standardization is an important procedure before performing the PCA method: \(\frac{x_{i} - mean(x)}{sd(x)}\)

Table 1 describes the variables and definitions including four categorical variables: region, county, municipality and income with 15 numerical socioeconomic and environmental factors. The municipality variable uses as labeling data points for individual visualizations. Table 2 lists the given raw data without the data Standardization procedure.

1.3 Results and Discussion

The objective of using the PCA method is to find out the dominated Prinicipal component Dimensions, the dominated socioeconomics and environmental factors on municipalities in Sweden. To perform Principal Component Analysis, all the raw data must undergo a data standardization procedure. All the computation mainly uses the “FactoMineR” and “factoextra” packages (Lê, Josse, and Husson (2008)) and the R codes are attached to Appendix A.

The results in Table 3 that the first 5 principal-component dimensions (Dim-1,Dim-2,Dim-3,Dim-4,Dim-5) are totally contributing \(79.79\%\) variance in eigenspace system. The first dominated PCD(Dim-1) contributes \(33.07\%\) variance while The second dominated PCD(Dim-2) contributes \(17.74\%\) variance in eigenspace system. Figure 1 and Figure 2 shows that from Dim-6 to Dim-15 each variance contribution is less than \(5\%\).

The results in Table 4 indicate that

The first dominated PCD(Dim-1) is attributed by the factors in descending order: “mean.age” > “mortality” > “higher.edu” > “pop.change” > “tax.capacity” > “pop.size” > “natality”(see Figure 3).
The second dominated PCD(Dim-2) is attributed by the factors in descending order: “emigration” > “immigration” “foreign.origin” > “dioxin.mg” > “greenhouse.gases”> “unemployment” > “area” (see Figure 4).

Table 5 is socioeconomic factors and environmental factor contribution on PCD Dim-1-and-DIm 2 Plane. It indicates that

“emigration”, “immigration”, “mean.age”, “mortality” “higher.edu”, and “pop.change” are attributed very high contribution to “municipalities” in Sweden
“foreign.origin”, “tax.capacity”, “pop.size”, “natality”, “dioxin.mg” and “greenhouse.gases” are attributed moderate contribution to “municipalities” in Sweden
“unemployment”, “tax.equal” and “area” are attributed less contribution to “municipalities” in Sweden

Table 5 shows that

indivdual municipality 199 (=“Stockholm”) has a high contribution on Dim-1-and Dim-2 Plane.
indivdual municipality 205 (=“Sundbyberg”) has a moderated contribution on Dim-1-and Dim-2 Plan.
indivdual municipality 194 (=“Solna”) has a moderated contribution on Dim-1-and Dim-2 Plan.
indivdual municipality 58 (=“Göteborg”) has a moderated contribution on Dim-1-and Dim-2 Plan.
indivdual municipality 131 (=“Luleå”) has a moderated contribution on Dim-1-and Dim-2 Plan.
the rest of municipalities are attributed lower or very low contribution on Dim-1-and Dim-2 Plan.

1.4 Conclusion: Part 1

With the PCA method, it can conclude that

the socioeconomic factors of “emigration”, “immigration”, “mean.age”, “mortality” “higher.edu” and “pop.change” have high contributions to “municipalities” in Sweden. The municipalities of Stockholm, Göteborg, Solna, Sundbyberg, and Luleå are attributed high or moderated contributions on Dim-1-and Dim-2 Plane.

2 Part 2

2.1 Non-metric Multidimensional Scaling

The main objective in Part 2 is to study whether there are any significant differences in the gut microbiome among faecal samples from humans of different nationalities or not. The given dataset (“entero”) entails an OTU table (a bacteria “species” inventory table) with relative abundances of the different OTUs in each sample. The OTU data structure of “entero” has already done data standardization. In this study, it suggests using a permutational multivariate analysis of variance (PERMANOVA) to test the main research question. Details of PERMANOVA can refer to some academic papers of Ramette (2007), Oksanen et al. (2007) Anderson et al. (2011) and Oksanen (2013).

Permutational multivariate analysis of variance (PERMANOVA) is a geometric partitioning of variation across a multivariate data for measuring dissimilarity(or similarity). The statistical inference of PERMANOVA is distribution‐free by using permutational algorithms without normal assumption. The PERMANOVA pseudo F statistic is to test no difference in dispersion between groups(null hypothesis).

2.2 Results and Discussion

Table 5 shows the sample dataset of enteros and its variables for this study.

Figure 10 is the visualization plot of the first two dimensions (NMDS1 against NMDS2) by Nationality, by Clinical Status, by Age, and by Gender. In this visualization plot, it indicates American, Japanese, Spanish, Danish, French and italian nationality have a certain degree of between-group heterogeneous variance in the gut microbiome from faecal samples.

Figures 11, 12,13 and 14 are the PCoA plots measured with Bray Distance for 4 Models. Model 1 is without the removal of any nationality group. Model 2 is the removal of American nationality, Model 3 is the removal of American and Japanese groups and Model 4 is the removal of American, Japanese and Spanish groups.

Figures 15, 16, 17 and 18 are the Boxplot of Bray Distance to Centroid for 4 Models and indicates that Mode 4 (Danish, French and Italian group ) shows closed Bray distance to Centroid.

Table 6 and Table 7 are the Anova Test and Permutation Test for homogeneity of group dispersions. Both tests indicate that all 4 models are statistically non-significant. Therefore, there is no reason to reject the null hypothesis (no difference in dispersion between groups) for all 4 models. In other words, there has a low risk of confounding and dispersion effects on the nationality groups in the gut microbiome from faecal samples for all 4 models.

Table 8 is the Adonis Test(Permutational Multivariate Analysis of Variance Using Bray Distance Matrices). It indicates that all 4 models are highly significant different between-group effects on Nationality.

2.3 Conclusion : Part 2

Based on the results of the Permutation test and Adonis test, it concludes that there are very statistically significant differences in the gut microbiome among faecal samples from humans of different nationalities for all 4 models.

3 Appendix A

3.1 Program code: Part I

4 Appendix B

4.1 Model_01 output: TukeyHSD Test

4.2 Model_02 output: TukeyHSD Test

4.3 Model_03 output: TukeyHSD Test

4.4 Model_04 output: TukeyHSD Test

4.5 Program code: Part II

Reference

Abdi, Hervé, and Lynne J Williams. 2010. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (4): 433–59. https://doi.org/10.1002/wics.101.

Anderson, Marti J., Thomas O. Crist, Jonathan M. Chase, Mark Vellend, Brian D. Inouye, Amy L. Freestone, Nathan J. Sanders, et al. 2011. “Navigating the Multiple Meanings of \(\beta\) Diversity: A Roadmap for the Practicing Ecologist.” Ecology Letters 14 (1): 19–28. https://doi.org/10.1111/j.1461-0248.2010.01552.x.

Kostov, Belchin, Mónica Bécue-Bertaut, and François Husson. 2013. “Multiple Factor Analysis for Contingency Tables in the Factominer Package.” In.

Lê, Sébastien, Julie Josse, and François Husson. 2008. “FactoMineR: A Package for Multivariate Analysis.” Journal of Statistical Software 25 (1): 1–18. https://doi.org/10.18637/jss.v025.i01.

Oksanen, Jari. 2013. “Multivariate Analysis of Ecological Communities in R: Vegan Tutorial.” R Package Version, January, 1–43.

Oksanen, Jari, Roeland Kindt, Pierre Legendre, Bob Hara, M. Henry, and Hank Stevens. 2007. “The Vegan Package,” November.

Ramette, Alban. 2007. “Multivariate Analyses in Microbial Ecology.” FEMS Microbiology Ecology 62 (17892477): 142–60. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2121141/.

PCA and NMDS

DKCH

2020-04-19