The first part of assignment III is performed to analyze data on Swedish municipalities. The purpose of part 1 in assignment III is to statistically analyze Swedish municipality based data with a Principal Component Analysis (PCA) model. The second part of assignment III is performed to analyze data on faeces samples from humans. The purpose of part 2 in assignment III is to statistically analyze faeces based data with non-metric Multidimensional Scaling (NMDS).
An online version of the course curriculum “Modern Statistics with R” is provided by the university and is authored by Måns Thulin (2021), which is the literature that is used as a foundation to perform statistical tests by using the principal component analysis model. The default help function in R and online error searches are also used while completing assignment III in part 1 and part 2.
The research questions are literally mentioned in the assignment. The research question of part 1 is to look for possible factors that can be used to discuss differences among the municipalities. The research question of part 2 is on whether there are any significant differences in the gut microbiome among faeces samples from humans of different nationalities.
As the research question for part I is formulated in an exploratory manner, only a null hypothesis for part 2 is created below.
\(Statistical\) \(hypothesis\) \(for\) \(part\) \(2:\)
\(Hypothesis\)
\(H_0:\) \(There\) \(are\) \(no\) \(significant\) \(differences\) \(in\) \(the\) \(gut\) \(microbiome\) \(among\) \(faeces\) \(samples\) \(from\) \(humans\) \(of\) \(different\) \(nationalities.\)
\(H_A:\) \(There\) \(are\) \(significant\) \(differences\) \(in\) \(the\) \(gut\) \(microbiome\) \(among\) \(faeces\) \(samples\) \(from\) \(humans\) \(of\) \(different\) \(nationalities.\)
The data set used in part 1 is compiled, and possibly also in part gathered by the Swedish government agency Statistics Sweden (SCB). In part 1, 19 variables are included in the data set. The first four variables are categorical factor variables. The first three of the categorical factor variables “municipality”, “region” and “county” measure their respective geographical and administrative areas within Sweden. The fourth categorical variable “income” measures annual median income in the population above the age of 20, in “low”, “medium” and “high” income categories. The following 15 variables are numeric continuous and discrete response variables that measure municipality level quantitative values. The first response variable “pop.size” measures the population in the number of individuals. The second response variable “area” measures the area in square kilometers. The third response variable “mean.age” measures the average age of the population. The fourth response variable “mortality” measures the number of deaths divided by population size. The fifth response variable “natality” measures the number of births divided by population size. The sixth response variable “pop.change” measures annual population growth in percentage points. The seventh response variable “immigration” and eight response variable “emigration” are self-explanatory and are measured in percentage points. The ninth response variable “tax.capacity” measures the per capita tax base. The tenth response variable “tax.equal” measures per capita tax equalization. The eleventh response variable “unemployment” measures the unemployment rate in percentage points. The twelfth response variable “foreign.origin” measures the percentage of the population that is foreign born, as well as Swedish born with two foreign born parents. The thirteenth response variable “higher.edu” measures the percentage of the population with university education. The fourteenth response variable “greenhouse.gases” measures total emissions of greenhouse gases. The fifteenth response variable “dioxin.mg” measures dioxine emissions in milligrams.
As the assignment in part 1 quite explicitly states that the statistical model is to include all of the response variables above, no lengthy analysis regarding these variables is possible nor required. The more important aspect regarding the construction of the statistical model is the categorical variables other than “municipality”, which is used to visualize data points. The variables “region”, “county” and “income” are used to aid the viewer in finding possible factors that create differences between the municipalities. Also, these three categorical values will be visualized in separate plots, with a minimum of showing each plot with the first and second principal component.
The chosen statistical testing for part 1 is Principal Component Analysis (PCA). In order to determine the reasons behind the direction and magnitude of the variables, this method is suitable as it enables analysis of the substantial number of 15 variables. Orthogonality between the variables is a perquisite in PCA because it completely mitigates potential problems with potential collinearity between a large number variables. In addition to this, visualization in the form of PCA plots are used to specifically aid the reader in pinpointing the positions of different municipality based data points.
As apparent below, the results are divided into two sub chapters, with quantitative aspects and plots respectively. In the plots and PCA below, no prior mutation in the form of standardization or centering is included in this segment. Instead, this mutation is performed in the appendix, where it shows that standardization and centering is already an integral part of PCA functions in R. This is apparent as the results did not differ while comparing the output with and without the manually performed standardization and centering. As also included in the appendix, three interaction variables of which the respective variables within each interaction are highly correlated were created to try to increase the proportion of variance explained by the first few principal components by reducing the number of included components. Though, this defeated its own purpose as it had an effect that was opposite to its intended effect. Therefore, the number of principal components is still 15 which is the original number of components. To clarify this further, the standardization and centering, as well as the creation of the interaction variables are not used in the results below. Though, attempts were made to improve the model and its components. This only further shows that no such mutation is needed and that the optimal number of principal components, as well as the number of variables should equal the original number of 15. Further possible mutations that could be performed in this part of the assignment is to standardize the data by dividing “greenhouse.gases”, “dioxin.mg” and “area” respectively by “pop.size” to create per capita emission data, as well as population density figures. Though, this mutation is not performed because the assignment likely requires analysis on how for example a large population can affect the results of the performed PCA.
For simplicity when writing the code, the data set containing the 19 variables is renamed from “SCB_assign3” to “SCB”. As can be seen in the appendix it is tested whether there are missing observations, which is necessary to check because principal component analysis does not allow for missing observations. No missing observations were found.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | Comp.10 | Comp.11 | Comp.12 | Comp.13 | Comp.14 | Comp.15 |
Standard deviation | 2.227 | 1.631 | 1.454 | 1.1596 | 0.9420 | 0.8563 | 0.7127 | 0.6886 | 0.6182 | 0.5790 | 0.4915 | 0.4145 | 0.3263 | 0.27902 | 0.0207419 |
Proportion of Variance | 0.331 | 0.177 | 0.141 | 0.0896 | 0.0592 | 0.0489 | 0.0339 | 0.0316 | 0.0255 | 0.0223 | 0.0161 | 0.0115 | 0.0071 | 0.00519 | 0.0000287 |
Cumulative Proportion | 0.331 | 0.508 | 0.649 | 0.7388 | 0.7979 | 0.8468 | 0.8807 | 0.9123 | 0.9378 | 0.9601 | 0.9762 | 0.9877 | 0.9948 | 0.99997 | 1.0000000 |
In the summary of the PCA above, it shows that 2 PC:s explain approximately half of the cumulative proportion of variance, 3 PC:s explain nearly two thirds and that 85% in the cumulative proportion of variance is exceeded in the 7th variable. As will be further explained, in order to reduce the length of the report, principal components exceeding the sixth PC are not included in the plotting.
As also is present in the PCA table, the screeplot above shows that the first two PC:s only constitute about half of the cumulative proportion of variance, while the other PC:s in total constitute the remainder of the proportion of variance. This essentially means that the analysis can not be performed by limiting the PCA to the first two principal components, which is the main reason behind the inclusion of several principal components in the plotting of the following sub chapter.
As based on the results, the first to the sixth principal components will be plotted and are denoted by PCA1 to PCA6. Equivalent names that also appear are Comp.1 to Comp.6, as well as Dimension 1 to dimension 6. The inclusion of several principal components is based on the grounds that follow. There is likely no objective limitation regarding the number of principal components, but according to my judgement, the proportion of variance of 5.9% and 4.9% as explained by the fifth and sixth principal component respectively are necessary to increase to cumulative proportion of variance to approximately 85%. Though, the proportion of variance explained by the third to the sixth principal components are too low to be analyzed in depth in an assignment of this scope. The principal components will be given attention in an descending order by their respective proportion of variance. In a similar manner when the number of principal components is limited to 6, the third and fourth principal components are of greater importance as the explained proportion of variance is higher and thus the choice of including these two is obvious along this line of thinking. The correlations of the loadings and the third to sixth principal components also need to be compared to the equivalent aspects regarding the correlations of the the first and second principal components, which is an additional reason to include several PC:s when the loadings from the very same variables can deviate or have similarities between them.
eigenvalue | variance.percent | cumulative.variance.percent |
4.95993 | 33.06619 | 33.1 |
2.66179 | 17.74525 | 50.8 |
2.11541 | 14.10276 | 64.9 |
1.34457 | 8.96377 | 73.9 |
0.88738 | 5.91589 | 79.8 |
0.73317 | 4.88778 | 84.7 |
0.50798 | 3.38654 | 88.1 |
0.47418 | 3.16117 | 91.2 |
0.38223 | 2.54820 | 93.8 |
0.33520 | 2.23465 | 96.0 |
0.24154 | 1.61027 | 97.6 |
0.17185 | 1.14566 | 98.8 |
0.10650 | 0.71000 | 99.5 |
0.07785 | 0.51900 | 100.0 |
0.00043 | 0.00287 | 100.0 |
The results in table 2 show the eigenvalues, as well as repeating aspects that also are present in table 1, showing the proportion of variance and the cumulative proportion of variance. The eigenvalues are essentially the variance of each component, which is the square of each standard deviation.
The screeplot above shows eigenvalues by principal component and by comparing the relative differences of each principal components it indicates orderly consistency with the descending proportion of variance of each PCA in figure 1.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 |
pop.size | 0.2732 | 0.1883 | 0.3645 | 0.1804 | 0.14052 | 0.1804 |
area | -0.1511 | 0.1982 | 0.1339 | 0.4137 | -0.55486 | -0.4577 |
mean.age | -0.3942 | 0.0717 | 0.0593 | 0.0627 | 0.30725 | -0.1379 |
mortality | -0.3918 | 0.0204 | 0.1722 | 0.1032 | 0.14356 | -0.1058 |
natality | 0.2564 | -0.1475 | 0.2241 | -0.2441 | -0.44296 | -0.2717 |
pop.change | 0.3440 | -0.0671 | -0.1867 | -0.2142 | -0.17115 | -0.0945 |
immigration | 0.1673 | -0.4939 | -0.0385 | 0.2428 | 0.16294 | -0.3028 |
emigration | 0.0518 | -0.5064 | 0.0720 | 0.3501 | 0.21491 | -0.2772 |
tax.capacity | 0.2775 | 0.1454 | -0.2980 | 0.3926 | 0.12182 | 0.1023 |
tax.equal | -0.1928 | -0.1255 | 0.0112 | 0.4877 | -0.36847 | 0.5168 |
unemployment | -0.1246 | -0.2500 | 0.4862 | -0.1478 | 0.02154 | 0.1821 |
foreign.origin | 0.2111 | -0.3305 | 0.3073 | -0.0139 | -0.04327 | 0.3318 |
higher.edu | 0.3636 | 0.1347 | -0.1190 | 0.2463 | 0.12407 | 0.0590 |
greenhouse.gases | 0.1716 | 0.2736 | 0.3242 | 0.0492 | 0.30130 | -0.2269 |
dioxin.mg | 0.2071 | 0.3024 | 0.4305 | 0.1331 | -0.00328 | -0.0274 |
In table 3 above, correlations by principal component are shown by each loading. As orthogonality is a statistical assumption when computing each principal component, it is difficult to idiosyncratically analyze the principal components by variable, by looking for patterns within each variable. Therefore, the analysis is instead performed by using the visualizations below where the viewer is aided by the categorical variables and municipality based observation labels.
As noticeable, the PCA plots are very large with the purpose of minimizing issues with overlapping labels. In order to minimize issues with cluttering, the width of each PCA plot is maximized. Though, as the center is the most cluttered, mainly deviating data points are labelled which can create a biased interpretation while directing the attention of the viewer towards municipalities that often are extreme outliers in the original data. Stockholm regarding its population, or Luleå regarding its extreme greenhouse gas emissions, of which both values are many standard deviations above the median. As I am born and raised in Luleå I know that the per capita greenhouse gas emissions in my home city exceed the levels in countries found in oil based economies that are situated in the Arabic gulf region.
In consecutive order, figure 3, figure 4 and figure 5 show the first two principal components by region, county and income respectively. Figure 6 shows the third and fourth principal component by income. Figure 7 shows the fifth and sixth principal component by income. The reason why the last two plots are shown by the categorical variable “income” is because it clearly dissects different geographical regions. To instantly be capable of analyzing data points from Swedish regions and counties requires knowledge about such differences within Sweden, especially in terms of demographics and socioeconomic differences between regions and counties. In order to visualize the most important aspects of the performed PCA, I therefore limit the number of PCA figures to five.
The loadings of mean age and mortality rates are highly negatively correlated with the first PC. A high age, as well as the municipality based mortality rate are both directed towards the data points of Norrland, which indicates that the population in this region is elderly and dies at a higher rate. As shown in the appendix, the variables behind the loadings of mean age and mortality have a correlation of 88%. As the statistical method of PCA uses orthogonality to remove bias created by correlation between variables, there is no such present bias due to collinearity. Though, the high correlation between the variables can possibly be an indicator to the similar correlation of mean age and mortality rates. The loadings of these two variables consistently also show similar visualized directions in the plots of the third to sixth principal components. In addition to this, the data points of Svealand are negatively connected to these loadings along the first PC, which indicates that this region is younger and dies at a lower rate. The correlation between natality and population change are clearly correlated with municipality based data points in the regions of Svealand and Götaland, thus indicating extreme values in some observations of these regions. As emigration and immigration rates are not directly associated with these data points, it further indicates that the positive population change is not driven by immigration, but mainly of high birth rates. The opposite relation holds for the previously mentioned municipality based data points of Norrland with small population levels.
As there are a multitude of different counties, in figure 4 it is difficult to make general conclusions regarding the data points. Figure 3 and figure 4 essentially are the same PCA plots that only differ in the categorical variable that is used to visually aid the viewer. The list of the counties reduce the horizontal size of figure 4. As combined with the issuse with the multitude of counties, the category of “county” is therefore the least useful of the three categories. Though, some observations can be made. Data points from Stockholms län are often outliers in the direction of immigration, emigration and foreign origin, with municipalities such as Sundbyberg, Solna and Botkyrka. When disregarding Gothenburg municipality and Stockholm municipality with high population levels, Norrbottens län, Västerbottens län and Västernorrlands län are situated in the direction of the emission based response variables despite not having large populations, most likely due to steel industries and mines being situated in these three counties. The loadings of immigration and emigration are largely in the direction of deviating municipality based observation points in Stockholms län, such as Botkyrka and Solna.
In figure 5 is a very clear connection between low income municipality based data points and mean age, mortality, tax equality and unemployment. The first three of these variables can probably be attributed to a high percentage of retirees, where tax equality is high possibly because the retirees are relatively poor and that the population works in the public sector. The connection between unemployment and a low average income is rather obvious, as people who are not in the labour force do not produce anything. Higher education and tax capacity are associated with medium and high income data points, that also tend to be in the direction of the loading for population size. Homogeneous municipalities with a low percentage of foreign born inhabitants tend to be poor, whereas the opposite holds for a group of data points with middle income.
The third and fourth principal components explain approximately 14% and 9% of the proportion of variance respectively, with a combined proportion of approximately 23%, thus indicating relatively high importance regarding these two principal components. The data points with high income are relatively densely situated in the same part of the plot, with the main exception of Stockholm. High income is clearly associated with tax capacity and higher education. The pull effect of extremes is high in this plot, where the emission variables and population size are clearly directed towards data points that are several standard deviations above the median in one or more of these aspects. The spread of low income data points is large, thus making it difficult to make general conclusions regarding this category, when there is a large spread around the center of the orthogonality, as long as a strong pull effect towards a high correlation to the fourth principal component.
The fifth and sixth principal components explain approximately 6% and 5% of the proportion of variance respectively, with a combined proportion of approximately 11%, thus indicating a lower importance regarding these two principal components. The only data points that are relatively clearly distributed in one direction are high income municipalities. An interesting aspect is that the loading of unemployment is directed towards high income data points. Though, if there is such an actual connection is unclear as the sixth principal component only explains 5% of the proportion of variance and the assumption of orthogonality can possibly create indications that are not relevant in the variance within the variable.
While only using 6 principal components and 15.3% of the remaining cumulative proportion of variance is left out, this can potentially create some degree of bias in analyzing the variables. To at least include principal components that constitute more than 90% of the proportion of variance would be preferable, but it would create a report that is too lengthy and possibly not clear in structure. In PCA, my guess is that the position of a data point is set by correlation to principal components and also, the correlation of data to principal components also determines the position of a data point. I assess that PCA in itself is not as straightforward as maximum likelihood OLS and similar regression based methods, as PCA in that sense would be dual instead of singular. When orthogonality is used while computing the PCA, it creates unbiased estimates. Though, I argue that the bias occurs in the division into principal components. By this line of thought it is optimal for a minimal level of bias if the first two principal components explain a large proportion of the variance, which is not the case in this PCA where the first and second principal components combined only explain approximately 50% of the variance.
The principal components do not only explain the proportion of variance in a descending order, but I also argue that it is possible that the preciseness of the loadings of each principal component decreases in a descending order, as the assumption of orthogonality must persist when the proportion of variance explained by each PC decreases, thus making each loading less and less precise. If one has overall knowledge of the municipalities used as data points, there is a pattern of the first two principal components being mostly related to ones preconceived notions of the municipialities, such as Jokkmokk having a large area, Stockholm big population or Luleå having high greenhouse gas emissions. Though, the weakness of such an analysis with a preconceived basis it that the conclusions can be post hoc, but with a sizable portion of previous knowledge on Swedish municipalities, such a perception can not be avoided. The pull effect in the third to sixth PC in a descending order is decreasingly related to such preconceived observances. In addition to the proportion of variance being the largest, the loadings are also seemingly most accurate in the first two principal components. Along this line of thinking, figure 3 to figure 5 with the first two principal components are clearly the most essential of the performed PCA. Figure 6 and figure 7 hold some relevance in explaining the variation in the response variables, but partially are included to show the essential importance of the first two principal components. The visual direction of the loadings of the first two principal components can for example be clearly analyzed by mean age by mortality, in the direction of low income municipality based data points in the regional category of Norrland. As the loadings are orthogonal to each other, these data points are also in the opposite direction of population change, thus indicating clearly negative and also in some cases stagnant changes in population closer to the center of the loadings. This is not as apparent in the following principal components of the sixth and seventh figures, where the pull of the loadings deviate from the previous figures. It needs to be specified that the following part of the discussion is on the first two principal components. The visualized category of income is the most useful, as it as previously mentioned dissects different regions by using a more generally comprehensible category. The variable for population creates a loading that controls for differences of many standard deviations in population. Though, in the other variables there is a presence of bias towards municipality based data points that with smaller populations, especially in Norrland with small populations that are less relevant in other per capita variables. A mitigating factor is that the emission variables of dioxin and greenhouse gases are in absolute numbers, often with many standard deviations above the median, that especially regarding the first two principal components pull the loadings towards data points such as Luleå and Gällivare, as well as population towards Gothenburg, Stockholm and Uppsala. Though, to minimize bias it would be preferable if a maximal number of response variables would be measured in per capita terms. The extremes of unemployment are mainly located in poor municipalities in Svealand and Götaland. These data points according to the visualization are in between the direction of the loadings of foreign origin and mean age respectively, which indicates the data points with the lowest incomes are the combination of relative diversity of foreign born inhabitants, as well as a relatively high percentage of retirees. The opposite relationship can be found in university cities with higher education and high tax capacity, where the latter indicates high incomes and therefore high average income tax brackets. As the orthogonal assumption in the computing of the following principal components potentially creates bias for some loadings, as well as their relatively small importance in explaining the cumulative proportion of variance, I choose not to continue with analyzing these principal components in detail. As a short summary of the final figures, for example the extreme pulls of population size and greenhouse gas emissions make the visualized loadings coincide with the first two principal components, but tax equality clearly deviates in this regard and in part it is the reason why the first two principal components are clearly the most essential to the performed PCA.
The data set used in part 2 is a heavily modified version of the dataset “enterotype” from “(Arumugam et al., 2011)”. It includes nine variables, with two doubles. Only eight variables are mentioned in this text. The first variable measures the specific enterotype by number. The second variable measures the sequencing tech. Third variable measures the sample ID by nationality and number. The fourth variable measures which project the sample is extracted in. The fifth variable measures the nationality of a sample. The sixth variable measures the gender of a sample. The seventh variables measures age by number of years. The eigth and final variable measures clinical status by weight category, such as healthy and obese.
The statistical method is non-metric Multidimensional Scaling (NMDS), that is used to determine whether there any significant differences in the gut microbiome among faeces samples from humans of different nationalities. The chosen statistical tests are as following. Whether there are risks with confounding location and dispersion effects tested by checking whether there significantly different beta dispersions between the grouped nationality based samples in gut microbiome composition, mainly by using Tukey’s range test and a box plot of distances to each group centroid as a visual aid to the Tukey’s range test. If such multivariate homogeneity exists, a Permanova test is used to determine the answer to the research question, by testing the null hypothesis of homogenous OTU composition between samples of different nationalities.
At first, the phyloseq-compatible object and the vegan-compatible objects are read into a file and renamed for the simplicity of the code. Then, the OTU:s that do not exist in any of the samples are pruned and therefore reduced from 553 to 224 taxa in the phyloseq-compatible object.
Enterotype | Sample_ID | SeqTech | SampleID | Project | Nationality | Gender | Age | ClinicalStatus |
AM.AD.1 | Sanger | AM.AD.1 | gill06 | american | F | 28 | healthy | |
AM.AD.2 | Sanger | AM.AD.2 | gill06 | american | M | 37 | healthy | |
2 | DA.AD.1 | Sanger | DA.AD.1 | MetaHIT | danish | F | 59 | healthy |
3 | DA.AD.2 | Sanger | DA.AD.2 | MetaHIT | danish | M | 54 | healthy |
3 | DA.AD.3 | Sanger | DA.AD.3 | MetaHIT | danish | F | 49 | obese |
2 | DA.AD.4 | Sanger | DA.AD.4 | MetaHIT | danish | M | 59 | obese |
Above, the sample meta-data is displayed where the variable that measures the different categories of nationality are shown. Also, different microbiome sample types are shown in the second column. Only the head with the first 6 rows out of 33 rows are shown with the purpose of providing the reader with examples of the different variables without making the report too extensive.
Next, the phyloseq-compatible object is pruned another time by taxa. As shown in the appendix in this second pruning, the taxa are reduced to the 100 most abundant OTUS:s. When I tried to rarefy the object to sequence the OTU:s of different time of extraction to the same sequencing depth and number of reads, an error message occurred as the OTU abundance data does not have non-zero dimensions. This can possibly can be interpreted as the object already having the same sequencing debt and number of reads, or already being rarefied prior to reading the object into R in the downloaded file that was provided in the course. Therefore, the only rarefying that is performed is by reducing the object to the 100 most abundant OTU:s, which hopefully can improve the results of the stress test and the distribution of the regressed interpoint distances.
An NMDS stress test is performed with the setting of a Bray-Curtis dissimilarity measure. As shown in the appendix, the stress level is approximately 0.15, which indicates a great representation of the original data when the phyloseq-object is reduced to two dimensions. If the number of dimensions is increased to 3, the stress level with a certain pseudo-random starting point is reduced to approximately 0.09. Though, by using more than 2 dimensions, it can be difficult to provide an intuitively interpretative NMDS plot. Also, a reduction of the stress level is not necessary, as a value of 0.15 already indicates a great representation of the original data.
The NMDS test is plotted above by nationality. Unfortunately it was not possible to create a visible nationality based level for each dot, but the color palette that differentiates between the nationalities is sufficiently clear. Instead of for example a dot indicating “Italy”, it instead is labeled as “IT.AD.X” where the “X” labels the number of the sample. This type of labeling can be preferable as it makes it possible to specify samples by numbers for providing explanations in writing.
There are general visualized differences regarding gut microbiome among faeces samples from humans of different nationalities. Nationalities of different samples, where for example the American and Japanese samples are not present in the concentration approximately at the position of (x,y) = (-0.2,-0.1), where “x” is the position on the x-axis and “y” is the position on the y-axis. Possibly, this could be because of the geographic distance to the other nationalities that are all located in Europe, where the differences in the estimated distances of gut microbiome values are smaller within this continent, than while comparing them to samples in other continents. Though, only six out of nine Japanese gut microbiome feaces samples are located far from the concentration to the right along NMDS1. Both American samples are located outside of the concentration, where the second American sample “AM.AD.2” clearly is an outlier with a highly negative distance value along the y-axis of NMDS2. Though, the deviation of the American samples can possibly have occured due to chance as they are few in number and the deviating relation to the other samples can therefore not be assesed with statistical certainty. The French samples are the most concentrated where seven out of eight samples are located in the concentrated clutter at approximately (x,y) = (-0.2,-0.1). The Spanish samples possibly have the largest spread in regards to no sample being in direct proximity of the other. To a certain degree, three out of these four Spanish samples are outliers that are located in the same direction. Though, these relations regarding NMDS based nationality sample distances are yet to be analyzed with certainty by using numeric methods for statistical assurance.
To provide an intuitive interpretation regarding the polygon in the NMDS plot above is seemingly more difficult than interpreting the structure of the points in the previous plot. The Spanish, Danish and American samples mainly form the exterior border of the red polygon, possibly due to being outliers in different general directions for each nationality. Possibly also due to being outliers, the Japanese and Italian samples are part of forming the exterior border of the transparent polygon, but in a different sense. Nearly every sample in the main concentration at approximately (x,y) = (-0.2,-0.1) are located within the overlap of the two polygons, thus further visualizing the concentrated clutter of mainly continental European countries and the visually estimated presence of such similarities.
After having converted the object to the form of a matrix and an array, the data is seemingly continuous as the variables of different types of feces based OTU categories measure values in a non-binary form. The contour lines are formed by randomness and change to a degree that is sufficient for creating deviating interpretations from different pseudorandom starting points. While curved and non-linear in shape, the contour lines in the plot above indicate that the original data according to the NMDS analysis has overall similarities in the concentrated clutter at approximately (x,y) = (-0.2,-0.1) and that the general interpoint distances are less uniform further the away from the concentrated clutter they are located. This can been seen as the shape of the contour lines while curved have their maximal bend along the outlying sample data points. Though, this is based solely on a certain pseudorandom starting point and can not be generally verified with statistical certainty as other pseuorandom starting points lead to other shapes in the contour lines.
The coefficient of determination indicates a very high non-metric fit and linear fit. In the non-metric fit, 97.8% of the variation in the ordination distance can be explained by the observed dissimilarity. In the linear fit, 91.3% of the variation in the ordination distance can be explained by the observed dissimilarity.
Nationality | AverageDistance |
american | 0.205 |
danish | 0.229 |
french | 0.241 |
italian | 0.286 |
japanese | 0.273 |
spanish | 0.199 |
The function for creating a vegan-compatible object from the phyloseq-compatible object and the data frame from the vegan-compatible are already generated and saved above. As based on the generated vegan-compatible object, a Bray-Curtis dissimilarity matrix is created above. Values for beta dispersions of the different nationalities are generated above, as based on the generated vegan-compatible object and its respective data frame. The average distances to the median do not vary greatly, with a range of beta dispersion of minimal values and maximal values of 0.199 and 0.273 of the for the feces microbiome samples of the Spanish nationality and Japanese nationality, respectively. By analyzing this information, the distribution of the distances to the centroid seemingly are homogeneous in their distribution, but the values in the table above only indicate average distances to the median, and not where the range of different samples intersect with each other.
nationality | diff | lwr | upr | p-value |
danish-american | 0.02408 | -0.0994 | 0.14759 | 0.9903 |
french-american | 0.03562 | -0.0771 | 0.14837 | 0.9241 |
italian-american | 0.08078 | -0.0357 | 0.19722 | 0.3048 |
japanese-american | 0.06789 | -0.0436 | 0.17938 | 0.4434 |
spanish-american | -0.00579 | -0.1293 | 0.11772 | 1.0000 |
french-danish | 0.01155 | -0.0758 | 0.09888 | 0.9984 |
italian-danish | 0.05670 | -0.0354 | 0.14876 | 0.4311 |
japanese-danish | 0.04382 | -0.0419 | 0.12952 | 0.6263 |
spanish-danish | -0.02987 | -0.1307 | 0.07097 | 0.9412 |
italian-french | 0.04515 | -0.0319 | 0.12218 | 0.4847 |
japanese-french | 0.03227 | -0.0370 | 0.10157 | 0.7110 |
spanish-french | -0.04142 | -0.1288 | 0.04592 | 0.6955 |
japanese-italian | -0.01288 | -0.0880 | 0.06228 | 0.9947 |
spanish-italian | -0.08657 | -0.1786 | 0.00549 | 0.0743 |
spanish-japanese | -0.07369 | -0.1594 | 0.01201 | 0.1232 |
The Tukey’s range test that is performed above is used as a post-hoc test to determine which samples that differ regarding their distribution to the group centroid. The only significant difference in the performed beta dispersion is between the Spanish and and Italian samples of gut microbiome feces samples. Though, the issue is only different at 10% significance with a p-value of approximately 0.074. Since the community of ecological research is unclear on whether the correct measure is to drop a sample that has a statistically different beta-diversity, I have chosen to keep the Spanish sample and perform the following Anova test and Permanova test. If the difference of the beta-diversity between the Spanish and the Italian samples for example would have been statistically significant at 3%, as p < 0.03, I would have chosen to drop the Spanish sample and redo the NMDS analysis instead. Also, 13 out of 15 statistical comparisons have p-values of p > 0.3 and several comparisons even are in the proximity of p-values of nearly 1. This general homogeneity indicates a further basis as to why there is a low risk of confounding location and dispersion effects. Therefore, the following Anova test and Permanova test will not suffer from sufficient heterogeneity in beta-dispersions between the samples and the results will be statistically valid from this standpoint.
As can be viewed above, the distributions of the Spanish and the Italian samples do not intersect at all along the y-axis. Though, uppermost extreme of the Spanish sample and the lowermost extreme of the Italian sample are in close proximity of each other, thus explaining the relatively low rejection of the null hypothesis at 10% significance. As an opposite example with the highest p-value of 1, the Spanish and American samples intersect each other as the medians of the distributions are nearly at the very same distances to the centroid. The remainder of the distributions of the samples intersect each other, thus further indicating a visualized basis as to why the Anova test and Permutation test are valid to perform without significant bias.
Df | Sum Sq | Mean Sq | F value | Pr(>F) |
5 | 0.0292 | 0.00584 | 2.69 | 0.0423 |
27 | 0.0585 | 0.00217 |
Df | SumSq | MeanSq | F_ | N.Perm | Prob_F |
5 | 0.0292 | 0.00584 | 2.69 | 999 | 0.043 |
27 | 0.0585 | 0.00217 |
A regular ANOVA test and a permutation test for testing the homogeneity the OTU compositions of sample types of different nationalities. Both are statistically significantly different from zero with approximately the same significance levels. According to the information in ?permutest(), a permutation test is similar to an ANOVA test and hence the probable reason to the similar p-values. The null hypotheses are based on homogeneous group dispersions and hence, a rejection of the null hypotheses would indicate overall heterogeneous group dispersions. This means that the group dispersions of different samples differ significantly in terms of distances to the group centroid. The Anova test and Permanova test are both significant at 5%, which means that there are significant differences in the gut microbiome among faeces samples from humans of different nationalities.
The stress level of approximately 0.15 indicates a great representation of the original data. This is an initial indicator of dissimilarity between the OTU composition in the data. Despite of the mentioning of transformations not being necessary, the taxa have been pruned to the 100 most abundant OTU:s in order to improve the results of the stress test and the distribution of the regressed interpoint distances. A minimization of the number of OTU:s reduces the size of the dataset, but can instead improve statistical the accuracy of the NMDS analysis.
The absence of signifcantly different beta dispersions at p < 0.03 indicates a clear multivariate homogeneity of the sample groups. The intersection of nearly all grouped distances to the centroid clearly show this in a visualized form. This is further shown by using the more statistically precise approach of the post-hoc Tukey’s range test with a clear absence of multivariate heterogeneity. Also, the stress test indicates good linear- and non metric fits with high coefficients of determination while regressing ordination distance on observed dissimilarity, thus indicating an absence of a relevant residual term and therefore a low level of bias for the Permanova test.
As the risks are low for confounding location and dispersion effects, the Permanova test is statistically valid. The Permanova test is statistically significant at 5%, which answers the research question of whether there any significant differences in the gut microbiome among faeces samples from humans of different nationalities. The rejection of the null hypothesis of the Permanova test shows that such significant differences exist, and that the included countries have different genetic bacetrical compositions in the faeces samples. The Anova test also coincides with the Permanova test with a similar level of statistical rejection and p-value.
As a last remark, as visualized in the NMDS plot of Figure 8, many samples are clustered together, but there are several outliers such as American, Japanese and Spanish samples that are at a clear distance from the clustered samples at approximately (x,y) = (-0.2,-0.1). This further indicates that there are significant differences in the gut microbiome among faeces samples from humans of different nationalities.
Måns Thulin (2021). Modern Statistics with R. From wrangling and exploring data to inference and predictive modelling. http://www.modernstatisticswithr.com/
# Correlation matrix as data frame
cor_dataframe <- as.data.frame(cor(SCB[,-1:-4]))
# Most highly correlated variables
# Mean age and mortality (0.88).
cor_dataframe$mean.age[4]
## [1] 0.877
# Immigration and emigration (0.89).
cor_dataframe$immigration[8]
## [1] 0.893
# Higher education and tax capacity (0.77)
cor_dataframe$higher.edu[9]
## [1] 0.768
# Renaming the data set
SCB <- SCB_assign3
# Test if there are missing observations. There are no missing observations as the number of rows are equal.
OMIT <- na.omit(SCB)
nrow(OMIT) == nrow(SCB)
# Perform principal component analysis. A cumulative proportion of 90% is exceeded in the 8th PC.
PCAPRINCOMP <- princomp(~pop.size+area+mean.age+mortality+natality+
pop.change+immigration+emigration+tax.capacity+tax.equal+
unemployment+foreign.origin+higher.edu+greenhouse.gases+
dioxin.mg, cor = TRUE,
data = SCB)
# Function to create table for PCA output. Taken from stackoverflow.
pca_importance <- function(x) {
vars <- x$sdev^2
vars <- vars/sum(vars)
rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars,
`Cumulative Proportion` = cumsum(vars))
}
# Create table of standard deviations and variance of PCA.
PCASUM <- as.data.frame(pca_importance(summary(PCAPRINCOMP)))
PCASUM <- rownames_to_column(PCASUM, " ")
PCASUM %>%
flextable() %>%
set_caption("PCA stdev. and variance. Table 1.")
CONTVAR <-
SCB %>%
select(-1:-4)
PCAPCA<-PCA(CONTVAR, graph=FALSE)
fviz_screeplot(PCAPCA, addlabels = TRUE, ylim = c(0, 40), ncp = 15,
caption = "Screeplot. Proportion of variance by PC",
main = "Figure 1") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
plot.caption = element_text(size = 25, hjust = 0.5),
plot.title = element_text(size = 30, hjust = 0.5))
# Create table of eigenvalues and proportion of variance.
CONTVAR <-
SCB %>%
select(-1:-4)
PCAPCA <- PCA(CONTVAR, graph=FALSE)
PCAPCASUM <- as.data.frame(get_eig(PCAPCA))
PCAPCASUM %>%
flextable() %>%
set_caption("PCA eigenvalues and proportion of variance. Table 2.")
# Screeplot with eigenvalues by principal component.
eigenvalues <- PCAPRINCOMP$sdev^2
qplot(c(1:15), eigenvalues) +
geom_line(color = "blue") +
geom_bar(stat = "identity") +
labs(x = "Principal component",
y = "Eigenvalue",
title = "Figure 2",
caption = "Screeplot. Eigenvalue by PC.") +
scale_x_continuous(breaks = c(1:15)) +
theme(text = element_text(size = 20),
axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
plot.title = element_text(hjust = 0.5, size = 28),
plot.caption = element_text(hjust = 0.5, size = 28))
# Table of the first 6 principal components.
ALOAD <- as.data.frame(unclass(loadings(PCAPRINCOMP)))[1:6]
ALOAD <- rownames_to_column(ALOAD, " ")
ALOAD %>%
flextable() %>%
set_caption("Principal component by loading. Table 3.")
# Plot PCA1 and PCA2 by region with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'region', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("region") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 3",
caption = "Scatterplot of PCA1 and PCA2 by region.") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA1 and PCA2 by county with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'county', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("county") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 4",
caption = "Scatterplot of PCA1 and PCA2 by county") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA1 and PCA2 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 5",
caption = "Scatterplot of PCA1 and PCA2 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA3 and PCA4 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, x=3, y=4, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 3",
y = "Principal component 4",
title = "Figure 6",
caption = "Scatterplot of PCA3 and PCA4 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA5 and PCA6 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, x=5, y=6, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 5",
y = "Principal component 6",
title = "Figure 7",
caption = "Scatterplot of PCA5 and PCA6 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Reading and renaming the phyloseq object for the simplicity of the code.
entero<-readRDS("entero.rds")
ENT <- entero
# Reading and renaming the vegan object for the simplicity of the code.
entero_vegan <- readRDS("entero_vegan.rds")
VEG <- entero_vegan
# Displaying the sample meta-data.
# First check for OTU:s that might not exist in any of the samples. Most of the
# taxa were found to not exist in any of the samples. For the NMDS, perhaps the # pruned object should be used instead.
ENTPRUNED <- prune_taxa(taxa_sums(ENT) > 0, ENT)
ENTPRUNEDDF <- as.data.frame(sample_data(ENTPRUNED))
ENTPRUNEDTIBBLE <- as_tibble(ENTPRUNEDDF)
head(ENTPRUNEDTIBBLE) %>%
flextable() %>%
set_caption("Head of sample meta data. Table 4")
# Prune the object by taxa name.
TaxaNames <- names(sort(taxa_sums(ENTPRUNED), decreasing = TRUE)[1:100])
ENTPRUNED2 <- prune_taxa(TaxaNames, ENTPRUNED)
# Perform and save the stress test with 2 dimensions.
NDMS_stress_ENTPRUNED2 <- ordinate(ENTPRUNED2, method="NMDS", distance="bray")
# Print the stress test with 2 dimensions.
NDMS_stress_ENTPRUNED2
# Perform and save the stress test with 3 dimensions.
NDMS_stress_ENTPRUNED3 <- ordinate(ENTPRUNED2, method="NMDS", distance="bray", k=3)
# Print the stress test with 3 dimensions.
NDMS_stress_ENTPRUNED3
# Function for vegan object
veganotu = function(physeq) {
require("vegan")
OTU = otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU = t(OTU)
}
return(as(OTU, "matrix"))
}
# Transform the pruned phyloseq-compatible object to a vegan-compatible object.
ENTPRUNED2_VEG <- veganotu(ENTPRUNED2)
# Same the phyloseq-compatible object as a data frame with nationality and other usable categorical variables.
ENTPRUNED2DF <- data.frame(sample_data(ENTPRUNED2))
# Perform the stress test on the vegan-compatible object
ENTPRUNED2_VEG_mds <- metaMDS(ENTPRUNED2_VEG, k=2, trace = FALSE)
# Creat a color palette
colpal_ENT <- brewer.pal(6, "Set1")
# Next, the plotting begins. The code for the plotting is taken from the NMDS # # tutorial and is chosen because it properly labels each dot by nationality.
# First, plot an empty NMDS plot.
plot(ENTPRUNED2_VEG_mds,type="n", xlim = c(-0.8, 1), ylim = c(-0.8,0.8),
main = "Figure 8", sub = "NMDS plot by nationality", cex.main = 4,
cex.sub = 1.8, cex.lab = 1.6)
# Ordiplot can also be used. This line is written by the Teacher.
# Then, add the estimates positions as visualized by dots.
with(ENTPRUNED2DF, points(ENTPRUNED2_VEG_mds, display = "sites",
col = colpal_ENT[Nationality], pch = 16, cex=6))
# Label every estimate position dot by sample name.
text(ENTPRUNED2_VEG_mds, cex = 1.6, col = "black",pos=1)
# Show which group each estimate belongs to.
with(ENTPRUNED2DF, legend("topright", legend = levels(Nationality), bty = "n",
col = colpal_ENT, pch = 16, cex = 3))
# Manually add information about the stress with a specified position.
text(0.85, -0.85, "Stress = 0.149", cex=3)
# First, create a vector of color values corresponding to the same length as the
# vector of treatment values.
treat=c(rep("Treatment1",17),rep("Treatment2",16))
colors=c(rep("red",5),rep("blue",5))
ordiplot(ENTPRUNED2_VEG_mds,type="n", xlim = c(-0.6,0.9), ylim = c(-0.8,0.8),
main = "Figure 9", sub = "Polygon based NMDS plot by nationality",
cex.main = 3,cex.sub = 1.8, cex.lab = 1.4)
# Plot convex hulls with colors based on treatment.
for(i in unique(treat)) {
ordihull(ENTPRUNED2_VEG_mds$point[grep(i,treat),],draw="polygon",
groups=treat[treat==i],col=colors[grep(i,treat)],label=F) }
# orditorp(ENTPRUNED2_VEG_mds,display="species",col="red",air=0.01)
orditorp(ENTPRUNED2_VEG_mds,display="sites",col=c(rep("green",5),
rep("blue",5)),air=0.01,cex=1.25)
# NMDS plot with contour lines.
set.seed=2
# Define random elevations for previous example.
elevation=runif(33,0.5,1.5)
# Use the function ordisurf to plot contour lines.
ordisurf(ENTPRUNED2_VEG_mds,elevation,main="Figure 10",
sub="NMDS plot with contour lines",col="forestgreen")
# Finally, display species on plot.
orditorp(ENTPRUNED2_VEG_mds,display="sites",col="grey30",air=0.1,
cex=1)
# As based on the vegan based stress test, create the stressplot with stressplot().
stressplotjpg <- image_read("C:/Users/olle_/OneDrive/Skrivbord/Statistical analyses and visualization in R II/Part 4/Assignment 3 PCA and NMDS/StressplotJpg.jpg")
stressplotjpg %>%
image_resize("2000x1400")
# Creating a Bray-Curtis dissimilarity matrix.
ENTPRUNED2_VEG_BC <- vegdist(wisconsin(sqrt(ENTPRUNED2_VEG)), method = "bray")
# Test of multivariate homogeneity of group dispersions.
betadisp_ENTPRUNED <- betadisper(ENTPRUNED2_VEG_BC, ENTPRUNED2DF$Nationality)
Beta_Dispersion <- data.frame(Nationality = c("american", "danish", "french", "italian",
"japanese", "spanish"),
AverageDistance = c(0.205,0.229,0.241,0.286,0.273,0.199))
Beta_Dispersion %>%
flextable() %>%
set_caption("Avg beta dispersion distance to median. Table 5.")
# Post-hoc Tukey's range test to determine which samples that differ regarding the average distances to the centroid.
df1 <- as.data.frame(TukeyHSD(betadisp_ENTPRUNED)$group) %>%
tibble::rownames_to_column()
df1 <- rename(df1, nationality = rowname)
df1 <- rename(df1, "p-value" = "p adj")
df1 %>%
flextable() %>%
set_caption("Tukey's range test. Table 6.")
# Creating box plot for distribution to centroid.
boxplot(betadisp_ENTPRUNED, xlab = "Distribution of distance to centroid",
main = "Figure 12", las=2, cex.main = 3.4, cex.lab = 1.7)
# Anova test for beta dispersions.
as_tibble(anova(betadisp_ENTPRUNED)) %>%
flextable() %>%
set_caption("Anova test. Table 7.")
# Permanova test for beta dispersions.
permubetadisp <-
data.frame(Df = c(5, 27),
SumSq = c(0.0292, 0.0585),
MeanSq = c(0.00584, 0.00217),
F_ = c(2.69,""),
N.Perm = c(999,""),
Prob_F = c(0.043, ""))
permubetadisp %>%
flextable() %>%
set_caption("Permanova Test. Table 8.")