The first part of assignment III is performed to analyze data on Swedish municipalities. The purpose of part 1 in assignment III is to statistically analyze Swedish municipality based data with a Principal Component Analysis (PCA) model. The second part of assignment III is performed to analyze data on faeces samples from humans. The purpose of part 2 in assignment III is to statistically analyze faeces based data with non-metric Multidimensional Scaling (NMDS).
An online version of the course curriculum “Modern Statistics with R” is provided by the university and is authored by Måns Thulin (2021), which is the literature that is used as a foundation to perform statistical tests by using the principal component analysis model. The default help function in R and online error searches are also used while completing assignment III in part 1 and part 2.
The research questions are literally mentioned in the assignment. The research question of part 1 is to look for possible factors that can be used to discuss differences among the municipalities. The research question of part 2 is on whether there are any significant differences in the gut microbiome among faeces samples from humans of different nationalities.
As the research question for part I is formulated in an exploratory manner, only a null hypothesis for part 2 is created below.
\(Statistical\) \(hypothesis\) \(for\) \(part\) \(2:\)
\(Hypothesis\)
\(H_0:\) \(There\) \(are\) \(no\) \(significant\) \(differences\) \(in\) \(the\) \(gut\) \(microbiome\) \(among\) \(faeces\) \(samples\) \(from\) \(humans\) \(of\) \(different\) \(nationalities.\)
\(H_A:\) \(There\) \(are\) \(significant\) \(differences\) \(in\) \(the\) \(gut\) \(microbiome\) \(among\) \(faeces\) \(samples\) \(from\) \(humans\) \(of\) \(different\) \(nationalities.\)
The data set used in part 1 is compiled, and possibly also in part gathered by the Swedish government agency Statistics Sweden (SCB). In part 1, 19 variables are included in the data set. The first four variables are categorical factor variables. The first three of the categorical factor variables “municipality”, “region” and “county” measure their respective geographical and administrative areas within Sweden. The fourth categorical variable “income” measures annual median income in the population above the age of 20, in “low”, “medium” and “high” income categories. The following 15 variables are numeric continuous and discrete response variables that measure municipality level quantitative values. The first response variable “pop.size” measures the population in the number of individuals. The second response variable “area” measures the area in square kilometers. The third response variable “mean.age” measures the average age of the population. The fourth response variable “mortality” measures the number of deaths divided by population size. The fifth response variable “natality” measures the number of births divided by population size. The sixth response variable “pop.change” measures annual population growth in percentage points. The seventh response variable “immigration” and eight response variable “emigration” are self-explanatory and are measured in percentage points. The ninth response variable “tax.capacity” measures the per capita tax base. The tenth response variable “tax.equal” measures per capita tax equalization. The eleventh response variable “unemployment” measures the unemployment rate in percentage points. The twelfth response variable “foreign.origin” measures the percentage of the population that is foreign born, as well as Swedish born with two foreign born parents. The thirteenth response variable “higher.edu” measures the percentage of the population with university education. The fourteenth response variable “greenhouse.gases” measures total emissions of greenhouse gases. The fifteenth response variable “dioxin.mg” measures dioxine emissions in milligrams.
As the assignment in part 1 quite explicitly states that the statistical model is to include all of the response variables above, no lengthy analysis regarding these variables is possible nor required. The more important aspect regarding the construction of the statistical model is the categorical variables other than “municipality”, which is used to visualize data points. The variables “region”, “county” and “income” are used to aid the viewer in finding possible factors that create differences between the municipalities. Also, these three categorical values will be visualized in separate plots, with a minimum of showing each plot with the first and second principal component.
As apparent below, the results are divided into two sub chapters, with quantitative aspects and plots respectively. In the plots and PCA below, no prior mutation in the form of standardization or centering is included in this segment. Instead, this mutation is performed in the appendix, where it shows that standardization and centering is already an integral part of PCA functions in R. This is apparent as the results did not differ while comparing the output with and without the manually performed standardization and centering. As also included in the appendix, three interaction variables of which the respective variables within each interaction are highly correlated were created to try to increase the proportion of variance explained by the first few principal components by reducing the number of included components. Though, this defeated its own purpose as it had an effect that was opposite to its intended effect. Therefore, the number of principal components is still 15 which is the original number of components. To clarify this further, the standardization and centering, as well as the creation of the interaction variables are not used in the results below. Though, attempts were made to improve the model and its components. This only further shows that no such mutation is needed and that the optimal number of principal components, as well as the number of variables should equal the original number of 15. Further possible mutations that could be performed in this part of the assignment is to standardize the data by dividing “greenhouse.gases”, “dioxin.mg” and “area” respectively by “pop.size” to create per capita emission data, as well as population density figures. Though, this mutation is not performed because the assignment likely requires analysis on how for example a large population can affect the results of the performed PCA.
For simplicity when writing the code, the data set containing the 19 variables is renamed from “SCB_assign3” to “SCB”. As can be seen in the appendix it is tested whether there are missing observations, which is necessary to check because principal component analysis does not allow for missing observations. No missing observations were found.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | Comp.10 | Comp.11 | Comp.12 | Comp.13 | Comp.14 | Comp.15 |
Standard deviation | 2.227 | 1.631 | 1.454 | 1.1596 | 0.9420 | 0.8563 | 0.7127 | 0.6886 | 0.6182 | 0.5790 | 0.4915 | 0.4145 | 0.3263 | 0.27902 | 0.0207419 |
Proportion of Variance | 0.331 | 0.177 | 0.141 | 0.0896 | 0.0592 | 0.0489 | 0.0339 | 0.0316 | 0.0255 | 0.0223 | 0.0161 | 0.0115 | 0.0071 | 0.00519 | 0.0000287 |
Cumulative Proportion | 0.331 | 0.508 | 0.649 | 0.7388 | 0.7979 | 0.8468 | 0.8807 | 0.9123 | 0.9378 | 0.9601 | 0.9762 | 0.9877 | 0.9948 | 0.99997 | 1.0000000 |
In the summary of the PCA above, it shows that 2 PC:s explain approximately half of the cumulative proportion of variance, 3 PC:s explain nearly two thirds and that 85% in the cumulative proportion of variance is exceeded in the 7th variable. As will be further explained, in order to reduce the length of the report, principal components exceeding the sixth PC are not included in the plotting.
As also is present in the PCA table, the screeplot above shows that the first two PC:s only constitute about half of the cumulative proportion of variance, while the other PC:s in total constitute the remainder of the proportion of variance. This essentially means that the analysis can not be performed by limiting the PCA to the first two principal components, which is the main reason behind the inclusion of several principal components in the plotting of the following sub chapter.
As based on the results, the first to the sixth principal components will be plotted and are denoted by PCA1 to PCA6. Equivalent names that also appear are Comp.1 to Comp.6, as well as Dimension 1 to dimension 6. The inclusion of several principal components is based on the grounds that follow. There is likely no objective limitation regarding the number of principal components, but according to my judgement, the proportion of variance of 5.9% and 4.9% as explained by the fifth and sixth principal component respectively are necessary to increase to cumulative proportion of variance to approximately 85%. Though, the proportion of variance explained by the third to the sixth principal components are too low to be analyzed in depth in an assignment of this scope. The principal components will be given attention in an descending order by their respective proportion of variance. In a similar manner when the number of principal components is limited to 6, the third and fourth principal components are of greater importance as the explained proportion of variance is higher and thus the choice of including these two is obvious along this line of thinking. The correlations of the loadings and the third to sixth principal components also need to be compared to the equivalent aspects regarding the correlations of the the first and second principal components, which is an additional reason to include several PC:s when the loadings from the very same variables can deviate or have similarities between them.
eigenvalue | variance.percent | cumulative.variance.percent |
4.95993 | 33.06619 | 33.1 |
2.66179 | 17.74525 | 50.8 |
2.11541 | 14.10276 | 64.9 |
1.34457 | 8.96377 | 73.9 |
0.88738 | 5.91589 | 79.8 |
0.73317 | 4.88778 | 84.7 |
0.50798 | 3.38654 | 88.1 |
0.47418 | 3.16117 | 91.2 |
0.38223 | 2.54820 | 93.8 |
0.33520 | 2.23465 | 96.0 |
0.24154 | 1.61027 | 97.6 |
0.17185 | 1.14566 | 98.8 |
0.10650 | 0.71000 | 99.5 |
0.07785 | 0.51900 | 100.0 |
0.00043 | 0.00287 | 100.0 |
The results in table 2 show the eigenvalues, as well as repeating aspects that also are present in table 1, showing the proportion of variance and the cumulative proportion of variance. The eigenvalues are essentially the variance of each component, which is the square of each standard deviation.
The screeplot above shows eigenvalues by principal component and by comparing the relative differences of each principal components it indicates orderly consistency with the descending proportion of variance of each PCA in figure 1.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 |
pop.size | 0.2732 | 0.1883 | 0.3645 | 0.1804 | 0.14052 | 0.1804 |
area | -0.1511 | 0.1982 | 0.1339 | 0.4137 | -0.55486 | -0.4577 |
mean.age | -0.3942 | 0.0717 | 0.0593 | 0.0627 | 0.30725 | -0.1379 |
mortality | -0.3918 | 0.0204 | 0.1722 | 0.1032 | 0.14356 | -0.1058 |
natality | 0.2564 | -0.1475 | 0.2241 | -0.2441 | -0.44296 | -0.2717 |
pop.change | 0.3440 | -0.0671 | -0.1867 | -0.2142 | -0.17115 | -0.0945 |
immigration | 0.1673 | -0.4939 | -0.0385 | 0.2428 | 0.16294 | -0.3028 |
emigration | 0.0518 | -0.5064 | 0.0720 | 0.3501 | 0.21491 | -0.2772 |
tax.capacity | 0.2775 | 0.1454 | -0.2980 | 0.3926 | 0.12182 | 0.1023 |
tax.equal | -0.1928 | -0.1255 | 0.0112 | 0.4877 | -0.36847 | 0.5168 |
unemployment | -0.1246 | -0.2500 | 0.4862 | -0.1478 | 0.02154 | 0.1821 |
foreign.origin | 0.2111 | -0.3305 | 0.3073 | -0.0139 | -0.04327 | 0.3318 |
higher.edu | 0.3636 | 0.1347 | -0.1190 | 0.2463 | 0.12407 | 0.0590 |
greenhouse.gases | 0.1716 | 0.2736 | 0.3242 | 0.0492 | 0.30130 | -0.2269 |
dioxin.mg | 0.2071 | 0.3024 | 0.4305 | 0.1331 | -0.00328 | -0.0274 |
In table 3 above, correlations by principal component are shown by each loading. As orthogonality is a statistical assumption when computing each principal component, it is difficult to idiosyncratically analyze the principal components by variable, by looking for patterns within each variable. Therefore, the analysis is instead performed by using the visualizations below where the viewer is aided by the categorical variables and municipality based observation labels.
As noticeable, the PCA plots are very large with the purpose of minimizing issues with overlapping labels. In order to minimize issues with cluttering, the width of each PCA plot is maximized. Though, as the center is the most cluttered, mainly deviating data points are labelled which can create a biased interpretation while directing the attention of the viewer towards municipalities that often are extreme outliers in the original data. Stockholm regarding its population, or Luleå regarding its extreme greenhouse gas emissions, of which both values are many standard deviations above the median. As I am born and raised in Luleå I know that the per capita greenhouse gas emissions in my home city exceed the levels in countries found in oil based economies that are situated in the Arabic gulf region.
In consecutive order, figure 3, figure 4 and figure 5 show the first two principal components by region, county and income respectively. Figure 6 shows the third and fourth principal component by income. Figure 7 shows the fifth and sixth principal component by income. The reason why the last two plots are shown by the categorical variable “income” is because it clearly dissects different geographical regions. To instantly be capable of analyzing data points from Swedish regions and counties requires knowledge about such differences within Sweden, especially in terms of demographics and socioeconomic differences between regions and counties. In order to visualize the most important aspects of the performed PCA, I therefore limit the number of PCA figures to five.
The loadings of mean age and mortality rates are highly negatively correlated with the first PC. A high age, as well as the municipality based mortality rate are both directed towards the data points of Norrland, which indicates that the population in this region is elderly and dies at a higher rate. As shown in the appendix, the variables behind the loadings of mean age and mortality have a correlation of 88%. As the statistical method of PCA uses orthogonality to remove bias created by correlation between variables, there is no such present bias due to collinearity. Though, the high correlation between the variables can possibly be an indicator to the similar correlation of mean age and mortality rates. The loadings of these two variables consistently also show similar visualized directions in the plots of the third to sixth principal components. In addition to this, the data points of Svealand are negatively connected to these loadings along the first PC, which indicates that this region is younger and dies at a lower rate. The correlation between natality and population change are clearly correlated with municipality based data points in the regions of Svealand and Götaland, thus indicating extreme values in some observations of these regions. As emigration and immigration rates are not directly associated with these data points, it further indicates that the positive population change is not driven by immigration, but mainly of high birth rates. The opposite relation holds for the previously mentioned municipality based data points of Norrland with small population levels.
As there are a multitude of different counties, in figure 4 it is difficult to make general conclusions regarding the data points. Figure 3 and figure 4 essentially are the same PCA plots that only differ in the categorical variable that is used to visually aid the viewer. The list of the counties reduce the horizontal size of figure 4. As combined with the issuse with the multitude of counties, the category of “county” is therefore the least useful of the three categories. Though, some observations can be made. Data points from Stockholms län are often outliers in the direction of immigration, emigration and foreign origin, with municipalities such as Sundbyberg, Solna and Botkyrka. When disregarding Gothenburg municipality and Stockholm municipality with high population levels, Norrbottens län, Västerbottens län and Västernorrlands län are situated in the direction of the emission based response variables despite not having large populations, most likely due to steel industries and mines being situated in these three counties. The loadings of immigration and emigration are largely in the direction of deviating municipality based observation points in Stockholms län, such as Botkyrka and Solna.
In figure 5 is a very clear connection between low income municipality based data points and mean age, mortality, tax equality and unemployment. The first three of these variables can probably be attributed to a high percentage of retirees, where tax equality is high possibly because the retirees are relatively poor and that the population works in the public sector. The connection between unemployment and a low average income is rather obvious, as people who are not in the labour force do not produce anything. Higher education and tax capacity are associated with medium and high income data points, that also tend to be in the direction of the loading for population size. Homogeneous municipalities with a low percentage of foreign born inhabitants tend to be poor, whereas the opposite holds for a group of data points with middle income.
The third and fourth principal components explain approximately 14% and 9% of the proportion of variance respectively, with a combined proportion of approximately 23%, thus indicating relatively high importance regarding these two principal components. The data points with high income are relatively densely situated in the same part of the plot, with the main exception of Stockholm. High income is clearly associated with tax capacity and higher education. The pull effect of extremes is high in this plot, where the emission variables and population size are clearly directed towards data points that are several standard deviations above the median in one or more of these aspects. The spread of low income data points is large, thus making it difficult to make general conclusions regarding this category, when there is a large spread around the center of the orthogonality, as long as a strong pull effect towards a high correlation to the fourth principal component.
The fifth and sixth principal components explain approximately 6% and 5% of the proportion of variance respectively, with a combined proportion of approximately 11%, thus indicating a lower importance regarding these two principal components. The only data points that are relatively clearly distributed in one direction are high income municipalities. An interesting aspect is that the loading of unemployment is directed towards high income data points. Though, if there is such an actual connection is unclear as the sixth principal component only explains 5% of the proportion of variance and the assumption of orthogonality can possibly create indications that are not relevant in the variance within the variable.
While only using 6 principal components and 15.3% of the remaining cumulative proportion of variance is left out, this can potentially create some degree of bias in analyzing the variables. To at least include principal components that constitute more than 90% of the proportion of variance would be preferable, but it would create a report that is too lengthy and possibly not clear in structure. In PCA, my guess is that the position of a data point is set by correlation to principal components and also, the correlation of data to principal components also determines the position of a data point. I assess that PCA in itself is not as straightforward as maximum likelihood OLS and similar regression based methods, as PCA in that sense would be dual instead of singular. When orthogonality is used while computing the PCA, it creates unbiased estimates. Though, I argue that the bias occurs in the division into principal components. By this line of thought it is optimal for a minimal level of bias if the first two principal components explain a large proportion of the variance, which is not the case in this PCA where the first and second principal components combined only explain approximately 50% of the variance.
The principal components do not only explain the proportion of variance in a descending order, but I also argue that it is possible that the preciseness of the loadings of each principal component decreases in a descending order, as the assumption of orthogonality must persist when the proportion of variance explained by each PC decreases, thus making each loading less and less precise. If one has overall knowledge of the municipalities used as data points, there is a pattern of the first two principal components being mostly related to ones preconceived notions of the municipialities, such as Jokkmokk having a large area, Stockholm big population or Luleå having high greenhouse gas emissions. Though, the weakness of such an analysis with a preconceived basis it that the conclusions can be post hoc, but with a sizable portion of previous knowledge on Swedish municipalities, such a perception can not be avoided. The pull effect in the third to sixth PC in a descending order is decreasingly related to such preconceived observances. In addition to the proportion of variance being the largest, the loadings are also seemingly most accurate in the first two principal components. Along this line of thinking, figure 3 to figure 5 with the first two principal components are clearly the most essential of the performed PCA. Figure 6 and figure 7 hold some relevance in explaining the variation in the response variables, but partially are included to show the essential importance of the first two principal components. The visual direction of the loadings of the first two principal components can for example be clearly analyzed by mean age by mortality, in the direction of low income municipality based data points in the regional category of Norrland. As the loadings are orthogonal to each other, these data points are also in the opposite direction of population change, thus indicating clearly negative and also in some cases stagnant changes in population closer to the center of the loadings. This is not as apparent in the following principal components of the sixth and seventh figures, where the pull of the loadings deviate from the previous figures. It needs to be specified that the following part of the discussion is on the first two principal components. The visualized category of income is the most useful, as it as previously mentioned dissects different regions by using a more generally comprehensible category. The variable for population creates a loading that controls for differences of many standard deviations in population. Though, in the other variables there is a presence of bias towards municipality based data points that with smaller populations, especially in Norrland with small populations that are less relevant in other per capita variables. A mitigating factor is that the emission variables of dioxin and greenhouse gases are in absolute numbers, often with many standard deviations above the median, that especially regarding the first two principal components pull the loadings towards data points such as Luleå and Gällivare, as well as population towards Gothenburg, Stockholm and Uppsala. Though, to minimize bias it would be preferable if a maximal number of response variables would be measured in per capita terms. The extremes of unemployment are mainly located in poor municipalities in Svealand and Götaland. These data points according to the visualization are in between the direction of the loadings of foreign origin and mean age respectively, which indicates the data points with the lowest incomes are the combination of relative diversity of foreign born inhabitants, as well as a relatively high percentage of retirees. The opposite relationship can be found in university cities with higher education and high tax capacity, where the latter indicates high incomes and therefore high average income tax brackets. As the orthogonal assumption in the computing of the following principal components potentially creates bias for some loadings, as well as their relatively small importance in explaining the cumulative proportion of variance, I choose not to continue with analyzing these principal components in detail. As a short summary of the final figures, for example the extreme pulls of population size and greenhouse gas emissions make the visualized loadings coincide with the first two principal components, but tax equality clearly deviates in this regard and in part it is the reason why the first two principal components are clearly the most essential to the performed PCA.
\(Chosen\) \(statistical\) \(model:\) \(Linear\) \(Mixed-effects\) \(Modelling\)
\(Model\) \(1\) \(for\) \(part\) \(2:\)
\(TRANSITIONS = δ_0+δ_1TREATMENT+δ_2GENDER + δ_3(1|FAMILY)\)
Måns Thulin (2021). Modern Statistics with R. From wrangling and exploring data to inference and predictive modelling. http://www.modernstatisticswithr.com/
# Correlation matrix as data frame
cor_dataframe <- as.data.frame(cor(SCB[,-1:-4]))
# Most highly correlated variables
# Mean age and mortality (0.88).
cor_dataframe$mean.age[4]
## [1] 0.877
# Immigration and emigration (0.89).
cor_dataframe$immigration[8]
## [1] 0.893
# Higher education and tax capacity (0.77)
cor_dataframe$higher.edu[9]
## [1] 0.768
# Renaming the data set
SCB <- SCB_assign3
# Test if there are missing observations. There are no missing observations as the number of rows are equal.
OMIT <- na.omit(SCB)
nrow(OMIT) == nrow(SCB)
# Perform principal component analysis. A cumulative proportion of 90% is exceeded in the 8th PC.
PCAPRINCOMP <- princomp(~pop.size+area+mean.age+mortality+natality+
pop.change+immigration+emigration+tax.capacity+tax.equal+
unemployment+foreign.origin+higher.edu+greenhouse.gases+
dioxin.mg, cor = TRUE,
data = SCB)
# Function to create table for PCA output. Taken from stackoverflow.
pca_importance <- function(x) {
vars <- x$sdev^2
vars <- vars/sum(vars)
rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars,
`Cumulative Proportion` = cumsum(vars))
}
# Create table of standard deviations and variance of PCA.
PCASUM <- as.data.frame(pca_importance(summary(PCAPRINCOMP)))
PCASUM <- rownames_to_column(PCASUM, " ")
PCASUM %>%
flextable() %>%
set_caption("PCA stdev. and variance. Table 1.")
CONTVAR <-
SCB %>%
select(-1:-4)
PCAPCA<-PCA(CONTVAR, graph=FALSE)
fviz_screeplot(PCAPCA, addlabels = TRUE, ylim = c(0, 40), ncp = 15,
caption = "Screeplot. Proportion of variance by PC",
main = "Figure 1") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
plot.caption = element_text(size = 25, hjust = 0.5),
plot.title = element_text(size = 30, hjust = 0.5))
# Create table of eigenvalues and proportion of variance.
CONTVAR <-
SCB %>%
select(-1:-4)
PCAPCA <- PCA(CONTVAR, graph=FALSE)
PCAPCASUM <- as.data.frame(get_eig(PCAPCA))
PCAPCASUM %>%
flextable() %>%
set_caption("PCA eigenvalues and proportion of variance. Table 2.")
# Screeplot with eigenvalues by principal component.
eigenvalues <- PCAPRINCOMP$sdev^2
qplot(c(1:15), eigenvalues) +
geom_line(color = "blue") +
geom_bar(stat = "identity") +
labs(x = "Principal component",
y = "Eigenvalue",
title = "Figure 2",
caption = "Screeplot. Eigenvalue by PC.") +
scale_x_continuous(breaks = c(1:15)) +
theme(text = element_text(size = 20),
axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
plot.title = element_text(hjust = 0.5, size = 28),
plot.caption = element_text(hjust = 0.5, size = 28))
# Table of the first 6 principal components.
ALOAD <- as.data.frame(unclass(loadings(PCAPRINCOMP)))[1:6]
ALOAD <- rownames_to_column(ALOAD, " ")
ALOAD %>%
flextable() %>%
set_caption("Principal component by loading. Table 3.")
# Plot PCA1 and PCA2 by region with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'region', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("region") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 3",
caption = "Scatterplot of PCA1 and PCA2 by region.") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA1 and PCA2 by county with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'county', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("county") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 4",
caption = "Scatterplot of PCA1 and PCA2 by county") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA1 and PCA2 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 1",
y = "Principal component 2",
title = "Figure 5",
caption = "Scatterplot of PCA1 and PCA2 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA3 and PCA4 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, x=3, y=4, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 3",
y = "Principal component 4",
title = "Figure 6",
caption = "Scatterplot of PCA3 and PCA4 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Plot PCA5 and PCA6 by income with data points labeled by municipality.
autoplot(PCAPRINCOMP, x=5, y=6, data = SCB, colour = 'income', size = 4L, loadings = TRUE,
loadings.colour = 'blue', loadings.label = TRUE,
loadings.label.size = 6L) +
theme_light() +
scale_colour_discrete("income") +
geom_text_repel(label = SCB$municipality, max.overlaps = 6, size = 6L) +
labs(x = "Principal component 5",
y = "Principal component 6",
title = "Figure 7",
caption = "Scatterplot of PCA5 and PCA6 by income") +
theme(text = element_text(size = 20),
axis.title = element_text(size = 20),
axis.text = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 50),
plot.caption = element_text(hjust = 0.5, size = 28))
# Appendix Part 2