The fourth part of the project is performed to analyze data on the countries of the world, where the purpose of this part of the project is to statistically analyze country based data with a Principal Component Analysis model. The research question is to look for factors that create differences between the countries of the world.
An online version of the course curriculum “Modern Statistics with R” is provided by the university and is authored by Måns Thulin (2021), which is the literature that is used as a foundation to perform statistical tests by using the principal component analysis model. The default help function in R and online error searches are also used while completing assignment III in part 1 and part 2.
The data used for the Principal Component Analysis is compiled and gathered by the World Bank, the Center for Systemic Peace, the United Nations, Transparency International and the World Inequality Database. The chosen year for the data set is 2019, where this year is chosen to avoid using data that include irregular macroeconomic volatility that occurred because of the pandemic and also because more observations are gathered and published with some duration after a given year.
13 variables are used, of which 2 are factor variables and 11 are numeric variables. The first numeric variable is “GDPGrowth” and measures per capita % growth in economic output. The second numeric variable “PolityIV” measures an indexed level of democracy. The third numeric variable “NominalGDP” measures total GDP that is not adjusted for purchasing power parity against currency fluctuations against the US dollar and the domestic price level of the United States. The fourth numeric variable “PerCapitaGDP” measures the average domestic economic output as an average in the country. The fifth numeric variable “CorruptionPerception” measures the perception of the corruption level as an index, as extracted from interviewed citizens. The sixth numeric variable “PoliticalStability” measures the degree of political stability. The seventh numeric variable “StateFragility” measures state fragility as a composite of variable of eight component indicators for how easily the political system in a country can be replaced with violent measures. The eigth numeric variable “WorkingAge” measures the percentage of the population of working age. The ninth numeric variable “MedianAge” measures the median age of the population. The tenth numeric variable “IncomeInequality” measures the percentage share of income that is obtained by the richest 10% of the population. The eleventh numeric variable “Unemployment” measures the unemployment rate in percentages. The first of the factor variables measures the country of each data row. The second factor variable, “PoliticalSystem” measures the category of political system. The second numeric variable above, the “PolityIV” democracy index is measured on a range of -10 to +10, from extreme autocracy to extreme democracy. Specifically for this part of the project, this categorical variable is generated with cut-off values of -10 to -5 for autocracy, -4 to 5 for the middle category of anocracy and 6 to 10 for democracy.
The data set is transformed to avoid computational problems with the principal component analysis, by omitting all rows where at least one observation is missing, which reduces the size of the data set from 188 rows to 144 rows. Though, many of the omitted countries have a small population, often coupled with a low level of development. In the principal component analysis, this creates a certain bias towards more populous and developed countries, but this measure is necessary as principal component analysis does not allow for missing observations.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | Comp.10 | Comp.11 |
Standard deviation | 2.07 | 1.172 | 1.104 | 1.0011 | 0.932 | 0.8402 | 0.749 | 0.6512 | 0.5117 | 0.4104 | 0.3599 |
Proportion of Variance | 0.39 | 0.125 | 0.111 | 0.0911 | 0.079 | 0.0642 | 0.051 | 0.0386 | 0.0238 | 0.0153 | 0.0118 |
Cumulative Proportion | 0.39 | 0.515 | 0.625 | 0.7164 | 0.795 | 0.8596 | 0.911 | 0.9491 | 0.9729 | 0.9882 | 1.0000 |
As can be seen in the appendix, a principal component object is first generated above, with “Country” as the factor variable and where “PoliticalSystem” is not yet generated and used. As shown in the summary above, the first and the second principal components explain 51.4% of the proportion of variance, with 38.7% and 12.6% respectively. Approximately 71.7% and 86% of the cumulative proportion of variance are exceeded in the fifth variable and seventh variable, respectively.
The screeplot above visualizes the data that is provided in table 1. Given that the project includes other methods of statistical analysis, as a measure of limiting the size of the project, only the four principal components with the highest proportion of variance are visualized as figures. As Approximately 71.7% of the cumulative proportion of variance is exceeded in the fifth variable, approximately 28.3% will be left unexplained in this part of the project.
eigenvalue | variance.percent | cumulative.variance.percent |
4.287 | 38.97 | 39.0 |
1.374 | 12.49 | 51.5 |
1.218 | 11.07 | 62.5 |
1.002 | 9.11 | 71.6 |
0.869 | 7.90 | 79.5 |
0.706 | 6.42 | 86.0 |
0.561 | 5.10 | 91.1 |
0.424 | 3.86 | 94.9 |
0.262 | 2.38 | 97.3 |
0.168 | 1.53 | 98.8 |
0.129 | 1.18 | 100.0 |
The eigenvalues of all principal components are shown above. After having calculated the quotas between each proportion of variance, the quotas are directly proportional to the proportion of variance of each PCA.
| Comp.1 | Comp.2 | Comp.3 | Comp.4 |
GDPGrowth | 0.09185 | 0.0682 | 0.56777 | 0.4875 |
PolityIV | -0.19718 | 0.6406 | -0.00938 | -0.0864 |
NominalGDP | -0.13304 | -0.1211 | 0.29369 | -0.7426 |
PerCapitaGDP | -0.39616 | -0.0996 | 0.10920 | -0.1702 |
CorruptionPerception | -0.41397 | 0.0678 | 0.06079 | -0.0640 |
PoliticalStability | -0.38854 | -0.0341 | -0.02478 | 0.1127 |
StateFragility | 0.42321 | -0.0126 | 0.17982 | -0.1527 |
WorkingAge | -0.20626 | -0.6356 | -0.20866 | 0.2427 |
MedianAge | -0.39889 | -0.0653 | 0.02472 | 0.0579 |
IncomeInequality | 0.27372 | -0.2614 | -0.24742 | -0.2662 |
Unemployment | -0.00351 | 0.2792 | -0.66133 | 0.0354 |
In the correlations by principal component vary by each loading. The correlation based value of each loading can only be analyzed within each principal component, and not between them. The assumption of orthogonality creates differing loading correlations and results can be optimally verified if the relation of a variable to other variable is consistent between the included principal components. For example if certain variabes are correlated to the same country based data points throughout the principal components. If this is not consistent through the more important principal components, certain relations between data points and deviating variables can not be verified.
In the first principal component above, one can see that “PerCapitaGDP”, “CorruptionPerception” and “MedianAge” have similar correlations in the proximity of -0.4. Their respective loadings in the second principal component do not deviate a lot from the center with values relatively close to 0. This indicates that a sufficient number of data points have a standard of living, a low level of corruption and a large share of elderly inhabitants. Though, for the reader this is yet to be visually verified in the PCA based scatterplot below.
First, the previously mentioned factor variable “PolticalSystem” is created above. As appearent in the plot above, it serves to visually aid the viewer as a category. The division between countries of different political systems is clear, with democracies to the left, anocracies to the right and dictatorships in the lower part of the data points. The most politically fragile and unstable countries are anocracies, in the direction of the correlation “StateFragility”. The most stable are democracies, but this also shows that many relatively politically stable countries are dictatorships, such as the data points of China and Qatar. The figure above also indicates that the poorest countries in the opposite direction of the loading of GDP per capita do not have the highest unemployment rate, but that a high level of unemployment is typical for middle income countries, such as Armenia and Guyana. Also, it indicates that the countries with high levlels of unemployment are all democracies and that dictatorships on the opposite end have the lowest level of unemployment, with anocracies in the middle.
The data points of Arabic countries in the southern parts of the Persian Gulf, such as Kuwait, the United Arab Emirates and Qatar are all located around the extreme of the loadings of working age population. This indicates that they have a very high share of guest workers and few children and elderly people. Also, unemployment and working age have relatively opposite loadings, which also indicates low birth rates and high emigration in the data points with high unemployment. Income inequality is mainly in the direction of autocracies, such as Bangladesh and Laos. According to the figure above, democracies such as Finland and France have an opposite relationship to income inequality. Another finding along the negative correlation to the first principal component shows that a high level of corruption, a large share of elderly, a high average income and political stability is associated with largely rich democracies in the Western world. A low level of corruption can possibly be linked to a high per capita GDP and political stability. On the opposite of end of these variables, one finds unstable, poor and politically corrupt countries with a young population. This indicates high recent birth rates, low productivity and instability for ruling governments.
Because of the extremely negative position of the United States, the figure had to become very large in the vertical direction to reduce clutter and create more labels for the other data points. Also, the correlations in the figure above are seemingly more sensitive to pull effects of extreme data points. As a first remark, China and the United States are in the direction of nominal GDP, which fully coincides with the two countries being the two largest economies in the world. In this figure more so than in the previous figure, the variable of nominal GDP is more important as other countries with very large economies, such as Germany, France, Japan and the United Kingdom are also located in the direction of nominal GDP. Some poorer countries with small populations, such as Djibouti and Montenegro are in the opposite direction, but the effect on this side of the extreme is not as clear with fewer and less deviating countries in this direction. GDP growth has a relatively high positive correlation with both PCA3 and PCA4 above. The countries in this direction largely consists of developing countries with high percentage based economic development, such as Rwanda, Vietnam, Cambodia and Bangladesh. Countries on the other extreme are indicates to be in recession or be in a state of low economic growth, such as Argentina, Lebanon and Brazil. Another interesting finding is that income inequality and economic growth are opposite of each other, which indicates that high levels of contemporary economic progress occurs in countries with a relatively equalized distribution of income.
Variables such as median age, corruption perception and the Polity IV democracy index are not highly relevant here with very low correlation to PCA3 and PCA4. This can be viewed by perceiving how large the loadings are in absolute value in how much they deviate from the origin of (x,y) = (0,0). In figure 2, variables were more even in their importance when they mostly had a sizeable correlation to at least PCA1 or PCA2. In figure 3 above, this is clearly not the case where some variables have very small loadings in absolute number, while other variables have very large loadings.
Center for Systemic Peace. (2018). Polity5 Annual Time-Series, 1946-2018. https://www.systemicpeace.org/inscrdata.html [2021-07-20].
Center for Systemic Peace. (2020). PITF State Failure Problem Set, 1955-2018. http://www.systemicpeace.org/inscrdata.html [2021-04-01].
Måns Thulin (2021). Modern Statistics with R. From wrangling and exploring data to inference and predictive modelling. http://www.modernstatisticswithr.com/
The International Monetary Fund. (2020). GDP per capita, current prices. https://www.imf.org/external/datamapper/PPPPC@WEO/THA [2021-07-17].
Transparency International. (2019). Corruption Perceptions Index. https://www.transparency.org/en/cpi/2019/index/press-and-downloads [2021-03-09].
World Bank. (2019). World Bank national accounts data, and OECD National Accounts data files. https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG [2021-07-19].
World Inequality Database. (2019). WID Metadata. https://wid.world/data/ [2021-03-06].