Flipped Test 1: Principal Component Analysis
Summary
- Many differences between the countries can be explained by three
components.
- The first dimension shows the level of development of the countries.
The least developed countries are the ones with high fertility and child
mortality rates, while the most developed have higher life expectancy,
GDP and income per individual.
- The second component reflects the country’s international trading
and covers imports and exports.
- The third component, interestingly enough, shows that countries with high inflation and countries where spendings on health are high are polar opposites.
Data
For this project I am going to use data about 167 countries and some indicators associated with them.
Source of the dataset: Kaggle
Here’s the original description of the variables:
country - country name
child_mort - death of children under 5 years of age per
1000 live births
exports - exports of goods and services. Given as % of
the Total GDP
health - total health spending as % of Total GDP
imports - imports of goods and services. Given as % of
the Total GDP
Income - net income per person
Inflation - the measurement of the annual growth rate
of the Total GDP
life_expec - the average number of years a new born
child would live if the current mortality patterns are to remain the
same
total_fer - the number of children that would be born
to each woman if the current age-fertility rates remain the same
gdpp - GDP per capita. Calculated as the Total GDP
divided by the total population.
The goal of this work is to run PCA on the data, reduce demensionality and identify whether some variables could be explained by underlying factors.
Exploration of variables
Firstly, let us examine whether those variables are suitable for PCA and if there is correlation via correlation matrix.
countries[3:11] %>%
cor() %>%
corrplot()From the matrix it’s obvious that some of the variables are very much correlated - for example, GDP and Income or Total fertility and Child mortality. Moreover, some variables are negatively associated with each other, like Life expectancy and Child mortality. This data is good for PCA.
As a next step we will look at the variables’ statistics.
desc <- describe(countries[3:11])
desc[10:12]## range skew kurtosis
## child_mort 205.40 1.42 1.62
## exports 199.89 2.40 9.65
## health 16.09 0.69 0.59
## imports 173.93 1.87 6.41
## income 124391.00 2.19 6.67
## inflation 108.21 5.06 39.95
## life_expec 50.70 -0.95 1.03
## total_fer 6.34 0.95 -0.25
## gdpp 104769.00 2.18 5.23
As can be seen from the table above, the range of our variables is very different, therefore they require scaling. At the same time, some of the variables are definitely not normally distributed (like exports or inflation), but it’s not as important to PCA, so I am not going to manipulate it.
countries_scaled <- countries[3:11] %>%
scale() %>%
as.data.frame()
countries_scaled <- cbind(countries[1], countries_scaled)PCA
Components
Now we will run the PCA and print the scree plot to show how much of variance each component explains.
pca_scores <- prcomp(countries_scaled[2:10])
fviz_eig(pca_scores, col.var="blue", addlabels = TRUE)The first component is quite good and explains 46% of the variance, together with the second component - 63%, which is not ideal, but it works.
Additionally, I’ll run parallel analysis and see how many components it suggests to leave:
x <- fa.parallel(countries_scaled[2:10], fm="pa", fa="pc", n.iter=1)## Parallel analysis suggests that the number of factors = NA and the number of components = 2
According to it, 3 components is the best solution, so we’ll explore the third component too. For now, let’s examine which variables contribute the most to the components.
pca_scores$rotation[,1:3]## PC1 PC2 PC3
## child_mort -0.4195194 -0.192883937 0.02954353
## exports 0.2838970 -0.613163494 -0.14476069
## health 0.1508378 0.243086779 0.59663237
## imports 0.1614824 -0.671820644 0.29992674
## income 0.3984411 -0.022535530 -0.30154750
## inflation -0.1931729 0.008404473 -0.64251951
## life_expec 0.4258394 0.222706743 -0.11391854
## total_fer -0.4037290 -0.155233106 -0.01954925
## gdpp 0.3926448 0.046022396 -0.12297749
PC1:
Life expectancy, total fertility, child mortality, income and gdp have
the highest impact on this component.
PC2:
Exports and imports are associated with this component.
PC3:
Health and inflation contribute to this component.
Visualization
Let’s plot the first two components:
fviz_pca_biplot(pca_scores, col.var="contrib", alpha.ind = 0.05, title = "Countries PCA", label = c("ind.sup", "quali", "var", "quanti.sup"), repel = TRUE)+
scale_color_gradient2(low="#9CBAD5", mid="blue",
high="red", midpoint=10) +
theme_minimal()So, it seems like our first component covers countries from ones with high fertility and high child mortality (less developed countries, I would assume) to the more developed countries with higher life expectancy, GDP and income per individual.
The second component represents is parallel to the first one and it depicts indicators about country’s trading - imports and exports, which both go the same way.
Inflation and health don’t contribute much to those componens, so let’s look at the first and third dimension:
fviz_pca_biplot(pca_scores, col.var="contrib", axes = c(1,3), alpha.ind = 0.05, title = "Countries PCA", label = c("ind.sup", "quali", "var", "quanti.sup"), repel = TRUE)+
scale_color_gradient2(low="#9CBAD5", mid="blue",
high="red", midpoint=10) +
theme_minimal()As suspected, this component includes health and inflation, which are located in the opposite direction. It seems from this analysis that countries where spendings on health are high and countries with high inflation are total opposites. This might reflect country’s overall policy orientation and how it decides to spend the budget, I suppose, that countries with levels of inflation decide not to spend much budget on healthcare.