Flipped Test 1: Principal Component Analysis

Summary

  • Many differences between the countries can be explained by three components.
  • The first dimension shows the level of development of the countries. The least developed countries are the ones with high fertility and child mortality rates, while the most developed have higher life expectancy, GDP and income per individual.
  • The second component reflects the country’s international trading and covers imports and exports.
  • The third component, interestingly enough, shows that countries with high inflation and countries where spendings on health are high are polar opposites.

Data

For this project I am going to use data about 167 countries and some indicators associated with them.

Source of the dataset: Kaggle

Here’s the original description of the variables:

country - country name
child_mort - death of children under 5 years of age per 1000 live births
exports - exports of goods and services. Given as % of the Total GDP
health - total health spending as % of Total GDP
imports - imports of goods and services. Given as % of the Total GDP
Income - net income per person
Inflation - the measurement of the annual growth rate of the Total GDP
life_expec - the average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer - the number of children that would be born to each woman if the current age-fertility rates remain the same
gdpp - GDP per capita. Calculated as the Total GDP divided by the total population.

The goal of this work is to run PCA on the data, reduce demensionality and identify whether some variables could be explained by underlying factors.

Exploration of variables

Firstly, let us examine whether those variables are suitable for PCA and if there is correlation via correlation matrix.

countries[3:11] %>% 
  cor() %>% 
  corrplot()

From the matrix it’s obvious that some of the variables are very much correlated - for example, GDP and Income or Total fertility and Child mortality. Moreover, some variables are negatively associated with each other, like Life expectancy and Child mortality. This data is good for PCA.

As a next step we will look at the variables’ statistics.

desc <- describe(countries[3:11])
desc[10:12]
##                range  skew kurtosis
## child_mort    205.40  1.42     1.62
## exports       199.89  2.40     9.65
## health         16.09  0.69     0.59
## imports       173.93  1.87     6.41
## income     124391.00  2.19     6.67
## inflation     108.21  5.06    39.95
## life_expec     50.70 -0.95     1.03
## total_fer       6.34  0.95    -0.25
## gdpp       104769.00  2.18     5.23

As can be seen from the table above, the range of our variables is very different, therefore they require scaling. At the same time, some of the variables are definitely not normally distributed (like exports or inflation), but it’s not as important to PCA, so I am not going to manipulate it.

countries_scaled <- countries[3:11] %>%
  scale() %>% 
  as.data.frame()
countries_scaled <- cbind(countries[1], countries_scaled)

PCA

Components

Now we will run the PCA and print the scree plot to show how much of variance each component explains.

pca_scores <- prcomp(countries_scaled[2:10])
fviz_eig(pca_scores, col.var="blue", addlabels = TRUE)

The first component is quite good and explains 46% of the variance, together with the second component - 63%, which is not ideal, but it works.

Additionally, I’ll run parallel analysis and see how many components it suggests to leave:

x <- fa.parallel(countries_scaled[2:10], fm="pa", fa="pc", n.iter=1)

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

According to it, 3 components is the best solution, so we’ll explore the third component too. For now, let’s examine which variables contribute the most to the components.

pca_scores$rotation[,1:3]
##                   PC1          PC2         PC3
## child_mort -0.4195194 -0.192883937  0.02954353
## exports     0.2838970 -0.613163494 -0.14476069
## health      0.1508378  0.243086779  0.59663237
## imports     0.1614824 -0.671820644  0.29992674
## income      0.3984411 -0.022535530 -0.30154750
## inflation  -0.1931729  0.008404473 -0.64251951
## life_expec  0.4258394  0.222706743 -0.11391854
## total_fer  -0.4037290 -0.155233106 -0.01954925
## gdpp        0.3926448  0.046022396 -0.12297749

PC1:
Life expectancy, total fertility, child mortality, income and gdp have the highest impact on this component.

PC2:
Exports and imports are associated with this component.

PC3:
Health and inflation contribute to this component.

Visualization

Let’s plot the first two components:

fviz_pca_biplot(pca_scores, col.var="contrib", alpha.ind = 0.05, title = "Countries PCA", label = c("ind.sup", "quali", "var", "quanti.sup"), repel = TRUE)+
 scale_color_gradient2(low="#9CBAD5", mid="blue",
           high="red", midpoint=10) +
 theme_minimal()

So, it seems like our first component covers countries from ones with high fertility and high child mortality (less developed countries, I would assume) to the more developed countries with higher life expectancy, GDP and income per individual.

The second component represents is parallel to the first one and it depicts indicators about country’s trading - imports and exports, which both go the same way.

Inflation and health don’t contribute much to those componens, so let’s look at the first and third dimension:

fviz_pca_biplot(pca_scores, col.var="contrib", axes = c(1,3), alpha.ind = 0.05, title = "Countries PCA", label = c("ind.sup", "quali", "var", "quanti.sup"), repel = TRUE)+
 scale_color_gradient2(low="#9CBAD5", mid="blue",
           high="red", midpoint=10) +
 theme_minimal()

As suspected, this component includes health and inflation, which are located in the opposite direction. It seems from this analysis that countries where spendings on health are high and countries with high inflation are total opposites. This might reflect country’s overall policy orientation and how it decides to spend the budget, I suppose, that countries with levels of inflation decide not to spend much budget on healthcare.