The Index as constructed

The index is constructed using diverse techniques. Step by step, the process was:

  1. A main dataset comprising a number of candidate covariates was explored, its coverage determined by researchers and the variables selected by domain knowledge and regresion techniques.
  2. A dataset emerged from this process. This dataset is the OAS Atlas Index of Inclusive Growth or, within our programming environment, oas_index

The oas_index dataset

Integration, documentation and location

This dataset was delivered to me by email with two files. The R script session that loads, construcs and explores it, and a CSV file with the researcher selected countries and covariates from where it is constructed.

This section documents my loading of the dataset, its documentation and commitment to the OasW pin board.

The data is provided in order, without a country column. The researcher assigned the countries by hand using a character vector. I just load the file he gave me, and assign the countries as he assigned them.


 load("../OasW/data/oas_index.rda")

ctrys <-c("AGO","ARG","BOL","BRA","BGR","CHN", "COL", "CRI","HRV","CZE","DOM","ECU","SLV","EST","HND","HUN","LVA","LTU","MNE",
          "NGA","MKD","PAN","PRY","PER","POL","ROU","SRB","SVK","SVN","TUR","URY","AUT","BEL","CYP","FRA","DEU","GRC",
          "IRL","ITA","MLT","NLD","PRT","ESP","GBR","USA","SWE","DNK","CHE","NOR","FIN","CAN","CHL","MEX")
oas_index$iso3c <- ctrys
oas_index$idx <- seq_along(ctrys)

For convenience and practice, this dataset is included as a data source in the OasW package and can be loaded with data("oas_index",package = 'OasW') if you have it installed.

Analysis of the oas_index dataset

This is a dataset with a large number of variables relative to observations (e.g. countries). This kind of setup is prime for dimensionality reduction but also, as we are exploring a way to order or rank countries according to how they fare in the selected measures, PCA is an obvious solution.

For that, variables should be identified as predictors versus identifiers.

The R tidymodels suite is perhaps best suited to express what we are doing:

set.seed(123)
pca_recipe <- recipes::recipe(
~.,
data = oas_index
) %>%
  #Identify the idx and iso3c variables as ids, not to be operated on
update_role(c(idx,iso3c),new_role = 'id') %>%
  #PCA fares best with normalized predictors'
  step_normalize(all_numeric_predictors()) %>% 
  #We do the PCA
step_pca(all_numeric_predictors())

This is only the recipe. To perform the steps described in it, we must use the prep function:

And extract the data to operate on:

#The PCA data itself
pca_data <- tidy(pca_estimates,2)

#Further calculations to extract variance explained by component

sdev_components <- pca_estimates$steps[[2]]$res$sdev
percent_var_exp <- sdev_components^2 / sum(sdev_components^2)
percent_var_exp_cum <- cumsum(percent_var_exp) 


varexp_ds <- tibble(
  component=unique(pca_data$component),
  percent_var_exp,
  percent_var_exp_cum=cumsum(percent_var_exp)
)

And we can see how much variance this components explain:

varexp_ds %>% 
  mutate(
    component=fct_inorder(component)
  ) %>% 
  ggplot(aes(component,percent_var_exp))+
  geom_col()+
  scale_y_continuous(labels=scales::percent_format(),
                     breaks=seq(0,1,by=0.1))+
  labs(x=NULL,y="Percent of variance explained by Principal Component")+
  theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))

So we can see the first two components explain most of the variance. But also the third one shares some signifficant proportion.

How are they composed?

As an exploratory tool, PCA can give us information per variable as to what explain most of the variance in each component. Here the researcher has informed us that the first two components are the most relevant and we can proceed with our analysis using only those two.

library(tidytext)
pca_data %>% 
    filter(component %in% c("PC1","PC2","PC3")) %>% 
    inner_join(varexp_ds) %>% 
    mutate(
        component=glue::glue("{component} - {scales::percent(percent_var_exp)}"),
        terms = reorder_within(terms, abs(value), component)
    ) %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  geom_label(aes(y=terms,x=abs(value),label=scales::number(abs(value),accuracy = 0.01)))+
  facet_wrap(~component, scales = "free_y") +
  scale_y_reordered() +
  scale_x_continuous(breaks=seq(0,1,by=0.1))+
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  )
#> Joining, by = "component"

So for the first component, explaining close to 60% variance, all variables seem almost of similar importance. Folowed by the second and third with similar and much smaller contribution to variance, although not necessarily explanatory power or importance.

In PCA we are interested in the ordering of countries and their distance to each component. This is much more powerful to appreaciate by plotting the countries and see where they lie relative to the components.

How do countries get ordered in this components?


plot_components <- function(comp1,comp2){
  ren_components <- list(
    "PC1"="Core Competitiveness",
    "PC2"="Rights and Government Effectiveness",
    "PC3"="Industrial Infrastructure"
)

data <- juice(pca_estimates) 
data%>% 
    mutate(across(matches("PC"),~((.x-mean(.x))/sd(.x)))) %>% 
    mutate(country=countrycode(iso3c,"iso3c","country.name"),
           is_oas= iso3c %in% roascountries::get_oas_countries()$iso3c ) %>%
    select(country,is_oas,1:PC3) %>% 
    ggplot(aes(y=.data[[comp1]],x=.data[[comp2]],
               label=country)) +
    geom_point(aes(color=is_oas)) +
    ggrepel::geom_text_repel(family="IBMPlexSans",
                             check_overlap = TRUE,
                             aes(color=is_oas),
                             max.overlaps = 20) +
    theme(legend.position = 'bottom')+
  labs(y=ren_components[[comp1]],x=ren_components[[comp2]])
}

PC1 vs PC2

Here is the plot of the first and second PCÑ

plot_components("PC1","PC2")
#> Warning: Ignoring unknown parameters: check_overlap

PC1 vs PC3

plot_components("PC1","PC3")
#> Warning: Ignoring unknown parameters: check_overlap

Analysis

It is immediatly evident that PC2 and PC3 more or less measure the same relationships.

The PC1 is very much dominant in almost all variables, so it makes sense that, with the caveat of it explaining only 72% of variance, is a good indicator (a good linear combination) of all the selected variables.