The index is constructed using diverse techniques. Step by step, the process was:
This dataset was delivered to me by email with two files. The R script session that loads, construcs and explores it, and a CSV file with the researcher selected countries and covariates from where it is constructed.
This section documents my loading of the dataset, its documentation and commitment to the OasW pin board.
The data is provided in order, without a country column. The researcher assigned the countries by hand using a character vector. I just load the file he gave me, and assign the countries as he assigned them.
load("../OasW/data/oas_index.rda")
ctrys <-c("AGO","ARG","BOL","BRA","BGR","CHN", "COL", "CRI","HRV","CZE","DOM","ECU","SLV","EST","HND","HUN","LVA","LTU","MNE",
"NGA","MKD","PAN","PRY","PER","POL","ROU","SRB","SVK","SVN","TUR","URY","AUT","BEL","CYP","FRA","DEU","GRC",
"IRL","ITA","MLT","NLD","PRT","ESP","GBR","USA","SWE","DNK","CHE","NOR","FIN","CAN","CHL","MEX")
oas_index$iso3c <- ctrys
oas_index$idx <- seq_along(ctrys)
For convenience and practice, this dataset is included as a data source in the OasW package and can be loaded with data("oas_index",package = 'OasW') if you have it installed.
This is a dataset with a large number of variables relative to observations (e.g. countries). This kind of setup is prime for dimensionality reduction but also, as we are exploring a way to order or rank countries according to how they fare in the selected measures, PCA is an obvious solution.
For that, variables should be identified as predictors versus identifiers.
The R tidymodels suite is perhaps best suited to express what we are doing:
set.seed(123)
pca_recipe <- recipes::recipe(
~.,
data = oas_index
) %>%
#Identify the idx and iso3c variables as ids, not to be operated on
update_role(c(idx,iso3c),new_role = 'id') %>%
#PCA fares best with normalized predictors'
step_normalize(all_numeric_predictors()) %>%
#We do the PCA
step_pca(all_numeric_predictors())
This is only the recipe. To perform the steps described in it, we must use the prep function:
And extract the data to operate on:
#The PCA data itself
pca_data <- tidy(pca_estimates,2)
#Further calculations to extract variance explained by component
sdev_components <- pca_estimates$steps[[2]]$res$sdev
percent_var_exp <- sdev_components^2 / sum(sdev_components^2)
percent_var_exp_cum <- cumsum(percent_var_exp)
varexp_ds <- tibble(
component=unique(pca_data$component),
percent_var_exp,
percent_var_exp_cum=cumsum(percent_var_exp)
)
And we can see how much variance this components explain:
varexp_ds %>%
mutate(
component=fct_inorder(component)
) %>%
ggplot(aes(component,percent_var_exp))+
geom_col()+
scale_y_continuous(labels=scales::percent_format(),
breaks=seq(0,1,by=0.1))+
labs(x=NULL,y="Percent of variance explained by Principal Component")+
theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))
So we can see the first two components explain most of the variance. But also the third one shares some signifficant proportion.
As an exploratory tool, PCA can give us information per variable as to what explain most of the variance in each component. Here the researcher has informed us that the first two components are the most relevant and we can proceed with our analysis using only those two.
library(tidytext)
pca_data %>%
filter(component %in% c("PC1","PC2","PC3")) %>%
inner_join(varexp_ds) %>%
mutate(
component=glue::glue("{component} - {scales::percent(percent_var_exp)}"),
terms = reorder_within(terms, abs(value), component)
) %>%
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
geom_label(aes(y=terms,x=abs(value),label=scales::number(abs(value),accuracy = 0.01)))+
facet_wrap(~component, scales = "free_y") +
scale_y_reordered() +
scale_x_continuous(breaks=seq(0,1,by=0.1))+
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
)
#> Joining, by = "component"
So for the first component, explaining close to 60% variance, all variables seem almost of similar importance. Folowed by the second and third with similar and much smaller contribution to variance, although not necessarily explanatory power or importance.
In PCA we are interested in the ordering of countries and their distance to each component. This is much more powerful to appreaciate by plotting the countries and see where they lie relative to the components.
plot_components <- function(comp1,comp2){
ren_components <- list(
"PC1"="Core Competitiveness",
"PC2"="Rights and Government Effectiveness",
"PC3"="Industrial Infrastructure"
)
data <- juice(pca_estimates)
data%>%
mutate(across(matches("PC"),~((.x-mean(.x))/sd(.x)))) %>%
mutate(country=countrycode(iso3c,"iso3c","country.name"),
is_oas= iso3c %in% roascountries::get_oas_countries()$iso3c ) %>%
select(country,is_oas,1:PC3) %>%
ggplot(aes(y=.data[[comp1]],x=.data[[comp2]],
label=country)) +
geom_point(aes(color=is_oas)) +
ggrepel::geom_text_repel(family="IBMPlexSans",
check_overlap = TRUE,
aes(color=is_oas),
max.overlaps = 20) +
theme(legend.position = 'bottom')+
labs(y=ren_components[[comp1]],x=ren_components[[comp2]])
}
Here is the plot of the first and second PCÑ
plot_components("PC1","PC2")
#> Warning: Ignoring unknown parameters: check_overlap
plot_components("PC1","PC3")
#> Warning: Ignoring unknown parameters: check_overlap
It is immediatly evident that PC2 and PC3 more or less measure the same relationships.
The PC1 is very much dominant in almost all variables, so it makes sense that, with the caveat of it explaining only 72% of variance, is a good indicator (a good linear combination) of all the selected variables.