Let’s start by loading packages and reading the data:
library(tidyverse)
tech <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')
head(tech)
## variable label iso3c year group
## 1 BCG % children who received a BCG immunization AFG 1982 Consumption
## 2 BCG % children who received a BCG immunization AFG 1983 Consumption
## 3 BCG % children who received a BCG immunization AFG 1984 Consumption
## 4 BCG % children who received a BCG immunization AFG 1985 Consumption
## 5 BCG % children who received a BCG immunization AFG 1986 Consumption
## 6 BCG % children who received a BCG immunization AFG 1987 Consumption
## category value
## 1 Vaccines 10
## 2 Vaccines 10
## 3 Vaccines 11
## 4 Vaccines 17
## 5 Vaccines 18
## 6 Vaccines 27
names(tech)
## [1] "variable" "label" "iso3c" "year" "group" "category" "value"
Well this is a bit of a mess! We have 195 variables binned into one column, their values being on different scales. One potential strategy to addressing these issues would be to standardize them, but another potential approach is to expand the data out into Tidy format (one observation per row, one feature per column) and conduct a data reduction technique like PCA. That sounds like a bit more fun, so let’s do that after splitting the data into production and consumption and subsetting down to recent years:
tech %>%
filter(group == "Production") %>%
pivot_wider(id_cols = c("iso3c","year"),
names_from = "variable",
values_from = "value") %>%
filter(year >= 2000) -> tidy_production
tech %>%
filter(group == "Consumption") %>%
pivot_wider(id_cols = c("iso3c","year"),
names_from = "variable",
values_from = "value") %>%
filter(year >= 2000) -> tidy_consumption
There is a lot of missingness on some of these variables. Let’s figure out the extent of it and eliminate them.
tidy_consumption %>%
apply(.,2,is.na) %>%
apply(.,2,sum) %>%
sort() -> consumption_missing
tidy_production %>%
apply(.,2,is.na) %>%
apply(.,2,sum) %>%
sort() -> production_missing
sapply(1:length(consumption_missing),
function(i){
nrow(na.omit(tidy_consumption[,names(consumption_missing[1:i])]))
}) -> cons_data_remaining
sapply(1:length(production_missing),
function(i){
nrow(na.omit(tidy_production[,names(production_missing[1:i])]))
}) -> prod_data_remaining
cons_data_remaining
## [1] 4389 4389 4218 3685 3685 3685 3685 3159 3035 2515 2120 1651 1341 920 819
## [16] 528 442 254 69 0 0 0 0 0 0 0 0 0 0 0
## [31] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [61] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [76] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [91] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0
prod_data_remaining
## [1] 4720 4720 4248 4248 4248 4248 4248 4228 4228 4228 4228 2975 2460 2401 2382
## [16] 2360 2261 2254 1821 1501 746 602 341 250 176 176 138 117 65 57
## [31] 51 35 32 31 28 26 25 13 13 13 0 0 0 0 0
## [46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [61] 0 0 0 0 0 0 0 0
Let’s just grab the first seven substantive variables of consumption and ten production for exposition, leaving us samples of around 3000 complete observations each.
tidy_consumption %>%
dplyr::select(all_of(names(consumption_missing[1:9]))) %>%
na.omit() -> tidy_cons_sub
tidy_production %>%
dplyr::select(all_of(names(production_missing[1:12]))) %>%
na.omit() -> tidy_prod_sub
names(tidy_cons_sub)
## [1] "iso3c" "year" "telephone_canning_wdi"
## [4] "DPT" "MCV1" "Pol3"
## [7] "pctimmunizmeas" "cell_subsc" "internetuser"
names(tidy_prod_sub)
## [1] "iso3c" "year" "elec_hydro" "elec_nuc"
## [5] "elec_renew_other" "elec_solar" "elec_wind" "elecprod"
## [9] "elec_coal" "elec_gas" "elec_oil" "aviation_pass"
Our consumption metric looks like it will be dominated by vaccination rates and communication technologies whereas our production metric will be dominated by electrical production. This is fine for our purposes, since we don’t want to deal with the missing data, but would have to be dealt with for a more rigorous analysis. Now let’s conduct PCA on each of the data frames:
pca_hand <- function(data){
X <- t(scale(data))
A <- X %*% t(X)
E <- eigen(A)
P <- t(E$vectors)
new <- t(P %*% X)
new
}
tidy_prod_sub %>%
dplyr::select(-iso3c,-year) %>%
pca_hand() %>%
as.data.frame() -> production_pca
colnames(production_pca) <- paste0("C",1:ncol(production_pca))
tidy_cons_sub %>%
dplyr::select(-iso3c,-year) %>%
pca_hand() %>%
as.data.frame() -> consumption_pca
colnames(consumption_pca) <- paste0("C",1:ncol(consumption_pca))
tidy_cons_sub <- data.frame(tidy_cons_sub, consumption_pca)
tidy_prod_sub <- data.frame(tidy_prod_sub, production_pca)
Let’s take a look at the first two components of each:
tidy_prod_sub %>%
ggplot(.,aes(x = C1, y = C2, color = year)) +
geom_point() +
theme_bw() +
labs(color = "Year") +
ggtitle("Production PrinComp")
tidy_cons_sub %>%
ggplot(.,aes(x = C1, y = C2, color = year)) +
geom_point() +
theme_bw() +
labs(color = "Year") +
ggtitle("Consumption PrinComp")
For both, there is clear clustering of some observations while others are clearly diverging from the mass of countries over time. Let’s take a look at a few of these components over time. Having already taken a glance, we will take a look at the first principal component for production (dominated by energy) and the second principal component for consumption (to avoid domination by vaccines).
tidy_prod_sub %>%
filter(year == 2020) %>%
dplyr::select(iso3c,year,C1) %>%
arrange(C1) -> ranks
tidy_prod_sub %>%
ggplot(aes(x=year,y=C1, color = iso3c)) +
geom_point() +
geom_line() +
theme_bw() +
theme(legend.position = "none") +
xlab("Year") +
scale_x_continuous(breaks = seq(1960,2020,by=1),
guide = guide_axis(angle = 90)) +
ggtitle("Production Patterns over Time") +
geom_text(data = ranks[1:4,], aes(x=year,y=C1,label=iso3c),
nudge_x = 1)
Here we see a clear divergence in production profiles over time, China first moving towards the United States and then moving steadily away driven largely by massive hydro-electric projects. Most other countries, by contrast, have kept largely the same profiles of energy production over time although the overall trend indicates differentiation.
we can see a similar pattern when we look at the second component of our selected consumption data, this component largely tracking the uptake of communication technologies. Here we see that the data is dominated by population size. Whereas China and India have made great progess in extending these technologies to their population over time, others have stayed more or less where they were at the turn of the century. We do see, however, that the data “fans out” over time, indicating at least a bit of differentiation rather than homogenization.
tidy_cons_sub %>%
filter(year == 2019) %>%
dplyr::select(iso3c,year,C2) %>%
arrange(C2) -> ranks
tidy_cons_sub %>%
ggplot(aes(x=year,y=C2, color = iso3c)) +
geom_point() +
geom_line() +
theme_bw() +
theme(legend.position = "none") +
xlab("Year") +
scale_x_continuous(breaks = seq(1960,2020,by=1),
guide = guide_axis(angle = 90)) +
ggtitle("Consumption Patterns over Time") +
geom_text(data = ranks[1:4,], aes(x=year,y=C2,label=iso3c),
nudge_x = 1)
This is about as far as I’ll go with the present analysis, ending with some comments on potential improvements. Perhaps the most pressing issue to be dealt with is that of missing data. In the above, we took a relatively crude approach to selecting a subset of data for which not too many observations were missing. An alternative approach would be to first pre-process the data by perhaps aggregating similar metrics together (like the vaccination rates) to improve data retention. In principle this would allow for a greater variety of indicators to be included in the data reduction step and for principle components to pick up on deeper underlying structure than currently modeled.
Regardless, with the quick-cut methods employed above we see some tentative evidence for greater differentiation than convergence in production and consumption profiles. A large group of countries are changing little over time while others continue to develop profiles which differ from these static countries. In both of the models above, China stand out as a primary mover.