Measuring Education similarity on the basis of occupation counts

Source code: github logo

Author

Richard Martin

Code
edu_noc <- read_csv(here("out","edu_noc.csv"))

all_cor <- vroom::vroom(here("out", "all_cor.csv"))|>
  group_by(cip1, highest1)|>
  mutate(broad1=str_sub(cip1, 1, 2),
         broad2=str_sub(cip2, 1, 2))

max_spearman <- all_cor|>
  filter(spearman_cor==max(spearman_cor, na.rm = TRUE))|>
  mutate(what_same=case_when(cip1==cip2 & highest1==highest2 ~ "Both  CIP and Highest attained",
                            cip1!=cip2 & highest1!=highest2 ~ "Neither  CIP nor Highest attained",
                            cip1==cip2 ~ " CIP",
                            highest1==highest2 ~ "Highest attained"
                            ))

max_pearson <- all_cor|>
  filter(pearson_cor==max(pearson_cor, na.rm = TRUE))|>
  mutate(what_same=case_when(cip1==cip2 & highest1==highest2 ~ "Both  CIP and Highest attained",
                            cip1!=cip2 & highest1!=highest2 ~ "Neither  CIP nor Highest attained",
                            cip1==cip2 ~ " CIP",
                            highest1==highest2 ~ "Highest attained"
                            ))

TL;DR

What types of education are similar? This is a tough question to answer, as we have little information about the skills and knowledge that are acquired. The CIP (field of study) hierarchy gives some information in this regard, but perhaps is unduly influenced by the organizational structure of post secondary institutions. Note that the lack of information regarding education stands in contrast with occupations, where we have extensive ONET data regarding the level and importance of many dimensions of skill and knowledge.

In this paper we conjecture that education similarity can be quantified on the basis of labour market outcomes: i.e. if the graduates from two different educations fan out across the labour market in similar manner, we hypothesize the reason why is that they are, in some sense, close substitutes.

Once we have identified the most similar educations, we can investigate how they are similar: either in terms of sharing the same CIP at differing levels of highest attainment, or sharing the same highest level of attainment, but in a different CIP. These measures of similarity shed some light on the question of whether education is more about signalling hidden ability (via highest attainment) or the acquisition of specific skills (i.e. same CIP).

Intro

Statistics Canada table 98-10-0403-01 forms the basis of the following analysis. This table provides us with employment counts on the basis of 435 four digit CIPs, 512 five digit NOCs and 8 levels of highest educational attainment. Crossing CIPs with highest attainment we create a new variable education, with \(8 \times 435 = 3480\) different values. Note that not all combinations of 4 digit CIP and highest attainment have positive employment counts. e.g. Ph.d. cosmetologists do not exist, nor are there apprentice dentists. We perform the following filtering of this (very sparse) matrix on the basis of its margins: only include rows and columns that have sums in excess of 1000 employed. Once filtered, we are left with 782 types of education and 475 occupations.

Example

Lets look at the NOC profiles of two educations: a BA in Economics vs. a MA in Economics:

Code
compare_econ <- edu_noc|>
  filter(Education %in% c("45.06 Economics: Master's degree", "45.06 Economics: Bachelor's degree"))|>
  pivot_wider(names_from = Education, values_from = value)

plt <- ggplot(compare_econ, aes(`45.06 Economics: Master's degree`, `45.06 Economics: Bachelor's degree`, text=NOC))+
  geom_jitter(alpha=.5)+
  geom_rug(col=rgb(.5,0,0, alpha=.5))+
  labs(title=paste("NOC counts for BA vs MA Economics"))
ggplotly(plt, tooltip = "text")

The most notable feature of the data is how skewed it is: For many NOCs the counts are either zero (or very close to zero), with a very small number of NOCs with high counts. Lets look at the same data on the log10 scale.

Code
plt <- ggplot(compare_econ, aes(`45.06 Economics: Master's degree`, `45.06 Economics: Bachelor's degree`, text=NOC))+
  geom_jitter(alpha=.5)+
  geom_rug(col=rgb(.5,0,0, alpha=.5))+
  scale_x_continuous(trans="log10")+
  scale_y_continuous(trans="log10")+
  labs(title=paste("NOC counts for BA vs MA Economics: Log10 scale"))

ggplotly(plt, tooltip = "text")

If you hover over the points in the plot above you can see that off diagonal, those with a MA tend to work in research oriented occupations whereas those with a BA tend to work more in sales.

One might wonder which education (a combination of CIP and highest attainment) has the maximal correlation with a MA in Economics? We use two measures of correlation:

  1. Pearson (linear) correlation on the log10 transformed data.
  2. Spearman (monotonic) correlation on the raw data.

The closest alternative education to a MA in Economics (based on Spearman)

Code
compare_closest <- edu_noc|>
  filter(Education %in% c("45.06 Economics: Master's degree", "52.02 Business administration, management and operations: Master's degree"))|>
  pivot_wider(names_from = Education, values_from = value)

closest_cor <- compare_closest|>
  column_to_rownames("NOC")|>
  correlate(method="spearman")|>
  pull()|>
  na.omit()|>
  round(digits=3)|>
  as.vector()

plt <- ggplot(compare_closest, aes(`45.06 Economics: Master's degree`, `52.02 Business administration, management and operations: Master's degree`, text=NOC))+
  geom_jitter(alpha=.5)+
  geom_rug(col=rgb(.5,0,0, alpha=.2))+
  scale_x_continuous(trans="log10")+
  scale_y_continuous(trans="log10")+
  labs(title=paste("NOC counts for MA Economics vs MBA: The Spearman correlation is", closest_cor))

ggplotly(plt, tooltip = "text")

Again, if we hover over the off diagonal points, we see management and engineering jobs more prevalent for those with a MBA, and research type jobs more common for people with a MA in Economics. So here we see that there is an (ever so slightly) higher correlation between two different CIPs as the same highest educational attainment than between the same CIP at two different levels of highest attainment. This suggests that the closest fields of study might be found by searching horizontally (across different CIPs at the same level of highest attainment) rather than vertically (within the same CIP at different levels of highest attainment).

How are maximally correlated Educations similar?

We look at two dimensions of similarity:

  1. Horizontal similarity is when two education paths share the same level of highest attained, but have differing CIP classifications.
  2. Vertical similarity occurs when two education paths share the same CIP classification, but have differing levels of highest attainment.
  3. … and it is possible that the two education paths share neither of the above measures of similarity.

Of course, just because an education is the most highly correlated of all other educations does not imply that it is highly correlated: some educations are pretty niche. So below we filter the results on the basis of some arbitrary minimal correlations, ranging from 0 (no correlation) to .9 (highly correlated). The relative frequency of the similarity counts is not sensitive to where we set the minimal correlation.

Code
filter_tbbl <- function(min_correlation, tbbl, filter_var){
  tbbl|>
    filter({{  filter_var  }} > min_correlation)|>
    pull(what_same)|>
    table()|>
    as.data.frame()
}

spearman_table <- tibble(min_correlation=seq(0,.9,.1))|>
  mutate(data=list(max_spearman))|>
  mutate(filtered=map2(min_correlation, data, filter_tbbl, spearman_cor))|>
  select(-data)|>
  unnest(filtered)

ggplot(spearman_table, aes(factor(min_correlation), 
                           Freq, 
                           fill=Var1))+
  geom_col(position = "dodge")+
  labs(x="Only consider correlations in excess of",
       y="Count",
       fill="Closest education shares the same",
       title="Closest education on the basis of Spearman correlation of NOC counts")

In many more cases the closest alternative education (in terms of NOC correlation) will share the same highest level attained rather than share the same CIP.

Same using Pearson correlation on log transformed NOC counts.

Code
pearson_table <- tibble(min_correlation=seq(0,.9,.1))|>
  mutate(data=list(max_pearson))|>
  mutate(filtered=map2(min_correlation, data, filter_tbbl, pearson_cor))|>
  select(-data)|>
  unnest(filtered)

ggplot(pearson_table, aes(factor(min_correlation), 
                           Freq, 
                           fill=Var1))+
  geom_col(position = "dodge")+
  labs(x="Only consider correlations in excess of",
       y="Count",
       fill="Closest education shares the same",
       title="Closest education on the basis of Pearson correlation of log10(1+NOC counts)")

Conclusion:

Based on the above it appears that signalling hidden ability (via highest attainment) is significantly more important than specific skill acquisition (via staying within the same CIP).

Caveats:

Code
ma_econ <- all_cor|>
  filter(cip1=="45.06 Economics",
         highest1=="Master's degree")

You might be thinking that this is not a fair comparison, as there are significant differences in the number of “competitor” educations in each of the three categories. For instance, consider a MA in Economics. There are 781 alternative educations, of which 135 are a Masters in a different CIP, whereas only 5 educations are in the same CIP but at different levels of highest attainment. The remainder of 641 are share neither the same CIP nor the same highest attainment. The best defense to this criticism I can think of is an analogy to a running race: if winner takes all, what matters is the speed of your fastest competitor, not the shear number of competitors you face: This:

you vs seven Olympic finalists

vs. this:

Sydney fun run

you vs. 70,000 weekend warriors

Top 5 correlated educations:

By sorting by correlation ascending, the list is populated mostly with the medical fields, with a smattering of programs I wasn’t even aware they existed: e.g. 29.05 Military technologies and applied sciences, 43.02 Fire protection, 14.34 Forest engineering.

If you sort descending, the list is populated with business degrees, or in fields of study with offerings at both College, CEGEP or other non-university certificate and diploma and Apprenticeship or trades certificate or diploma.

Code
all_cor|>
  group_by(cip1, highest1)|>
  select(-broad1,-broad2)|>
  mutate(ave_cor=(spearman_cor+pearson_cor)/2)|>
  slice_max(ave_cor, n=5)|>
  mutate(across(where(is.double), \(x) round(x, digits=2)))|>
  my_dt()