library(tidyverse)  # for data manipulation
library(jsonlite)   # to import json
library(stringdist) # to compute string distances
library(igraph)     # to visualise data as networks
## Warning: package 'igraph' was built under R version 4.0.4

Concept

I really liked the idea of producing an alternative mapping to navigate jobs, and the suggestion to look at skills was very interesting, so I’ll try to go for something like that. In this document, I’ll outline my attempt to build a new classification based on distances from one job to another, based on both the essential and the optional skill-sets. This classification will then be used to create an actual map.

The idea is that of looking at how hard (to define what “hard” means here) it would be to retrain from one job to another, and create a map of the job market where you are placed based on your existing skills in relation to how “far” you currently are from other jobs.

Let’s say we want to transition from a current job/skillset to a new job. Let \(D^{a\rightarrow b}\) be a \(2 \times 2\) matrix with elements \(d_{i,j}^{a\rightarrow b}, j,i=1,2\) defined as below:

Distances
job_1 / job_2 Essential Optional
Essential \(d_{1,1}^{a\rightarrow b}\) how close core skills between the two jobs are (should be the main principle) \(d_{1,2}^{a\rightarrow b}\) cross-distances, how much of your existing essential skills can you re-employ as optional in the new job?
Optional \(d_{2,1}^{a\rightarrow b}\) cross-distances, how much do your existing optional skills cover the essentials for the new job? \(d_{2,2}^{a\rightarrow b}\) how close optional skills are

Considerations

When looking to distances between jobs and “best” next options, I think it’d be interesting to set up a few criteria. Consider the following:

  • minimum retraining the “best” next job \(b'\) from \(a\) is such that \(b'(a)= \underset{b}{\mathrm{argmin}} \sum_{i,j} w_{i,j}d_{i,j}^{a\rightarrow b}\), where \(w_{i,j}, \sum_{i,j} w_{i,j}=1\) are weights to be express relative importance of the distances
  • relative importance when looking for a new job, it might be reasonable to think that already having close essential skills (measured by \(d_{1,1}^{a\rightarrow b}\)), or optional skills close to the new essential skills (\(d_{2,1}^{a\rightarrow b}\)) might be more important than being closer to the optional. This would result in something like \(w_{1,1},w_{2,1}>w_{1,2},w_{2,2}\) if we use the “mininum retraining” criterion, or, more in general, we could use preference operators to express priority in lookingwhich indicator to look at first.

We are clearly ignoring personal preferences in terms of what the jobs consist of here, but given enough data those might be factored in too.

Practical implementation

We have already identified the subset of information that we will use to assess (dis)similarity between jobs based on how close the skills are. We thus need to:

  1. extract information on essential and optional skills
  2. choose a suitable string distance to obtain \(d_{i,j}^{a\rightarrow b}\)
  3. computing all pairwise distances between \((i,j)\) for every combination of jobs \((a,b)\) and store it
  4. create an interface whereby it would be possible to input current job/skill-set and select search criteria, to obtain a map of nearest jobs (and a suggestion of the best one-s according to said criteria)

1. Extract information on essential and optional skills

Point 1 is easily carried out by loading the json file in the working environment and transforming it into \(\texttt{data.frame}\) object or anything else that would make working with it easier than having to access a nested list.

# load available data
dictionary <- read_json("ESCO_occup_skills.json")
occupations <- read_csv('occupations_en.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   conceptType = col_character(),
##   conceptUri = col_character(),
##   iscoGroup = col_character(),
##   preferredLabel = col_character(),
##   altLabels = col_character(),
##   description = col_character()
## )
# skills <- read_csv('skills_en.csv')

# transform the dictionary from a list object to a dataframe
dictionary_df <- enframe(unlist(dictionary))

# quick check to count nesting levels and split accordingly
rgx_split <- "\\."
n_cols_max <-
  dictionary_df %>%
  pull(name) %>% 
  str_split(rgx_split) %>% 
  map_dbl(~length(.)) %>% 
  max()

nms_sep <- paste0("name", 1:n_cols_max)

data_sep <- dictionary_df %>% 
  separate(name, into = nms_sep, sep = rgx_split, fill = "right")

Note 1: you’ll notice that I haven’t actually loaded the skills file, this is because all the information I’ll use in this analysis can be obtained from the other two data source.

Note 2: the size of the json file makes it so that it is still actually possible to just load it into memory, without having to resort to solutions such as loading it into a Mongo docker and querying it from outside.

Now that we have transformed the messy list in a nice dataset, we can readily extract the essential and optional skill-sets. I would have extracted the \(\texttt{iscoGroup}\) code as well from here, but it wasn’t always available (?), so I’ll just merge that information from the occupations together with the job description (not ideal, but it’ll do for today).

# create a filtered versione of the dataset
data_filt <- data_sep %>%
  filter(
    (name2 == "_links" & name3 == "hasEssentialSkill" & name4 == 'title') | 
    (name2 == "_links" & name3 == "hasOptionalSkill" & name4 == 'title')) %>% 
  dplyr::select(name1,name3,value)

names(data_filt) <- c('job_name','aspect','aspect_value')

# merge iscoGroup code from the occupations dataset

data_final <- data_filt %>% 
  left_join(occupations,by=c('job_name'='preferredLabel')) %>% 
  dplyr::select(job_name,job_code=iscoGroup,
                job_description=description,
                aspect,aspect_value)

There are three main reasons why I included the \(\texttt{iscoGroup}\) code: 1. to be able to perform a sanity check on the new mapping later, does it make any sense through the lenses of the existing framework? 2. to possibly use it as an additional information in describing paths on the map we generate 3. generating the full distance matrices amounts to computing 2942 squared pairwise comparisons (times 4, one for each distance); for the sake of time (and my desktop’s own good!) I’ll sub-sample the occupations to demonstrate the method in the next section, and I’ll do it using a stratified approach based on group codes, so to make sure to have a good representation of ideally close-by and far away jobs - more on this later.

2. Choose a suitable string distance

There is plenty of string distances to choose from, and probably it might be worth looking into whether such a specific problem as comparing skill-sets might warrant development of an ad-hoc solution.

My plan here is to chain all essential skills for a job in a single string, and do the same for the optional (where available, some jobs do not appear to heave optional skills listed). Once this is done, a dissimilarity measure can be computed between these strings for two jobs, as described earlier.

For the sake of this assignment, I’ll just go with one of the pre-implemented algorithms in the \(\texttt{stringdist}\) package. There are currently 10 criteria available (some with parameters) to choose from to obtain a quantification of distance between strings; I have decided to use the Jaro distance, mainly as it is (sort of) normalised to \([0,1]\), thus allowing a direct comparison across pairs of jobs.

The following function takes as inputs two job titles \(\texttt{a}\) and \(\texttt{b}\) and outputs the matrix \(D^{a\rightarrow b}\) defined earlier. In order to do this, the function extracts essential an (where available) potential skills for the two jobs, create a single string for each and then computes \(d_{i,j}^{a\rightarrow b}, j,i=1,2\). It is possible to modify the distance function, and provide additional

string_distance <- function(a,b,distance='jw',...) {
  
  ## check the job titles are in the dataset
  ## a refinement to this might be to allow for a not-exact match
  ## and just suggest the closest, say "statisician" for "statistician"
  stopifnot(any(data_filt$job_name==a)*any(data_filt$job_name==b)==1)
  
  ## job 1 - extract essential and optional skills and concatenate them 
  ## in a string each
  job_1_essential <- data_final %>% 
    filter(job_name==a & aspect=='hasEssentialSkill') %>% 
    dplyr::select(aspect_value) %>% 
    paste
  job_1_essential <- gsub('\\n'," ",gsub('\\\"',"",job_1_essential))
  
  job_1_optional <- data_final %>% 
    filter(job_name==a & aspect=='hasOptionalSkill') %>% 
    dplyr::select(aspect_value)
  
  if(nrow(job_1_optional)>0) { # this is to check whether optional are available
    
    job_1_optional <- job_1_optional %>% 
      paste 
    job_1_optional <- gsub('\\n'," ",gsub('\\\"',"",job_1_optional))
      
  } else {
    
    job_1_optional <- NULL
    
  }
  
  ## job 2 - extract essential and optional skills and concatenate them 
  ## in a string each
  job_2_essential <- data_final %>% 
    filter(job_name==b & aspect=='hasEssentialSkill') %>% 
    dplyr::select(aspect_value) %>% 
    paste
  job_2_essential <- gsub('\\n'," ",gsub('\\\"',"",job_2_essential))
  
  job_2_optional <- data_final %>% 
    filter(job_name==b & aspect=='hasOptionalSkill') %>% 
    dplyr::select(aspect_value)
  
  if (nrow(job_2_optional)>0) { # this is to check whether optional are available
    
    job_2_optional <- job_2_optional %>% 
      paste
    job_2_optional <- gsub('\\n'," ",gsub('\\\"',"",job_2_optional))
    
  } else {
    
    job_2_optional <- NULL
    
  }
  
  ## return distance matrix - if optional are not available, set distance to NA
  dist_matrix <- matrix(NA,2,2)
  
  dist_matrix[1,1] <- stringdist(job_1_essential,job_2_essential,method=distance,...)
  
  dist_matrix[1,2] <- ifelse(length(job_2_optional)>0,
                             stringdist(job_1_essential,job_2_optional,method=distance,...),NA)
    
  dist_matrix[2,1] <- ifelse(length(job_1_optional)>0,
                             stringdist(job_1_optional,job_2_essential,method=distance,...),NA)
  
  dist_matrix[2,2] <- ifelse(length(job_1_optional)>0|length(job_2_optional)>0,
                             stringdist(job_1_optional,job_2_optional,method=distance,...),NA)

  dist_matrix

}

3. Compute pairwise distances

This part is particularly computationally intensive. I will subset the available jobs to obtain enough to produce a proof of concept. The good thing is, that the distance computation would need to be carried out only once (or only partially updated as time goes by), stored to a (json? :) ) object together with the rest of the information, and then queried when needed.

To obtain a subset that will contain jobs that can be considered close, as well as far away, I’ll exploit the hierarchies provided by the \(\texttt{iscoGroup}\) codes. Specifically, I’ll sample, using a stratified design, a fixed fraction of the 2942 jobs using ISCO’s major group (the first digit of the code) as strata, and using a proportional allocation (number of sample units proportional to stratum size).

set.seed(42) # for reproducibility

sampling_fraction <- 0.05 # select a sample of the whole

job_list_sample <- data_final %>%
  mutate(major_group=substr(job_code,1,1)) %>%
  group_by(job_name) %>% 
  slice(1) %>% ungroup %>% 
  add_count(major_group,name='major_group_size') %>% 
  sample_frac(sampling_fraction, weight=major_group_size) %>% 
  pull(job_name)

We can quickly check the distribution of jobs across groups

occupations %>% 
  filter(preferredLabel%in%job_list_sample) %>% 
  count(substr(iscoGroup,1,1))

to find out that the least numerous groups do not appear in the subsample, which happens because of the low sampling fraction and their small size relative to other groups. This will not constitute a problem here, as this is just a worked out example, and has the upside to bring the number of pairwise comparisons to be carried out to a mere147 squared.

The distances can now be computed by looping over a grid of pairs of jobs and applying the previously defined \(\texttt{string_distance()}\) function. The result can be stored in a number of ways, for simplicity I’ll create an array of empty \(2\times 2\) matrices, each of which will contain the \(d_{i,j}^{a\rightarrow b}\)s for a specific pair.

# create the job pairs grid
job_pairs <- expand.grid(job_list_sample,job_list_sample)
names(job_pairs) <- c('job_1','job_2')

n_pairs <- nrow(job_pairs)

# pre-allocate space for distance results
distance_matrix <- array(dim=c(2,2,n_pairs))

# loop over the grid and populate the distance matrix

for (i in 1:n_pairs) {
  
  job_1 <- occupations$preferredLabel[job_pairs[i,1]]
  job_2 <- occupations$preferredLabel[job_pairs[i,2]]
  
  distance_matrix[,,i] <- string_distance(job_1,job_2)
  
}

We should also keep track of some useful information, such as whether the pair is a job and itself (bound to return 0 distance on the essential skills), and what isco code transition it entails:

# keep track of which pairs have a=b (same job), as they'll return distance 0 
# between essential skills invariably
job_pairs$which_same <- job_pairs[,1]==job_pairs[,2]

# keep track of major groups associated to the pair
job_pairs <- job_pairs %>% 
  left_join(occupations %>% 
              dplyr::select(preferredLabel,iscoGroup),
            by=c('job_1'='preferredLabel')) %>% 
  left_join(occupations %>% dplyr::select(preferredLabel,iscoGroup),
            by=c('job_2'='preferredLabel'))
names(job_pairs)[4:5] <- c('isco_job_1','isco_job_2')

While we are at it, for ease of handling, let’s add the distances to the \(\texttt{job_pairs}\) object as new columns.

for (i in 1:n_pairs) {
  
  distances <- c(distance_matrix[,,i])
  
  job_pairs$essential_to_essential[i] <- distances[1]
  job_pairs$optional_to_essential[i] <- distances[2]
  job_pairs$essential_to_optional[i] <- distances[3]
  job_pairs$optional_to_optional[i] <- distances[4]
  
}

4. Create an interface

I will not get into details of the creation of a dashboard/webapp here. However, let’s write a couple of functions that will help us visualise the results.

The first one will take as arguments a job label to start \(\texttt{from}\), a distance \(\texttt{threshold}\) to define a neighbourhood, and which of the distances \(d^{a\rightarrow b}_{i,j}\) should be used (1) or not (\(\texttt{NA}\)) to assess proximity.

return_neighbours <- function(from,threshold=.3,
                              which_distances=c(1,NA,NA,NA)) {
  
  job_pairs %>% 
    filter(job_1==from) %>% 
    mutate(avg_distance=rowMeans(
      cbind(
        which_distances[1]*essential_to_essential,
        which_distances[2]*essential_to_optional,
        which_distances[3]*optional_to_essential,
        which_distances[4]*optional_to_optional),
      na.rm=TRUE)
      ) %>% 
    filter(avg_distance<threshold) %>% 
    filter(!which_same)

}

The way the \(\texttt{which_distances}\) argument work is essentially assigning those “weights” \(w_{i,j}\) discussed above in a dichotomous in/out way. We’ll use the value 1 to indicate “use the distance” and \(\texttt{NA}\) to indicate “do not use the distance”; the positions reflect those described earlier, so that \(w_{1,1}\) will refer to the essential-to-essential distance, \(w_{1,2}\) to essential-to-optional, \(w_{2,1}\) to optional-to-essential and \(w_{2,2}\) to optional-to-optional. An overall average distance is then computed which uses only the distances that were specified.

The function \(\texttt{return_neighbours()}\) will be used within the next one, which uses the dataframe returned and provides a visual representation of closest (by skills) jobs using the \(\texttt{igraph}\) package. The visualisation is static, but with some work it can be made interactive (for example through plotly or similar graphic libraries), even if only to allow for information pop-ups while hovering with the mouse or rotation in 3 dimensions to adjust the visual.

plot_neighbours <- function(from, threshold=.3, 
                            which_distances=c(1,NA,NA,NA),
                            ...) {
  
  tmp_data <- return_neighbours(from,threshold,
                                which_distances)
  
  d1 <- data.frame(from=from, to=tmp_data$job_2)
  vertices <- data.frame(name = unique(c(as.character(d1$from),
                                         as.character(d1$to)))) 
  
  mygraph <- graph_from_data_frame(d1, vertices=vertices)
  
  vertex_labels <- c(paste0(from,' [',substr(tmp_data$isco_job_1[1],1,1),']'),
                     paste0(tmp_data$job_2,' [',substr(tmp_data$isco_job_2,1,1),']'))
  
  plot(mygraph, vertex.label=vertex_labels, 
       edge.arrow.size=0, vertex.size=3,
       vertex.label.dist=2)
  
}

Demonstration

Let’s take a random job from the subset we have created and imagine it is our current job position:

set.seed(11235813)
current_job <- sample(job_list_sample,1)
current_job
## [1] "polygraph examiner"

Its \(\texttt{iscoGroup}\) code is 2634 and the job description reads

occupations %>% 
  filter(preferredLabel==current_job) %>% 
  pull(description)
## [1] "Polygraph examiners prepare individuals for polygraph testing, conduct the polygraph exam and interpret the results. They pay close attention to detail and use a range of instruments to monitor respiratory, sweat and cardiovascular responses to questions addressed during the process. Polygraph examiners write reports on the basis of the results and can provide courtroom testimony."

Nice. Let’s first take a look at what do the essential-to-essential skills distances look lie for this job:

percentile_threshold <- .03

job_pairs %>% 
    filter(job_1==current_job) %>% 
    ggplot()+
    geom_density(aes(essential_to_essential))+
    geom_vline(aes(xintercept=quantile(essential_to_essential,percentile_threshold)),lty=2)+
    theme_bw()

we can eyball some reasonable lower threshold to identify close-by jobs. Let’s say we take the first 3\(\%\) (dashed vertical line in the plot), which corresponds to a distance of about 0.23:

threshold_distance <-  quantile(job_pairs$essential_to_essential[job_pairs$job_1==current_job],percentile_threshold)

return_neighbours(current_job,
                  threshold = threshold_distance)

We can see the algorithm returns 5 neighbours at that distance. Let’s take a look at them!

plot_neighbours(from=current_job,
                threshold=threshold_distance)

At the center, our current occupation. The numbers in square bracket indicate the major group according to ISCO classification (finer grouping - more digits - can be reported/looked up in the database for sanity checks). The plot shows the closest jobs according to

  1. chosen distance (default: Jaro)
  2. weights (default: only look at essential-to-essential)
  3. threshold (here: third percentile of distances).

So, according to our algorithm, a polygraph examiner could consider retraining in quite a few funny ways… Where have I seen something similar already…

Let’s keep in mind we are using an arbitrary distance on a sample of the whole dataset, so close-by jobs might not be too close also because of that!

Conclusions

I have tried to outline one possible approach to the problem of re-categorising jobs, starting from the ESCO framework and defining distances based how similar essential and optional skills for said jobs are. I have described a simple implementation and created a visual output that gives an immediate idea of the potential of this method.

Advantages The framework is quite intuitive, and pretty flexible. The possibility to implement a custom distance and choose the thresholds would make it possible to create a user-friendly interface (for example using ShinyApp/flexidashboards) to explore the job landscape under this new lens. While this notebook has been written using rmarkdown and the R language, the method is clearly platform-free, and everything needed to implement it (maybe even better) already exists in Python and other widely used languages.

Limitations The implementation is very raw, there is certainly scope for improvements there. Moreover, the choice of which distance to use is critical: as we are trying to measure something as complex as how similar two sets of skills are (and possibly more than two), the way we choose to do this will determine the meaningfulness of the whole thing. Computational issues pertain the computation of pairwise distances for a large number of jobs, which - however - would only need to be done once, and can certainly optimised by writing a loop in a lower-level language such as C++ or similar.

On a side note, there is an \(\texttt{R}\) package by the name of \(\texttt{labourR}\) that provides a host of functions to deal with the ESCO data, which “Includes the ESCO corpus and the respective ESCO to ISCO mappings. Allows a user to enter multilingual free-form text and receive its classification in the ESCO-ISCO hierarchy. Computations are fully vectorized and memory efficient. Includes facilities to assist research in text mining of labour market data.” (quote from the package vignette). I haven’t explored it too much, as I wanted to give this a go on my own, but it looks interesting.

Oh, also, I couldn’t resist and created a wee interactive version of the above, you can find it here. :)

Thanks for getting to the bottom of this!