Intro

The goal of this exercise is to create skill profiles specific to job titles. O*NET data provides skill profiles at the NOC unit group level, which will serve as the baseline to which we will adjust. The approach taken is to

  1. identify “skill words” that are “important” for each of the 35 ONET skills.
  2. fuzzyjoin the dirty job titles from postings with the clean job titles from O*NET.
  3. aggregate all job postings with the same clean job title.
  4. calculate the “term frequency” for each word in each aggregate job title.
  5. inner join the skill words and the job words (retaining only words that occur in both)
  6. calculate a score for each word, which is the product of the term frequency in the job postings data and the “importance” of that word for each skill.
  7. for each job title/skill, calculate the sum of all the scores.
  8. take log of this sum, then scale so the mean of the sum is zero, and the standard deviation is reasonable.

Identifying important skill words

I started off with asking ChatGPT to describe in detail each of the 35 O*NET skills, forming a corpus (a list of documents). I then calculated the TF-IDF for each word in each skill.

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in information retrieval and text mining to identify important words in a document. Here’s a brief explanation of how TF-IDF works and how it can be used to identify important words:

Term Frequency (TF):

\(TF(t,d)= \frac{\mbox{Number of times term } t \mbox{ appears in document }d}{\mbox{Total number of terms in document }d}\)

Inverse Document Frequency (IDF):

\(IDF(t,D)=\log\left(\frac{\mbox{Total number of documents in the corpus }D}{\mbox{Number of documents containing term }t\mbox{ +1}}\right)\)

Note: Adding 1 in the denominator is to avoid division by zero in case a term is not present in any document.

TF-IDF Score:

\(TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\)

Identifying Important Words:

Cleaning the jobs postings data

It is probably not surprising that when employers create a job posting, they do not use standardized job titles drawn from the list of job titles that comprise each NOC unit group. In order to clean up the job titles we perform a fuzzy join1, aggregate by clean job title, and then calculate the tf and tf-idf for each word in each job description. Below we have plotted the term frequency for the NOC unit group Advertising, marketing and public relations managers. I chose this NOC because it has several different job titles with multiple postings for each.

Creating a score:

So we have a list of words associated with each of the 35 O*NET skills, and a list of words associated with each of the 12 different job titles within the NOC unit group Advertising, marketing and public relations managers. We perform an inner join (retaining only words that occur in both), and then calculate a “score” equal to the product of the skill tf-idf (i.e. importance) and the job posting term frequency (how frequently that word occurs in job postings for that job title.) We then group by job title and skill and add up the scores. The scores are log-normally distributed, so we take the log (transforming the distribution to be normal-ish) and then scale the result to have a mean of zero (for each skill, across all job titles) and a “reasonable” standard deviation.

What terms are important in determining adjustment?

The adjustments:


  1. https://joss.theoj.org/papers/10.21105/joss.05693↩︎