Adjusting a job’s skill profile based on job postings

Intro

The goal of this exercise is to create skill profiles specific to job titles. O*NET data provides skill profiles at the NOC unit group level, which will serve as the baseline to which we will adjust. The approach taken is to

identify “skill words” that are “important” for each of the 35 ONET skills.
fuzzyjoin the dirty job titles from postings with the clean job titles from O*NET.
aggregate all job postings with the same clean job title.
calculate the “term frequency” for each word in each aggregate job title.
inner join the skill words and the job words (retaining only words that occur in both)
calculate a score for each word, which is the product of the term frequency in the job postings data and the “importance” of that word for each skill.
for each job title/skill, calculate the sum of all the scores.
take log of this sum, then scale so the mean of the sum is zero, and the standard deviation is reasonable.

Identifying important skill words

I started off with asking ChatGPT to describe in detail each of the 35 O*NET skills, forming a corpus (a list of documents). I then calculated the TF-IDF for each word in each skill.

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in information retrieval and text mining to identify important words in a document. Here’s a brief explanation of how TF-IDF works and how it can be used to identify important words:

Term Frequency (TF):

Term Frequency is a measure of how often a term (word) appears in a document.
It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document.

\(TF(t,d)= \frac{\mbox{Number of times term } t \mbox{ appears in document }d}{\mbox{Total number of terms in document }d}\)

Inverse Document Frequency (IDF):

Inverse Document Frequency is a measure of how unique or rare a term is across a collection of documents.
It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term, with the result being inverted.

\(IDF(t,D)=\log\left(\frac{\mbox{Total number of documents in the corpus }D}{\mbox{Number of documents containing term }t\mbox{ +1}}\right)\)

Note: Adding 1 in the denominator is to avoid division by zero in case a term is not present in any document.

TF-IDF Score:

The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF values.

\(TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\)

Identifying Important Words:

Words with higher TF-IDF scores are considered more important in a specific document.
If a term has a high TF, it means it appears frequently in the document. If it has a high IDF, it means it is relatively rare across the entire corpus.
The combination of high TF and high IDF implies that the term is both frequent in the document and unique or rare in the corpus, making it a good candidate for being an important word in that document.

Cleaning the jobs postings data

It is probably not surprising that when employers create a job posting, they do not use standardized job titles drawn from the list of job titles that comprise each NOC unit group. In order to clean up the job titles we perform a fuzzy join¹, aggregate by clean job title, and then calculate the tf and tf-idf for each word in each job description. Below we have plotted the term frequency for the NOC unit group Advertising, marketing and public relations managers. I chose this NOC because it has several different job titles with multiple postings for each.

Creating a score:

So we have a list of words associated with each of the 35 O*NET skills, and a list of words associated with each of the 12 different job titles within the NOC unit group Advertising, marketing and public relations managers. We perform an inner join (retaining only words that occur in both), and then calculate a “score” equal to the product of the skill tf-idf (i.e. importance) and the job posting term frequency (how frequently that word occurs in job postings for that job title.) We then group by job title and skill and add up the scores. The scores are log-normally distributed, so we take the log (transforming the distribution to be normal-ish) and then scale the result to have a mean of zero (for each skill, across all job titles) and a “reasonable” standard deviation.

What terms are important in determining adjustment?

The adjustments:

https://joss.theoj.org/papers/10.21105/joss.05693 ↩︎