The goal of this exercise is to create skill profiles specific to job titles. O*NET data provides skill profiles at the NOC unit group level, which will serve as the baseline to which we will adjust. The approach taken is to
I started off with asking ChatGPT to describe in detail each of the 35 O*NET skills, forming a corpus (a list of documents). I then calculated the TF-IDF for each word in each skill.
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in information retrieval and text mining to identify important words in a document. Here’s a brief explanation of how TF-IDF works and how it can be used to identify important words:
Term Frequency (TF):
\(TF(t,d)= \frac{\mbox{Number of times term } t \mbox{ appears in document }d}{\mbox{Total number of terms in document }d}\)
Inverse Document Frequency (IDF):
\(IDF(t,D)=\log\left(\frac{\mbox{Total number of documents in the corpus }D}{\mbox{Number of documents containing term }t\mbox{ +1}}\right)\)
Note: Adding 1 in the denominator is to avoid division by zero in case a term is not present in any document.
TF-IDF Score:
\(TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\)
Identifying Important Words:
It is probably not surprising that when employers create a job posting, they do not use standardized job titles drawn from the list of job titles that comprise each NOC unit group. In order to clean up the job titles we perform a fuzzy join1, aggregate by clean job title, and then calculate the tf and tf-idf for each word in each job description. Below we have plotted the term frequency for the NOC unit group Advertising, marketing and public relations managers. I chose this NOC because it has several different job titles with multiple postings for each.
So we have a list of words associated with each of the 35 O*NET skills, and a list of words associated with each of the 12 different job titles within the NOC unit group Advertising, marketing and public relations managers. We perform an inner join (retaining only words that occur in both), and then calculate a “score” equal to the product of the skill tf-idf (i.e. importance) and the job posting term frequency (how frequently that word occurs in job postings for that job title.) We then group by job title and skill and add up the scores. The scores are log-normally distributed, so we take the log (transforming the distribution to be normal-ish) and then scale the result to have a mean of zero (for each skill, across all job titles) and a “reasonable” standard deviation.