source("shared_code.R")
People vary in how well matched their skills are to the skill required by their occupation. There are various reasons why people might be mis-matched. For instance, if there is an element of “winner takes all” in the labour market, one would expect that the highest quality workers and highest quality employers would end up with the highest quality matches: both the buyer and the seller have the luxury of being selective in who they match with. But being poorly matched does not necessarily indicate a worker is of low ability.
Regardless of the cause of poor match quality, we will want to control for these confounding variables. If we successfully control for all the factors that influence match quality, we can assume that match quality is “as good as” randomly assigned. The goal of this exercise is to investigate the impact of skill mis-match on
employment income of those who work full time full year: Going from the 10th percentile of skill mis-match (relatively well matched) to the 90th percentile of skill mis-match (relatively poorly matched) causes a -7.9% reduction in employment income for those working full time, full year.
the probability of working full time full year: Going from the 10th percentile of skill mis-match (relatively well matched) to the 90th percentile of skill mis-match (relatively poorly matched) causes a -5.4% reduction in the probability of being employed full time full year.
We define skill mis-match to be the euclidean distance between the skill profile of an occupation and the skills possessed by a worker. The skill profile of an occupation is relatively straight forward to derive from the ONET skills and work activities. In contrast there is no direct source of information regarding the skills of a worker. Here we make the presumption that a worker’s skills are acquired during education, and we can infer the skills acquired during education by looking at the relationship between occupations and field of study.
Statistics Canada Table: 98-10-0403-01 contains counts of workers by
occupation and field of study. Suppose that we wanted to infer the
average skill profile of someone whose field of study was
Health and related fields
. In the table below we show the
top occupations (by count of workers) for the field of study
Health and related fields
. From this we derive a
proportion, a truncated proportion (only proportions greater than .01),
and an adjusted proportion to ensure the proportions we utilize sum to
one. We then multiply the skill profile of each of the occupations by
the adjusted proportion and then sum to create a weighted average skill
profile for each field of study. At the extreme, if every person from a
given field of study ended up in the same occupation, both skill
profiles would be identical.
my_dt(cip2_noc5)
We undertake a similar exercise to create a weighted average skill profile of the 25 aggregate occupations based on the 5 digit NOC skill profiles and weights based on the counts of workers.
Once we have the skill profiles associated with each of the fields of study and occupation we can measure the distance between. To begin with we scale the skills and work activities to have a mean of zero and a standard deviation of one. Next we perform principal component analysis, and retain only the first five principal components. Finally we compute the euclidean distance between each of the occupations and fields of study based on the first five principal components. The distances are ploted below, where colour indicates distance. The most striking pattern is that for “Assisting occupations” the distance is large across all fields of study i.e. the yellow stripe in the bottom row.
heatmaply::heatmaply(dist_mat)
Why is the NOC group “Assisting Occupations…” poorly matched? Based on the NOC-CIP differences, they appear to be generically over-skilled. The skills they apparently have (based on their field of study) exceed the skills required in this broad set of occupations.
plt <- cip_noc_diff|>
filter(census_noc_code==14)|>
select(contains("noc-cip"))|>
pivot_longer(cols=everything())|>
mutate(name=str_sub(name, start=8),
value=-value)|>
group_by(name)|>
summarize(value=mean(value))|>
ggplot(aes(value,
fct_reorder(name,value),
text=paste0("Skill: ",
name,
"\n Skill surplus: ",
round(value,2)
)
))+
geom_col()+
labs(x=NULL,
y=NULL,
title='Skill surplus for Assisting occupations, care providers, student monitors, crossing guards...')+
theme_minimal(base_size = 8)
plotly::ggplotly(plt, tooltip = "text")
The census public use micro data file contains a sample of roughly 1 million Canadians. We perform the following filtering of the data prior to analysis of the impact of skill mis-match of the employment income of those who work full year full time.
colnames(filtering_info) <- c("Filter applied","Observations left")
my_dt(filtering_info)
Note that the proportion of Canadians working full year full time is likely lower than normal, given COVID. This filtering leaves us with 20122 observations, 90% of which are randomly allocated to a training set, with the remaining 10% going to a testing set.
Using our filtered dataset, we can look at what fields of study are associated with what occupations. Note that it is the dispersion of occupations associated with a given field of study that leads to our measure of skill mis-match. e.g. if there was a field of study where every single person ends up in the same occupation, their skill mis-match (distance) would be zero. In contrast, the larger the set of destination occupations, the more likely it is that the skills developed during education do not match exactly the skills required in one of the resulting occupations.
filtered|>
group_by(`NOC vs. Admin and financ...:`, `CIP vs. Agriculture...:`)|>
count()|>
ggplot(aes(axis1= `CIP vs. Agriculture...:`, axis2 = `NOC vs. Admin and financ...:`, y = n)) +
geom_alluvium(aes(fill = `CIP vs. Agriculture...:`)) +
geom_stratum() +
ggfittext::geom_fit_text(stat = "stratum", aes(label = after_stat(stratum)), width = 1/3, min.size = 1) +
labs(fill="Fields of Study")+
theme_void()+
theme(legend.position = "bottom")
So what we are after is the causal impact of a mis-match in skills (proxied by distance) on employment income. Given that we are using observational data, we must lean on the Conditional Independence Assumption: i.e. we must assume that distance is “as good as randomly assigned” when we condition on the all the variables that influence both distance and employment income. We fit the model
\[\log(Employment~Income)=Age+Highest~Degree+ Language+ Gender+ Ethnicity+Occupation+Field~of~Study+distance+\mu\] The plot below shows (most of the) regression results: Language and Gender estimates can be found in the regression table in the appendix. From the results you can see that:
But the main result we are after is the monetary penalty associated with a skill mis-match: \(\beta_{distance}~=~\)-0.008, which can be interpreted as “A one unit increase in distance causes a \(100*(\exp(\beta_{distance})-1)~=~\)-0.8% change in employment income.” In terms of its economic relevance, going from the 10th percentile of distance 3.71 (relatively well matched) to the 90th percentile of distance 13.33 (relatively poorly matched) causes a -7.9% change in employment income, ceteris paribus.
ggplot(mod1_coef, aes(estimate,
reorder_within(level, within=variable, by=estimate),
xmin=conf.low,
xmax=conf.high))+
geom_vline(xintercept = 0, col="grey", lty=2)+
geom_point(size=.5)+
geom_errorbarh(height=0)+
facet_wrap(~variable, scales = "free")+
scale_y_reordered()+
scale_x_continuous(labels=scales::percent)+
labs(x=NULL,
y=NULL)+
theme_minimal()
Note that the model only explains about 30% of the variation in the in sample employment income, but even this might be overly optimistic. It is always a good idea to investigate the model’s performance out of sample, to make sure that we have not over-fit the model. Over fitting occurs when the model fits the in sample data well, but does not perform well out of sample. Below we look at how well the model performed out of sample (using the 10% of the sample that we held back). The model does a fairly decent job of predicting employment income when it is low, but does not do a good job of explaining employment income in excess of $300,000. Note that the root mean squared error in sample is 0.444 whereas it is 0.437 out of sample: i.e. there is no evidence of over-fitting.
ggplot(test_w_pred, aes(exp(prediction), exp(log_income)))+
geom_abline(slope = 1, intercept = 0, col="white", lwd=2)+
geom_point(alpha=.1)+
scale_x_continuous(trans="log10", labels=scales::dollar)+
scale_y_continuous(trans="log10", labels = scales::dollar)+
labs(x="Prediction",
y="Actual",
title="Test set prediction errors")
By costlessly displacing over a third of the labour market the social planner was able to marginally improve social outcomes, in the sense that any risk neutral or risk averse individual behind the veil of ignorance would prefer the government to intervene. In reality, switching occupations is not without cost, so a more realistic intervention would likely focus on those entering the labour market. Thus the above exercise puts an upper bound on what we can hope to achieve by improving labour market skill matching.
In the analysis above we only considered those who were working full year full time in 2020: here we look at whether skills mis-match causes a difference in the odds of working full year full time. We perform the following filtering of the data prior to analyzing the impact of skills mis-match on the odds of working full year full time.
colnames(filtering_info2) <- c("Filter applied","Observations left")
my_dt(filtering_info2)
So what we are after is the causal impact of a mis-match in skills (proxied by distance) on the probability of being employed full year full time. Given that we are using observational data, we must lean on the Conditional Independence Assumption: i.e. we must assume that distance is “as good as randomly assigned” when we condition on the all the variables that influence both distance and whether employed full time full year. We fit the model
\[logit(Employed~FT~FY)=Age+Highest~Degree+ Language+ Gender+ Ethnicity+Occupation+Field~of~Study+distance+\mu\] The plot below shows (most of the) regression results: Language and Gender estimates can be found in regression table which follows. From the results you can see that the probability of working full year full time is:
But the main result we are after is how distance affects the probability of being employed full time full year: The probability of working full year full time decrease by -0.5% for every one unit increase in distance. In terms of its economic relevance, going from the 10th percentile of distance 3.71 (relatively well matched) to the 90th percentile of distance 13.67 (relatively poorly matched) causes a -5.4% reduction in the probability of being fully employed, ceteris paribus.
margins_mod2|>
mutate(variable=str_remove_all(variable, "`"),
level= if_else(is.na(level), variable, level),
level=str_trunc(level, 50)
)|>
filter(!variable %in% c("(Intercept)",
"Language vs. not english",
"Gender vs. Woman+"
))|>
arrange(variable, level)|>
ggplot(aes(AME,
reorder_within(level, within=variable, by=AME),
xmin=lower,
xmax=upper))+
geom_vline(xintercept = 0, col="grey", lty=2)+
geom_point(size=.5)+
geom_errorbarh(height=0)+
facet_wrap(~variable, scales = "free")+
scale_y_reordered()+
scale_x_continuous(labels=scales::percent)+
labs(x=NULL,
y=NULL)+
theme_minimal()
We use the logit model to form predictions based on the test data. These predictions are converted to probabilities of being employed full time full year, and then rounded to either zero or one. We then can look at the out of sample prediction accuracy of the model via a confusion matrix:
confusion
prediction
full_time_full_year 0 1
0 610 812
1 340 1736
From the confusion matrix we can see the model predicts full time full year employment with 68% accuracy. We can test the correspondence between the observed and predicted using the Pearson’s Chi-squared test with Yates’ continuity correction. i.e. What is the probability that we would get a correlation this strong between prediction and actual if the null hypothesis (no relationship) is true \[p = 0.0000000000000000000000000000000000000000000000000000000000000000006362731\] Lets compare this result to the null model, where we randomly assign either a zero or a one to each observation in the test set using the probabilities from the training set.
null_confusion
null_prediction
full_time_full_year 0 1
0 589 833
1 827 1249
The NULL model predicts full time full year employment with 52% accuracy. Again, we can use the Pearson’s Chi-squared test with Yates’ continuity correction to assess the probability that we would get a result this extreme if the null hypothesis is true p = 0.367.
For those of you who like regression tables… Note that logit results give odds ratio effects (not probabilities)
stargazer::stargazer(mod1, mod2, type = "html",
se = list(robust_se1, robust_se2))
Dependent variable: | ||
log_income | full_time_full_year | |
OLS | logistic | |
(1) | (2) | |
Age vs. 20-24: 25 to 29 years
|
0.182*** | 0.956*** |
(0.021) | (0.066) | |
Age vs. 20-24: 30 to 34 years
|
0.345*** | 1.201*** |
(0.021) | (0.065) | |
Age vs. 20-24: 35 to 39 years
|
0.442*** | 1.286*** |
(0.021) | (0.066) | |
Age vs. 20-24: 40 to 44 years
|
0.507*** | 1.526*** |
(0.022) | (0.069) | |
Age vs. 20-24: 45 to 49 years
|
0.535*** | 1.638*** |
(0.022) | (0.069) | |
Age vs. 20-24: 50 to 54 years
|
0.561*** | 1.677*** |
(0.022) | (0.070) | |
Age vs. 20-24: 55 to 59 years
|
0.523*** | 1.463*** |
(0.022) | (0.070) | |
Age vs. 20-24: 60 to 64 years
|
0.511*** | 1.125*** |
(0.024) | (0.075) | |
Age vs. 20-24: 65 to 69 years
|
0.489*** | 0.419*** |
(0.033) | (0.089) | |
Age vs. 20-24: Unknown
|
0.424*** | 0.751*** |
(0.106) | (0.270) | |
Degree vs. non-apprentice: Apprenticeship certificate
|
0.075*** | 0.117 |
(0.022) | (0.076) | |
Degree vs. non-apprentice: Less than 1 year College
|
0.029 | 0.112 |
(0.021) | (0.075) | |
Degree vs. non-apprentice: 1-2 years of College
|
0.096*** | 0.254*** |
(0.020) | (0.070) | |
Degree vs. non-apprentice: More than 2 years of College
|
0.108*** | 0.201*** |
(0.022) | (0.077) | |
Degree vs. non-apprentice: University certificate or diploma
|
0.091*** | 0.222*** |
(0.023) | (0.080) | |
Degree vs. non-apprentice: Bachelor’s degree
|
0.220*** | 0.332*** |
(0.020) | (0.069) | |
Degree vs. non-apprentice: University diploma above bachelor
level
|
0.272*** | 0.237** |
(0.027) | (0.095) | |
Degree vs. non-apprentice: Medicine, dentistry, veterinary,
optometry
|
0.319*** | 0.033 |
(0.070) | (0.162) | |
Degree vs. non-apprentice: Master’s degree
|
0.296*** | 0.203*** |
(0.022) | (0.077) | |
Degree vs. non-apprentice: PhD
|
0.473*** | 0.450*** |
(0.036) | (0.131) | |
Degree vs. non-apprentice: Unknown
|
0.182** | -0.265 |
(0.079) | (0.217) | |
Language vs. not english: English
|
0.148*** | 0.224*** |
(0.011) | (0.039) | |
Language vs. not english: Unknown
|
-0.404 | -0.710 |
(0.256) | (0.526) | |
Gender vs. Woman+: Man+
|
0.187*** | 0.514*** |
(0.008) | (0.030) | |
Ethnicity vs. White: South Asian
|
-0.067*** | 0.062 |
(0.014) | (0.050) | |
Ethnicity vs. White: Chinese
|
-0.037*** | 0.006 |
(0.012) | (0.046) | |
Ethnicity vs. White: Black
|
-0.138*** | -0.167 |
(0.039) | (0.139) | |
Ethnicity vs. White: Filipino
|
-0.167*** | 0.276*** |
(0.015) | (0.063) | |
Ethnicity vs. White: Arab
|
-0.001 | -0.270 |
(0.064) | (0.199) | |
Ethnicity vs. White: Latin American
|
-0.104*** | -0.192* |
(0.028) | (0.100) | |
Ethnicity vs. White: Southeast Asian
|
-0.083*** | -0.172 |
(0.031) | (0.120) | |
Ethnicity vs. White: West Asian
|
-0.195*** | -0.387*** |
(0.031) | (0.103) | |
Ethnicity vs. White: Korean
|
-0.163*** | -0.421*** |
(0.030) | (0.102) | |
Ethnicity vs. White: Japanese
|
-0.034 | -0.389*** |
(0.043) | (0.135) | |
Ethnicity vs. White: Other population groups, n.i.e.
|
-0.139*** | -0.575*** |
(0.053) | (0.206) | |
Ethnicity vs. White: Other multiple population groups
|
-0.054*** | 0.042 |
(0.019) | (0.071) | |
Ethnicity vs. White: Indigenous peoples
|
-0.060*** | -0.166** |
(0.018) | (0.065) | |
Ethnicity vs. White: Unknown
|
-0.132*** | -0.148** |
(0.020) | (0.069) | |
NOC vs. Admin and financ...: Administrative and financial
support and supply chain logistics occupations
|
-0.101*** | -0.424*** |
(0.022) | (0.097) | |
NOC vs. Admin and financ...: Administrative occupations and
transportation logistics occupations
|
-0.083*** | -0.269*** |
(0.020) | (0.092) | |
NOC vs. Admin and financ...: Assisting occupations in
support of health services
|
-0.055** | -0.509*** |
(0.026) | (0.109) | |
NOC vs. Admin and financ...: Assisting occupations, care
providers, student monitors, crossing guards and related occupations in
education and in legal and public protection
|
0.060 | -1.202*** |
(0.042) | (0.147) | |
NOC vs. Admin and financ...: Frontline public protection
services and paraprofessional occupations in legal, social, community,
education services
|
0.030 | -0.430*** |
(0.022) | (0.094) | |
NOC vs. Admin and financ...: General trades
|
-0.052* | -0.549*** |
(0.028) | (0.118) | |
NOC vs. Admin and financ...: Helpers and labourers and other
transport drivers, operators and labourers
|
0.006 | -0.727*** |
(0.038) | (0.126) | |
NOC vs. Admin and financ...: Mail and message distribution,
other transport equipment operators and related maintenance workers
|
-0.095** | -0.450** |
(0.045) | (0.193) | |
NOC vs. Admin and financ...: Middle management occupations
|
0.304*** | 0.413*** |
(0.022) | (0.095) | |
NOC vs. Admin and financ...: Occupations in natural
resources, agriculture and related production
|
0.060 | -1.379*** |
(0.054) | (0.154) | |
NOC vs. Admin and financ...: Occupations in processing,
manufacturing and utilities
|
0.059* | -0.415*** |
(0.032) | (0.119) | |
NOC vs. Admin and financ...: Occupations in sales and
services
|
-0.063* | -0.776*** |
(0.036) | (0.107) | |
NOC vs. Admin and financ...: Other occupations in art,
culture and sport
|
-0.039 | -1.633*** |
(0.049) | (0.166) | |
NOC vs. Admin and financ...: Professional and technical
occupations in art, culture and sport
|
0.164*** | -0.518*** |
(0.028) | (0.112) | |
NOC vs. Admin and financ...: Professional occupations in
business and finance
|
0.223*** | 0.191** |
(0.022) | (0.094) | |
NOC vs. Admin and financ...: Professional occupations in
health
|
0.375*** | -0.295*** |
(0.024) | (0.101) | |
NOC vs. Admin and financ...: Professional occupations in
law, education, social, community and government services
|
0.130*** | 0.153* |
(0.021) | (0.090) | |
NOC vs. Admin and financ...: Professional occupations in
natural and applied sciences
|
0.278*** | 0.432*** |
(0.023) | (0.101) | |
NOC vs. Admin and financ...: Retail sales and service
supervisors and specialized occupations in sales and services
|
-0.059* | -0.404*** |
(0.031) | (0.115) | |
NOC vs. Admin and financ...: Sales and service
representatives and other customer and personal services occupations
|
-0.157*** | -1.168*** |
(0.025) | (0.090) | |
NOC vs. Admin and financ...: Sales and service support
occupations
|
-0.260*** | -1.320*** |
(0.030) | (0.106) | |
NOC vs. Admin and financ...: Technical occupations in health
|
0.076*** | -0.541*** |
(0.027) | (0.111) | |
NOC vs. Admin and financ...: Technical occupations related
to natural and applied sciences
|
0.030 | -0.104 |
(0.022) | (0.104) | |
NOC vs. Admin and financ...: Technical trades and
transportation officers and controllers
|
0.106*** | -0.458*** |
(0.022) | (0.095) | |
CIP vs. Agriculture...: Architecture, engineering, and
related trades
|
0.147*** | 0.035 |
(0.021) | (0.094) | |
CIP vs. Agriculture...: Business, management and public
administration
|
0.083*** | -0.048 |
(0.021) | (0.092) | |
CIP vs. Agriculture...: Education
|
-0.079*** | -0.150 |
(0.024) | (0.104) | |
CIP vs. Agriculture...: Health and related fields
|
0.048** | -0.212** |
(0.023) | (0.097) | |
CIP vs. Agriculture...: Humanities
|
-0.078*** | -0.364*** |
(0.025) | (0.102) | |
CIP vs. Agriculture...: Mathematics, computer and
information sciences
|
0.155*** | 0.068 |
(0.025) | (0.106) | |
CIP vs. Agriculture...: Personal, protective and
transportation services
|
0.081*** | -0.239** |
(0.027) | (0.106) | |
CIP vs. Agriculture...: Physical and life sciences and
technologies
|
0.040 | -0.028 |
(0.025) | (0.106) | |
CIP vs. Agriculture...: Social and behavioural sciences and
law
|
0.047** | -0.197** |
(0.022) | (0.094) | |
CIP vs. Agriculture...: Visual and performing arts, and
communications technologies
|
-0.055** | -0.410*** |
(0.027) | (0.108) | |
distance | -0.008*** | -0.026*** |
(0.001) | (0.005) | |
Constant | 10.309*** | -0.806*** |
(0.039) | (0.149) | |
Observations | 18,228 | 31,433 |
R2 | 0.301 | |
Adjusted R2 | 0.298 | |
Log Likelihood | -18,848.930 | |
Akaike Inf. Crit. | 37,845.860 | |
Residual Std. Error | 0.444 (df = 18154) | |
F Statistic | 107.239*** (df = 73; 18154) | |
Note: | p<0.1; p<0.05; p<0.01 |