Historical extract from iworkforsa - lab book

ofchurches
28 June 2021

Background

This markdown is a preliminary investigation of the data provided by the Office of the Commissioner for Public Sector Employment.The data contains details from every advertisement on iworkforsa.gov.au.

Note that these data have been classified as "Public" by the OCPSE.

Packages

library(readxl)
library(tidyverse)
library(here)
library(lubridate)
library(kableExtra)
library(naniar)
library(patchwork)
library(skimr)
library(DT)
library(ComplexUpset)
library(tidytext)
library(tidyr)
library(stringr)
library(scales)
library(widyr)
library(tidygraph)
library(ggraph)
library(netrankr)

Import, arrange and format

path <- here("IWORK4SA Data 2017-2021, Classification-Public.xlsx")

advertiments <- path %>% 
  excel_sheets() %>%
  map_dfr(
    function(sheet) {
      temp = read_excel(sheet, path = path, skip = 8)
      mutate(temp, Year = sheet)
    }
  )

Checks and balances

Check the top and bottom of the data

Data read in from xlsx files often has bits of extra information above and below the actual "data". To check that we haven't accidentally imported that we can have a look at the top and bottom of our data set.

advertiments %>%
  slice_head(n = 6) %>%
  kable() %>%
  kable_styling()

Date Publish To Public Drill	Division	Position Status	I Workfor SA Classification Level	Position ID	Position ID Job Title	I Workfor SA Vacancy Type	I Workfor SA Part Time	I Workfor SA Part Time Hours	I Workfor SA Location	I Workfor SA Min Salary	I Workfor SA Max Salary	I Workfor SA Employment Status	I Workfor SA Job Category	Year
2017-11-04	Department for Child Protection	Position Closed	AHP3 - Allied Health Professional	294895	294895 - Supervisor	Open to Everyone	No	0	5540 - PORT PIRIE	92757	98896	Ongoing	Community and Social Services, Child, Aged and Disability Care	2017
2017-11-04	Department for Child Protection	Position Closed	PO3 - Professional Officer	294897	294897 - Supervisor	Open to Everyone	No	0	5540 - PORT PIRIE	90029	95487	Ongoing	Community and Social Services, Child, Aged and Disability Care	2017
2017-11-04	Department for Education	Position Closed	AHP3 - Allied Health Professional	294898	294898 - Psychologist, Centre for Hearing Impaired	Open to Everyone	No	45	5000 - ADELAIDE	92757	98896	Ongoing	Community and Social Services, Child, Aged and Disability Care	2017
2017-11-04	Department for Education	Position Closed	AHP2 - Allied Health Professional	294899	294899 - Speech Pathologist, Centre for Hearing Impaired	Open to Everyone	No	45	5000 - ADELAIDE	77752	90029	Ongoing	Medical	2017
2017-11-04	Department for Infrastructure and Transport	Position Closed	ASO5 - Administrative Services Officer	294900	294900 - Senior Asset Inspection Officer	Open to Everyone	No	0	5037 - NETLEY	75430	84497	Short Term Contract	Maintenance and Facilities	2017
2017-11-04	Department for Infrastructure and Transport	Position Closed	ASO6 - Administrative Services Officer	294901	294901 - Asset Assurance Manager	Open to Everyone	No	0	5037 - NETLEY	87384	92743	Short Term Contract	Maintenance and Facilities	2017

advertiments %>%
  slice_tail(n = 6) %>%
  kable() %>%
  kable_styling()

Date Publish To Public Drill	Division	Position Status	I Workfor SA Classification Level	Position ID	Position ID Job Title	I Workfor SA Vacancy Type	I Workfor SA Part Time	I Workfor SA Part Time Hours	I Workfor SA Location	I Workfor SA Min Salary	I Workfor SA Max Salary	I Workfor SA Employment Status	I Workfor SA Job Category	Year
NA	TAFE SA	Created	NA	436727	436727 - Default template	NA	No	NA	5000 - ADELAIDE	NA	NA	NA	NA	2021
NA	Office of the Commissioner for Public Sector Employment	Created	NA	422698	422698 - Default template	NA	No	NA	5000 - ADELAIDE	NA	NA	NA	NA	2021
NA	Department for Correctional Services	Created	NA	434022	434022 - DCS template	NA	No	NA	5000 - ADELAIDE	NA	NA	NA	NA	2021
NA	Green Industries SA	Created	NA	439435	439435 - Default template	NA	No	NA	5000 - ADELAIDE	NA	NA	NA	NA	2021
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2021
Report created on: 9/06/2021 3:42:28 PM	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2021

It looks like we've imported some extra rows that we don't want because they were footer details and some NA whole rows. We can remove them.

advertiments_clean <- advertiments%>%
  filter(str_detect(`Date Publish To Public Drill`, "Report created") == FALSE | is.na(`Date Publish To Public Drill`) == TRUE) %>%
  filter(is.na(`Position ID Job Title`) == FALSE)

Just to check this again...we imported five sheets from the xlsx file so advertiments_clean should now have five less rows than advertisments. And, the difference in rows is: 10!

Change format

Let's put the variables in a format that we can treat appropriately.

advertiments_formatted <- advertiments_clean %>% 
  mutate(`Date Publish To Public Drill` = ymd(`Date Publish To Public Drill`)) %>%
  mutate(`I Workfor SA Min Salary` = as.numeric(`I Workfor SA Min Salary`)) %>%
  mutate(`I Workfor SA Max Salary` = as.numeric(`I Workfor SA Max Salary`))

What are we working with

It would be good to have a look at the data types in each variable to work out if its suitable for what we'll want to do in the analysis.

advertiments_formatted %>%
  skim()

Data summary

Name	Piped data
Number of rows	35532
Number of columns	15
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Column type frequency:
character	12
Date	1
numeric	2
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Group variables	None

Variable type: character

skim\_variable	n\_missing	complete\_rate	min	max	n\_unique
Division	0	1.00	7	55	69
Position Status	0	1.00	4	32	9
I Workfor SA Classification Level	887	0.98	9	74	325
Position ID	0	1.00	6	6	35532
Position ID Job Title	0	1.00	8	163	35532
I Workfor SA Vacancy Type	448	0.99	16	31	2
I Workfor SA Part Time	0	1.00	2	3	2
I Workfor SA Part Time Hours	21410	0.40	1	14	383
I Workfor SA Location	40	1.00	11	44	320
I Workfor SA Employment Status	451	0.99	6	19	4
I Workfor SA Job Category	453	0.99	6	62	41
Year	0	1.00	4	4	5

Variable type: Date

skim\_variable	n\_missing	complete\_rate	min	max	median	n\_unique
Date Publish To Public Drill	2384	0.93	2017-11-04	2021-06-08	2019-09-24	1017

Variable type: numeric

skim\_variable	n\_missing	complete\_rate	mean	sd	p0	p25	p50	p75	p100	hist
I Workfor SA Min Salary	826	0.98	75939.52	30251.57	0	60681	75430	92784	324141	<U+2586><U+2587><U+2581><U+2581><U+2581>
I Workfor SA Max Salary	826	0.98	87480.98	38156.83	0	66368	88420	101685	413381	<U+2585><U+2587><U+2581><U+2581><U+2581>

The completeness of Position ID Job Title looks really promising and the completeness of I Workfor SA Min Salary and I Workfor SA Max Salary is hopeful too.

Missing data

advertiments_formatted %>%
  gg_miss_var(show_pct = TRUE, facet = Year) +
  labs(title = str_wrap("Percent of missing data for each variable in `advertisments`", width = 50))

It seems that all the variables except I Workfor SA Part Time Hours have minimal missing values and none of this changes much over time.

gg_miss_upset(advertiments_formatted)

It's reassuring that I Workfor SA Min Salary and I Workfor SA MAX Salary are only missing in combination with each other (and sometimes with I Workfor SA Part Time Hours as well).

The only note that appears particularly pertinent from these analyses is that I Workfor SA Part Time Hours should be treated with caution. We should note that the variables: Division, Position Status, Position ID, Position ID Job Title, I Workfor SA Part Time and Year are all complete with no missing values.

Analyses

"Data" jobs

The motivation for getting these data was to look at advertisements for "data" jobs to investigate questions such as:

Are they increasing as a proportion of the advertisements?
What departments are recruiting them?
What sort of work are they being anticipated to do?

Definition and caveats

To answer this question we need a definition of "data" jobs.

We can do this by including words and word-parts that signify the sort of job we are interested in. It is worth stating up front that this is necessarily a decision process rather than a data process. That is, the jobs we end up with classified as "data" and "not data" will be so classified because of our subjective decision not because of some objective data. In addition, this has a signal detection characteristic in that there will be errors due both to some "not data" jobs which end up classified as "data jobs" and some "data jobs" that end up classified as "not data jobs".

Definition

Regardless, we can set the definition and accept these caveats. Importantly, because the new variable Defined as a data job is based on Position ID Job Title, there will be no missing values in this new variable. Its worth noting that this step can (and should) be changed as we think more or less terms are applicable.

Its worth noting that there is a variable called I Workfor SA Job Category that is fairly complete. However, because its not actually complete, its not suitable for basing our definition on.

Data jobs will be defined as Position ID Job Title that contain:

data_job_definitions <- c("data", "analy", "information", "intelligence", "statistic", "scien", "research", "tech")

Then we can append this definition to the data frame:

advertiments_defined <- advertiments_formatted %>% 
  mutate(`Position ID Job Title` = str_to_lower(`Position ID Job Title`)) %>%
  mutate(`Defined as a data job` = ifelse(str_detect(`Position ID Job Title`, 
                                                     paste(data_job_definitions, collapse = "|")
                                                     ), 
         "Data", "Not data"))

As a check on our definitions, we should see how often each occur.

data_job_type <- advertiments_defined %>%
  select(`Position ID Job Title`) %>%
  mutate(`Is "data"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[1]), 1, 0)) %>%
  mutate(`Is "analy"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[2]), 1, 0)) %>%
  mutate(`Is "information"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[3]), 1, 0)) %>%
  mutate(`Is "intelligence"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[4]), 1, 0)) %>%
  mutate(`Is "statistic"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[5]), 1, 0)) %>%
  mutate(`Is "scien"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[6]), 1, 0)) %>%
  mutate(`Is "research"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[7]), 1, 0)) %>%
    mutate(`Is "tech"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[8]), 1, 0)) 

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  pivot_longer(cols = everything(), names_to = "Data job type", values_to = "Count") %>%
  group_by(`Data job type`) %>%
  summarise(Count = sum(Count)) %>%
  ggplot(aes(x = reorder(`Data job type`, -Count), y = Count, fill = `Data job type`)) + 
  geom_col() + 
  theme(legend.position = "none") + 
  labs(x = "Data job type", title = "Counts of data job types")

This seems pretty plausible. Lets also check how they join together within a single Position ID Job Title.

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  filter_all(any_vars(. == 1)) %>%
  upset(colnames(data_job_type[2:9]))

So there were 5 "data scientists". Seems pretty plausible

Before we move onto the answers to our questions, lets just have a look at the most frequent "Data" and "Not data" jobs.

advertiments_title <- advertiments_defined %>%
  mutate("Position ID check" = str_extract(`Position ID Job Title`, `Position ID`)) %>%
  mutate("Job title" = str_remove(`Position ID Job Title`, paste0(`Position ID`, " - ")))

Did we extract the Position ID correctly? Yes!

So lets remove the Position ID from the Position ID Job Title to get the JOb Title.

advertiments_title %>%
  group_by(`Defined as a data job`) %>%
  count(`Job title`) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(`Defined as a data job` = as.factor(`Defined as a data job`),
         `Job title` = reorder_within(`Job title`, n, `Defined as a data job`)) %>%
  ggplot(aes(`Job title`, n, fill = `Defined as a data job`)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~`Defined as a data job`, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count of times 'Job title' was advertised",
       x = NULL,
       title = "Most common job titles by category")

## Selecting by n

This raises some serious side questions such as:

what does a "iworkfoesa template" do for a job?
There are a bunch of roles with the first word of "Senior". I wonder if there are comparable Job title without that prefix?

Answers

Are "data jobs" increasing as a proportion of the advertisments?

increasing_data <- advertiments_title %>%
  group_by(`Defined as a data job`, Year) %>%
  count(name = "Number of advertisments") %>%
  group_by(Year) %>%
  mutate(`Proportion within each year`= `Number of advertisments`/sum(`Number of advertisments`))

count_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Number of advertisments`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Count of jobs on iworkforsa each year by definition")

prop_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Proportion within each year`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Proportion of jobs on iworkforsa each year by definition")

count_plot/prop_plot

increasing_data %>%
  filter(`Defined as a data job` == "Data") %>%
  kable() %>%
  kable_styling()

Defined as a data job	Year	Number of advertisments	Proportion within each year
Data	2017	114	0.0566882
Data	2018	644	0.0705135
Data	2019	657	0.0732441
Data	2020	690	0.0673894
Data	2021	357	0.0689322

This certainly doesn't make it look as though there has been an explosion in the proportion of "data" jobs advertised over the last four years. There were 46 more "data" jobs in 2020 than in 2018. While this is a 93.3333333 percent change in the number of these jobs that were advertised, it is worth noting that this is only a change from a percent of 7.0513522 in 2018 to 6.7389393 in 2020 of the overall number of jobs that were advertised. That is a change of -0.3124129 percent.

What departments are recruiting them?

departments_data <- advertiments_title %>%
  group_by(Division, `Defined as a data job`) %>%
  count(name = "Number of advertisments in division") %>%
  ungroup() %>%
  complete(Division, `Defined as a data job`, fill = list(`Number of advertisments in division` = 0)) %>%
  group_by(Division) %>%
  mutate("Division total" = sum(`Number of advertisments in division`)) %>%
  ungroup() %>%
  mutate("Percent" = (`Number of advertisments in division`/`Division total`)*100)

most_data_departments <- departments_data %>%
  filter(`Defined as a data job` == "Data") %>%
  slice_max(order_by = Percent, n = 12) %>%
  pull(Division)

departments_data  %>%
  filter(Division %in% most_data_departments) %>%
  mutate(Division = factor(Division, levels = most_data_departments, ordered = TRUE)) %>%
  ggplot(aes(x = Division, 
             y = Percent, 
             fill = `Defined as a data job`)) + 
  geom_col(position = "dodge") + 
  labs(title = "Divisions with the greatest proportion of 'data' job advertiments", 
       x = "") + 
  scale_x_discrete(limits=rev) +
  coord_flip()

A more comprehensive investigation of these data can be made in the table below:

datatable(departments_data)

So, the Department of Health and Wellbeing has the greatest number of "data" job advertisements: 1325. But, it also has one of the smallest percentages of data job advertisements: 5.9727732%.

It is worth noting that there are multiple explanations for the differences in the number of advertisements of "data" jobs between the Divisions. * Divisions that "data" people like working in may keep their staff for longer and so advertise less * Some divisions may have recruited more "data" people in an earlier time period * Some divisions may not be hiring as many "data" people as they think they are

What sort of work are they being anticipated to do?

Using the Position ID Job Title, we can see what other words appear alongside those used to define Defined as a data job.

Lets start by looking at the words most frequently occurring in those jobs that we classified as Defined as a data job == "Data".

advertiments_unnest <- advertiments_title %>%
  #select(`Defined as a data job`, `Job title`) %>%
  unnest_tokens(Word, `Job title`)

advertiments_unnest %>%
  filter(`Defined as a data job` == "Data") %>%
  count(Word, name = "Number of occurances") %>%
  slice_max(order_by = `Number of occurances`, n = 12) %>%
  ggplot(aes(x = reorder(Word, `Number of occurances`), y = `Number of occurances`, fill = Word)) +
  geom_col() + 
  coord_flip()  + 
  theme(legend.position = "none") + 
  labs(x = "", title = "Most used words used in 'data' job advertisments")

It would be more instructive to see the words that most differentiate between "Data" and "Not data" jobs:

advertisments_frequency <- advertiments_unnest %>% 
  group_by(`Defined as a data job`) %>% 
  count(Word, sort = TRUE) %>% 
  left_join(advertiments_unnest %>% 
              group_by(`Defined as a data job`) %>% 
              summarise(total = n())) %>%
  mutate(Frequency = n/total) %>% 
  select(`Defined as a data job`, Word, Frequency) %>% 
  pivot_wider(names_from = `Defined as a data job`, values_from = Frequency) %>%
  arrange(Data, `Not data`)

ggplot(advertisments_frequency, aes(Data, `Not data`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = Word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Words near the red line are used with similar frequencies in the Position ID Job Title of "Data" and "Not data" jobs. Words further from the red line are used disproportionately in one set of Position ID Job Title. Hence, "programmer" is used predominantly in "Data" jobs while "social" is used predominantly in "Not data" jobs.

We could also look at the log-odds ratios between the categories:

word_ratios <- advertiments_unnest %>%
  count(Word, `Defined as a data job`) %>%
  group_by(Word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = `Defined as a data job`, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`Data` / `Not data`)) %>%
  arrange(desc(logratio))

word_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(Word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col() +
  coord_flip() +
  ylab("log odds ratio") +
  scale_fill_discrete(name = "", labels = c("Data", "Not dat"))

So, a lot of the words that are found have the actual stem we included in the definition of "Data" jobs in the first place. But removing them would remove both "technical" and "technician" and its interesting to note that both are present.

Finally, it might be instructive to see which words are used together in the "Data" job advertisements.

advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  filter(n >= 15) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(degree = centrality_degree() / local_ave_degree()) %>%
  mutate(community = as.factor(group_fast_greedy())) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
  geom_node_point(aes(size = degree, colour = community)) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() + 
  theme(legend.position = "none")

It appears that there are some meaningful clusters. Within the connected component, the two biggest clusters seem associated with roles that might be related to "business intelligence" and "freedom of information". Its worth noting that "data" itself is quite central to the network overall.

Its worth seeing what are the most central words in the network of "data" jobs:

top <- advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(centrality = centrality_betweenness_rsp_simple()) %>%
  as.data.frame() %>%
  slice_max(order_by = centrality, n = 15) %>%
  ggplot(aes(x = reorder(name, centrality), y = centrality, fill = name)) + 
  geom_col()+ 
  theme(legend.position = "none") + 
  coord_flip() + 
  labs(x = NULL)

bottom <- advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(centrality = centrality_betweenness_rsp_simple()) %>%
  as.data.frame() %>%
  slice_min(order_by = centrality, n = 15) %>%
  ggplot(aes(x = reorder(name, centrality), y = centrality, fill = name)) + 
  geom_col()+ 
  theme(legend.position = "none") + 
  coord_flip() + 
  labs(x = NULL)

top/bottom

So, "officer" is used in the lot of different types of roles. That makes sense because its super generic.