Historical extract from iworkforsa - lab book

ofchurches
28 June 2021

Background

This markdown is a preliminary investigation of the data provided by the Office of the Commissioner for Public Sector Employment.The data contains details from every advertisement on iworkforsa.gov.au.

Note that these data have been classified as "Public" by the OCPSE.

Packages

library(readxl)
library(tidyverse)
library(here)
library(lubridate)
library(kableExtra)
library(naniar)
library(patchwork)
library(skimr)
library(DT)
library(ComplexUpset)
library(tidytext)
library(tidyr)
library(stringr)
library(scales)
library(widyr)
library(tidygraph)
library(ggraph)
library(netrankr)

Import, arrange and format

path <- here("IWORK4SA Data 2017-2021, Classification-Public.xlsx")

advertiments <- path %>% 
  excel_sheets() %>%
  map_dfr(
    function(sheet) {
      temp = read_excel(sheet, path = path, skip = 8)
      mutate(temp, Year = sheet)
    }
  )

Checks and balances

Check the top and bottom of the data

Data read in from xlsx files often has bits of extra information above and below the actual "data". To check that we haven't accidentally imported that we can have a look at the top and bottom of our data set.

advertiments %>%
  slice_head(n = 6) %>%
  kable() %>%
  kable_styling()
Date Publish To Public Drill Division Position Status I Workfor SA Classification Level Position ID Position ID Job Title I Workfor SA Vacancy Type I Workfor SA Part Time I Workfor SA Part Time Hours I Workfor SA Location I Workfor SA Min Salary I Workfor SA Max Salary I Workfor SA Employment Status I Workfor SA Job Category Year
2017-11-04 Department for Child Protection Position Closed AHP3 - Allied Health Professional 294895 294895 - Supervisor Open to Everyone No 0 5540 - PORT PIRIE 92757 98896 Ongoing Community and Social Services, Child, Aged and Disability Care 2017
2017-11-04 Department for Child Protection Position Closed PO3 - Professional Officer 294897 294897 - Supervisor Open to Everyone No 0 5540 - PORT PIRIE 90029 95487 Ongoing Community and Social Services, Child, Aged and Disability Care 2017
2017-11-04 Department for Education Position Closed AHP3 - Allied Health Professional 294898 294898 - Psychologist, Centre for Hearing Impaired Open to Everyone No 45 5000 - ADELAIDE 92757 98896 Ongoing Community and Social Services, Child, Aged and Disability Care 2017
2017-11-04 Department for Education Position Closed AHP2 - Allied Health Professional 294899 294899 - Speech Pathologist, Centre for Hearing Impaired Open to Everyone No 45 5000 - ADELAIDE 77752 90029 Ongoing Medical 2017
2017-11-04 Department for Infrastructure and Transport Position Closed ASO5 - Administrative Services Officer 294900 294900 - Senior Asset Inspection Officer Open to Everyone No 0 5037 - NETLEY 75430 84497 Short Term Contract Maintenance and Facilities 2017
2017-11-04 Department for Infrastructure and Transport Position Closed ASO6 - Administrative Services Officer 294901 294901 - Asset Assurance Manager Open to Everyone No 0 5037 - NETLEY 87384 92743 Short Term Contract Maintenance and Facilities 2017
advertiments %>%
  slice_tail(n = 6) %>%
  kable() %>%
  kable_styling()
Date Publish To Public Drill Division Position Status I Workfor SA Classification Level Position ID Position ID Job Title I Workfor SA Vacancy Type I Workfor SA Part Time I Workfor SA Part Time Hours I Workfor SA Location I Workfor SA Min Salary I Workfor SA Max Salary I Workfor SA Employment Status I Workfor SA Job Category Year
NA TAFE SA Created NA 436727 436727 - Default template NA No NA 5000 - ADELAIDE NA NA NA NA 2021
NA Office of the Commissioner for Public Sector Employment Created NA 422698 422698 - Default template NA No NA 5000 - ADELAIDE NA NA NA NA 2021
NA Department for Correctional Services Created NA 434022 434022 - DCS template NA No NA 5000 - ADELAIDE NA NA NA NA 2021
NA Green Industries SA Created NA 439435 439435 - Default template NA No NA 5000 - ADELAIDE NA NA NA NA 2021
NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2021
Report created on: 9/06/2021 3:42:28 PM NA NA NA NA NA NA NA NA NA NA NA NA NA 2021

It looks like we've imported some extra rows that we don't want because they were footer details and some NA whole rows. We can remove them.

advertiments_clean <- advertiments%>%
  filter(str_detect(`Date Publish To Public Drill`, "Report created") == FALSE | is.na(`Date Publish To Public Drill`) == TRUE) %>%
  filter(is.na(`Position ID Job Title`) == FALSE)

Just to check this again...we imported five sheets from the xlsx file so advertiments_clean should now have five less rows than advertisments. And, the difference in rows is: 10!

Change format

Let's put the variables in a format that we can treat appropriately.

advertiments_formatted <- advertiments_clean %>% 
  mutate(`Date Publish To Public Drill` = ymd(`Date Publish To Public Drill`)) %>%
  mutate(`I Workfor SA Min Salary` = as.numeric(`I Workfor SA Min Salary`)) %>%
  mutate(`I Workfor SA Max Salary` = as.numeric(`I Workfor SA Max Salary`))

What are we working with

It would be good to have a look at the data types in each variable to work out if its suitable for what we'll want to do in the analysis.

advertiments_formatted %>%
  skim()
Data summary
Name Piped data
Number of rows 35532
Number of columns 15
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Column type frequency:
character 12
Date 1
numeric 2
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Group variables None

Variable type: character

skim\_variable n\_missing complete\_rate min max empty n\_unique whitespace
Division 0 1.00 7 55 0 69 0
Position Status 0 1.00 4 32 0 9 0
I Workfor SA Classification Level 887 0.98 9 74 0 325 0
Position ID 0 1.00 6 6 0 35532 0
Position ID Job Title 0 1.00 8 163 0 35532 0
I Workfor SA Vacancy Type 448 0.99 16 31 0 2 0
I Workfor SA Part Time 0 1.00 2 3 0 2 0
I Workfor SA Part Time Hours 21410 0.40 1 14 0 383 0
I Workfor SA Location 40 1.00 11 44 0 320 0
I Workfor SA Employment Status 451 0.99 6 19 0 4 0
I Workfor SA Job Category 453 0.99 6 62 0 41 0
Year 0 1.00 4 4 0 5 0

Variable type: Date

skim\_variable n\_missing complete\_rate min max median n\_unique
Date Publish To Public Drill 2384 0.93 2017-11-04 2021-06-08 2019-09-24 1017

Variable type: numeric

skim\_variable n\_missing complete\_rate mean sd p0 p25 p50 p75 p100 hist
I Workfor SA Min Salary 826 0.98 75939.52 30251.57 0 60681 75430 92784 324141 <U+2586><U+2587><U+2581><U+2581><U+2581>
I Workfor SA Max Salary 826 0.98 87480.98 38156.83 0 66368 88420 101685 413381 <U+2585><U+2587><U+2581><U+2581><U+2581>

The completeness of Position ID Job Title looks really promising and the completeness of I Workfor SA Min Salary and I Workfor SA Max Salary is hopeful too.

Missing data

advertiments_formatted %>%
  gg_miss_var(show_pct = TRUE, facet = Year) +
  labs(title = str_wrap("Percent of missing data for each variable in `advertisments`", width = 50))

It seems that all the variables except I Workfor SA Part Time Hours have minimal missing values and none of this changes much over time.

gg_miss_upset(advertiments_formatted)

It's reassuring that I Workfor SA Min Salary and I Workfor SA MAX Salary are only missing in combination with each other (and sometimes with I Workfor SA Part Time Hours as well).

The only note that appears particularly pertinent from these analyses is that I Workfor SA Part Time Hours should be treated with caution. We should note that the variables: Division, Position Status, Position ID, Position ID Job Title, I Workfor SA Part Time and Year are all complete with no missing values.

Analyses

"Data" jobs

The motivation for getting these data was to look at advertisements for "data" jobs to investigate questions such as:

Definition and caveats

To answer this question we need a definition of "data" jobs.

We can do this by including words and word-parts that signify the sort of job we are interested in. It is worth stating up front that this is necessarily a decision process rather than a data process. That is, the jobs we end up with classified as "data" and "not data" will be so classified because of our subjective decision not because of some objective data. In addition, this has a signal detection characteristic in that there will be errors due both to some "not data" jobs which end up classified as "data jobs" and some "data jobs" that end up classified as "not data jobs".

Definition

Regardless, we can set the definition and accept these caveats. Importantly, because the new variable Defined as a data job is based on Position ID Job Title, there will be no missing values in this new variable. Its worth noting that this step can (and should) be changed as we think more or less terms are applicable.

Its worth noting that there is a variable called I Workfor SA Job Category that is fairly complete. However, because its not actually complete, its not suitable for basing our definition on.

Data jobs will be defined as Position ID Job Title that contain:

data_job_definitions <- c("data", "analy", "information", "intelligence", "statistic", "scien", "research", "tech")

Then we can append this definition to the data frame:

advertiments_defined <- advertiments_formatted %>% 
  mutate(`Position ID Job Title` = str_to_lower(`Position ID Job Title`)) %>%
  mutate(`Defined as a data job` = ifelse(str_detect(`Position ID Job Title`, 
                                                     paste(data_job_definitions, collapse = "|")
                                                     ), 
         "Data", "Not data"))

As a check on our definitions, we should see how often each occur.

data_job_type <- advertiments_defined %>%
  select(`Position ID Job Title`) %>%
  mutate(`Is "data"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[1]), 1, 0)) %>%
  mutate(`Is "analy"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[2]), 1, 0)) %>%
  mutate(`Is "information"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[3]), 1, 0)) %>%
  mutate(`Is "intelligence"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[4]), 1, 0)) %>%
  mutate(`Is "statistic"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[5]), 1, 0)) %>%
  mutate(`Is "scien"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[6]), 1, 0)) %>%
  mutate(`Is "research"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[7]), 1, 0)) %>%
    mutate(`Is "tech"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[8]), 1, 0)) 

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  pivot_longer(cols = everything(), names_to = "Data job type", values_to = "Count") %>%
  group_by(`Data job type`) %>%
  summarise(Count = sum(Count)) %>%
  ggplot(aes(x = reorder(`Data job type`, -Count), y = Count, fill = `Data job type`)) + 
  geom_col() + 
  theme(legend.position = "none") + 
  labs(x = "Data job type", title = "Counts of data job types")

This seems pretty plausible. Lets also check how they join together within a single Position ID Job Title.

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  filter_all(any_vars(. == 1)) %>%
  upset(colnames(data_job_type[2:9]))

So there were 5 "data scientists". Seems pretty plausible

Before we move onto the answers to our questions, lets just have a look at the most frequent "Data" and "Not data" jobs.

advertiments_title <- advertiments_defined %>%
  mutate("Position ID check" = str_extract(`Position ID Job Title`, `Position ID`)) %>%
  mutate("Job title" = str_remove(`Position ID Job Title`, paste0(`Position ID`, " - ")))

Did we extract the Position ID correctly? Yes!

So lets remove the Position ID from the Position ID Job Title to get the JOb Title.

advertiments_title %>%
  group_by(`Defined as a data job`) %>%
  count(`Job title`) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(`Defined as a data job` = as.factor(`Defined as a data job`),
         `Job title` = reorder_within(`Job title`, n, `Defined as a data job`)) %>%
  ggplot(aes(`Job title`, n, fill = `Defined as a data job`)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~`Defined as a data job`, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count of times 'Job title' was advertised",
       x = NULL,
       title = "Most common job titles by category")
## Selecting by n

This raises some serious side questions such as:

Answers

Are "data jobs" increasing as a proportion of the advertisments?

increasing_data <- advertiments_title %>%
  group_by(`Defined as a data job`, Year) %>%
  count(name = "Number of advertisments") %>%
  group_by(Year) %>%
  mutate(`Proportion within each year`= `Number of advertisments`/sum(`Number of advertisments`))

count_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Number of advertisments`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Count of jobs on iworkforsa each year by definition")

prop_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Proportion within each year`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Proportion of jobs on iworkforsa each year by definition")

count_plot/prop_plot

increasing_data %>%
  filter(`Defined as a data job` == "Data") %>%
  kable() %>%
  kable_styling()
Defined as a data job Year Number of advertisments Proportion within each year
Data 2017 114 0.0566882
Data 2018 644 0.0705135
Data 2019 657 0.0732441
Data 2020 690 0.0673894
Data 2021 357 0.0689322

This certainly doesn't make it look as though there has been an explosion in the proportion of "data" jobs advertised over the last four years. There were 46 more "data" jobs in 2020 than in 2018. While this is a 93.3333333 percent change in the number of these jobs that were advertised, it is worth noting that this is only a change from a percent of 7.0513522 in 2018 to 6.7389393 in 2020 of the overall number of jobs that were advertised. That is a change of -0.3124129 percent.

What departments are recruiting them?

departments_data <- advertiments_title %>%
  group_by(Division, `Defined as a data job`) %>%
  count(name = "Number of advertisments in division") %>%
  ungroup() %>%
  complete(Division, `Defined as a data job`, fill = list(`Number of advertisments in division` = 0)) %>%
  group_by(Division) %>%
  mutate("Division total" = sum(`Number of advertisments in division`)) %>%
  ungroup() %>%
  mutate("Percent" = (`Number of advertisments in division`/`Division total`)*100)

most_data_departments <- departments_data %>%
  filter(`Defined as a data job` == "Data") %>%
  slice_max(order_by = Percent, n = 12) %>%
  pull(Division)

departments_data  %>%
  filter(Division %in% most_data_departments) %>%
  mutate(Division = factor(Division, levels = most_data_departments, ordered = TRUE)) %>%
  ggplot(aes(x = Division, 
             y = Percent, 
             fill = `Defined as a data job`)) + 
  geom_col(position = "dodge") + 
  labs(title = "Divisions with the greatest proportion of 'data' job advertiments", 
       x = "") + 
  scale_x_discrete(limits=rev) +
  coord_flip()

A more comprehensive investigation of these data can be made in the table below:

datatable(departments_data)

So, the Department of Health and Wellbeing has the greatest number of "data" job advertisements: 1325. But, it also has one of the smallest percentages of data job advertisements: 5.9727732%.

It is worth noting that there are multiple explanations for the differences in the number of advertisements of "data" jobs between the Divisions. * Divisions that "data" people like working in may keep their staff for longer and so advertise less * Some divisions may have recruited more "data" people in an earlier time period * Some divisions may not be hiring as many "data" people as they think they are

What sort of work are they being anticipated to do?

Using the Position ID Job Title, we can see what other words appear alongside those used to define Defined as a data job.

Lets start by looking at the words most frequently occurring in those jobs that we classified as Defined as a data job == "Data".

advertiments_unnest <- advertiments_title %>%
  #select(`Defined as a data job`, `Job title`) %>%
  unnest_tokens(Word, `Job title`)

advertiments_unnest %>%
  filter(`Defined as a data job` == "Data") %>%
  count(Word, name = "Number of occurances") %>%
  slice_max(order_by = `Number of occurances`, n = 12) %>%
  ggplot(aes(x = reorder(Word, `Number of occurances`), y = `Number of occurances`, fill = Word)) +
  geom_col() + 
  coord_flip()  + 
  theme(legend.position = "none") + 
  labs(x = "", title = "Most used words used in 'data' job advertisments")

It would be more instructive to see the words that most differentiate between "Data" and "Not data" jobs:

advertisments_frequency <- advertiments_unnest %>% 
  group_by(`Defined as a data job`) %>% 
  count(Word, sort = TRUE) %>% 
  left_join(advertiments_unnest %>% 
              group_by(`Defined as a data job`) %>% 
              summarise(total = n())) %>%
  mutate(Frequency = n/total) %>% 
  select(`Defined as a data job`, Word, Frequency) %>% 
  pivot_wider(names_from = `Defined as a data job`, values_from = Frequency) %>%
  arrange(Data, `Not data`)

ggplot(advertisments_frequency, aes(Data, `Not data`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = Word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Words near the red line are used with similar frequencies in the Position ID Job Title of "Data" and "Not data" jobs. Words further from the red line are used disproportionately in one set of Position ID Job Title. Hence, "programmer" is used predominantly in "Data" jobs while "social" is used predominantly in "Not data" jobs.

We could also look at the log-odds ratios between the categories:

word_ratios <- advertiments_unnest %>%
  count(Word, `Defined as a data job`) %>%
  group_by(Word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = `Defined as a data job`, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`Data` / `Not data`)) %>%
  arrange(desc(logratio))

word_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(Word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col() +
  coord_flip() +
  ylab("log odds ratio") +
  scale_fill_discrete(name = "", labels = c("Data", "Not dat"))

So, a lot of the words that are found have the actual stem we included in the definition of "Data" jobs in the first place. But removing them would remove both "technical" and "technician" and its interesting to note that both are present.

Finally, it might be instructive to see which words are used together in the "Data" job advertisements.

advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  filter(n >= 15) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(degree = centrality_degree() / local_ave_degree()) %>%
  mutate(community = as.factor(group_fast_greedy())) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
  geom_node_point(aes(size = degree, colour = community)) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() + 
  theme(legend.position = "none")

It appears that there are some meaningful clusters. Within the connected component, the two biggest clusters seem associated with roles that might be related to "business intelligence" and "freedom of information". Its worth noting that "data" itself is quite central to the network overall.

Its worth seeing what are the most central words in the network of "data" jobs:

top <- advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(centrality = centrality_betweenness_rsp_simple()) %>%
  as.data.frame() %>%
  slice_max(order_by = centrality, n = 15) %>%
  ggplot(aes(x = reorder(name, centrality), y = centrality, fill = name)) + 
  geom_col()+ 
  theme(legend.position = "none") + 
  coord_flip() + 
  labs(x = NULL)

bottom <- advertiments_unnest %>% 
  filter(`Defined as a data job` == "Data") %>%
  select(Word, `Position ID Job Title`) %>%
  rename(Title = `Position ID Job Title`) %>%
  pairwise_count(Word, Title, sort = TRUE, upper = FALSE) %>%
  as_tbl_graph() %>%
  to_undirected %>%
  activate(nodes) %>%
  mutate(centrality = centrality_betweenness_rsp_simple()) %>%
  as.data.frame() %>%
  slice_min(order_by = centrality, n = 15) %>%
  ggplot(aes(x = reorder(name, centrality), y = centrality, fill = name)) + 
  geom_col()+ 
  theme(legend.position = "none") + 
  coord_flip() + 
  labs(x = NULL)

top/bottom

So, "officer" is used in the lot of different types of roles. That makes sense because its super generic.