For this project we focused on answering the question “What are the most valued data science skills?”.
Our method for answering this question was to collect data from a number of Data Science and Analysis related job postings and query that data to try and identify the skills that are the most prominent across these postings.
The overall aproach we took for this project was:
Data Collection - collect job postings from various job boards to create our initial corpus of data science skills.
Data Cleaning - clean the raw job postings data by creating a single row of data for each of the skills requirements listed in the job posts. Also, once this has been completed, then we created a data frame of just the skills for each of the job postings. Finally, we took those skills and cleaned the data by a) removing any non-alphanumeric characters, b) removing any extra white space; c) converting all words to lower case; and d) removing any stop words from the postings
Word Tokenization - using the job skills data, we then created a corpus of discrete words that are included in the various posts, by taking each of the listed job skills and then breaking them into their separate words, and creating word groups of size n = {1,2,3,4,5}
Word Classification - next we used our new corpus of job skills and exported the data as a .csv file so that we can go through the manual process of labeling the words that were actual skills associated with Data Science vs. those that were not.
Labeling Original Data - now that we had our classification corpus of words that were associated with Data Science skills, we then returned to the original posts and labeled each job skill with the various discrete skills we identified in our dictionary of skills. For each job requirement, we flagged up to seven different skills.
Tidy Data - finally we then took the new data frame and converted it from wide to long.
Export to Database - now that we have our data in a structured format, we are able to export it to a database.
Data Analysis - now that our data is structured in a tidy format, we were able to begin the process of tidying the data.
We setup our environment by loading the libraries that we will be using.
knitr::opts_chunk$set(echo = TRUE)
rm(list=ls())
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(RPostgres)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(ggwordcloud)
library(dotenv)
For this step we start by connecting to the database that we will use to store the cleaned data and then to retrieve the data for use when analyzing the data.
load_dot_env('../creds.env')
con <- dbConnect(Postgres(),
dbname = Sys.getenv("DB_NAME"),
host = Sys.getenv("DB_HOST"),
port = Sys.getenv("DB_PORT"),
user = Sys.getenv("DB_UID"),
password = Sys.getenv("DB_PWD"))
We initiated this project by assigning each member of our team to a different job board and tasked them with collecting 25 job postings each (~100 postings). This raw data was used to generate our overall corpus of job skills across different jobs that
Core steps: 1. Input job postings from .csv file 2. Break out the job skills for each job posting for each new line
path = '../input/Job Postings Data.csv'
jobs_raw <- read.csv(path)
jobs_df <- jobs_raw %>% separate_rows(Skills, sep= "\n")
skills_list <- jobs_df %>%
select(Skills)
skills_list <- skills_list %>%
mutate(Skills = str_squish(Skills)) %>%
filter(Skills != "")
#clean the skills list data
skills_list <- skills_list %>%
mutate(Skills = tolower(Skills)) %>%
mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' '))
skills_list <- skills_list %>%
mutate(Skills = str_squish(Skills))
This step was focused on taking the various job postings, and breaking down the words used in the job skills portion of the job post into discrete skills terms that we can then use to create our overall corpus of terms.
This step had the following core steps: 1. Create a function that can be used to get strings of multiple lengths 2. Create a list of the skills words 3. Remove the stop words from the data
#Import list of stop words
stop_words_list = 'https://raw.githubusercontent.com/igorbrigadir/stopwords/ab85d86c3fac0360020a921b91ccf9d697b54757/en/terrier.txt'
stop_words <- read.table(stop_words_list)
stop_words <- stop_words$V1
#Remove "R" and "C" from stop words
stop_words <- stop_words[stop_words != 'r' & stop_words != 'c']
# Create function to capture word groups
get_words <- function(vec, vec_length, word_size) {
words_vec = data.frame()
start_index = 1
end_index = start_index + (word_size-1)
while(end_index <= vec_length) {
str = paste(vec[start_index:end_index],collapse=" ")
str_df <- data.frame(str)
words_vec = rbind(words_vec,str_df)
start_index = start_index+1
end_index = end_index+1
}
return(words_vec)
}
max_word_groups = 5
skills_vec = data.frame()
## Generate list of skills from the job postings
for(i in 1:nrow(skills_list)) {
skill = skills_list[[i,1]]
if(skill != "") {
skill_split <- strsplit(skill," ")
skills = skill_split[[1]]
skills <- skills[! skills %in% stop_words]
vec_length = length(skills)
for(word_size in 1:max_word_groups) {
words <- get_words(skills,vec_length,word_size)
skills_vec = rbind(skills_vec, words)
}
}
}
Once we tokenized the terms used in the job postings, we cleaned out the data to create a simplified view of the distrinct terms that are used across the various postings. We then exported this data so that we are able to manually process the terms to determine which ones relate to Data Science work and which ones can be discarded/ignored:
The core steps here were: 1. The first step was to conduct a series of steps to clean the tokenized Data Science skills list 2. Create metrics to show how frequent the term appears in job postings and how many times a version of the term appears in other words in the dataset. 3. Export the data skills dataset for manual processing
# Rename the skill str column to skill
skills_vec <- skills_vec %>%
rename(skill = str)
# Convert skills to lower case
skills_df <- skills_vec %>% mutate(skill = tolower(skill))
# Remove special characters from skills list
skills_df <- skills_df %>%
mutate(skill = str_replace_all(skill, '[^[:alnum:]]', ' '))
# Remove leading and trailing white space characters
skills_df <- skills_df %>%
mutate(skill = str_squish(skill))
# Generate distinct list of skills words
distinct_skills <- skills_df %>%
distinct(skill)
# Loop through distinct skills and determine the number of job postings that have the word
word_count = data.frame()
for(i in 1:nrow(distinct_skills)) {
word <- distinct_skills[[i,1]]
num_posts <- skills_list %>%
filter(grepl(word,Skills)) %>%
nrow()
word_count = rbind(word_count,num_posts)
}
skills_data <- cbind(distinct_skills, word_count)
columns <- c("skill", "num_postings")
colnames(skills_data) <- columns
# Remove skills words that do not appear in any of the postings
skills_data <- skills_data %>%
filter(num_postings >= 1)
skills_data <- skills_data %>%
mutate(pct_postings = round(num_postings/nrow(skills_list),4))
# Filter the list to remove the stop words but keep the R and C since those could be in reference to the coding languages. Also remove any blank skills
skills_data <- skills_data %>%
filter(!skill %in% stop_words) %>%
filter(skill != "") %>%
arrange(desc(pct_postings))
#how many times is a word nested in other skills
word_occurrence = data.frame()
for(i in 1:nrow(skills_data)) {
num_occurrences = 0
word = skills_data[[i,"skill"]]
#print(word)
num_occurrences <- skills_data %>%
filter(grepl(word, skill)) %>%
nrow()
word_occurrence <- rbind(word_occurrence, num_occurrences)
}
skills_data <- cbind(skills_data, word_occurrence)
skills_data <- data.frame(skills_data)
columns = c("skill", "num_postings", "pct_postings", "num_occurence")
colnames(skills_data) <- columns
skills_data <- as_tibble(skills_data)
skills_data <- skills_data %>% mutate(pct_occurence = round(num_occurence/n(),4))
write_csv(skills_data, '../output/skills_data.csv')
Now, using our new corpus of Data Science skills, we can now go back to our original list of job skills, and identify the job skills that are included in each of the posts. This phase ended up requiring a lot of manual processing. These steps included:
Using our list of skills, we were able to label the words in the output file which skills should be flagged as relevent skills.
skills_path = '../input/skills_df.csv'
skills_labeled <- read_csv(skills_path)
## Rows: 3576 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): skill, is_skill
## dbl (4): num_postings, pct_postings, num_occurence, pct_occurence
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skills_labeled <- skills_labeled %>%
mutate(is_skill = tolower(is_skill))
#Remove unlabeled skills from dataset
skills_labeled_filtered <- skills_labeled %>%
filter(!is.na(is_skill))
#Take original skills_list and flag the skill if the skill is in our labeled dataset
skills_list_labeled <- skills_list %>%
mutate(skill_short1 = NA,
skill_short2 = NA,
skill_short3 = NA,
skill_short4 = NA,
skill_short5 = NA,
skill_short6 = NA,
skill_short7 = NA)
for(i in 1:nrow(skills_labeled_filtered)) {
skill = skills_labeled_filtered[[i,1]]
skill_regex = paste0("\\b", skill, "\\b")
#Set first skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short1 = ifelse(is.na(skill_short1) & str_detect(Skills, skill_regex),skill,skill_short1))
#Set second skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short2 = ifelse(!is.na(skill_short1) & skill_short1 != skill & is.na(skill_short2) & str_detect(Skills, skill_regex),
skill,skill_short2))
#Set third skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short3 = ifelse(!is.na(skill_short2) & skill_short2 != skill & is.na(skill_short3) & str_detect(Skills, skill_regex),
skill,skill_short3))
#Set fourth skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short4 = ifelse(!is.na(skill_short3) & skill_short3 != skill & is.na(skill_short4) & str_detect(Skills, skill_regex),
skill,skill_short4))
#Set fifth skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short5 = ifelse(!is.na(skill_short4) & skill_short4 != skill & is.na(skill_short5) & str_detect(Skills, skill_regex),
skill,skill_short5))
#Set sixth skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short6 = ifelse(!is.na(skill_short5) & skill_short5 != skill & is.na(skill_short6) & str_detect(Skills, skill_regex),
skill,skill_short6))
#Set seventh skill_short value
skills_list_labeled <- skills_list_labeled %>%
mutate(skill_short7 = ifelse(!is.na(skill_short6) & skill_short6 != skill & is.na(skill_short7) & str_detect(Skills, skill_regex),
skill,skill_short7))
}
Now that we have a labeled list of skills requirements from the original job postings, we can then join these new fields with the original data.
This process primarily consists of: 1. Further preprocessing of the skills and job postings datasets to account for terms that are symantically similar 2. Remove bad character strings 3. Impute Job Position information 4. Impute Job Seniority Level information
jobs_df_clean <- jobs_df %>%
filter(Skills != "") %>%
mutate(Skills = tolower(Skills)) %>%
mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' ')) %>%
mutate(Skills = str_squish(Skills))
j1 <- left_join(jobs_df_clean, skills_list_labeled, by=c("Skills"="Skills"), multiple = "all")
j1_pivot <- j1 %>% pivot_longer(cols = c("skill_short1", "skill_short2", "skill_short3", "skill_short4", "skill_short5", "skill_short6", "skill_short7"),
names_to = "skill_number",
values_to = "discrete_skill")
columns = c("Job Board", "Post URL", "Company", "Industry", "Company Location",
"Job Title", "Listed Skill", "Seniority Level", "Job Type",
"Years Experience", "skill_number", "discrete_skill")
colnames(j1_pivot) <- columns
j1_pivot <- janitor::clean_names(j1_pivot)
j1_pivot <- j1_pivot %>%
mutate(discrete_skill = case_when(
str_detect(discrete_skill, '(communication|communicate)') ~ 'communication skills',
str_detect(discrete_skill, 'visualization') ~ 'data visualization skills',
str_detect(discrete_skill, '\\br\\b') ~ 'r programming',
str_detect(discrete_skill, 'stat') ~ 'applied statistics',
str_detect(discrete_skill,'\\bai\\b') ~ 'artificial intelligence',
str_detect(discrete_skill,'agile') ~ 'agile development',
str_detect(discrete_skill, '(^analys|analyz)') ~ 'analysis skills',
str_detect(discrete_skill, '^analyt') ~ 'analytics skills',
str_detect(discrete_skill, '(machine learning|ml)') ~ 'machine learning',
str_detect(discrete_skill, '\\bci\\b') ~ 'ci',
str_detect(discrete_skill, 'strat') ~ 'strategy skills',
str_detect(discrete_skill, 'workflow') ~ 'worklow automation',
str_detect(discrete_skill, '\\bcloud\\b') ~ 'cloud computing',
str_detect(discrete_skill, 'cluster') ~ 'cluster analysis',
str_detect(discrete_skill, 'critical') ~ 'critical thinking skills',
str_detect(discrete_skill, 'project') ~ 'project management skills',
str_detect(discrete_skill, 'customer') ~ 'customer service skills',
str_detect(discrete_skill, 'database') ~ 'database management',
str_detect(discrete_skill, 'big data') ~ 'big data',
str_detect(discrete_skill, 'deep') ~ 'deep learning',
str_detect(discrete_skill, '\\betl\\b') ~ 'etl',
str_detect(discrete_skill, 'python') ~ 'python programming',
str_detect(discrete_skill, 'innov') ~ 'innovative',
str_detect(discrete_skill, 'regression') ~ 'regression modeling',
str_detect(discrete_skill, 'ggplot') ~ 'ggplot',
str_detect(discrete_skill, 'model') ~ 'model development',
str_detect(discrete_skill, 'data transformation') ~ 'data transformation',
str_detect(discrete_skill, '\\bsql\\b') ~ 'sql',
str_detect(discrete_skill, '(\\bbi\\b|business intelligence)') ~ 'business intelligence skills',
str_detect(discrete_skill, '\\bai\\b') ~ 'artificial intelligence',
str_detect(discrete_skill, 'neural') ~ 'neural networks',
str_detect(discrete_skill, 'keras') ~ 'keras',
str_detect(discrete_skill, '(sas|saas)') ~ 'saas',
str_detect(discrete_skill, 'software development') ~ 'software development',
str_detect(discrete_skill, 'algorithm') ~ 'algorithm design',
str_detect(discrete_skill, '(\\bmip\\b|\\bqp\\b)') ~ 'mixed-interger programming',
str_detect(discrete_skill, '\\bnlp\\b') ~ 'natural language programming',
str_detect(discrete_skill, '(\\boop\\b|object)') ~ 'object oriented programming',
str_detect(discrete_skill, 'consulting') ~ 'consulting',
str_detect(discrete_skill, 'predictive') ~ 'predictive modeling',
str_detect(discrete_skill, 'curio') ~ 'curiosity',
str_detect(discrete_skill, 'data cleaning') ~ 'data wrangling',
str_detect(discrete_skill, 'probab') ~ 'probabilistic modeling',
str_detect(discrete_skill, '(programming|scripting|coding)') & !str_detect(skill, '(python|sql|database|\\br\\b)') ~ 'computer programming',
str_detect(discrete_skill, 'solving') ~ 'problem solving',
str_detect(discrete_skill, 'team') ~ 'teamwork skills',
str_detect(discrete_skill, 'unsupervised') ~ 'unsupervised learning',
TRUE ~ discrete_skill
))
## Export cleaned data frame
write_csv(j1_pivot, '../output/jobs_data_clean.csv')
company_df <- j1_pivot %>%
select(company, industry, company_location) %>%
distinct() %>%
mutate(company_id = row_number())
skills_df <- j1_pivot %>%
select(discrete_skill) %>%
distinct(discrete_skill) %>%
na.omit(discrete_skill) %>%
arrange(discrete_skill) %>%
mutate(skill_id = row_number())
job_board_df <- j1_pivot %>%
select(job_board) %>%
distinct() %>%
mutate(job_board_id = row_number())
job_posting_df <- j1_pivot %>%
group_by(job_board, post_url, company, job_title) %>%
summarize(posting_id = cur_group_id())
## `summarise()` has grouped output by 'job_board', 'post_url', 'company'. You can
## override using the `.groups` argument.
j2 <- left_join(j1_pivot,job_posting_df)
## Joining with `by = join_by(job_board, post_url, company, job_title)`
j3 <- left_join(j2, skills_df)
## Joining with `by = join_by(discrete_skill)`
j4 <- left_join(j3, company_df)
## Joining with `by = join_by(company, industry, company_location)`
j5 <- left_join(j4, job_board_df)
## Joining with `by = join_by(job_board)`
company_data <- j5 %>%
select(company_id, company, company_location, industry) %>%
distinct()
posting_data <- j5 %>%
select(posting_id, company_id, job_board_id, job_title, job_type, seniority_level, years_experience) %>%
distinct() %>%
arrange(posting_id)
#Clean up the job type field
posting_data <- posting_data %>%
mutate(job_type = case_when(
str_detect(job_type,"\\bHybrid\\b") ~ "Hybrid",
str_detect(job_type, "\\b[rR]emote\\b") ~ "Remote",
str_detect(job_type, "\\bIn [pP]erson\\b") ~ "On-Site",
str_detect(job_type, 'On Premise') ~ 'On-Site',
str_detect(job_type, "character(0)") ~ "",
str_detect(job_type, '0') ~ "",
TRUE ~ job_type
))
#Clean up the seniority_level field
posting_data <- posting_data %>%
mutate(seniority_level = case_when(
str_detect(seniority_level,"(Entry|Assistant)") ~ 'Entry Level',
str_detect(seniority_level, "(Mid|Manager|Supervisor)") ~ 'Mid Level',
str_detect(seniority_level, "(Senior)") ~ 'Senior Level',
seniority_level == 'character(0)' ~ "",
seniority_level == '0' ~ "",
TRUE ~ seniority_level
))
# Clean up the years experience field
posting_data <- posting_data %>%
mutate(years_experience = str_match(years_experience, '[\\d][~]*[\\d]*')) %>%
mutate(years_experience = str_replace(years_experience, "~", "-"))
#Impute seniority level based on years experience
posting_data <- posting_data %>%
mutate(seniority_level = ifelse(seniority_level == "" | is.na(seniority_level),
case_when(
years_experience %in% c("0","1","2") ~ 'Entry Level',
years_experience %in% c("3","4") ~ 'Mid Level',
TRUE ~ seniority_level
),seniority_level))
posting_data <- posting_data %>%
mutate(job_title_category = case_when(
str_detect(job_title, "(Analyst|Analytics)") ~ 'Data Analyst',
str_detect(job_title, "Data S") ~ 'Data Scientist',
str_detect(job_title, 'Engineer') ~ 'Data Engineer',
str_detect(job_title, 'Machine Learning') ~ 'Data Scientist',
TRUE ~ 'Other'
))
job_board_data <- j5 %>%
select(job_board_id, job_board) %>%
distinct()
job_skills_data <- j5 %>%
select(skill_id, posting_id, discrete_skill) %>%
distinct() %>%
arrange(skill_id)
Now that we’ve cleaned the data and created our dataframes, we want to create the relevant tables in our PostgresSQL database and then import the data from the dataframe into these tables
drop_query = "DROP TABLE company,job_board, skills, job_posting"
dbSendQuery(con, drop_query)
## <PqResult>
## SQL DROP TABLE company,job_board, skills, job_posting
## ROWS Fetched: 0 [complete]
## Changed: 0
## Write data to database
dbWriteTable(con, "job_board", job_board_data, overwrite=TRUE)
## Warning in result_create(conn@ptr, statement, immediate): Closing open result
## set, cancelling previous query
dbWriteTable(con, "company", company_data, overwrite=TRUE)
dbWriteTable(con, "skills", job_skills_data, overwrite=TRUE)
dbWriteTable(con, "job_posting", posting_data, overwrite=TRUE)
Now that we have cleaned data available to us through our database, we are now ready to answer our core data analysis questions. For this project we focused on answering the following questions:
query = "SELECT * FROM skills"
skills_data <- dbGetQuery(con, query)
query = "SELECT * FROM job_posting"
posting_data <- dbGetQuery(con,query)
num_postings = posting_data %>% nrow()
top50_skills <- skills_data %>%
group_by(discrete_skill) %>%
summarize(num_jobs = n_distinct(posting_id)) %>%
mutate(pct_of_jobs = num_jobs/n()) %>%
na.omit() %>%
arrange(desc(pct_of_jobs)) %>%
head(50)
top25_skills <- skills_data %>%
group_by(discrete_skill) %>%
summarize(num_jobs = n_distinct(posting_id)) %>%
mutate(pct_of_jobs = num_jobs/num_postings) %>%
na.omit() %>%
arrange(desc(pct_of_jobs)) %>%
head(25) %>%
mutate(discrete_skill = factor(discrete_skill, levels=discrete_skill, ordered=TRUE))
## Table of top 25 Job Postings
top25_skills %>%
mutate(pct_of_jobs = scales::percent(pct_of_jobs)) %>%
kable(
col.names = c("Skill", "No. Jobs", "Percent of Jobs"),
row.names = TRUE,
caption = "Top 25 Skills for Data Scientists & Data Professionals (N = 89 Postings)",
align = c("l", "c", "c")
) %>%
kable_material(c("striped"))
| Skill | No. Jobs | Percent of Jobs | |
|---|---|---|---|
| 1 | analysis skills | 52 | 58.4% |
| 2 | python programming | 51 | 57.3% |
| 3 | machine learning | 39 | 43.8% |
| 4 | communication skills | 38 | 42.7% |
| 5 | analytics skills | 37 | 41.6% |
| 6 | sql | 34 | 38.2% |
| 7 | applied statistics | 31 | 34.8% |
| 8 | model development | 30 | 33.7% |
| 9 | programming | 28 | 31.5% |
| 10 | database management | 24 | 27.0% |
| 11 | r programming | 24 | 27.0% |
| 12 | problem solving | 22 | 24.7% |
| 13 | algorithm design | 21 | 23.6% |
| 14 | big data | 16 | 18.0% |
| 15 | aws | 13 | 14.6% |
| 16 | tableau | 13 | 14.6% |
| 17 | business intelligence skills | 12 | 13.5% |
| 18 | deep learning | 12 | 13.5% |
| 19 | saas | 12 | 13.5% |
| 20 | azure | 11 | 12.4% |
| 21 | curiosity | 11 | 12.4% |
| 22 | programming languages | 11 | 12.4% |
| 23 | project management skills | 11 | 12.4% |
| 24 | strategy skills | 11 | 12.4% |
| 25 | cluster analysis | 10 | 11.2% |
## Shows top 25 skills for Data Scientist
ggplot(data = top25_skills, aes(x=reorder(discrete_skill, num_jobs), y=pct_of_jobs), fig(40,10)) +
geom_bar(stat='identity', aes(fill=discrete_skill)) +
coord_flip() +
labs(
x = "Job Skills",
y = "Percent of Jobs",
title = 'Top 25 Skills For Data Scientist and Data Professionals',
subtitle = 'N = 89 Postings',
fill = 'Skills'
) +
scale_y_continuous(labels = scales::percent)
## Wordcloud of top 50 Skills
ggplot(top50_skills, aes(label=discrete_skill, size=num_jobs)) +
geom_text_wordcloud() +
theme_minimal()
Answer:
For this we pulled the top 25 skills that across each of our data science jobs, and found that of these the top 5 were: 1. Analysis Skills - 58.4% of postings 2. Python Programming - 57.3% of postings 3. Communication Skills - 42.7% of postings 4. Machine Learning - 42.7% of postings 5. Analytics Skills - 41.6% of postings
query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"
combined_data <- dbGetQuery(con, query)
combined_data <- combined_data %>%
select(-posting_id..4)
num_postings_by_cat <- combined_data %>%
group_by(job_title_category) %>%
summarize(total_postings = n_distinct(posting_id)) %>%
arrange(desc(total_postings))
num_postings_by_skill <- combined_data %>%
group_by(job_title_category, discrete_skill) %>%
summarize(num_postings = n_distinct(posting_id)) %>%
ungroup()
## `summarise()` has grouped output by 'job_title_category'. You can override
## using the `.groups` argument.
combined_postings_by_cat <- left_join(num_postings_by_skill, num_postings_by_cat) %>%
mutate(pct_postings = num_postings/total_postings)
## Joining with `by = join_by(job_title_category)`
## Top Skills for Data Scientist Positions
combined_postings_by_cat %>%
filter(job_title_category == 'Data Scientist') %>%
group_by(job_title_category) %>%
na.omit(discrete_skill) %>%
arrange(desc(num_postings)) %>%
filter(row_number() <= 10) %>%
ungroup() %>%
select(discrete_skill, pct_postings) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science Skills for Data Scientists",
align = c("l","c")
) %>%
kable_material(font_size = 15, c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | analysis skills | 63.5% |
| 2 | python programming | 60.3% |
| 3 | communication skills | 42.9% |
| 4 | machine learning | 42.9% |
| 5 | model development | 41.3% |
| 6 | analytics skills | 39.7% |
| 7 | applied statistics | 36.5% |
| 8 | programming | 34.9% |
| 9 | algorithm design | 31.7% |
| 10 | sql | 31.7% |
## Top Skills for Data Analyst Positions
combined_postings_by_cat %>%
filter(job_title_category == 'Data Analyst') %>%
group_by(job_title_category) %>%
na.omit(discrete_skill) %>%
arrange(desc(num_postings)) %>%
filter(row_number() <= 10) %>%
ungroup() %>%
select(discrete_skill, pct_postings) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science Skills for Data Analysts",
align = c("l","c")
) %>%
kable_material(font_size = 15, c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | analytics skills | 55.6% |
| 2 | sql | 55.6% |
| 3 | analysis skills | 50.0% |
| 4 | machine learning | 44.4% |
| 5 | python programming | 44.4% |
| 6 | applied statistics | 33.3% |
| 7 | communication skills | 33.3% |
| 8 | database management | 33.3% |
| 9 | problem solving | 27.8% |
| 10 | tableau | 27.8% |
## Top Skills for Data Engineer Positions
combined_postings_by_cat %>%
filter(job_title_category == 'Data Engineer') %>%
group_by(job_title_category) %>%
na.omit(discrete_skill) %>%
arrange(desc(num_postings)) %>%
filter(row_number() <= 10) %>%
ungroup() %>%
select(discrete_skill, pct_postings) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science Skills for Data Engineers",
align = c("l","c")
) %>%
kable_material(font_size = 15, c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | aws | 75% |
| 2 | azure | 75% |
| 3 | python programming | 75% |
| 4 | sql | 75% |
| 5 | ci | 50% |
| 6 | communication skills | 50% |
| 7 | data pipelines | 50% |
| 8 | machine learning | 50% |
| 9 | pipelines | 50% |
| 10 | programming | 50% |
top10_by_cat <- combined_postings_by_cat %>%
group_by(job_title_category) %>%
na.omit(discrete_skill) %>%
arrange(desc(num_postings)) %>%
filter(row_number() <= 10)
## Top Data Science Skills common across different data functions
combined_postings_by_cat %>%
group_by(job_title_category) %>%
na.omit(discrete_skill) %>%
arrange(desc(num_postings)) %>%
ungroup() %>%
select(job_title_category, discrete_skill, pct_postings) %>%
mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
pivot_wider(names_from = job_title_category, values_from = pct_postings) %>%
na.omit() %>%
filter(row_number() <= 10) %>%
kable(
col.names = (c("Skill", "Data Scientist", "Data Analyst", "Data Engineer", "Other")),
row.names = TRUE,
caption = "Top 10 Data Science Skills Consistent Across Each Data Function/Title",
align = c("l",rep("c",4))
) %>%
kable_material(font_size = 15, c("striped", "condensed"))
| Skill | Data Scientist | Data Analyst | Data Engineer | Other | |
|---|---|---|---|---|---|
| 1 | analysis skills | 63.5% | 50.0% | 25.0% | 50.0% |
| 2 | python programming | 60.3% | 44.4% | 75.0% | 50.0% |
| 3 | communication skills | 42.9% | 33.3% | 50.0% | 75.0% |
| 4 | machine learning | 42.9% | 44.4% | 50.0% | 50.0% |
| 5 | model development | 41.3% | 11.1% | 25.0% | 25.0% |
| 6 | analytics skills | 39.7% | 55.6% | 25.0% | 25.0% |
| 7 | sql | 31.7% | 55.6% | 75.0% | 25.0% |
| 8 | r programming | 28.6% | 16.7% | 25.0% | 50.0% |
| 9 | problem solving | 23.8% | 27.8% | 25.0% | 25.0% |
| 10 | strategy skills | 12.7% | 5.6% | 25.0% | 25.0% |
Answer: Overall there seemed to be similarity in the importance of skills across the functional role. However, as expected we do find some start differences between each position. For example, Data Engineers are expected to have more familiarity with cloud based architectures and technologies, such as AWS and Azure and there’s a lot more focus on their programming skills (e.g. computer programming, and python programming). However, for Data Scientist and Data Analyst, there’s more of focus on analytical skills and the abiity to develop models.
query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"
combined_data <- dbGetQuery(con, query)
combined_data <- combined_data %>%
select(-posting_id..4)
filtered_posts <- combined_data %>%
filter(!is.na(seniority_level), seniority_level != "")
group_by_seniority <- filtered_posts %>%
group_by(seniority_level) %>%
summarize(group_size = n_distinct(posting_id)) %>%
ungroup()
group_by_skill <- filtered_posts %>%
group_by(seniority_level, discrete_skill) %>%
summarize(num_jobs = n_distinct(posting_id)) %>%
ungroup()
## `summarise()` has grouped output by 'seniority_level'. You can override using
## the `.groups` argument.
combined_by_seniority <- left_join(group_by_skill, group_by_seniority)
## Joining with `by = join_by(seniority_level)`
## Top 10 Skills for Entry Level professionals
combined_by_seniority %>%
filter(seniority_level == 'Entry Level') %>%
ungroup() %>%
arrange(desc(num_jobs)) %>%
mutate(pct_postings = round((num_jobs/group_size),3)) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
na.omit() %>%
select(discrete_skill, pct_postings) %>%
filter(row_number() <= 10) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science for Entry Level Professionals",
align = c("l","c")
) %>%
kable_material(c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | python programming | 83.3% |
| 2 | analysis skills | 66.7% |
| 3 | applied statistics | 50.0% |
| 4 | machine learning | 50.0% |
| 5 | model development | 50.0% |
| 6 | communication skills | 41.7% |
| 7 | database management | 41.7% |
| 8 | deep learning | 41.7% |
| 9 | analytics skills | 33.3% |
| 10 | pytorch | 33.3% |
## Top 10 Skills for Mid Level professionals
combined_by_seniority %>%
filter(seniority_level == 'Mid Level') %>%
ungroup() %>%
arrange(desc(num_jobs)) %>%
mutate(pct_postings = round((num_jobs/group_size),3)) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
na.omit() %>%
select(discrete_skill, pct_postings) %>%
filter(row_number() <= 10) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science for Mid Level Professionals",
align = c("l","c")
) %>%
kable_material(c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | python programming | 73.1% |
| 2 | communication skills | 61.5% |
| 3 | analytics skills | 57.7% |
| 4 | sql | 57.7% |
| 5 | analysis skills | 53.8% |
| 6 | machine learning | 46.2% |
| 7 | problem solving | 42.3% |
| 8 | applied statistics | 34.6% |
| 9 | r programming | 34.6% |
| 10 | model development | 30.8% |
## Top 10 Skills for Senior Level professionals
combined_by_seniority %>%
filter(seniority_level == 'Senior Level') %>%
ungroup() %>%
arrange(desc(num_jobs)) %>%
mutate(pct_postings = round((num_jobs/group_size),3)) %>%
mutate(pct_postings = scales::percent(pct_postings)) %>%
na.omit() %>%
select(discrete_skill, pct_postings) %>%
filter(row_number() <= 10) %>%
kable(
col.names = (c("Skill", "Percent of Postings")),
row.names = TRUE,
caption = "Top 10 Data Science for Senior Level Professionals",
align = c("l","c")
) %>%
kable_material(c("striped", "condensed"))
| Skill | Percent of Postings | |
|---|---|---|
| 1 | python programming | 100% |
| 2 | sql | 100% |
| 3 | tableau | 75% |
| 4 | analysis skills | 50% |
| 5 | analytics skills | 50% |
| 6 | communication skills | 50% |
| 7 | database management | 50% |
| 8 | ggplot | 50% |
| 9 | machine learning | 50% |
| 10 | model development | 50% |
combined_by_seniority %>%
group_by(seniority_level) %>%
mutate(pct_postings = num_jobs/group_size) %>%
na.omit(discrete_skill) %>%
arrange(desc(pct_postings)) %>%
ungroup() %>%
select(seniority_level, discrete_skill, pct_postings) %>%
mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
pivot_wider(names_from = seniority_level, values_from = pct_postings) %>%
na.omit() %>%
select(discrete_skill, 'Entry Level', 'Mid Level', 'Senior Level') %>%
filter(row_number() <= 10) %>%
kable(
col.names = (c("Skill", "Entry Level", "Mid Level", "Senior Level")),
row.names = TRUE,
caption = "Top 10 Data Science Skills Common Amongst Each Seniority Level",
align = c("l",rep("c",3))
) %>%
kable_material(c("striped", "condensed"))
| Skill | Entry Level | Mid Level | Senior Level | |
|---|---|---|---|---|
| 1 | python programming | 83.3% | 73.1% | 100.0% |
| 2 | sql | 33.3% | 57.7% | 100.0% |
| 3 | tableau | 8.3% | 11.5% | 75.0% |
| 4 | analysis skills | 66.7% | 53.8% | 50.0% |
| 5 | communication skills | 41.7% | 61.5% | 50.0% |
| 6 | analytics skills | 33.3% | 57.7% | 50.0% |
| 7 | machine learning | 50.0% | 46.2% | 50.0% |
| 8 | model development | 50.0% | 30.8% | 50.0% |
| 9 | database management | 41.7% | 26.9% | 50.0% |
| 10 | project management skills | 8.3% | 19.2% | 50.0% |
Answer: Across the different seniority levels, we see that the top skills that are common amongst each level of seniority are include: a) python programming, b) sql, c) tableau, d) analysis skills, and e) communication skills.
This project provided a very practical way for our team to explore the top skills that we should be developing on our journey towards becoming Data Science professionals. In addition to building up our knowledge and awareness of these core skills, this project provided a hands-on and practical way for us to go through the full Data Analysis and Data Science life-cycle to come up with a problem, pull our raw data, clean the data, import the data into a database, then analyze the data to answer the questions of importance.
I think that one of the key surprises throughout this project was the number of times we had to go back to the drawing board to account for data that was missing or difficult to acquire, or other challenges that we faced throughout the life of this project. Additionally, there were a number of manual pre-processing steps that given more time, we may have chosen to find a programmatic way to solve for.
Overall, this project was a great exercise and experience in working collaboratively and being able to work collaboratively in a remote manner. This project allows us to focus on meeting tight deadlines and leverage the strengths of each member of the team.