Introduction

For this project we focused on answering the question “What are the most valued data science skills?”.

Our method for answering this question was to collect data from a number of Data Science and Analysis related job postings and query that data to try and identify the skills that are the most prominent across these postings.

The overall aproach we took for this project was:

  1. Data Collection - collect job postings from various job boards to create our initial corpus of data science skills.

  2. Data Cleaning - clean the raw job postings data by creating a single row of data for each of the skills requirements listed in the job posts. Also, once this has been completed, then we created a data frame of just the skills for each of the job postings. Finally, we took those skills and cleaned the data by a) removing any non-alphanumeric characters, b) removing any extra white space; c) converting all words to lower case; and d) removing any stop words from the postings

  3. Word Tokenization - using the job skills data, we then created a corpus of discrete words that are included in the various posts, by taking each of the listed job skills and then breaking them into their separate words, and creating word groups of size n = {1,2,3,4,5}

  4. Word Classification - next we used our new corpus of job skills and exported the data as a .csv file so that we can go through the manual process of labeling the words that were actual skills associated with Data Science vs. those that were not.

  5. Labeling Original Data - now that we had our classification corpus of words that were associated with Data Science skills, we then returned to the original posts and labeled each job skill with the various discrete skills we identified in our dictionary of skills. For each job requirement, we flagged up to seven different skills.

  6. Tidy Data - finally we then took the new data frame and converted it from wide to long.

  7. Export to Database - now that we have our data in a structured format, we are able to export it to a database.

  8. Data Analysis - now that our data is structured in a tidy format, we were able to begin the process of tidying the data.

Load Packages

We setup our environment by loading the libraries that we will be using.

knitr::opts_chunk$set(echo = TRUE)

rm(list=ls())

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(RPostgres)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggwordcloud)
library(dotenv)

Database Connection

For this step we start by connecting to the database that we will use to store the cleaned data and then to retrieve the data for use when analyzing the data.

load_dot_env('../creds.env')

con <- dbConnect(Postgres(),
                 dbname = Sys.getenv("DB_NAME"),
                 host = Sys.getenv("DB_HOST"), 
                 port = Sys.getenv("DB_PORT"),
                 user = Sys.getenv("DB_UID"),
                 password = Sys.getenv("DB_PWD"))

Data Collection

We initiated this project by assigning each member of our team to a different job board and tasked them with collecting 25 job postings each (~100 postings). This raw data was used to generate our overall corpus of job skills across different jobs that

Core steps: 1. Input job postings from .csv file 2. Break out the job skills for each job posting for each new line

path = '../input/Job Postings Data.csv'

jobs_raw <-  read.csv(path)
  
jobs_df <- jobs_raw %>% separate_rows(Skills, sep= "\n")

skills_list <- jobs_df %>% 
  select(Skills)

skills_list <- skills_list %>% 
  mutate(Skills = str_squish(Skills)) %>%
  filter(Skills != "")

#clean the skills list data
skills_list <- skills_list %>% 
  mutate(Skills = tolower(Skills)) %>%
  mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' '))

skills_list <- skills_list %>% 
  mutate(Skills = str_squish(Skills))

Word Tokenization

This step was focused on taking the various job postings, and breaking down the words used in the job skills portion of the job post into discrete skills terms that we can then use to create our overall corpus of terms.

This step had the following core steps: 1. Create a function that can be used to get strings of multiple lengths 2. Create a list of the skills words 3. Remove the stop words from the data

#Import list of stop words

stop_words_list = 'https://raw.githubusercontent.com/igorbrigadir/stopwords/ab85d86c3fac0360020a921b91ccf9d697b54757/en/terrier.txt'


stop_words <- read.table(stop_words_list)
stop_words <- stop_words$V1

#Remove "R" and "C" from stop words
stop_words <- stop_words[stop_words != 'r' & stop_words != 'c']

# Create function to capture word groups
get_words <- function(vec, vec_length, word_size) {
  words_vec = data.frame()
  
  start_index = 1
  end_index = start_index + (word_size-1)
  
  while(end_index <= vec_length) {
    str = paste(vec[start_index:end_index],collapse=" ")
    str_df <- data.frame(str)
    words_vec = rbind(words_vec,str_df)
    start_index = start_index+1
    end_index = end_index+1
  }
  
  return(words_vec)
}


max_word_groups = 5

skills_vec = data.frame()


## Generate list of skills from the job postings
for(i in 1:nrow(skills_list)) {
  skill = skills_list[[i,1]]
  
  if(skill != "") {
    skill_split <- strsplit(skill," ")

    skills = skill_split[[1]]
    skills <- skills[! skills %in% stop_words]
    
    vec_length = length(skills)
    
    for(word_size in 1:max_word_groups) {
      words <- get_words(skills,vec_length,word_size)
      skills_vec = rbind(skills_vec, words)  
    }
    
    
  }
  
}  

Once we tokenized the terms used in the job postings, we cleaned out the data to create a simplified view of the distrinct terms that are used across the various postings. We then exported this data so that we are able to manually process the terms to determine which ones relate to Data Science work and which ones can be discarded/ignored:

The core steps here were: 1. The first step was to conduct a series of steps to clean the tokenized Data Science skills list 2. Create metrics to show how frequent the term appears in job postings and how many times a version of the term appears in other words in the dataset. 3. Export the data skills dataset for manual processing

# Rename the skill str column to skill
skills_vec <- skills_vec %>%
  rename(skill = str)

# Convert skills to lower case
skills_df <- skills_vec %>% mutate(skill = tolower(skill))

# Remove special characters from skills list
skills_df <- skills_df %>% 
  mutate(skill = str_replace_all(skill, '[^[:alnum:]]', ' '))

# Remove leading and trailing white space characters 
skills_df <- skills_df %>% 
  mutate(skill = str_squish(skill))

# Generate distinct list of skills words
distinct_skills <- skills_df %>%
  distinct(skill)


# Loop through distinct skills and determine the number of job postings that have the word
word_count = data.frame()

for(i in 1:nrow(distinct_skills)) {
  word <- distinct_skills[[i,1]]
  num_posts <- skills_list %>%
    filter(grepl(word,Skills)) %>%
    nrow()
  
  word_count = rbind(word_count,num_posts)
}

skills_data <- cbind(distinct_skills, word_count)

columns <- c("skill", "num_postings")

colnames(skills_data) <- columns 

# Remove skills words that do not appear in any of the postings
skills_data <- skills_data %>% 
  filter(num_postings >= 1) 


skills_data <- skills_data %>%
  mutate(pct_postings = round(num_postings/nrow(skills_list),4))


# Filter the list to remove the stop words but keep the R and C since those could be in reference to the coding languages. Also remove any blank skills



skills_data <- skills_data %>%
  filter(!skill %in% stop_words) %>%
  filter(skill != "") %>%
  arrange(desc(pct_postings))

#how many times is a word nested in other skills

word_occurrence = data.frame()

for(i in 1:nrow(skills_data)) {
  num_occurrences = 0
  word = skills_data[[i,"skill"]]
  #print(word)
  
  num_occurrences <- skills_data %>%
    filter(grepl(word, skill)) %>%
    nrow()
    
  
  
  word_occurrence <- rbind(word_occurrence, num_occurrences)
  
}

skills_data <- cbind(skills_data, word_occurrence)

skills_data <- data.frame(skills_data)

columns = c("skill", "num_postings", "pct_postings", "num_occurence")

colnames(skills_data) <- columns

skills_data <- as_tibble(skills_data) 

skills_data <- skills_data %>% mutate(pct_occurence = round(num_occurence/n(),4))


write_csv(skills_data, '../output/skills_data.csv')

Now, using our new corpus of Data Science skills, we can now go back to our original list of job skills, and identify the job skills that are included in each of the posts. This phase ended up requiring a lot of manual processing. These steps included:

  1. I manually went through the skills_data rows and set which ones would be considered a skill.
  2. Remove NA values
  3. Sort through the original data postings and set the skill to the labeled skill if the word is in the job posting
  4. Tally the skills
  5. Conducted manual review and audit of the data to ensure we weren’t excluding anything and then revised the data appropriately

Word Labeling

Using our list of skills, we were able to label the words in the output file which skills should be flagged as relevent skills.

skills_path = '../input/skills_df.csv'

skills_labeled <- read_csv(skills_path)
## Rows: 3576 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): skill, is_skill
## dbl (4): num_postings, pct_postings, num_occurence, pct_occurence
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skills_labeled <- skills_labeled %>% 
  mutate(is_skill = tolower(is_skill))

#Remove unlabeled skills from dataset
skills_labeled_filtered <- skills_labeled %>%
  filter(!is.na(is_skill))


#Take original skills_list and flag the skill if the skill is in our labeled dataset

skills_list_labeled <- skills_list %>% 
  mutate(skill_short1 = NA,
         skill_short2 = NA,
         skill_short3 = NA, 
         skill_short4 = NA,
         skill_short5 = NA,
         skill_short6 = NA,
         skill_short7 = NA)

for(i in 1:nrow(skills_labeled_filtered)) {
  skill = skills_labeled_filtered[[i,1]]
  skill_regex = paste0("\\b", skill, "\\b")
  
  #Set first skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short1 = ifelse(is.na(skill_short1) & str_detect(Skills, skill_regex),skill,skill_short1))
  
  #Set second skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short2 = ifelse(!is.na(skill_short1) & skill_short1 != skill & is.na(skill_short2) & str_detect(Skills, skill_regex),
                                 skill,skill_short2))
  
  #Set third skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short3 = ifelse(!is.na(skill_short2) & skill_short2 != skill & is.na(skill_short3) & str_detect(Skills, skill_regex),
                                 skill,skill_short3))
  
  
  #Set fourth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short4 = ifelse(!is.na(skill_short3) & skill_short3 != skill & is.na(skill_short4) & str_detect(Skills, skill_regex),
                                 skill,skill_short4))
  
  
  #Set fifth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short5 = ifelse(!is.na(skill_short4) & skill_short4 != skill & is.na(skill_short5) & str_detect(Skills, skill_regex),
                                 skill,skill_short5))
  
  #Set sixth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short6 = ifelse(!is.na(skill_short5) & skill_short5 != skill & is.na(skill_short6) & str_detect(Skills, skill_regex),
                                 skill,skill_short6))
  
  #Set seventh skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short7 = ifelse(!is.na(skill_short6) & skill_short6 != skill & is.na(skill_short7) & str_detect(Skills, skill_regex),
                                 skill,skill_short7))
  
  
}  

Tidy Data

Now that we have a labeled list of skills requirements from the original job postings, we can then join these new fields with the original data.

This process primarily consists of: 1. Further preprocessing of the skills and job postings datasets to account for terms that are symantically similar 2. Remove bad character strings 3. Impute Job Position information 4. Impute Job Seniority Level information

Join Tables

jobs_df_clean <- jobs_df %>%
  filter(Skills != "") %>%
  mutate(Skills = tolower(Skills)) %>%
  mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' ')) %>%
  mutate(Skills = str_squish(Skills))

  
j1 <- left_join(jobs_df_clean, skills_list_labeled, by=c("Skills"="Skills"), multiple = "all")


j1_pivot <- j1 %>% pivot_longer(cols = c("skill_short1", "skill_short2", "skill_short3", "skill_short4", "skill_short5", "skill_short6", "skill_short7"),
                    names_to = "skill_number", 
                    values_to = "discrete_skill")


columns = c("Job Board", "Post URL", "Company", "Industry", "Company Location", 
            "Job Title", "Listed Skill", "Seniority Level", "Job Type",
            "Years Experience", "skill_number", "discrete_skill")

colnames(j1_pivot) <- columns

j1_pivot <- janitor::clean_names(j1_pivot)

j1_pivot <- j1_pivot %>%
  mutate(discrete_skill = case_when(
    str_detect(discrete_skill, '(communication|communicate)') ~ 'communication skills',
    str_detect(discrete_skill, 'visualization') ~ 'data visualization skills',
    str_detect(discrete_skill, '\\br\\b') ~ 'r programming', 
    str_detect(discrete_skill, 'stat') ~ 'applied statistics', 
    str_detect(discrete_skill,'\\bai\\b') ~ 'artificial intelligence',
    str_detect(discrete_skill,'agile') ~ 'agile development',
    str_detect(discrete_skill, '(^analys|analyz)') ~ 'analysis skills',
    str_detect(discrete_skill, '^analyt') ~ 'analytics skills',
    str_detect(discrete_skill, '(machine learning|ml)') ~ 'machine learning',
    str_detect(discrete_skill, '\\bci\\b') ~ 'ci',
    str_detect(discrete_skill, 'strat') ~ 'strategy skills',
    str_detect(discrete_skill, 'workflow') ~ 'worklow automation',
    str_detect(discrete_skill, '\\bcloud\\b') ~ 'cloud computing',
    str_detect(discrete_skill, 'cluster') ~ 'cluster analysis',
    str_detect(discrete_skill, 'critical') ~ 'critical thinking skills',
    str_detect(discrete_skill, 'project') ~ 'project management skills',
    str_detect(discrete_skill, 'customer') ~ 'customer service skills',
    str_detect(discrete_skill, 'database') ~ 'database management',
    str_detect(discrete_skill, 'big data') ~ 'big data',
    str_detect(discrete_skill, 'deep') ~ 'deep learning',
    str_detect(discrete_skill, '\\betl\\b') ~ 'etl',
    str_detect(discrete_skill, 'python') ~ 'python programming',
    str_detect(discrete_skill, 'innov') ~ 'innovative',
    str_detect(discrete_skill, 'regression') ~ 'regression modeling',
    str_detect(discrete_skill, 'ggplot') ~ 'ggplot',
    str_detect(discrete_skill, 'model') ~ 'model development',
    str_detect(discrete_skill, 'data transformation') ~ 'data transformation',
    str_detect(discrete_skill, '\\bsql\\b') ~ 'sql',
    str_detect(discrete_skill, '(\\bbi\\b|business intelligence)') ~ 'business intelligence skills',
    str_detect(discrete_skill, '\\bai\\b') ~ 'artificial intelligence',
    str_detect(discrete_skill, 'neural') ~ 'neural networks',
    str_detect(discrete_skill, 'keras') ~ 'keras',
    str_detect(discrete_skill, '(sas|saas)') ~ 'saas',
    str_detect(discrete_skill, 'software development') ~ 'software development',
    str_detect(discrete_skill, 'algorithm') ~ 'algorithm design',
    str_detect(discrete_skill, '(\\bmip\\b|\\bqp\\b)') ~ 'mixed-interger programming',
    str_detect(discrete_skill, '\\bnlp\\b') ~ 'natural language programming',
    str_detect(discrete_skill, '(\\boop\\b|object)') ~ 'object oriented programming',
    str_detect(discrete_skill, 'consulting') ~ 'consulting',
    str_detect(discrete_skill, 'predictive') ~ 'predictive modeling',
    str_detect(discrete_skill, 'curio') ~ 'curiosity',
    str_detect(discrete_skill, 'data cleaning') ~ 'data wrangling',
    str_detect(discrete_skill, 'probab') ~ 'probabilistic modeling',
    str_detect(discrete_skill, '(programming|scripting|coding)') & !str_detect(skill, '(python|sql|database|\\br\\b)') ~ 'computer programming',
    str_detect(discrete_skill, 'solving') ~ 'problem solving',
    str_detect(discrete_skill, 'team') ~ 'teamwork skills',
    str_detect(discrete_skill, 'unsupervised') ~ 'unsupervised learning',
    TRUE ~ discrete_skill
  ))
  


## Export cleaned data frame
write_csv(j1_pivot, '../output/jobs_data_clean.csv')

Clean data to prepare for importing into database

company_df <- j1_pivot %>%
  select(company, industry, company_location) %>%
  distinct() %>%
  mutate(company_id = row_number())

skills_df <- j1_pivot %>%
  select(discrete_skill) %>%
  distinct(discrete_skill) %>%
  na.omit(discrete_skill) %>%
  arrange(discrete_skill) %>%
  mutate(skill_id = row_number())

job_board_df <- j1_pivot %>%
  select(job_board) %>%
  distinct() %>%
  mutate(job_board_id = row_number())

job_posting_df <- j1_pivot %>%
  group_by(job_board, post_url, company, job_title) %>%
  summarize(posting_id = cur_group_id())
## `summarise()` has grouped output by 'job_board', 'post_url', 'company'. You can
## override using the `.groups` argument.
j2 <- left_join(j1_pivot,job_posting_df) 
## Joining with `by = join_by(job_board, post_url, company, job_title)`
j3 <- left_join(j2, skills_df)
## Joining with `by = join_by(discrete_skill)`
j4 <- left_join(j3, company_df)
## Joining with `by = join_by(company, industry, company_location)`
j5 <- left_join(j4, job_board_df)
## Joining with `by = join_by(job_board)`
company_data <- j5 %>%
  select(company_id, company, company_location, industry) %>%
  distinct()

posting_data <- j5 %>%
  select(posting_id, company_id, job_board_id, job_title, job_type, seniority_level, years_experience) %>%
  distinct() %>%
  arrange(posting_id)

#Clean up the job type field

posting_data <- posting_data %>%
  mutate(job_type = case_when(
    str_detect(job_type,"\\bHybrid\\b") ~ "Hybrid",
    str_detect(job_type, "\\b[rR]emote\\b") ~ "Remote",
    str_detect(job_type, "\\bIn [pP]erson\\b") ~ "On-Site",
    str_detect(job_type, 'On Premise') ~ 'On-Site',
    str_detect(job_type, "character(0)") ~ "",
    str_detect(job_type, '0') ~ "",
    TRUE ~ job_type
  ))

#Clean up the seniority_level field

posting_data <- posting_data %>%
  mutate(seniority_level = case_when(
    str_detect(seniority_level,"(Entry|Assistant)") ~ 'Entry Level',
    str_detect(seniority_level, "(Mid|Manager|Supervisor)") ~ 'Mid Level',
    str_detect(seniority_level, "(Senior)") ~ 'Senior Level',
    seniority_level == 'character(0)' ~ "",
    seniority_level == '0' ~ "",
    TRUE ~ seniority_level
  ))


# Clean up the years experience field
posting_data <- posting_data %>%
  mutate(years_experience = str_match(years_experience, '[\\d][~]*[\\d]*')) %>%
  mutate(years_experience = str_replace(years_experience, "~", "-"))
  
#Impute seniority level based on years experience

posting_data <- posting_data %>%
  mutate(seniority_level = ifelse(seniority_level == "" | is.na(seniority_level),
    case_when(
      years_experience %in% c("0","1","2") ~ 'Entry Level',
      years_experience %in% c("3","4") ~ 'Mid Level',
      TRUE ~ seniority_level
    ),seniority_level))


posting_data <- posting_data %>%
  mutate(job_title_category = case_when(
    str_detect(job_title, "(Analyst|Analytics)") ~ 'Data Analyst',
    str_detect(job_title, "Data S") ~ 'Data Scientist',
    str_detect(job_title, 'Engineer') ~ 'Data Engineer',
    str_detect(job_title, 'Machine Learning') ~ 'Data Scientist',
    TRUE ~ 'Other'
  )) 
  


job_board_data <- j5 %>%
  select(job_board_id, job_board) %>%
  distinct()

job_skills_data <- j5 %>%
  select(skill_id, posting_id, discrete_skill) %>%
  distinct() %>%
  arrange(skill_id)

Import dataframe into database

Now that we’ve cleaned the data and created our dataframes, we want to create the relevant tables in our PostgresSQL database and then import the data from the dataframe into these tables

drop_query = "DROP TABLE company,job_board, skills, job_posting"
dbSendQuery(con, drop_query)
## <PqResult>
##   SQL  DROP TABLE company,job_board, skills, job_posting
##   ROWS Fetched: 0 [complete]
##        Changed: 0
## Write data to database
dbWriteTable(con, "job_board", job_board_data, overwrite=TRUE)
## Warning in result_create(conn@ptr, statement, immediate): Closing open result
## set, cancelling previous query
dbWriteTable(con, "company", company_data, overwrite=TRUE)
dbWriteTable(con, "skills", job_skills_data, overwrite=TRUE)
dbWriteTable(con, "job_posting", posting_data, overwrite=TRUE)

Analysis of Data

Now that we have cleaned data available to us through our database, we are now ready to answer our core data analysis questions. For this project we focused on answering the following questions:

  1. What skills are the most commonly listed across all job positions?
  2. Are there differences in skills based on position/title?
  3. What skills are the most commonly listed across jobs based on seniority level?

Question 1 - What skills are the most commonly listed across all job positions?

query = "SELECT * FROM skills"
skills_data <- dbGetQuery(con, query)

query = "SELECT * FROM job_posting"
posting_data <- dbGetQuery(con,query)

num_postings = posting_data %>% nrow()

top50_skills <- skills_data %>%
  group_by(discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  mutate(pct_of_jobs = num_jobs/n()) %>%
  na.omit() %>%
  arrange(desc(pct_of_jobs)) %>%
  head(50)


top25_skills <- skills_data %>%
  group_by(discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  mutate(pct_of_jobs = num_jobs/num_postings) %>%
  na.omit() %>%
  arrange(desc(pct_of_jobs)) %>% 
  head(25) %>%
  mutate(discrete_skill = factor(discrete_skill, levels=discrete_skill, ordered=TRUE))
  
  
## Table of top 25 Job Postings
top25_skills %>%
  mutate(pct_of_jobs = scales::percent(pct_of_jobs)) %>%
  kable(
    col.names = c("Skill", "No. Jobs", "Percent of Jobs"),
    row.names = TRUE,
    caption = "Top 25 Skills for Data Scientists & Data Professionals (N = 89 Postings)",
    align = c("l", "c", "c")
  ) %>%
  kable_material(c("striped"))
Top 25 Skills for Data Scientists & Data Professionals (N = 89 Postings)
Skill No. Jobs Percent of Jobs
1 analysis skills 52 58.4%
2 python programming 51 57.3%
3 machine learning 39 43.8%
4 communication skills 38 42.7%
5 analytics skills 37 41.6%
6 sql 34 38.2%
7 applied statistics 31 34.8%
8 model development 30 33.7%
9 programming 28 31.5%
10 database management 24 27.0%
11 r programming 24 27.0%
12 problem solving 22 24.7%
13 algorithm design 21 23.6%
14 big data 16 18.0%
15 aws 13 14.6%
16 tableau 13 14.6%
17 business intelligence skills 12 13.5%
18 deep learning 12 13.5%
19 saas 12 13.5%
20 azure 11 12.4%
21 curiosity 11 12.4%
22 programming languages 11 12.4%
23 project management skills 11 12.4%
24 strategy skills 11 12.4%
25 cluster analysis 10 11.2%
## Shows top 25 skills for Data Scientist
ggplot(data = top25_skills, aes(x=reorder(discrete_skill, num_jobs), y=pct_of_jobs), fig(40,10)) +
  geom_bar(stat='identity', aes(fill=discrete_skill)) +
  coord_flip() + 
  labs(
    x = "Job Skills",
    y = "Percent of Jobs",
    title = 'Top 25 Skills For Data Scientist and Data Professionals',
    subtitle = 'N = 89 Postings',
    fill = 'Skills'
  ) + 
  scale_y_continuous(labels = scales::percent)

## Wordcloud of top 50 Skills
ggplot(top50_skills, aes(label=discrete_skill, size=num_jobs)) + 
  geom_text_wordcloud() + 
  theme_minimal()

Answer:

For this we pulled the top 25 skills that across each of our data science jobs, and found that of these the top 5 were: 1. Analysis Skills - 58.4% of postings 2. Python Programming - 57.3% of postings 3. Communication Skills - 42.7% of postings 4. Machine Learning - 42.7% of postings 5. Analytics Skills - 41.6% of postings

Question 2 - Are there differences in skills based on position/title?

query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"

combined_data <- dbGetQuery(con, query)

combined_data <- combined_data %>%
  select(-posting_id..4)

num_postings_by_cat <- combined_data %>% 
  group_by(job_title_category) %>%
  summarize(total_postings = n_distinct(posting_id)) %>%
  arrange(desc(total_postings))

num_postings_by_skill <- combined_data %>%
  group_by(job_title_category, discrete_skill) %>%
  summarize(num_postings = n_distinct(posting_id)) %>%
  ungroup()
## `summarise()` has grouped output by 'job_title_category'. You can override
## using the `.groups` argument.
combined_postings_by_cat <- left_join(num_postings_by_skill, num_postings_by_cat) %>%
  mutate(pct_postings = num_postings/total_postings)
## Joining with `by = join_by(job_title_category)`
## Top Skills for Data Scientist Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Scientist') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Scientists",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))
Top 10 Data Science Skills for Data Scientists
Skill Percent of Postings
1 analysis skills 63.5%
2 python programming 60.3%
3 communication skills 42.9%
4 machine learning 42.9%
5 model development 41.3%
6 analytics skills 39.7%
7 applied statistics 36.5%
8 programming 34.9%
9 algorithm design 31.7%
10 sql 31.7%
## Top Skills for Data Analyst Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Analyst') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Analysts",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))
Top 10 Data Science Skills for Data Analysts
Skill Percent of Postings
1 analytics skills 55.6%
2 sql 55.6%
3 analysis skills 50.0%
4 machine learning 44.4%
5 python programming 44.4%
6 applied statistics 33.3%
7 communication skills 33.3%
8 database management 33.3%
9 problem solving 27.8%
10 tableau 27.8%
## Top Skills for Data Engineer Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Engineer') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Engineers",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))
Top 10 Data Science Skills for Data Engineers
Skill Percent of Postings
1 aws 75%
2 azure 75%
3 python programming 75%
4 sql 75%
5 ci 50%
6 communication skills 50%
7 data pipelines 50%
8 machine learning 50%
9 pipelines 50%
10 programming 50%
top10_by_cat <- combined_postings_by_cat %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10)
  

## Top Data Science Skills common across different data functions

combined_postings_by_cat %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  ungroup() %>%
  select(job_title_category, discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
  mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
  pivot_wider(names_from = job_title_category, values_from = pct_postings) %>%
  na.omit() %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Data Scientist", "Data Analyst", "Data Engineer", "Other")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills Consistent Across Each Data Function/Title",
    align = c("l",rep("c",4))
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))
Top 10 Data Science Skills Consistent Across Each Data Function/Title
Skill Data Scientist Data Analyst Data Engineer Other
1 analysis skills 63.5% 50.0% 25.0% 50.0%
2 python programming 60.3% 44.4% 75.0% 50.0%
3 communication skills 42.9% 33.3% 50.0% 75.0%
4 machine learning 42.9% 44.4% 50.0% 50.0%
5 model development 41.3% 11.1% 25.0% 25.0%
6 analytics skills 39.7% 55.6% 25.0% 25.0%
7 sql 31.7% 55.6% 75.0% 25.0%
8 r programming 28.6% 16.7% 25.0% 50.0%
9 problem solving 23.8% 27.8% 25.0% 25.0%
10 strategy skills 12.7% 5.6% 25.0% 25.0%

Answer: Overall there seemed to be similarity in the importance of skills across the functional role. However, as expected we do find some start differences between each position. For example, Data Engineers are expected to have more familiarity with cloud based architectures and technologies, such as AWS and Azure and there’s a lot more focus on their programming skills (e.g. computer programming, and python programming). However, for Data Scientist and Data Analyst, there’s more of focus on analytical skills and the abiity to develop models.

Question 3 - What skills are the most commonly listed across jobs based on seniority level?

query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"

combined_data <- dbGetQuery(con, query)

combined_data <- combined_data %>%
  select(-posting_id..4)

filtered_posts <- combined_data %>%
  filter(!is.na(seniority_level), seniority_level != "")


group_by_seniority <- filtered_posts %>%
  group_by(seniority_level) %>%
  summarize(group_size = n_distinct(posting_id)) %>%
  ungroup()



group_by_skill <- filtered_posts %>%
  group_by(seniority_level, discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  ungroup()
## `summarise()` has grouped output by 'seniority_level'. You can override using
## the `.groups` argument.
combined_by_seniority <- left_join(group_by_skill, group_by_seniority)
## Joining with `by = join_by(seniority_level)`
## Top 10 Skills for Entry Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Entry Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Entry Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))
Top 10 Data Science for Entry Level Professionals
Skill Percent of Postings
1 python programming 83.3%
2 analysis skills 66.7%
3 applied statistics 50.0%
4 machine learning 50.0%
5 model development 50.0%
6 communication skills 41.7%
7 database management 41.7%
8 deep learning 41.7%
9 analytics skills 33.3%
10 pytorch 33.3%
## Top 10 Skills for Mid Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Mid Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Mid Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))
Top 10 Data Science for Mid Level Professionals
Skill Percent of Postings
1 python programming 73.1%
2 communication skills 61.5%
3 analytics skills 57.7%
4 sql 57.7%
5 analysis skills 53.8%
6 machine learning 46.2%
7 problem solving 42.3%
8 applied statistics 34.6%
9 r programming 34.6%
10 model development 30.8%
## Top 10 Skills for Senior Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Senior Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Senior Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))
Top 10 Data Science for Senior Level Professionals
Skill Percent of Postings
1 python programming 100%
2 sql 100%
3 tableau 75%
4 analysis skills 50%
5 analytics skills 50%
6 communication skills 50%
7 database management 50%
8 ggplot 50%
9 machine learning 50%
10 model development 50%
combined_by_seniority %>%
  group_by(seniority_level) %>%
  mutate(pct_postings = num_jobs/group_size) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(pct_postings)) %>%
  ungroup() %>%
  select(seniority_level, discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
  mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
  pivot_wider(names_from = seniority_level, values_from = pct_postings) %>%
  na.omit() %>%
  select(discrete_skill, 'Entry Level', 'Mid Level', 'Senior Level') %>% 
  filter(row_number() <= 10) %>%
  kable(
      col.names = (c("Skill", "Entry Level", "Mid Level", "Senior Level")),
      row.names = TRUE,
      caption = "Top 10 Data Science Skills Common Amongst Each Seniority Level",
      align = c("l",rep("c",3))
    ) %>%
    kable_material(c("striped", "condensed"))
Top 10 Data Science Skills Common Amongst Each Seniority Level
Skill Entry Level Mid Level Senior Level
1 python programming 83.3% 73.1% 100.0%
2 sql 33.3% 57.7% 100.0%
3 tableau 8.3% 11.5% 75.0%
4 analysis skills 66.7% 53.8% 50.0%
5 communication skills 41.7% 61.5% 50.0%
6 analytics skills 33.3% 57.7% 50.0%
7 machine learning 50.0% 46.2% 50.0%
8 model development 50.0% 30.8% 50.0%
9 database management 41.7% 26.9% 50.0%
10 project management skills 8.3% 19.2% 50.0%

Answer: Across the different seniority levels, we see that the top skills that are common amongst each level of seniority are include: a) python programming, b) sql, c) tableau, d) analysis skills, and e) communication skills.

Conclusion

This project provided a very practical way for our team to explore the top skills that we should be developing on our journey towards becoming Data Science professionals. In addition to building up our knowledge and awareness of these core skills, this project provided a hands-on and practical way for us to go through the full Data Analysis and Data Science life-cycle to come up with a problem, pull our raw data, clean the data, import the data into a database, then analyze the data to answer the questions of importance.

I think that one of the key surprises throughout this project was the number of times we had to go back to the drawing board to account for data that was missing or difficult to acquire, or other challenges that we faced throughout the life of this project. Additionally, there were a number of manual pre-processing steps that given more time, we may have chosen to find a programmatic way to solve for.

Overall, this project was a great exercise and experience in working collaboratively and being able to work collaboratively in a remote manner. This project allows us to focus on meeting tight deadlines and leverage the strengths of each member of the team.