Most Valued Data Science Skills

Introduction
Load Packages
Database Connection
- Data Collection
Word Tokenization
Word Labeling
Tidy Data
Import dataframe into database
Analysis of Data
Conclusion

Introduction

For this project we focused on answering the question “What are the most valued data science skills?”.

Our method for answering this question was to collect data from a number of Data Science and Analysis related job postings and query that data to try and identify the skills that are the most prominent across these postings.

The overall aproach we took for this project was:

Data Collection - collect job postings from various job boards to create our initial corpus of data science skills.
Data Cleaning - clean the raw job postings data by creating a single row of data for each of the skills requirements listed in the job posts. Also, once this has been completed, then we created a data frame of just the skills for each of the job postings. Finally, we took those skills and cleaned the data by a) removing any non-alphanumeric characters, b) removing any extra white space; c) converting all words to lower case; and d) removing any stop words from the postings
Word Tokenization - using the job skills data, we then created a corpus of discrete words that are included in the various posts, by taking each of the listed job skills and then breaking them into their separate words, and creating word groups of size n = {1,2,3,4,5}
Word Classification - next we used our new corpus of job skills and exported the data as a .csv file so that we can go through the manual process of labeling the words that were actual skills associated with Data Science vs. those that were not.
Labeling Original Data - now that we had our classification corpus of words that were associated with Data Science skills, we then returned to the original posts and labeled each job skill with the various discrete skills we identified in our dictionary of skills. For each job requirement, we flagged up to seven different skills.
Tidy Data - finally we then took the new data frame and converted it from wide to long.
Export to Database - now that we have our data in a structured format, we are able to export it to a database.
Data Analysis - now that our data is structured in a tidy format, we were able to begin the process of tidying the data.

Load Packages

We setup our environment by loading the libraries that we will be using.

knitr::opts_chunk$set(echo = TRUE)

rm(list=ls())

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(RPostgres)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggwordcloud)
library(dotenv)

Database Connection

For this step we start by connecting to the database that we will use to store the cleaned data and then to retrieve the data for use when analyzing the data.

load_dot_env('../creds.env')

con <- dbConnect(Postgres(),
                 dbname = Sys.getenv("DB_NAME"),
                 host = Sys.getenv("DB_HOST"), 
                 port = Sys.getenv("DB_PORT"),
                 user = Sys.getenv("DB_UID"),
                 password = Sys.getenv("DB_PWD"))

Data Collection

We initiated this project by assigning each member of our team to a different job board and tasked them with collecting 25 job postings each (~100 postings). This raw data was used to generate our overall corpus of job skills across different jobs that

Core steps: 1. Input job postings from .csv file 2. Break out the job skills for each job posting for each new line

path = '../input/Job Postings Data.csv'

jobs_raw <-  read.csv(path)
  
jobs_df <- jobs_raw %>% separate_rows(Skills, sep= "\n")

skills_list <- jobs_df %>% 
  select(Skills)

skills_list <- skills_list %>% 
  mutate(Skills = str_squish(Skills)) %>%
  filter(Skills != "")

#clean the skills list data
skills_list <- skills_list %>% 
  mutate(Skills = tolower(Skills)) %>%
  mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' '))

skills_list <- skills_list %>% 
  mutate(Skills = str_squish(Skills))

Word Tokenization

This step was focused on taking the various job postings, and breaking down the words used in the job skills portion of the job post into discrete skills terms that we can then use to create our overall corpus of terms.

This step had the following core steps: 1. Create a function that can be used to get strings of multiple lengths 2. Create a list of the skills words 3. Remove the stop words from the data

#Import list of stop words

stop_words_list = 'https://raw.githubusercontent.com/igorbrigadir/stopwords/ab85d86c3fac0360020a921b91ccf9d697b54757/en/terrier.txt'


stop_words <- read.table(stop_words_list)
stop_words <- stop_words$V1

#Remove "R" and "C" from stop words
stop_words <- stop_words[stop_words != 'r' & stop_words != 'c']

# Create function to capture word groups
get_words <- function(vec, vec_length, word_size) {
  words_vec = data.frame()
  
  start_index = 1
  end_index = start_index + (word_size-1)
  
  while(end_index <= vec_length) {
    str = paste(vec[start_index:end_index],collapse=" ")
    str_df <- data.frame(str)
    words_vec = rbind(words_vec,str_df)
    start_index = start_index+1
    end_index = end_index+1
  }
  
  return(words_vec)
}


max_word_groups = 5

skills_vec = data.frame()


## Generate list of skills from the job postings
for(i in 1:nrow(skills_list)) {
  skill = skills_list[[i,1]]
  
  if(skill != "") {
    skill_split <- strsplit(skill," ")

    skills = skill_split[[1]]
    skills <- skills[! skills %in% stop_words]
    
    vec_length = length(skills)
    
    for(word_size in 1:max_word_groups) {
      words <- get_words(skills,vec_length,word_size)
      skills_vec = rbind(skills_vec, words)  
    }
    
    
  }
  
}

Once we tokenized the terms used in the job postings, we cleaned out the data to create a simplified view of the distrinct terms that are used across the various postings. We then exported this data so that we are able to manually process the terms to determine which ones relate to Data Science work and which ones can be discarded/ignored:

The core steps here were: 1. The first step was to conduct a series of steps to clean the tokenized Data Science skills list 2. Create metrics to show how frequent the term appears in job postings and how many times a version of the term appears in other words in the dataset. 3. Export the data skills dataset for manual processing

# Rename the skill str column to skill
skills_vec <- skills_vec %>%
  rename(skill = str)

# Convert skills to lower case
skills_df <- skills_vec %>% mutate(skill = tolower(skill))

# Remove special characters from skills list
skills_df <- skills_df %>% 
  mutate(skill = str_replace_all(skill, '[^[:alnum:]]', ' '))

# Remove leading and trailing white space characters 
skills_df <- skills_df %>% 
  mutate(skill = str_squish(skill))

# Generate distinct list of skills words
distinct_skills <- skills_df %>%
  distinct(skill)


# Loop through distinct skills and determine the number of job postings that have the word
word_count = data.frame()

for(i in 1:nrow(distinct_skills)) {
  word <- distinct_skills[[i,1]]
  num_posts <- skills_list %>%
    filter(grepl(word,Skills)) %>%
    nrow()
  
  word_count = rbind(word_count,num_posts)
}

skills_data <- cbind(distinct_skills, word_count)

columns <- c("skill", "num_postings")

colnames(skills_data) <- columns 

# Remove skills words that do not appear in any of the postings
skills_data <- skills_data %>% 
  filter(num_postings >= 1) 


skills_data <- skills_data %>%
  mutate(pct_postings = round(num_postings/nrow(skills_list),4))


# Filter the list to remove the stop words but keep the R and C since those could be in reference to the coding languages. Also remove any blank skills



skills_data <- skills_data %>%
  filter(!skill %in% stop_words) %>%
  filter(skill != "") %>%
  arrange(desc(pct_postings))

#how many times is a word nested in other skills

word_occurrence = data.frame()

for(i in 1:nrow(skills_data)) {
  num_occurrences = 0
  word = skills_data[[i,"skill"]]
  #print(word)
  
  num_occurrences <- skills_data %>%
    filter(grepl(word, skill)) %>%
    nrow()
    
  
  
  word_occurrence <- rbind(word_occurrence, num_occurrences)
  
}

skills_data <- cbind(skills_data, word_occurrence)

skills_data <- data.frame(skills_data)

columns = c("skill", "num_postings", "pct_postings", "num_occurence")

colnames(skills_data) <- columns

skills_data <- as_tibble(skills_data) 

skills_data <- skills_data %>% mutate(pct_occurence = round(num_occurence/n(),4))


write_csv(skills_data, '../output/skills_data.csv')

Now, using our new corpus of Data Science skills, we can now go back to our original list of job skills, and identify the job skills that are included in each of the posts. This phase ended up requiring a lot of manual processing. These steps included:

I manually went through the skills_data rows and set which ones would be considered a skill.
Remove NA values
Sort through the original data postings and set the skill to the labeled skill if the word is in the job posting
Tally the skills
Conducted manual review and audit of the data to ensure we weren’t excluding anything and then revised the data appropriately

Word Labeling

Using our list of skills, we were able to label the words in the output file which skills should be flagged as relevent skills.

skills_path = '../input/skills_df.csv'

skills_labeled <- read_csv(skills_path)

## Rows: 3576 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): skill, is_skill
## dbl (4): num_postings, pct_postings, num_occurence, pct_occurence
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skills_labeled <- skills_labeled %>% 
  mutate(is_skill = tolower(is_skill))

#Remove unlabeled skills from dataset
skills_labeled_filtered <- skills_labeled %>%
  filter(!is.na(is_skill))


#Take original skills_list and flag the skill if the skill is in our labeled dataset

skills_list_labeled <- skills_list %>% 
  mutate(skill_short1 = NA,
         skill_short2 = NA,
         skill_short3 = NA, 
         skill_short4 = NA,
         skill_short5 = NA,
         skill_short6 = NA,
         skill_short7 = NA)

for(i in 1:nrow(skills_labeled_filtered)) {
  skill = skills_labeled_filtered[[i,1]]
  skill_regex = paste0("\\b", skill, "\\b")
  
  #Set first skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short1 = ifelse(is.na(skill_short1) & str_detect(Skills, skill_regex),skill,skill_short1))
  
  #Set second skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short2 = ifelse(!is.na(skill_short1) & skill_short1 != skill & is.na(skill_short2) & str_detect(Skills, skill_regex),
                                 skill,skill_short2))
  
  #Set third skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short3 = ifelse(!is.na(skill_short2) & skill_short2 != skill & is.na(skill_short3) & str_detect(Skills, skill_regex),
                                 skill,skill_short3))
  
  
  #Set fourth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short4 = ifelse(!is.na(skill_short3) & skill_short3 != skill & is.na(skill_short4) & str_detect(Skills, skill_regex),
                                 skill,skill_short4))
  
  
  #Set fifth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short5 = ifelse(!is.na(skill_short4) & skill_short4 != skill & is.na(skill_short5) & str_detect(Skills, skill_regex),
                                 skill,skill_short5))
  
  #Set sixth skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short6 = ifelse(!is.na(skill_short5) & skill_short5 != skill & is.na(skill_short6) & str_detect(Skills, skill_regex),
                                 skill,skill_short6))
  
  #Set seventh skill_short value
  skills_list_labeled <- skills_list_labeled %>%
    mutate(skill_short7 = ifelse(!is.na(skill_short6) & skill_short6 != skill & is.na(skill_short7) & str_detect(Skills, skill_regex),
                                 skill,skill_short7))
  
  
}

Tidy Data

Now that we have a labeled list of skills requirements from the original job postings, we can then join these new fields with the original data.

This process primarily consists of: 1. Further preprocessing of the skills and job postings datasets to account for terms that are symantically similar 2. Remove bad character strings 3. Impute Job Position information 4. Impute Job Seniority Level information

Join Tables

jobs_df_clean <- jobs_df %>%
  filter(Skills != "") %>%
  mutate(Skills = tolower(Skills)) %>%
  mutate(Skills = str_replace_all(Skills, '[^[:alnum:]]', ' ')) %>%
  mutate(Skills = str_squish(Skills))

  
j1 <- left_join(jobs_df_clean, skills_list_labeled, by=c("Skills"="Skills"), multiple = "all")


j1_pivot <- j1 %>% pivot_longer(cols = c("skill_short1", "skill_short2", "skill_short3", "skill_short4", "skill_short5", "skill_short6", "skill_short7"),
                    names_to = "skill_number", 
                    values_to = "discrete_skill")


columns = c("Job Board", "Post URL", "Company", "Industry", "Company Location", 
            "Job Title", "Listed Skill", "Seniority Level", "Job Type",
            "Years Experience", "skill_number", "discrete_skill")

colnames(j1_pivot) <- columns

j1_pivot <- janitor::clean_names(j1_pivot)

j1_pivot <- j1_pivot %>%
  mutate(discrete_skill = case_when(
    str_detect(discrete_skill, '(communication|communicate)') ~ 'communication skills',
    str_detect(discrete_skill, 'visualization') ~ 'data visualization skills',
    str_detect(discrete_skill, '\\br\\b') ~ 'r programming', 
    str_detect(discrete_skill, 'stat') ~ 'applied statistics', 
    str_detect(discrete_skill,'\\bai\\b') ~ 'artificial intelligence',
    str_detect(discrete_skill,'agile') ~ 'agile development',
    str_detect(discrete_skill, '(^analys|analyz)') ~ 'analysis skills',
    str_detect(discrete_skill, '^analyt') ~ 'analytics skills',
    str_detect(discrete_skill, '(machine learning|ml)') ~ 'machine learning',
    str_detect(discrete_skill, '\\bci\\b') ~ 'ci',
    str_detect(discrete_skill, 'strat') ~ 'strategy skills',
    str_detect(discrete_skill, 'workflow') ~ 'worklow automation',
    str_detect(discrete_skill, '\\bcloud\\b') ~ 'cloud computing',
    str_detect(discrete_skill, 'cluster') ~ 'cluster analysis',
    str_detect(discrete_skill, 'critical') ~ 'critical thinking skills',
    str_detect(discrete_skill, 'project') ~ 'project management skills',
    str_detect(discrete_skill, 'customer') ~ 'customer service skills',
    str_detect(discrete_skill, 'database') ~ 'database management',
    str_detect(discrete_skill, 'big data') ~ 'big data',
    str_detect(discrete_skill, 'deep') ~ 'deep learning',
    str_detect(discrete_skill, '\\betl\\b') ~ 'etl',
    str_detect(discrete_skill, 'python') ~ 'python programming',
    str_detect(discrete_skill, 'innov') ~ 'innovative',
    str_detect(discrete_skill, 'regression') ~ 'regression modeling',
    str_detect(discrete_skill, 'ggplot') ~ 'ggplot',
    str_detect(discrete_skill, 'model') ~ 'model development',
    str_detect(discrete_skill, 'data transformation') ~ 'data transformation',
    str_detect(discrete_skill, '\\bsql\\b') ~ 'sql',
    str_detect(discrete_skill, '(\\bbi\\b|business intelligence)') ~ 'business intelligence skills',
    str_detect(discrete_skill, '\\bai\\b') ~ 'artificial intelligence',
    str_detect(discrete_skill, 'neural') ~ 'neural networks',
    str_detect(discrete_skill, 'keras') ~ 'keras',
    str_detect(discrete_skill, '(sas|saas)') ~ 'saas',
    str_detect(discrete_skill, 'software development') ~ 'software development',
    str_detect(discrete_skill, 'algorithm') ~ 'algorithm design',
    str_detect(discrete_skill, '(\\bmip\\b|\\bqp\\b)') ~ 'mixed-interger programming',
    str_detect(discrete_skill, '\\bnlp\\b') ~ 'natural language programming',
    str_detect(discrete_skill, '(\\boop\\b|object)') ~ 'object oriented programming',
    str_detect(discrete_skill, 'consulting') ~ 'consulting',
    str_detect(discrete_skill, 'predictive') ~ 'predictive modeling',
    str_detect(discrete_skill, 'curio') ~ 'curiosity',
    str_detect(discrete_skill, 'data cleaning') ~ 'data wrangling',
    str_detect(discrete_skill, 'probab') ~ 'probabilistic modeling',
    str_detect(discrete_skill, '(programming|scripting|coding)') & !str_detect(skill, '(python|sql|database|\\br\\b)') ~ 'computer programming',
    str_detect(discrete_skill, 'solving') ~ 'problem solving',
    str_detect(discrete_skill, 'team') ~ 'teamwork skills',
    str_detect(discrete_skill, 'unsupervised') ~ 'unsupervised learning',
    TRUE ~ discrete_skill
  ))
  


## Export cleaned data frame
write_csv(j1_pivot, '../output/jobs_data_clean.csv')

Clean data to prepare for importing into database

company_df <- j1_pivot %>%
  select(company, industry, company_location) %>%
  distinct() %>%
  mutate(company_id = row_number())

skills_df <- j1_pivot %>%
  select(discrete_skill) %>%
  distinct(discrete_skill) %>%
  na.omit(discrete_skill) %>%
  arrange(discrete_skill) %>%
  mutate(skill_id = row_number())

job_board_df <- j1_pivot %>%
  select(job_board) %>%
  distinct() %>%
  mutate(job_board_id = row_number())

job_posting_df <- j1_pivot %>%
  group_by(job_board, post_url, company, job_title) %>%
  summarize(posting_id = cur_group_id())

## `summarise()` has grouped output by 'job_board', 'post_url', 'company'. You can
## override using the `.groups` argument.

j2 <- left_join(j1_pivot,job_posting_df)

## Joining with `by = join_by(job_board, post_url, company, job_title)`

j3 <- left_join(j2, skills_df)

## Joining with `by = join_by(discrete_skill)`

j4 <- left_join(j3, company_df)

## Joining with `by = join_by(company, industry, company_location)`

j5 <- left_join(j4, job_board_df)

## Joining with `by = join_by(job_board)`

company_data <- j5 %>%
  select(company_id, company, company_location, industry) %>%
  distinct()

posting_data <- j5 %>%
  select(posting_id, company_id, job_board_id, job_title, job_type, seniority_level, years_experience) %>%
  distinct() %>%
  arrange(posting_id)

#Clean up the job type field

posting_data <- posting_data %>%
  mutate(job_type = case_when(
    str_detect(job_type,"\\bHybrid\\b") ~ "Hybrid",
    str_detect(job_type, "\\b[rR]emote\\b") ~ "Remote",
    str_detect(job_type, "\\bIn [pP]erson\\b") ~ "On-Site",
    str_detect(job_type, 'On Premise') ~ 'On-Site',
    str_detect(job_type, "character(0)") ~ "",
    str_detect(job_type, '0') ~ "",
    TRUE ~ job_type
  ))

#Clean up the seniority_level field

posting_data <- posting_data %>%
  mutate(seniority_level = case_when(
    str_detect(seniority_level,"(Entry|Assistant)") ~ 'Entry Level',
    str_detect(seniority_level, "(Mid|Manager|Supervisor)") ~ 'Mid Level',
    str_detect(seniority_level, "(Senior)") ~ 'Senior Level',
    seniority_level == 'character(0)' ~ "",
    seniority_level == '0' ~ "",
    TRUE ~ seniority_level
  ))


# Clean up the years experience field
posting_data <- posting_data %>%
  mutate(years_experience = str_match(years_experience, '[\\d][~]*[\\d]*')) %>%
  mutate(years_experience = str_replace(years_experience, "~", "-"))
  
#Impute seniority level based on years experience

posting_data <- posting_data %>%
  mutate(seniority_level = ifelse(seniority_level == "" | is.na(seniority_level),
    case_when(
      years_experience %in% c("0","1","2") ~ 'Entry Level',
      years_experience %in% c("3","4") ~ 'Mid Level',
      TRUE ~ seniority_level
    ),seniority_level))


posting_data <- posting_data %>%
  mutate(job_title_category = case_when(
    str_detect(job_title, "(Analyst|Analytics)") ~ 'Data Analyst',
    str_detect(job_title, "Data S") ~ 'Data Scientist',
    str_detect(job_title, 'Engineer') ~ 'Data Engineer',
    str_detect(job_title, 'Machine Learning') ~ 'Data Scientist',
    TRUE ~ 'Other'
  )) 
  


job_board_data <- j5 %>%
  select(job_board_id, job_board) %>%
  distinct()

job_skills_data <- j5 %>%
  select(skill_id, posting_id, discrete_skill) %>%
  distinct() %>%
  arrange(skill_id)

Import dataframe into database

Now that we’ve cleaned the data and created our dataframes, we want to create the relevant tables in our PostgresSQL database and then import the data from the dataframe into these tables

drop_query = "DROP TABLE company,job_board, skills, job_posting"
dbSendQuery(con, drop_query)

## <PqResult>
##   SQL  DROP TABLE company,job_board, skills, job_posting
##   ROWS Fetched: 0 [complete]
##        Changed: 0

## Write data to database
dbWriteTable(con, "job_board", job_board_data, overwrite=TRUE)

## Warning in result_create(conn@ptr, statement, immediate): Closing open result
## set, cancelling previous query

dbWriteTable(con, "company", company_data, overwrite=TRUE)
dbWriteTable(con, "skills", job_skills_data, overwrite=TRUE)
dbWriteTable(con, "job_posting", posting_data, overwrite=TRUE)

Analysis of Data

Now that we have cleaned data available to us through our database, we are now ready to answer our core data analysis questions. For this project we focused on answering the following questions:

What skills are the most commonly listed across all job positions?
Are there differences in skills based on position/title?
What skills are the most commonly listed across jobs based on seniority level?

Question 1 - What skills are the most commonly listed across all job positions?

query = "SELECT * FROM skills"
skills_data <- dbGetQuery(con, query)

query = "SELECT * FROM job_posting"
posting_data <- dbGetQuery(con,query)

num_postings = posting_data %>% nrow()

top50_skills <- skills_data %>%
  group_by(discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  mutate(pct_of_jobs = num_jobs/n()) %>%
  na.omit() %>%
  arrange(desc(pct_of_jobs)) %>%
  head(50)


top25_skills <- skills_data %>%
  group_by(discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  mutate(pct_of_jobs = num_jobs/num_postings) %>%
  na.omit() %>%
  arrange(desc(pct_of_jobs)) %>% 
  head(25) %>%
  mutate(discrete_skill = factor(discrete_skill, levels=discrete_skill, ordered=TRUE))
  
  
## Table of top 25 Job Postings
top25_skills %>%
  mutate(pct_of_jobs = scales::percent(pct_of_jobs)) %>%
  kable(
    col.names = c("Skill", "No. Jobs", "Percent of Jobs"),
    row.names = TRUE,
    caption = "Top 25 Skills for Data Scientists & Data Professionals (N = 89 Postings)",
    align = c("l", "c", "c")
  ) %>%
  kable_material(c("striped"))

Top 25 Skills for Data Scientists & Data Professionals (N = 89 Postings)
	Skill	No. Jobs	Percent of Jobs
1	analysis skills	52	58.4%
2	python programming	51	57.3%
3	machine learning	39	43.8%
4	communication skills	38	42.7%
5	analytics skills	37	41.6%
6	sql	34	38.2%
7	applied statistics	31	34.8%
8	model development	30	33.7%
9	programming	28	31.5%
10	database management	24	27.0%
11	r programming	24	27.0%
12	problem solving	22	24.7%
13	algorithm design	21	23.6%
14	big data	16	18.0%
15	aws	13	14.6%
16	tableau	13	14.6%
17	business intelligence skills	12	13.5%
18	deep learning	12	13.5%
19	saas	12	13.5%
20	azure	11	12.4%
21	curiosity	11	12.4%
22	programming languages	11	12.4%
23	project management skills	11	12.4%
24	strategy skills	11	12.4%
25	cluster analysis	10	11.2%

## Shows top 25 skills for Data Scientist
ggplot(data = top25_skills, aes(x=reorder(discrete_skill, num_jobs), y=pct_of_jobs), fig(40,10)) +
  geom_bar(stat='identity', aes(fill=discrete_skill)) +
  coord_flip() + 
  labs(
    x = "Job Skills",
    y = "Percent of Jobs",
    title = 'Top 25 Skills For Data Scientist and Data Professionals',
    subtitle = 'N = 89 Postings',
    fill = 'Skills'
  ) + 
  scale_y_continuous(labels = scales::percent)

## Wordcloud of top 50 Skills
ggplot(top50_skills, aes(label=discrete_skill, size=num_jobs)) + 
  geom_text_wordcloud() + 
  theme_minimal()

Answer:

For this we pulled the top 25 skills that across each of our data science jobs, and found that of these the top 5 were: 1. Analysis Skills - 58.4% of postings 2. Python Programming - 57.3% of postings 3. Communication Skills - 42.7% of postings 4. Machine Learning - 42.7% of postings 5. Analytics Skills - 41.6% of postings

Question 2 - Are there differences in skills based on position/title?

query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"

combined_data <- dbGetQuery(con, query)

combined_data <- combined_data %>%
  select(-posting_id..4)

num_postings_by_cat <- combined_data %>% 
  group_by(job_title_category) %>%
  summarize(total_postings = n_distinct(posting_id)) %>%
  arrange(desc(total_postings))

num_postings_by_skill <- combined_data %>%
  group_by(job_title_category, discrete_skill) %>%
  summarize(num_postings = n_distinct(posting_id)) %>%
  ungroup()

## `summarise()` has grouped output by 'job_title_category'. You can override
## using the `.groups` argument.

combined_postings_by_cat <- left_join(num_postings_by_skill, num_postings_by_cat) %>%
  mutate(pct_postings = num_postings/total_postings)

## Joining with `by = join_by(job_title_category)`

## Top Skills for Data Scientist Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Scientist') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Scientists",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))

Top 10 Data Science Skills for Data Scientists
	Skill	Percent of Postings
1	analysis skills	63.5%
2	python programming	60.3%
3	communication skills	42.9%
4	machine learning	42.9%
5	model development	41.3%
6	analytics skills	39.7%
7	applied statistics	36.5%
8	programming	34.9%
9	algorithm design	31.7%
10	sql	31.7%

## Top Skills for Data Analyst Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Analyst') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Analysts",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))

Top 10 Data Science Skills for Data Analysts
	Skill	Percent of Postings
1	analytics skills	55.6%
2	sql	55.6%
3	analysis skills	50.0%
4	machine learning	44.4%
5	python programming	44.4%
6	applied statistics	33.3%
7	communication skills	33.3%
8	database management	33.3%
9	problem solving	27.8%
10	tableau	27.8%

## Top Skills for Data Engineer Positions
combined_postings_by_cat %>%
  filter(job_title_category == 'Data Engineer') %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10) %>%
  ungroup() %>%
  select(discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills for Data Engineers",
    align = c("l","c")
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))

Top 10 Data Science Skills for Data Engineers
	Skill	Percent of Postings
1	aws	75%
2	azure	75%
3	python programming	75%
4	sql	75%
5	ci	50%
6	communication skills	50%
7	data pipelines	50%
8	machine learning	50%
9	pipelines	50%
10	programming	50%

top10_by_cat <- combined_postings_by_cat %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  filter(row_number() <= 10)
  

## Top Data Science Skills common across different data functions

combined_postings_by_cat %>%
  group_by(job_title_category) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(num_postings)) %>%
  ungroup() %>%
  select(job_title_category, discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
  mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
  pivot_wider(names_from = job_title_category, values_from = pct_postings) %>%
  na.omit() %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Data Scientist", "Data Analyst", "Data Engineer", "Other")),
    row.names = TRUE,
    caption = "Top 10 Data Science Skills Consistent Across Each Data Function/Title",
    align = c("l",rep("c",4))
  ) %>%
  kable_material(font_size = 15, c("striped", "condensed"))

Top 10 Data Science Skills Consistent Across Each Data Function/Title
	Skill	Data Scientist	Data Analyst	Data Engineer	Other
1	analysis skills	63.5%	50.0%	25.0%	50.0%
2	python programming	60.3%	44.4%	75.0%	50.0%
3	communication skills	42.9%	33.3%	50.0%	75.0%
4	machine learning	42.9%	44.4%	50.0%	50.0%
5	model development	41.3%	11.1%	25.0%	25.0%
6	analytics skills	39.7%	55.6%	25.0%	25.0%
7	sql	31.7%	55.6%	75.0%	25.0%
8	r programming	28.6%	16.7%	25.0%	50.0%
9	problem solving	23.8%	27.8%	25.0%	25.0%
10	strategy skills	12.7%	5.6%	25.0%	25.0%

Answer: Overall there seemed to be similarity in the importance of skills across the functional role. However, as expected we do find some start differences between each position. For example, Data Engineers are expected to have more familiarity with cloud based architectures and technologies, such as AWS and Azure and there’s a lot more focus on their programming skills (e.g. computer programming, and python programming). However, for Data Scientist and Data Analyst, there’s more of focus on analytical skills and the abiity to develop models.

Question 3 - What skills are the most commonly listed across jobs based on seniority level?

query <- "SELECT * FROM skills LEFT JOIN job_posting ON skills.posting_id = job_posting.posting_id"

combined_data <- dbGetQuery(con, query)

combined_data <- combined_data %>%
  select(-posting_id..4)

filtered_posts <- combined_data %>%
  filter(!is.na(seniority_level), seniority_level != "")


group_by_seniority <- filtered_posts %>%
  group_by(seniority_level) %>%
  summarize(group_size = n_distinct(posting_id)) %>%
  ungroup()



group_by_skill <- filtered_posts %>%
  group_by(seniority_level, discrete_skill) %>%
  summarize(num_jobs = n_distinct(posting_id)) %>%
  ungroup()

## `summarise()` has grouped output by 'seniority_level'. You can override using
## the `.groups` argument.

combined_by_seniority <- left_join(group_by_skill, group_by_seniority)

## Joining with `by = join_by(seniority_level)`

## Top 10 Skills for Entry Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Entry Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Entry Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))

Top 10 Data Science for Entry Level Professionals
	Skill	Percent of Postings
1	python programming	83.3%
2	analysis skills	66.7%
3	applied statistics	50.0%
4	machine learning	50.0%
5	model development	50.0%
6	communication skills	41.7%
7	database management	41.7%
8	deep learning	41.7%
9	analytics skills	33.3%
10	pytorch	33.3%

## Top 10 Skills for Mid Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Mid Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Mid Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))

Top 10 Data Science for Mid Level Professionals
	Skill	Percent of Postings
1	python programming	73.1%
2	communication skills	61.5%
3	analytics skills	57.7%
4	sql	57.7%
5	analysis skills	53.8%
6	machine learning	46.2%
7	problem solving	42.3%
8	applied statistics	34.6%
9	r programming	34.6%
10	model development	30.8%

## Top 10 Skills for Senior Level professionals
combined_by_seniority %>%
  filter(seniority_level == 'Senior Level') %>%
  ungroup() %>%
  arrange(desc(num_jobs)) %>%
  mutate(pct_postings = round((num_jobs/group_size),3)) %>%
  mutate(pct_postings = scales::percent(pct_postings)) %>%
  na.omit() %>%
  select(discrete_skill, pct_postings) %>%
  filter(row_number() <= 10) %>%
  kable(
    col.names = (c("Skill", "Percent of Postings")),
    row.names = TRUE,
    caption = "Top 10 Data Science for Senior Level Professionals",
    align = c("l","c")
  ) %>%
  kable_material(c("striped", "condensed"))

Top 10 Data Science for Senior Level Professionals
	Skill	Percent of Postings
1	python programming	100%
2	sql	100%
3	tableau	75%
4	analysis skills	50%
5	analytics skills	50%
6	communication skills	50%
7	database management	50%
8	ggplot	50%
9	machine learning	50%
10	model development	50%

combined_by_seniority %>%
  group_by(seniority_level) %>%
  mutate(pct_postings = num_jobs/group_size) %>%
  na.omit(discrete_skill) %>%
  arrange(desc(pct_postings)) %>%
  ungroup() %>%
  select(seniority_level, discrete_skill, pct_postings) %>%
  mutate(pct_postings = scales::percent(round(pct_postings,3))) %>%
  mutate(pct_postings = ifelse(pct_postings == 'NA',"-",pct_postings)) %>%
  pivot_wider(names_from = seniority_level, values_from = pct_postings) %>%
  na.omit() %>%
  select(discrete_skill, 'Entry Level', 'Mid Level', 'Senior Level') %>% 
  filter(row_number() <= 10) %>%
  kable(
      col.names = (c("Skill", "Entry Level", "Mid Level", "Senior Level")),
      row.names = TRUE,
      caption = "Top 10 Data Science Skills Common Amongst Each Seniority Level",
      align = c("l",rep("c",3))
    ) %>%
    kable_material(c("striped", "condensed"))

Top 10 Data Science Skills Common Amongst Each Seniority Level
	Skill	Entry Level	Mid Level	Senior Level
1	python programming	83.3%	73.1%	100.0%
2	sql	33.3%	57.7%	100.0%
3	tableau	8.3%	11.5%	75.0%
4	analysis skills	66.7%	53.8%	50.0%
5	communication skills	41.7%	61.5%	50.0%
6	analytics skills	33.3%	57.7%	50.0%
7	machine learning	50.0%	46.2%	50.0%
8	model development	50.0%	30.8%	50.0%
9	database management	41.7%	26.9%	50.0%
10	project management skills	8.3%	19.2%	50.0%

Answer: Across the different seniority levels, we see that the top skills that are common amongst each level of seniority are include: a) python programming, b) sql, c) tableau, d) analysis skills, and e) communication skills.

Conclusion

This project provided a very practical way for our team to explore the top skills that we should be developing on our journey towards becoming Data Science professionals. In addition to building up our knowledge and awareness of these core skills, this project provided a hands-on and practical way for us to go through the full Data Analysis and Data Science life-cycle to come up with a problem, pull our raw data, clean the data, import the data into a database, then analyze the data to answer the questions of importance.

I think that one of the key surprises throughout this project was the number of times we had to go back to the drawing board to account for data that was missing or difficult to acquire, or other challenges that we faced throughout the life of this project. Additionally, there were a number of manual pre-processing steps that given more time, we may have chosen to find a programmatic way to solve for.

Overall, this project was a great exercise and experience in working collaboratively and being able to work collaboratively in a remote manner. This project allows us to focus on meeting tight deadlines and leverage the strengths of each member of the team.