To answer the question, “Which are the most valued data science
skills?”, we decided to use the
Data Science Job Postings & Skills (2024)
dataset from
Kaggle which can be found here: https://www.kaggle.com/datasets/asaniczka/data-science-job-postings-and-skills
The dataset was built from a raw dump of data science-related job
postings collected from LinkedIn and contains job postings with
attributes such as job title, description, and requirements. We will
examine the jobs_postings.csv
and
jobs_skills.csv
file in-depth and analyze these tables.
Note: Please see https://github.com/gillianmcgovern0/cuny-data-607-project-3/blob/40091bedcf60f343699a1c6c44f2f9c24d8efa61/Project3_DATA607_JobSkills.Rmd for how we normalized the datasets
Our group’s motivation for choosing this dataset for our project was influenced by our desire to combine academic theories with actual industry insights. As students pursuing an MS in Data Science, we recognized that this dataset offers real-world insights into the demand for skills, roles, and requirements in this industry. As we analyze the job titles, descriptions, and skills required, we can identify emerging trends and areas that we may be lacking in skill. As a plus - we’re also able to put some of these skills to use as we transform this data, sharpening our technical abilities and preparing us for real-world challenges in our future careers.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(slackr)
library(dplyr)
library(tidytext)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(stringr)
library(topicmodels)
library(knitr)
job_postings_norm <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607-project-3/refs/heads/main/job_postings_norm.csv", show_col_types = FALSE)
job_posting_with_skills <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607-project-3/refs/heads/main/job_postings_with_skills.csv", show_col_types = FALSE)
sep_skills <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607-project-3/refs/heads/main/sep_skills.csv", show_col_types = FALSE)
We applied natural language processing (tidytext
library) to scan job titles and retain only those related to data
science roles, such as “machine learning engineer,” “lead data
engineer,” or “senior data analyst.” We decided to use the
tidytext
library since it provides an efficient way to do
text mining tasks. This step narrows our focus to relevant positions and
removes any outliers or irrelevant job postings.
To perform word frequency analysis on the job titles, we decided on
using the unnest_tokens
function with the
token = "ngrams"
argument, which tokenizes by a certain
amount of adjacent words. After trial and error, we decided that using
the most common trigrams (consecutive sequences of 3 words) for job
title would be the most efficient way to find the most popular data
science related job titles:
head(job_posting_with_skills, 5)
## # A tibble: 5 × 16
## job_link last_processed_time last_status got_summary got_ner is_being_worked
## <chr> <dttm> <chr> <lgl> <lgl> <lgl>
## 1 https://w… 2024-01-21 08:08:48 Finished N… TRUE TRUE FALSE
## 2 https://w… 2024-01-20 04:02:12 Finished N… TRUE TRUE FALSE
## 3 https://w… 2024-01-21 08:08:31 Finished N… TRUE TRUE FALSE
## 4 https://w… 2024-01-20 15:30:55 Finished N… TRUE TRUE FALSE
## 5 https://w… 2024-01-21 08:08:58 Finished N… TRUE TRUE FALSE
## # ℹ 10 more variables: job_title <chr>, company <chr>, job_location <chr>,
## # first_seen <date>, search_city <chr>, search_country <chr>,
## # search_position <chr>, job_level <chr>, job_type <chr>, job_skills <chr>
# Break up job title in all possible combinations of 3 consecutive words
titles_broken_3_words <- job_posting_with_skills %>%
unnest_tokens(title, job_title, token = "ngrams", n = 3) %>%
filter(!is.na(title)) %>%
count(title, sort = TRUE)
head(titles_broken_3_words, 5)
## # A tibble: 5 × 2
## title n
## <chr> <int>
## 1 senior data engineer 453
## 2 machine learning engineer 426
## 3 senior data analyst 321
## 4 senior data scientist 257
## 5 lead data engineer 252
# Find data science related titles
titles_broken_3_words <- titles_broken_3_words %>%
mutate(
# Replace "sr" with "senior" as a whole word
title = str_replace_all(title, "\\bsr\\b", "senior"),
# Recode specific variants to join similar titles
title = case_when(
title %in% c("data loss prevention", "loss prevention dlp") ~ "prevention dlp engineer",
title %in% c("service representative data") ~ "customer service representative",
TRUE ~ title
)
) %>%
# Group by the recoded title and sum their counts
group_by(title) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
# Filter to include only titles that contain one of the keywords (case-insensitive)
filter(str_detect(title, regex("analytic|model|engineer|data|machine", ignore_case = TRUE))) %>%
arrange(desc(n))
# Find the top 20 data science related titles
popular_data_science_skills_vector <- titles_broken_3_words$title[1:20]
popular_data_science_skills_vector
## [1] "senior data engineer" "machine learning engineer"
## [3] "senior data analyst" "senior data scientist"
## [5] "prevention dlp engineer" "lead data engineer"
## [7] "senior machine learning" "senior mlops engineer"
## [9] "staff machine learning" "business data analyst"
## [11] "manager data engineering" "data entry clerk"
## [13] "engineer series a" "learning engineer series"
## [15] "data analyst data" "manager data center"
## [17] "representative data analyst" "analyst data entry"
## [19] "senior database administrator" "data center construction"
# Refresh job_posting_with_skills_filtered using the updated popular titles
job_posting_with_skills_filtered <- job_posting_with_skills %>%
mutate(job_title_duplicate = job_title) %>% # Keep the original job_title variable
unnest_tokens(title, job_title, token = "ngrams", n = 3) %>%
filter(title %in% popular_data_science_skills_vector) %>%
left_join(titles_broken_3_words, by = "title") %>%
group_by(job_title_duplicate) %>%
arrange(desc(n)) %>%
slice(1) %>%
ungroup()
head(job_posting_with_skills_filtered, 5)
## # A tibble: 5 × 18
## job_link last_processed_time last_status got_summary got_ner is_being_worked
## <chr> <dttm> <chr> <lgl> <lgl> <lgl>
## 1 https://w… 2024-01-19 14:29:15 Finished N… TRUE TRUE FALSE
## 2 https://w… 2024-01-19 09:45:09 Finished N… TRUE TRUE FALSE
## 3 https://w… 2024-01-19 09:45:09 Finished N… TRUE TRUE FALSE
## 4 https://w… 2024-01-20 08:57:35 Finished N… TRUE TRUE FALSE
## 5 https://w… 2024-01-19 22:38:30 Finished N… TRUE TRUE FALSE
## # ℹ 12 more variables: company <chr>, job_location <chr>, first_seen <date>,
## # search_city <chr>, search_country <chr>, search_position <chr>,
## # job_level <chr>, job_type <chr>, job_skills <chr>,
## # job_title_duplicate <chr>, title <chr>, n <int>
After filtering for relevant jobs, we were still left with an untidy
dataset where all skills were listed in a single variable
job_skills
. This structure makes it difficult to add a new
skill for a job posting. We transformed the dataset by splitting up
job_skills
so that each observation represents a unique job
title and single skill combination. This restructuring simplifies the
future analysis and visualization of individual job titles and related
skills.
Additionally, to produce a data frame with skills frequencies related to filtered job titles, we extracted the skills by counting their occurrences within the filtered job titles to build a frequency table. This new data frame emphasizes the most in-demand skills and gives us a clearer insight into the industry trends:
# Make the data frame tidy - break up the `job_skills` variable (a list of skills represented as a string) so each observation is a job title/single skill combo
tidy_top_skills <- job_posting_with_skills_filtered %>%
unnest_tokens(skill, job_skills, token = 'regex', pattern=",") %>%
count(skill, sort = TRUE) # get the frequency
Here is the final tidy data frame for analysis:
# Final tidy data frame
head(tidy_top_skills)
## # A tibble: 6 × 2
## skill n
## <chr> <int>
## 1 " python" 300
## 2 " sql" 241
## 3 " machine learning" 142
## 4 " communication" 135
## 5 " data visualization" 121
## 6 " data analysis" 117
We created a new data frame for the top 10 skills grouped by job title. This new data frame shows how skills vary between different roles.
# For each title, break up the 'job_skills' column into individual skills and count them.
top_skills_by_job_title <- job_posting_with_skills_filtered |>
group_by(title) |> # Group by the data science job title
# Split the comma-separated skills into individual tokens
unnest_tokens(skill, job_skills, token = "regex", pattern = ",") |>
count(skill, sort = TRUE) |>
# Optionally, limit to the top 10 skills per job title
group_by(title) |>
slice_max(n, n = 10) |>
ungroup() |>
arrange(title, desc(n))
# View the resulting summary table
print(top_skills_by_job_title)
## # A tibble: 247 × 3
## title skill n
## <chr> <chr> <int>
## 1 business data analyst " sql" 29
## 2 business data analyst " data analysis" 23
## 3 business data analyst " data visualization" 17
## 4 business data analyst " project management" 17
## 5 business data analyst " communication" 15
## 6 business data analyst " excel" 14
## 7 business data analyst " business intelligence" 13
## 8 business data analyst " data governance" 13
## 9 business data analyst " data mining" 12
## 10 business data analyst " problem solving" 11
## # ℹ 237 more rows
# Create separate plots for each job title
unique_titles <- unique(top_skills_by_job_title$title)
print(unique_titles)
## [1] "business data analyst" "data analyst data"
## [3] "data center construction" "data entry clerk"
## [5] "lead data engineer" "machine learning engineer"
## [7] "manager data center" "manager data engineering"
## [9] "prevention dlp engineer" "representative data analyst"
## [11] "senior data analyst" "senior data engineer"
## [13] "senior data scientist" "senior database administrator"
## [15] "senior machine learning" "senior mlops engineer"
## [17] "staff machine learning"
# Here we exclude all non-data related job titles and turn them into upper scales
pattern <- "analytic|model|engineer|data|machine"
unique_titles <- unique_titles[grepl(pattern, unique_titles, ignore.case = TRUE)]
#unique_titles <- toupper(unique_titles)
print(unique_titles)
## [1] "business data analyst" "data analyst data"
## [3] "data center construction" "data entry clerk"
## [5] "lead data engineer" "machine learning engineer"
## [7] "manager data center" "manager data engineering"
## [9] "prevention dlp engineer" "representative data analyst"
## [11] "senior data analyst" "senior data engineer"
## [13] "senior data scientist" "senior database administrator"
## [15] "senior machine learning" "senior mlops engineer"
## [17] "staff machine learning"
for (job in unique_titles) {
p <- top_skills_by_job_title |>
filter(title == job) |>
ggplot(aes(x = reorder(skill, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = paste("Top Skills for:", job),
x = "Skill",
y = "Count")
print(p)
}
For a data scientist role, for example, “Senior Data Scientist”, the first top 3 skills are the same as the general top skills for a data scientist role, but Machine Learning ranks higher than SQL. The next 2 skills within the top 5, R and Statistics, are more technical for a data scientist compared to the general top skills, communication and data visualization.
For a data analyst role such as “Business Data Analyst”, SQL, Data Analysis, Project Management, Data Visualization and Communication are the top skills. This shows that a data science position could require more specific technical skills compared to an analyst role.
We extracted state information from job location data using regex, ranked the top 10 states by job openings, and visualized this distribution.
state_summary <- job_posting_with_skills_filtered %>%
mutate(state = str_extract(job_location, "[A-Z]{2}$")) %>%
# This excludes NA states
filter(!is.na(state)) %>%
group_by(state) %>%
summarise(openings = n(), .groups = "drop") %>%
arrange(desc(openings)) %>%
slice_head(n = 10) # keep only the top 10 states
# Display the summary table
knitr::kable(state_summary, caption = "Top 10 States by Job Openings")
state | openings |
---|---|
CA | 82 |
TX | 48 |
VA | 44 |
IL | 36 |
NY | 35 |
MA | 22 |
FL | 16 |
WA | 16 |
AZ | 14 |
GA | 12 |
# Plot the top 10 states by job openings
ggplot(state_summary, aes(x = reorder(state, openings), y = openings)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 States by Job Openings",
x = "State",
y = "Number of Openings") +
theme_minimal()
CA clearly dominates the Data Science job market with 82 openings, much more than the next highest state, TX, which has 48.
This is likely due to California’s booming tech scene (such as Silicon Valley and the Bay Area) which is home to a large number of tech companies and startups that may create a high demand for data science expertise.
We compared the top 10 skills in New York and California to examine the regional differences.
tidy_top_skills_ny <- job_posting_with_skills_filtered %>%
unnest_tokens(skill, job_skills, token = 'regex', pattern=",") %>%
mutate(state = str_extract(job_location, "[A-Z]{2}$")) %>%
filter(state == "NY") %>%
group_by(state) %>%
count(skill, sort = TRUE) %>%
slice(1:10)
head(tidy_top_skills_ny)
## # A tibble: 6 × 3
## # Groups: state [1]
## state skill n
## <chr> <chr> <int>
## 1 NY " python" 20
## 2 NY " sql" 16
## 3 NY " communication" 11
## 4 NY " data analysis" 10
## 5 NY " data modeling" 10
## 6 NY " machine learning" 10
# Plot the top 10 skills in NY
ggplot(tidy_top_skills_ny, aes(x = reorder(skill, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Skills in NY",
x = "State",
y = "Count") +
theme_minimal()
tidy_top_skills_ca <- job_posting_with_skills_filtered %>%
unnest_tokens(skill, job_skills, token = 'regex', pattern=",") %>%
mutate(state = str_extract(job_location, "[A-Z]{2}$")) %>%
filter(state == "CA") %>%
group_by(state) %>%
count(skill, sort = TRUE) %>%
slice(1:10)
head(tidy_top_skills_ca)
## # A tibble: 6 × 3
## # Groups: state [1]
## state skill n
## <chr> <chr> <int>
## 1 CA " python" 51
## 2 CA " sql" 41
## 3 CA " machine learning" 30
## 4 CA " pytorch" 24
## 5 CA " tensorflow" 22
## 6 CA " deep learning" 20
# Plot the top 10 skills in CA
ggplot(tidy_top_skills_ca, aes(x = reorder(skill, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Skills in CA",
x = "State",
y = "Count") +
theme_minimal()
The top skills for CA vs NY also emphasize CA’s booming tech scene. CA shows more specific technical skills such as Tensorflow and Pytorch, compared to NY’s top skills which contains more general skills such as data analysis.
We highlighted mid-senior and associate roles by categorizing job titles based on these experience levels and created a chart displaying the top skills for each group. This classification allowed us to analyze trends and differences in skill requirements across career levels.
tidy_top_skills_by_job_level <- job_posting_with_skills_filtered %>%
unnest_tokens(skill, job_skills, token = 'regex', pattern=",") %>%
group_by(job_level) %>%
count(skill, sort = TRUE) %>%
slice(1:5) %>%
mutate(percent = n/sum(n))
print(tidy_top_skills_by_job_level)
## # A tibble: 10 × 4
## # Groups: job_level [2]
## job_level skill n percent
## <chr> <chr> <int> <dbl>
## 1 Associate " sql" 14 0.28
## 2 Associate " python" 12 0.24
## 3 Associate " data visualization" 9 0.18
## 4 Associate " data analysis" 8 0.16
## 5 Associate " aws" 7 0.14
## 6 Mid senior " python" 288 0.323
## 7 Mid senior " sql" 227 0.254
## 8 Mid senior " machine learning" 137 0.153
## 9 Mid senior " communication" 129 0.144
## 10 Mid senior " data visualization" 112 0.125
# Plot the top 5 by job level
ggplot(tidy_top_skills_by_job_level, aes(x = reorder(skill, n), y = percent)) +
geom_bar(stat = "identity", fill = "steelblue") +
facet_wrap(~tidy_top_skills_by_job_level$job_level) +
coord_flip() +
labs(title = "Top 5 Skills by Job Level",
x = "Job Level",
y = "Percent as Decimal") +
theme_minimal()
Entry-level positions show a mix of various skills, suggesting that employers value a well-rounded foundation rather than specialization at the start of a career.
Job titles with terms like “senior”, “lead”, or “manager” often include responsibilities beyond technical work. This shows that there is a growing need for leadership and communication - a combination of technical knowledge with the skills needed to work well with others, and potentially help lead a team.
Please see below for a summary of our findings:
The general top 3 skills for all data science related roles are:
Python
SQL
Machine learning.
For a data scientist role specifically, the top 3 skills are the same, but Machine Learning ranks higher than SQL. For a data analyst role, SQL, Data Analysis, and Project Management are the top skills. This shows that data science position could require more technical skills compared to an analyst role.
Entry-level positions show a mix of various skills, suggesting that employers value a well-rounded foundation rather than specialization at the start of a career.
Job titles with terms like senior, lead, or manager often include responsibilities beyond technical work. This shows that there is a growing need for leadership and communication - a combination of technical knowledge with the skills needed to work well with others.
CA clearly dominates the Data Science job market with 82 openings, much more than the next highest state, TX, which has 48.
This is likely due to California’s booming tech scene (such as Silicon Valley and the Bay Area) which is home to a large number of tech companies and startups that may create a high demand for data science expertise.
The top skills for CA vs NY also emphasize CA’s booming tech scene. CA shows more specific technical skills such as Tensorflow and Pytorch, compared to NY’s top skills which contains more general skills such as data analysis.