“Which Are The Most Valued Data Science Skills?”

Team Members: Jayden Jiang, Sergio Belich

Introduction

Data science continues to grow in demand across industries, the skill sets required for these roles vary widely depending on region, company, and job level.
For this project, we aims to answer the central question: “Which are the most valued data science skills?”
To explore this, we collect and analyze job posting from multiple countries using a structed, reproducible workflow. The goal is to extract insights about hte most frequently listed skills across geographies, companies, and job level, and provide evidence-based guidance for job seekers and educators.

Overall Project Approach

1. Collaboration Tools - to ensure effective team collaboration and transparency, we will use Zoom for video communication, GitHub for code sharing, Google Doc for project documentation, R Studio for loading, cleaning, tidying, transforming, and analyzing Data, and Microsoft PowerPoint for presentation slides.

2. Data Collection & Loading - collected a structed CSV dataset of over 12,000 job postings from LinkedIn/Kaggle, containing fields such as job title, company, location, skills, job level, and country.

3. Word Tokenization - break down the job_skills string field into individual skill tokens for processing

4. Word Classification - ounts the frequency of raw skill tokens, exports them for manual review, creates a refined dictionary of core skills, and filters the data to retain only relevant data science skills.

5. Labeling Original Data - assigns a binary indicator for each classified skill in job postings and pivots the data into a job-skill matrix for subsequent analysis

6. Data Tidying & Transformation - cleaning and splitting the job_skills field, normalizing skill names, and creating tidy long format tables for analysis.

7. Exploratory Data Analysis - Geographic comparison, employer expectations, employee skill profiles, and job level trends.

Library

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.4.3
library(stringi)
library(tidytext)
library(stringr)
library(purrr)
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

Load and Prepare Dataset

job_data <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Data-607-Project-3-Team-6/refs/heads/main/Data-Science%20Job%20Postings.csv")
#job_data <- read.csv("Data-Science Job Postings.csv")
job_data <- job_data %>%
  filter(!is.na(job_skills)) %>%
  mutate(job_id = row_number())

Tokenization of Skills

# Step 1: Separate job_skills using commas as delimiters and trim left spaces
job_data <- job_data %>%
  mutate(job_skills = stringi::stri_enc_toutf8(job_skills, validate = TRUE)) %>%
  mutate(job_skills = str_trim(job_skills, side = "left"))

# Step 1: Separate job_skills using commas as delimiters
#job_data <- job_data %>%
#  mutate(job_skills = stringi::stri_enc_toutf8(job_skills, validate = TRUE))

# Step 2: Extract and clean individual skills
skill_terms <- job_data %>%
  pull(job_skills) %>%
  tolower() %>%
  str_split(",") %>%
  unlist() %>%
  str_replace_all("[^a-z0-9 ]", "") %>%
  str_squish()

# Step 3: Remove stop words and count term frequency
data("stop_words")
filter_stop_words <- stop_words %>%
  filter(word != "r")

skill_terms_clean <- tibble(term = skill_terms) %>%
  filter(term != "") %>%
  anti_join(filter_stop_words, by = c("term" = "word")) %>%
  count(term, sort = TRUE)

head(skill_terms_clean, 20)
## # A tibble: 20 × 2
##    term                     n
##    <chr>                <int>
##  1 python                4814
##  2 sql                   4612
##  3 data analysis         3296
##  4 machine learning      2688
##  5 communication         2507
##  6 data visualization    2331
##  7 aws                   1742
##  8 project management    1738
##  9 data engineering      1715
## 10 communication skills  1714
## 11 tableau               1705
## 12 data science          1694
## 13 r                     1545
## 14 data modeling         1520
## 15 data management       1472
## 16 java                  1435
## 17 problem solving       1411
## 18 data analytics        1410
## 19 data warehousing      1408
## 20 spark                 1397
job_data <- job_data %>%
  mutate(job_skills = stringi::stri_enc_toutf8(job_skills, validate = TRUE))

skill_tokens <- job_data %>%
  select(job_id, job_skills, search_country, company, job_level, first_seen) %>%
  separate_rows(job_skills, sep = ",") %>%
  mutate(skill = stringi::stri_trim_both(stringi::stri_trans_tolower(job_skills))) %>%
  filter(skill != "")

Word Classification

# Step 1: Create frequency-based metrics
term_metrics <- skill_tokens %>%
  count(skill, sort = TRUE) %>%
  rename(frequency = n)

# Step 2: Export terms to CSV for manual labeling
write_csv(term_metrics, "term_metrics_for_labeling.csv")

# Step 3: Extract the top N skills from skill_terms_clean
top_n_skills <- 20  # starting with 20
top_skills <- skill_terms_clean %>%
  slice_head(n = top_n_skills) %>%
  pull(term)

# Create skill dictionary using these top skills
skill_dict <- tibble(
  skill = top_skills,
  is_data_skill = TRUE
)

# Now use skill_dict in classified_skills code
classified_skills <- skill_tokens %>%
  left_join(skill_dict, by = c("skill" = "skill")) %>%
  filter(is_data_skill == TRUE)

# Step 3: Define labeled skill dictionary manually
#skill_dict <- tibble(
  # skill = c(
    # "python", "sql", "r", "aws", "excel", "communication", "machine learning", 
    # "data analysis", "data visualization", "tableau", "power bi", "hadoop",
    # "spark", "statistics", "cloud computing", "nlp", "deep learning",
    # "big data", "data wrangling", "data engineering"
  # ),
  # is_data_skill = TRUE
# )

# Step 4: Classify skill tokens based on manual labels
#classified_skills <- skill_tokens %>%
#  left_join(skill_dict, by = "skill") %>%
#  filter(is_data_skill == TRUE)

Label Original Data

# Step 1: Remove NAs
labeled_skills <- classified_skills %>%
  filter(!is.na(skill))

# Step 2: Match labeled skills to job posts and tag as 1 if present
labeled_skills <- labeled_skills %>%
  mutate(value = 1)

# Step 3: Create job-skill matrix for analysis
job_skill_matrix <- labeled_skills %>%
  select(job_id, skill, value) %>%
  distinct() %>%
  pivot_wider(
    id_cols = job_id,
    names_from = skill,
    values_from = value,
    values_fill = list(value = 0)
  )

Tidy Data

# Convert to long format for analysis
job_skill_long <- job_skill_matrix %>%
  pivot_longer(
    cols = -job_id,
    names_to = "skill",
    values_to = "has_skill"
  ) %>%
  filter(has_skill == 1) %>%
  left_join(job_data %>% select(job_id, search_country, company, job_level), by = "job_id")

print(job_skill_long)
## # A tibble: 41,477 × 6
##    job_id skill                has_skill search_country company        job_level
##     <int> <chr>                    <dbl> <chr>          <chr>          <chr>    
##  1      1 machine learning             1 United States  Jobs for Huma… Mid seni…
##  2      1 python                       1 United States  Jobs for Huma… Mid seni…
##  3      1 java                         1 United States  Jobs for Huma… Mid seni…
##  4      1 data engineering             1 United States  Jobs for Huma… Mid seni…
##  5      1 data visualization           1 United States  Jobs for Huma… Mid seni…
##  6      1 spark                        1 United States  Jobs for Huma… Mid seni…
##  7      2 python                       1 United States  Aurora         Mid seni…
##  8      2 communication skills         1 United States  Aurora         Mid seni…
##  9      3 data warehousing             1 United States  Adame Service… Associate
## 10      3 data modeling                1 United States  Adame Service… Associate
## # ℹ 41,467 more rows

Analysis of Data

# Let's start easy with Top 10 Data Science Skills Requested (for all data)
# Get the top 10 most frequently requested skills across all countries
top_skills_all <- classified_skills %>%
  count(skill, sort = TRUE) %>%
  slice_head(n = 10)

# Create a bar chart
ggplot(top_skills_all, aes(x = reorder(skill, n), y = n, fill = skill)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Most Requested Data Science Skills",
    subtitle = "Based on job posting analysis across all countries",
    x = "Skill",
    y = "Number of Job Postings",
    caption = "Data source: LinkedIn/Kaggle job postings dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.title = element_text(face = "bold"),
    legend.position = "none",  # Remove the legend since skill names are on the y-axis
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")  # Adjust margins (top, right,             bottom, left)
    ) +
    scale_fill_viridis_d(option = "D")  # Use a color-blind friendly palette

# then                  Top 10 Data Science Skills Requested by Country

# Filter for major countries with sufficient data
major_countries <- c("United States", "United Kingdom", "Canada", "Australia", "Germany", "India")

# Get top skills by country
skills_by_country <- classified_skills %>%
  filter(search_country %in% major_countries) %>%
  count(search_country, skill, sort = TRUE) %>%
  group_by(search_country) %>%
  slice_max(order_by = n, n = 10) %>%
  ungroup()

# Create a faceted bar chart for top skills by country
ggplot(skills_by_country, aes(x = reorder_within(skill, n, search_country), y = n, fill = search_country)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ search_country, scales = "free_y") +
  scale_x_reordered() +  # Required for reorder_within
  labs(
    title = "Top 10 Data Science Skills by Country",
    subtitle = "Based on job posting analysis",
    x = "Skill",
    y = "Number of Job Postings",
    caption = "Data source: LinkedIn/Kaggle job postings dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.title = element_text(face = "bold"),
    strip.background = element_rect(fill = "lightgray"),
    strip.text = element_text(face = "bold"),
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")
  )

Conclusion

Python, SQL, and Data Analysis are among the top three most in-demand skills worldwide, with technical skills dominating global rankings across all countries, though regional variations influence the importance of specific skills. Despite the emphasis on technical expertise, communication skills remain highly valued as a key soft skill. Additionally, cloud platforms like AWS feature prominently in job requirements, highlighting their growing significance in the workforce.