Introduction

In today’s data-driven world, data science has become one of the most in-demand and interdisciplinary career paths. Professionals in this field are expected to combine technical, analytical, and communication skills to extract insights and drive decisions using data.

This project explores the question:

“Which are the most valued data science skills?”

Using a dataset from Kaggle (Data Science Job Postings & Skills, 2024), we analyze more than 12,000 job listings to identify the most frequently requested skills among employers in the data science job market.

Data Source

Dataset: Data Science Job Postings & Skills (2024)
Author: asaniczka
Platform: Kaggle
Source: LinkedIn job postings
License: ODC Attribution License

This dataset provides a raw dump of data science related job postings collected from LinkedIn. It includes details such as job titles, companies, locations, and most importantly, a list of skills mentioned for each posting. The main objective of the dataset is to allow users to practice data cleaning and to explore which skills are most relevant in the current job market.

library(tidyverse)

Step 1 - Import the Dataset

job_skills <- read_csv("job_skills.csv")

Step 2 - Explore the Data

head(job_skills)
## # A tibble: 6 × 2
##   job_link                                                            job_skills
##   <chr>                                                               <chr>     
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Machine L…
## 2 https://www.linkedin.com/jobs/view/principal-software-engineer-ml-… C++, Pyth…
## 3 https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-speci… ETL, Data…
## 4 https://www.linkedin.com/jobs/view/senior-data-warehouse-developer… Data Lake…
## 5 https://www.linkedin.com/jobs/view/lead-data-engineer-at-dice-3805… Java, Sca…
## 6 https://www.linkedin.com/jobs/view/senior-data-engineer-at-univers… Data Ware…
glimpse(job_skills)
## Rows: 12,217
## Columns: 2
## $ job_link   <chr> "https://www.linkedin.com/jobs/view/senior-machine-learning…
## $ job_skills <chr> "Machine Learning, Programming, Python, Scala, Java, Data E…

Step 3 - Clean and Transform the Skills Column

skills_clean <- job_skills %>%
  separate_rows(job_skills, sep = ",") %>%        
  mutate(job_skills = str_trim(job_skills)) %>%   
  filter(job_skills != "")

skills_clean
## # A tibble: 314,950 × 2
##    job_link                                                           job_skills
##    <chr>                                                              <chr>     
##  1 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Machine L…
##  2 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Programmi…
##  3 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Python    
##  4 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Scala     
##  5 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Java      
##  6 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Data Engi…
##  7 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Distribut…
##  8 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Statistic…
##  9 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Optimizat…
## 10 https://www.linkedin.com/jobs/view/senior-machine-learning-engine… Data Pipe…
## # ℹ 314,940 more rows

Step 4 - Count Skill Frequencies

skill_counts <- skills_clean %>%
  count(job_skills, sort = TRUE)

head(skill_counts, 10)
## # A tibble: 10 × 2
##    job_skills             n
##    <chr>              <int>
##  1 Python              4801
##  2 SQL                 4606
##  3 Communication       2498
##  4 Data Analysis       2181
##  5 Machine Learning    1966
##  6 AWS                 1740
##  7 Tableau             1685
##  8 Data Visualization  1562
##  9 R                   1542
## 10 Java                1414

Step 5 - Top 10 Skills Visualization

skill_counts %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(job_skills, n), y = n)) +
  geom_col(fill = "orange") +
  coord_flip() +
  labs(
    title = "Top 10 Most Valued Data Science Skills (Job Postings)",
    x = "Skill",
    y = "Frequency"
  ) +
  theme_minimal()

Summary

The analysis shows that Python, SQL, and Machine Learning are among the most frequently mentioned skills in job postings. Communication, Data Analysis, and Visualization tools like Tableau and R also appear prominently, suggesting that data scientists must balance technical proficiency with analytical and storytelling skills.