Project 3:Most Valued Data Science Skills

Author

Madina Kudanova, Muhammad Khan, Leon Lin, Xiaofei Mei

Introduction

This project investigated which technical skills are most valued in the data science job market. As the demand for data science professionals has grown significantly over the last decade, it is important to understand which tools, programming languages, and analytical skills are most frequently required by employers.

To explore this question, the project analyzed a dataset containing job postings for data science-related roles. Job postings provide insight into employer expectations because they typically list the technical skills required for each position. In this project, the frequency of skills listed in job postings was used as a proxy for measuring the demand for specific data science skills.

Approach

Team members

  • Madina Kudanova

  • Muhammad Khan

  • Leon Lin

  • Xiaofei Mei

Communication

Our team communicate primarily through WhatsApp/Slack group chats and school email to coordinate meetings, discuss progress, and share quick updates related to the project.

Code and File Sharing

A shared GitHub repository https://github.com/MKudanova/Project3_DataScienceSkills/tree/main is being used to store project code, scripts, and datasets.

Project Documentation

A shared Google Docs document is being used to collaboratively write project notes and draft written sections of the project before final versions are uploaded to the GitHub repository.

Data Sources

We obtained the data from Kaggle. The data was scraped from Glassdoor with information such as job title, job description, company information, industry. Skills required mostly can be listed through the job description column.

Data Loading and Database Design

We built a relational PostgreSQL database to organize the dataset using a normalized structure. We defined tables, primary keys, and relationships to ensure data integrity and reduce redundancy.

Because one job posting can require multiple skills and a single skill may appear in multiple job postings, we implemented a many-to-many relationship between jobs and skills.

The database included three tables: Jobs (job_id, job_title, company, location), Skills (skill_id, skill_name), and Job_Skills (job_id, skill_id). The Job_Skills table served as a bridge table with foreign keys referencing both Jobs and Skills.

We used SQL to create the schema and manage the database in pgAdmin4. The data were then accessed from PostgreSQL in R using the RPostgres package for querying and analysis.

ER Diagram

An Entity–Relationship (ER) diagram was created to document the structure of the database. It illustrates the relationships between the Jobs, Skills, and Job_Skills tables, showing how the bridge table resolves the many-to-many relationship between jobs and skills. The diagram was created using draw.io

Exploratory Analysis and Visualization

After cleaning the data in R, we explored which skills appeared most often in job postings, how they relate to salary, and how they vary across sectors. We created visualizations such as bar charts to show the most common skills and scatter plots to compare skill demand with average salary.

These visuals made it easier to understand which data science skills are most in demand, how they are valued in the job market, and how they differ by sector.

Code base

library(DBI)
library(RPostgres)
library(tidyverse)
library(patchwork)

Connect to PostgreSQL Database

# Establish connection to local PostgreSQL database
con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "project3_skills",
  host = "localhost",
  port = 5432,
  user = "postgres",
  password = ""  
  
)
# Set schema to project3 so we can access our tables
dbExecute(con, "SET search_path TO project3;")
[1] 0

Verify Connection

# Check available tables
dbListTables(con)
[1] "job_skills"           "jobs"                 "skills"              
[4] "vw_job_skill_details"
# Retrieve all tables specifically within the "project3" schema
# This confirms that our project tables (jobs, skills, job_skills) are correctly loaded
# and avoids listing unrelated system tables
dbGetQuery(con, "
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'project3'
ORDER BY table_name;
")
  table_schema           table_name
1     project3           job_skills
2     project3                 jobs
3     project3               skills
4     project3 vw_job_skill_details

Skill Demand Analysis

# Query the database to count how many job postings require each skill
# COUNT(DISTINCT js.job_id) ensures each job is counted only once per skill
skills_demand <- dbGetQuery(con, "
SELECT
    s.skill_name,
    COUNT(DISTINCT js.job_id) AS job_count
FROM project3.job_skills js
JOIN project3.skills s
    ON js.skill_id = s.skill_id
GROUP BY s.skill_name
ORDER BY job_count DESC;
")

# Convert job_count from integer64 (database type) to numeric for compatibility with R and plotting
skills_demand <- skills_demand %>%
  mutate(job_count = as.numeric(job_count))


skills_demand
           skill_name job_count
1              python       482
2          statistics       439
3    machine learning       405
4                 sql       322
5                   r       308
6               excel       291
7  data visualization       238
8               spark       186
9       data analysis       185
10                aws       147
11             hadoop       140
12           big data       136
13            tableau       122
14      deep learning        97
15                nlp        87
16           power bi        34

Visualization: Demand by Skill

ggplot(skills_demand, aes(x = reorder(skill_name, job_count), y = job_count)) +
  geom_col(fill = "orange") +
  coord_flip() +
  labs(
    title = "Demand by Top Data Science Skills",
    x = "Skill",
    y = "Number of Job Postings"
  )

Salary Analysis by Skill

# Calculate the average salary associated with each skill
# This links job postings with salary information to estimate the market value of each skill
skills_salary <- dbGetQuery(con, "
SELECT
    s.skill_name,
    ROUND(AVG(j.avg_salary), 2) AS avg_salary
FROM project3.job_skills js
JOIN project3.jobs j
    ON js.job_id = j.job_id
JOIN project3.skills s
    ON js.skill_id = s.skill_id
GROUP BY s.skill_name
ORDER BY avg_salary DESC;
")

# Select the top 8 highest-paying skills for clearer comparison
skills_salary_top <- skills_salary %>%
  slice_max(avg_salary, n = 8)


skills_salary_top
        skill_name avg_salary
1         power bi     130.38
2              nlp     127.90
3    data analysis     126.26
4           hadoop     126.12
5          tableau     125.75
6 machine learning     125.65
7    deep learning     124.89
8            spark     124.74

Visualization: Salary by Skill

combined <- skills_demand %>%
  inner_join(skills_salary, by = "skill_name")

ggplot(combined, aes(x = job_count, y = avg_salary, label = skill_name)) +
  geom_point(color = "orange", size = 3) +
  geom_text(nudge_y = 0.3, size = 3) +
  labs(
    title = "Demand vs Salary for Data Science Skills",
    x = "Number of Job Postings",
    y = "Average Salary"
  )

Plot interpretation:

This plot compares the demand for data science skills with their corresponding average salaries. The results show that higher demand does not necessarily translate to higher pay.

For example, Python and SQL appear among the most frequently requested skills, but their average salaries are relatively moderate compared to more specialized skills. In contrast, tools such as Power BI and NLP show higher average salaries despite lower demand, suggesting that niche or specialized skills may command a salary premium.

Overall, the relationship between demand and salary is not strictly linear. Core skills are widely required across many roles, while more specialized skills tend to be less common but offer higher compensation.

Skill Demand by Sector Analysis

# Retrieve skill demand by sector from the database
# Each row represents how many job postings in a sector require a given skill
skills_sector <- dbGetQuery(con, "
SELECT
    j.sector,
    s.skill_name,
    COUNT(DISTINCT j.job_id) AS job_count
FROM project3.job_skills js
JOIN project3.jobs j
    ON js.job_id = j.job_id
JOIN project3.skills s
    ON js.skill_id = s.skill_id
WHERE j.sector IS NOT NULL
AND j.sector <> '-1'
GROUP BY j.sector, s.skill_name
ORDER BY j.sector, job_count DESC;
")

# Convert job_count to numeric for compatibility with R operations and plotting
skills_sector <- skills_sector %>%
  mutate(job_count = as.numeric(job_count))

# Select three major sectors for focused comparison
top_sectors <- skills_sector %>%
  filter(sector %in% c("Information Technology", "Finance", "Health Care"))

# Calculate percentage of skill mentions within each sector
# This normalizes differences in total job counts across sectors
top_skills_sector <- top_sectors %>%
  group_by(sector) %>%
  mutate(percent = job_count / sum(job_count) * 100) %>%
  slice_max(percent, n = 5) %>%
  ungroup()


top_skills_sector
# A tibble: 15 × 4
   sector                 skill_name         job_count percent
   <chr>                  <chr>                  <dbl>   <dbl>
 1 Finance                python                    19   12.3 
 2 Finance                sql                       17   11.0 
 3 Finance                excel                     17   11.0 
 4 Finance                statistics                16   10.3 
 5 Finance                r                         14    9.03
 6 Health Care            data analysis             14   13.0 
 7 Health Care            data visualization        11   10.2 
 8 Health Care            sql                       11   10.2 
 9 Health Care            tableau                   10    9.26
10 Health Care            statistics                10    9.26
11 Information Technology python                   144   14.3 
12 Information Technology machine learning         124   12.3 
13 Information Technology statistics               114   11.3 
14 Information Technology r                         83    8.24
15 Information Technology sql                       82    8.14
# Create separate datasets for each sector (needed for plotting)
finance_skills <- top_skills_sector %>%
  filter(sector == "Finance")

health_skills <- top_skills_sector %>%
  filter(sector == "Health Care")

it_skills <- top_skills_sector %>%
  filter(sector == "Information Technology")

Visualization: Skill Demand by Sector

# Finance
p1 <- ggplot(finance_skills, aes(x = reorder(skill_name, percent), y = percent)) +
  geom_col(fill = "green") +
  coord_flip() +
  ylim(0, 15) +
  labs(
    title = "Top Skills in Finance",
    x = "Skill",
    y = "   
    Percentage of Skill 
    Mentions"
  ) +
  theme_minimal()

# HealthCare
p2 <- ggplot(health_skills, aes(x = reorder(skill_name, percent), y = percent)) +
  geom_col(fill = "red") +
  coord_flip() +
  labs(
    title = "Top Skills in Health Care",
    x = "Skill",
    y = "   
    Percentage of Skill
    Mentions"
  ) +
  theme_minimal()

# IT
p3 <- ggplot(it_skills, aes(x = reorder(skill_name, percent), y = percent)) +
  geom_col(fill = "lightblue") +
  coord_flip() +
  labs(
    title = "Top Skills in IT",
    x = "Skill",
    y = "    
    Percentage of Skill 
    Mentions"
  ) +
  theme_minimal()
p1 | p2 | p3

Conclusion

Based on the Glassdoor dataset analyzed, which was transformed into a relational database using SQL, core data science skills, particularly Python, SQL, machine learning, and statistical analysis, consistently emerge as the most in-demand. These skills appear across multiple salary levels and industry sectors represented in the dataset, indicating that their value is consistently recognized across different roles and domains.

While these core skills dominate, other tools such as R and Excel appear more context-dependent. R is present in the dataset but does not consistently rank among the top skills across salary levels or sectors, suggesting it is more specialized. Excel, on the other hand, appears more frequently in business-oriented roles and lower to mid salary ranges, indicating its role as a supporting tool rather than a core technical requirement.

Overall, the findings suggest that a strong foundation in key programming and analytical skills forms a reliable “value pack” for data scientists. This core skill set remains relevant from entry-level to senior positions, while additional tools contribute to specialization depending on industry and role requirements.