library(DBI)
library(RPostgres)
library(tidyverse)
library(patchwork)Project 3:Most Valued Data Science Skills
Introduction
This project investigated which technical skills are most valued in the data science job market. As the demand for data science professionals has grown significantly over the last decade, it is important to understand which tools, programming languages, and analytical skills are most frequently required by employers.
To explore this question, the project analyzed a dataset containing job postings for data science-related roles. Job postings provide insight into employer expectations because they typically list the technical skills required for each position. In this project, the frequency of skills listed in job postings was used as a proxy for measuring the demand for specific data science skills.
Approach
Team members
Madina Kudanova
Muhammad Khan
Leon Lin
Xiaofei Mei
Communication
Our team communicate primarily through WhatsApp/Slack group chats and school email to coordinate meetings, discuss progress, and share quick updates related to the project.
Code and File Sharing
A shared GitHub repository https://github.com/MKudanova/Project3_DataScienceSkills/tree/main is being used to store project code, scripts, and datasets.
Project Documentation
A shared Google Docs document is being used to collaboratively write project notes and draft written sections of the project before final versions are uploaded to the GitHub repository.
Data Sources
We obtained the data from Kaggle. The data was scraped from Glassdoor with information such as job title, job description, company information, industry. Skills required mostly can be listed through the job description column.
Data Loading and Database Design
We built a relational PostgreSQL database to organize the dataset using a normalized structure. We defined tables, primary keys, and relationships to ensure data integrity and reduce redundancy.
Because one job posting can require multiple skills and a single skill may appear in multiple job postings, we implemented a many-to-many relationship between jobs and skills.
The database included three tables: Jobs (job_id, job_title, company, location), Skills (skill_id, skill_name), and Job_Skills (job_id, skill_id). The Job_Skills table served as a bridge table with foreign keys referencing both Jobs and Skills.
We used SQL to create the schema and manage the database in pgAdmin4. The data were then accessed from PostgreSQL in R using the RPostgres package for querying and analysis.
ER Diagram
An Entity–Relationship (ER) diagram was created to document the structure of the database. It illustrates the relationships between the Jobs, Skills, and Job_Skills tables, showing how the bridge table resolves the many-to-many relationship between jobs and skills. The diagram was created using draw.io
Exploratory Analysis and Visualization
After cleaning the data in R, we explored which skills appeared most often in job postings, how they relate to salary, and how they vary across sectors. We created visualizations such as bar charts to show the most common skills and scatter plots to compare skill demand with average salary.
These visuals made it easier to understand which data science skills are most in demand, how they are valued in the job market, and how they differ by sector.
Code base
Connect to PostgreSQL Database
# Establish connection to local PostgreSQL database
con <- dbConnect(
RPostgres::Postgres(),
dbname = "project3_skills",
host = "localhost",
port = 5432,
user = "postgres",
password = ""
)
# Set schema to project3 so we can access our tables
dbExecute(con, "SET search_path TO project3;")[1] 0
Verify Connection
# Check available tables
dbListTables(con)[1] "job_skills" "jobs" "skills"
[4] "vw_job_skill_details"
# Retrieve all tables specifically within the "project3" schema
# This confirms that our project tables (jobs, skills, job_skills) are correctly loaded
# and avoids listing unrelated system tables
dbGetQuery(con, "
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'project3'
ORDER BY table_name;
") table_schema table_name
1 project3 job_skills
2 project3 jobs
3 project3 skills
4 project3 vw_job_skill_details
Skill Demand Analysis
# Query the database to count how many job postings require each skill
# COUNT(DISTINCT js.job_id) ensures each job is counted only once per skill
skills_demand <- dbGetQuery(con, "
SELECT
s.skill_name,
COUNT(DISTINCT js.job_id) AS job_count
FROM project3.job_skills js
JOIN project3.skills s
ON js.skill_id = s.skill_id
GROUP BY s.skill_name
ORDER BY job_count DESC;
")
# Convert job_count from integer64 (database type) to numeric for compatibility with R and plotting
skills_demand <- skills_demand %>%
mutate(job_count = as.numeric(job_count))
skills_demand skill_name job_count
1 python 482
2 statistics 439
3 machine learning 405
4 sql 322
5 r 308
6 excel 291
7 data visualization 238
8 spark 186
9 data analysis 185
10 aws 147
11 hadoop 140
12 big data 136
13 tableau 122
14 deep learning 97
15 nlp 87
16 power bi 34
Visualization: Demand by Skill
ggplot(skills_demand, aes(x = reorder(skill_name, job_count), y = job_count)) +
geom_col(fill = "orange") +
coord_flip() +
labs(
title = "Demand by Top Data Science Skills",
x = "Skill",
y = "Number of Job Postings"
)Salary Analysis by Skill
# Calculate the average salary associated with each skill
# This links job postings with salary information to estimate the market value of each skill
skills_salary <- dbGetQuery(con, "
SELECT
s.skill_name,
ROUND(AVG(j.avg_salary), 2) AS avg_salary
FROM project3.job_skills js
JOIN project3.jobs j
ON js.job_id = j.job_id
JOIN project3.skills s
ON js.skill_id = s.skill_id
GROUP BY s.skill_name
ORDER BY avg_salary DESC;
")
# Select the top 8 highest-paying skills for clearer comparison
skills_salary_top <- skills_salary %>%
slice_max(avg_salary, n = 8)
skills_salary_top skill_name avg_salary
1 power bi 130.38
2 nlp 127.90
3 data analysis 126.26
4 hadoop 126.12
5 tableau 125.75
6 machine learning 125.65
7 deep learning 124.89
8 spark 124.74
Visualization: Salary by Skill
combined <- skills_demand %>%
inner_join(skills_salary, by = "skill_name")
ggplot(combined, aes(x = job_count, y = avg_salary, label = skill_name)) +
geom_point(color = "orange", size = 3) +
geom_text(nudge_y = 0.3, size = 3) +
labs(
title = "Demand vs Salary for Data Science Skills",
x = "Number of Job Postings",
y = "Average Salary"
)Plot interpretation:
This plot compares the demand for data science skills with their corresponding average salaries. The results show that higher demand does not necessarily translate to higher pay.
For example, Python and SQL appear among the most frequently requested skills, but their average salaries are relatively moderate compared to more specialized skills. In contrast, tools such as Power BI and NLP show higher average salaries despite lower demand, suggesting that niche or specialized skills may command a salary premium.
Overall, the relationship between demand and salary is not strictly linear. Core skills are widely required across many roles, while more specialized skills tend to be less common but offer higher compensation.
Skill Demand by Sector Analysis
# Retrieve skill demand by sector from the database
# Each row represents how many job postings in a sector require a given skill
skills_sector <- dbGetQuery(con, "
SELECT
j.sector,
s.skill_name,
COUNT(DISTINCT j.job_id) AS job_count
FROM project3.job_skills js
JOIN project3.jobs j
ON js.job_id = j.job_id
JOIN project3.skills s
ON js.skill_id = s.skill_id
WHERE j.sector IS NOT NULL
AND j.sector <> '-1'
GROUP BY j.sector, s.skill_name
ORDER BY j.sector, job_count DESC;
")
# Convert job_count to numeric for compatibility with R operations and plotting
skills_sector <- skills_sector %>%
mutate(job_count = as.numeric(job_count))
# Select three major sectors for focused comparison
top_sectors <- skills_sector %>%
filter(sector %in% c("Information Technology", "Finance", "Health Care"))
# Calculate percentage of skill mentions within each sector
# This normalizes differences in total job counts across sectors
top_skills_sector <- top_sectors %>%
group_by(sector) %>%
mutate(percent = job_count / sum(job_count) * 100) %>%
slice_max(percent, n = 5) %>%
ungroup()
top_skills_sector# A tibble: 15 × 4
sector skill_name job_count percent
<chr> <chr> <dbl> <dbl>
1 Finance python 19 12.3
2 Finance sql 17 11.0
3 Finance excel 17 11.0
4 Finance statistics 16 10.3
5 Finance r 14 9.03
6 Health Care data analysis 14 13.0
7 Health Care data visualization 11 10.2
8 Health Care sql 11 10.2
9 Health Care tableau 10 9.26
10 Health Care statistics 10 9.26
11 Information Technology python 144 14.3
12 Information Technology machine learning 124 12.3
13 Information Technology statistics 114 11.3
14 Information Technology r 83 8.24
15 Information Technology sql 82 8.14
# Create separate datasets for each sector (needed for plotting)
finance_skills <- top_skills_sector %>%
filter(sector == "Finance")
health_skills <- top_skills_sector %>%
filter(sector == "Health Care")
it_skills <- top_skills_sector %>%
filter(sector == "Information Technology")Visualization: Skill Demand by Sector
# Finance
p1 <- ggplot(finance_skills, aes(x = reorder(skill_name, percent), y = percent)) +
geom_col(fill = "green") +
coord_flip() +
ylim(0, 15) +
labs(
title = "Top Skills in Finance",
x = "Skill",
y = "
Percentage of Skill
Mentions"
) +
theme_minimal()
# HealthCare
p2 <- ggplot(health_skills, aes(x = reorder(skill_name, percent), y = percent)) +
geom_col(fill = "red") +
coord_flip() +
labs(
title = "Top Skills in Health Care",
x = "Skill",
y = "
Percentage of Skill
Mentions"
) +
theme_minimal()
# IT
p3 <- ggplot(it_skills, aes(x = reorder(skill_name, percent), y = percent)) +
geom_col(fill = "lightblue") +
coord_flip() +
labs(
title = "Top Skills in IT",
x = "Skill",
y = "
Percentage of Skill
Mentions"
) +
theme_minimal()
p1 | p2 | p3Conclusion
Based on the Glassdoor dataset analyzed, which was transformed into a relational database using SQL, core data science skills, particularly Python, SQL, machine learning, and statistical analysis, consistently emerge as the most in-demand. These skills appear across multiple salary levels and industry sectors represented in the dataset, indicating that their value is consistently recognized across different roles and domains.
While these core skills dominate, other tools such as R and Excel appear more context-dependent. R is present in the dataset but does not consistently rank among the top skills across salary levels or sectors, suggesting it is more specialized. Excel, on the other hand, appears more frequently in business-oriented roles and lower to mid salary ranges, indicating its role as a supporting tool rather than a core technical requirement.
Overall, the findings suggest that a strong foundation in key programming and analytical skills forms a reliable “value pack” for data scientists. This core skill set remains relevant from entry-level to senior positions, while additional tools contribute to specialization depending on industry and role requirements.