In today’s data-driven world, data science has become one of the most in-demand and interdisciplinary career paths. Professionals are expected to combine technical, analytical, and communication skills to extract insights and drive decisions using data.
This project explores the question: “Which are the most valued data science skills?”
Using a dataset from Kaggle (Data Science Job Postings & Skills, 2024), the goal is to identify the most frequently requested skills among employers in data science job postings.
This project was completed individually using:
This workflow demonstrates data acquisition, transformation, storage, and analysis from start to finish.
Dataset: Data Science Job Postings & Skills (2024)
Author: asaniczka
Platform: Kaggle
Source: LinkedIn job postings
License: ODC Attribution License
The dataset includes two columns:
job_link: a unique link for each job postingjob_skills: a comma-separated list of skills mentioned
in the listingA PostgreSQL database named data_science_jobs was created, and the dataset was imported into a raw table:
CREATE TABLE raw_job_skills (
job_link TEXT,
job_skills TEXT
);
To comply with relational design principles, the data was normalized into three tables:
The normalization process ensured a many-to-many relationship between job postings and skills, allowing efficient querying and analysis.
The following SQL script summarizes the normalization process:
-- Create normalized tables
CREATE TABLE jobs (
job_id SERIAL PRIMARY KEY,
job_link TEXT UNIQUE
);
CREATE TABLE skills (
skill_id SERIAL PRIMARY KEY,
skill_name TEXT UNIQUE
);
CREATE TABLE job_skills (
job_id INTEGER REFERENCES jobs(job_id),
skill_id INTEGER REFERENCES skills(skill_id),
PRIMARY KEY (job_id, skill_id)
);
-- Insert distinct job links
INSERT INTO jobs (job_link)
SELECT DISTINCT job_link
FROM raw_job_skills
WHERE job_link IS NOT NULL;
-- Insert distinct skills
INSERT INTO skills (skill_name)
SELECT DISTINCT trim(skill)
FROM (
SELECT unnest(string_to_array(job_skills, ',')) AS skill
FROM raw_job_skills
) AS all_skills
WHERE trim(skill) <> '';
-- Link jobs to skills safely
INSERT INTO job_skills (job_id, skill_id)
SELECT j.job_id, s.skill_id
FROM raw_job_skills r
JOIN jobs j ON r.job_link = j.job_link
JOIN LATERAL unnest(string_to_array(r.job_skills, ',')) AS skill_array(skill_text) ON TRUE
JOIN skills s ON trim(skill_array.skill_text) = s.skill_name
ON CONFLICT DO NOTHING;
The normalized tables were connected to RStudio through the RPostgres and DBI libraries. A secure password management approach was implemented using the .Renviron file, which stores the PostgreSQL password as an environment variable (POSTGRES_PASS). This ensures reproducibility while keeping credentials confidential.
con <- dbConnect(
RPostgres::Postgres(),
dbname = "data_science_jobs",
host = "localhost",
port = 5432,
user = "postgres",
password = Sys.getenv("POSTGRES_PASS")
)
The query below retrieves the Top 10 Most Frequently Mentioned Skills:
top_skills <- dbGetQuery(con, "
SELECT s.skill_name, COUNT(js.job_id) AS frequency
FROM job_skills js
JOIN skills s ON js.skill_id = s.skill_id
GROUP BY s.skill_name
ORDER BY frequency DESC
LIMIT 10;
")
top_skills
## skill_name frequency
## 1 Python 4746
## 2 SQL 4516
## 3 Communication 2481
## 4 Data Analysis 2146
## 5 Machine Learning 1934
## 6 AWS 1714
## 7 Tableau 1651
## 8 Data Visualization 1528
## 9 R 1527
## 10 Java 1399
top_skills %>%
ggplot(aes(x = reorder(skill_name, frequency), y = frequency)) +
geom_col(fill = 'orange') +
coord_flip() +
labs(
title = 'Top 10 Most Valued Data Science Skills',
x = 'Skill',
y = 'Frequency'
) +
theme_minimal()
After completing the visualization, the database connection was closed:
dbDisconnect(con)
The analysis revealed that:
These results emphasize that employers value professionals who combine technical expertise with strong analytical and communication abilities.
This project demonstrates the complete process of:
Through this end-to-end workflow, I gained practical experience in data acquisition, database design, and data visualization — all key components of modern data management.