DATA 607: Project 3

Author

Desiree Thomas, Denise Atherley, Kiera Griffiths

Approach -

This project is meant to explore the question, “Which data science skills are most valued?” in a collaborative process with a group of our choosing. The goal is to work effectively as a team to perform an exploratory analysis to answer this question. Because we’ve had experience working together in a past assignment and noticed that we collaborate pretty well, we (Desiree Thomas, Denise Atherley and Kiera Griffiths) decided to work together for this Project as well. We plan to fulfill the various project requirements by breaking down each key deliverable.

Collaboration:

We will be using GitHub and Slack for our primary communication. GitHub will be the host for our shared repository and we will be using GitHub Projects to track issues, manage our workflow, assign tasks to specific members and track their completion. We will use Slack to coordinate meetings and Teams for video calls. GitHub issues may be used for handling bugs and errors. We have created a README.md for documentation as well.

Data Acquisition:

To acquire a relevant data source, we used the assistance of LLM Google Gemini to suggest some data sources that could highlight which data science skills are most valued. Various options were suggested but we ultimately chose a data source from Kaggle titled, “Data Science Job Postings with Salaries 2025”, which appeared to be the most compelling. The data is a processed version of scraped data collected in 2025 with transformations made to respect company privacy and avoid redistribution of raw proprietary content.

Logical model for normalized database:

To design a normalized relational schema, we will breakdown the Data Science Job Postings table into four interconnected tables.

  • companies Table: Stores unique information about each company.

    • company_id (Primary Key)

    • company_name (from the company column)

    • headquarter, industry, ownership, company_size, revenue

  • jobs Table: Stores the specific job postings and links to the company.

    • job_id (Primary Key)

    • company_id (Foreign Key ->companies.company_id)

    • job_title, seniority_level, status, location, post_date, salary

  • skills Table: Stores a unique list of all possible skills.

    • skill_id (Primary Key)

    • skill_name

  • job_skills Table: A mapping (bridge) table to handle the many-to-many relationship between jobs and skills (since a job requires multiple skills, and a skill is required by multiple jobs).

    • job_id (Foreign Key -> jobs.job_id)

    • skill_id (Foreign Key -> skills.skill_id)

An Entity-Relationship (ER) diagram will be produced to document this design, illustrating how entities relate to one another. We can use resources such as draw.io, Excalidraw or Lucidchart.

How we will load the data:

We will load the .csv file into a database, most likely SQLite, and use SQL code to generate our structured tables with the appropriate primary keys so that we can link them.

Once the relational database is established, we can use the DBI, RSQLite and tidyverse libraries to perform all data tidying, transformations and exploratory data analysis in R. Hopefully, we can successfully store our database in the cloud which will allow our team to connect to it. We will pull only the data that we need for our analysis and address any data quality issues like missing data and outliers that deviate from the standard format.

Analysis strategy:

In order to answer the question of which data science skills are most valued, we will likely analyze the data through two main lenses:

  • Market demand - what percentage of job postings require the specific skill?

  • Salary to skill correlation - which skills have the strongest positive correlation with the higher salary ranges.

Data Transformation:

#chunk 1 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(lubridate)
library(DBI)
library(RSQLite)
library(ggplot2)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
# Load raw data and standardize headers
url <- "https://raw.githubusercontent.com/desithomas/607-Project-3/refs/heads/main/data_science_job_posts_2025.csv"

job_posts_raw <- read_csv(url) %>% 
  clean_names()
Rows: 944 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): job_title, seniority_level, status, company, location, post_date, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# meets the requirement to tidy the data using R
job_posts_tidy <- job_posts_raw %>%
  # Create the ID first
  mutate(job_id = row_number()) %>% 
  mutate(
    # Use str_extract_all to pull in the entire salary range
    salary_extract = str_extract_all(salary, "\\d+,?\\d*"),
    salary_numeric = map_dbl(salary_extract, ~{
      nums <- as.numeric(str_replace_all(.x, ",", ""))
      if(length(nums) > 0) mean(nums) else NA_real_
    })
  ) %>%
  mutate(skills_clean = str_remove_all(skills, "[\\[\\]']")) %>%
  mutate(post_date_dt = today() - days(as.numeric(str_extract(post_date, "\\d+")))) %>%
  drop_na(salary_numeric, skills_clean) %>%
  select(-salary_extract)
# normalizing the database

# Table A: Companies (Unique entities) 
companies <- job_posts_tidy %>%
  select(company, headquarter, industry, company_size, revenue) %>%
  distinct(company, .keep_all = TRUE) %>% 
  mutate(company_id = row_number())

# Table B: Jobs (The Fact Table) 
jobs_table <- job_posts_tidy %>%
  left_join(companies, by = "company") %>%
  select(job_id, job_title, salary_numeric, location, post_date_dt, company_id)

# Table C/D: Skills & Bridge (Many-to-Many handling) 
skills_long <- job_posts_tidy %>%
  separate_rows(skills_clean, sep = ",\\s*") %>%
  mutate(skill_name = str_to_lower(str_trim(skills_clean))) %>%
  filter(skill_name != "")

skills_lookup <- skills_long %>%
  distinct(skill_name) %>%
  mutate(skill_id = row_number())

job_skills_bridge <- skills_long %>%
  left_join(skills_lookup, by = "skill_name") %>%
  select(job_id, skill_id)

# Establish Database Connection and Load 
con <- dbConnect(SQLite(), "ds_job_posts.db")
dbWriteTable(con, "companies", companies, overwrite = TRUE)
dbWriteTable(con, "jobs", jobs_table, overwrite = TRUE)
dbWriteTable(con, "skills_lookup", skills_lookup, overwrite = TRUE)
dbWriteTable(con, "job_skills_bridge", job_skills_bridge, overwrite = TRUE)

Data Visualization:

# Pulling the joined data from the database using a SQL Query 

skill_valuation <- dbGetQuery(con, "
  SELECT 
    s.skill_name,
    COUNT(j.job_id) AS demand_count,
    AVG(j.salary_numeric) AS avg_salary
  FROM job_skills_bridge b
  JOIN jobs j ON b.job_id = j.job_id
  JOIN skills_lookup s ON b.skill_id = s.skill_id
  GROUP BY s.skill_name
  HAVING demand_count > 5
  ORDER BY avg_salary DESC
")



# Visualization in R
ggplot(skill_valuation, aes(x = demand_count, y = avg_salary, label = skill_name)) +
  geom_point(color = "darkgreen", size = 3, alpha = 0.7) +
  geom_text(vjust = -1, check_overlap = TRUE, size = 3) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Data Science Skills: Value vs. Demand (2025)",
       subtitle = "High-Value skills are at the top; High-Demand skills are to the right",
       x = "Demand (Number of Job Postings)",
       y = "Average Salary (Market Value)",
       caption = "Source: Project 3 Normalized SQLite Database") +
  theme_minimal()

#Statistical analysis 

# Most Common Skills (Top 10)
# query database for counts and filter out empty strings
top_skills <- dbGetQuery(con, "
  SELECT s.skill_name, COUNT(j.job_id) as demand
  FROM job_skills_bridge b
  JOIN skills_lookup s ON b.skill_id = s.skill_id
  JOIN jobs j ON b.job_id = j.job_id
  WHERE s.skill_name != ''
  GROUP BY s.skill_name
  ORDER BY demand DESC
  LIMIT 10
")

ggplot(top_skills, aes(x = reorder(skill_name, demand), y = demand)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Most Common Data Science Skills (2025)",
       x = "Skill", y = "Number of Job Postings") +
  theme_minimal()

# Skill -> Salary Correlation
# average salary associated with the most frequent skills
skill_salary_corr <- dbGetQuery(con, "
  SELECT s.skill_name, AVG(j.salary_numeric) as avg_salary
  FROM job_skills_bridge b
  JOIN skills_lookup s ON b.skill_id = s.skill_id
  JOIN jobs j ON b.job_id = j.job_id
  WHERE s.skill_name IN (SELECT skill_name FROM (
      SELECT skill_name, COUNT(*) as c FROM job_skills_bridge JOIN skills_lookup USING(skill_id) GROUP BY 1 ORDER BY c DESC LIMIT 15
  ))
  GROUP BY s.skill_name
  ORDER BY avg_salary DESC
")

ggplot(skill_salary_corr, aes(x = reorder(skill_name, avg_salary), y = avg_salary)) +
  geom_point(size = 4, color = "darkorange") +
  scale_y_continuous(labels = scales::label_dollar()) +
  coord_flip() +
  labs(title = "Salary Correlation for Top 15 Skills",
       subtitle = "Average salary associated with the most frequently requested skills",
       x = "Skill", y = "Average Market Salary") +
  theme_minimal()

Conclusion -

This project consisted of identifying a data source that highlighted jobs posted in 2025 that included “data scientist” or “machine learning” in the position title and included key words on the skills necessary to fulfill the job. Because of how robust the data set was, the team decided to gather the information tidy it and normalize it in a relational database. Once the distinct tables were created and the relationships identified, we were able to analyze further. The analysis successfully categorizes data science skills into distinct tiers based on their frequency in job postings (demand) and their financial return (value). Our findings show that based on demand, python coding is the most frequently requested data science skill with over 600 distinct job postings requiring it as a skill set. Python’s popularity stems from a unique blend of simple, readable syntax and immense versatility across numerous domains.

In terms of the data science skill with the most financial return, scala coding is the highest paying skill set. Scala, short for “Scalable Language,” is a high-level, general-purpose programming language designed to integrate object-oriented and functional programming paradigms. In data science, Scala is primarily used for big data engineering, large-scale data processing, and distributed computing, largely due to its native support for Apache Spark.

These findings are incredibly interesting because it showcases how the market is moving to employ individuals skilled in flexible coding language with a large financial investment in scalable language that can support big data. It represents a desire for the market to grow its data and generate meaningful insights from it.

The assistance of and LLM was used to enhance code:

Google DeepMind. (2025). Gemini 3 Flash [Large language model]. https://gemini.google.com. Accessed March 16-22, 2026.