project_03

Team members:

Joshua Henry, Shawn Ivan Ganz, Zineb Tamnat, Jonnathan Zuna Largo, Izza Khan

Approach

Collaboration Tools

Our group will use the following collaboration tools throughout the project:

Communication

  • Slack direct message group chat

Code Sharing

Project Documentation


Data Sources

We identified two primary data sources for this project: **O*NET Online and Kaggle**.

1. O*NET Online

O*NET Online provides structured occupational data for the role of Data Scientists, including information on relevant skills and labor demand.

Resources: - Occupational summary:
O*NET Summary for Data Scientists - Skills CSV:
onetooonline_datascience_skills.csv - Demand page:
O*NET Demand for Data Scientists - Skills demand CSV:
onetooonline_datascience_sklls_demand.csv

2. Kaggle

The Kaggle dataset contains data science job postings and salary-related information for 2025.

Resources: - Dataset page:
Data Science Careers and Salaries 2025 - CSV file:
kaggle_data_science_job_posts_2025.csv


Data Loading

The Kaggle dataset will be downloaded as a CSV file and loaded into R using read_csv() from the tidyverse package.

The O*NET Online data can be accessed directly from the website and imported from the linked CSV files. Once loaded into R, both datasets will be cleaned and transformed using packages such as dplyr and tidyr.

Relevant variables such as: - job title - employer type - required skills - skill rankings

will be extracted to support analysis of the most frequently requested data science skills across job postings.


Logical Model

The logical model integrates job posting data from Kaggle with occupational skill data from O*NET Online.

O*NET Entity

  • Skills
  • Ranking

Kaggle Entity

  • Job_ID
  • Employer_Type
  • Job_Title
  • Skills

Entity-Relationship (ER) Diagram

The database is designed around a many-to-many relationship between jobs and skills.

Jobs

  • Job_ID
  • Employer_Type
  • Job_Title

Skills

  • Skill_ID
  • Skill_Name

JobSkills

  • Job_ID
  • Skill_ID

The JobSkills table acts as a junction table that links jobs and skills. This structure allows: - one job posting to have many skills - one skill to appear in many job postings

This normalized design reduces redundancy and supports flexible querying of relationships between jobs and required skills.

Codebase

ERD (Entity Relationship Diagram)

ERD

Initializing

We will initialize the project, define the url, read the csv, and glimpse.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url_kaggle_data <- "https://raw.githubusercontent.com/Siganz/607_project03/refs/heads/main/data/kaggle_data_science_job_posts_2025.csv"
url_onetonline_skills <- "https://raw.githubusercontent.com/Siganz/607_project03/refs/heads/main/data/onetoonline_datascience_skills.csv"
url_onetonline_demand <- "https://raw.githubusercontent.com/Siganz/607_project03/refs/heads/main/data/onetoonline_datascience_skills_demand.csv"

data_science_posts_df <- read.csv(url_kaggle_data)
data_science_skills_df <- read.csv(url_onetonline_skills)
# Might not need demand?
data_science_demand_df <- read.csv(url_onetonline_demand)

glimpse(data_science_posts_df)
Rows: 944
Columns: 13
$ job_title       <chr> "data scientist", "data scientist", "data scientist", …
$ seniority_level <chr> "senior", "lead", "senior", "senior", "", "lead", "jun…
$ status          <chr> "hybrid", "hybrid", "on-site", "hybrid", "on-site", ""…
$ company         <chr> "company_003", "company_005", "company_007", "company_…
$ location        <chr> "Grapevine, TX . Hybrid", "Fort Worth, TX . Hybrid", "…
$ post_date       <chr> "17 days ago", "15 days ago", "a month ago", "8 days a…
$ headquarter     <chr> "Bentonville, AR, US", "Detroit, MI, US", "Redwood Cit…
$ industry        <chr> "Retail", "Manufacturing", "Technology", "Technology",…
$ ownership       <chr> "Public", "Public", "Public", "Public", "Private", "Pr…
$ company_size    <chr> "€352.44B", "155,030", "25,930", "34,690", "1,800", "1…
$ revenue         <chr> "Public", "€51.10B", "€33.80B", "€81.71B", "Private", …
$ salary          <chr> "€100,472 - €200,938", "€118,733", "€94,987 - €159,559…
$ skills          <chr> "['spark', 'r', 'python', 'scala', 'machine learning',…
glimpse(data_science_skills_df)
Rows: 52
Columns: 3
$ Section  <chr> "Technology", "Technology", "Technology", "Technology", "Tech…
$ Category <chr> "Analytical or scientific software", "Analytical or scientifi…
$ Example  <chr> "IBM SPSS Statistics", "SAS", "TensorFlow", "The MathWorks MA…
glimpse(data_science_demand_df)
Rows: 22
Columns: 3
$ Percentage       <int> 65, 50, 35, 22, 17, 17, 12, 11, 11, 10, 9, 8, 8, 8, 7…
$ Technology.Skill <chr> "Python", "Structured query language SQL", "R", "Tabl…
$ Hot.Technology   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…

Data Cleaning & DF Creating

We are going to clean and standardize skill names across the job posting, category, and demand datasets, then merge them into one master skill table with unique IDs and category mappings. We are also going to create the related company, job posting, posting-skill, skill category, and skill demand tables so the final dataset follows the project ERD.

# Create IDs + clean base
data_science_posts_clean_df <- data_science_posts_df |>
  mutate(
    posting_id = row_number(),
    company_id = as.integer(factor(company))
  ) |>
  relocate(posting_id, company_id)

# Company df creation to match ERD
company_df <- data_science_posts_clean_df |>
  distinct(company_id, .keep_all = TRUE) |>
  select(company_id, industry, ownership, headquarter, company_size, revenue)

# Job Posting df creation to match ERD
job_posting_df <- data_science_posts_clean_df |>
  select(
    posting_id,
    company_id,
    seniority_level,
    work_status = status,
    location,
    post_date,
    salary_range = salary
  )

# Posting Skill df - cleaning skills field to prep for mapping
posting_skill_df <- data_science_posts_clean_df |>
  select(posting_id, skills) |>
  mutate(skills = str_remove_all(skills, "\\[|\\]|'")) |>
  separate_rows(skills, sep = ",") |>
  mutate(
    skills = str_trim(skills),
    skills = str_to_lower(skills)
  ) |>
  rename(skill_name = skills) |>
  mutate(
    skill_name = case_when(
      skill_name %in% c("sklearn", "scikit learn") ~ "scikit-learn",
      skill_name %in% c("amazon", "amazon web services") ~ "aws",
      TRUE ~ skill_name
    ),
    skill_name = str_to_title(skill_name)
  ) |>
  filter(skill_name != "", !is.na(skill_name))

# Posting skill mapping to O*Net skill sources
skill_crosswalk <- tibble::tribble(
  ~skill_id, ~job_skill, ~onet_skill_name,
  1, "Spark", "Apache Spark",
  2, "R", "R",
  3, "Python", "Python",
  4, "Scala", "Scala",
  5, "Machine Learning", NA,
  6, "Tensorflow", "TensorFlow",
  7, "Sql", "Structured query language SQL",
  8, "Aws", "Amazon Web Services AWS software",
  9, "Git", "Git",
  10, "Docker", "Docker",
  11, "Gcp", "Google Cloud software",
  12, "Kubernetes", "Kubernetes",
  13, "Deep Learning", NA,
  14, "Scikit-Learn", "Scikit-learn",
  15, "Pytorch", "PyTorch",
  16, "Keras", NA,
  17, "Java", "Oracle Java",
  18, "Pandas", "pandas",
  19, "Powerbi", "Microsoft Power BI",
  20, "Tableau", "Tableau",
  21, "Hadoop", "Apache Hadoop",
  22, "Azure", "Microsoft Azure software",
  23, "Airflow", "Apache Airflow",
  24, "Linux", "Linux",
  25, "Bash", "Bash",
  26, "Numpy", "NumPy",
  27, "Neural Network", NA,
  28, "Matplotlib", NA,
  29, "Database", NA,
  30, "Scipy", NA,
  31, "Opencv", NA
)

# Posting skills remapping
posting_skill_df <- posting_skill_df |>
  left_join(skill_crosswalk, by = c("skill_name" = "job_skill")) |>
  mutate(skill_name = coalesce(onet_skill_name, skill_name)) |>
  select(posting_id, skill_name)

# Skill Category creation to match ERD
skill_category_df <- data_science_skills_df |>
  rename(section = Section, category_name = Category, skill_name = Example) |>
  distinct(section, category_name, skill_name) |>
  arrange(section, category_name, skill_name) |>
  mutate(category_id = row_number()) |>
  relocate(category_id)

# Prepping skill for eventual mapping
skill <- data_science_demand_df |>
  rename(skill_name = Technology.Skill,
         is_hot_technology = Hot.Technology,
         demand_percentage = Percentage
         )

# Building master skill table from all unique skill names from all 3 sources
all_skills_df <- bind_rows(
  skill_category_df |> select(skill_name),
  skill   |> select(skill_name),
  posting_skill_df  |> select(skill_name)
) |>
  distinct(skill_name) |>
  arrange(skill_name)

# Using skill_demand_df as the base for demand fields
skill <- all_skills_df |>
  left_join(
    skill |>
      select(skill_name, demand_percentage, is_hot_technology),
    by = "skill_name"
  ) |>
  mutate(is_hot_technology = if_else(is.na(is_hot_technology) | is_hot_technology == "", "No",is_hot_technology)) |>
  arrange(skill_name) |>
  mutate(skill_id = row_number()) |>
  select(skill_id, skill_name, demand_percentage, is_hot_technology)

# Skill df Category join
skill <- skill |>
  left_join(
    skill_category_df |>
      select(skill_name, category_id),
    by = "skill_name"
  ) |>
  relocate(skill_id, category_id)

# Skill category clean
skill_category_df <- skill_category_df |>
  select(category_id, section, category_name) |>
  distinct()

# Posting skill_id join and clean
posting_skill_df <- posting_skill_df |>
  left_join(
    skill |> select(skill_id, skill_name),
    by = "skill_name"
  ) |>
  select(posting_id, skill_id, skill_name)

# Skill demand creation 
skill_demand_df <- skill |>
  select(skill_id, skill_name, demand_percentage) |>
  filter(!is.na(demand_percentage)) |>
  mutate(
    demand_id = row_number(),
    source = "O*Net"
  ) |>
  select(demand_id, skill_id, demand_percentage, source)

Data Tables ERD

cat("\n--- company_df ---\n")

--- company_df ---
glimpse(company_df)
Rows: 420
Columns: 6
$ company_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 16, 18, 19, 20…
$ industry     <chr> "Retail", "Manufacturing", "Technology", "Technology", "F…
$ ownership    <chr> "Public", "Public", "Public", "Public", "Private", "Priva…
$ headquarter  <chr> "Bentonville, AR, US", "Detroit, MI, US", "Redwood City, …
$ company_size <chr> "€352.44B", "155,030", "25,930", "34,690", "1,800", "150"…
$ revenue      <chr> "Public", "€51.10B", "€33.80B", "€81.71B", "Private", "€2…
cat("\n--- job_posting_df ---\n")

--- job_posting_df ---
glimpse(job_posting_df)
Rows: 944
Columns: 7
$ posting_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ company_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 16, 18, 19,…
$ seniority_level <chr> "senior", "lead", "senior", "senior", "", "lead", "jun…
$ work_status     <chr> "hybrid", "hybrid", "on-site", "hybrid", "on-site", ""…
$ location        <chr> "Grapevine, TX . Hybrid", "Fort Worth, TX . Hybrid", "…
$ post_date       <chr> "17 days ago", "15 days ago", "a month ago", "8 days a…
$ salary_range    <chr> "€100,472 - €200,938", "€118,733", "€94,987 - €159,559…
cat("\n--- posting_skill_df ---\n")

--- posting_skill_df ---
glimpse(posting_skill_df)
Rows: 4,181
Columns: 3
$ posting_id <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ skill_id   <int> 13, 52, 51, 57, 35, 65, 13, 52, 51, 63, 35, 6, 24, 51, 21, …
$ skill_name <chr> "Apache Spark", "R", "Python", "Scala", "Machine Learning",…
cat("\n--- skill_category_df ---\n")

--- skill_category_df ---
glimpse(skill_category_df)
Rows: 52
Columns: 3
$ category_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ section       <chr> "Technology", "Technology", "Technology", "Technology", …
$ category_name <chr> "Analytical or scientific software", "Analytical or scie…
cat("\n--- skill ---\n")

--- skill ---
glimpse(skill)
Rows: 68
Columns: 5
$ skill_id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ category_id       <int> 8, 20, 21, 49, 12, 22, 46, 15, NA, 16, 25, 34, 9, 47…
$ skill_name        <chr> "Alteryx software", "Amazon Elastic Compute Cloud EC…
$ demand_percentage <int> NA, NA, NA, NA, NA, 17, NA, NA, 8, NA, NA, NA, 11, N…
$ is_hot_technology <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Ye…
cat("\n--- skill_demand_df ---\n")

--- skill_demand_df ---
glimpse(skill_demand_df)
Rows: 22
Columns: 4
$ demand_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ skill_id          <int> 6, 9, 13, 18, 24, 38, 39, 41, 42, 45, 47, 50, 51, 52…
$ demand_percentage <int> 17, 8, 11, 7, 5, 12, 8, 17, 5, 5, 5, 9, 65, 35, 10, …
$ source            <chr> "O*Net", "O*Net", "O*Net", "O*Net", "O*Net", "O*Net"…

Data Analysis

Skill Frequency Analysis

skills_analysis <- posting_skill_df %>%
  filter(!is.na(skill_name), skill_name != "") %>%
  count(skill_name, sort = TRUE)

head(skills_analysis, 10)
# A tibble: 10 × 2
   skill_name                           n
   <chr>                            <int>
 1 Python                             640
 2 Machine Learning                   580
 3 Structured query language SQL      442
 4 R                                  343
 5 Amazon Web Services AWS software   269
 6 Deep Learning                      178
 7 TensorFlow                         165
 8 Apache Spark                       161
 9 Microsoft Azure software           155
10 PyTorch                            148

This analysis uses the normalized posting_skill_df table to identify the most frequently requested skills across job postings.

skills_analysis %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(skill_name, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Most Requested Data Science Skills",
    x = "Skill",
    y = "Frequency"
  )

The frequency counts show which technical skills appear most often across the job postings in the Kaggle dataset. This helps answer the project question by identifying which data science skills employers most commonly request.

Compare Kaggle Results with O*NET Demand

top_kaggle <- skills_analysis %>%
  slice_max(n, n = 10)

top_onet <- skill_demand_df %>%
  left_join(skill %>% select(skill_id, skill_name), by = "skill_id") %>%
  slice_max(demand_percentage, n = 10)

top_kaggle
# A tibble: 10 × 2
   skill_name                           n
   <chr>                            <int>
 1 Python                             640
 2 Machine Learning                   580
 3 Structured query language SQL      442
 4 R                                  343
 5 Amazon Web Services AWS software   269
 6 Deep Learning                      178
 7 TensorFlow                         165
 8 Apache Spark                       161
 9 Microsoft Azure software           155
10 PyTorch                            148
top_onet
   demand_id skill_id demand_percentage source                       skill_name
1         13       51                65  O*Net                           Python
2         18       63                50  O*Net    Structured query language SQL
3         14       52                35  O*Net                                R
4         19       64                22  O*Net                          Tableau
5          1        6                17  O*Net Amazon Web Services AWS software
6          8       41                17  O*Net               Microsoft Power BI
7          6       38                12  O*Net         Microsoft Azure software
8          3       13                11  O*Net                     Apache Spark
9         20       65                11  O*Net                       TensorFlow
10        15       56                10  O*Net                              SAS

The results from the Kaggle dataset show which skills appear most frequently in job postings, while the O*NET dataset provides demand percentages based on industry data.

By comparing these two sources, we can evaluate whether the most commonly listed skills in job postings align with broader industry demand. Skills such as Python, SQL, and machine learning are expected to appear in both datasets, reinforcing their importance in the data science field.

Conclusion

This analysis identified the most valued data science skills by combining job posting data with industry demand data. The Kaggle dataset highlighted the most frequently requested skills in real job postings, while O*NET provided a structured view of skill demand across the industry. Together, these findings suggest that programming languages such as Python and SQL, along with machine learning tools, are consistently among the most important skills for data science roles. This demonstrates that both employer demand and industry data align on the core technical competencies required in the field.