Project 3

Project 3 Code Base


Group Members:

Long Lin, Emily Elmouaquite, Zihao Yu, Pascal Hermann Kouogang Tafo


First Data Source: https://www.kaggle.com/datasets/joerakhimov/data-scientist-skills

https://raw.githubusercontent.com/longflin/DATA-607-Project-3/refs/heads/main/data_scientist_skills.csv

Second Data Source: https://www.kaggle.com/datasets/fahadrehman07/data-science-jobs-and-salary-glassdoor?select=glassdoor_jobs.csv

https://raw.githubusercontent.com/longflin/DATA-607-Project-3/refs/heads/main/Glassdoor_Salary_Cleaned_Version.csv


Introduction

In this project, we are interested in finding the most valuable data science skills for data science jobs. In order to figure out these skills, we’ll take data from online sources like Kaggle to create a data set of the various skills mentioned in job postings. Then, we’ll parse the data into data frames and plot the data in order to provide analysis on the data. In the end, we should have a better idea of what skills are relevant for the data science job market and will be able to prioritize gaining the desired skills in our own skillset to match the market demands.

First Data Source

For the first data source, we grabbed a dataset from kaggle on data scientist skills.

We created a data frame from first data source by reading in the .csv file.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/longflin/DATA-607-Project-3/refs/heads/main/data_scientist_skills.csv"

ds_skills <- read_csv2(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
head(ds_skills)
# A tibble: 6 × 30
  Company   Country Platform Bachelor Masters   PhD Communication English German
  <chr>     <chr>   <chr>       <dbl>   <dbl> <dbl>         <dbl>   <dbl>  <dbl>
1 Procter … Switze… Glassdo…        0       1     0             1       1      0
2 Philip M… Switze… Glassdo…        0       1     0             1       0      0
3 Swiss Re  Switze… Glassdo…        1       0     0             1       1      0
4 SAP       Switze… Glassdo…        0       1     0             0       1      1
5 Sigma     Switze… Glassdo…        0       0     0             0       0      0
6 Swarowski Switze… Glassdo…        1       0     0             0       1      0
# ℹ 21 more variables: Statistics <dbl>, ML <dbl>, DL <dbl>, AI <dbl>,
#   Optimization <dbl>, Analytics <dbl>, Visualization <dbl>, Python <dbl>,
#   R <dbl>, Matlab <dbl>, SAS <dbl>, SQL <dbl>, Spark <dbl>, Hadoop <dbl>,
#   Docker <dbl>, Keras <dbl>, Tensorflow <dbl>, Pytorch <dbl>, NLP <dbl>,
#   AWS <dbl>, Azure <dbl>

Next, we transformed the data from a wide data format to a long data format using pivot_longer on the columns pertaining to skills.

ds_skills_long <- ds_skills |>
  pivot_longer(
    cols = c(Statistics:Azure),
    names_to = "Skills",
    values_to = "Value"
  )

head(ds_skills_long)
# A tibble: 6 × 11
  Company   Country Platform Bachelor Masters   PhD Communication English German
  <chr>     <chr>   <chr>       <dbl>   <dbl> <dbl>         <dbl>   <dbl>  <dbl>
1 Procter … Switze… Glassdo…        0       1     0             1       1      0
2 Procter … Switze… Glassdo…        0       1     0             1       1      0
3 Procter … Switze… Glassdo…        0       1     0             1       1      0
4 Procter … Switze… Glassdo…        0       1     0             1       1      0
5 Procter … Switze… Glassdo…        0       1     0             1       1      0
6 Procter … Switze… Glassdo…        0       1     0             1       1      0
# ℹ 2 more variables: Skills <chr>, Value <dbl>

Next, we removed the columns with information that we were not interested in using the select function.

library(dplyr)
# Remove unncessary columns from ds_skills_long
ds_subset <- select(ds_skills_long, -Platform, -Bachelor, -Masters, -PhD, -Communication, -English, -German)
head(ds_subset)
# A tibble: 6 × 4
  Company          Country     Skills       Value
  <chr>            <chr>       <chr>        <dbl>
1 Procter & Gamble Switzerland Statistics       1
2 Procter & Gamble Switzerland ML               1
3 Procter & Gamble Switzerland DL               0
4 Procter & Gamble Switzerland AI               0
5 Procter & Gamble Switzerland Optimization     1
6 Procter & Gamble Switzerland Analytics        1

Next, we removed the rows that had a 0 in the Value column because the job did not require the skill.

ds_subset_remove_0s <- ds_subset[ds_subset$Value != 0, ]
head(ds_subset_remove_0s)
# A tibble: 6 × 4
  Company          Country     Skills       Value
  <chr>            <chr>       <chr>        <dbl>
1 Procter & Gamble Switzerland Statistics       1
2 Procter & Gamble Switzerland ML               1
3 Procter & Gamble Switzerland Optimization     1
4 Procter & Gamble Switzerland Analytics        1
5 Procter & Gamble Switzerland Python           1
6 Procter & Gamble Switzerland Spark            1

Next, we removed the Value Column altogether because all of the values in that column were 1.

ds_subset_remove_value_column <- select(ds_subset_remove_0s, -Value)
head(ds_subset_remove_value_column)
# A tibble: 6 × 3
  Company          Country     Skills      
  <chr>            <chr>       <chr>       
1 Procter & Gamble Switzerland Statistics  
2 Procter & Gamble Switzerland ML          
3 Procter & Gamble Switzerland Optimization
4 Procter & Gamble Switzerland Analytics   
5 Procter & Gamble Switzerland Python      
6 Procter & Gamble Switzerland Spark       

Next, we added a column named Job Title and assigned a value of Data Scientist to each of the rows because the data source is about data scientist jobs.

ds_job_title <- ds_subset_remove_value_column |>
  mutate(
    `Job Title` = "Data Scientist"
  ) %>%
  relocate(`Job Title`, .before = 1)

head(ds_job_title)
# A tibble: 6 × 4
  `Job Title`    Company          Country     Skills      
  <chr>          <chr>            <chr>       <chr>       
1 Data Scientist Procter & Gamble Switzerland Statistics  
2 Data Scientist Procter & Gamble Switzerland ML          
3 Data Scientist Procter & Gamble Switzerland Optimization
4 Data Scientist Procter & Gamble Switzerland Analytics   
5 Data Scientist Procter & Gamble Switzerland Python      
6 Data Scientist Procter & Gamble Switzerland Spark       

Second Data Source

For the second data source, we created a dataframe from the .csv file.

library(tidyverse)

url2 <- "https://raw.githubusercontent.com/longflin/DATA-607-Project-3/refs/heads/main/Glassdoor_Salary_Cleaned_Version.csv"

glassdoor_skills <- read_csv(
  file = url2,
  show_col_types = FALSE,
  progress = FALSE
)

head(glassdoor_skills)
# A tibble: 6 × 28
  `Job Title` `Salary Estimate` `Job Description` Rating `Company Name` Location
  <chr>       <chr>             <chr>              <dbl> <chr>          <chr>   
1 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
2 Healthcare… $63K-$112K (Glas… "What You Will D…    3.4 "University o… Linthic…
3 Data Scien… $80K-$90K (Glass… "KnowBe4, Inc. i…    4.8 "KnowBe4\n4.8" Clearwa…
4 Data Scien… $56K-$97K (Glass… "*Organization a…    3.8 "PNNL\n3.8"    Richlan…
5 Data Scien… $86K-$143K (Glas… "Data Scientist\…    2.9 "Affinity Sol… New Yor…
6 Data Scien… $71K-$119K (Glas… "CyrusOne is see…    3.4 "CyrusOne\n3.… Dallas,…
# ℹ 22 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
#   `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
#   Competitors <chr>, hourly <dbl>, employer_provided <dbl>, min_salary <dbl>,
#   max_salary <dbl>, avg_salary <dbl>, company_txt <chr>, job_state <chr>,
#   same_state <dbl>, age <dbl>, python_yn <dbl>, R_yn <dbl>, spark <dbl>,
#   aws <dbl>, excel <dbl>

Next, we renamed the columns in the data frame to match with the first data frame’s format for the same skills.

glassdoor_skills_rename_columns <- rename(glassdoor_skills, Python = python_yn, R = R_yn, Spark = spark, AWS = aws, Excel = excel)

head(glassdoor_skills_rename_columns)
# A tibble: 6 × 28
  `Job Title` `Salary Estimate` `Job Description` Rating `Company Name` Location
  <chr>       <chr>             <chr>              <dbl> <chr>          <chr>   
1 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
2 Healthcare… $63K-$112K (Glas… "What You Will D…    3.4 "University o… Linthic…
3 Data Scien… $80K-$90K (Glass… "KnowBe4, Inc. i…    4.8 "KnowBe4\n4.8" Clearwa…
4 Data Scien… $56K-$97K (Glass… "*Organization a…    3.8 "PNNL\n3.8"    Richlan…
5 Data Scien… $86K-$143K (Glas… "Data Scientist\…    2.9 "Affinity Sol… New Yor…
6 Data Scien… $71K-$119K (Glas… "CyrusOne is see…    3.4 "CyrusOne\n3.… Dallas,…
# ℹ 22 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
#   `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
#   Competitors <chr>, hourly <dbl>, employer_provided <dbl>, min_salary <dbl>,
#   max_salary <dbl>, avg_salary <dbl>, company_txt <chr>, job_state <chr>,
#   same_state <dbl>, age <dbl>, Python <dbl>, R <dbl>, Spark <dbl>, AWS <dbl>,
#   Excel <dbl>

Next, we used the pivot_longer function to convert the data frame from a wide format to a long format.

glassdoor_skills_long <- glassdoor_skills_rename_columns |>
  pivot_longer(
    cols = c(Python:Excel),
    names_to = "Skills",
    values_to = "Value"
  )

head(glassdoor_skills_long)
# A tibble: 6 × 25
  `Job Title` `Salary Estimate` `Job Description` Rating `Company Name` Location
  <chr>       <chr>             <chr>              <dbl> <chr>          <chr>   
1 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
2 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
3 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
4 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
5 Data Scien… $53K-$91K (Glass… "Data Scientist\…    3.8 "Tecolote Res… Albuque…
6 Healthcare… $63K-$112K (Glas… "What You Will D…    3.4 "University o… Linthic…
# ℹ 19 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
#   `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
#   Competitors <chr>, hourly <dbl>, employer_provided <dbl>, min_salary <dbl>,
#   max_salary <dbl>, avg_salary <dbl>, company_txt <chr>, job_state <chr>,
#   same_state <dbl>, age <dbl>, Skills <chr>, Value <dbl>

Next, we removed the columns that we were not interested in using the select function.

library(dplyr)
# Remove unncessary columns
glassdoor_subset <- select(glassdoor_skills_long, -"Salary Estimate", -"Job Description", -Rating, -Headquarters, -Size, -Founded, -"Type of ownership", -Industry, -Sector, -Revenue, -Competitors, -hourly, -employer_provided, -min_salary, -max_salary, -avg_salary, -company_txt, -job_state, -same_state, -age)

head(glassdoor_subset)
# A tibble: 6 × 5
  `Job Title`               `Company Name`                 Location Skills Value
  <chr>                     <chr>                          <chr>    <chr>  <dbl>
1 Data Scientist            "Tecolote Research\n3.8"       Albuque… Python     1
2 Data Scientist            "Tecolote Research\n3.8"       Albuque… R          0
3 Data Scientist            "Tecolote Research\n3.8"       Albuque… Spark      0
4 Data Scientist            "Tecolote Research\n3.8"       Albuque… AWS        0
5 Data Scientist            "Tecolote Research\n3.8"       Albuque… Excel      1
6 Healthcare Data Scientist "University of Maryland Medic… Linthic… Python     1

Next, we removed the rows that had a Value of 0.

glassdoor_subset_remove_0s <- glassdoor_subset[glassdoor_subset$Value != 0, ]

head(glassdoor_subset_remove_0s)
# A tibble: 6 × 5
  `Job Title`               `Company Name`                 Location Skills Value
  <chr>                     <chr>                          <chr>    <chr>  <dbl>
1 Data Scientist            "Tecolote Research\n3.8"       Albuque… Python     1
2 Data Scientist            "Tecolote Research\n3.8"       Albuque… Excel      1
3 Healthcare Data Scientist "University of Maryland Medic… Linthic… Python     1
4 Data Scientist            "KnowBe4\n4.8"                 Clearwa… Python     1
5 Data Scientist            "KnowBe4\n4.8"                 Clearwa… Spark      1
6 Data Scientist            "KnowBe4\n4.8"                 Clearwa… Excel      1

Next, we removed the Value column because the only rows remaining had a Value of 1.

glassdoor_remove_value_column <- select(glassdoor_subset_remove_0s, -Value)

head(glassdoor_remove_value_column)
# A tibble: 6 × 4
  `Job Title`               `Company Name`                       Location Skills
  <chr>                     <chr>                                <chr>    <chr> 
1 Data Scientist            "Tecolote Research\n3.8"             Albuque… Python
2 Data Scientist            "Tecolote Research\n3.8"             Albuque… Excel 
3 Healthcare Data Scientist "University of Maryland Medical Sys… Linthic… Python
4 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Python
5 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Spark 
6 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Excel 

Next, we filtered jobs based on if they had the word data in the Job Title or not, in order to get relevant job listings. We did this using the stringr library and the str_detect function.

library(stringr)
glassdoor_remove_value_column$`has_data_in_name` <- str_detect(glassdoor_remove_value_column$`Job Title`, "Data")

head(glassdoor_remove_value_column)
# A tibble: 6 × 5
  `Job Title`               `Company Name`      Location Skills has_data_in_name
  <chr>                     <chr>               <chr>    <chr>  <lgl>           
1 Data Scientist            "Tecolote Research… Albuque… Python TRUE            
2 Data Scientist            "Tecolote Research… Albuque… Excel  TRUE            
3 Healthcare Data Scientist "University of Mar… Linthic… Python TRUE            
4 Data Scientist            "KnowBe4\n4.8"      Clearwa… Python TRUE            
5 Data Scientist            "KnowBe4\n4.8"      Clearwa… Spark  TRUE            
6 Data Scientist            "KnowBe4\n4.8"      Clearwa… Excel  TRUE            

Next, we removed the columns that did not have the word data in the Job Title.

glassdoor_data_jobs_remove_false <- glassdoor_remove_value_column[glassdoor_remove_value_column$has_data_in_name != FALSE, ]

head(glassdoor_data_jobs_remove_false)
# A tibble: 6 × 5
  `Job Title`               `Company Name`      Location Skills has_data_in_name
  <chr>                     <chr>               <chr>    <chr>  <lgl>           
1 Data Scientist            "Tecolote Research… Albuque… Python TRUE            
2 Data Scientist            "Tecolote Research… Albuque… Excel  TRUE            
3 Healthcare Data Scientist "University of Mar… Linthic… Python TRUE            
4 Data Scientist            "KnowBe4\n4.8"      Clearwa… Python TRUE            
5 Data Scientist            "KnowBe4\n4.8"      Clearwa… Spark  TRUE            
6 Data Scientist            "KnowBe4\n4.8"      Clearwa… Excel  TRUE            

Next, we removed the has_data_in_name column because the value is TRUE for all of the rows since we removed the columns that had a value of FALSE.

glassdoor_final <- select(glassdoor_data_jobs_remove_false, -has_data_in_name)

head(glassdoor_final)
# A tibble: 6 × 4
  `Job Title`               `Company Name`                       Location Skills
  <chr>                     <chr>                                <chr>    <chr> 
1 Data Scientist            "Tecolote Research\n3.8"             Albuque… Python
2 Data Scientist            "Tecolote Research\n3.8"             Albuque… Excel 
3 Healthcare Data Scientist "University of Maryland Medical Sys… Linthic… Python
4 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Python
5 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Spark 
6 Data Scientist            "KnowBe4\n4.8"                       Clearwa… Excel 

Join Data Frames

Before joining the data frames, we made the format of the first data frame similar to that of the second data frame.

ds_final <- rename(ds_job_title, "Company Name" = Company, Location = Country)
head(ds_final)
# A tibble: 6 × 4
  `Job Title`    `Company Name`   Location    Skills      
  <chr>          <chr>            <chr>       <chr>       
1 Data Scientist Procter & Gamble Switzerland Statistics  
2 Data Scientist Procter & Gamble Switzerland ML          
3 Data Scientist Procter & Gamble Switzerland Optimization
4 Data Scientist Procter & Gamble Switzerland Analytics   
5 Data Scientist Procter & Gamble Switzerland Python      
6 Data Scientist Procter & Gamble Switzerland Spark       

Next, we joined the two data frames using the dplyr library and the rbind function.

library(dplyr)

combined_df <- rbind(ds_final, glassdoor_final)

head(combined_df)
# A tibble: 6 × 4
  `Job Title`    `Company Name`   Location    Skills      
  <chr>          <chr>            <chr>       <chr>       
1 Data Scientist Procter & Gamble Switzerland Statistics  
2 Data Scientist Procter & Gamble Switzerland ML          
3 Data Scientist Procter & Gamble Switzerland Optimization
4 Data Scientist Procter & Gamble Switzerland Analytics   
5 Data Scientist Procter & Gamble Switzerland Python      
6 Data Scientist Procter & Gamble Switzerland Spark       
colnames(combined_df)
[1] "Job Title"    "Company Name" "Location"     "Skills"      

Database

#Create database tables
#Job
jobTable <- combined_df %>%
  distinct(`Job Title`, `Company Name`, Location) %>%
  mutate(jobId = row_number()) %>%
  rename(jobName = `Job Title`)
#Skill
skillTable <- combined_df %>%
  distinct(Skills) %>%
  mutate(skillId = row_number()) %>%
  rename(skillName = Skills)
#jobSkill
jobSkillTable <- combined_df %>%
  left_join(jobTable,
            by = c("Job Title" = "jobName",
                   "Company Name",
                   "Location")) %>%
  left_join(skillTable,
            by = c("Skills" = "skillName")) %>%
  select(jobId, skillId) %>%
  distinct()

#Load database/ SQL libraries
library(DBI)
library(RSQLite)
Warning: package 'RSQLite' was built under R version 4.5.3
#Connect to database
con <- dbConnect(SQLite(), "ds_jobs.db")

#Create db tables
dbExecute(con, "DROP TABLE IF EXISTS job;")
[1] 0
dbExecute(con, "DROP TABLE IF EXISTS skill;")
[1] 0
dbExecute(con, "DROP TABLE IF EXISTS jobSkill;")
[1] 0
dbExecute(con, "
CREATE TABLE job (
  jobId INTEGER PRIMARY KEY,
  jobName TEXT,
  location TEXT
);
")
[1] 0
dbExecute(con, "
CREATE TABLE skill (
  skillId INTEGER PRIMARY KEY,
  skillName TEXT
);
")
[1] 0
dbExecute(con, "
CREATE TABLE jobSkill (
  jobId INTEGER,
  skillId INTEGER,
  PRIMARY KEY (jobId, skillId),
  FOREIGN KEY (jobId) REFERENCES Job(jobId),
  FOREIGN KEY (skillId) REFERENCES Skill(skillId)
);
")
[1] 0
#Populate tables with data
dbWriteTable(con, "job", jobTable %>% select(jobId, jobName, location = Location),append = TRUE)
dbWriteTable(con, "skill", skillTable %>% select(skillId, skillName), append = TRUE)
dbWriteTable(con, "jobSkill", jobSkillTable, append = TRUE)
#Check
DBI::dbListTables(con)
[1] "job"      "jobSkill" "skill"   
head(DBI::dbReadTable(con, "job"))
  jobId        jobName    location
1     1 Data Scientist Switzerland
2     2 Data Scientist Switzerland
3     3 Data Scientist Switzerland
4     4 Data Scientist Switzerland
5     5 Data Scientist Switzerland
6     6 Data Scientist Switzerland
head(DBI::dbReadTable(con, "skill"))
  skillId    skillName
1       1   Statistics
2       2           ML
3       3 Optimization
4       4    Analytics
5       5       Python
6       6        Spark
head(DBI::dbReadTable(con, "jobSkill"))
  jobId skillId
1     1       1
2     1       2
3     1       3
4     1       4
5     1       5
6     1       6

Analysis

Top Ten Most Valued Skills

combined_df %>%
  count(Skills, sort = TRUE) %>%
  head(10) %>%
  ggplot(aes(x = reorder(Skills, n), y = n, fill = Skills)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n), 
            hjust = -0.05,
            size = 4,
            color = "black") +
  coord_flip() +
  labs(title = "Top 10 Most Demanded Skills", x = "Skill", y = "Frequency")+
  theme_minimal()

Based on the plot above for the combined data set, we’re able to see the top 10 skills that are in demand. Python is the clear leader with Excel and Spark as the second and third most in demand, respectively.

Data Science Skills Clustering by Category

library(ggplot2)
library(dplyr)

# Let's Define the 2026 Data Science Skills Framework
skill_framework <- data.frame(
  Skills = c("AI", "ML", "DL", "NLP", "Pytorch", "Tensorflow", "Keras", 
             "AWS", "Azure", "Docker", "Spark", "Hadoop",              
             "Python", "R", "SQL", "Statistics", "Analytics",          
             "Visualization", "Matlab", "SAS", "Excel", "Optimization"),
  Cluster = c(rep("Agentic & Advanced AI", 7),
              rep("Cloud & Infrastructure", 5),
              rep("Core Data Foundations", 10))
)

# 1. Aggregate frequencies from the raw combined data
# This ensures we have both the "Skills" and "Frequency" columns

skill_counts <- combined_df %>%
  group_by(Skills) %>%
  summarise(Frequency = n(), .groups = "drop")

# 2. Join with the 2026 Skill Framework to add the Cluster variable
plot_ready_data <- skill_counts %>%
  left_join(skill_framework, by = "Skills") %>%
  # Remove any skills not categorized in our 2026 clusters
  filter(!is.na(Cluster))

# 3. Run the Visualization

ggplot(plot_ready_data, aes(x = reorder(Skills, Frequency), y = Frequency, fill = Cluster)) +
  geom_bar(stat = "identity", alpha = 0.85) +
  geom_text(aes(label = Frequency), 
            hjust = -0.2, 
            size = 3.5, 
            color = "gray30") +
  coord_flip() +
  facet_grid(Cluster ~ ., scales = "free_y", space = "free_y") +
  scale_fill_manual(values = c(
    "Agentic & Advanced AI"  = "#E63946", 
    "Cloud & Infrastructure" = "#1D3557", 
    "Core Data Foundations"  = "#A8DADC"
  )) +
  # Use expansion to ensure the numeric labels fit on the screen
  
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) + 
  labs(
    title    = "Valuable Skills Demand By Category",
    subtitle = "Strategic clustering of data science requirements ",
    x        = "Specific Skill",
    y        = "Frequency in Dataset"
  ) +
  theme_minimal(base_size = 13) +
  theme(strip.text.y   = element_text(angle = 0, face = "bold"),
        legend.position = "none")

Based on this chart result, while the core foundational skills like Python and Excel remain the most frequent requirements in job postings, the market is strategically pivoting toward specialized clusters in Agentic AI and Cloud Infrastructure.

Skills Valued in 2020 vs 2024

#Add year columns to final data frames for both sources
df1_final <- ds_final %>% mutate(Year = "2020")
df2_final <- glassdoor_final %>% mutate(Year = "2024")
#Find skills shared between both data sources
shared_skills <- intersect(df1_final$Skills, df2_final$Skills)
#Create new data frame with years & only keeping the shared skills
dfForPlot <- bind_rows(df1_final, df2_final) %>%
  filter(Skills %in% shared_skills)
#Create another new data frame with proportions
dfForPlotProp <- dfForPlot %>%
  count(Year, Skills) %>%
  group_by(Year) %>%
  mutate(prop = n / sum(n))
#Plot
ggplot(dfForPlotProp, aes(x = reorder(Skills, prop), y = prop, fill = Year)) + 
  geom_col(position = "dodge") + 
  coord_flip() + 
  labs(title = "Proportion of Jobs Requiring Skills (2020 vs 2024)",
  x = "Skill",
  y = "Proportion of Job Listings") +
  scale_fill_manual(values = c("2020" = "orchid","2024" = "lightpink")) +
  geom_text(aes(label = round(prop, 2)),
          position = position_dodge(width = 0.9),
          hjust = -0.1,
          size = 3)

The plot shows the commonly shared skills found within both of the data sets, and the proportion of jobs that required them. The first data source that we used was from 2020, while the second one was from 2024. In both years, about half of the jobs required Python. The proportion of jobs requiring Spark increased from 2020, when about 9% of jobs required it, to 2024, when about 24% of jobs required it. Similarly, the proportion of jobs requiring AWS increased from 5% to 23%. On the other hand, the proportion of jobs requiring R decreased from 33% in 2020, to 0% in 2024 based on this selection of jobs.

A Comparative Analysis of Skills Needs Across Different Countries

library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
location_list <-
  combined_df |>
  filter(!is.na(Location), Location != "") |>
  mutate(Location = str_trim(Location)) |>
  distinct(Location) |>
  arrange(Location)

location_list
# A tibble: 156 × 1
   Location              
   <chr>                 
 1 Agoura Hills, CA      
 2 Albuquerque, NM       
 3 Alexandria, VA        
 4 Aliso Viejo, CA       
 5 Allentown, PA         
 6 Ann Arbor, MI         
 7 Annapolis Junction, MD
 8 Arlington, VA         
 9 Armonk, NY            
10 Arvada, CO            
# ℹ 146 more rows
countries_count <-
  combined_df |>
  filter(!is.na(Location), Location != "") |>
  mutate(
    Location = str_trim(Location),
    Location = if_else(
      str_detect(Location, ",\\s*[A-Z]{2}$"),
      "United States",
      Location
      )
) |>
  count(Location, sort = TRUE)

countries_count
# A tibble: 4 × 2
  Location          n
  <chr>         <int>
1 United States   982
2 Germany         192
3 Switzerland     143
4 Austria         129
top10_skills_by_country <- 
   combined_df |>
   filter(!is.na(Location), Location != "") |>
   mutate(
     Location = str_trim(Location),
     Country = case_when(
       Location %in% c("Germany", "Switzerland", "Austria") ~ Location,
       str_detect(Location, ",\\s*[A-Z]{2}$") ~ "United States"
     )
   ) |>
   filter(!is.na(Country)) |>
   count(Country, Skills, sort = TRUE) |>
   group_by(Country) |>
   slice_head(n = 10) |>
   ungroup()

 top10_skills_by_country
# A tibble: 35 × 3
   Country Skills         n
   <chr>   <chr>      <int>
 1 Austria Python        21
 2 Austria ML            15
 3 Austria R             15
 4 Austria SQL           13
 5 Austria Analytics     11
 6 Austria Statistics     8
 7 Austria Hadoop         6
 8 Austria Spark          6
 9 Austria Tensorflow     6
10 Austria AI             4
# ℹ 25 more rows
library(tidytext)
ggplot(
   top10_skills_by_country,
   aes(x = n, y = reorder_within(Skills, n, Country), fill = Country)
 ) +
   geom_col(show.legend = FALSE) +
   geom_text(aes(label = n), hjust = -0.1, size = 3) +
   facet_wrap(~ Country, scales = "free_y") +
   scale_y_reordered() +
   labs(
     title = "Top 10 Skill Requirements by Country",
     x = "Count",
     y = "Skill"
   ) +
   theme_minimal() +
   expand_limits(x = max(top10_skills_by_country$n) * 1.05)

With the cleaning data, one dataset includes Germany, Switzerland, and Austria , the other set of data is from the states in the United States. After get the top 10 skills for the three countries, then compiled the top 10 skills for each U.S. state to represent the U.S. data.

By comparing the skills across the four countries, although there are differences in skill requirements, it is clear that Python is the most common.

In the U.S., only two states reported a need for R, whereas in the other three countries, R ranked relatively high—second in both Austria and Germany; in the U.S., Excel ranked second.

Conclusion

The results of our exploratory analysis show that Python seems to be the most valued data science skill. It is the most valued skill, not only overall, but also in the four countries we looked at, in both 2020 and 2024, and in the Core Data Foundations category, which is the most in demand skill category. Excel is also highly valued, with over half of the jobs in our data requiring it. By category, core data foundations including Python, Excel, R and SQL were highest in demand. However, there were also jobs requiring agentic and advanced AI skills, such as ML. Since these jobs were from either 2020 or 2024, more insights into the value of this skill category might be gained from job postings from 2025 or 2026 due to the fast paced evolution of these technologies and shifting employer expectations.

With these results, we are better equipped to focus on gaining the desired skills for the job market. As the market evolves with time, further analysis can be done to update the dataset and analysis.