Project 3: Project Proposal

Introduction

This project explores answering the question “Which are the most valued data science skills?”. To answer this, the following methodology was utilized:

Data Collection - Import data from various sources such as Bureau of Labor Statistics, Projections Central, and O*Net.
Data Normalization - Clean and normalize the data using various processing techniques.
Export to Database - Store the processed data.
Data Analysis - Conduct analysis on the structured data.
Summary of Findings - Summarize key insights.

Data Normalization

To reduce redundancy and improve data integrity, the data in this project was normalized. This helps to ensure that data is stored efficiently, avoids duplication and inconsistencies, and allows for better-managed database. To normalize the datasets, five core tables and reference tables were developed.

Research Questions

Which skills are considered the most important in the data science field?
What is relationship between projected employment and the importance of job skills?
What types of technical skills (based on commodity categories) are most frequently included on data science job postings (in-demand or hot)?
What is the distribution of skill importance across different skill categories (e.g., cognitive, interpersonal)?

Data Sources

Industry Profile for Data Scientists - This data source provides detailed information on the data science occupation and projected employment.
O*Net Database – The O*NET database outlines various information that describe work and worker characteristics, including skill requirements for specific occupations. This data source was used to explore various skill sets for occupations related to data science.

Technology Skills Data

Updated column names

# updated column names
colnames(tech_skills) <- c(
  'onet_soc', 'onet_soc_title', 'tech_skill_example',
  'commodity_code', 'commodity_title', 'hot_technology',
  'in_demand')

# rendering table 
knit_table(head(tech_skills), 'View Raw Tech Skills Data')

Normalize Tech Skills Data

# create tech skills data reference tables
commodity_ref <- 
  tech_skills |>
  select(commodity_code, commodity_title) |>
  distinct()

Final Data Handling Note: this dataset maintained a tidy format and no additional transformations were needed

# remove reference table columns from main data frame
tech_skills_df <- tech_skills |>
  select('onet_soc','commodity_code', 'hot_technology','in_demand')

# rendering table 
knit_table(head(tech_skills_df), 'View Normalized Tidy Tech Skills Data')

Skills Data

# update col names
colnames(ep_project_skills) <- c(
  'soc_title', 'soc', 'employment_2023', 'employment_2033', 'employment_change_num_2023_33',
  'employment_change_prct_2023_33', 'ep_skills_category_id', 'ep_skills_category', 'ep_skills_score',
  'onet_soc', 'onet_element_id', 'onet_element_name', 'onet_rating_value'
  )

# render table
knit_table(head(ep_project_skills), 'View Raw Skills Data') |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")

Normalize Skills Data

# create skills data reference tables

skills_element_ref <- 
  ep_project_skills |>
  select(onet_element_id, onet_element_name) |>
  distinct()
skills_category_ref <- ep_project_skills |>
  select('ep_skills_category_id', 'ep_skills_category') |>
  distinct()

# create soc and onet soc link table
soc_onet_soc_lnk <- 
  ep_project_skills |>
  select(soc, onet_soc) |>
  distinct()

Final Data Handling - Remove Partial Dependencies

# split out skills values based on source (onet and ep)
ep_skills_df <- ep_project_skills |>
  select('soc', 'ep_skills_category_id', 'ep_skills_score') |>
  distinct()

onet_skills_df <- ep_project_skills |>
  select('soc', 'onet_element_id', 'onet_rating_value') |>
  distinct()

knit_table(head(ep_skills_df), 'View Normalized Tidy EP Skills Data')
knit_table(head(onet_skills_df), 'View Normalized Tidy ONET Skills Data')

Industry and Occupation Projection Data

# update column names
colnames(soc_industry_project) = c(
  'soc_type', 'industry_type', 'soc', 'soc_title', 'industry_code', 'industry_title',
  'employment_2023', 'prct_industry_2023', 'prct_soc_2023', 
  'employment_2033', 'prct_industry_2033', 'prct_soc_2033',
  'employment_change_num_2023_33', 'employment_change_prct_2023_33')

# rendering table 
knit_table(head(soc_industry_project), 'View Raw Industry and Occupation Projection Data') |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")

Normalize Projection Data

# create industry and soc code reference tables
industry_ref <- 
  soc_industry_project |>
  select(industry_code, industry_title, industry_type) |>
  distinct()

# removing dups
soc_ref <- 
  soc_industry_project |>
  select(soc_title, soc, soc_type) |>
  distinct()


# create soc and industry code link table
soc_industry_lnk <- 
  soc_industry_project |>
  select(soc, industry_code) |>
  distinct()

Data Handling

# make employment columns to be in thousands as stated in raw data file 
soc_industry_project <- soc_industry_project |>
  mutate_at(c('employment_2023', 'employment_2033', 'employment_change_num_2023_33'), ~(. *1000))

Tidy Data - Split out 2023 and 2033 employment data and stack dataframes to make each variable its own column

# subset 2023 employment data
soc_industry_project_2023 <- soc_industry_project |>
 select('industry_code', 'soc', 'employment_2023',
       'prct_industry_2023', 'prct_soc_2023') |>
mutate(year = '2023')
colnames(soc_industry_project_2023) = c('industry_code', 'soc', 'employment', 'prct_industry', 'prct_soc', 'year')

# subset 2033 projection employment data
soc_industry_project_2033 <- soc_industry_project |>
 select('industry_code', 'soc', 'employment_2033',
       'prct_industry_2033', 'prct_soc_2033') |>
mutate(year = '2033')
colnames(soc_industry_project_2033) = c('industry_code', 'soc', 'employment', 'prct_industry', 'prct_soc', 'year')

# stack data to create final soc industry employment dataframe
soc_industry_project_df <- bind_rows(soc_industry_project_2023, soc_industry_project_2033)

Create employment change dataframe by pivoting the employment change metrics columns from wide to long format

# subset employment change data and pivot longer
soc_industry_project_change_df <- soc_industry_project |> 
  select('industry_code', 'soc', 'employment_change_num_2023_33', 'employment_change_prct_2023_33') |>
  pivot_longer(
    cols = c('employment_change_num_2023_33', 'employment_change_prct_2023_33'),
    names_to = 'employment_change_2023_33_stat', 
    values_to = "employment_change_2023_33_value",
    names_pattern="employment_change_(.*)_2023_33"
  )

# rendering table 
knit_table(head(soc_industry_project_df), 'View Normalized Tidy Industry and Occupation Projection Data') |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")

knit_table(head(soc_industry_project_change_df), 'View Normalized Tidy Industry and Occupation Projection Change Data') |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")

Tidy Data

To tidy and normalize the data the team performed the following:

Renamed columns to allow for more intuitive names as well as ensure columns representing the same data values are referenced the same across all data frames.
Developed reference tables to store distinct categorical values (ex: skill categories) and remove partial dependencies.
Removed redundant columns (such as columns that are represented in reference tables) from the core data tables, retaining only relevant fields for analysis.

Data Cleaning

# Data Cleaning Phase

# 1. ep_skills_df Cleaning
# Remove NAs, join with onet skills, ensure correct data types
ep_skills_df_clean <- ep_skills_df |>
  drop_na() |>
  mutate(across(c(ep_skills_score), as.numeric)) |>
  filter(ep_skills_score >= 0 & ep_skills_score <= 100)

# 2. onet_skills_df_clean Cleaning
onet_skills_df_clean = onet_skills_df |>
  drop_na() |>
  mutate(across(c(onet_rating_value), as.numeric)) |>
  filter(onet_rating_value >= 0 & onet_rating_value <= 100)

# 3. tech_skills_df Cleaning
tech_skills_df_clean <- tech_skills_df |> 
  drop_na(commodity_code) |> 
  mutate(
    hot_technology = case_when(
      str_to_lower(hot_technology) %in% c("yes", "y", "true", "1") ~ TRUE,
      str_to_lower(hot_technology) %in% c("no", "n", "false", "0") ~ FALSE,
      TRUE ~ NA
    ),
    in_demand = case_when(
      str_to_lower(in_demand) %in% c("yes", "y", "true", "1") ~ TRUE,
      str_to_lower(in_demand) %in% c("no", "n", "false", "0") ~ FALSE,
      TRUE ~ NA
    )
  )

# 4. soc_industry_project_df Cleaning
soc_industry_project_df_clean <- soc_industry_project_df |> 
  drop_na() |> 
  mutate(across(c("employment", 'year', 'prct_industry', 'prct_soc'), as.numeric))

# 5. soc_industry_project_change_df Cleaning
soc_industry_project_change_df_clean <- soc_industry_project_change_df |> 
  drop_na() |> 
  mutate(across(c("employment_change_2023_33_value"), as.numeric))

# Output summaries (for validation)
ep_skills_df_clean_summ <- summary(ep_skills_df_clean)
onet_skills_df_clean_summ <- summary(onet_skills_df_clean)
tech_skills_df_clean_summ <- summary(tech_skills_df_clean)
soc_industry_project_df_clean_summ <- summary(soc_industry_project_df_clean)
soc_industry_project_change_df_clean_smm <- summary(soc_industry_project_change_df_clean)

list(
  ep_skills_df_clean_summ = ep_skills_df_clean_summ,
  tech_skills_df_clean_summ = tech_skills_df_clean_summ,
  onet_skills_df_clean_summ = onet_skills_df_clean_summ,
  soc_industry_project_df_clean_summ = soc_industry_project_df_clean_summ,
  soc_industry_project_change_df_clean_smm = soc_industry_project_change_df_clean_smm
)

knit_table(head(ep_skills_df_clean), 'Preview Cleaned ep_skills_df')
knit_table(head(onet_skills_df_clean), 'Preview Cleaned onet_skills_df')
knit_table(head(tech_skills_df_clean), 'Preview Cleaned tech_skills_df')
knit_table(head(soc_industry_project_df_clean), 'Preview Cleaned soc_industry_project_df')
knit_table(head(soc_industry_project_change_df_clean), 'Preview Cleaned soc_industry_project_change_df')

MySQL Database

To ensure our processed data is securely stored and easily queryable for analysis, we structured and uploaded it into a MySQL relational database. We created a schema called project3_data using MySQL Workbench and defined tables for each of our core cleaned datasets: ep_skills, tech_skills, and soc_industry_project. This allows us to organize the data using normalized table structures and access it through SQL queries.

Database Connection In this section, we:

Export the cleaned datasets to .csv files for transparency and backup.
Securely connect to the MySQL database from R using credentials stored in environment variables (to protect sensitive information).
Use R’s dbWriteTable() function to insert each dataset into its corresponding SQL table.
Run a test query to verify the connection and ensure data was inserted successfully.
Disconnect from the database once the upload is complete to free up system resources and maintain good practice.

This setup creates a reproducible and secure workflow for integrating R with SQL, allowing us to bridge data cleaning, storage, and analysis in a single environment.

Database Connection

# Core tables
write.csv(ep_skills_df_clean, "Data/ep_skills_df_clean.csv", row.names = FALSE)
write.csv(onet_skills_df_clean, "Data/onet_skills_df_clean.csv", row.names = FALSE)
write.csv(tech_skills_df_clean, "Data/tech_skills_df_clean.csv", row.names = FALSE)
write.csv(soc_industry_project_df_clean, "soc_industry_project_df_clean.csv", row.names = FALSE)
write.csv(soc_industry_project_change_df_clean, "soc_industry_project_change_df_clean.csv", row.names = FALSE)
# Link tables
write.csv(soc_onet_soc_lnk, "Data/soc_onet_soc_lnk.csv", row.names = FALSE)
write.csv(soc_industry_lnk, "Data/soc_industry_lnk.csv", row.names = FALSE)
# Reference tables
write.csv(commodity_ref, "Data/commodity_ref.csv", row.names = FALSE)
write.csv(skills_element_ref, "Data/skills_element_ref.csv", row.names = FALSE)
write.csv(soc_ref, "soc_ref.csv", row.names = FALSE)
write.csv(skills_category_ref, "Data/skills_category_ref.csv", row.names = FALSE)
write.csv(industry_ref, "Data/industry_ref.csv", row.names = FALSE)

# Securely fetching the password
# password <- Sys.getenv("")
# 
# # Attempt to connect
# conn <- tryCatch({
#   dbConnect(
#     MySQL(),
#     user = "root",
#     password = password,
#     dbname = "project3_data",
#     host= '192.168.1.52',
#     port= 3306,
#     client.flag = CLIENT_LOCAL_FILES
#   )
# }, error = function(e) {
#   message("Error: ", e$message)
#   return(NULL)
# })
# 
# # Proceed only if connection succeeded
# if (!is.null(conn)) {
#   print("Database connection successful.")
# 
#   # Sample query
#   ep_sample <- dbGetQuery(conn, "SELECT * FROM ep_skills LIMIT 5;")
#   print(ep_sample)
# 
#   # Inserting data frames
#   # Core tables
#   dbWriteTable(conn, "ep_skills", ep_skills_df_clean, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "onet_skills", onet_skills_df_clean, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "tech_skills", tech_skills_df_clean, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "soc_industry_project", soc_industry_project_df_clean, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "soc_industry_project_change", soc_industry_project_change_df_clean, row.names = FALSE, overwrite = TRUE)
#   # Link tables
#   dbWriteTable(conn, "soc_onet_soc_lnk", soc_onet_soc_lnk, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "soc_industry_lnk", soc_industry_lnk, row.names = FALSE, overwrite = TRUE)
#   # Reference tables
#   dbWriteTable(conn, "commodity_ref", commodity_ref, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "skills_element_ref", skills_element_ref, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "soc_ref", soc_ref, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "skills_category_ref", skills_category_ref, row.names = FALSE, overwrite = TRUE)
#   dbWriteTable(conn, "industry_ref", industry_ref, row.names = FALSE, overwrite = TRUE)
# 
#   # Disconnect when done
#   dbDisconnect(conn)
# 
# } else {
#   stop("Database connection failed. Check credentials and try again.")
# }

Data Analysis and Visualization

Question 1:

Which skills are considered the most important in the data science field?

To explore the most critical skills in the data science field, we begin by analyzing the ep_skills_df_clean dataset. The goal is to identify which skills are disproportionately used in the data science field compared to all other occupations. This was done by calculating the percent of occupations having a lower skill importance score than data science for all EP skill categories.

# create subset of skills data that only includes data science for comparison
ds_ep_skills = ep_skills_df_clean |>
  filter(soc == data_science_soc) |>
  select(ep_skills_category_id, ep_skills_score)|>
  arrange(desc(ep_skills_score))
colnames(ds_ep_skills) = c('ep_skills_category_id', 'ds_ep_skills_score')
  
# create subset data frame that includes all occupations other than data science
less_ds_ep_skills <- ep_skills_df_clean |>
  filter(soc != data_science_soc) |>
  left_join(skills_category_ref, by = join_by(ep_skills_category_id)) |>
  left_join(soc_ref, by = join_by(soc)) |>
  # join data science skills subset data
  left_join(ds_ep_skills, by = join_by(ep_skills_category_id)) |>
  # group by skill category
  group_by(ep_skills_category) |>
  # calculate percent of occupations with lower score
  mutate(less_score = ifelse(ep_skills_score < ds_ep_skills_score, 1, 0)) |>
  summarise(prct_less = mean(less_score)* 100) |>
  arrange(desc(prct_less))

knit_table(less_ds_ep_skills) |> scroll_box(width = "100%", box_css = "border: 1px solid #FFFFFF;")

In the above code chunk, we calculated the percent of occupations with lower skill importance scores (ep_skills_score) for each skill. This aided in identifying the top three EP skill categories that are disproportionately utilized more by data scientists. To make the output more readable, we merged these results with the skill labels from the skills_category_ref reference table. The resulting table displays the EP skill category and the percent of occupations with a skill importance score less than data science.

The results highlight three skills in which the data scientist skill importance value exceeds 90% of all other occupations. Those skills include Mathematics’, ‘Critical and analytical thinking’, and ’Computers and information technology. These reflect a strong demand for a range of technical knowledge and cognitive stability, which are essential for navigating the dynamic challenges within data science roles.

To better visualize these insights, we created the following horizontal bar chart:

less_ds_ep_skills <- less_ds_ep_skills |>
  mutate(fill = ifelse(prct_less >= 90, "Yes", "No"))

less_ds_ep_skills |>
  ggplot(aes(x = reorder(ep_skills_category, prct_less), y = prct_less, fill = fill)) + 
  geom_col() +
  scale_fill_manual(values = c("steelblue", "indianred2")) +
  coord_flip() +
  labs(
    title = "Percent of Occupations with EP Skill Importance
    Scores Less Than Data Science",
    x = "EP Skill Category",
    y = "Percent",
    fill = 'Data Science Skill Score Greater Than 90% Occupations'
  ) +
  theme_classic() +
  theme(
     plot.title = element_text(hjust = 0.5),
     legend.position = 'bottom',
     legend.justification = "left",
     legend.title =element_text(size = 10))

This plot highlights the same top three skills and makes it easier to compare their occupation proportions. Ordering the bars by average score and flipping the coordinates improves readability, especially for longer skill names. The visual reinforces the table’s insights, showing that technical knowledge, adaptability, precision, and decision-making are viewed as highly important across data science occupations.

The same procedure employed above is applied below to the O*Net skill rating scores data. Analyzing both occupation skill scores and ratings using the same methodology allows us to compare and verify accuracy of results, as well as gain a more comprehensive understanding of important data science skills.

# create subset of skills data that only includes data science for comparison
ds_onet_skills = onet_skills_df |>
  filter(soc == data_science_soc) |>
  select(onet_element_id, onet_rating_value)|>
  arrange(desc(onet_rating_value))
colnames(ds_onet_skills) = c('onet_element_id', 'ds_onet_rating_value')

# create subset data frame that includes all occupations other than data science
less_ds_onet_skills <- onet_skills_df |>
  filter(soc != data_science_soc) |>
  left_join(skills_element_ref, by = join_by(onet_element_id)) |>
  # join data science skills subset data
  left_join(ds_onet_skills, by = join_by(onet_element_id)) |>
  # group by skill category
  group_by(onet_element_name) |>
  # calculate percent of occupations with lower score
  mutate(less_score = ifelse(onet_rating_value < ds_onet_rating_value, 1, 0)) |>
  summarise(prct_less = mean(less_score)* 100) |>
  arrange(desc(prct_less))

knit_table(less_ds_onet_skills) |> scroll_box(width = "100%", height = '450px', box_css = "border: 1px solid #FFFFFF;")

The results highlight 11 skills in which the data scientist skill importance value exceeds 90% of all other occupations. Four of these skills overlapped with the EP skill importance scores. The seven newly identified skills include ‘Number Facility’, ‘Mathematical Reasoning’, ‘Analyzing Data or Information, ’Programming’, ‘Interpreting the Meaning of Information for Others’, ‘Inductive Reasoning’, and ‘Science’.

To better visualize these insights, we created the following horizontal bar chart. For readability, only the top 24 skills are displayed.

less_ds_onet_skills <- less_ds_onet_skills |>
  mutate(fill = ifelse(prct_less >= 90, "Yes", "No"))

head(less_ds_onet_skills, 24) |>
  ggplot(aes(x = reorder(onet_element_name, prct_less), y = prct_less, fill = fill)) + 
  geom_col() +
  scale_fill_manual(values = c("steelblue", "indianred2")) +
  coord_flip() +
  labs(
    title = "Percent of Occupations with O*NET Skill Importance
    Scores Less Than Data Science",
    x = "O*NET Skill Category",
    y = "Percent",
    fill = 'Data Science Skill Score Greater Than 90% Occupations'
  ) +
  theme_classic() +
  theme(
     plot.title = element_text(hjust = 0.5),
     legend.position = 'bottom',
     legend.justification = "left",
     legend.title =element_text(size = 8))

Question 2:

What is relationship between projected employment and the importance of job skills?

To explore this question, we examine how the need for different occupational skills may change in the future by comparing the average score of each EP skill category across all occupations weighted by base (2023) and projected (2033) employment.

The analysis begins by creating a function to calculate the weighted average of skills over the base and projection period. The weights reflect projected changes in employment. Applying the weights allows for us to compare projected skills for a variable that is not projected.

weighted_avg <- function(skill, employment) {
  return(sum(skill * employment) / sum(employment))
}

# get average weights skill scores
skills_project = ep_project_skills |>
  select(soc, ep_skills_category, employment_2023, employment_2033, ep_skills_score) |>
  distinct() |>
  group_by(ep_skills_category) |>
  summarise(
    wgt_23 = weighted_avg(ep_skills_score, employment_2023),
    wgt_33 = weighted_avg(ep_skills_score, employment_2033),
  ) |>
  mutate(prct_diff = ((wgt_33-wgt_23)/wgt_23)*100) |>
  arrange(desc(prct_diff))

Percent change in projected weighted average skill importance is visualized below:

skills_project = skills_project |>
  mutate(fill = ifelse(
    ep_skills_category %in% c(
      'Mathematics', 'Critical and analytical thinking', 'Computers and information technology'
      ), "Yes", "No"))

skills_project |>
  ggplot(aes(x = reorder(ep_skills_category, prct_diff), y = prct_diff, fill=fill)) + 
  geom_col() +
  scale_fill_manual(values = c("steelblue", "indianred2")) +
  coord_flip() +
  labs(
    title = "Percent Change in Skill Score per 
    Projected Changes in Occupational Employment",
    x = "EP Skill Category",
    y = "Percent Change",
    fill = 'Important Data Science EP Skills') +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = 'bottom',
    legend.justification = "left",
    legend.title =element_text(size = 10))

This chart helps to visually assess the relationship between skills and projected changes in employment. The three bars in red call out the top three data science EP skills identified in question one. These three skills have varying levels of increase in projected skill score increase across all categories. The small increase in mathematics could be due to the world becoming more automated and technologically advanced that. In addition, the overall changes being relatively small across all occupations could be accounted for due to slow structural changes in the economy. Further analysis is needed to draw meaningful conclusions on causation from this analysis.

Question 3:

What types of technical skills (based on commodity categories) are most frequently included on data science job postings (in-demand or hot)?

This analysis explores which categories of technical tools or platforms - referred to as commodity categories - are most frequently used in job posting requirements. Commodities labeled as “hot technologies” indicate they are frequently included across all employer job postings, and “in-demand” indicates they are frequently included across job postings for a specific occupation. The goal is to understand what kinds of software or systems are frequently required for data science-related roles.

The analysis begins by joining the cleaned tech_skills_df with its corresponding commodity_ref descriptions. Then, for each unique commodity_title, we count how many times it appears with hot_technology = TRUE and in_demand = TRUE. These counts are reshaped into a long format for plotting:

# Join tech skills with commodity reference
tech_commodity_summary <- tech_skills_df_clean |>
  filter(onet_soc == '15-2051.00') |>
  left_join(commodity_ref, by = "commodity_code") |>
  group_by(commodity_title) |>
  summarise(
    count_hot = sum(hot_technology == TRUE, na.rm = TRUE),
    count_demand = sum(in_demand == TRUE, na.rm = TRUE),
    total = sum(in_demand == TRUE | hot_technology == TRUE, na.rm = TRUE),
  ) |>
  top_n(12, total) |>
  pivot_longer(cols = c(count_hot, count_demand),
               names_to = "label", values_to = "count") |>
  mutate(label = recode(label, 
                        count_hot = "Hot Technology",
                        count_demand = "In Demand"))

After calculating the counts, we isolate the most frequently occurring commodity categories based on total references across both labels. The visualization plots these categories side-by-side by count and label:

# Visualization: bar chart
top_commodities <- tech_commodity_summary |>
  group_by(commodity_title) |>
  summarise(total_count = sum(count))

filtered_commodity_plot <- tech_commodity_summary |>
  filter(commodity_title %in% top_commodities$commodity_title)

ggplot(filtered_commodity_plot, aes(x = reorder(commodity_title, count), y = count, fill = label)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = c("steelblue", "indianred2")) +
  labs(title = "Top Hot and In-Demand Skills by Commodity Category",
       x = "Commodity Category", y = "Count", fill = "Label") +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = 'bottom',
    legend.justification = "left",
    legend.title =element_text(size = 10))

From the plot, it’s clear that “object or component oriented development software” and “data base user interface and query software” dominate in both “hot” and “in-demand” counts for data science occupations. These tools are fundamental to data handling, analysis, and model development—so their prominence isn’t surprising.

In addition to those commodity categories, “development environment software”, “business intelligence and data analysis software”, and “analytically or scientific software” all have a “in-demand” proportion that is over 50% of “hot”. This means that these categories for data science frequently occur over 50% of the time on job postings when compared to the frequency of all job postings. Thus, these commodities can be seen a valuable to the data science field. Interestingly, many categories show up as “hot” but not “in-demand”. This contrast may reflect the difference between technologies that are currently trending and those that are persistently needed across job roles.

Question 4:

What is the distribution of skill importance across different skill categories (e.g., cognitive, interpersonal)?

This final analysis examines how skill importance scores are distributed across various skill categories - such as communication, analytical thinking, adaptability, and more. It’s an attempt to identify which broad categories of skills tend to receive higher importance ratings in data science-related occupations.

The analysis begins by joining the cleaned ep_skills_df_clean dataset with the skills_category_ref lookup to pull in readable category names:

# Join skills with category reference
skills_by_category <- ep_skills_df_clean |>
  left_join(skills_category_ref, by = "ep_skills_category_id") |>
  filter(startsWith(soc, data_science_soc))

To visualize the distribution, a horizontal barplot is created. Each bar represents a skill category, with the EP skill scores (ranging from 0 to 5) plotted vertically within each group:

# Summary plot by skill category
ggplot(skills_by_category, aes(
  x = reorder(ep_skills_category, ep_skills_score),
  y = ep_skills_score)) +
  geom_col() +
  coord_flip() +
  labs(title = "EP Skill Importance by Category",
       x = "Skill Category", y = "Importance Score") +
  theme_classic()

The resulting plot reveals similar results to our first analysis. Categories like Reading and Writing, Adaptability, Computers and Information Technology, and Creativity and Innovation tend to have higher importance scores, suggesting they are especially valued in data science-related roles. On the other hand, categories like Fine Motor, Customer Service, or Physical Strength and Stamina generally receive lower scores, which aligns with the nature of data-oriented work.

This distributional view helps distinguish which categories house the most critical competencies in the field and highlights the multidimensional nature of skill expectations - even in technical roles.

Summary of Findings

The analysis identified a variety of valued data science skills based on skill importance, skill ranking, projected employment, and frequency of skills in job posting requirements (hot and in demand skills). Overall, the following data science skills were found to be of value

Mathematics
Critical and analytical thinking
Computers and information technology
Number Facility
Analyzing Data or Information
Programming
Interpreting the Meaning of Information for Other
Inductive Reasoning
Science
Reading and Writing
Adaptability
Creativity and Innovation

While majority of these skills are technical, there is much diversity in the range of skills found to be of value. This combination of technical knowledge, cognitive abilities, communication, and creativity in skills highlights that data science is both a science and an art. Three of the top valued data science skills are projected to increase slightly overtime across all occupations based on 2033 employment projection data. This supports the trend of data science growing and evolving rapidly with new emerging trends such as AI.

The presence of “hot technologies” did not significantly impact skill importance, indicating that trendy tools are not necessarily considered more essential. Finally, the following commodities were also found to be of value in the data science field due frequently occurring on job posting over 50% of the time when compared to job postings for all occupations.

Object or component-oriented development software
Data base user interface and query software
Development environment software
Business intelligence and data analysis software
Analytically or scientific software