Project 3: Project Presentation

Group Members: Alina Vikhnevich, Olivia Azevedo, Alyssa Gurkas

2025-03-26

Introduction

This project explores what are the most valuable data science skills. To answer this, the following methodology was followed:

Data Collection - Import data from various sources such as Bureau of Labor Statistics, Projections Central, and O*Net.
Data Normalization - Clean and normalize the data using various processing techniques.
Export to Database - Store the processed data.
Data Analysis - Conduct analysis on the structured data.
Summary of Findings - Summarize key insights.

Research Questions

Which skills are considered the most important in the data science field?
What is relationship between projected employment and the importance of job skills?
What types of technical skills (based on commodity categories) are most frequently included on data science job postings (in-demand or hot)?
What is the distribution of skill importance across different skill categories (e.g., cognitive, interpersonal)?

Data Sources

Industry Profile for Data Scientists - This data source provides detailed information on the data science occupation and projected employment.
O*Net Database – The O*NET database outlines various information that describe work and worker characteristics, including skill requirements for specific occupations. This data source was used to explore various skill sets for occupations related to data science.

Logic Model

Entity Relationship Diagram

Data Normalization

To reduce redundancy and improve data integrity, the data in this project was normalized. This helps to ensure that data is stored efficiently, avoiding duplication and inconsistencies, and to have a better-managed database. To normalize the datasets, five core tables were developed, and five reference tables were developed.

Connecting to MySQL Database

# Proceed only if connection succeeded
if (!is.null(conn)) {
  print("Database connection successful.")
  
  # Load Core tables
  ep_skills_df_clean <- dbReadTable(conn, "ep_skills")
  onet_skills_df_clean <- dbReadTable(conn, "onet_skills")
  tech_skills_df_clean <- dbReadTable(conn, "tech_skills")
  soc_industry_project_df_clean <- dbReadTable(conn, "soc_industry_project")
  soc_industry_project_change_df_clean <- dbReadTable(conn, "soc_industry_project_change")
  
  # Load Link tables
  soc_onet_soc_lnk <- dbReadTable(conn, "soc_onet_soc_lnk")
  soc_industry_lnk <- dbReadTable(conn, "soc_industry_lnk")
  
  # Load Reference tables
  commodity_ref <- dbReadTable(conn, "commodity_ref")
  skills_element_ref <- dbReadTable(conn, "skills_element_ref")
  soc_ref <- dbReadTable(conn, "soc_ref")
  skills_category_ref <- dbReadTable(conn, "skills_category_ref")
  industry_ref <- dbReadTable(conn, "industry_ref")
  
  # Optionally, print a sample from one table for verification
  print(head(ep_skills_df_clean))
  
  # Disconnect when done
  dbDisconnect(conn)
  
} else {
  stop("Database connection failed. Check credentials and try again.")
}

Tidy Data

To tidy and normalize the data the team performed the following

1. Renamed columns to allow for more intuitive names as well as ensure columns representing the same data values are referenced the same across all data frames.

2. Developed reference tables to store distinct categorical values (ex: skill categories) and remove partial dependencies.

3. Removed redundant columns (such as columns that are represented in reference tables) from the core data tables, retaining only relevant fields for analysis.

Research Question 1

Which skills are considered the most important in the data science field?

To explore the most critical skills in the data science field, we begin by analyzing the ep_skills_df_clean dataset. The goal is to identify which skills are disproportionately used in the data science field compared to all other occupations. This is done by calculating the percent of occupations having a lower skill importance score than data science for all EP skill categories.

Research Question 2

What is relationship between projected employment and the importance of job skills?

To explore this question, we examine how the need for different occupational skills may change in the future by comparing the average score of each EP skill category across all occupations and weighted by base (2023) and projected (2033) employment.

Research Question 3

What types of technical skills (based on commodity categories) are most frequently included on data science job postings (in-demand or hot)?

This analysis explores which categories of technical tools or platforms - referred to as commodity categories - are most frequently used in the job posting requirements. Commodities labeled as “hot technologies” indicate they are frequently included across all employer job postings, and “in-demand” indicate they are frequently included across job postings for a specific occupation. The goal is to understand what kinds of software or systems are frequently required for data science-related roles.

Research Question 4

What is the distribution of skill importance across different skill categories (e.g., cognitive, interpersonal)?

This final analysis examines how skill importance scores are distributed across various skill categories - such as communication, analytical thinking, adaptability, and more. It’s an attempt to identify which broad categories of skills tend to receive higher importance ratings in data science-related occupations.

Findings: Top Twelve Skills

The analysis identified the top twelve most important skills in data science, highlighting both technical knowledge and cognitive abilities like adaptability and reasoning. While the majority of the skills are technical, there was a wide range of skills to be found valuable.

Findings: Projections

Three of the top valued data science skills are projected to increase slightly overtime across all occupations based on 2033 employment projection data. This supports the trend of data science growing and evolving rapidly with new emerging trends such as AI.

Findings: Skills Found in Job Postings

Finally, the following commodities were also found to be of value in the data science field due frequently occurring on job posting over 50% of the time when compared to job postings for all occupations.

Object or component-oriented development software
Data base user interface and query software
Development environment software
Business intelligence and data analysis software
Analytically or scientific software