Team 5 - Project 3

Overview

For Project 3, our team was asked to answer the question, Which are the most valued data science skills? To answer this question, we decided to explore the top skills that companies search for in a job candidate. We sourced data from Indeed, a popular job listing site. We were able to source a dataset featuring over 12,000 US Data Science jobs scraped from Indeed in January 2024^*.

We began by normalizing the dataset to match our proposed Entity Relationship Diagram (ERD) shown below:

Normalizing the Original Dataset

After inspecting the Indeed dataset files, we noticed that while the files did not have an id column, each job listing, job summary, and job skills could be identified by the listings URL (job_link field). Because this field is a bit verbose, we decided to create a new id for the job listings but retain the existing unique relationships by deriving the id from the link. While originally we thought to use the 10-digit sequence at the end of the slug, we realized there 23 instances where the sequence was repeated for related but distinct jobs at the same company. We decided to create a custom function that we can call when preparing each of the dataframes which extracts the page slug to create a hash using digest.

Next, we imported the job summary, job listing, and job skills files one at a time and transfomed them to match our Entity Relationship Diagram.

# function to create a primary key column from the url
convert_key <- function (df) {
  df |> mutate(
    job_id = str_extract(job_link, "\\/[^\\/]*$"),
    job_id = map_chr(job_id, digest, algo = 'md5', seed = 1234)
  ) |>
  relocate(job_id)
}

# read in job summary csv
orig_job_summary <- read_csv("source_files/job_summary.csv") |>
  convert_key() |>
  subset(select = -c(job_link))

# read in job postings csv
orig_job_postings <- read_csv("source_files/job_postings.csv") |>
  convert_key()

# subset companies and add col id  
companies_df <- subset(orig_job_postings, select = c(company)) |>
  distinct(company) |>
  mutate(company_id = row_number()) |>
  relocate(company_id, .before = "company") 

# add job description and subset
jobs_df <- orig_job_postings |> 
  left_join(orig_job_summary, join_by(job_id == job_id)) |> # add Job Description
  subset(select = c(job_id, job_title, job_summary, job_location, search_position, job_level, job_type, company)) |>
  left_join(companies_df, join_by(company == company)) |>   # add Job Company field
  subset(select = -c(company))

Creating the Skills and Job_Skills Joining Table

To create a “skills” dataframe and a joiner “job_skills” dataframes, we read in the job skills, transformed the url field into a hash to use as the foreign key to connect with the job record, subset with just the fields job_id and job_skills, and used separate_longer_delim to break up the job_skills field into separate rows, using a comma deliminator. We will need to update this dataframe later to replace each skill listed in the skill_name column with its unique id.

To create the “skills” dataframe, we first copy the “job_skills” dataframe, use distint to remove the duplicates, than add a row number to use as the skill_id field. This essentially leaves a table of unique skills with their respective Id.

As a final step, we use left_join to join the “skills” dataframe to the “job_skills” dataframe by the skill_name field. Dropping the skill_name leaves two columns: 1. job_id which will serve as a foreign key to join with my “jobs” dataframe or SQL table, and 2. skill_id which will serve as a foreign key to join with my “skills” table or SQL table

# create joiner table
job_skills_dirty <- read_csv("source_files/job_skills.csv") |>
  convert_key() |>
  subset(select = c(job_id, job_skills)) |>
  rename(skill_name = job_skills)

# create joiner table
job_skills_distinct <- job_skills_dirty  |>
  separate_longer_delim(cols = c(skill_name), delim="," )

# make distinct list of unique skills and give them their own id
skills_distinct <- job_skills_distinct |>
  distinct(skill_name) |>
  mutate(skill_id = row_number()) |>
  relocate(skill_id, .before = skill_name)

# update joiner to bring unique artist idea 
job_skills_distinct <- job_skills_distinct |>
  left_join(skills_distinct, join_by(skill_name == skill_name)) |>
  subset(select=-c(skill_name))

# write parquet files
write_parquet(jobs_df, "datasets/jobs.parquet")
write_parquet(companies_df, "datasets/companies.parquet")
write_parquet(skills_distinct, "datasets/skills.parquet")
write_parquet(job_skills_distinct, "datasets/job_skills.parquet")

Reading the Normalized Dataset As a Relational Database

We used DuckDB, a popular open-source embeddable database to manage our relational database. We chose DuckDB as way to try working with a different type of database within R and for its portability — though we don’t recommend storing sensitive data in a public repository as we have for this exercise.

After loading the duckdb package, we initialized DuckDB by establishing a connection (#A1) and importing our pre-normalized database parquet files (#A2). Next, we used SQL to join the tables into a single dataframe (#A3).

# A1. make an in-memory db and store the connection in a variable
con <- dbConnect(duckdb::duckdb())

# A2. read parquet files and register them as tables to the database
tables <- c('jobs','skills','job_skills','companies')
for (t in tables) {
  parquet_file <- paste('datasets/',t,'.parquet', sep="")
  duckdb_register(con, t, read_parquet(parquet_file))
}

# show what we loaded
glimpse(tbl(con, "jobs"))

## Rows: ??
## Columns: 8
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ job_id          <chr> "009e71ae0eaad02dbf565187a0ba5357", "9dd473793b10ad4df…
## $ job_title       <chr> "Senior Machine Learning Engineer", "Principal Softwar…
## $ job_summary     <chr> "Company Description\r\nJobs for Humanity is partnerin…
## $ job_location    <chr> "New Haven, CT", "San Francisco, CA", "New York, NY", …
## $ search_position <chr> "Agricultural-Research Engineer", "Set-Key Driver", "T…
## $ job_level       <chr> "Mid senior", "Mid senior", "Associate", "Mid senior",…
## $ job_type        <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id      <int> 1, 2, 3, 4, 5, 6, 1, 1, 7, 8, 9, 10, 11, 5, 12, 13, 14…

glimpse(tbl(con, "skills"))

## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ skill_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ skill_name <chr> "Machine Learning", " Programming", " Python", " Scala", " …

glimpse(tbl(con, "job_skills"))

## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ job_id   <chr> "009e71ae0eaad02dbf565187a0ba5357", "009e71ae0eaad02dbf565187…
## $ skill_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…

glimpse(tbl(con, "companies"))

## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ company_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ company    <chr> "Jobs for Humanity", "Aurora", "Adame Services LLC", "Morph…

# A3. join the tables with SQL and save as dataframe
full_df <- dbGetQuery(con,"WITH skills_by_id AS (
                    SELECT job_id,
                      STRING_AGG (skill_name, ',' ) AS skills
                    FROM job_skills AS x
                    LEFT JOIN skills AS s 
                      ON s.skill_id = x.skill_id 
                    GROUP BY job_id
                  )

                  SELECT j.*, c.company, s.skills
                  FROM jobs AS j
                  LEFT JOIN skills_by_id AS s
                      ON s.job_id = j.job_id 
                  LEFT JOIN companies AS c
                    ON c.company_id = j.company_id
            ")

glimpse(full_df)

## Rows: 12,217
## Columns: 10
## $ job_id          <chr> "f7cb39b176e8222147322790e588cae9", "64d338df9a9e9dbf6…
## $ job_title       <chr> "Lead Data Engineer", "Manager, Cyber Risk & Analysis …
## $ job_summary     <chr> "Dice is the leading career destination for tech exper…
## $ job_location    <chr> "Plano, TX", "Boston, MA", "Lake Buena Vista, FL", "Su…
## $ search_position <chr> "Maintenance Data Analyst", "Manager Reports Analysis"…
## $ job_level       <chr> "Mid senior", "Mid senior", "Mid senior", "Mid senior"…
## $ job_type        <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id      <int> 5, 1, 7, 8, 14, 16, 18, 27, 12, 30, 31, 34, 49, 51, 55…
## $ company         <chr> "Dice", "Jobs for Humanity", "The Walt Disney Company"…
## $ skills          <chr> "Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflak…

Data Tidying

With our dataframe loaded, we moved on to cleaning and tidying the data. First, we standardized the column mames by lowercasing and replacing spaces with “_” to convert them to snake format. We removed leftover any duplicates using the distinct function. Finally, we replaced any NA values with “Unknown”.

# B1. Standardize Column Names
full_df <- full_df %>%
  rename_all(~str_to_lower(.)) %>%    # Convert all column names to lowercase
  rename_all(~str_replace_all(., " ", "_")) # Replace spaces with underscores

# B2. Remove Duplicates
full_df <- full_df %>%
  distinct()  # Keep only unique rows

# B3. Handle Missing Values
# Replace NA values with 'Unknown' in character columns
full_df <- full_df %>%
  mutate(across(where(is.character), ~replace_na(., "Unknown")))

# Display the cleaned data frame
glimpse(full_df)

## Rows: 12,217
## Columns: 10
## $ job_id          <chr> "f7cb39b176e8222147322790e588cae9", "64d338df9a9e9dbf6…
## $ job_title       <chr> "Lead Data Engineer", "Manager, Cyber Risk & Analysis …
## $ job_summary     <chr> "Dice is the leading career destination for tech exper…
## $ job_location    <chr> "Plano, TX", "Boston, MA", "Lake Buena Vista, FL", "Su…
## $ search_position <chr> "Maintenance Data Analyst", "Manager Reports Analysis"…
## $ job_level       <chr> "Mid senior", "Mid senior", "Mid senior", "Mid senior"…
## $ job_type        <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id      <int> 5, 1, 7, 8, 14, 16, 18, 27, 12, 30, 31, 34, 49, 51, 55…
## $ company         <chr> "Dice", "Jobs for Humanity", "The Walt Disney Company"…
## $ skills          <chr> "Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflak…

Our dataframe had multiple skills in a single column. We used separate_row to put each of these observations onto their own rows. We removed the white space around them and used count to create a column showing the number of times (frequency) each skill was mentioned.

# Assuming 'full_df' contains job postings and skill information

# C1. Separate the skills column back into individual rows, renaming it to "skill_name":
skill_counts <- full_df %>%
  separate_rows(skills, sep = ",") %>%    # Split skills by commas into individual rows
  rename(skill_name = skills) %>%         # Rename the column to "skill_name"
  mutate(skill_name = str_trim(skill_name)) %>%  # Remove extra spaces around skill names
  count(skill_name, sort = TRUE) %>%      # Count occurrences of each skill and sort by frequency
  rename(frequency = n)                   # Rename count column to 'frequency'

# C2. Display the top skills
head(skill_counts, 10)  # Show the top 10 skills by frequency

## # A tibble: 10 × 2
##    skill_name         frequency
##    <chr>                  <int>
##  1 Python                  4801
##  2 SQL                     4606
##  3 Communication           2498
##  4 Data Analysis           2181
##  5 Machine Learning        1966
##  6 AWS                     1740
##  7 Tableau                 1685
##  8 Data Visualization      1562
##  9 R                       1542
## 10 Java                    1414

Plot the top 10 skills

We used our frequencies table to create a plot showing the top 10 skills.

ggplot(skill_counts %>% head(10),
       aes(x = reorder(skill_name, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = 'green') +
  labs(title = "Top 10 Skills in Data Science Job Postings",
       x = "Skill",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Data Analysis

Diving deeper into analysis, we can check to see if the top 10 skills are influenced by the job level. Our data contains two values for the job_level field, ‘Mid senior’ and ‘Associate.’ We will take the same approach we did for the skill_counts data frame and make sure to include job_level in our output. Once we create the skill_by_level data frame, we will use it to merge two subsets filtered for each job title and combined through the row name. Doing so produces the table seen below.

#Skill frequency by job level
skill_by_level <- full_df %>%
  separate_rows(skills, sep = ",") %>%
  rename(skill_name = skills) %>%
  mutate(skill_name = str_trim(skill_name)) %>%
  group_by(skill_name,job_level) %>% 
  count(job_level, sort = TRUE) %>%
  rename(frequency = n)

#Comparison of top 10 skills based on job level
job_level_comparison <- merge(
  skill_by_level %>% 
    filter(job_level=='Mid senior') %>% 
    head(10),
  skill_by_level %>% 
    filter(job_level=='Associate') %>% 
    head(10),
  by='row.names'
) %>% 
  arrange(desc(frequency.x)) %>% 
  rename(top_mid_senior_skill = skill_name.x,
         top_associate_skill = skill_name.y) %>% 
  select(2,5)

job_level_comparison

##    top_mid_senior_skill top_associate_skill
## 1                Python                 SQL
## 2                   SQL              Python
## 3         Communication       Communication
## 4         Data Analysis       Data Analysis
## 5      Machine Learning             Tableau
## 6                   AWS  Data Visualization
## 7               Tableau    Machine Learning
## 8    Data Visualization                   R
## 9                     R            Teamwork
## 10                Spark       Data analysis

mid_senior_plot <- ggplot(skill_by_level %>% 
    filter(job_level=='Mid senior') %>% 
    head(10), aes(x = reorder(skill_name, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = 'purple') +
  labs(title = "Top 10 Skills in Data Science Job Postings by Job Level",
       x = "Mid Senior Skill",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

associate_plot <- ggplot(skill_by_level %>% 
    filter(job_level=='Associate') %>% 
    head(10), aes(x = reorder(skill_name, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = 'blue') +
  labs(x = "Associate Skill",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

mid_senior_plot + associate_plot

The above data frame and plot illustrate the slight differences in skills that exist between ‘Mid senior’ and ‘Associate’ level positions. Below, are a few more interesting statistics regarding the job titles, companies, and regions within our data set. The first one tabulates the top job titles within our data set which seems to be filled by mostly ‘Mid senior,’ engineer and analytics positions. The next one gives a glimpse into which companies have the most amount of job postings. Interestingly enough, a bit of research comes to show that the majority of these are recruiting companies. Last, but not least, we have a distribution of the regions with the most job postings. Out of the 10 listed, the majority of them fall on the east coast United Sates, save California, Texas, the UK, and Illinois.

#Distribution of top job titles, companies, and regions
job_title_distribution <- full_df %>%
  count(job_title, sort = TRUE) %>%
  head(10) %>% # Top 10 job titles
  rename(frequency = n)

company_distribution <- full_df %>%
  count(company, sort = TRUE) %>%
  head(10) %>%   # Top 10 companies
  rename(frequency = n)

regional_distribution <- full_df %>%
  mutate(region=str_replace(job_location,'(.*?),(.*?)', '\\2')) %>% 
  count(region, sort = TRUE) %>%
  head(10) %>%   # Top 10 regions
  rename(frequency = n)

job_title_distribution

##                                    job_title frequency
## 1                       Senior Data Engineer       285
## 2                        Senior Data Analyst       163
## 3                              Data Engineer       149
## 4                      Senior MLOps Engineer       138
## 5                               Data Analyst       137
## 6                             Data Scientist       128
## 7                         Lead Data Engineer       123
## 8                      Senior Data Scientist       119
## 9                             Data Architect       111
## 10 Staff Machine Learning Engineer, Series A       101

company_distribution

##                      company frequency
## 1          Jobs for Humanity       732
## 2    Recruiting from Scratch       390
## 3                       Dice       193
## 4                      Agoda       172
## 5              ClearanceJobs       166
## 6               ClickJobs.io       152
## 7                Capital One        92
## 8             Energy Jobline        73
## 9  Amazon Web Services (AWS)        72
## 10                  Deloitte        67

regional_distribution

##                      region frequency
## 1                        CA      1299
## 2                        TX       861
## 3   England, United Kingdom       853
## 4                        VA       687
## 5                        NY       550
## 6                        IL       445
## 7                        NJ       415
## 8                        PA       401
## 9                        FL       391
## 10                       MA       389

# Plots
job_plot <- ggplot(job_title_distribution,
                   aes(x = reorder(job_title, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='red') +
  labs(title = "Top 10 Job Titles, Companies, and Regions in Data Science Job Postings",
       x = "Job Title",
       y = "Frequency") +
  guides(fill = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

company_plot <- ggplot(company_distribution,
                       aes(x = reorder(company, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='orange') +
  labs(x = "Company",
       y = "") +
  guides(fill = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

region_plot <- ggplot(regional_distribution,
                      aes(x = reorder(region, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='yellow') +
  labs(x = "Region",
       y = "") +
  guides(fill = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

job_plot + company_plot + region_plot