For Project 3, our team was asked to answer the question, Which are the most valued data science skills? To answer this question, we decided to explore the top skills that companies search for in a job candidate. We sourced data from Indeed, a popular job listing site. We were able to source a dataset featuring over 12,000 US Data Science jobs scraped from Indeed in January 2024*.
We began by normalizing the dataset to match our proposed Entity Relationship Diagram (ERD) shown below:
After inspecting the Indeed dataset files, we noticed that while the files did not have an id column, each job listing, job summary, and job skills could be identified by the listings URL (job_link field). Because this field is a bit verbose, we decided to create a new id for the job listings but retain the existing unique relationships by deriving the id from the link. While originally we thought to use the 10-digit sequence at the end of the slug, we realized there 23 instances where the sequence was repeated for related but distinct jobs at the same company. We decided to create a custom function that we can call when preparing each of the dataframes which extracts the page slug to create a hash using digest.
Next, we imported the job summary, job listing, and job skills files one at a time and transfomed them to match our Entity Relationship Diagram.
# function to create a primary key column from the url
convert_key <- function (df) {
df |> mutate(
job_id = str_extract(job_link, "\\/[^\\/]*$"),
job_id = map_chr(job_id, digest, algo = 'md5', seed = 1234)
) |>
relocate(job_id)
}
# read in job summary csv
orig_job_summary <- read_csv("source_files/job_summary.csv") |>
convert_key() |>
subset(select = -c(job_link))
# read in job postings csv
orig_job_postings <- read_csv("source_files/job_postings.csv") |>
convert_key()
# subset companies and add col id
companies_df <- subset(orig_job_postings, select = c(company)) |>
distinct(company) |>
mutate(company_id = row_number()) |>
relocate(company_id, .before = "company")
# add job description and subset
jobs_df <- orig_job_postings |>
left_join(orig_job_summary, join_by(job_id == job_id)) |> # add Job Description
subset(select = c(job_id, job_title, job_summary, job_location, search_position, job_level, job_type, company)) |>
left_join(companies_df, join_by(company == company)) |> # add Job Company field
subset(select = -c(company))
To create a “skills” dataframe and a joiner “job_skills” dataframes, we read in the job skills, transformed the url field into a hash to use as the foreign key to connect with the job record, subset with just the fields job_id and job_skills, and used separate_longer_delim to break up the job_skills field into separate rows, using a comma deliminator. We will need to update this dataframe later to replace each skill listed in the skill_name column with its unique id.
To create the “skills” dataframe, we first copy the “job_skills” dataframe, use distint to remove the duplicates, than add a row number to use as the skill_id field. This essentially leaves a table of unique skills with their respective Id.
As a final step, we use left_join to join the “skills” dataframe to the “job_skills” dataframe by the skill_name field. Dropping the skill_name leaves two columns: 1. job_id which will serve as a foreign key to join with my “jobs” dataframe or SQL table, and 2. skill_id which will serve as a foreign key to join with my “skills” table or SQL table
# create joiner table
job_skills_dirty <- read_csv("source_files/job_skills.csv") |>
convert_key() |>
subset(select = c(job_id, job_skills)) |>
rename(skill_name = job_skills)
# create joiner table
job_skills_distinct <- job_skills_dirty |>
separate_longer_delim(cols = c(skill_name), delim="," )
# make distinct list of unique skills and give them their own id
skills_distinct <- job_skills_distinct |>
distinct(skill_name) |>
mutate(skill_id = row_number()) |>
relocate(skill_id, .before = skill_name)
# update joiner to bring unique artist idea
job_skills_distinct <- job_skills_distinct |>
left_join(skills_distinct, join_by(skill_name == skill_name)) |>
subset(select=-c(skill_name))
# write parquet files
write_parquet(jobs_df, "datasets/jobs.parquet")
write_parquet(companies_df, "datasets/companies.parquet")
write_parquet(skills_distinct, "datasets/skills.parquet")
write_parquet(job_skills_distinct, "datasets/job_skills.parquet")
We used DuckDB, a popular open-source embeddable database to manage our relational database. We chose DuckDB as way to try working with a different type of database within R and for its portability — though we don’t recommend storing sensitive data in a public repository as we have for this exercise.
After loading the duckdb package, we initialized DuckDB by establishing a connection (#A1) and importing our pre-normalized database parquet files (#A2). Next, we used SQL to join the tables into a single dataframe (#A3).
# A1. make an in-memory db and store the connection in a variable
con <- dbConnect(duckdb::duckdb())
# A2. read parquet files and register them as tables to the database
tables <- c('jobs','skills','job_skills','companies')
for (t in tables) {
parquet_file <- paste('datasets/',t,'.parquet', sep="")
duckdb_register(con, t, read_parquet(parquet_file))
}
# show what we loaded
glimpse(tbl(con, "jobs"))
## Rows: ??
## Columns: 8
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ job_id <chr> "009e71ae0eaad02dbf565187a0ba5357", "9dd473793b10ad4df…
## $ job_title <chr> "Senior Machine Learning Engineer", "Principal Softwar…
## $ job_summary <chr> "Company Description\r\nJobs for Humanity is partnerin…
## $ job_location <chr> "New Haven, CT", "San Francisco, CA", "New York, NY", …
## $ search_position <chr> "Agricultural-Research Engineer", "Set-Key Driver", "T…
## $ job_level <chr> "Mid senior", "Mid senior", "Associate", "Mid senior",…
## $ job_type <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id <int> 1, 2, 3, 4, 5, 6, 1, 1, 7, 8, 9, 10, 11, 5, 12, 13, 14…
glimpse(tbl(con, "skills"))
## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ skill_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ skill_name <chr> "Machine Learning", " Programming", " Python", " Scala", " …
glimpse(tbl(con, "job_skills"))
## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ job_id <chr> "009e71ae0eaad02dbf565187a0ba5357", "009e71ae0eaad02dbf565187…
## $ skill_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
glimpse(tbl(con, "companies"))
## Rows: ??
## Columns: 2
## Database: DuckDB v1.1.1 [steve@Windows 10 x64:R 4.3.3/:memory:]
## $ company_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ company <chr> "Jobs for Humanity", "Aurora", "Adame Services LLC", "Morph…
# A3. join the tables with SQL and save as dataframe
full_df <- dbGetQuery(con,"WITH skills_by_id AS (
SELECT job_id,
STRING_AGG (skill_name, ',' ) AS skills
FROM job_skills AS x
LEFT JOIN skills AS s
ON s.skill_id = x.skill_id
GROUP BY job_id
)
SELECT j.*, c.company, s.skills
FROM jobs AS j
LEFT JOIN skills_by_id AS s
ON s.job_id = j.job_id
LEFT JOIN companies AS c
ON c.company_id = j.company_id
")
glimpse(full_df)
## Rows: 12,217
## Columns: 10
## $ job_id <chr> "f7cb39b176e8222147322790e588cae9", "64d338df9a9e9dbf6…
## $ job_title <chr> "Lead Data Engineer", "Manager, Cyber Risk & Analysis …
## $ job_summary <chr> "Dice is the leading career destination for tech exper…
## $ job_location <chr> "Plano, TX", "Boston, MA", "Lake Buena Vista, FL", "Su…
## $ search_position <chr> "Maintenance Data Analyst", "Manager Reports Analysis"…
## $ job_level <chr> "Mid senior", "Mid senior", "Mid senior", "Mid senior"…
## $ job_type <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id <int> 5, 1, 7, 8, 14, 16, 18, 27, 12, 30, 31, 34, 49, 51, 55…
## $ company <chr> "Dice", "Jobs for Humanity", "The Walt Disney Company"…
## $ skills <chr> "Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflak…
With our dataframe loaded, we moved on to cleaning and tidying the data. First, we standardized the column mames by lowercasing and replacing spaces with “_” to convert them to snake format. We removed leftover any duplicates using the distinct function. Finally, we replaced any NA values with “Unknown”.
# B1. Standardize Column Names
full_df <- full_df %>%
rename_all(~str_to_lower(.)) %>% # Convert all column names to lowercase
rename_all(~str_replace_all(., " ", "_")) # Replace spaces with underscores
# B2. Remove Duplicates
full_df <- full_df %>%
distinct() # Keep only unique rows
# B3. Handle Missing Values
# Replace NA values with 'Unknown' in character columns
full_df <- full_df %>%
mutate(across(where(is.character), ~replace_na(., "Unknown")))
# Display the cleaned data frame
glimpse(full_df)
## Rows: 12,217
## Columns: 10
## $ job_id <chr> "f7cb39b176e8222147322790e588cae9", "64d338df9a9e9dbf6…
## $ job_title <chr> "Lead Data Engineer", "Manager, Cyber Risk & Analysis …
## $ job_summary <chr> "Dice is the leading career destination for tech exper…
## $ job_location <chr> "Plano, TX", "Boston, MA", "Lake Buena Vista, FL", "Su…
## $ search_position <chr> "Maintenance Data Analyst", "Manager Reports Analysis"…
## $ job_level <chr> "Mid senior", "Mid senior", "Mid senior", "Mid senior"…
## $ job_type <chr> "Onsite", "Onsite", "Onsite", "Onsite", "Onsite", "Ons…
## $ company_id <int> 5, 1, 7, 8, 14, 16, 18, 27, 12, 30, 31, 34, 49, 51, 55…
## $ company <chr> "Dice", "Jobs for Humanity", "The Walt Disney Company"…
## $ skills <chr> "Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflak…
Our dataframe had multiple skills in a single column. We used separate_row to put each of these observations onto their own rows. We removed the white space around them and used count to create a column showing the number of times (frequency) each skill was mentioned.
# Assuming 'full_df' contains job postings and skill information
# C1. Separate the skills column back into individual rows, renaming it to "skill_name":
skill_counts <- full_df %>%
separate_rows(skills, sep = ",") %>% # Split skills by commas into individual rows
rename(skill_name = skills) %>% # Rename the column to "skill_name"
mutate(skill_name = str_trim(skill_name)) %>% # Remove extra spaces around skill names
count(skill_name, sort = TRUE) %>% # Count occurrences of each skill and sort by frequency
rename(frequency = n) # Rename count column to 'frequency'
# C2. Display the top skills
head(skill_counts, 10) # Show the top 10 skills by frequency
## # A tibble: 10 × 2
## skill_name frequency
## <chr> <int>
## 1 Python 4801
## 2 SQL 4606
## 3 Communication 2498
## 4 Data Analysis 2181
## 5 Machine Learning 1966
## 6 AWS 1740
## 7 Tableau 1685
## 8 Data Visualization 1562
## 9 R 1542
## 10 Java 1414
We used our frequencies table to create a plot showing the top 10 skills.
ggplot(skill_counts %>% head(10),
aes(x = reorder(skill_name, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = 'green') +
labs(title = "Top 10 Skills in Data Science Job Postings",
x = "Skill",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Diving deeper into analysis, we can check to see if the top 10 skills
are influenced by the job level. Our data contains two values for the
job_level
field, ‘Mid senior’ and ‘Associate.’ We will take
the same approach we did for the skill_counts
data frame
and make sure to include job_level
in our output. Once we
create the skill_by_level
data frame, we will use it to
merge two subsets filtered for each job title and combined through the
row name. Doing so produces the table seen below.
#Skill frequency by job level
skill_by_level <- full_df %>%
separate_rows(skills, sep = ",") %>%
rename(skill_name = skills) %>%
mutate(skill_name = str_trim(skill_name)) %>%
group_by(skill_name,job_level) %>%
count(job_level, sort = TRUE) %>%
rename(frequency = n)
#Comparison of top 10 skills based on job level
job_level_comparison <- merge(
skill_by_level %>%
filter(job_level=='Mid senior') %>%
head(10),
skill_by_level %>%
filter(job_level=='Associate') %>%
head(10),
by='row.names'
) %>%
arrange(desc(frequency.x)) %>%
rename(top_mid_senior_skill = skill_name.x,
top_associate_skill = skill_name.y) %>%
select(2,5)
job_level_comparison
## top_mid_senior_skill top_associate_skill
## 1 Python SQL
## 2 SQL Python
## 3 Communication Communication
## 4 Data Analysis Data Analysis
## 5 Machine Learning Tableau
## 6 AWS Data Visualization
## 7 Tableau Machine Learning
## 8 Data Visualization R
## 9 R Teamwork
## 10 Spark Data analysis
mid_senior_plot <- ggplot(skill_by_level %>%
filter(job_level=='Mid senior') %>%
head(10), aes(x = reorder(skill_name, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = 'purple') +
labs(title = "Top 10 Skills in Data Science Job Postings by Job Level",
x = "Mid Senior Skill",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
associate_plot <- ggplot(skill_by_level %>%
filter(job_level=='Associate') %>%
head(10), aes(x = reorder(skill_name, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = 'blue') +
labs(x = "Associate Skill",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
mid_senior_plot + associate_plot
The above data frame and plot illustrate the slight differences in skills that exist between ‘Mid senior’ and ‘Associate’ level positions. Below, are a few more interesting statistics regarding the job titles, companies, and regions within our data set. The first one tabulates the top job titles within our data set which seems to be filled by mostly ‘Mid senior,’ engineer and analytics positions. The next one gives a glimpse into which companies have the most amount of job postings. Interestingly enough, a bit of research comes to show that the majority of these are recruiting companies. Last, but not least, we have a distribution of the regions with the most job postings. Out of the 10 listed, the majority of them fall on the east coast United Sates, save California, Texas, the UK, and Illinois.
#Distribution of top job titles, companies, and regions
job_title_distribution <- full_df %>%
count(job_title, sort = TRUE) %>%
head(10) %>% # Top 10 job titles
rename(frequency = n)
company_distribution <- full_df %>%
count(company, sort = TRUE) %>%
head(10) %>% # Top 10 companies
rename(frequency = n)
regional_distribution <- full_df %>%
mutate(region=str_replace(job_location,'(.*?),(.*?)', '\\2')) %>%
count(region, sort = TRUE) %>%
head(10) %>% # Top 10 regions
rename(frequency = n)
job_title_distribution
## job_title frequency
## 1 Senior Data Engineer 285
## 2 Senior Data Analyst 163
## 3 Data Engineer 149
## 4 Senior MLOps Engineer 138
## 5 Data Analyst 137
## 6 Data Scientist 128
## 7 Lead Data Engineer 123
## 8 Senior Data Scientist 119
## 9 Data Architect 111
## 10 Staff Machine Learning Engineer, Series A 101
company_distribution
## company frequency
## 1 Jobs for Humanity 732
## 2 Recruiting from Scratch 390
## 3 Dice 193
## 4 Agoda 172
## 5 ClearanceJobs 166
## 6 ClickJobs.io 152
## 7 Capital One 92
## 8 Energy Jobline 73
## 9 Amazon Web Services (AWS) 72
## 10 Deloitte 67
regional_distribution
## region frequency
## 1 CA 1299
## 2 TX 861
## 3 England, United Kingdom 853
## 4 VA 687
## 5 NY 550
## 6 IL 445
## 7 NJ 415
## 8 PA 401
## 9 FL 391
## 10 MA 389
# Plots
job_plot <- ggplot(job_title_distribution,
aes(x = reorder(job_title, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill='red') +
labs(title = "Top 10 Job Titles, Companies, and Regions in Data Science Job Postings",
x = "Job Title",
y = "Frequency") +
guides(fill = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
company_plot <- ggplot(company_distribution,
aes(x = reorder(company, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill='orange') +
labs(x = "Company",
y = "") +
guides(fill = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
region_plot <- ggplot(regional_distribution,
aes(x = reorder(region, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill='yellow') +
labs(x = "Region",
y = "") +
guides(fill = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
job_plot + company_plot + region_plot
This project relied on a pre-assembled webscrape of job listings on Indeed for the “Data Science” positions which was compiled in January 2024. It represents a snapshot of common skills that employers were search on a single job listing site at a particular time and may not be fully representative of the overall most important data skills. We would need current data to compare whether the top skills of our work continue to be important today or to data from additional sources to cross-check.
While we attempted to scrape the Indeed site as part of this work, we ran into some authorization issues that prevented us from recreating a current dataset. However, we were able to scrapte a dataset from an UK-based jobs engine, which could be used to compare the differences between US and UK tops skills in. future version of this project.
* Our team made an attempt to scrape current data from Indeed, but were unable to replicate the procedure. For demonstrative purposes, we were able to (scrape data](https://github.com/mcastro64/D607_Project3/blob/bcdc0748160cd2e9abca7ece7aff03f774bfda12/Team5-Project3-scraper.Rmd) another source