DATA 607 Project 3

1 Objective

Answer the Question: Which are the most valued data science skills?

2 Motivation

The primary goal of this project is to explore and identify the most valued skills in the field of Data Science, using real job market data.

We selected the “Data Science Job Postings & Skills (2024)” dataset from Kaggle because it provides a rich collection of LinkedIn job postings, including detailed job titles, locations, and required skills. This dataset allows us to examine patterns in the job market and derive insights about which technical and analytical competencies employers value most.

This project also serves as a practical exercise in applying key principles of data acquisition, cleaning, management, and normalization. In addition, it demonstrates the team’s ability to collaborate effectively in a remote setting, leveraging shared tools and workflows (GitHub, Google Docs, Slack, and RStudio).

3 Approach

Using the Kaggle dataset Data Science Job Postings & Skills (2024). With it, we can trim down the job listing to only those that are the most representative of data science role. Then we can investigate what skills were the most commonly required for these jobs.

4 Reading the Data from Github Repo

projPath <- dirname(file.path(getSourceEditorContext()$path))   # getting where .Rmd is located

job_postings_raw <- read.csv("https://raw.githubusercontent.com/cdube89128/DATA_607_Project_3/refs/heads/main/data/job_postings.csv")   
job_skills_raw <- read.csv("https://raw.githubusercontent.com/cdube89128/DATA_607_Project_3/refs/heads/main/data/job_skills.csv")
job_summary_raw <- read.csv("https://media.githubusercontent.com/media/cdube89128/DATA_607_Project_3/refs/heads/main/data/job_summary.csv")

5 Cleaning the data to prepare for SQLlite

# Getting just jobs, job names, skills
job_postings_with_skills <- merge(
    x = job_postings_raw, y = job_skills_raw,
    by = "job_link"
  ) %>% 
  select(c("job_link", "job_title", "job_skills"))

# Cleaning up the skills
job_skills_long_normalized <- job_postings_with_skills %>%
  select(job_link, job_skills) %>%
  mutate(job_skills = str_split(job_skills, ",")) %>%     # Split skills by comma
  unnest(job_skills) %>%
  mutate(job_skills = str_trim(job_skills))

# skills_long will be used for our database

6 Making the SQLLite Database

# Create a new SQLite database 
db_file <- "jobs_database.sqlite"
con <- dbConnect(SQLite(), dbname = db_file)

# Write each dataframe to the database as a table
dbWriteTable(con, "job_skills_long", job_skills_long_normalized, overwrite = TRUE)
dbWriteTable(con, "job_postings", job_postings_raw, overwrite = TRUE)
dbWriteTable(con, "job_summary", job_summary_raw, overwrite = TRUE)

These are the tables in our SQLite database: job_postings, job_skills_long, job_summary.

Retain SQLLite as the Data “Source of Truth” and pull the data from here going forward

job_postings <- dbReadTable(con, "job_postings")
job_summary <- dbReadTable(con, "job_summary")
job_skills_long <- dbReadTable(con, "job_skills_long")

Disconnest from SQLLite Database at end

# Disconnect from the database when done
dbDisconnect(con)

7 Let’s trim (and a bit of exploratory analysis)

Trimming this down to a shorter list of job titles that we think are more relevant to the question.

filtered_jobs <- job_postings %>%
  filter(
    str_detect(job_title, regex("data.*science", ignore_case = TRUE)) |
    str_detect(job_title, regex("analyst", ignore_case = TRUE)) |
    str_detect(job_title, regex("database", ignore_case = TRUE)) |
    str_detect(job_title, regex("data.*engineer", ignore_case = TRUE)) |
    str_detect(job_title, regex("machine.*learning", ignore_case = TRUE))
  )

# What percent of the original dataset is left?
trimmed_pct <- round(nrow(filtered_jobs) / nrow(job_postings)*100, 2)

# Overwrite job_postings with the trimmed dataset
job_postings <- filtered_jobs

We have trimmed down our initial dataset, and now we are only working with 53.31% of the original job listings. This was based on job listings that specifically referenced data science, analyst, database, or data engineering. This is more representative of the data science jobs that we are interested in.

# Joining our trimmed job_postings with job_skills_long
job_postings_with_skills <- job_skills_long %>%
  inner_join(job_postings, by = "job_link")

# Count how many jobs mention each skill
skill_counts <- job_postings_with_skills %>%
  group_by(job_skills) %>%
  summarise(
    jobs_with_skill = n_distinct(job_link),
    .groups = "drop"
  ) %>%
  mutate(
    percent_jobs = (jobs_with_skill / n_distinct(job_postings_with_skills$job_link)) * 100
  ) %>%
  arrange(desc(jobs_with_skill)) %>%
  slice_head(n = 25)  # Top 25 skills for readability

#head(skill_counts, 25)

Here are some exploratory summary stats about the skills these jobs are looking for.

# Brief Exploratory Analysis
summary(skill_counts)

##   job_skills        jobs_with_skill  percent_jobs   
##  Length:25          Min.   : 587    Min.   : 9.013  
##  Class :character   1st Qu.: 645    1st Qu.: 9.903  
##  Mode  :character   Median : 838    Median :12.867  
##                     Mean   :1045    Mean   :16.047  
##                     3rd Qu.:1136    3rd Qu.:17.442  
##                     Max.   :3212    Max.   :49.317

Above, we see that the median number of jobs that a skill is referenced by is 838. The mean is 1045. If we were to pluck a skill from the list, we might expect it to be referenced in 13% of the jobs from this set.

Here is a peak into the most common job titles in our trimmed data set.

# Brief Exploratory Analysis Cont.
filtered_jobs %>%
  group_by(job_title) %>%
  summarise(
    number_of_entries = n()
  ) %>%
  arrange(desc(number_of_entries)) %>%
  slice_head(n = 50)

8 Top 25 Skills for Data Science!

ggplot(skill_counts, aes(x = reorder(job_skills, percent_jobs), y = percent_jobs)) +
  geom_col(fill = "cornflowerblue") +
  coord_flip() +  # Horizontal bars for better readability
  labs(
    title = "Top 25 Skills for Data Science Positions in 2024",
    x = "Skill",
    y = "Percentage of Job Listings"
  ) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +  # Add % sign to axis
  theme_minimal()

9 Distribution of Unique Skills Required by Jobs

Let’s take a look at the number of unique skills required by the jobs in our dataset.

#creating a dataframe of the job posting with the most skills
UniqueSkills_ByPosting <- job_postings_with_skills %>%
  reframe(
    num_skill = n_distinct(job_skills),
    title = job_title,
    job_company = company,
    seniority = job_level,
    .by = c(job_link)
  ) %>%
distinct(job_link, .keep_all = TRUE)

What if we want to know the mean number of skills required by jobs?

#calculationg mean
nskill_mean <- round(mean(UniqueSkills_ByPosting$num_skill), digits =0)
nskill_mean

## [1] 26

What if we want to know the median number of skills required by jobs?

#calculationg median
nskill_median <- round(median(UniqueSkills_ByPosting$num_skill), digits =2)

nskill_median

## [1] 24

Okay great now we know that the mean and median of the distribution of unique skills required by jobs are different but not there is not a great difference between them. The mean being 26 (rounded) and the median being 24.

Now let’s plot the distribution of the unique skills required by data science jobs in our dataset

ggplot(UniqueSkills_ByPosting, aes(x = num_skill)) +
  geom_histogram(binwidth = 1, fill = "steelblue") +
  labs(
    title = "Distribution of Skills Required by Jobs",
    x = "Number of Unique Skills",
    y = "Number of Jobs"
  )+
  geom_vline(aes(xintercept = nskill_mean), color = "black", 
             linetype ="dashed", linewidth =1) +
 annotate("text",                        
           x = nskill_mean + 5,
           y = 340,
           label = paste("Mean =", nskill_mean),
           col = "black",
           size = 5,
          hjust = 0.05)

We can see from that the distribution is right-skewed because of a few outliers and especially one very large outlier that gives our distribution a long right tail. But we know that the average number of unique skills required by the data science jobs in our data set is actually 26, though the spread ranges from near zero to 200.

10 Top 25 Job Postings with the Heaviest Skill Requirements

#Arrange in descending order

top_skill_postings<- UniqueSkills_ByPosting %>%
  arrange(desc(num_skill))  %>%
  slice_max(num_skill, n = 25, with_ties = FALSE)

#Cleaning up job titles
top_skill_postings <- top_skill_postings %>%
  mutate(
    short_title = str_replace(title, "\\s*[-–—/(),]\\s*.*$", "") %>% str_trim()
  )

11 Plotting the 25 Job Listings with the Heaviest Skill Requirements

#Plot


ggplot(top_skill_postings, aes(x = reorder(short_title, num_skill), y = num_skill, fill = job_company)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Job Postings with the Heaviest Skill Requirements",
    subtitle = "By Job Title and Company",
    x = "Job Title",
    y = "Number of Skills Required for the Job",
    fill = "Company" 
  ) +
  theme_minimal()+
  guides(fill = guide_legend(ncol = 1, keywidth = 1, keyheight = 0.8))

12 The 25 Job Listings with the Lowest Number of Skill Requirements

#Arrange in ascending order
bottom_skill_postings <- UniqueSkills_ByPosting %>%
  arrange(num_skill) %>%
  slice(1:25)


#Cleaning up job titles
bottom_skill_postings <- bottom_skill_postings %>%
  mutate(
    bshort_title = title %>%
      # trying to remove 100% ONSITE and 100% ONSITE JOB
      str_replace(regex("^\\s*100% ONSITE( JOB)?\\s*:?", ignore_case = TRUE), "") %>%
      str_replace_all("\\s*[-–—/(),].*$", "") %>%
      str_trim()
    )

library(RColorBrewer)

bottom_colors <- 25
colors <- colorRampPalette(brewer.pal(12, "Set3"))(bottom_colors)


ggplot(bottom_skill_postings, aes(x = reorder(bshort_title, num_skill), y = num_skill, fill = job_company)) +
  geom_col() +
  scale_fill_manual(values = colors ) +
  coord_flip() +
  labs(
    title = "The 25 Job Postings with the Lowest Skill Requirements",
    subtitle = "By Job Title and Company",
    x = "Job Title",
    y = "Number of Skills Required for the Job",
    fill = "Company" 
  ) +
  theme_minimal() +
  guides(fill = guide_legend(ncol = 1, keywidth = 1, keyheight = 0.8))

13 Conclusion

The top 5 most valuable data science skills are (in order of importance) Python, SQL, Data Analysis, Communication, AWS!
However, we can see that Python is referenced in data science job postings 49.3% of the time.
We also can see that SQL is referenced 48.6% of the time.
This leads the pack by a large margin, since the 3rd most referenced skill is only referenced 21.9% of the time.

We can now say that these are the most valued data science skills!

From our dataset we have garnered that on average data science job look for a set of 26 unique skills, though there exists many that require far fewer, and an exceptional few that require over 150+.

14 Extra

14.1 Notes

Here is some extra information on the database built for this project!

14.2 ER Diagram Code

er_diagram <- grViz("
digraph ERD {
  graph [layout = dot, rankdir = LR]
  node [shape = plaintext, fontname = Helvetica, fontsize = 10]

  job_postings_raw [
    label=<
      <TABLE BORDER='1' CELLBORDER='0' CELLSPACING='0' BGCOLOR='lightyellow'>
        <TR><TD WIDTH='130' ALIGN='CENTER' BGCOLOR='gold'><B>Job_Postings</B></TD></TR>
        <TR><TD ALIGN='LEFT' BALIGN='LEFT'>job_link (PK)</TD></TR>
        <TR><TD ALIGN='LEFT'>last_processed_time</TD></TR>
        <TR><TD ALIGN='LEFT'>last_status</TD></TR>
        <TR><TD ALIGN='LEFT'>got_summary</TD></TR>
        <TR><TD ALIGN='LEFT'>got_ner</TD></TR>
        <TR><TD ALIGN='LEFT'>is_being_worked</TD></TR>
        <TR><TD ALIGN='LEFT'>job_title</TD></TR>
        <TR><TD ALIGN='LEFT'>company</TD></TR>
        <TR><TD ALIGN='LEFT'>job_location</TD></TR>
        <TR><TD ALIGN='LEFT'>first_seen</TD></TR>
        <TR><TD ALIGN='LEFT'>search_city</TD></TR>
        <TR><TD ALIGN='LEFT'>search_country</TD></TR>
        <TR><TD ALIGN='LEFT'>search_position</TD></TR>
        <TR><TD ALIGN='LEFT'>job_level</TD></TR>
        <TR><TD ALIGN='LEFT'>job_type</TD></TR>
      </TABLE>
    >
  ]

  job_summary_raw [
    label=<
      <TABLE BORDER='1' CELLBORDER='0' CELLSPACING='0' BGCOLOR='lightyellow'>
        <TR><TD WIDTH='120' ALIGN='CENTER' BGCOLOR='gold'><B>Job_Summary</B></TD></TR>
        <TR><TD ALIGN='LEFT'>job_link (PK, FK)</TD></TR>
        <TR><TD ALIGN='LEFT'>job_summary</TD></TR>
      </TABLE>
    >
  ]

  skills_long [
    label=<
      <TABLE BORDER='1' CELLBORDER='0' CELLSPACING='0' BGCOLOR='lightyellow'>
        <TR><TD WIDTH='100' ALIGN='CENTER' BGCOLOR='gold'><B>Job_Skills_Long</B></TD></TR>
        <TR><TD ALIGN='LEFT'>job_link (PK, FK)</TD></TR>
        <TR><TD ALIGN='LEFT'>job_skills</TD></TR>
      </TABLE>
    >
  ]

  # Relationships
  job_postings_raw -> job_summary_raw [label = '1 : 1']
  job_postings_raw -> skills_long [label = '1 : many']
}
")

14.3 ER Diagram

14.4 Deleting SQLLite Local Instance

file.remove("jobs_database.sqlite")

## [1] TRUE