607_project3

Packages

library(RMySQL)
library(DBI)
library(dplyr)
library(tidyverse)
library(ggplot2)

Bringing the data from SQL to R

Connection to the SQL Server

Importing the DB

# retrieving
sql_query <- "
SELECT
  jp.posting_id,
  jp.URL,
  jp.first_seen,
  jp.last_processed,
  jpswd.skills_id,
  jpswd.skills_desc,
  t.title_desc,
  ct.city_desc,
  cn.country_desc,
  jl.job_level_desc,
  osf.onsite_desc
FROM
  tbl_job_posting AS jp
  LEFT JOIN tbl_job_posting_skills_w_desc AS jpswd ON jp.posting_id = jpswd.posting_id
  LEFT JOIN tbl_title AS t ON jp.title_id = t.title_id
  LEFT JOIN tbl_city AS ct ON jp.city_id = ct.city_id
  LEFT JOIN tbl_country AS cn ON jp.country_id = cn.country_id
  LEFT JOIN tbl_job_level AS jl ON jp.job_level_id = jl.job_level_id
  LEFT JOIN tbl_onsite_flag AS osf ON jp.onsite_flag_id = osf.onsite_flag_id
"

# join data
joined_data <- dbGetQuery(con, sql_query)

DBI::dbDisconnect(con)

## [1] TRUE

# save
write.csv(joined_data, "C:/Users/lenov/Dropbox/_CUNY SPS MSDS/607/Project 3/merged_data3.csv", row.names = FALSE)

Tidying and Transformations

# if needed to load the dataframe
joined_data <- read.csv("C:/Users/lenov/Dropbox/_CUNY SPS MSDS/607/Project 3/merged_data3.csv")

The data was tidied well in SQL as a benefit of using a logical framework for the normalized relational database.

What does the data look like?

Exploring the data

# unique values for country_desc
unique_countries <- distinct(joined_data, country_desc) %>% arrange(country_desc)
unique_countries

##       country_desc
## 1      Australia\r
## 2         Canada\r
## 3 United Kingdom\r
## 4  United States\r

# Remove the carriage return character from the country descriptions
joined_data <- joined_data %>%
  mutate(country_desc = gsub("\r", "", country_desc))


unique_postings_by_country <- joined_data %>%
  select(posting_id, country_desc) %>%
  distinct()

country_frequency <- unique_postings_by_country %>%
  filter(country_desc %in% c("Australia", "Canada", "United Kingdom", "United States")) %>%
  group_by(country_desc) %>%
  summarise(frequency = n()) %>%
  ungroup() %>%
  arrange(desc(frequency))
country_frequency

## # A tibble: 4 × 2
##   country_desc   frequency
##   <chr>              <int>
## 1 United States      10277
## 2 United Kingdom       994
## 3 Canada               630
## 4 Australia            301

So this dataset only includes job postings in the USA (the vast majority, 10,277), UK (994), Canada (630), and Australia (301).

joined_data <- joined_data %>%
  mutate(job_level_desc = gsub("\r", "", job_level_desc))
joined_data <- joined_data %>%
  mutate(job_level_desc = gsub("\n", "", job_level_desc))

unique_job_levels_vector <- distinct(joined_data, job_level_desc) %>% arrange(job_level_desc) %>% pull(job_level_desc)

# unique values for job_level_desc
unique_postings_by_joblevel <- joined_data %>%
  select(posting_id, job_level_desc) %>%
  distinct() 

job_level_frequency <- unique_postings_by_joblevel %>%
  filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
  group_by(job_level_desc) %>%
  summarise(frequency = n()) %>%
  ungroup() %>%
  arrange(desc(frequency))

job_level_frequency

## # A tibble: 2 × 2
##   job_level_desc frequency
##   <chr>              <int>
## 1 Mid senior         10905
## 2 Associate           1297

It looks like we only have 2 job levels: Associate (1,297) or Mid-Senior (10,905).

# unique values for onsite_desc
unique_onsite_flags <- distinct(joined_data, onsite_desc) %>% arrange(onsite_desc)
unique_onsite_flags

##   onsite_desc
## 1      Hybrid
## 2     On_Site
## 3      Remote

# unique values for job_level_desc
unique_postings_by_onsite <- joined_data %>%
  select(posting_id, onsite_desc) %>%
  distinct() 

job_onsite_frequency <- unique_postings_by_onsite %>%
  filter(onsite_desc %in% c("Hybrid", "Remote", "On_Site")) %>%
  group_by(onsite_desc) %>%
  summarise(frequency = n()) %>%
  ungroup() %>%
  arrange(desc(frequency))

job_onsite_frequency

## # A tibble: 3 × 2
##   onsite_desc frequency
##   <chr>           <int>
## 1 On_Site         12174
## 2 Remote             18
## 3 Hybrid             10

As expected, we have three possibilities for the job nature: onsite (the very vast majority, 12,174), remote (18), or hybrid (10).

Which skills came up the most in the LinkedIn job postings?

Visualization

# after joining, the data is now in long-form and is ready to check the frequency of skills
skill_frequency <- joined_data %>%
  group_by(skills_desc) %>%
  summarise(frequency = n()) %>%
  ungroup() %>%
  arrange(desc(frequency))

skill_frequency

## # A tibble: 74,868 × 2
##    skills_desc            frequency
##    <chr>                      <int>
##  1 "Python\r"                  4801
##  2 "SQL\r"                     4606
##  3 "Communication\r"           2498
##  4 "Data Analysis\r"           2181
##  5 "Machine Learning\r"        1966
##  6 "AWS\r"                     1740
##  7 "Tableau\r"                 1685
##  8 "Data Visualization\r"      1562
##  9 "R\r"                       1543
## 10 "Java\r"                    1414
## # ℹ 74,858 more rows

# visualization - bar chart
ggplot(head(skill_frequency, 10), aes(x = reorder(skills_desc, frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  coord_flip() + 
  labs(x = "Skill", y = "Frequency", title = "Top 10 Skills by Frequency")

# making it nicer
ggplot(head(skill_frequency, 10), aes(x = reorder(skills_desc, frequency), y = frequency, fill = skills_desc)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_viridis_d() + 
  coord_flip() + 
  labs(x = "Skill",
       y = "Frequency",
       title = "Top 10 Skills by Frequency",
       subtitle = "Based on job postings data") +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(), 
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold"), 
    plot.subtitle = element_text(size = 14), 
    plot.margin = margin(1, 1, 1, 1, "cm")
  )

Heat Map

skill_frequency$index <- seq_along(skill_frequency$skills_desc)

top_skills <- skill_frequency %>%
  filter(frequency >= 100) %>%
  top_n(20, frequency)

ggplot(top_skills, aes(x = factor(1), y = reorder(skills_desc, frequency), fill = frequency)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_text(size = 8),
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5)
  ) +
  labs(fill = "Frequency", y = "Skill", title = "Top 20 Skills by Frequency") +
  coord_fixed(ratio = 1)

==============================================================

Generating a Word Cloud for the skills most in demand in Data Science

Below we import tbl_job_posting_skills_w_desc from our database, which is a table that has job posting ID, Skill ID, and skill description. As seen from the glimpse function, the data set is massive.

## Rows: 314,926
## Columns: 3
## $ posting_id  <chr> "6334263", "6334263", "6334263", "6334263", "6334263", "63…
## $ skills_id   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
## $ skills_desc <chr> "Machine Learning\r", "Programming\r", "Python\r", "Scala\…

Cleaning the data slightly more for a Word Cloud

Below we convert all of the skills into lower case. This is to make the skills case insensitive. Ie, Python and python would be two different skills if there was case sensitivity. We also remove the “\r” character.

#lower case
job_skillsDF$skills_desc <- str_to_lower(job_skillsDF$skills_desc)
job_skillsDF$skills_desc <- str_replace_all(job_skillsDF$skills_desc, '\\r','')

Evaluating the length of skill strings

Below we add a column that contains the length of the strings. This will be necessary to remove outliers.

#length of string
job_skillsDF$skill_desc_length <- str_count(job_skillsDF$skills_desc)

job_skillsDF <- arrange(job_skillsDF, skill_desc_length)

tail(job_skillsDF, n = 10)

##        posting_id skills_id                                        skills_desc
## 314917    3189276     74732 ability to perform and adapt in high stress enviro
## 314918    3189276     74733 ability to manage communicate and coordinate proje
## 314919    3189276     74736 demonstrated strong skills in collaborating and in
## 314920    7386807      6945 bachelor's degree in cybersecurity systems enginee
## 314921    4707968     74871 technical challenges for transportation technologi
## 314922    8598957     74900 hardware/software/operatingsystem implementation a
## 314923    8598957     74906 larger environments and working around the custome
## 314924    8598957     74909 ability to use knowledge sound judgment and resour
## 314925    8598957     74911 ability to balance multiple priorities and meet de
## 314926    8598957     74913 desktop mobile os and/or hw design and implementat
##        skill_desc_length
## 314917                50
## 314918                50
## 314919                50
## 314920                50
## 314921                50
## 314922                50
## 314923                50
## 314924                50
## 314925                50
## 314926                50

As we can see above, there are skills that are pretty much whole sentences. This does not provide much insight as we are interested in specific skills (eg, Python, Excel, etc).

Removing outliers

Below we use a histogram to visualize the length of the strings

hist(job_skillsDF$skill_desc_length)

summary(job_skillsDF$skill_desc_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   15.00   15.36   20.00   50.00

Based on the above, it seems like removing skills that have a length of greater than 25 will capture most of the data and will remove skills that are sentences.

job_skillsDF_filtered <- job_skillsDF |> filter(skill_desc_length < 25)

job_skillsDF_filtered <- arrange(job_skillsDF_filtered, skill_desc_length)

tail(job_skillsDF_filtered, n = 10)

##        posting_id skills_id              skills_desc skill_desc_length
## 278110    3019943     74771 construction regulations                24
## 278111    6728169     74784 ba/bs or technical certs                24
## 278112    2730123     74790 leadingedge technologies                24
## 278113    8020065       175 green belt certification                24
## 278114    3161415     74836 online and short courses                24
## 278115    4707968     74854 diverse team supervision                24
## 278116    4707968     74855 energy sector experience                24
## 278117    4707968     74860 relevant data experience                24
## 278118    6138795     12446 procedural documentation                24
## 278119    4050887     23476 blood bank refrigerators                24

Tallying the frequency of each skill

Below we create a table that counts the frequency of each unique skill

freqOfSkills <- as.data.frame(table(job_skillsDF_filtered$skills_desc))

#freqOfSkills <- arrange(freqOfSkills, Freq)

#tail(freqOfSkills, n = 10)
colnames(freqOfSkills) <- c('word', 'freq')
rownames(freqOfSkills) <- freqOfSkills$word


summary(freqOfSkills$Freq)

## Length  Class   Mode 
##      0   NULL   NULL

Removing skills that rarely appear

Based on the above, the data can use more filter, as the top skill appears nearly 5000 times while some skills appear only once. We will filter for skills that appear at least 200 times

filterValue <- 200

freqOfSkillsFiltered <- freqOfSkills |> filter(freq >= filterValue)

Word cloud

Below we create a word cloud of the skills for data science jobs

set.seed(1234)
wordcloud2(data = freqOfSkillsFiltered, size = 0.4, color = 'random-dark')

============================

Statistical Analyses: a nuanced look at the data science skills most in demand.

To explore potential analysis options, we will try to see interesting potential relationships between the skill frequency with the other variables available in this data set. First we start with the relationship with the job level (ie, seniority).

Model 1: relationship with job level.

skill_freq_by_job_level <- joined_data %>%
  filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
  count(job_level_desc, skills_desc) %>%
  group_by(skills_desc) %>%
  mutate(total_freq = sum(n)) %>%
  ungroup() %>%
  arrange(desc(total_freq)) %>%
  slice(1:10) 

ggplot(skill_freq_by_job_level, aes(x = reorder(skills_desc, total_freq), y = n, fill = job_level_desc)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d() +
  coord_flip() +
  labs(x = "Skill", 
       y = "Frequency", 
       title = "Top 10 Skills by Frequency", 
       subtitle = "Stratified by Job Level") +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(), 
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold"), 
    plot.subtitle = element_text(size = 14), 
    plot.margin = margin(1, 1, 1, 1, "cm"), 
    legend.position = "bottom"
  )

# But this is showing huge differences between the two levels because of the huge difference in how many of each level there is in the first place. To make the comparison easier and account for the differences in the number of job postings at each job level, I will normalize the frequencies. I'll show the proportion of postings that mention each skill within each job level, rather than the raw counts. 

skill_freq_by_job_level <- joined_data %>%
  filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
  count(job_level_desc, skills_desc) %>%
  group_by(job_level_desc) %>%
  mutate(frequency_proportion = n / sum(n)) %>%
  ungroup() %>%
  group_by(skills_desc) %>%
  mutate(total_freq = sum(n)) %>%
  ungroup() %>%
  arrange(desc(total_freq)) %>%
  slice(1:10)

ggplot(skill_freq_by_job_level, aes(x = reorder(skills_desc, total_freq), y = frequency_proportion, fill = job_level_desc)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d() +
  coord_flip() +
  labs(x = "Skill", 
       y = "Proportion", 
       title = "Top 10 Skills by Normalized Frequency", 
       subtitle = "Stratified by Job Level") +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(), 
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold"), 
    plot.subtitle = element_text(size = 14), 
    plot.margin = margin(1, 1, 1, 1, "cm"), 
    legend.position = "bottom"
  )

Statistical relationship between the skills frequency by job level:

# a contingency table of counts of skills by job level
contingency_table <- xtabs(n ~ job_level_desc + skills_desc, data = skill_freq_by_job_level)

# chi-squared test
chi_squared_result <- chisq.test(contingency_table)
chi_squared_result

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 49.852, df = 4, p-value = 3.877e-10

The chi square test came out to be statistically significant. The very low p-value indicates that there is a statistically significant difference between the skill frequencies across different job levels. Therefore, the frequency of skill mentions is dependent on the job level. The skills mentioned in job postings are not randomly distributed across ‘Associate’ and ‘Mid senior’ levels; there is a pattern to their distribution that is statistically significant.

Logistic regression model for the skills frequency by job level:

binary_data <- skill_freq_by_job_level %>%
  uncount(n) %>%
  mutate(mentioned = 1)

logistic_model <- glm(mentioned ~ job_level_desc + skills_desc, data = binary_data, family = "binomial")

## Warning: glm.fit: algorithm did not converge

logistic_model

## 
## Call:  glm(formula = mentioned ~ job_level_desc + skills_desc, family = "binomial", 
##     data = binary_data)
## 
## Coefficients:
##                   (Intercept)       job_level_descMid senior  
##                     2.657e+01                     -3.083e-08  
##    skills_descData Analysis\r  skills_descMachine Learning\r  
##                     1.964e-09                      4.796e-06  
##           skills_descPython\r               skills_descSQL\r  
##                    -1.161e-08                     -1.175e-08  
## 
## Degrees of Freedom: 16051 Total (i.e. Null);  16046 Residual
## Null Deviance:       0 
## Residual Deviance: 9.313e-08     AIC: 12

This warning suggests that the iterative process used to find the maximum likelihood estimates for the logistic regression model did not successfully converge to a stable solution.

Models by country or job onsite/remote:

Unfortunately, the number of instances for the different possible values for these variables are too small to be able to build any meaningful model. The VAST MAJORITY of these job postings were for onsite jobs located in the United States.

==========================================================

Conclusion

This is a large and recent (2024) data set focused on the job postings for Data Science scraped from LinkedIn. We imported the data into SQL and normalized the tables in a logical framework. We then loaded the data tables in R and performed a left join to build a relevant framework for the purposes of trying to find which skills were most in demand for data scientists. We then cleaned up the data slightly more (since most of the tidying and transformations were already achieved in SQL) and then started with creating summary statistics and visualizations to explore the data set. We created several graphical representations of the skills most in demand for data scientists including bar charts, a heat map, and a word cloud. We then performed statistical analysis to explore the relationship between the skills frequency and job level (ie, seniority) to test for any statistical significance and found that there is indeed a statistically significant difference between the two job levels included in the data (associate vs. mid senior). Therefore, we concluded that the skills required were dependent on the job level. Then we created bar charts to visualize the differences in skills across these two categories (first as raw numbers and then as a log scale to normalize for the large difference of instances of each of the two categories). We then attempted to build a logistic regression model to explore this relationship further, but the model did not converge (several reasons may be behind that, but the scope of those reasons are beyond the scope of this work at the moment). We also wanted to conduct similar analysis for the relationship between the skills level and the location (ie, country) as well as onsite/remote status, but the distribution of these two variables were so skewed that we could not perform such analyses.

Main takeaways:

The data from job postings for data scientists showed that the required skills are a mix of both technical and soft skills.
The data skills most in demand for data scientists were (in descending order): Python, SQL, communication, data analysis, and machine learning.
The data came mostly from the United States (and a very small proportion came from the UK, Canada, Australia).
The vast majority of the job postings were for onsite positions.
Some skills were expressed in several forms within this data set (eg, “communication” and “communication skills”, etc). If we were to dive deeper into this data set to fix the many instances for how certain skills were expressed, some skills may become higher in the ranking in terms of frequency. However, we don’t think this would likely change the current ranking for the top 5 skills.
The desired data science skills in the job market represented in this data set were depended on the job level (associate vs. mid senior).

Limitations:

Despite a few strengths (large data set, recent, using a popular job seeking website), the data has some limitations. The vast majority of the job postings included were from the US, which largely limits the generazability of these findings to other countries in the world. A few skills were expressed in a few different ways, which may have under-estimated their frequency and ranking. The finding that virtually all of the job postings were “onsite” seemed an irregular finding, given that we know a large proportions of the data science jobs currently on the market are hybrid or remote. The data set only included a few variables that limited our desire to check for the relationship between the skills frequency and certain domains/disciplines/industry, salary information, and full-time/part-time.

Future directions:

An effort to address the above-mentioned limitations would help clarify which data science skills are most desired in a more nuanced way. A more representative sample that covers more countries in the world (in a more balanced distribution) would improve the generalizability of these findings.