Packages
library(RMySQL)
library(DBI)
library(dplyr)
library(tidyverse)
library(ggplot2)
Connection to the SQL Server
Importing the DB
# retrieving
sql_query <- "
SELECT
jp.posting_id,
jp.URL,
jp.first_seen,
jp.last_processed,
jpswd.skills_id,
jpswd.skills_desc,
t.title_desc,
ct.city_desc,
cn.country_desc,
jl.job_level_desc,
osf.onsite_desc
FROM
tbl_job_posting AS jp
LEFT JOIN tbl_job_posting_skills_w_desc AS jpswd ON jp.posting_id = jpswd.posting_id
LEFT JOIN tbl_title AS t ON jp.title_id = t.title_id
LEFT JOIN tbl_city AS ct ON jp.city_id = ct.city_id
LEFT JOIN tbl_country AS cn ON jp.country_id = cn.country_id
LEFT JOIN tbl_job_level AS jl ON jp.job_level_id = jl.job_level_id
LEFT JOIN tbl_onsite_flag AS osf ON jp.onsite_flag_id = osf.onsite_flag_id
"
# join data
joined_data <- dbGetQuery(con, sql_query)
DBI::dbDisconnect(con)
## [1] TRUE
# save
write.csv(joined_data, "C:/Users/lenov/Dropbox/_CUNY SPS MSDS/607/Project 3/merged_data3.csv", row.names = FALSE)
Exploring the data
# unique values for country_desc
unique_countries <- distinct(joined_data, country_desc) %>% arrange(country_desc)
unique_countries
## country_desc
## 1 Australia\r
## 2 Canada\r
## 3 United Kingdom\r
## 4 United States\r
# Remove the carriage return character from the country descriptions
joined_data <- joined_data %>%
mutate(country_desc = gsub("\r", "", country_desc))
unique_postings_by_country <- joined_data %>%
select(posting_id, country_desc) %>%
distinct()
country_frequency <- unique_postings_by_country %>%
filter(country_desc %in% c("Australia", "Canada", "United Kingdom", "United States")) %>%
group_by(country_desc) %>%
summarise(frequency = n()) %>%
ungroup() %>%
arrange(desc(frequency))
country_frequency
## # A tibble: 4 × 2
## country_desc frequency
## <chr> <int>
## 1 United States 10277
## 2 United Kingdom 994
## 3 Canada 630
## 4 Australia 301
So this dataset only includes job postings in the USA (the vast
majority, 10,277), UK (994), Canada (630), and Australia (301).
joined_data <- joined_data %>%
mutate(job_level_desc = gsub("\r", "", job_level_desc))
joined_data <- joined_data %>%
mutate(job_level_desc = gsub("\n", "", job_level_desc))
unique_job_levels_vector <- distinct(joined_data, job_level_desc) %>% arrange(job_level_desc) %>% pull(job_level_desc)
# unique values for job_level_desc
unique_postings_by_joblevel <- joined_data %>%
select(posting_id, job_level_desc) %>%
distinct()
job_level_frequency <- unique_postings_by_joblevel %>%
filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
group_by(job_level_desc) %>%
summarise(frequency = n()) %>%
ungroup() %>%
arrange(desc(frequency))
job_level_frequency
## # A tibble: 2 × 2
## job_level_desc frequency
## <chr> <int>
## 1 Mid senior 10905
## 2 Associate 1297
It looks like we only have 2 job levels: Associate (1,297) or
Mid-Senior (10,905).
# unique values for onsite_desc
unique_onsite_flags <- distinct(joined_data, onsite_desc) %>% arrange(onsite_desc)
unique_onsite_flags
## onsite_desc
## 1 Hybrid
## 2 On_Site
## 3 Remote
# unique values for job_level_desc
unique_postings_by_onsite <- joined_data %>%
select(posting_id, onsite_desc) %>%
distinct()
job_onsite_frequency <- unique_postings_by_onsite %>%
filter(onsite_desc %in% c("Hybrid", "Remote", "On_Site")) %>%
group_by(onsite_desc) %>%
summarise(frequency = n()) %>%
ungroup() %>%
arrange(desc(frequency))
job_onsite_frequency
## # A tibble: 3 × 2
## onsite_desc frequency
## <chr> <int>
## 1 On_Site 12174
## 2 Remote 18
## 3 Hybrid 10
As expected, we have three possibilities for the job nature: onsite
(the very vast majority, 12,174), remote (18), or hybrid (10).
Visualization
# after joining, the data is now in long-form and is ready to check the frequency of skills
skill_frequency <- joined_data %>%
group_by(skills_desc) %>%
summarise(frequency = n()) %>%
ungroup() %>%
arrange(desc(frequency))
skill_frequency
## # A tibble: 74,868 × 2
## skills_desc frequency
## <chr> <int>
## 1 "Python\r" 4801
## 2 "SQL\r" 4606
## 3 "Communication\r" 2498
## 4 "Data Analysis\r" 2181
## 5 "Machine Learning\r" 1966
## 6 "AWS\r" 1740
## 7 "Tableau\r" 1685
## 8 "Data Visualization\r" 1562
## 9 "R\r" 1543
## 10 "Java\r" 1414
## # ℹ 74,858 more rows
# visualization - bar chart
ggplot(head(skill_frequency, 10), aes(x = reorder(skills_desc, frequency), y = frequency)) +
geom_bar(stat = "identity") +
theme_minimal() +
coord_flip() +
labs(x = "Skill", y = "Frequency", title = "Top 10 Skills by Frequency")

# making it nicer
ggplot(head(skill_frequency, 10), aes(x = reorder(skills_desc, frequency), y = frequency, fill = skills_desc)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_viridis_d() +
coord_flip() +
labs(x = "Skill",
y = "Frequency",
title = "Top 10 Skills by Frequency",
subtitle = "Based on job postings data") +
theme_minimal() +
theme(
axis.title.y = element_blank(),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14),
plot.margin = margin(1, 1, 1, 1, "cm")
)

Heat Map
skill_frequency$index <- seq_along(skill_frequency$skills_desc)
top_skills <- skill_frequency %>%
filter(frequency >= 100) %>%
top_n(20, frequency)
ggplot(top_skills, aes(x = factor(1), y = reorder(skills_desc, frequency), fill = frequency)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "blue") +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5)
) +
labs(fill = "Frequency", y = "Skill", title = "Top 20 Skills by Frequency") +
coord_fixed(ratio = 1)

==============================================================
Generating a Word Cloud for the skills most in demand in Data
Science
Below we import tbl_job_posting_skills_w_desc from our database,
which is a table that has job posting ID, Skill ID, and skill
description. As seen from the glimpse function, the data set is
massive.
## Rows: 314,926
## Columns: 3
## $ posting_id <chr> "6334263", "6334263", "6334263", "6334263", "6334263", "63…
## $ skills_id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
## $ skills_desc <chr> "Machine Learning\r", "Programming\r", "Python\r", "Scala\…
Cleaning the data slightly more for a Word Cloud
Below we convert all of the skills into lower case. This is to make
the skills case insensitive. Ie, Python and python would be two
different skills if there was case sensitivity. We also remove the “\r”
character.
#lower case
job_skillsDF$skills_desc <- str_to_lower(job_skillsDF$skills_desc)
job_skillsDF$skills_desc <- str_replace_all(job_skillsDF$skills_desc, '\\r','')
Evaluating the length of skill strings
Below we add a column that contains the length of the strings. This
will be necessary to remove outliers.
#length of string
job_skillsDF$skill_desc_length <- str_count(job_skillsDF$skills_desc)
job_skillsDF <- arrange(job_skillsDF, skill_desc_length)
tail(job_skillsDF, n = 10)
## posting_id skills_id skills_desc
## 314917 3189276 74732 ability to perform and adapt in high stress enviro
## 314918 3189276 74733 ability to manage communicate and coordinate proje
## 314919 3189276 74736 demonstrated strong skills in collaborating and in
## 314920 7386807 6945 bachelor's degree in cybersecurity systems enginee
## 314921 4707968 74871 technical challenges for transportation technologi
## 314922 8598957 74900 hardware/software/operatingsystem implementation a
## 314923 8598957 74906 larger environments and working around the custome
## 314924 8598957 74909 ability to use knowledge sound judgment and resour
## 314925 8598957 74911 ability to balance multiple priorities and meet de
## 314926 8598957 74913 desktop mobile os and/or hw design and implementat
## skill_desc_length
## 314917 50
## 314918 50
## 314919 50
## 314920 50
## 314921 50
## 314922 50
## 314923 50
## 314924 50
## 314925 50
## 314926 50
As we can see above, there are skills that are pretty much whole
sentences. This does not provide much insight as we are interested in
specific skills (eg, Python, Excel, etc).
Removing outliers
Below we use a histogram to visualize the length of the strings
hist(job_skillsDF$skill_desc_length)

summary(job_skillsDF$skill_desc_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 15.00 15.36 20.00 50.00
Based on the above, it seems like removing skills that have a length
of greater than 25 will capture most of the data and will remove skills
that are sentences.
job_skillsDF_filtered <- job_skillsDF |> filter(skill_desc_length < 25)
job_skillsDF_filtered <- arrange(job_skillsDF_filtered, skill_desc_length)
tail(job_skillsDF_filtered, n = 10)
## posting_id skills_id skills_desc skill_desc_length
## 278110 3019943 74771 construction regulations 24
## 278111 6728169 74784 ba/bs or technical certs 24
## 278112 2730123 74790 leadingedge technologies 24
## 278113 8020065 175 green belt certification 24
## 278114 3161415 74836 online and short courses 24
## 278115 4707968 74854 diverse team supervision 24
## 278116 4707968 74855 energy sector experience 24
## 278117 4707968 74860 relevant data experience 24
## 278118 6138795 12446 procedural documentation 24
## 278119 4050887 23476 blood bank refrigerators 24
Tallying the frequency of each skill
Below we create a table that counts the frequency of each unique
skill
freqOfSkills <- as.data.frame(table(job_skillsDF_filtered$skills_desc))
#freqOfSkills <- arrange(freqOfSkills, Freq)
#tail(freqOfSkills, n = 10)
colnames(freqOfSkills) <- c('word', 'freq')
rownames(freqOfSkills) <- freqOfSkills$word
summary(freqOfSkills$Freq)
## Length Class Mode
## 0 NULL NULL
Removing skills that rarely appear
Based on the above, the data can use more filter, as the top skill
appears nearly 5000 times while some skills appear only once. We will
filter for skills that appear at least 200 times
filterValue <- 200
freqOfSkillsFiltered <- freqOfSkills |> filter(freq >= filterValue)
Word cloud
Below we create a word cloud of the skills for data science jobs
set.seed(1234)
wordcloud2(data = freqOfSkillsFiltered, size = 0.4, color = 'random-dark')
============================
Statistical Analyses
To explore potential analysis options, we will try to see interesting
potential relationships between the skill frequency with the other
variables available in this data set. First we start with the
relationship with the job level (ie, seniority).
Model 1:
skill_freq_by_job_level <- joined_data %>%
filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
count(job_level_desc, skills_desc) %>%
group_by(skills_desc) %>%
mutate(total_freq = sum(n)) %>%
ungroup() %>%
arrange(desc(total_freq)) %>%
slice(1:10)
ggplot(skill_freq_by_job_level, aes(x = reorder(skills_desc, total_freq), y = n, fill = job_level_desc)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_viridis_d() +
coord_flip() +
labs(x = "Skill",
y = "Frequency",
title = "Top 10 Skills by Frequency",
subtitle = "Stratified by Job Level") +
theme_minimal() +
theme(
axis.title.y = element_blank(),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14),
plot.margin = margin(1, 1, 1, 1, "cm"),
legend.position = "bottom"
)

# But this is showing huge differences between the two levels because of the huge difference in how many of each level there is in the first place. To make the comparison easier and account for the differences in the number of job postings at each job level, I will normalize the frequencies. I'll show the proportion of postings that mention each skill within each job level, rather than the raw counts.
skill_freq_by_job_level <- joined_data %>%
filter(job_level_desc %in% c("Associate", "Mid senior")) %>%
count(job_level_desc, skills_desc) %>%
group_by(job_level_desc) %>%
mutate(frequency_proportion = n / sum(n)) %>%
ungroup() %>%
group_by(skills_desc) %>%
mutate(total_freq = sum(n)) %>%
ungroup() %>%
arrange(desc(total_freq)) %>%
slice(1:10)
ggplot(skill_freq_by_job_level, aes(x = reorder(skills_desc, total_freq), y = frequency_proportion, fill = job_level_desc)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_viridis_d() +
coord_flip() +
labs(x = "Skill",
y = "Proportion",
title = "Top 10 Skills by Normalized Frequency",
subtitle = "Stratified by Job Level") +
theme_minimal() +
theme(
axis.title.y = element_blank(),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14),
plot.margin = margin(1, 1, 1, 1, "cm"),
legend.position = "bottom"
)

Relationship between the skills frequency by job level:
# a contingency table of counts of skills by job level
contingency_table <- xtabs(n ~ job_level_desc + skills_desc, data = skill_freq_by_job_level)
# chi-squared test
chi_squared_result <- chisq.test(contingency_table)
chi_squared_result
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 49.852, df = 4, p-value = 3.877e-10
The chi square test came out to be statistically significant.
The very low p-value indicates that there is a statistically significant
difference between the skill frequencies across different job levels.
Therefore, the frequency of skill mentions is dependent on the job
level. The skills mentioned in job postings are not randomly distributed
across ‘Associate’ and ‘Mid senior’ levels; there is a pattern to their
distribution that is statistically significant.
Logistic regression model for the skills frequency by job
level:
binary_data <- skill_freq_by_job_level %>%
uncount(n) %>%
mutate(mentioned = 1)
logistic_model <- glm(mentioned ~ job_level_desc + skills_desc, data = binary_data, family = "binomial")
## Warning: glm.fit: algorithm did not converge
logistic_model
##
## Call: glm(formula = mentioned ~ job_level_desc + skills_desc, family = "binomial",
## data = binary_data)
##
## Coefficients:
## (Intercept) job_level_descMid senior
## 2.657e+01 -3.083e-08
## skills_descData Analysis\r skills_descMachine Learning\r
## 1.964e-09 4.796e-06
## skills_descPython\r skills_descSQL\r
## -1.161e-08 -1.175e-08
##
## Degrees of Freedom: 16051 Total (i.e. Null); 16046 Residual
## Null Deviance: 0
## Residual Deviance: 9.313e-08 AIC: 12
This warning suggests that the iterative process used to find the
maximum likelihood estimates for the logistic regression model did not
successfully converge to a stable solution.
Models by country or job onsite/remote:
Unfortunately, the number of instances for the different possible
values for these variables are too small to be able to build any
meaningful model. The VAST MAJORITY of these job postings were for
onsite jobs located in the United States.
==========================================================
Conclusion
This is a large and recent (2024) data set focused on the job
postings for Data Science scraped from LinkedIn. We imported the data
into SQL and normalized the tables in a logical framework. We then
loaded the data tables in R and performed a left join to build a
relevant framework for the purposes of trying to find which skills were
most in demand for data scientists. We then cleaned up the data slightly
more (since most of the tidying and transformations were already
achieved in SQL) and then started with creating summary statistics and
visualizations to explore the data set. We created several graphical
representations of the skills most in demand for data scientists
including bar charts, a heat map, and a word cloud. We then performed
statistical analysis to explore the relationship between the skills
frequency and job level (ie, seniority) to test for any statistical
significance and found that there is indeed a statistically significant
difference between the two job levels included in the data (associate
vs. mid senior). Therefore, we concluded that the skills required were
dependent on the job level. Then we created bar charts to visualize the
differences in skills across these two categories (first as raw numbers
and then as a log scale to normalize for the large difference of
instances of each of the two categories). We then attempted to build a
logistic regression model to explore this relationship further, but the
model did not converge (several reasons may be behind that, but the
scope of those reasons are beyond the scope of this work at the moment).
We also wanted to conduct similar analysis for the relationship between
the skills level and the location (ie, country) as well as onsite/remote
status, but the distribution of these two variables were so skewed that
we could not perform such analyses.
Main takeaways:
- The data skills most in demand for data scientists were (in
descending order): Python, SQL, communication, data analysis, and
machine learning
- The data came mostly from the United States (and a very small
proportion came from the UK, Canada, Australia)
- The vast majority of the job postings were for onsite positions
- Some skills were expressed in several forms within this data set
(eg, “communication” and “communication skills”, etc). If we were to
dive deeper into this data set to fix the many instances for how certain
skills were expressed, some skills may become higher in the ranking in
terms of frequency. However, we don’t think this would likely change the
current ranking for the top 5 skills.
- The desired data science skills in the job market represented in
this data set were depended on the job level (associate vs. mid
senior)
Limitations:
Despite a few strengths (large data set, recent, using a popular job
seeking website), the data has some limitations. The vast majority of
the job postings included were from the US, which largely limits the
generazability of these findings to other countries in the world. A few
skills were expressed in a few different ways, which may have
under-estimated their frequency and ranking. The finding that virtually
all of the job postings were “onsite” seemed an irregular finding, given
that we know a large proportions of the data science jobs currently on
the market are hybrid or remote. The data set only included a few
variables that limited our desire to check for the relationship between
the skills frequency and certain domains/disciplines/industry, salary
information, and full-time/part-time.
Future directions:
An effort to address the above-mentioned limitations would help
clarify which data science skills are most desired in a more nuanced
way. A more representative sample that covers more countries in the
world (in a more balanced distribution) would improve the
generalizability of these findings.