Analysis & Visualizations

Introduction.

This work is the fourth of five stages of an analysis where the main objective was to identify the most valued data science skills. Our approach to this was to collect job postings from various job boards and extract the skills from the postings. The purpose of this specific file’s work is to analyze the skills extracted from the job postings.

Code for web scraping, cleaning, and skill extraction can be found in the project repository on GitHub.

Read in data

Read the data from a CSV file which contains job listings and skills

import_df <- read_csv('output/job_listings_skills_long.csv')

Job Listings Based on Data Source

distinct_job_website <- 
  import_df |> 
  group_by(website) |> 
  summarise(count = n_distinct(job_id)) |> 
  arrange(desc(count))

ggplot(distinct_job_website, aes(x = count, y = reorder(website, count))) +
  geom_bar(stat = 'identity', fill = '#4E79A7') +
  labs(title = "Job Listings per Website", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold'))

knitr::kable(distinct_job_website)

website	count
ai-jobs.net	3967
dataanalyst.com	700
usajobs.gov	58
data.cityofnewyork.us	44

Most Requested Degree

Based on the job postings that had college degree information, having a Master’s was first at 498, followed by a Bachelor’s at 456.

high_degree <- 
  import_df |> 
  na.omit(highest_ed) %>%
  group_by(highest_ed) %>%
  summarize(count = n_distinct(job_id))

high_degree %>%
  ggplot(aes(x = highest_ed, y = count)) +
  geom_bar(stat = 'identity', fill = '#4E79A7') + 
  labs(title = "Most Degrees Requested", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold')) +
  geom_text(aes(label = count), 
            position = position_stack(vjust = 0.8), 
            size = 4,
            fontface = 'bold',
            color = 'white')

knitr::kable(high_degree)

highest_ed	count
bachelor	456
master	498
phd	313

Maximum Salary Distribution

The histogram shows the distribution of maximum salary values for all jobs within the dataset, with missing values removed. We can see that there is a bi-modal distribution where it peaks first at $125,000 and again at $175,000.

salaries <- 
  import_df |> 
  filter(!is.na(salary_max)) |> 
  distinct(job_id, .keep_all = TRUE)
  

salaries |> 
  ggplot(aes(x = salary_max)) +
  geom_histogram(binwidth = 25000, fill = '#4E79A7') +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Maximum Salary Distribution", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold')) +
  stat_bin(binwidth=25000, geom='text', color='white', size=3,
           aes(label=after_stat(count)), position=position_stack(vjust=0.8))

Maximum Salary vs. Education

Separating our bi-modal distribution based on education, we can still see the same type of distribution and peaks as before.

salaries_edu <- 
  import_df |> 
  filter(!is.na(salary_max), !is.na(highest_ed)) |> 
  distinct(job_id, .keep_all = TRUE)
  

salaries_edu |> 
  ggplot(aes(x = salary_max)) +
  geom_histogram(binwidth = 25000, fill = '#4E79A7') +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Maximum Salary Distribution vs. Education", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold')) +
  stat_bin(binwidth=25000, geom='text', color='white', size=3,
           aes(label=after_stat(count)), position=position_stack(vjust=0.8)) +
  facet_grid(rows = vars(highest_ed))

Maximum Salary vs. Years of Experience

Separating our bi-modal distribution based on years of experience, we again can still see the same type of distribution and peaks as before.

salaries_exp <- 
  import_df |> 
  filter(!is.na(salary_max), !is.na(years_exp)) |> 
  distinct(job_id, .keep_all = TRUE) |> 
  mutate(exp_bins = case_when(years_exp <= 2 ~ "0-2",
                              years_exp > 2 & years_exp <= 5 ~ "3-5",
                              years_exp > 5 & years_exp <= 8 ~ "6-8",
                              years_exp > 8  ~ "9+")) |> 
  mutate(across(c(exp_bins), factor))


salaries_exp |> 
  ggplot(aes(x = salary_max)) +
  geom_histogram(binwidth = 25000, fill = '#4E79A7') +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Maximum Salary Distribution vs. Years of Experience", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold')) +
  stat_bin(binwidth=25000, geom='text', color='white', size=3,
           aes(label=after_stat(count)), position=position_stack(vjust=0.8)) +
  facet_grid(rows = vars(exp_bins))

Education and Years of Experience

Based on the chart, we can identify that most jobs primarily look for candidates with 3-5 years of experience. However, the plot also suggests that there are differences in the distribution of years of experience within each educational level. For example, among individuals with a Bachelor’s degree, the distribution of years of experience appears to be skewed towards the lower end, suggesting that many individuals in these positions have relatively little experience. On the other hand, among individuals with a PhD, the distribution of years of experience appears to be more evenly spread out, suggesting that individuals with a PhD may have more varied levels of experience.

edu_exp <- 
  import_df |> 
  filter(!is.na(highest_ed), !is.na(years_exp)) |> 
  distinct(job_id, .keep_all = TRUE) |> 
  mutate(exp_bins = case_when(years_exp <= 2 ~ "0-2",
                              years_exp > 2 & years_exp <= 5 ~ "3-5",
                              years_exp > 5 & years_exp <= 8 ~ "6-8",
                              years_exp > 8  ~ "9+")) |> 
  mutate(across(c(exp_bins), factor))

edu_exp |> ggplot(aes(exp_bins)) +
  geom_bar(aes(fill = exp_bins)) +
  facet_grid(~highest_ed,
             labeller = label_wrap_gen(width=2, multi_line=TRUE)) +
  labs(title = "Education vs. Years of Experience", 
       x = '', y = '',
       fill = "Years of Experience") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold'),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position="top") +
  scale_fill_brewer(palette = 'Set1')

Top 20 Most Frequently Cited Skills

Graph

skill_counts <- import_df %>%
  group_by(skill) %>%
  summarize(count = n(), .groups = 'keep') |> 
  filter(!skill %in% c('engineering', 'data science', 'research', 'programming')) |> 
  arrange(desc(count))

skill_counts |> 
  head(20) |> 
  ggplot(aes(x = count, y = reorder(skill, count))) +
  geom_bar(stat = 'identity', fill = '#4E79A7') +
    labs(title = "Top 20 Most Frequently Cited Skills", x = '', y = '') +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        axis.text = element_text(face = 'bold')) +
  geom_text(aes(label = count), 
            position = position_stack(vjust = 0.9), 
            size = 3,
            fontface = 'bold',
            color = 'white')

Table

knitr::kable(skill_counts |> head(20))

skill	count
Python	2915
SQL	2809
Collaboration	2683
Engineering	2393
Communication	2173
Machine Learning	1865
Data visualization	1590
Data science	1552
Computer Science	1548
Statistics	1414
Research	1326
Programming	1274
Tableau	1195
R	1193
Architecture	1127
Data analysis	1115
Pipelines	1086
AWS	1027
Mathematics	1025
Leadership	1005

What are the top 10 most important skills for data science?

Graph

top_10 <- import_df %>%
  count(skill) %>%
  arrange(desc(n)) %>%
  head(n = 10)

top_10 %>%
  ggplot(aes(y = reorder(skill, n), x = n, fill = skill)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 10 Skills for Data Science", x = "Count", y = "Skill") +
    theme(legend.position = "none")

Table

knitr::kable(top_10, col.names = c("Skills", "Count"), format.args = list(big.mark = ","))

Skills	Count
Python	2,915
SQL	2,809
Collaboration	2,683
Engineering	2,393
Communication	2,173
Machine Learning	1,865
Data visualization	1,590
Data science	1,552
Computer Science	1,548
Statistics	1,414

What are the top skills for Data Scientists vs. Data Analysts?

The highlighted bars shows the overlap in skills between both job titles.

# filter for just data scientists and analysts
data_scientist <- import_df %>% 
  filter(str_detect(job_title, "data scientist")) %>%
  mutate(job = "Data Scientist")
  
data_analyst <- import_df %>% 
  filter(str_detect(job_title, "data analyst")) %>%
  mutate(job = "Data Analyst")

# top 10 for each
data_scientist_top10 <- data_scientist %>%
  count(skill) %>%
  arrange(desc(n)) %>%
  head(n = 10)

data_analyst_top10 <- data_analyst %>%
  count(skill) %>%
  arrange(desc(n)) %>%
  head(n = 10)

common_skills <- data_analyst_top10 |> 
  inner_join(data_scientist_top10, by = 'skill') |> 
  select(skill)

vec_common_skills <- common_skills[['skill']]

data_scientist_top10 <-
  data_scientist_top10 |> mutate(f_color = ifelse(skill %in% vec_common_skills, "Yes", "No"))

data_analyst_top10 <-
  data_analyst_top10 |> mutate(f_color = ifelse(skill %in% vec_common_skills, "Yes", "No"))

# plot top 10
data_scientist_top10 %>%
  ggplot(aes(y = reorder(skill, n), x = n, fill = f_color)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 10 Skills for Data Scientists", x = "Count", y='') +
    theme_bw() +
    theme(legend.position = "none") +
    scale_fill_manual(values = c("Yes" = "#F28E2B", "No" = "gray")) +
    geom_text(aes(label = n), 
            position = position_stack(vjust = 0.9), 
            size = 3,
            fontface = 'bold',
            color = 'black')

data_analyst_top10 %>%
  ggplot(aes(y = reorder(skill, n), x = n, fill = f_color)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 10 Skills for Data Analysts", x = "Count", y = '') +
    theme_bw() +
    theme(legend.position = "none") +
    scale_fill_manual(values = c("Yes" = "#F28E2B", "No" = "gray")) +
    geom_text(aes(label = n), 
            position = position_stack(vjust = 0.9), 
            size = 3,
            fontface = 'bold',
            color = 'black')

Wordcloud

Sort the data by the number of citations in descending order and select the top 100 most cited skills. Create the plot using ggplot2 and ggwordcloud packages

skill_counts <- skill_counts %>%
  arrange(desc(count)) %>%
  head(100)

ggplot(skill_counts, aes(label = skill, size = count, color = factor(sample.int(8, nrow(skill_counts), replace = TRUE)))) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 10) +
  theme_minimal()

This code generates a word cloud plot of the most frequently cited skills in the jobs_long dataset. It first creates a new dataframe called skills_freq that counts the frequency of each skill in the skills column of jobs_long, arranges them in descending order, and selects the top 100. It then uses ggplot() and geom_text_wordcloud() to plot the skill names as words, with the size of the word indicating the frequency of the skill. The plot is given a title “Most frequently cited skills”.

Conclusion

The “Master” education level is the most prevalent among different job types, followed by “PhD” and “Bachelor”. The majority of jobs have a maximum salary value between 100,000 and 200,000 USD. There are relatively fewer jobs with a maximum salary value between 0 and 100,000 USD, and even fewer with a maximum salary value between 200,000 and 300,000 USD. There are only a few jobs with a maximum salary value above 300,000 USD. There are differences in the distribution of years of experience across different job positions and education levels. For example, among individuals with a Bachelor’s degree, the distribution of years of experience appears to be skewed towards the lower end, suggesting that many individuals in these positions have relatively little experience. On the other hand, among individuals with a PhD, the distribution of years of experience appears to be more evenly spread out, suggesting that individuals with a PhD may have more varied levels of experience. The most frequently cited skills in the job listings are “Python”, “SQL”, and “Machine Learning”. The top skills cited in the job listings vary depending on the highest education level required and the job location.

Analysis & Visualizations

Waheeb Algabri, Keith Collela, John Cruz, Shoshana Farber, Kayleah Griffen

March 19, 2023

Introduction.

Read in data

Job Listings Based on Data Source

Most Requested Degree

Maximum Salary Distribution

Maximum Salary vs. Education

Maximum Salary vs. Years of Experience

Education and Years of Experience

Top 20 Most Frequently Cited Skills

Graph

Table

What are the top 10 most important skills for data science?

Graph

Table

What are the top skills for Data Scientists vs. Data Analysts?

Wordcloud

Conclusion