Project 3 - Job Posts skills

Tyding Skills from Job Posts

Research Question:

What specific skills are essential for various data science positions across different industries?

After reviewing different articles to better understand how skills in Data Science can be classified into different categories, we’ll compare how that is applied in skills found in job posts.

We used a Python program to scrap data from different job posts URLs (please see program in Github:“https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project3/Skill_Search.py”), and stored every entry in a .csv file.

Once we have a .csv file containing the list of skills for each job post we can proceed to organize the information a bit more.

Load the needed libraries

library(readr)
library(dplyr)
library(tidyr)
library (stringr)
library (ggplot2)

And load the data, which in this case is hosted in GitHub

skills_url <- "https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project3/skills.csv"  
companies_skills_url <- "https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project3/companies_skills.csv"

# Load the CSV files into R data frames
skills_df <- read.csv(skills_url, stringsAsFactors = FALSE)
companies_skills_df <- read.csv(companies_skills_url, header = FALSE, stringsAsFactors = FALSE)

# Display the first few rows of each table to confirm data was loaded succesfully 
head(skills_df)

##                skill              category
## 1             Python Programming Languages
## 2                  R Programming Languages
## 3      Data Analysis      Technical Skills
## 4     Data Wrangling      Technical Skills
## 5 Data Visualization      Technical Skills
## 6   Machine Learning      Technical Skills

head(companies_skills_df)

##                                                                                                   V1
## 1 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=26a6924bb59873b4
## 2 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=7c45500a7feedb5e
## 3 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=41b9d84df3cdeb15
## 4 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=2bab98cf98ac2c23
## 5 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=bc3912859daaf1ef
## 6 https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&from=searchOnHP&vjk=9fd9ee455e6c924b
##                 V2
## 1  10/8/2024 20:22
## 2  10/8/2024 20:26
## 3  10/8/2024 20:56
## 4 10/10/2024 10:46
## 5 10/10/2024 10:48
## 6 10/10/2024 10:49
##                                                                                                                                                                                                                                                                                                                                       V3
## 1                                                                                                                                       R, SQL, Python, Data Visualization, Machine Learning, Statistical Analysis, Data Mining, Hadoop, Spark, Relational Databases, Project Management, Leadership, Git, Java, Data Science, Analytics
## 2                                                                                                                                                                                                  R, SQL, Python, Data Analysis, Data Visualization, Machine Learning, Statistical Analysis, Critical Thinking, Data Science, Analytics
## 3                                                        C, C++, SQL, Python, Data Analysis, Data Visualization, Machine Learning, Statistical Analysis, Spark, Critical Thinking, Tableau, Pandas, Scikit-Learn, Feature Engineering, Web Scraping, Software Development, MLOps, Data Science, Analytics, Automation, Feature Selection
## 4                                                                                                                                                                                                                         Python, Data Analysis, Data Visualization, Machine Learning, Matplotlib, Seaborn, Git, Data Science, Analytics
## 5                                                                           R, C, C++, SQL, Python, Data Visualization, Machine Learning, Statistical Analysis, Predictive Modeling, Big Data, Azure, Communication, Business Intelligence, Excel, Scikit-Learn, TensorFlow, PyTorch, C#, Data Science, Analytics, Mentoring, Innovation
## 6 SQL, Python, Data Visualization, Machine Learning, Natural Language Processing, Neural Networks, Reinforcement Learning, Data Engineering, Azure, Data Modeling, Project Management, Communication, Leadership, Excel, TensorFlow, Keras, PyTorch, Git, MLOps, Data Science, Analytics, Stakeholder Management, Automation, Innovation
##                                         V4                    V5
## 1 Senior Data Scientist, Decision Sciences         NBC Universal
## 2             Staff Data Scientist, Search                Google
## 3           Customer Facing Data Scientist             Explorium
## 4                    Senior Data Scientist West Coast Consulting
## 5                Data Scientist, Corporate        Oak View Group
## 6                    Senior Data Scientist                  Amex
##                        V6
## 1 Media and Entertainment
## 2                    Tech
## 3                    Tech
## 4                    Tech
## 5       Sports and Events
## 6                 Finance

We can see that we have 2 initial data sets:

skills_df: Which works as a sort of dictionary for skills and what category it belongs to.

companies_skills_df: Wich is the result of the data scrapped and stored in a .csv, with the columns: url, timestamp, skills, position, company and industry. However, it currently has no headers.

We’ll assign headers to the companies_skills_df table and remove some columns that are not necessary for the analysis, such as URL and Timestamp

# To assign column names to the companies_skills_df table
colnames(companies_skills_df) <- c("URL", "Timestamp", "Skills", "Position", "Company", "Industry")

# And then remove the "URL" and "Timestamp" columns using dplyr's select function
companies_skills_df <- companies_skills_df %>%
  select(-URL, -Timestamp)

In order to facilitate the comparison with the other Data set of this project (skills in data science according to articles) we’ll use 3 categories instead of 4 for the skills. In this case we’ll merge Business Skills with Soft Skills.

# Replace "Business Skills" with "Soft Skills" in the skills_df
skills_df <- skills_df %>%
  mutate(category = ifelse(category == "Business Skills", "Soft Skills", category))

# Check the updated table
head(skills_df)

##                skill              category
## 1             Python Programming Languages
## 2                  R Programming Languages
## 3      Data Analysis      Technical Skills
## 4     Data Wrangling      Technical Skills
## 5 Data Visualization      Technical Skills
## 6   Machine Learning      Technical Skills

To organize the data a bit more, we’ll be separating the companies_skills_df table into two separate tables:

company_industry_table: Focused on storing name of company and the industry it belongs to.

skills_job_post: Focused on storing main information we’ll be working with: Position, skills and company.

# Reordering the columns using dplyr's select function
companies_skills_df <- companies_skills_df %>%
  select(Position, Skills, Company, Industry)

# Table 1: Skills Job Post (without Industry)
skills_job_post <- companies_skills_df %>%
  select(Position, Skills, Company)

# Table 2: Company and Industry Table
company_industry_table <- companies_skills_df %>%
  select(Company, Industry) %>%
  distinct()  # using distinct to ensure there's no duplicated company-industry rows

Since our current list of skills in the table for job posts does not distinguish by category we’ll split the column skills into 3, according to category of skill to facilitate analysis:

#First, we make a copy of the "Skills" column to a new column called "Technical Skills"
skills_job_post <- skills_job_post %>%
  mutate(Technical_Skills = Skills)

#Then we create a list of all Technical Skills from skills_df to have something to compare
technical_skills_full_list <- skills_df %>%
  filter(category == "Technical Skills") %>%
  pull(skill)

#Comparing the new column and the new list, we filter only the Technical Skills in the "Technical_Skills" column
# We will keep only the skills that are present in both the job post skills and the technical skills list
skills_job_post <- skills_job_post %>%
  rowwise() %>%
  mutate(Technical_Skills = paste(
    intersect(strsplit(Technical_Skills, ",\\s*")[[1]], technical_skills_full_list),
    collapse = ", "
  ))

#Repeat process for Programming Languages
skills_job_post <- skills_job_post %>%
  mutate(Programming_Languages = Skills)

programming_skills_full_list <- skills_df %>%
  filter(category == "Programming Languages") %>%
  pull(skill)

skills_job_post <- skills_job_post %>%
  rowwise() %>%
  mutate(Programming_Languages = paste(
    intersect(strsplit(Programming_Languages, ",\\s*")[[1]], programming_skills_full_list),
    collapse = ", "
  ))

# And repeat the process for Soft Skills
skills_job_post <- skills_job_post %>%
  mutate(Soft_Skills = Skills)

soft_skills_full_list <- skills_df %>%
  filter(category == "Soft Skills") %>%
  pull(skill)

skills_job_post <- skills_job_post %>%
  rowwise() %>%
  mutate(Soft_Skills = paste(
    intersect(strsplit(Soft_Skills, ",\\s*")[[1]], soft_skills_full_list),
    collapse = ", "
  ))

#Remove the original "Skills" column
skills_job_post <- skills_job_post %>%
  select(-Skills)

#View updated table to check our progress
head(skills_job_post)

## # A tibble: 6 × 5
## # Rowwise: 
##   Position            Company Technical_Skills Programming_Languages Soft_Skills
##   <chr>               <chr>   <chr>            <chr>                 <chr>      
## 1 Senior Data Scient… NBC Un… Data Visualizat… R, SQL, Python, Java  "Project M…
## 2 Staff Data Scienti… Google  Data Analysis, … R, SQL, Python        "Critical …
## 3 Customer Facing Da… Explor… Data Analysis, … C++, SQL, Python, Pa… "Critical …
## 4 Senior Data Scient… West C… Data Analysis, … Python, Matplotlib, … ""         
## 5 Data Scientist, Co… Oak Vi… Data Visualizat… R, C++, SQL, Python,… "Communica…
## 6 Senior Data Scient… Amex    Data Visualizat… SQL, Python, TensorF… "Project M…

Analysis by skill category

Now that we have the skills_job_post table tidied to our liking, we can start creating smaller tables for summary data and analyse it.

We can start by creating a summary table for Technical Skills:

#Extract all technical skills from the "Technical_Skills" column, and split them into individual elements
all_technical_skills <- skills_job_post %>%
  pull(Technical_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

#Create a summary table that counts the occurrences of each technical skill
technical_skills_summary <- as.data.frame(table(all_technical_skills))

#rename the columns for clarity
colnames(technical_skills_summary) <- c("Name", "Count")

#calculate the percentage of occurrences
total_posts <- 50  # Assuming there are 50 job posts
technical_skills_summary <- technical_skills_summary %>%
  mutate(Percentage = (Count / total_posts) * 100)

#organize the table by Count, in descending order
technical_skills_summary <- technical_skills_summary %>%
  arrange(desc(Count))

#visualize tables
print(technical_skills_summary)

##                           Name Count Percentage
## 1             Machine Learning    50        100
## 2                        Excel    39         78
## 3                  Data Mining    36         72
## 4                Data Analysis    29         58
## 5           Data Visualization    29         58
## 6         Statistical Analysis    29         58
## 7      Artificial Intelligence    22         44
## 8                Deep Learning    22         44
## 9                          Git    20         40
## 10                    Big Data    19         38
## 11                     Tableau    16         32
## 12               Data Modeling    14         28
## 13             Experimentation    14         28
## 14              Data Pipelines    13         26
## 15         Predictive Modeling    13         26
## 16                       Spark    12         24
## 17                 Text Mining    12         24
## 18      Reinforcement Learning    10         20
## 19              Classification     8         16
## 20                  Automation     7         14
## 21            Data Engineering     6         12
## 22                    Power BI     6         12
## 23 Natural Language Processing     5         10
## 24                       Azure     4          8
## 25                  Clustering     4          8
## 26             Data Governance     4          8
## 27                      Hadoop     4          8
## 28        Relational Databases     4          8
## 29        Software Development     4          8
## 30                 A/B Testing     3          6
## 31         Feature Engineering     3          6
## 32            Model Monitoring     3          6
## 33                  Networking     3          6
## 34             API Development     2          4
## 35             Cloud Computing     2          4
## 36           Data Architecture     2          4
## 37               Data Cleaning     2          4
## 38            Data Integration     2          4
## 39            Data Warehousing     2          4
## 40              Data Wrangling     2          4
## 41         Database Management     2          4
## 42           Feature Selection     2          4
## 43                       MLOps     2          4
## 44            Model Deployment     2          4
## 45       Statistical Inference     2          4
## 46               Cybersecurity     1          2
## 47                  Data Lakes     1          2
## 48                      DevOps     1          2
## 49       Google Cloud Platform     1          2
## 50           Gradient Boosting     1          2
## 51          Hypothesis Testing     1          2
## 52          Linear Programming     1          2
## 53                       Linux     1          2
## 54            Model Validation     1          2
## 55             Neural Networks     1          2
## 56            Python Libraries     1          2
## 57        Qualitative Research     1          2
## 58       Quantitative Research     1          2
## 59               Random Forest     1          2
## 60         Regression Analysis     1          2
## 61             Shell Scripting     1          2
## 62        Time Series Analysis     1          2
## 63                        Unix     1          2
## 64             Version Control     1          2
## 65                Web Scraping     1          2

And if we want to visualize it using ggplot:

top_10_tech_skills <- technical_skills_summary %>%
  arrange(desc(Count)) %>%
  head(10)

# Generate the plot
ggplot(top_10_tech_skills, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Top 10 Most Common Technical Skills",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

What does this tell us?

Machine Learning emerges as the most valued skill, appearing in 100% of job postings (50 counts). Other significant skills include Excel (78%), Data Mining (72%), and Data Analysis (58%). This indicates that proficiency in traditional data manipulation and analysis tools remains vital for anyone planning to be in a data science role.
Deep Learning and Git also make the list, highlighting the relevance of newer technologies and version control in the data science field. This is significant as it shows a trend towards more complex modeling techniques and the importance of collaboration in projects.

Now what do we find for Programming Languages?

#Extract all programming languages from the "Programming Languages" column, and split them into individual elements
all_programming_skills <- skills_job_post %>%
  pull(Programming_Languages) %>%
  strsplit(",\\s*") %>%
  unlist()

#Create a summary table that counts the occurrences of each technical skill
programming_languages_summary <- as.data.frame(table(all_programming_skills))

#rename the columns for clarity
colnames(programming_languages_summary) <- c("Name", "Count")

#calculate the percentage of occurrences
total_posts <- 50  # Assuming there are 50 job posts
programming_languages_summary <- programming_languages_summary %>%
  mutate(Percentage = (Count / total_posts) * 100)

#organize the table by Count, in descending order
programming_languages_summary <- programming_languages_summary %>%
  arrange(desc(Count))

#visualize tables
print(programming_languages_summary)

##            Name Count Percentage
## 1        Python    46         92
## 2           SQL    46         92
## 3             R    28         56
## 4    TensorFlow    13         26
## 5           C++     5         10
## 6       PyTorch     5         10
## 7  Scikit-Learn     5         10
## 8          Java     4          8
## 9        MATLAB     4          8
## 10   Matplotlib     4          8
## 11        Keras     3          6
## 12       Pandas     3          6
## 13      Seaborn     3          6
## 14           C#     2          4
## 15        MySQL     2          4
## 16   PostgreSQL     2          4
## 17         Bash     1          2
## 18   JavaScript     1          2
## 19      MongoDB     1          2
## 20        NoSQL     1          2

And the visual for programming languages:

top_10_prog_skills <- programming_languages_summary %>%
  arrange(desc(Count)) %>%
  head(10)

# Generate the plot
ggplot(top_10_prog_skills, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "red", high ="darkred") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) +
  labs(
    title = "Top 10 Most Common Programming Languages",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

Python and SQL are the most sought-after programming languages, each appearing in 92% of job postings. The high count for these 2 languages aligns with current industry trends, suggesting that job seekers should prioritize learning these languages to remain competitive in the job market.
R appears in 56% of job postings, highlighting its relevance, particularly in statistical analysis and data visualization. While not as dominant as Python and SQL, it remains an important language for data scientists, especially those in academic or research-focused roles.

Finally, let’s take a look at Soft Skills:

#Extract all softs skills from the "Soft Skills" column, and split them into individual elements
all_soft_skills <- skills_job_post %>%
  pull(Soft_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

#Create a summary table that counts the occurrences of each technical skill
soft_skills_summary <- as.data.frame(table(all_soft_skills))

#rename the columns for clarity
colnames(soft_skills_summary) <- c("Name", "Count")

#calculate the percentage of occurrences
total_posts <- 50  # Assuming there are 50 job posts
soft_skills_summary <- soft_skills_summary %>%
  mutate(Percentage = (Count / total_posts) * 100)

#organize the table by Count, in descending order
soft_skills_summary <- soft_skills_summary %>%
  arrange(desc(Count))

#visualize tables
print(soft_skills_summary)

##                           Name Count Percentage
## 1                   Leadership    38         76
## 2                Communication    30         60
## 3           Project Management    29         58
## 4            Critical Thinking    26         52
## 5                Collaboration    24         48
## 6                   Innovation    24         48
## 7        Business Intelligence    17         34
## 8          Attention to Detail     7         14
## 9                    Mentoring     4          8
## 10         Presentation Skills     3          6
## 11             Business Acumen     2          4
## 12 Data-Driven Decision Making     2          4
## 13             Problem Solving     2          4
## 14          Team Collaboration     2          4
## 15                    Teamwork     2          4
## 16             Time Management     2          4
## 17                Adaptability     1          2
## 18       Customer Segmentation     1          2
## 19           Data Storytelling     1          2
## 20           Knowledge Sharing     1          2
## 21                 Negotiation     1          2
## 22                       Scrum     1          2
## 23      Stakeholder Management     1          2

And the plot for soft skills:

top_10_soft_skills <- soft_skills_summary %>%
  arrange(desc(Count)) %>%
  head(10)

# Generate the plot
ggplot(top_10_soft_skills, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightgreen", high ="darkgreen") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + # Add percentage inside bars
  labs(
    title = "Top 10 Most Common Soft Skills",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

As we can see in the plot, leadership is the most frequently mentioned soft skill, appearing in 76% of job postings. This could indicate that employers value leadership abilities in potential candidates, reflecting the need for individuals who can guide teams and projects effectively.
Communication skills are essential, as they appear in 60% of job postings. It’s also interesting to see critical thinking (52%) and collaboration (48%), as this shows an interest in candidates that can both be team players but also analyze situations thoughtfully and develop solution more independently when needed.

Analysis of 2 Industries

Our previous findings are based on the data for all job posts. But what if we wanted to apply a similar analysis only to job posts from companies in certain industries such as Technology and Healthcare.

Let’s focus on the Tech Industry first:

# Merge the skills_job_post table with company_industry_table to add the Industry column
skills_job_post_with_industry <- merge(skills_job_post, company_industry_table, by = "Company")

# Filter for Technology industry
skills_technology <- skills_job_post_with_industry %>%
  filter(Industry == "Tech")

# Extract all technical skills from the "Technical_Skills" column for the Technology industry
all_technical_skills_technology <- skills_technology %>%
  pull(Technical_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table that counts the occurrences of each technical skill in the Technology industry
technical_skills_summary_technology <- as.data.frame(table(all_technical_skills_technology))

# Rename the columns for clarity
colnames(technical_skills_summary_technology) <- c("Name", "Count")

# Calculate the percentage of occurrences for technical skills in Technology industry
total_technology_posts <- nrow(skills_technology)  # Total number of job posts in Technology industry
technical_skills_summary_technology <- technical_skills_summary_technology %>%
  mutate(Percentage = (Count / total_technology_posts) * 100) %>%
  arrange(desc(Count))

# Extract all programming languages from the "Programming_Languages" column for Technology industry
all_programming_skills_technology <- skills_technology %>%
  pull(Programming_Languages) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table for programming languages in Technology
programming_languages_summary_technology <- as.data.frame(table(all_programming_skills_technology))

# Rename columns and calculate percentages for programming languages in Technology industry
colnames(programming_languages_summary_technology) <- c("Name", "Count")
programming_languages_summary_technology <- programming_languages_summary_technology %>%
  mutate(Percentage = (Count / total_technology_posts) * 100) %>%
  arrange(desc(Count))

# Extract all soft skills from the "Soft_Skills" column for Technology industry
all_soft_skills_technology <- skills_technology %>%
  pull(Soft_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table for soft skills in Technology
soft_skills_summary_technology <- as.data.frame(table(all_soft_skills_technology))

# Rename columns and calculate percentages for soft skills in Technology industry
colnames(soft_skills_summary_technology) <- c("Name", "Count")
soft_skills_summary_technology <- soft_skills_summary_technology %>%
  mutate(Percentage = (Count / total_technology_posts) * 100) %>%
  arrange(desc(Count))

# Select the top 10 most common technical skills in Technology
top_10_tech_skills_technology <- technical_skills_summary_technology %>%
  arrange(desc(Count)) %>%
  head(10)

# Select the top 10 most common programming languages in Technology
top_10_programming_languages_technology <- programming_languages_summary_technology %>%
  arrange(desc(Count)) %>%
  head(10)

# Select the top 10 most common soft skills in Technology
top_10_soft_skills_technology <- soft_skills_summary_technology %>%
  arrange(desc(Count)) %>%
  head(10)

# Generate the plots for Technology industry
ggplot(top_10_tech_skills_technology, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Technical Skills in Technology Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

ggplot(top_10_programming_languages_technology, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "red", high = "darkred") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Programming Languages in Technology Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

ggplot(top_10_soft_skills_technology, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Soft Skills in Technology Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

And let’s see the Healthcare Industry:

# Merge the skills_job_post table with company_industry_table to add the Industry column
skills_job_post_with_industry <- merge(skills_job_post, company_industry_table, by = "Company")

# Filter for a specific industry, for example, "Healthcare"
skills_healthcare <- skills_job_post_with_industry %>%
  filter(Industry == "Healthcare")

# Extract all technical skills from the "Technical_Skills" column for the Healthcare industry
all_technical_skills_healthcare <- skills_healthcare %>%
  pull(Technical_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table that counts the occurrences of each technical skill in the Healthcare industry
technical_skills_summary_healthcare <- as.data.frame(table(all_technical_skills_healthcare))

# Rename the columns for clarity
colnames(technical_skills_summary_healthcare) <- c("Name", "Count")

# Calculate the percentage of occurrences
total_healthcare_posts <- nrow(skills_healthcare)  # Total number of job posts in Healthcare industry
technical_skills_summary_healthcare <- technical_skills_summary_healthcare %>%
  mutate(Percentage = (Count / total_healthcare_posts) * 100)

# Organize the table by Count, in descending order
technical_skills_summary_healthcare <- technical_skills_summary_healthcare %>%
  arrange(desc(Count))

# Extract all programming languages from the "Programming_Languages" column for Healthcare industry
all_programming_skills_healthcare <- skills_healthcare %>%
  pull(Programming_Languages) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table for programming languages in Healthcare
programming_languages_summary_healthcare <- as.data.frame(table(all_programming_skills_healthcare))

# Rename columns and calculate percentages
colnames(programming_languages_summary_healthcare) <- c("Name", "Count")
programming_languages_summary_healthcare <- programming_languages_summary_healthcare %>%
  mutate(Percentage = (Count / total_healthcare_posts) * 100) %>%
  arrange(desc(Count))

# Extract all programming languages from the "Programming_Languages" column for Healthcare industry
all_soft_skills_healthcare <- skills_healthcare %>%
  pull(Soft_Skills) %>%
  strsplit(",\\s*") %>%
  unlist()

# Create a summary table for programming languages in Healthcare
soft_skills_summary_healthcare <- as.data.frame(table(all_soft_skills_healthcare))

# Rename columns and calculate percentages
colnames(soft_skills_summary_healthcare) <- c("Name", "Count")
soft_skills_summary_healthcare <- soft_skills_summary_healthcare %>%
  mutate(Percentage = (Count / total_healthcare_posts) * 100) %>%
  arrange(desc(Count))

# Select the top 10 most common technical skills in Healthcare
top_10_tech_skills_healthcare <- technical_skills_summary_healthcare %>%
  arrange(desc(Count)) %>%
  head(10)

# Select the top 10 most common programming languages in Healthcare
top_10_programming_languages_healthcare <- programming_languages_summary_healthcare %>%
  arrange(desc(Count)) %>%
  head(10)

# Select the top 10 most common soft skills in Healthcare
top_10_soft_skills_healthcare <- soft_skills_summary_healthcare %>%
  arrange(desc(Count)) %>%
  head(10)

# Generate the plots for Healthcare industry
ggplot(top_10_tech_skills_healthcare, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Technical Skills in Healthcare Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

ggplot(top_10_programming_languages_healthcare, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "red", high = "darkred") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Programming Languages in Healthcare Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

ggplot(top_10_soft_skills_healthcare, aes(x = Count, y = reorder(Name, Count), fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.4), hjust = -0.1, color = "white", size = 4) + 
  labs(
    title = "Most Common Soft Skills in Healthcare Industry",
    x = "Count of Occurrences",
    y = "Skills"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 16)
  )

Findings

Technology skills: - Machine Learning is highly sought after, appearing in 100% of the job posts for both industries. Other common skills include Data Analysis, Excel, Data Mining, and Data Visualization, all appearing in over 50% of the job posts. The focus seems to be on both core analytical skills and the tools required to manage large data sets, with Artificial Intelligence, Git, and Deep Learning also being significant. However, skills like Data Modeling, Statistical Analysis, and Deep Learning have a higher relative importance, with Classification also showing up, indicating that more specific modeling techniques are valued in healthcare. Big Data is more prominent here, suggesting that handling massive healthcare datasets is critical in this sector.

Programming Languages: - Python and SQL dominate this sector, appearing in almost all job posts. R appears in just over half of the job posts. This shows the strong demand for versatile languages like Python and SQL, with TensorFlow and MATLAB being tools needed for machine learning tasks. Meanwhile, in the healthcare, SQL is required in 100% of posts in this industry, likely due to the need to manage large databases in healthcare. Python is almost equally important, but R appears more frequently in healthcare than in technology, which may reflect its statistical capabilities.

Soft Skills: - Critical Thinking emerges as the most important soft skill, appearing in over 66% of job posts. Leadership and Project Management are also highly valued, emphasizing the need for strong team management and project execution. In healthcare, soft skills like Project Management, Leadership, and Innovation appear in 80% of job postings, showing a need for individuals who can manage healthcare projects and lead teams effectively. Interestingly, Critical Thinking appears less frequently (60%) in healthcare compared to technology, but Business Intelligence is equally important, suggesting a need for those who can understand both the medical and business sides of healthcare.

Conclusions

Overall both industries emphasize Machine Learning, but healthcare places a greater emphasis on data modeling, classification, and handling large healthcare datasets. Tools like Git and more specialized machine learning techniques seem more relevant to the technology sector.
SQL is the dominant language in healthcare, but Python is essential across both industries. Healthcare shows a higher demand for R due to its statistical capabilities.
Leadership and project management are highly valued across both industries, but healthcare emphasizes communication, collaboration, and innovation slightly more than technology.