Data607: Project 3 - Data Science Skills

Author

Anthony Roman, John Ferrara, Alinzon Simon, Akeem Lawrence, Ben Wolin

Introduction

The aim of this project is to find out what skills in data science are mostly demanded in the labor market that is now booming. Data science has become a highly important field in all sectors, where innovation and decisions are powered by insights drawn from available data. Increasingly, companies look up to data scientists to make sense of vast datasets, develop predictive models, and deliver actionable insights that would inform their business strategies.

With the growth of the data science field, the required skills are also changing; they now range from knowing programming languages like Python and R to machine learning, advanced techniques in the visualization of data, cloud computing, and big data analytics. Understanding which skills are considered valuable helps aspiring data scientists and professionals currently working in the field orient their development efforts toward market needs.

The project will analyze current job postings to determine the key skills in demand today and how professionals and organizations can remain competitive within this exponentially growing industry. By studying trends across different regions and sectors, we learn how specific skills are valued differently depending on the industry or location.

The data used in this project was obtained from Kaggle, a platform for predictive modeling and analytics competitions. The dataset contains job postings for data science positions, including information on the job title, location, company, job description, and required skills. The dataset was collected from Indeed.com, a popular job search engine, and contains job postings from various countries and industries.

The overall approach to the analysis is as follows:

  1. Data Collection: The dataset was obtained from Kaggle and loaded into R for analysis.
  2. Data Cleaning: The dataset was cleaned to remove missing values and standardize the format of the data.
  3. Word Tokenization: The job descriptions were tokenized to extract the skills required for each job posting.
  4. Word Classification: The skills were classified into categories such as programming languages, machine learning, and data visualization.
  5. Data Analysis: The skills were analyzed to determine the most in-demand skills in the data science field.
  6. Visualization: The results were visualized using bar charts and word clouds to highlight the key skills in demand.

Loading Packages

The following packages are used for this project:

  • readr: For reading in the dataset.
  • RCurl: For reading in the dataset.
  • stringr: For string manipulation.
  • dplyr: For data manipulation.
  • tidyr: For data manipulation.
  • tidyverse: For data manipulation.
  • ggplot2: For data visualization.
  • kableExtra: For creating tables.
  • knitr: For creating reports.
  • wordcloud: For creating word clouds.
  • tm: For text mining.
  • ggwordcloud: For creating word clouds.
  • tidytext: For text mining.
  • colorspace : For color palettes.

Data Collection

The dataset used in this project was obtained from Kaggle and contains job postings for data science positions. The dataset was collected from Indeed.com and includes information on the job title, location, company, job description, and required skills. The following raw data is found here. The raw data contains job_postings.csv, job_skills.csv, and job_summary.csv.

The raw data contains information such as the locations of the entities hiring, the companies performing the hiring, the job titles for the open positions, along with additional information related to the position. Additional information, and the dataset itself can be found here. Lastly, the dataset files and their respective column names can be found in Table 1 below.

Table 1: Dataset Files and Columns

File Name Columns
job_postings job_link, last_processed_time, last_status, got_summary, got_ner, is_being_worked, job_title, company, job_location, first_seen, search_city, search_country, search_position, job_level, job_type
job_skills job_link, job_skills
job_summary job_link, job_summary

Structuring of the Data

The data was structured into threee tables for analysis:

  1. job_postings: Contains information about the job postings, including the job title, company, location, and job description.
  2. job_skills: Contains information about the skills required for each job posting.
  3. job_summary: Contains a summary of the job description for each job posting.

The image representation of the data is shown here:

Proposed Database Table Structures

Sources

The links to the data sources are from here:

Data Cleaning

The data was cleaned to remove missing values and standardize the format of the data. The following steps were taken to clean the data:

  1. Remove Missing Values: Rows with missing values were removed from the dataset.
  2. Standardize Format: The format of the data was standardized to ensure consistency across the dataset.
  3. Remove Duplicates: Duplicate rows were removed from the dataset.
  4. Remove Special Characters: Special characters were removed from the data to ensure accurate analysis.

The cleaned data was then used for further analysis to determine the key skills in demand in the data science field.

# Load the raw data
jobpostings <- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_postings.csv")
jobskills <- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_skills.csv")
jobsummary <- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_summary.csv")

# Read the data into dataframes
jpdf <- data.frame(read.csv(text=jobpostings, sep= "," , stringsAsFactors = FALSE, check.names = FALSE))
jskdf <- data.frame(read.csv(text=jobskills, sep= "," , stringsAsFactors = FALSE, check.names = FALSE))
jssdf <- data.frame(read.csv(text=jobsummary, sep= "," , stringsAsFactors = FALSE, check.names = FALSE))

This will check the dataframes to see if there are any missing values in the dataset.

# Check for missing values in the entire dataset
sum(is.na(jpdf))  # For job postings data
[1] 0
sum(is.na(jskdf)) # For job skills data
[1] 0
sum(is.na(jssdf)) # For job summary data
[1] 0
# Identify which columns have missing values
colSums(is.na(jpdf))
           job_link last_processed_time         last_status         got_summary 
                  0                   0                   0                   0 
            got_ner     is_being_worked           job_title             company 
                  0                   0                   0                   0 
       job_location          first_seen         search_city      search_country 
                  0                   0                   0                   0 
    search_position           job_level            job_type 
                  0                   0                   0 
colSums(is.na(jskdf))
  job_link job_skills 
         0          0 
colSums(is.na(jssdf))
   job_link job_summary 
          0           0 
# Check the column names in each dataframe
colnames(jpdf)
 [1] "job_link"            "last_processed_time" "last_status"        
 [4] "got_summary"         "got_ner"             "is_being_worked"    
 [7] "job_title"           "company"             "job_location"       
[10] "first_seen"          "search_city"         "search_country"     
[13] "search_position"     "job_level"           "job_type"           
colnames(jssdf)
[1] "job_link"    "job_summary"
colnames(jskdf)
[1] "job_link"   "job_skills"

After verifying if there were any missing values in the dataframes, the next step is to remove any missing values from the dataframes, but fortunately there were not any missing values in the dataframes. The next step is to combine the dataframes into one dataframe which will be labeled combined_data. The combined dataframe will also have a column named job_id which will be a unique identifier for each job posting.

# Combine the dataframes using 'job_link' as the common key
combined_data <- jpdf %>%
  left_join(jssdf, by = "job_link") %>%
  left_join(jskdf, by = "job_link")

# Assign a unique job_id for each row in the combined data
combined_data <- combined_data %>%
  mutate(job_id = row_number())
# Reorder the columns so that job_id is the first column
combined_data <- combined_data %>%
  select(job_id, everything())

We will now proceed to tokenize the job descriptions to extract the skills required for each job posting.

Word Tokenization

The job descriptions were tokenized to extract the skills required for each job posting. The following steps were taken to tokenize the job descriptions:

  1. Tokenization: The job descriptions were tokenized to extract individual words.
  2. Remove Stopwords: Common words such as “and,” “the,” and “is” were removed from the tokenized words.
  3. Remove Punctuation: Punctuation marks were removed from the tokenized words.
  4. Convert to Lowercase: The tokenized words were converted to lowercase for consistency.

The tokenized words were then used to extract the skills required for each job posting.

# Tokenize the job descriptions

# Create a corpus from the job descriptions
corpus <- Corpus(VectorSource(combined_data$job_summary))

# Convert the corpus to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
transformation drops documents
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
documents
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
documents
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("en"))
Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
transformation drops documents
# Create a document term matrix
dtm <- DocumentTermMatrix(corpus)

# Convert the document term matrix to a matrix
m <- as.matrix(dtm)

# Get the word frequency
word_freq <- colSums(m)

# Convert the word frequency to a data frame
word_freq_df <- data.frame(word = names(word_freq), freq = word_freq)

# Order the data frame by frequency
word_freq_df <- word_freq_df[order(-word_freq_df$freq), ]

# Check the top 10 words by frequency
head(word_freq_df, 10)
                 word   freq
data             data 122816
experience experience  65779
will             will  36024
work             work  35060
team             team  29087
business     business  28372
show             show  24670
skills         skills  22715
years           years  21511
management management  20956

Word Classification

The skills extracted from the job descriptions were classified into categories such as programming languages, machine learning, and data visualization. The following steps were taken to classify the skills:

  1. Create Skill Categories: Categories such as programming languages, machine learning, and data visualization were created to classify the skills.
  2. Match Skills to Categories: The skills extracted from the job descriptions were matched to the corresponding categories.
  3. Count Skills by Category: The skills were counted by category to determine the most in-demand skills in each category.

The skills were then analyzed to determine the most in-demand skills in the data science field.

# Create a list of programming languages
programming_languages <- c("python", "r", "java", "c++", "c#", "javascript", "sql", "scala", "ruby", "perl", "php", "swift", "kotlin", "typescript", "go", "rust", "dart")

# Create a list of machine learning skills
machine_learning <- c("machine learning", "deep learning", "neural networks", "artificial intelligence", "natural language processing", "reinforcement learning", "supervised learning", "unsupervised learning", "semi-supervised learning", "ensemble learning", "transfer learning", "deep reinforcement learning", "deep neural networks", "convolutional neural networks", "recurrent neural networks", "generative adversarial networks", "support vector machines", "random forests", "decision trees", "gradient boosting", "xgboost", "lightgbm", "catboost", "k-means clustering", "hierarchical clustering", "dbscan", "apriori", "frequent pattern mining", "association rule learning", "collaborative filtering", "content-based filtering", "matrix factorization", "recommender systems", "anomaly detection", "time series forecasting", "sequence prediction", "image recognition", "object detection", "semantic segmentation", "instance segmentation", "image classification", "image generation", "image synthesis", "image super-resolution", "image denoising", "image inpainting", "image captioning", "image style transfer", "image translation", "image segmentation", "image registration", "image restoration", "image enhancement", "image compression", "image processing", "image analysis", "image understanding", "image interpretation")

# Create a list of data visualization skills
data_visualization <- c("data visualization", "data analysis", "data exploration", "data interpretation", "data presentation", "data storytelling", "data reporting", "data communication", "data visualization tools", "data visualization techniques", "data visualization best practices", "data visualization libraries", "data visualization frameworks", "data visualization software", "data visualization platforms", "data visualization dashboards", "data visualization charts", "data visualization graphs", "data visualization maps", "data visualization tables", "data visualization infographics", "data visualization reports", "data visualization insights", "data visualization trends", "data visualization patterns", "data visualization principles", "data visualization guidelines", "data visualization standards", "data visualization design", "data visualization aesthetics", "data visualization color theory", "data visualization typography", "data visualization layout", "data visualization composition", "data visualization hierarchy", "data visualization alignment", "data visualization contrast", "data visualization proximity", "data visualization repetition", "data visualization scale", "data visualization size", "data visualization shape", "data visualization texture", "data visualization value", "data visualization color", "data visualization form", "data visualization space", "data visualization motion", "data visualization pattern", "data visualization rhythm", "data visualization unity", "data visualization balance", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization")

# Classify the skills into categories
combined_data <- combined_data %>%
  mutate(programming_language = str_extract(tolower(job_summary), paste(programming_languages, collapse = "|")),
         machine_learning = str_extract(tolower(job_summary), paste(machine_learning, collapse = "|")),
         data_visualization = str_extract(tolower(job_summary), paste(data_visualization, collapse = "|")))

Data Analysis

The skills were analyzed to determine the most in-demand skills in the data science field. The following steps were taken to analyze the skills:

  1. Count Skills by Category: The skills were counted by category to determine the most in-demand skills in each category.
  2. Visualize Skills: The results were visualized using bar charts and word clouds to highlight the key skills in demand.

The analysis was conducted to identify the key skills in demand in the data science field.

# Count the number of job postings by skill category
skill_counts <- combined_data %>%
  select(job_id, programming_language, machine_learning, data_visualization) %>%
  pivot_longer(cols = -job_id, names_to = "skill_category", values_to = "skill") %>%
  filter(!is.na(skill)) %>%
  group_by(skill_category, skill) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
`summarise()` has grouped output by 'skill_category'. You can override using
the `.groups` argument.

Visualization

The results were visualized using bar charts and word clouds to highlight the key skills in demand. The following visualizations were created to showcase the key skills in the data science field:

  1. Bar Chart: A bar chart was created to show the skills in demand by category.
  2. Word Cloud: A word cloud was created to visualize the most in-demand skills in the data science field.

The visualizations provide insights into the key skills required for data science positions.

top_skills <- skill_counts %>%
  arrange(desc(count)) %>%
  head(20)

top_skills %>%
  ggplot(aes(x = reorder(skill, count), y = count, fill = skill_category)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Top 20 Skills in Demand by Category",
       x = "Skill",
       y = "Count",
       fill = "Skill Category") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Set3")

The top 20 skills will fall under three main categories, namely programming languages, machine learning, and data visualization. In this category of programming languages, the above-mentioned R, Python, and SQL are foreseen to occupy the topmost positions, which shall once again confirm their pivotal role in data manipulation, analysis, and database management. Next, in the area of machine learning, techniques such as Machine Learning, Artificial Intelligence, and Reinforcement Learning will be immensely important, thereby establishing predictive modeling and advanced analytics as more vital in nature. Having Tableau lead the way in visualization, it shows how important it is for data practitioners to take ‘hard-to-view’ data and show it in understandable visualizations. What can be taken away from this breakdown is that data science is a very interdisciplinary field where technical, analytical, and communication skills all interweave together.

# Split job skills, count occurrences, and visualize
skills_series <- combined_data %>%
  filter(!is.na(job_skills)) %>%
  mutate(job_skills = strsplit(as.character(job_skills), ",")) %>%
  unnest(job_skills) %>%
  mutate(job_skills = trimws(job_skills)) 

skill_counts <- skills_series %>%
  count(job_skills, sort = TRUE)

# Standardize similar skill names (e.g., combine "Data Analytics" and "Data Analysis")
skills_series <- skills_series %>%
  mutate(job_skills = case_when(
    job_skills %in% c("Data Analytics", "Data analysis") ~ "Data Analysis",
    TRUE ~ job_skills
  ))

# Recount the occurrences after standardization
skill_counts <- skills_series %>%
  count(job_skills, sort = TRUE)

# Plot top skills
top_skills <- head(skill_counts, 20)
ggplot(top_skills, aes(x = reorder(job_skills, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Data Science Skills", x = "Skills", y = "Frequency")

This bar chart now gives a more structured breakdown of the top 20 most frequent data science skills. Python and SQL head the chart, reinforcing their integral part in any Data Scientist’s toolkit. Data Analysis, Communication, and Machine Learning come next, underlining the dual importance of both analytical skills and the ability to clearly present findings to non-technical stakeholders. Tools like Tableau, AWS, and R fortify the necessity of data visualization and cloud computing. The combination of hard skills, including Spark and Data Engineering, and soft skills like Teamwork and Problem Solving, indicates that successful data scientists should bring various competencies to their roles.

# Use the skill_counts dataframe, where 'job_skills' is the word and 'n' is the frequency
# Use the top 50 skills to generate a word cloud
top_skills <- skill_counts %>%
  head(50)  # Use the top 50 skills for the word cloud

# Generate a dynamic color palette with the same number of colors as unique skills
color_count <- length(unique(top_skills$job_skills))
color_palette <- rainbow_hcl(color_count)  # Generate dynamic colors

# Create the word cloud using dynamic colors
top_skills %>%
  ggplot(aes(label = job_skills, size = n, color = job_skills)) +  # Assign color to each word
  geom_text_wordcloud_area() +  # Use the word cloud function for better layout
  scale_size_area(max_size = 25) +  # Adjust the max size of the words (play with this for the right scale)
  scale_color_manual(values = color_palette) +  # Use the generated color palette
  theme_minimal() +  # Use a minimal theme
  labs(title = "Top In-Demand Skills") +  # Add a title
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

Below is the word cloud of the most in-demand skills in data science, where the size of each word shows how often a particular skill occurs in job postings. Python, SQL, and Data Analysis will be the three most outstanding skills, therefore meaning these are the most in-demand skills in the field. Other relevant technical skills are Java, AWS, Tableau, and Machine Learning, showing interest in being proficient with programming languages, cloud computing platforms, and data visualization tools. Also, some soft skills are represented, like Communication, Leadership, and Teamwork, which means that a data science professional is not only required to deliver on technical tasks but also on collaboration and effective communication of insights.

# Filter the top 20 skills
top_20_skills <- skill_counts %>%
  arrange(desc(n)) %>%
  head(20)

# Calculate the total count of just the top 20 skills
total_top_20_count <- sum(top_20_skills$n)

# Calculate percentages for the top 20 skills based on the total top 20 count
top_20_skills <- top_20_skills %>%
  mutate(percentage = (n / total_top_20_count) * 100)

# Display the table
top_20_skills_table <- top_20_skills %>%
  select(job_skills, n, percentage) %>%
  rename(Skill = job_skills, Frequency = n, Percentage = percentage)

# Format the percentage to two decimal places
top_20_skills_table$Percentage <- round(top_20_skills_table$Percentage, 2)

# Display the table using kable for a neat format

kable(top_20_skills_table, caption = "Top 20 Data Science Skills with Percentages")
Top 20 Data Science Skills with Percentages
Skill Frequency Percentage
Python 4801 12.67
SQL 4606 12.16
Data Analysis 4368 11.53
Communication 2498 6.59
Machine Learning 1966 5.19
AWS 1740 4.59
Tableau 1685 4.45
Data Visualization 1562 4.12
R 1542 4.07
Java 1414 3.73
Spark 1392 3.67
Data Science 1285 3.39
Data Engineering 1262 3.33
Teamwork 1218 3.21
Project Management 1213 3.20
Problem Solving 1093 2.88
Hadoop 1074 2.83
Collaboration 1072 2.83
Data Management 1059 2.80
Power BI 1036 2.73

Analyzing the top 20 data science skills, one may observe that Python with 12.67%, SQL with 12.16%, and Data Analysis at 11.53% of the total number of mentions is the most in-demand skill from job postings, reflecting nearly 40%. Therefore, such a ranking signifies that both programming and analytical capabilities are needed critically in the job cadre of data science. Other highly rated skills include communication skills, at 6.59%, indicating that what is needed is a professional able to communicate complex insights effectively. Of note are also Machine Learning at 5.19% and cloud platforms such as AWS at 4.59%, further indicating the requirement in skill sets for intelligent system development and cloud infrastructure management. Other essential tools in data visualization include Tableau at 4.45% and Data Visualization techniques at 4.12%, necessary to communicate data-driven insights among stakeholders. On the whole, from the above data, it shows a data science professional is bound to possess the right blend of technical, analytics, and communication skills.

Conclusion

The trend of job postings within data science also points to the prescription of crucial skills necessary for success. The most wanted skills seem to fall under three categories: programming languages, machine learning, and data visualization. Above all, the most sought-after are programming languages in Python, R, and Java, and the machine learning techniques include deep learning, neural networks, and artificial intelligence. Additionally, visualization of data, data analysis, and data exploration are critical for conveying insights effectively.

These insights are helpful to professionals and organizations wishing to become or remain competitive. This may be particularly true for professionals, since the development of skills like programming, machine learning, and data visualization will certainly pay off with excellent career prospects, improving their contribution to data-driven decision-making. Equally, organizations can power up innovation, assure growth, and strategic outcomes by attracting the best talent with these competencies.

As the field of data science continues to expand, the demand for informed insights from data that drive decision-making and strategic planning will increase exponentially. With a guarantee of a successful future, professionals update their knowledge on current industry trends and develop competencies in-demand by organizations. Correspondingly, organizations which emphasize the same competencies will be at the helm as they successfully navigate the fast-changing landscape presented by data science.

The following project maps out an all-round view of major skills expected in a data science position, along with actionable insights for professionals and companies. Focusing on the most valued skills and keeping up with industry changes, individual professionals and organizations can better prepare for challenges and opportunities within the high-growth field of data science.