# Load the raw data
<- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_postings.csv")
jobpostings <- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_skills.csv")
jobskills <- getURL("https://raw.githubusercontent.com/jhnboyy/DATA607_Project3_FALL2024/refs/heads/main/data/raw/job_summary.csv")
jobsummary
# Read the data into dataframes
<- data.frame(read.csv(text=jobpostings, sep= "," , stringsAsFactors = FALSE, check.names = FALSE))
jpdf <- data.frame(read.csv(text=jobskills, sep= "," , stringsAsFactors = FALSE, check.names = FALSE))
jskdf <- data.frame(read.csv(text=jobsummary, sep= "," , stringsAsFactors = FALSE, check.names = FALSE)) jssdf
Data607: Project 3 - Data Science Skills
Introduction
The aim of this project is to find out what skills in data science are mostly demanded in the labor market that is now booming. Data science has become a highly important field in all sectors, where innovation and decisions are powered by insights drawn from available data. Increasingly, companies look up to data scientists to make sense of vast datasets, develop predictive models, and deliver actionable insights that would inform their business strategies.
With the growth of the data science field, the required skills are also changing; they now range from knowing programming languages like Python and R to machine learning, advanced techniques in the visualization of data, cloud computing, and big data analytics. Understanding which skills are considered valuable helps aspiring data scientists and professionals currently working in the field orient their development efforts toward market needs.
The project will analyze current job postings to determine the key skills in demand today and how professionals and organizations can remain competitive within this exponentially growing industry. By studying trends across different regions and sectors, we learn how specific skills are valued differently depending on the industry or location.
The data used in this project was obtained from Kaggle, a platform for predictive modeling and analytics competitions. The dataset contains job postings for data science positions, including information on the job title, location, company, job description, and required skills. The dataset was collected from Indeed.com, a popular job search engine, and contains job postings from various countries and industries.
The overall approach to the analysis is as follows:
- Data Collection: The dataset was obtained from Kaggle and loaded into R for analysis.
- Data Cleaning: The dataset was cleaned to remove missing values and standardize the format of the data.
- Word Tokenization: The job descriptions were tokenized to extract the skills required for each job posting.
- Word Classification: The skills were classified into categories such as programming languages, machine learning, and data visualization.
- Data Analysis: The skills were analyzed to determine the most in-demand skills in the data science field.
- Visualization: The results were visualized using bar charts and word clouds to highlight the key skills in demand.
Loading Packages
The following packages are used for this project:
readr
: For reading in the dataset.RCurl
: For reading in the dataset.stringr
: For string manipulation.dplyr
: For data manipulation.tidyr
: For data manipulation.tidyverse
: For data manipulation.ggplot2
: For data visualization.kableExtra
: For creating tables.knitr
: For creating reports.wordcloud
: For creating word clouds.tm
: For text mining.ggwordcloud
: For creating word clouds.tidytext
: For text mining.colorspace
: For color palettes.
Data Collection
The dataset used in this project was obtained from Kaggle and contains job postings for data science positions. The dataset was collected from Indeed.com and includes information on the job title, location, company, job description, and required skills. The following raw data is found here. The raw data contains job_postings.csv
, job_skills.csv
, and job_summary.csv
.
The raw data contains information such as the locations of the entities hiring, the companies performing the hiring, the job titles for the open positions, along with additional information related to the position. Additional information, and the dataset itself can be found here. Lastly, the dataset files and their respective column names can be found in Table 1 below.
Table 1: Dataset Files and Columns
File Name | Columns |
---|---|
job_postings | job_link, last_processed_time, last_status, got_summary, got_ner, is_being_worked, job_title, company, job_location, first_seen, search_city, search_country, search_position, job_level, job_type |
job_skills | job_link, job_skills |
job_summary | job_link, job_summary |
Structuring of the Data
The data was structured into threee tables for analysis:
job_postings
: Contains information about the job postings, including the job title, company, location, and job description.job_skills
: Contains information about the skills required for each job posting.job_summary
: Contains a summary of the job description for each job posting.
The image representation of the data is shown here:
Sources
The links to the data sources are from here:
Data Cleaning
The data was cleaned to remove missing values and standardize the format of the data. The following steps were taken to clean the data:
- Remove Missing Values: Rows with missing values were removed from the dataset.
- Standardize Format: The format of the data was standardized to ensure consistency across the dataset.
- Remove Duplicates: Duplicate rows were removed from the dataset.
- Remove Special Characters: Special characters were removed from the data to ensure accurate analysis.
The cleaned data was then used for further analysis to determine the key skills in demand in the data science field.
This will check the dataframes to see if there are any missing values in the dataset.
# Check for missing values in the entire dataset
sum(is.na(jpdf)) # For job postings data
[1] 0
sum(is.na(jskdf)) # For job skills data
[1] 0
sum(is.na(jssdf)) # For job summary data
[1] 0
# Identify which columns have missing values
colSums(is.na(jpdf))
job_link last_processed_time last_status got_summary
0 0 0 0
got_ner is_being_worked job_title company
0 0 0 0
job_location first_seen search_city search_country
0 0 0 0
search_position job_level job_type
0 0 0
colSums(is.na(jskdf))
job_link job_skills
0 0
colSums(is.na(jssdf))
job_link job_summary
0 0
# Check the column names in each dataframe
colnames(jpdf)
[1] "job_link" "last_processed_time" "last_status"
[4] "got_summary" "got_ner" "is_being_worked"
[7] "job_title" "company" "job_location"
[10] "first_seen" "search_city" "search_country"
[13] "search_position" "job_level" "job_type"
colnames(jssdf)
[1] "job_link" "job_summary"
colnames(jskdf)
[1] "job_link" "job_skills"
After verifying if there were any missing values in the dataframes, the next step is to remove any missing values from the dataframes, but fortunately there were not any missing values in the dataframes. The next step is to combine the dataframes into one dataframe which will be labeled combined_data
. The combined dataframe will also have a column named job_id
which will be a unique identifier for each job posting.
# Combine the dataframes using 'job_link' as the common key
<- jpdf %>%
combined_data left_join(jssdf, by = "job_link") %>%
left_join(jskdf, by = "job_link")
# Assign a unique job_id for each row in the combined data
<- combined_data %>%
combined_data mutate(job_id = row_number())
# Reorder the columns so that job_id is the first column
<- combined_data %>%
combined_data select(job_id, everything())
We will now proceed to tokenize the job descriptions to extract the skills required for each job posting.
Word Tokenization
The job descriptions were tokenized to extract the skills required for each job posting. The following steps were taken to tokenize the job descriptions:
- Tokenization: The job descriptions were tokenized to extract individual words.
- Remove Stopwords: Common words such as “and,” “the,” and “is” were removed from the tokenized words.
- Remove Punctuation: Punctuation marks were removed from the tokenized words.
- Convert to Lowercase: The tokenized words were converted to lowercase for consistency.
The tokenized words were then used to extract the skills required for each job posting.
# Tokenize the job descriptions
# Create a corpus from the job descriptions
<- Corpus(VectorSource(combined_data$job_summary))
corpus
# Convert the corpus to lowercase
<- tm_map(corpus, content_transformer(tolower)) corpus
Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
transformation drops documents
# Remove punctuation
<- tm_map(corpus, removePunctuation) corpus
Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
documents
# Remove numbers
<- tm_map(corpus, removeNumbers) corpus
Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
documents
# Remove stopwords
<- tm_map(corpus, removeWords, stopwords("en")) corpus
Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
transformation drops documents
# Create a document term matrix
<- DocumentTermMatrix(corpus)
dtm
# Convert the document term matrix to a matrix
<- as.matrix(dtm)
m
# Get the word frequency
<- colSums(m)
word_freq
# Convert the word frequency to a data frame
<- data.frame(word = names(word_freq), freq = word_freq)
word_freq_df
# Order the data frame by frequency
<- word_freq_df[order(-word_freq_df$freq), ]
word_freq_df
# Check the top 10 words by frequency
head(word_freq_df, 10)
word freq
data data 122816
experience experience 65779
will will 36024
work work 35060
team team 29087
business business 28372
show show 24670
skills skills 22715
years years 21511
management management 20956
Word Classification
The skills extracted from the job descriptions were classified into categories such as programming languages, machine learning, and data visualization. The following steps were taken to classify the skills:
- Create Skill Categories: Categories such as programming languages, machine learning, and data visualization were created to classify the skills.
- Match Skills to Categories: The skills extracted from the job descriptions were matched to the corresponding categories.
- Count Skills by Category: The skills were counted by category to determine the most in-demand skills in each category.
The skills were then analyzed to determine the most in-demand skills in the data science field.
# Create a list of programming languages
<- c("python", "r", "java", "c++", "c#", "javascript", "sql", "scala", "ruby", "perl", "php", "swift", "kotlin", "typescript", "go", "rust", "dart")
programming_languages
# Create a list of machine learning skills
<- c("machine learning", "deep learning", "neural networks", "artificial intelligence", "natural language processing", "reinforcement learning", "supervised learning", "unsupervised learning", "semi-supervised learning", "ensemble learning", "transfer learning", "deep reinforcement learning", "deep neural networks", "convolutional neural networks", "recurrent neural networks", "generative adversarial networks", "support vector machines", "random forests", "decision trees", "gradient boosting", "xgboost", "lightgbm", "catboost", "k-means clustering", "hierarchical clustering", "dbscan", "apriori", "frequent pattern mining", "association rule learning", "collaborative filtering", "content-based filtering", "matrix factorization", "recommender systems", "anomaly detection", "time series forecasting", "sequence prediction", "image recognition", "object detection", "semantic segmentation", "instance segmentation", "image classification", "image generation", "image synthesis", "image super-resolution", "image denoising", "image inpainting", "image captioning", "image style transfer", "image translation", "image segmentation", "image registration", "image restoration", "image enhancement", "image compression", "image processing", "image analysis", "image understanding", "image interpretation")
machine_learning
# Create a list of data visualization skills
<- c("data visualization", "data analysis", "data exploration", "data interpretation", "data presentation", "data storytelling", "data reporting", "data communication", "data visualization tools", "data visualization techniques", "data visualization best practices", "data visualization libraries", "data visualization frameworks", "data visualization software", "data visualization platforms", "data visualization dashboards", "data visualization charts", "data visualization graphs", "data visualization maps", "data visualization tables", "data visualization infographics", "data visualization reports", "data visualization insights", "data visualization trends", "data visualization patterns", "data visualization principles", "data visualization guidelines", "data visualization standards", "data visualization design", "data visualization aesthetics", "data visualization color theory", "data visualization typography", "data visualization layout", "data visualization composition", "data visualization hierarchy", "data visualization alignment", "data visualization contrast", "data visualization proximity", "data visualization repetition", "data visualization scale", "data visualization size", "data visualization shape", "data visualization texture", "data visualization value", "data visualization color", "data visualization form", "data visualization space", "data visualization motion", "data visualization pattern", "data visualization rhythm", "data visualization unity", "data visualization balance", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization movement", "data visualization direction", "data visualization emphasis", "data visualization variety", "data visualization harmony", "data visualization contrast", "data visualization proportion", "data visualization")
data_visualization
# Classify the skills into categories
<- combined_data %>%
combined_data mutate(programming_language = str_extract(tolower(job_summary), paste(programming_languages, collapse = "|")),
machine_learning = str_extract(tolower(job_summary), paste(machine_learning, collapse = "|")),
data_visualization = str_extract(tolower(job_summary), paste(data_visualization, collapse = "|")))
Data Analysis
The skills were analyzed to determine the most in-demand skills in the data science field. The following steps were taken to analyze the skills:
- Count Skills by Category: The skills were counted by category to determine the most in-demand skills in each category.
- Visualize Skills: The results were visualized using bar charts and word clouds to highlight the key skills in demand.
The analysis was conducted to identify the key skills in demand in the data science field.
# Count the number of job postings by skill category
<- combined_data %>%
skill_counts select(job_id, programming_language, machine_learning, data_visualization) %>%
pivot_longer(cols = -job_id, names_to = "skill_category", values_to = "skill") %>%
filter(!is.na(skill)) %>%
group_by(skill_category, skill) %>%
summarise(count = n()) %>%
arrange(desc(count))
`summarise()` has grouped output by 'skill_category'. You can override using
the `.groups` argument.
Visualization
The results were visualized using bar charts and word clouds to highlight the key skills in demand. The following visualizations were created to showcase the key skills in the data science field:
- Bar Chart: A bar chart was created to show the skills in demand by category.
- Word Cloud: A word cloud was created to visualize the most in-demand skills in the data science field.
The visualizations provide insights into the key skills required for data science positions.
<- skill_counts %>%
top_skills arrange(desc(count)) %>%
head(20)
%>%
top_skills ggplot(aes(x = reorder(skill, count), y = count, fill = skill_category)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Top 20 Skills in Demand by Category",
x = "Skill",
y = "Count",
fill = "Skill Category") +
theme_minimal() +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = "Set3")
The top 20 skills will fall under three main categories, namely programming languages, machine learning, and data visualization. In this category of programming languages, the above-mentioned R, Python, and SQL are foreseen to occupy the topmost positions, which shall once again confirm their pivotal role in data manipulation, analysis, and database management. Next, in the area of machine learning, techniques such as Machine Learning, Artificial Intelligence, and Reinforcement Learning will be immensely important, thereby establishing predictive modeling and advanced analytics as more vital in nature. Having Tableau lead the way in visualization, it shows how important it is for data practitioners to take ‘hard-to-view’ data and show it in understandable visualizations. What can be taken away from this breakdown is that data science is a very interdisciplinary field where technical, analytical, and communication skills all interweave together.
# Split job skills, count occurrences, and visualize
<- combined_data %>%
skills_series filter(!is.na(job_skills)) %>%
mutate(job_skills = strsplit(as.character(job_skills), ",")) %>%
unnest(job_skills) %>%
mutate(job_skills = trimws(job_skills))
<- skills_series %>%
skill_counts count(job_skills, sort = TRUE)
# Standardize similar skill names (e.g., combine "Data Analytics" and "Data Analysis")
<- skills_series %>%
skills_series mutate(job_skills = case_when(
%in% c("Data Analytics", "Data analysis") ~ "Data Analysis",
job_skills TRUE ~ job_skills
))
# Recount the occurrences after standardization
<- skills_series %>%
skill_counts count(job_skills, sort = TRUE)
# Plot top skills
<- head(skill_counts, 20)
top_skills ggplot(top_skills, aes(x = reorder(job_skills, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Data Science Skills", x = "Skills", y = "Frequency")
This bar chart now gives a more structured breakdown of the top 20 most frequent data science skills. Python and SQL head the chart, reinforcing their integral part in any Data Scientist’s toolkit. Data Analysis, Communication, and Machine Learning come next, underlining the dual importance of both analytical skills and the ability to clearly present findings to non-technical stakeholders. Tools like Tableau, AWS, and R fortify the necessity of data visualization and cloud computing. The combination of hard skills, including Spark and Data Engineering, and soft skills like Teamwork and Problem Solving, indicates that successful data scientists should bring various competencies to their roles.
# Use the skill_counts dataframe, where 'job_skills' is the word and 'n' is the frequency
# Use the top 50 skills to generate a word cloud
<- skill_counts %>%
top_skills head(50) # Use the top 50 skills for the word cloud
# Generate a dynamic color palette with the same number of colors as unique skills
<- length(unique(top_skills$job_skills))
color_count <- rainbow_hcl(color_count) # Generate dynamic colors
color_palette
# Create the word cloud using dynamic colors
%>%
top_skills ggplot(aes(label = job_skills, size = n, color = job_skills)) + # Assign color to each word
geom_text_wordcloud_area() + # Use the word cloud function for better layout
scale_size_area(max_size = 25) + # Adjust the max size of the words (play with this for the right scale)
scale_color_manual(values = color_palette) + # Use the generated color palette
theme_minimal() + # Use a minimal theme
labs(title = "Top In-Demand Skills") + # Add a title
theme(plot.title = element_text(hjust = 0.5)) # Center the title
Below is the word cloud of the most in-demand skills in data science, where the size of each word shows how often a particular skill occurs in job postings. Python, SQL, and Data Analysis will be the three most outstanding skills, therefore meaning these are the most in-demand skills in the field. Other relevant technical skills are Java, AWS, Tableau, and Machine Learning, showing interest in being proficient with programming languages, cloud computing platforms, and data visualization tools. Also, some soft skills are represented, like Communication, Leadership, and Teamwork, which means that a data science professional is not only required to deliver on technical tasks but also on collaboration and effective communication of insights.
# Filter the top 20 skills
<- skill_counts %>%
top_20_skills arrange(desc(n)) %>%
head(20)
# Calculate the total count of just the top 20 skills
<- sum(top_20_skills$n)
total_top_20_count
# Calculate percentages for the top 20 skills based on the total top 20 count
<- top_20_skills %>%
top_20_skills mutate(percentage = (n / total_top_20_count) * 100)
# Display the table
<- top_20_skills %>%
top_20_skills_table select(job_skills, n, percentage) %>%
rename(Skill = job_skills, Frequency = n, Percentage = percentage)
# Format the percentage to two decimal places
$Percentage <- round(top_20_skills_table$Percentage, 2)
top_20_skills_table
# Display the table using kable for a neat format
kable(top_20_skills_table, caption = "Top 20 Data Science Skills with Percentages")
Skill | Frequency | Percentage |
---|---|---|
Python | 4801 | 12.67 |
SQL | 4606 | 12.16 |
Data Analysis | 4368 | 11.53 |
Communication | 2498 | 6.59 |
Machine Learning | 1966 | 5.19 |
AWS | 1740 | 4.59 |
Tableau | 1685 | 4.45 |
Data Visualization | 1562 | 4.12 |
R | 1542 | 4.07 |
Java | 1414 | 3.73 |
Spark | 1392 | 3.67 |
Data Science | 1285 | 3.39 |
Data Engineering | 1262 | 3.33 |
Teamwork | 1218 | 3.21 |
Project Management | 1213 | 3.20 |
Problem Solving | 1093 | 2.88 |
Hadoop | 1074 | 2.83 |
Collaboration | 1072 | 2.83 |
Data Management | 1059 | 2.80 |
Power BI | 1036 | 2.73 |
Analyzing the top 20 data science skills, one may observe that Python with 12.67%, SQL with 12.16%, and Data Analysis at 11.53% of the total number of mentions is the most in-demand skill from job postings, reflecting nearly 40%. Therefore, such a ranking signifies that both programming and analytical capabilities are needed critically in the job cadre of data science. Other highly rated skills include communication skills, at 6.59%, indicating that what is needed is a professional able to communicate complex insights effectively. Of note are also Machine Learning at 5.19% and cloud platforms such as AWS at 4.59%, further indicating the requirement in skill sets for intelligent system development and cloud infrastructure management. Other essential tools in data visualization include Tableau at 4.45% and Data Visualization techniques at 4.12%, necessary to communicate data-driven insights among stakeholders. On the whole, from the above data, it shows a data science professional is bound to possess the right blend of technical, analytics, and communication skills.
Conclusion
The trend of job postings within data science also points to the prescription of crucial skills necessary for success. The most wanted skills seem to fall under three categories: programming languages, machine learning, and data visualization. Above all, the most sought-after are programming languages in Python, R, and Java, and the machine learning techniques include deep learning, neural networks, and artificial intelligence. Additionally, visualization of data, data analysis, and data exploration are critical for conveying insights effectively.
These insights are helpful to professionals and organizations wishing to become or remain competitive. This may be particularly true for professionals, since the development of skills like programming, machine learning, and data visualization will certainly pay off with excellent career prospects, improving their contribution to data-driven decision-making. Equally, organizations can power up innovation, assure growth, and strategic outcomes by attracting the best talent with these competencies.
As the field of data science continues to expand, the demand for informed insights from data that drive decision-making and strategic planning will increase exponentially. With a guarantee of a successful future, professionals update their knowledge on current industry trends and develop competencies in-demand by organizations. Correspondingly, organizations which emphasize the same competencies will be at the helm as they successfully navigate the fast-changing landscape presented by data science.
The following project maps out an all-round view of major skills expected in a data science position, along with actionable insights for professionals and companies. Focusing on the most valued skills and keeping up with industry changes, individual professionals and organizations can better prepare for challenges and opportunities within the high-growth field of data science.