Our project aims to use data to answer the question: “Which data science skills are the most valued?” To achieve this, we performed web scraping on a major job search platform (e.g., Indeed, LinkedIn, Glassdoor) to gather job posting data. Our raw CSV file contained several key attributes, including job title, job URL, company name, job level (e.g., senior, junior, associate), location, appointment type, job industry, job function, and the complete job description.
We then compiled two main lists: one for technical skills and another for soft/non-technical skills, extracted from job descriptions. Our analysis focused on identifying the top five most frequently mentioned skills in each category, highlighting the skills most commonly sought after in data science roles. The tidying process included creating additional columns to capture details such as job salary, city/state of the job location, remote work options, and education requirements. This comprehensive analysis provided insight into the most valued data science skills as reflected in job postings.
Communication: Our primary communication tool will be Discord, enabling real-time discussions, file sharing, and topic organization through channels. This will facilitate constant and efficient communication among team members.
Code Sharing: We will use GitHub as our version control system for sharing and collaborating on code. GitHub will help us manage contributions, track changes, and work seamlessly together on the codebase.
Project Documentation: Project documentation and shared resources, such as reports, meeting notes, and other essential documents, will be hosted on GitHub. This platform will also act as a backup repository for non-code-related materials.
The primary motivation for this project is to understand which data science skills are most valued in the job market. As data science continues to evolve rapidly, knowing the most in-demand skills is crucial for aspiring and current professionals. This project not only provides insights into the technical and non-technical competencies sought by employers but also enhances our ability to work collaboratively as a class, applying data acquisition, tidying, and analysis techniques to solve real-world problems.
We began our project by determining our data sources and methods for collecting job postings. Using web scraping tools like Octoparse and Parsehub, we gathered data from LinkedIn, focusing on job titles, descriptions, companies, locations, and required skills. The raw data was stored in a relational database with normalized tables to ensure efficient data management.
Next, we performed data tidying and transformation in R. This included removing duplicates, splitting the job location into city and state, and cleaning up inconsistencies such as extra spaces. We created new columns for each skill and used loops to check the presence of skills within job descriptions, assigning binary values (“1” for present, “0” for absent). Finally, we conducted exploratory data analysis, using visualization techniques to highlight the top technical and non-technical skills, as well as the geographic distribution of job opportunities.
Our analysis revealed several key insights: 1. General Skills: ‘Analysis,’ ‘database,’ and ‘technical’ emerged as the most commonly cited general skills in job descriptions. 2. Top Technical Skills: Excel, Business Intelligence, and Tableau were the most frequently mentioned technical skills. 3. Top Non-Technical Skills: Communication stood out as the most important non-technical skill, followed by organization and attention to detail. The prominence of communication underscores the importance of effectively conveying complex ideas in the data science field.
Our findings provide a comprehensive view of the current data science job market, guiding professionals on the essential skills to develop for career success.
In the beginning, we used the package ‘rvest’ on R to web scrape Indeed for job postings. However, we ran into a 403 error. There is an API restriction. So, we used Parsehub and Octoparse to scrape job postings from LinkedIn using the keywords: ‘data’, ‘data analyst’, ‘data scientist’, and ‘data engineer’.
From Octoparse, we were able to obtain 7236 job postings from LinkedIn and over 1000 job postings from Indeed. However, due to incomplete data set in Glassdoor, we decided to use Linkedin dataset. After removing duplicate files, we end up with 1342 job postings from Linkedin.
library(stringr)
tidied_Linkedin <- read.csv('https://raw.githubusercontent.com/Shriyanshh/Project-Data-Science-Skills/refs/heads/main/final%20tidied%20Linkedin.csv')
Below are some of things we did to tidy our dataframe:
Remove leading and trailing white spaces
Split the Job location column into two columns: city and state
Fill in the missing values in the state column
Remove duplicate job posting
Create a column for job type (remote, hybrid, …), salary, and degree qualification (Bachelors, … )
Our database model is structured similarly to the ER diagram presented. At the core, we have a master table that contains all the job postings we scraped. This master table includes attributes common to any job listing, such as Job Title, Job ID Number (as the primary key), Job Level (e.g., internship, contract, senior), Associated Industry, Salary, Appointment Type (e.g., Remote, Hybrid, On-site), Education Level, and the associated skills.
Extending from the master table, we have several related tables to organize our data efficiently. First, there is a salary table that uses a foreign key linked to the Job ID to specify salary details for each job. We also created an employer table, which contains unique IDs for each employer, and from this table, an employer location table captures the city and state for each employer.
Additionally, a skills table branches from the master table, featuring a unique ID for each skill in our dataset. Further, a Skill_Group table references the specific skill group associated with each skill. We also normalized other attributes into separate tables, including a table for education levels, one for associated industries, another for job levels, a table for appointment types, and a normalized table for unique job titles (e.g., Data Scientist, Data Engineer, Data Analyst).
Although our final dataset didn’t fully adhere to this proposed model due to time constraints and challenges with data tidying, this was the ideal structure we aimed for to optimize organization and query efficiency.
url <- "https://raw.githubusercontent.com/Shriyanshh/Project-Data-Science-Skills/refs/heads/main/Data%20607%20ER%20Diagram.png"
knitr::include_graphics(url)
We implemented a loop to generate a new column for each skill, assigning a “1” or “0” for each job posting. A “1” indicates that the skill is present in the “Job_description” column, while a “0” means the skill is absent.
We researched (https://www.simplilearn.com/best-programming-languages-start-learning-today-article)) the top 20 popular programming languages.
The top 10 most frequent skills found in the job postings are analysis, technical, database, research, programming, machine learning, modeling, statistics, mathematics, and data visual.
general_skills <- tidied_Linkedin[c("Titile", "Company","City","State","Seniority_level","Employment_type","Industry","Function", "Education","Remote", "Salary","Description")]
general <- c("(?i)Statistical analysis","(?i)Machine Learning","(?i)data visual","(?i)wrangling","(?i)mathematics","(?i)programming|scripting language","(?i)statistics","(?i)big data","(?i)Artificial intelligence|\\bAI\\b","(?i)Deep learning","(?i)data modeling","(?i)data processing","(?i)data mananagement","(?i)data manipulation","(?i)database","(?i)data architecture","(?i)data mining","(?i)research","(?i)modeling","(?i)analysis","(?i)cloud computing","(?i)technical")
for(i in general){
general_skills[i] <- +str_detect(apply(general_skills, 1, paste0, collapse = " ", ignore_case = TRUE), general[which(general == i)])
}
# colnames(general_skills)
colnames(general_skills) <- c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description","Statistical analysis","Machine Learning","data visual","wrangling","mathematics","programming","statistics","big data","Artificial intelligence","Deep learning","data modeling","data processing","data mananagement","data manipulation","database","data architecture","data mining","research","modeling","analysis","cloud computing","technical")
library(tidyverse)
general_long_table <- general_skills %>%
pivot_longer(cols = !c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description"), names_to = "skill", values_to = "count" )
frequent_general_skill <- general_long_table %>%
filter(count == 1)
frequent_general_skill <- frequent_general_skill %>%
count(skill) %>%
mutate(percentage = (n / nrow(tidied_Linkedin))*100)
frequent_general_skill$percentage <- as.numeric(format(round(frequent_general_skill$percentage,2),nsmall =2))
library(DT)
datatable(frequent_general_skill, colnames = c('Skill', 'Count', 'Percentage'))
The top 10 most frequent technical skills found in the job postings are Excel, BI, Scala, Spark, Tableau, SAS, AirFlow, DevOps, and Hadoop.
technical_skills <- tidied_Linkedin[c("Titile", "Company","City","State","Seniority_level","Employment_type","Industry","Function", "Education","Remote", "Salary","Description")]
technical <- c("(?i)spss","(?i)stata", "(?i)sas","(?i)matlab", "(?i)Swift", "(?i)Julia","(?i)Hadoop","(?i)Spark","(?i)Pig","(?i)Tableau","(?i)REDCap", "(?i)Qualtrics", "(?i)Power BI", "(?i)Dedoose", "(?i)Atlas TI", "(?i)NVivo", "(?i)MPlus", "(?i)Mixor", "(?i)dbt", "(?i)BigQuery|big query", "(?i)Superset", "(?i)Baseten", "(?i)Airflow","(?i)ETL/ELT pipelines","(?i)Google Analytics", "(?i)Parse.ly", "(?i)Chartbeat","(?i)Excel", "(?i)Plotly", "(?i)Google Data Studio", "(?i)Looker","(?i)Spotfire","(?i)Smartsheet","(?i)\\bBI\\b|Business Intelligence","(?i)C#|C#.net","(?i)VBA","(?i)DevOps")
for(i in technical){
technical_skills[i] <- +str_detect(apply(technical_skills, 1, paste0, collapse = " ", ignore_case = TRUE), technical[which(technical == i)])
}
colnames(technical_skills) <- c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description", "SPSS","Stata", "SAS","MATLAB", "Swift", "Julia","Hadoop","Spark","Pig","Tableau","REDCap", "Qualtrics", "Power BI", "Dedoose", "Atlas TI", "NVivo", "MPlus", "Mixor", "dbt", "BigQuery|big query", "Superset", "Baseten", "Airflow","ETL/ELT pipelines","Google Analytics", "Parse.ly", "Chartbeat","Excel", "Plotly", "Google Data Studio", "Looker","Spotfire","Smartsheet","Business Intelligence","C#|C#.net","VBA","DevOps")
#library(tidyverse)
technical_long_table <- technical_skills %>%
pivot_longer(cols = !c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description"), names_to = "skill", values_to = "count" )
frequent_technical_skill <- technical_long_table %>%
filter(count == 1)
frequent_technical_skill <- frequent_technical_skill %>%
count(skill) %>%
mutate(percentage = (n / nrow(tidied_Linkedin))*100)
frequent_technical_skill$percentage <- as.numeric(format(round(frequent_technical_skill$percentage,2),nsmall =2))
library(DT)
datatable(frequent_technical_skill, colnames = c('Skill', 'Count', 'Percentage'))
We researched (https://www.simplilearn.com/best-programming-languages-start-learning-today-article) the top 20 popular programming languages.
The top 10 most frequent programming skills found in the job postings are SQL, python, Scala, R, Spark, Java, Rust, C, C++, and NOSQL.
programming_skills <- tidied_Linkedin[c("Titile", "Company","City","State","Seniority_level","Employment_type","Industry","Function", "Education","Remote", "Salary","Description")]
programming <- c("(?i)python", "(?i)java", "(?i)javascript", "(?i)sql", "(?i)\\bR\\b", "(?i)stata","(?i)scala","(?i)matlab", "(?i)\\bC\\b","(?i)\\bC++\\b", "(?i)Swift", "(?i)Julia","(?i)Hadoop","(?i)Spark","(?i)Pig","\\bGo\\b","(?i)Kotlin","(?i)PHP","(?i)C#|C#.net","(?i)Ruby","(?i)TypeScript","(?i)HTML","(?i)CSS","(?i)NOSQL","(?i)Rust","(?i)Perl","(?i)Alteryx")
for(i in programming){
programming_skills[i] <- +str_detect(apply(programming_skills, 1, paste0, collapse = " "), programming[which(programming == i)])
}
colnames(programming_skills) <- c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description","Python", "Java", "JavaScript", "SQL", "R", "Stata", "Scala","MATLAB", "C","C++", "Swift", "Julia","Hadoop","Spark","Pig","Go","Kotlin","PHP","C#|C#.net","Ruby","TypeScript","HTML","CSS","NOSQL","Rust","Perl","Alteryx")
programming_long_table <- programming_skills %>%
pivot_longer(cols = !c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description"), names_to = "skill", values_to = "count" )
frequent_programming_skill <- programming_long_table %>%
filter(count == 1)
frequent_programming_skill <- frequent_programming_skill %>%
count(skill) %>%
mutate(percentage = (n / nrow(tidied_Linkedin))*100)
frequent_programming_skill$percentage <- as.numeric(format(round(frequent_programming_skill$percentage,2),nsmall =2))
datatable(frequent_programming_skill, colnames = c('Skill', 'Count', 'Percentage'))
The top 10 most frequently mentioned non-technical skills in the job postings are communication, organization, attention to detail, visualization, problem-solving, attention, accuracy, presentation, collaboration, and innovation.
nontechnical_skills <- tidied_Linkedin[c("Titile", "Company","City","State","Seniority_level","Employment_type","Industry","Function", "Education","Remote", "Salary","Description")]
nontechnical <- c("(?i)communicate|communication", "(?i)critical thinking|critical-thinking", "(?i)problem solving|problem-solving", "(?i)business acumen", "(?i)storytell|data narrative", "(?i)adaptability","(?i)team player|team spirit", "(?i)product understanding","(?i)innovation","(?i)collaboration","(?i)visualization","(?i)attention","(?i)interpersonal","(?i)detail","(?i)presentation","(?i)multitask","(?i)decision making|decision-making","(?i)accuracy", "(?i)organization|organized","(?i)time management")
for(i in nontechnical){
nontechnical_skills[i] <- +str_detect(apply(nontechnical_skills, 1, paste0, collapse = " "), nontechnical[which(nontechnical == i)])
}
colnames(nontechnical_skills) <- c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description","communication", "critical thinking", "problem-solving", "business acumen", "data narrative", "adaptability","team player", "product understanding","innovation","collaboration","visualization","attention","interpersonal","detail","presentation","multitask","decision-making","accuracy", "organization","time management")
# Source: https://www.simplilearn.com/best-programming-languages-start-learning-today-article
nontechnical_long_table <- nontechnical_skills %>%
pivot_longer(cols = !c("Job_Title", "Company","City","State","Seniority_level","Employment_type","Industry","Job_Function", "Education","Remote", "Salary","Job_Description"), names_to = "skill", values_to = "count" )
frequent_nontechnical_skill <- nontechnical_long_table %>%
filter(count == 1)
frequent_nontechnical_skill <- frequent_nontechnical_skill %>%
count(skill) %>%
mutate(percentage = (n / nrow(tidied_Linkedin))*100)
frequent_nontechnical_skill$percentage <- as.numeric(format(round(frequent_nontechnical_skill$percentage,2),nsmall =2))
datatable(frequent_nontechnical_skill, colnames = c('Skill', 'Count', 'Percentage'))
library(ggplot2)
frequent_general_skill %>%
arrange(desc(percentage)) %>%
slice(1:10) %>%
ggplot(., aes(x = reorder(skill, percentage), y = percentage)) +
geom_bar(stat = 'identity', skill= "blue" , fill = 'lightblue') +
coord_flip() + ggtitle("Top 10 General Skills") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Percentage of Skill in Job Posting") +
xlab("Skill")
frequent_technical_skill %>%
arrange(desc(percentage)) %>%
slice(1:10) %>%
ggplot(., aes(x = reorder(skill, percentage), y = percentage)) +
geom_bar(stat = 'identity', skill= "blue" , fill = 'lightblue') +
coord_flip() + ggtitle("Top 10 Technical Skills") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Percentage of Skill in Job Posting") +
xlab("Skill")
frequent_programming_skill %>%
arrange(desc(percentage)) %>%
slice(1:10) %>%
ggplot(., aes(x = reorder(skill, percentage), y = percentage)) +
geom_bar(stat = 'identity', skill= "blue" , fill = 'lightblue') +
coord_flip() + ggtitle("Top 10 Programming Languages") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Percentage of Skill in Job Posting") +
xlab("Skill")
frequent_nontechnical_skill %>%
arrange(desc(percentage)) %>%
slice(1:10) %>%
ggplot(., aes(x = reorder(skill, percentage), y = percentage)) +
geom_bar(stat = 'identity', skill= "blue" , fill = 'lightblue') +
coord_flip() + ggtitle("Top 10 Non-Technical Skills") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Percentage of Skill in Job Posting") +
xlab("Skill")
Most Job postings don’t state whether they require a degree or not. However, for those that do require a degree, bachelors is the most common. And job postings rarely ask for Masters and almost never a Doctorate.
education_tidy <- tidied_Linkedin[, c("X", "Education")]
ggplot(education_tidy, aes(x = Education)) +
geom_bar(stat = "count", fill = 'lightblue') +
coord_flip() + ggtitle("Distribution of Education Level") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Number of Jobs Requiring Degree") +
xlab("Degree")
library(stringr)
all_skills <- tidied_Linkedin[c("Titile","Description")]
all_skills$Titile[grep("(?i)data analyst", all_skills$Titile)] <- "Data Analyst"
all_skills$Titile[grep("(?i)engineer", all_skills$Titile)] <- "Data Engineer"
all_skills$Titile[grep("(?i)data science", all_skills$Titile)] <- "Data Scientist"
all_skills$Titile[grep("(?i)scientist", all_skills$Titile)] <- "Data Scientist"
all_skills$Titile[grep("(?i)business", all_skills$Titile)] <- "Business Analyst"
all_skills$Titile[grep("(?i)data", all_skills$Titile)] <- "Data Analyst"
skills <- c("(?i)Statistical analysis","(?i)Machine Learning","(?i)data visual","(?i)wrangling","(?i)mathematics","(?i)programming|scripting language","(?i)statistics","(?i)big data","(?i)Artificial intelligence|\\bAI\\b","(?i)Deep learning","(?i)data modeling","(?i)data processing","(?i)data mananagement","(?i)data manipulation","(?i)database","(?i)data architecture","(?i)data mining","(?i)research","(?i)modeling","(?i)analysis","(?i)cloud computing","(?i)technical","(?i)spss","(?i)stata", "(?i)sas","(?i)scala","(?i)matlab", "(?i)Swift", "(?i)Julia","(?i)Hadoop","(?i)Spark","(?i)Pig","(?i)Tableau","(?i)REDCap", "(?i)Qualtrics", "(?i)Power BI", "(?i)Dedoose", "(?i)Atlas TI", "(?i)NVivo", "(?i)MPlus", "(?i)Mixor", "(?i)dbt", "(?i)BigQuery|big query", "(?i)Superset", "(?i)Baseten", "(?i)Airflow","(?i)ETL/ELT pipelines","(?i)Google Analytics", "(?i)Parse.ly", "(?i)Chartbeat","(?i)Excel", "(?i)Plotly", "(?i)Google Data Studio", "(?i)Looker","(?i)Spotfire","(?i)Smartsheet","(?i)\\bBI\\b|Business Intelligence","(?i)C#|C#.net","(?i)VBA")
for(i in skills ){
all_skills[i] <- +str_detect(apply(all_skills, 1, paste0, collapse = " ", ignore_case = TRUE), skills[which(skills == i)])
}
more_skills <- c("(?i)DevOps","(?i)python", "(?i)java", "(?i)javascript", "(?i)sql", "(?i)\\bR\\b", "(?i)stata","(?i)scala","(?i)matlab", "(?i)\\bC\\b","(?i)\\bC++\\b", "(?i)Swift", "(?i)Julia","(?i)Hadoop","(?i)Spark","(?i)Pig","\\bGo\\b","(?i)Kotlin","(?i)PHP","(?i)C#|C#.net","(?i)Ruby","(?i)TypeScript","(?i)HTML","(?i)CSS","(?i)NOSQL","(?i)Rust","(?i)Perl","(?i)Alteryx","(?i)communicate|communication", "(?i)critical thinking|critical-thinking", "(?i)problem solving|problem-solving", "(?i)business acumen", "(?i)storytell|data narrative", "(?i)adaptability","(?i)team player|team spirit", "(?i)product understanding","(?i)innovation","(?i)collaboration","(?i)visualization","(?i)attention","(?i)interpersonal","(?i)detail","(?i)presentation","(?i)multitask","(?i)decision making|decision-making","(?i)accuracy", "(?i)organization|organized","(?i)time management")
for(i in more_skills ){
all_skills[i] <- +str_detect(apply(all_skills, 1, paste0, collapse = " ", ignore_case = TRUE), more_skills[which(more_skills == i)])
}
colnames(all_skills) <- c("Job_Title","Job_Description","Statistical analysis","Machine Learning","data visual","wrangling","mathematics","programming","statistics","big data","Artificial intelligence","Deep learning","data modeling","data processing","data mananagement","data manipulation","database","data architecture","data mining","research","modeling","analysis","cloud computing","technical","SPSS","Stata", "SAS","Scala","MATLAB", "Swift", "Julia","Hadoop","Spark","Pig","Tableau","REDCap", "Qualtrics", "Power BI", "Dedoose", "Atlas TI", "NVivo", "MPlus", "Mixor", "dbt", "BigQuery|big query", "Superset", "Baseten", "Airflow","ETL/ELT pipelines","Google Analytics", "Parse.ly", "Chartbeat","Excel", "Plotly", "Google Data Studio", "Looker","Spotfire","Smartsheet","Business Intelligence","C#|C#.net","VBA","DevOps","Python", "Java", "JavaScript", "SQL", "R", "C","C++","Go","Kotlin","PHP","Ruby","TypeScript","HTML","CSS","NOSQL","Rust","Perl","Alteryx","communication", "critical thinking", "problem-solving", "business acumen", "data narrative", "adaptability","team player", "product understanding","innovation","collaboration","visualization","attention","interpersonal","detail","presentation","multitask","decision-making","accuracy", "organization","time management")
all_skills_long_table <- all_skills %>%
pivot_longer(cols = !c("Job_Title","Job_Description"), names_to = "skill", values_to = "count" )
job_title <- all_skills_long_table[c("Job_Title","skill","count")] %>%
filter(count == 1)
job_title <- job_title %>%
group_by(Job_Title,skill) %>%
mutate(count= sum(count))
head(job_title)
## # A tibble: 6 × 3
## # Groups: Job_Title, skill [6]
## Job_Title skill count
## <chr> <chr> <int>
## 1 Data Analyst Statistical analysis 86
## 2 Data Analyst database 469
## 3 Data Analyst analysis 555
## 4 Data Analyst technical 546
## 5 Data Analyst Power BI 107
## 6 Data Analyst Excel 536
As illustrated in our tables and graphs, the analysis revealed that ‘analysis,’ ‘database,’ and ‘technical’ emerged as the most frequently mentioned general skills across the job postings we collected. Under the technical skills, Excel, Business Intelligence, and Tableau and Spark were the top three cited skills. For programming languages, it came as no surprise that Python and SQL were the top 2 languages. However Scala seemed to beat out R by a small margin. On the non-technical side, communication was the most frequently mentioned skill, significantly outpacing others, followed by organization and being detail-oriented. The prominence of communication as the top skill was unsurprising.
Limitations: Initially, our dataset contained over 7,000 job postings from LinkedIn. However, after removing duplicate entries, we were left with only 1,342 job postings for our analysis. A larger, more comprehensive dataset could provide further validation of the most valued skills in data science.