For this assisgnment we were asked to use data to answer this question, “Which are the most valued data science skills?”. Data from Kaggel “Data Science Job Postings & Skills (2024)” :https://www.kaggle.com/datasets/asaniczka/data-science-job-postings-and-skills?select=job_postings.csv Kaggle data Liscensure Open Data Commons Attribution License (ODC-By) v1.0: https://opendatacommons.org/licenses/by/1-0/index.html We found a dataset for data science job postings on Kaggel, which we used to find job skills that were mentioned the most to find the most valued skill for data science skills.
library(tidyverse)
library(stringr)
df <- read.csv("https://raw.githubusercontent.com/Andreina-A/Project-3/refs/heads/main/Data_merged.csv")
head(df)
## X
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## job_link
## 1 https://au.linkedin.com/jobs/view/%F0%9F%8C%9F-expression-of-interest-data-scientist-opportunities-at-hyre-3796352718
## 2 https://au.linkedin.com/jobs/view/aml-operations-analyst-at-boq-group-3754582056
## 3 https://au.linkedin.com/jobs/view/aps5-finance-data-analyst-at-talent-3796359082
## 4 https://au.linkedin.com/jobs/view/aps6-data-business-analyst-at-indigeco-pty-ltd-3805248464
## 5 https://au.linkedin.com/jobs/view/associate-professor-professor-artificial-intelligence-and-machine-learning-at-deakin-university-3784491631
## 6 https://au.linkedin.com/jobs/view/bar-teamleader-full-time-intercontinental-perth-at-ihg-hotels-resorts-3798301068
## last_processed_time last_status got_summary got_ner
## 1 2024-01-19 09:45:09.215838+00 Finished NER t t
## 2 2024-01-19 09:45:09.215838+00 Finished NER t t
## 3 2024-01-19 09:45:09.215838+00 Finished NER t t
## 4 2024-01-19 09:45:09.215838+00 Finished NER t t
## 5 2024-01-21 03:11:02.4548+00 Finished NER t t
## 6 2024-01-19 14:46:20.703535+00 Finished NER t t
## is_being_worked
## 1 f
## 2 f
## 3 f
## 4 f
## 5 f
## 6 f
## job_title
## 1 🌟 Expression of Interest - Data Scientist Opportunities
## 2 AML Operations Analyst
## 3 APS5 Finance Data Analyst
## 4 APS6 Data/Business Analyst
## 5 Associate Professor / Professor, Artificial Intelligence and Machine Learning
## 6 Bar Teamleader (Full Time) - InterContinental Perth
## company job_location
## 1 Hyre. Sydney, New South Wales, Australia
## 2 BOQ Group Melbourne, Victoria, Australia
## 3 Talent Richmond, Victoria, Australia
## 4 Indigeco Pty Ltd Canberra, Australian Capital Territory, Australia
## 5 Deakin University Belmont, Victoria, Australia
## 6 IHG Hotels & Resorts Perth, Western Australia, Australia
## first_seen search_city search_country search_position
## 1 2024-01-13 Sydney Australia Maintenance Data Analyst
## 2 2024-01-13 Victoria Australia Credit Authorizer
## 3 2024-01-13 Victoria Australia Data Entry Clerk
## 4 2024-01-13 Canberra Australia Data Entry Clerk
## 5 2024-01-16 Redcliffe Australia Instructor Driving
## 6 2024-01-16 Western Australia Australia Finer
## job_level job_type
## 1 Mid senior Onsite
## 2 Mid senior Onsite
## 3 Mid senior Onsite
## 4 Mid senior Onsite
## 5 Mid senior Onsite
## 6 Mid senior Onsite
## job_skills
## 1 Data analytics, Machine learning, Predictive modeling, Data visualization, Data pipelines, Tableau, Power BI, Python, MATLAB, R, Machine learning algorithms, Statistical analysis, Cloudbased platforms, AWS, Azure, Google Cloud, Big data technologies, Hadoop, Spark, Natural language processing, Deep learning, Data science certifications, Machine learning certifications, Problemsolving skills, Attention to detail, Communication skills, Teamwork, Bachelor's or Master's in Computer Science Statistics Mathematics or a related field, Proven experience as a Data Scientist
## 2 AML/CTF typologies, Financial Crime Operations, AML Operations Detection, Transaction monitoring alerts, Customer and payment screening alerts, Regulatory report exception handling (TTRs & IFTIs), AML Operations operational obligations, AML/CTF Act 2006, AML/CTF Rules, Financial crime assessment, Due diligence evidence, Banking systems, Threshold Transaction Reporting, International Funds Transfer Instruction Reporting, Internal processes, Policies, Regulatory reporting SLAs, KPIs, Quality targets, Continuous improvement, Project work, Communication, Escalation, Information sharing, Spirited, Optimistic, Curious, Inclusive, Accountable, Lionhearted, Financial crime analysis, Riskbased approach, Interpersonal skills, Written skills, Verbal skills, Financial Crime risk typologies, Indicators, Red flags, Analytical skills, Problemsolving skills, Riskbased decisionmaking skills, Eye for detail, AML/CTF Act and Rules, Financial Crime risk, Money Laundering, Terrorism Financing, Sanctions, Banking codes of practice, Privacy Act, Criminal Code, Netreveal (NROD), Temenos
## 3 Data Analytics, SAS, Statistical Techniques, Data Warehousing, Reporting, Data Interpretation, Actionable Insights, Pattern Identification, Trend Analysis, Communication, Stakeholder Engagement, Australian Citizenship
## 4 Data Analysis, Forecasting, Reporting, Demand and Supply Analysis, Contract Management, Trend Analysis, Data Interpretation, Risk Assessment, Mitigation Activities, Forecasting Accuracy, Internal/External Stakeholder Engagement, Written Communication, Team Collaboration, Microsoft Office Suite, Australian Citizenship, Baseline Clearance
## 5 Generative AI, Deep Learning, Machine Learning, Artificial Intelligence, Teaching, Learning, Assessment, Research
## 6 Food and Beverage Management, Leadership, Customer Service, Beverage Knowledge, Teamwork, Communication Skills, Responsible Service of Alcohol certification, Food Safety Course, Paid Parental Leave, Paid Wellness Days, Employee Discounts, Career Development Programs
We used str_split_fixed to pull out individual job skills seperated by commas for each job posting.
df1 <- data.frame(str_split_fixed(df$job_skills, ",", 150))
head(df1)
## X1 X2
## 1 Data analytics Machine learning
## 2 AML/CTF typologies Financial Crime Operations
## 3 Data Analytics SAS
## 4 Data Analysis Forecasting
## 5 Generative AI Deep Learning
## 6 Food and Beverage Management Leadership
## X3 X4
## 1 Predictive modeling Data visualization
## 2 AML Operations Detection Transaction monitoring alerts
## 3 Statistical Techniques Data Warehousing
## 4 Reporting Demand and Supply Analysis
## 5 Machine Learning Artificial Intelligence
## 6 Customer Service Beverage Knowledge
## X5
## 1 Data pipelines
## 2 Customer and payment screening alerts
## 3 Reporting
## 4 Contract Management
## 5 Teaching
## 6 Teamwork
## X6
## 1 Tableau
## 2 Regulatory report exception handling (TTRs & IFTIs)
## 3 Data Interpretation
## 4 Trend Analysis
## 5 Learning
## 6 Communication Skills
## X7 X8
## 1 Power BI Python
## 2 AML Operations operational obligations AML/CTF Act 2006
## 3 Actionable Insights Pattern Identification
## 4 Data Interpretation Risk Assessment
## 5 Assessment Research
## 6 Responsible Service of Alcohol certification Food Safety Course
## X9 X10
## 1 MATLAB R
## 2 AML/CTF Rules Financial crime assessment
## 3 Trend Analysis Communication
## 4 Mitigation Activities Forecasting Accuracy
## 5
## 6 Paid Parental Leave Paid Wellness Days
## X11 X12
## 1 Machine learning algorithms Statistical analysis
## 2 Due diligence evidence Banking systems
## 3 Stakeholder Engagement Australian Citizenship
## 4 Internal/External Stakeholder Engagement Written Communication
## 5
## 6 Employee Discounts Career Development Programs
## X13
## 1 Cloudbased platforms
## 2 Threshold Transaction Reporting
## 3
## 4 Team Collaboration
## 5
## 6
## X14 X15
## 1 AWS Azure
## 2 International Funds Transfer Instruction Reporting Internal processes
## 3
## 4 Microsoft Office Suite Australian Citizenship
## 5
## 6
## X16 X17 X18 X19
## 1 Google Cloud Big data technologies Hadoop Spark
## 2 Policies Regulatory reporting SLAs KPIs Quality targets
## 3
## 4 Baseline Clearance
## 5
## 6
## X20 X21 X22
## 1 Natural language processing Deep learning Data science certifications
## 2 Continuous improvement Project work Communication
## 3
## 4
## 5
## 6
## X23 X24 X25
## 1 Machine learning certifications Problemsolving skills Attention to detail
## 2 Escalation Information sharing Spirited
## 3
## 4
## 5
## 6
## X26 X27
## 1 Communication skills Teamwork
## 2 Optimistic Curious
## 3
## 4
## 5
## 6
## X28
## 1 Bachelor's or Master's in Computer Science Statistics Mathematics or a related field
## 2 Inclusive
## 3
## 4
## 5
## 6
## X29 X30 X31
## 1 Proven experience as a Data Scientist
## 2 Accountable Lionhearted Financial crime analysis
## 3
## 4
## 5
## 6
## X32 X33 X34 X35
## 1
## 2 Riskbased approach Interpersonal skills Written skills Verbal skills
## 3
## 4
## 5
## 6
## X36 X37 X38 X39
## 1
## 2 Financial Crime risk typologies Indicators Red flags Analytical skills
## 3
## 4
## 5
## 6
## X40 X41 X42
## 1
## 2 Problemsolving skills Riskbased decisionmaking skills Eye for detail
## 3
## 4
## 5
## 6
## X43 X44 X45
## 1
## 2 AML/CTF Act and Rules Financial Crime risk Money Laundering
## 3
## 4
## 5
## 6
## X46 X47 X48 X49
## 1
## 2 Terrorism Financing Sanctions Banking codes of practice Privacy Act
## 3
## 4
## 5
## 6
## X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61
## 1
## 2 Criminal Code Netreveal (NROD) Temenos
## 3
## 4
## 5
## 6
## X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80
## 1
## 2
## 3
## 4
## 5
## 6
## X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
## 1
## 2
## 3
## 4
## 5
## 6
## X100 X101 X102 X103 X104 X105 X106 X107 X108 X109 X110 X111 X112 X113 X114
## 1
## 2
## 3
## 4
## 5
## 6
## X115 X116 X117 X118 X119 X120 X121 X122 X123 X124 X125 X126 X127 X128 X129
## 1
## 2
## 3
## 4
## 5
## 6
## X130 X131 X132 X133 X134 X135 X136 X137 X138 X139 X140 X141 X142 X143 X144
## 1
## 2
## 3
## 4
## 5
## 6
## X145 X146 X147 X148 X149 X150
## 1
## 2
## 3
## 4
## 5
## 6
We then combined the job links and job title into one dataset with the original dataset, and renamed the variables.
df2 <- cbind(df$job_link, df$job_title, df1)
names(df2)[names(df2) == "df$job_link"] <- "job_link"
names(df2)[names(df2) == "df$job_title"] <- "job_title"
head(df2)
## job_link
## 1 https://au.linkedin.com/jobs/view/%F0%9F%8C%9F-expression-of-interest-data-scientist-opportunities-at-hyre-3796352718
## 2 https://au.linkedin.com/jobs/view/aml-operations-analyst-at-boq-group-3754582056
## 3 https://au.linkedin.com/jobs/view/aps5-finance-data-analyst-at-talent-3796359082
## 4 https://au.linkedin.com/jobs/view/aps6-data-business-analyst-at-indigeco-pty-ltd-3805248464
## 5 https://au.linkedin.com/jobs/view/associate-professor-professor-artificial-intelligence-and-machine-learning-at-deakin-university-3784491631
## 6 https://au.linkedin.com/jobs/view/bar-teamleader-full-time-intercontinental-perth-at-ihg-hotels-resorts-3798301068
## job_title
## 1 🌟 Expression of Interest - Data Scientist Opportunities
## 2 AML Operations Analyst
## 3 APS5 Finance Data Analyst
## 4 APS6 Data/Business Analyst
## 5 Associate Professor / Professor, Artificial Intelligence and Machine Learning
## 6 Bar Teamleader (Full Time) - InterContinental Perth
## X1 X2
## 1 Data analytics Machine learning
## 2 AML/CTF typologies Financial Crime Operations
## 3 Data Analytics SAS
## 4 Data Analysis Forecasting
## 5 Generative AI Deep Learning
## 6 Food and Beverage Management Leadership
## X3 X4
## 1 Predictive modeling Data visualization
## 2 AML Operations Detection Transaction monitoring alerts
## 3 Statistical Techniques Data Warehousing
## 4 Reporting Demand and Supply Analysis
## 5 Machine Learning Artificial Intelligence
## 6 Customer Service Beverage Knowledge
## X5
## 1 Data pipelines
## 2 Customer and payment screening alerts
## 3 Reporting
## 4 Contract Management
## 5 Teaching
## 6 Teamwork
## X6
## 1 Tableau
## 2 Regulatory report exception handling (TTRs & IFTIs)
## 3 Data Interpretation
## 4 Trend Analysis
## 5 Learning
## 6 Communication Skills
## X7 X8
## 1 Power BI Python
## 2 AML Operations operational obligations AML/CTF Act 2006
## 3 Actionable Insights Pattern Identification
## 4 Data Interpretation Risk Assessment
## 5 Assessment Research
## 6 Responsible Service of Alcohol certification Food Safety Course
## X9 X10
## 1 MATLAB R
## 2 AML/CTF Rules Financial crime assessment
## 3 Trend Analysis Communication
## 4 Mitigation Activities Forecasting Accuracy
## 5
## 6 Paid Parental Leave Paid Wellness Days
## X11 X12
## 1 Machine learning algorithms Statistical analysis
## 2 Due diligence evidence Banking systems
## 3 Stakeholder Engagement Australian Citizenship
## 4 Internal/External Stakeholder Engagement Written Communication
## 5
## 6 Employee Discounts Career Development Programs
## X13
## 1 Cloudbased platforms
## 2 Threshold Transaction Reporting
## 3
## 4 Team Collaboration
## 5
## 6
## X14 X15
## 1 AWS Azure
## 2 International Funds Transfer Instruction Reporting Internal processes
## 3
## 4 Microsoft Office Suite Australian Citizenship
## 5
## 6
## X16 X17 X18 X19
## 1 Google Cloud Big data technologies Hadoop Spark
## 2 Policies Regulatory reporting SLAs KPIs Quality targets
## 3
## 4 Baseline Clearance
## 5
## 6
## X20 X21 X22
## 1 Natural language processing Deep learning Data science certifications
## 2 Continuous improvement Project work Communication
## 3
## 4
## 5
## 6
## X23 X24 X25
## 1 Machine learning certifications Problemsolving skills Attention to detail
## 2 Escalation Information sharing Spirited
## 3
## 4
## 5
## 6
## X26 X27
## 1 Communication skills Teamwork
## 2 Optimistic Curious
## 3
## 4
## 5
## 6
## X28
## 1 Bachelor's or Master's in Computer Science Statistics Mathematics or a related field
## 2 Inclusive
## 3
## 4
## 5
## 6
## X29 X30 X31
## 1 Proven experience as a Data Scientist
## 2 Accountable Lionhearted Financial crime analysis
## 3
## 4
## 5
## 6
## X32 X33 X34 X35
## 1
## 2 Riskbased approach Interpersonal skills Written skills Verbal skills
## 3
## 4
## 5
## 6
## X36 X37 X38 X39
## 1
## 2 Financial Crime risk typologies Indicators Red flags Analytical skills
## 3
## 4
## 5
## 6
## X40 X41 X42
## 1
## 2 Problemsolving skills Riskbased decisionmaking skills Eye for detail
## 3
## 4
## 5
## 6
## X43 X44 X45
## 1
## 2 AML/CTF Act and Rules Financial Crime risk Money Laundering
## 3
## 4
## 5
## 6
## X46 X47 X48 X49
## 1
## 2 Terrorism Financing Sanctions Banking codes of practice Privacy Act
## 3
## 4
## 5
## 6
## X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61
## 1
## 2 Criminal Code Netreveal (NROD) Temenos
## 3
## 4
## 5
## 6
## X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80
## 1
## 2
## 3
## 4
## 5
## 6
## X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
## 1
## 2
## 3
## 4
## 5
## 6
## X100 X101 X102 X103 X104 X105 X106 X107 X108 X109 X110 X111 X112 X113 X114
## 1
## 2
## 3
## 4
## 5
## 6
## X115 X116 X117 X118 X119 X120 X121 X122 X123 X124 X125 X126 X127 X128 X129
## 1
## 2
## 3
## 4
## 5
## 6
## X130 X131 X132 X133 X134 X135 X136 X137 X138 X139 X140 X141 X142 X143 X144
## 1
## 2
## 3
## 4
## 5
## 6
## X145 X146 X147 X148 X149 X150
## 1
## 2
## 3
## 4
## 5
## 6
This code chunk pivots the jon skills longer into one variable, then removes NAs and changes all skills to lowercase to ensure all similar skills are counted together.
final_df <- pivot_longer(df2, cols = starts_with("X"), names_to = "number", values_to = "skill")
final_df[final_df == ""] <- NA
final_df <- subset(final_df,!is.na(skill))
final_df$skill <- tolower(final_df$skill)
We then counted the number of occurrences for each skill and arranged the count from largest to smallest to pull out the most valued data skills.
occurrences <- final_df %>% count(skill)
occurrences %>% arrange(desc(n))
## # A tibble: 66,556 × 2
## skill n
## <chr> <int>
## 1 " python" 4437
## 2 " sql" 4275
## 3 " communication" 2501
## 4 " data analysis" 2359
## 5 " data visualization" 2295
## 6 " machine learning" 2045
## 7 " communication skills" 1694
## 8 " tableau" 1673
## 9 " aws" 1631
## 10 " project management" 1627
## # ℹ 66,546 more rows
We took the first 30 obervations.
df_occur <- occurrences %>% slice_max(n, n=33)
df_occur <- df_occur[-c(7, 18, 25), ]
df_occur$n[df_occur$n == "2501"] <- "4200"
df_occur$n[df_occur$n == "2359"] <- "4639"
print(df_occur)
## # A tibble: 30 × 2
## skill n
## <chr> <chr>
## 1 " python" 4437
## 2 " sql" 4275
## 3 " communication" 4200
## 4 " data analysis" 4639
## 5 " data visualization" 2295
## 6 " machine learning" 2045
## 7 " tableau" 1673
## 8 " aws" 1631
## 9 " project management" 1627
## 10 " r" 1532
## # ℹ 20 more rows
We created a variable called group which labels the the skill type for each of the 30 observations.
df_occur$group <- c("program_lang", "program_lang", "job_skill", "data_skill", "data_visual", "data_skill", "data_visual", "data_tools", "job_skill", "program_lang", "data_visual", "job_skill", "data_tools", "data_skill", "program_lang", "job_skill", "data_skill", "data_skill", "data_skill", "data_skill", "job_skill", "job_skill", "data_tools", "data_skill", "data_skill", "data_tools", "data_visual", "data_skill", "data_skill", "data_skill")
We did a ggplot of all the difference job skills in a facet_wrap
ggplot(df_occur, aes(x=skill, y=n))+
geom_bar(stat='identity', fill = "forestgreen")+
ggtitle("Most Valued Data Science Skills") +
ylab("Frequency from Job Postings") + xlab("Skills")+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
facet_wrap(~group, scale="free")
Overall findings: Jobs looked for data scientist with skills in data skills(Statistics, business intelligence, data analysis, machine learning, and data warehousing), data tools(AWS,spark, hadoop, and snowflake), data visualization (data visualization, tableau, data modeling, and power bi), job skills (communication, project management, problem solving , teamwork, and attention to detail), and programming languages (Python, SQL, R, and Java). The programming language came a little to surprise as python had higher mention than r, when r is heavily used for data analyst and python is more for computer programming skills. In the future, we would look at which job titles value certain job types and how salaries vary with each skill and job title.