We are attempting to use data in an effort to answer the question: “Which are the most valued data science skills?”
Our strategy, generally speaking, is to look at a dataset with thousands of LinkedIn job postings. We can then filter this dataset to include only data science (or data science-adjacent roles). Then, from the description of these jobs, we can extract the skills that the employers are looking for.
To be a bit more specific, we will attempt to answer two questions:
It’s worth pointing out, we originally wanted to look at another question, namely how salaries of different jobs might be associated with different skills. However, there was a lot of salary data missing, and this approach was just not feasible.
library(DBI)
library(RMySQL)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
We start by establishing a connection to the Amazon RDS database:
con <- dbConnect(RMySQL::MySQL(),
dbname = "linkedin_job_postings_2023",
host = "database-1-instance-1.cngfftmjinac.us-east-2.rds.amazonaws.com",
port = 3306,
user = "admin",
password = "DDbna}rt2R0h")
dbListTables(con)
## [1] "benefits" "cleaned" "companies"
## [4] "company_industries" "company_specialities" "employee_counts"
## [7] "job_industries" "job_postings" "job_skills"
suppressWarnings({
query <- "SELECT * FROM benefits"
benefits <- dbGetQuery(con, query)
query <- "SELECT * FROM cleaned"
cleaned <- dbGetQuery(con, query)
query <- "SELECT * FROM companies"
companies <- dbGetQuery(con, query)
query <- "SELECT * FROM company_industries"
company_industries <- dbGetQuery(con, query)
query <- "SELECT * FROM company_specialities"
company_specialities <- dbGetQuery(con, query)
query <- "SELECT * FROM employee_counts"
employee_counts <- dbGetQuery(con, query)
query <- "SELECT * FROM job_industries"
job_industries <- dbGetQuery(con, query)
query <- "SELECT * FROM job_postings"
job_postings <- dbGetQuery(con, query)
query <- "SELECT * FROM job_skills"
job_skills <- dbGetQuery(con, query)
#closing connection
dbDisconnect(con)
})
## [1] TRUE
The job_postings table is of special interest. However, we are only interested in the postings for data science (or at least data-related) roles. We’ll use the grep1 function to include all titles that have the word “data” in them.
ds_jobs <- job_postings %>%
filter(grepl("Data", title, ignore.case = TRUE))
head(ds_jobs)
## job_id company_id title
## 1 3586162459 69642092 Teradata Developer
## 2 3690692186 61242 Seasonal Payroll/Data Entry Clerk
## 3 3691795980 7573454 Data Engineer
## 4 3692302089 37768 Data Scientist/ Product Analyst
## 5 3692363778 2474970 Data Analytics Consultant
## 6 3693028967 89721371 Datacenter Technician
## description
## 1 Months Overview of should be able to develop and support application solutions with a focus on Teradata for a Financial conceptual understanding of the context and business Should be able to understand the business produce Data and design and implement code following the best to perform data quality checks methodically to understand how to accurately utilize client to communicate results and methodology with the project team and Should be able to work in to meet deadlines and thrive in a banking solutions for applications involving large and complex data and provides reconciliation and test Required SkillsShould be and be able to well under stringent deadlinesFinancial Domain KnowledgeExperience in Mainframes and interacting with business understanding requirements and providing quick have excellent exposure to project to learn and adapt to changes Desired Certifications
## 2 Universal Screen specialty is marketing home and much more direct to the consumer through our websites and social This results in a massive holiday rush and hundreds of seasonal jobs to be filled in our warehouse operations very We are looking for a Team Member who is enthusiastic and dynamic to help meet our HRIS needs during this This position will be responsible for updating multiple HR systems with a high volume of employment status including new and Additional responsibilities include weekly review and update of time processing of internal and temporary and attendance tracking and This position is seasonal in We are looking for the right candidate to start in August with an anticipated end to the assignment in in high volume data entryAccuracy and attention to detailExceptional communication skillsCreative problem solving skillsA Professional and Positive attitudeAbility to and adjust to frequently changing prioritiesExperience in payroll processing and HRIS systems preferredHighly organizedEffectively interact with people at all levelsThe ability to work in Microsoft Excel and Outlook Apply today for immediate
## 3 Job and launch extremely efficient and reliable data pipelines to move data and to provide intuitive analytics to our partner data more discoverable and easy to use for Data Scientists and Analysts across the service operations with other engineers and Data Scientists to discover the best and solve issues in our existing data pipelines and envision and build their understanding of one or more of the or JavaStrong understanding of SQLBroad knowledge of the data infrastructure ecosystemExperience with Hadoop or other working with large data volumesExperience in building Data Warehouses and data with any of the following is a or HiveInterest in learning Data Bricks
## 4 Looking for candidates with experience in data science or product analyst roles with a strong background in data statistical modelling and specific experience working with Note Need some one who can work on Data Product Analyst San CA to Months Project Overview Drive analysis for Fitbit product teams focused on and Analyses include optimizing user lifetime improving usage and evaluating content and suggesting future service innovations for Deliver effective presentations of findings and recommendations to multiple levels of creating visual displays of quantitative Overall Responsibilities Use causal inferential methods to quantify impact on product deliveries when experimentation is not Collaborate with stakeholders to formulate and complete full cycle analysis that includes data ongoing scaled deliverables and Help Google focus on key decisions to improve products and Top Daily Responsibilities Deep dive on core product user and relation to user satisfaction and Conduct data analysis to make business recommendations impact Develop and automate iteratively build and prototype dashboards to provide insights at solving for business Mandatory Skills Degree or equivalent experience years of experience Business acumen Knowledge of structured communicating risk considers business leverages for consumer Ability to extract relevant information from reading code in one or more core languages and frameworks and ability to leverage the code as a resource to create work output for users or Experience working with highly unstructured messy datasets and ability to clean and derive Communication and active Ability to clearly explain stats or domain knowledge to people not familiar with the subject matter or who lack a quant This includes the ability to explain reasons in a logical way that is very easy to understand by Leverage communication skills and active listening to manage stakeholders and to set proper technical direction for teams or Data analysis and Ability to analyze draw generate alternatives and and evaluate This includes the ability to use data to add value to business planning and Data and Ability to extract data and validate raw data to ensure it is valid and reliable and ability to clean data based on validation criteria and prepare for further Ability to define and rationalize appropriate create pipelines and dashboards that tell a Ability to measure the success of a given Knowledge of different ability to select the appropriate approach for the problem understanding of the broader context of and increasing the value of the business and revenue Modeling Ability to apply multiple approaches and select the right analysis for the Understanding of the mathematical and statistical concepts underpinning and Knowledge of essential statistical methods used to analyze data descriptive ability to identify and conduct appropriate basic statistical analysis to determine the basic parameters of a set of data and solve Product analysis Ability to interact and respectfully with especially senior to and clarify concerns or issues regarding an existing or ability to effectively address difficult handle pushback from a and maintain a professional demeanor while engaging in challenging or sometimes In addition to influencing this includes actively managing priorities across and Project execution and Ability to proactively communicate insights and influence stakeholders and subject matter experts to inform Ability to convert and uncover real world problems within the business context into trackable metrics or a structured as well as the ability to generate insights from data analysis in a way that is meaningful to the Ability to prepare effective presentations in content and and to speak competently to the level of the Ability to identify and debug product issues and user including the ability to carry out root cause analysis and quantitatively assess critical user think about and Please rate yourself on a scale of being the Please list the of years of experience you have with that particular Skills Rating Years of Experience with Skill Business acumen intuition extraction Communication and active listening Data analysis and synthesis Data and cleaning Analysis Modeling design Product analysis leadership Project execution and influence Provide me below information Name of the Candidate Location Address Number ID Rate on or Offers in Pipeline Availability Availability Status Warm Zainab Saba Talent Acquisition Specialist US XTN
## 5 About the CompanyDiLytics is a leading Information Technology Services Provider completely focused on Business Data ETL Data Integration and Enterprise Performance Management We have been growing for years with global offices in the Canada and Data Analytics CA implementing Analytics in one or more of the following Supply Human Sales Order implementing Analytics on one or more of the following data sources Oracle JD implementing Analytics on one or more of the following tools Power Oracle Analytics implementing Analytics on one or more of the Data Warehouse platforms Oracle Azure Data implementing ETL Data Integration in one or more of the following tools Oracle Data Azure Data years of IT to do offshore coordination working during hours that overlap with India time zone to a reasonable Travel whenever
## 6 GA About the We are looking for a dedicated Datacenter Technician to join our dynamic This is a contract position with the potential to transition to a role based on The ideal candidate is expected to be available at all times of the Key and unracking servers and other equipment in the data remote hands for troubleshooting and maintaining the networking tasks to ensure smooth communication within the with the IT team for the effective functioning of servers and promptly to any technical Qualifications and be based in or willing to relocate to experience as a Datacenter with racking and unracking understanding of networking protocols and experience with Integrated Lights Out is a to work flexibly throughout the communication both written and Contract This position starts as a contract with the opportunity for placement based on performance and mutual
## max_salary med_salary min_salary pay_period formatted_work_type
## 1 0 0 0 Contract
## 2 0 0 0 Temporary
## 3 0 0 0 Contract
## 4 80 0 70 HOURLY Contract
## 5 0 0 0 Full-time
## 6 50 0 15 HOURLY Contract
## location applies original_listed_time remote_allowed views
## 1 United States 1 0 1 56
## 2 Hudson, OH 5 0 325
## 3 United States 5 0 1 101
## 4 San Francisco, CA 7 0 37
## 5 Sacramento, CA 7 0 1 26
## 6 Alpharetta, GA 1 0 34
## job_posting_url
## 1 https://www.linkedin.com/jobs/view/3586162459/?trk=jobs_biz_prem_srch
## 2 https://www.linkedin.com/jobs/view/3690692186/?trk=jobs_biz_prem_srch
## 3 https://www.linkedin.com/jobs/view/3691795980/?trk=jobs_biz_prem_srch
## 4 https://www.linkedin.com/jobs/view/3692302089/?trk=jobs_biz_prem_srch
## 5 https://www.linkedin.com/jobs/view/3692363778/?trk=jobs_biz_prem_srch
## 6 https://www.linkedin.com/jobs/view/3693028967/?trk=jobs_biz_prem_srch
## application_url
## 1
## 2 https://recruiting.paylocity.com/recruiting/jobs/Details/1859929/Universal-Screen-Arts-Inc/Seasonal-PayrollData-Entry-Clerk
## 3
## 4
## 5
## 6
## application_type expiry closed_time formatted_experience_level skills_desc
## 1 ComplexOnsiteApply 0
## 2 OffsiteApply 0
## 3 ComplexOnsiteApply 0
## 4 ComplexOnsiteApply 0 Mid-Senior level
## 5 ComplexOnsiteApply 0
## 6 ComplexOnsiteApply 0
## listed_time posting_domain sponsored work_type currency compensation_type
## 1 0 0 CONTRACT
## 2 0 0 TEMPORARY
## 3 0 0 CONTRACT
## 4 0 0 CONTRACT USD BASE_SALARY
## 5 0 0 FULL_TIME
## 6 0 0 CONTRACT USD BASE_SALARY
Soon, we will take a look at all data science-related roles. In this section, we take a slightly differennt approach. After all, we are looking for the most desirable data science skills. Perhaps, then, we can look at the most desirable jobs and see which skills are associated with those jobs. There are different ways to rank the desirability of jobs. We opted to think about views–where presumably a job is more desirable if the posting received more views.
The LinkedIn dataset is sparse; despite the “skills_desc” column, it is not entirely obvious what skills are associated with most jobs. As such, we will work to extract the skills from the postings’ descriptions. To accomplish this, we need to match words in the descriptions to words that refer to skills. We took part of an html document from a website that discusses resume skills: https://enhancv.com/resume-skills/
library(rvest)
html_content <- read_html('https://raw.githubusercontent.com/hbedros/data607_prj3/gss/enhancv_excerpt.html')
li_items <- html_content %>%
html_nodes("li") %>%
html_text() %>%
str_squish() #because there is otherwise \n\n at the end of lines
li_items = tolower(li_items)
sample(li_items, 10)
## [1] "goal setting"
## [2] "leadership"
## [3] "understanding different perspectives"
## [4] "angular"
## [5] "preparation"
## [6] "fostering a sense of ownership and confidence"
## [7] "cloud infrastructure"
## [8] "influence"
## [9] "critical thinking:"
## [10] "confidence"
Now we have this collection of skills. We need a function to extract matches from individual job postings:
extract_skills <- function(description) {
description <- tolower(description) #to match li_items case
skills <- unlist(strsplit(description, "\\s+")) #splitting using \\s, turn to vector
skills <- skills[skills %in% li_items] #filtering for skills in skills_list
return(paste(unique(skills), collapse = ", ")) #combine matches into one string
}
Now we can apply this function, and create a new column in the dataframe with the extracted skills. While we’re at it, we will remove columns from the dataframe that aren’t as relevant to our analysis and add the company name for clarity.
ds_jobs$skills <- sapply(ds_jobs$description, extract_skills, USE.NAMES = FALSE)
ds_jobs_clean <- ds_jobs[c("job_id", "company_id", "title", "views", "skills")] %>%
left_join(select(companies, company_id, name), by = "company_id") %>%
relocate(name, .after = 'company_id') %>%
rename(company = name) %>%
select(-company_id)
head(ds_jobs_clean, 10)
## job_id company
## 1 3586162459 Sapience Inc
## 2 3690692186 Universal Screen Arts Inc.
## 3 3691795980 <NA>
## 4 3692302089 Milestone Technologies Inc.
## 5 3692363778 DiLytics
## 6 3693028967 CatalystVM LLC
## 7 3693040943 TekValue IT Solutions
## 8 3693041960 Publicis Sapient
## 9 3693043664 Premier International
## 10 3693043742 Publicis Sapient
## title views
## 1 Teradata Developer 56
## 2 Seasonal Payroll/Data Entry Clerk 325
## 3 Data Engineer 101
## 4 Data Scientist/ Product Analyst 37
## 5 Data Analytics Consultant 26
## 6 Datacenter Technician 34
## 7 Azure Data Engineer 534
## 8 Senior Data Engineer 86
## 9 Senior Oracle HCM Data & Technology Consultant 116
## 10 Senior Data Engineer 86
## skills
## 1 understanding
## 2 communication, excel
## 3 understanding
## 4 communication, understanding, influence, leadership
## 5
## 6 networking, communication, understanding
## 7 go
## 8 sql
## 9 respect, coaching, leadership, excel, communication, go, accountability
## 10 sql
A peek at the dataframe reveals a wide range of skills. Again, however, we are only interested in the postings that had a lot of views–these reflect the most desirable roles which plausibly might demand the most desirable skills. Let’s look at only the top 25% of listings
upper_quart <- quantile(ds_jobs_clean$views, 0.75, na.rm = TRUE)
many_views <- ds_jobs_clean %>%
filter(views >= upper_quart)
many_views
## job_id company
## 1 3690692186 Universal Screen Arts Inc.
## 2 3693040943 TekValue IT Solutions
## 3 3693043839 Publix Super Markets
## 4 3693044837 Insight Global
## 5 3693045284 Roth Staffing
## 6 3693045603 Vision Technology Services
## 7 3693045756 Irvine Technology Corporation
## 8 3693045914 The Judge Group
## 9 3693046487 SSi People
## 10 3693046569 The CARIAN Group
## 11 3693046734 LCG Inc.
## 12 3693047014 MethodHub
## 13 3693047232 LTIMindtree
## 14 3693047544 Ztek Consulting
## 15 3693047702 Sud Recruiting
## 16 3693048239 Insight Global
## 17 3693048249 Noom
## 18 3693049654 Insight Global
## 19 3693049706 LS Direct
## 20 3693050917 Akkodis
## 21 3693051072 The CARIAN Group
## 22 3693051129 Maxonic
## 23 3693051482 Sterrofox
## 24 3693052150 Amentum
## 25 3693053211 AccessHope
## 26 3693053263 US Tech Solutions
## 27 3693055046 Healthfirst
## 28 3693055168 Zillion Technologies Inc.
## 29 3693056161 The Judge Group
## 30 3693056244 Prosum
## 31 3693056342 AnyRoad
## 32 3693062969 Insight Global
## 33 3693063985 ASK Consulting
## 34 3693065764 HomeSphere
## 35 3693067708 Solari Inc.
## 36 3693069014 Prime Team Partners
## 37 3693069309 Signify Technology
## 38 3693070035 AMBRION
## 39 3693070192 ROR Partners
## 40 3693071184 Cost.U.Less
## 41 3693073454 Dynamite Jobs
## 42 3693074204 Community.com
## 43 3693586591 The Intersect Group
## 44 3694103473 Multi Media LLC
## 45 3697342482 Backpack Talent
## 46 3697353487 City of Atlanta
## 47 3697356241 Performant Corp
## 48 3697363377 Crunchyroll
## 49 3697381530 Lyft
## 50 3697382316 Lyft
## 51 3697386529 Phreesia
## 52 3697388794 Booz Allen Hamilton
## 53 3697390350 Mastercard
## 54 3697394852 Frontdoor Inc.
## 55 3697395737 Spectrum
## 56 3697397232 GE Digital
## 57 3699063216 ALKU
## 58 3699074702 Google
## 59 3699075688 Google
## 60 3699075689 Google
## 61 3699077610 Google
## 62 3699077628 Google
## 63 3699078392 Google
## 64 3699079423 Google
## 65 3701154588 GRAIL
## 66 3701198908 Veracity Solutions
## 67 3701300757 Software Technology Inc.
## 68 3701300819 YUPRO Placement
## 69 3701302398 Lendmark Financial Services
## 70 3701304425 Huxley
## 71 3701306205 LeadStack Inc.
## 72 3701307307 Quantum World Technologies Inc.
## 73 3701308059 Phaxis
## 74 3701308803 Geomagical Labs
## 75 3701309803 ActiveSoft Inc
## 76 3701312798 DISH Network
## 77 3701313682 Equifax
## 78 3701314004 Agile Datapro
## 79 3701315406 Davis Polk & Wardwell LLP
## 80 3701315556 Solve IT Strategies Inc.
## 81 3701315560 Snowflake
## 82 3701318091 Snowflake
## 83 3701318959 Levy Search
## 84 3701318976 springheadtechnologies
## 85 3701319581 Randstad
## 86 3701320040 AccruePartners
## 87 3701321316 ConnectiCare
## 88 3701322385 Pinnacle Group Inc.
## 89 3701323706 NBC Sports Next
## 90 3701323737 Pinnacle Group Inc.
## 91 3701325020 DocuSign
## 92 3701325300 RBW Consulting
## 93 3701369746 LinQuest
## title
## 1 Seasonal Payroll/Data Entry Clerk
## 2 Azure Data Engineer
## 3 Data Architect – Enterprise Architecture Team-REMOTE
## 4 Business Data Analyst
## 5 Project Manager (Data Center Move)
## 6 Data Scientist
## 7 Data Governance Specialist
## 8 DataStage Engineer
## 9 Data Analyst
## 10 Data Scientist – (Remote)
## 11 Database Developer
## 12 Data Analyst
## 13 Big Data Developer
## 14 Data Scientists / AIML Engineer
## 15 Marketing Sciences, Data Scientist
## 16 Data Analytics Manager
## 17 Lead Data Scientist
## 18 Data Engineer
## 19 Data Analyst
## 20 Data Analyst
## 21 Power BI Data Analyst – (Remote)
## 22 Global Data Insights Analyst (Only W2)
## 23 AWS Data Engineer
## 24 Data Scientist
## 25 Oncology Data Abstractor
## 26 Data Engineer
## 27 IT Data Analytics Analyst
## 28 Text Data Labeling Analyst - Machine Learning (Hybrid Role 1 Day a week Onsite)
## 29 Data Analytics Engineer
## 30 UX/UI Designer - Data Intelligence
## 31 Senior Backend Software Engineer, Data
## 32 Sr Data Engineer
## 33 Associate Data Analyst - Analytics and Insights
## 34 Senior Azure Database Engineer
## 35 Manager, Data and Reporting
## 36 Data Analytics Specialist
## 37 Senior Data Engineer (Scala/Spark)
## 38 Senior Data Analyst
## 39 Principal Data Engineer
## 40 Data Analyst
## 41 Senior Data Analyst
## 42 Senior Data Engineer
## 43 Data Engineer
## 44 Product Data Analyst
## 45 Environmental Engineer & Data Scientist
## 46 Data/Reporting Analyst
## 47 Database Administrator II (MSDA)
## 48 Senior Data Analyst - eCommerce
## 49 Senior Data Scientist - Algorithms
## 50 Data Analyst, Real-Time Supply Management (Hybrid)
## 51 Senior Product Manager, Data Products
## 52 Data Scientist
## 53 Director, Data Strategy, Retail and Commerce
## 54 Senior Data Analyst
## 55 Supply Chain Data Analyst
## 56 Implementation Data Engineer
## 57 Data Privacy analyst (NOT ATTORNEY) with 1-3 years of experience ONLY not a BI/technical role please no data architects/developers
## 58 Senior Data Scientist
## 59 Senior Data Scientist
## 60 Senior Data Scientist
## 61 Senior Data Scientist
## 62 Data Engineer, Google Nest
## 63 Senior Data Scientist
## 64 Senior Data Scientist
## 65 Eligibility Data Analyst # 3219
## 66 Database Engineer
## 67 Data Engineer
## 68 Data Optimization Analyst
## 69 Credit Data Analyst II
## 70 Software Data Engineer (Scala/Python)
## 71 Azure Data Engineer
## 72 100% Remote _ Azure Data Architect _ Immediate Interview !!!
## 73 Data Integration Engineer- REMOTE
## 74 Data Analyst - Remote / Contract
## 75 Data Engineer
## 76 New Grad - Data Analyst II
## 77 Data Engineer
## 78 Data Science & AI - ML Internship
## 79 RPA Developer, Business Operations & Data Analytics
## 80 Data Engineer (W2 ONLY)
## 81 Software Engineer - Database Engineering (San Mateo)
## 82 Software Engineer - Database Engineering (Seattle)
## 83 Senior Data Scientist
## 84 Senior Data Analyst (Onsite)
## 85 Call Center Data Entry Specialist
## 86 Pricing Data Analyst
## 87 Healthcare Business Data Leader
## 88 Data Analyst
## 89 Data Engineer II - NBC Sports Next
## 90 Business Analyst/Data Remediation
## 91 Data Engineer
## 92 Senior Clinical Data Manager
## 93 Data Scientist / Operations Research Analyst
## views
## 1 325
## 2 534
## 3 291
## 4 974
## 5 337
## 6 347
## 7 311
## 8 273
## 9 350
## 10 1334
## 11 199
## 12 193
## 13 209
## 14 333
## 15 525
## 16 196
## 17 956
## 18 568
## 19 724
## 20 271
## 21 1459
## 22 411
## 23 378
## 24 215
## 25 235
## 26 612
## 27 308
## 28 195
## 29 508
## 30 386
## 31 496
## 32 289
## 33 256
## 34 283
## 35 458
## 36 237
## 37 232
## 38 216
## 39 271
## 40 307
## 41 462
## 42 323
## 43 213
## 44 297
## 45 272
## 46 241
## 47 214
## 48 386
## 49 514
## 50 1038
## 51 293
## 52 248
## 53 330
## 54 197
## 55 260
## 56 276
## 57 234
## 58 258
## 59 224
## 60 279
## 61 219
## 62 945
## 63 299
## 64 256
## 65 301
## 66 559
## 67 516
## 68 237
## 69 244
## 70 275
## 71 906
## 72 263
## 73 630
## 74 1138
## 75 362
## 76 822
## 77 316
## 78 566
## 79 196
## 80 249
## 81 507
## 82 790
## 83 305
## 84 365
## 85 408
## 86 195
## 87 375
## 88 514
## 89 301
## 90 214
## 91 419
## 92 323
## 93 215
## skills
## 1 communication, excel
## 2 go
## 3 mentoring, understanding
## 4 mentoring, go
## 5 leadership, communication, understanding
## 6 vision, leadership, understanding, python
## 7 python, leadership, communication
## 8 python, sql
## 9 documentation
## 10 understanding
## 11 sql
## 12 sql, understanding, communication
## 13
## 14 containerization, understanding, reliability
## 15 influence, curiosity
## 16 excel
## 17 python
## 18 understanding, sql
## 19 documentation, communication
## 20 sql
## 21 understanding, sql
## 22 research, sql, communication, storytelling
## 23 understanding, python
## 24 presentation, tableau, matlab
## 25 leadership, documentation, integrity, honesty, research
## 26 sql
## 27 communication, understanding, sql, tableau, excel, preparation, presentation, vision
## 28
## 29 python, sql, understanding, networking, communication
## 30 research, figma, prototyping
## 31 understanding, communication, ruby
## 32 understanding, sql
## 33 sql, python, tableau
## 34 sql, influence, prototyping, presentation, vision
## 35 tableau, sql, compassion, leadership, vision
## 36 vision, sql, python, integrity, communication
## 37
## 38 tableau, presentation, preparation, excel
## 39 understanding, documentation, sql, python, docker
## 40 organization, excel, communication
## 41 research, clarity, respect
## 42 organization, leadership, integrity, confidence, python, understanding, tableau, communication
## 43 communication
## 44 research, sql, python
## 45
## 46 organization, leadership, excel
## 47 sql, understanding, integrity, leadership, communication
## 48 reasoning, understanding, sql, excel, transparency
## 49 organization, understanding, collaboration, python, communication, vision, flexibility
## 50 influence, research, sql, communication, vision, flexibility
## 51 understanding, leadership, vision, sql, communication, organization
## 52 research, leadership, understanding, flexibility
## 53 organization, understanding, communication, integrity
## 54 understanding, sql, python, communication, creativity, go
## 55 excel
## 56 sql, tableau, leadership, communication, initiative, research
## 57 understanding, communication
## 58 evaluation
## 59 evaluation
## 60 evaluation
## 61 evaluation
## 62 communication, presentation, organization
## 63 evaluation
## 64 evaluation
## 65 understanding, communication, evaluation, agility, influence, vision
## 66 sql, python, excel
## 67
## 68 communication, collaboration
## 69 preparation, excel, communication, sas
## 70 organization, python
## 71 python, sql, understanding, communication
## 72 communication
## 73 integrity, sql
## 74 integrity, sql, understanding, communication
## 75 sql
## 76 organization, communication, creativity, excel
## 77 organization, vision, understanding, sas, sql, communication
## 78 python, understanding
## 79 sql, python, communication
## 80 sql
## 81 sql
## 82 sql
## 83 python
## 84 sql, tableau
## 85 excel, communication, collaboration
## 86 confidence, excel, communication, tableau
## 87 leadership, documentation, sql, excel, sas, communication
## 88 research, understanding, leadership, vision
## 89 influence, python, understanding, sql, initiative
## 90 research, communication, presentation, collaboration, documentation, vision
## 91 communication, transparency
## 92 research, documentation
## 93 research, understanding, flexibility, vision
Let’s see which skills appear the most using tidytext:
library(tidytext)
skills_count <- many_views %>%
unnest_tokens(word, skills) %>%
count(word, sort = TRUE)
skills_count
## word n
## 1 communication 36
## 2 sql 36
## 3 understanding 31
## 4 python 20
## 5 excel 14
## 6 leadership 13
## 7 vision 13
## 8 research 12
## 9 organization 10
## 10 tableau 9
## 11 documentation 7
## 12 evaluation 7
## 13 integrity 7
## 14 presentation 6
## 15 influence 5
## 16 collaboration 4
## 17 flexibility 4
## 18 go 3
## 19 preparation 3
## 20 sas 3
## 21 confidence 2
## 22 creativity 2
## 23 initiative 2
## 24 mentoring 2
## 25 prototyping 2
## 26 transparency 2
## 27 agility 1
## 28 clarity 1
## 29 compassion 1
## 30 containerization 1
## 31 curiosity 1
## 32 docker 1
## 33 figma 1
## 34 honesty 1
## 35 matlab 1
## 36 networking 1
## 37 reasoning 1
## 38 reliability 1
## 39 respect 1
## 40 ruby 1
## 41 storytelling 1
The results are telling. In the most viewed (and therefore, perhaps, desirable) data science-related postings, many of the skills are not technical at all! In fact, of the top 3 skills, 2 are “communication” and “understanding.” We can visualize this nicely with a wordcloud as well.
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = skills_count$word, freq = skills_count$n, min.freq = 2, scale = c(3, .67))
In the next section, we’ll confirm this is no anomoly; soft skills are
extremely desirable for data science-related jobs.
In this section, we are identifying the most frequently occurring skills across all data science jobs.
# Apply the function to all jobs, creating a new column with the extracted skills
ds_jobs$skills <- sapply(ds_jobs$description, extract_skills, USE.NAMES = FALSE)
# Create a dataset with skills from all data science jobs
all_skills <- ds_jobs %>%
unnest_tokens(word, skills) %>%
count(word, sort = TRUE)
head(all_skills, 20)
## word n
## 1 communication 164
## 2 understanding 142
## 3 sql 101
## 4 leadership 89
## 5 research 69
## 6 organization 60
## 7 python 57
## 8 documentation 49
## 9 vision 43
## 10 excel 35
## 11 integrity 34
## 12 presentation 26
## 13 collaboration 23
## 14 influence 23
## 15 flexibility 22
## 16 go 18
## 17 initiative 17
## 18 evaluation 16
## 19 networking 16
## 20 tableau 16
The output reveals the top skills sought after in data science job postings. Across all data sciecne jobs, non-technical skills like ‘communication’ and ‘understanding’ are highly emphasized, with ‘communication’ topping the list, followed by ‘understanding’. Once again, we can visualize this with a wordcloud.
library(wordcloud)
wordcloud(words = skills_count$word, freq = skills_count$n, min.freq = 2, scale = c(3, .67))
Grpahs
Add visual
library(ggplot2)
ggplot(skills_count, aes(x=word, y = n))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ggtitle("Graph 1:Frequency:Count for each word")
Order factors by order in the data frame
skills_count$word = factor(skills_count$word,levels = unique(skills_count$word))
Check the dataframe
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
head(skills_count)
## word n
## 1 communication 36
## 2 sql 36
## 3 understanding 31
## 4 python 20
## 5 excel 14
## 6 leadership 13
tail(skills_count)
## word n
## 36 networking 1
## 37 reasoning 1
## 38 reliability 1
## 39 respect 1
## 40 ruby 1
## 41 storytelling 1
str(skills_count)
## 'data.frame': 41 obs. of 2 variables:
## $ word: Factor w/ 41 levels "communication",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ n : int 36 36 31 20 14 13 13 12 10 9 ...
summary(skills_count)
## word n
## communication: 1 Min. : 1.000
## sql : 1 1st Qu.: 1.000
## understanding: 1 Median : 2.000
## python : 1 Mean : 6.585
## excel : 1 3rd Qu.: 7.000
## leadership : 1 Max. :36.000
## (Other) :35
library(lattice)
histogram(~ n | word, data = skills_count, layout= c(1,41))
This is a histogram of the count for each word side-by-side but horitzonal instead of vertical.Visiually we can wee the words that high count vs low and those that have similar count. This is al alternate was for us to visualize the distribution of soft skills mentioned
library(lattice)
histogram(~ n | word, data = skills_count, layout= c(1,10))
The most viewed skills for higher end salary data science jobs are soft skills that mostly consist of soft skills such as understanding, and communication, then followed by integrity and evaluation .A few notable skills that are technical like SQL that ranked high. We see in the size differences in the word cloud indicating a difference in importance, and it is validated in the bar graphs and histograms that the frequency of each word is not uniform. However a a little over 50% of the words had a count less than 20. The graph is not valida for us to perform a Poisson regression test.