Like any aspiring young professional Data Scientists face a question about the most important skills that raise your competitive value in today’s market place. Because Data Science is a rapidly changing industry and so are its tools, we approached a problem from a generic data scientist point of view and did a a scrape of more than 50,000 job postings on Indeed.
From this we created simple list of the 40 most commonly sought-after hard technical skills wanted by recruiters whose postings appeared under data scientist searches.
Our initial scraping pulled skills from the experiences area of each listing, and saved the state, job title, individual skills under a document identification number. Although our initial presentation shows simple frequency, this method allows us to go back and look more closely at skills in specific markets, which might make job hunting easier for those willing to relocate for their next position Strategy
Scraping: Peter wrote a script which obtains key words from the experience section of job postings on Indeed for the entire united states.
QC, support and debugging of the scraper with Peter most frequently fell to Fahad, with team input on the tough decisions.
To support simplified searching and aggregation down-the-road, Bethany and Fahad implemented a script in R to create and populate an SQLite database of our results
The creation, coding and design of the final visualization was Lidiia’s main focus, with input from the whole team on content, and a simple flexdashboard framework provided by Bethany.
sessions <- submit_form(session, form, pages)
#select based on classes, this is the experience class
experience <- sessions %>%
lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = ' row result')
and descendant::*[@class = 'experienceList' ]]//span[@class = 'experienceList']") %>%
lapply(html_text) %>%
unlist ()
# Get job titles
job_title <- sessions %>%
lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = ' row result')
and descendant::*[@class = 'experienceList' ]]//a[@data-tn-element = 'jobTitle'
and following::*[@class = 'experienceList' ]]") %>%
lapply(html_text) %>%
unlist ()The Xpath query is the key here. It returns every job listing and then from that every job listing where there is an element with the class “experienceList”.
We wanted to use the power of the Indeed search engine to do a lot of the heavy lifting for us. The search engine has been optimized over time to return listings relevant to a given query, so it would know a job like “quantatavie engineer” is relevant to the “data science”" query. This allowed us to feed in one query and worry about selecting the relevant results from that query at a later point in the process. This led us to pulling every job listing from Indeed in all 50 states returned from using “data science as a search term”.
With each listing, a list of desired experience is provided. These terms equate to the skills needed for the given position. We needed to map skills to each position, but “desired experience” isn’t provided for each position. In order for this excercise to be succesful, we needed to only look at positions where “desired experience was given”. We used this code to line up skills and job titles:
setwd("")
db <- dbConnect(SQLite(), dbname= "indeed.sqlite")
sqldf("attach 'indeed.sqlite' as new ")
dbSendQuery(conn= db,
"DROP TABLE IF EXISTS Jobs;
CREATE TABLE Jobs
(job_id INTEGER PRIMARY KEY AUTOINCREMENT,
job_name TEXT)")
dbSendQuery(conn= db,
"DROP TABLE IF EXISTS State;
CREATE TABLE State
(state_id TEXT PRIMARY KEY
name TEXT)")
dbSendQuery(conn = db,
"DROP TABLE IF EXISTS Skills;
CREATE TABLE Skills
(skill_id INTEGER PRIMARY KEY AUTOINCREMENT,
job_id INTEGER,
state_id TEXT,
document_id INTEGER,
skill TEXT)")
# These initialize the tables, the state and jobs are filled with useable fields to query against with skills
State <- read.csv("official_csv/state.csv")
dbWriteTable(conn = db, name ="State", value = State, overwrite=TRUE)
Jobs <- read.csv("official_csv/job_title.csv")
dbWriteTable(conn = db, name ="Jobs", value = Jobs, overwrite=TRUE)
Skills <- read.csv("official_csv/final_data.csv")
dbWriteTable(conn = db, name ="Skills", value = Skills, overwrite=TRUE)This photograph shows the database tables in DB Browser for SQLite GUI. It can be accessed by any SQL tool, or programming language for further analysis.
This is the basic code used to import three tables which were created from simple data frames in R.
unique() from our scraped listBased on the analysis done on over 80 000 job postings on the website Indeed.com the most sought after programming languages are:
This may not come as a surprise, however the fact that R is closely followed by the proprietary data science language SAS, with Matlab not far behind may be to some.
What is even more striking at casual glance is that SQL is not in the top three, that is until you notice that SQL Server and MySQL are in separate categories.
It seems likely that if we were to go back and aggregate the many varieties of SQL, it would be right up there with Python, R and SAS.
Aside from popular analytical languages, language agnostic skills topping the list were:
If you are looking for work it would be important to show dexterity in one or more of this areas.
Despite the heavy focus on open-source tools, there is a clear need for people skilled in proprietary environments as Tableau, Oracle and AWS making strong showings in the top 25 list.
Bibliography
DevNami. (n.d.). R Programming Import Data from URL. Retrieved from
R and SQLite: Part 1. (2012, November 18). Retrieved October 22, 2017, from
Melvin L. (n.d.). Simple Web Scraping using R. Retrieved from
Bourret, R. (n.d.). rpbourret.com - XPath in Five Paragraphs. Retrieved October 22, 2017, from
|
|
|
|
| Fahad Arif | Lidiia Tronina | |
|
“Seek not to become a man of knowledge but a man of value. (Einstein)”
|
The world is one big data problem.(Andrew McAfee)
|
|
|
|
|
|
| Peter Goodridge | Bethany Poulin | |
|
“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”" John Tukey
|
“Fail your way to greatness!”
|
Aknowledgement from Project Manager
This project was the culmination of great teamwork, shared responsibility and insightful ideation and problem solving.
Although Peter’s herculean effort in extracting data was magnificent, Fahad’s diligence in problem solving and chasing down useful solutions and input on the database was equally as valuable.
Liddia was super proactive getting everything ready for the last minute surge to make a polished final product and she handled the ticking clock with grace.
The process was remarkable in how democratic, smooth and humanely it proceeded, making my job as the defacto project manager extremely easy!
Thanks you guys for the best project I could have imagined!
We would also like to extend our thanks to those who put their understanding out on the web for all to see, use and grow from.