Group Project - Data Scientist Job Skills

Project: Data Science Skills

Approach: why we decided to attack this assignment as we did.

Like any aspiring young professional Data Scientists face a question about the most important skills that raise your competitive value in today’s market place. Because Data Science is a rapidly changing industry and so are its tools, we approached a problem from a generic data scientist point of view and did a a scrape of more than 50,000 job postings on Indeed.

From this we created simple list of the 40 most commonly sought-after hard technical skills wanted by recruiters whose postings appeared under data scientist searches.

Our initial scraping pulled skills from the experiences area of each listing, and saved the state, job title, individual skills under a document identification number. Although our initial presentation shows simple frequency, this method allows us to go back and look more closely at skills in specific markets, which might make job hunting easier for those willing to relocate for their next position Strategy

Strategy: How we divided up work and slid into natural roles as the project progressed

Scraping: Peter wrote a script which obtains key words from the experience section of job postings on Indeed for the entire united states.
QC, support and debugging of the scraper with Peter most frequently fell to Fahad, with team input on the tough decisions.
To support simplified searching and aggregation down-the-road, Bethany and Fahad implemented a script in R to create and populate an SQLite database of our results
The creation, coding and design of the final visualization was Lidiia’s main focus, with input from the whole team on content, and a simple flexdashboard framework provided by Bethany.

Scraping: The centerpiece of this analysis required some tricky code to extract our skills

  sessions <- submit_form(session, form, pages)
  
    #select based on classes, this is the experience class
  experience <- sessions %>% 
    lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = '  row  result') 
             and descendant::*[@class = 'experienceList' ]]//span[@class = 'experienceList']") %>% 
    lapply(html_text) %>%
    unlist ()
  
  # Get job titles
  job_title <- sessions %>% 
      lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = '  row  result') 
             and descendant::*[@class = 'experienceList' ]]//a[@data-tn-element = 'jobTitle' 
             and following::*[@class = 'experienceList' ]]") %>%
      lapply(html_text) %>%
      unlist ()

The Xpath query is the key here. It returns every job listing and then from that every job listing where there is an element with the class “experienceList”.

We wanted to use the power of the Indeed search engine to do a lot of the heavy lifting for us. The search engine has been optimized over time to return listings relevant to a given query, so it would know a job like “quantatavie engineer” is relevant to the “data science”" query. This allowed us to feed in one query and worry about selecting the relevant results from that query at a later point in the process. This led us to pulling every job listing from Indeed in all 50 states returned from using “data science as a search term”.

With each listing, a list of desired experience is provided. These terms equate to the skills needed for the given position. We needed to map skills to each position, but “desired experience” isn’t provided for each position. In order for this excercise to be succesful, we needed to only look at positions where “desired experience was given”. We used this code to line up skills and job titles:

Database how we created a repository to store our skills for future analysis

setwd("")
db <- dbConnect(SQLite(), dbname= "indeed.sqlite")
sqldf("attach 'indeed.sqlite' as new ")
dbSendQuery(conn= db,
            "DROP TABLE IF EXISTS Jobs;
            CREATE TABLE Jobs
            (job_id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_name TEXT)")

dbSendQuery(conn= db,
            "DROP TABLE IF EXISTS State;
            CREATE TABLE State
            (state_id TEXT PRIMARY KEY
            name TEXT)")

dbSendQuery(conn = db,
            "DROP TABLE IF EXISTS Skills;
            CREATE TABLE Skills
            (skill_id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_id INTEGER,
            state_id TEXT,
            document_id INTEGER,
            skill TEXT)")
# These initialize the tables, the state and jobs are filled with useable fields to query against with skills
State <- read.csv("official_csv/state.csv")        
dbWriteTable(conn = db, name ="State", value = State, overwrite=TRUE)

Jobs <- read.csv("official_csv/job_title.csv")  
dbWriteTable(conn = db, name ="Jobs", value = Jobs, overwrite=TRUE)

Skills <- read.csv("official_csv/final_data.csv") 
dbWriteTable(conn = db, name ="Skills", value = Skills, overwrite=TRUE)

This photograph shows the database tables in DB Browser for SQLite GUI. It can be accessed by any SQL tool, or programming language for further analysis.

This is the basic code used to import three tables which were created from simple data frames in R.

States
- name manually entered
- state_id manually entered 2 character codes (Key)
Jobs
- job_name -extracted using unique() from our scraped list
- job_id - simple numerical identifier (Key)
Skills
document_id created during scrapping
job_title pulled from Indeed listing, as-is
state_id pulled from indeed and converted to 2 character code
skill pulled from listing, as-is
skill_id created for each unique skill and merged against skills in frame

Results: After scrapping 50,000+ postings nationwide, here are the results

Based on the analysis done on over 80 000 job postings on the website Indeed.com the most sought after programming languages are:

Python
R

This may not come as a surprise, however the fact that R is closely followed by the proprietary data science language SAS, with Matlab not far behind may be to some.

What is even more striking at casual glance is that SQL is not in the top three, that is until you notice that SQL Server and MySQL are in separate categories.

It seems likely that if we were to go back and aggregate the many varieties of SQL, it would be right up there with Python, R and SAS.

Aside from popular analytical languages, language agnostic skills topping the list were:

Machine Learning
Data Mining
Data Warehouse
Data Science

If you are looking for work it would be important to show dexterity in one or more of this areas.

Despite the heavy focus on open-source tools, there is a clear need for people skilled in proprietary environments as Tableau, Oracle and AWS making strong showings in the top 25 list.

About: This project was substantial and we could not have done it without the kindness of strangers

Bibliography

Law, J., & Rosenblum, J. (n.d.). rvest tutorial: scraping the web using R. Retrieved October 15, 2017, from

DevNami. (n.d.). R Programming Import Data from URL. Retrieved from

R and SQLite: Part 1. (2012, November 18). Retrieved October 22, 2017, from

Entrepreneur Tactics. (n.d.). Scraping Website Data From Multiple Pages For FREE Using Import io. Retrieved from

Melvin L. (n.d.). Simple Web Scraping using R. Retrieved from

Bourret, R. (n.d.). rpbourret.com - XPath in Five Paragraphs. Retrieved October 22, 2017, from


Fahad Arif		Lidiia Tronina
“Seek not to become a man of knowledge but a man of value. (Einstein)”		The world is one big data problem.(Andrew McAfee)

Peter Goodridge		Bethany Poulin
“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”" John Tukey		“Fail your way to greatness!”

Aknowledgement from Project Manager

This project was the culmination of great teamwork, shared responsibility and insightful ideation and problem solving.

Although Peter’s herculean effort in extracting data was magnificent, Fahad’s diligence in problem solving and chasing down useful solutions and input on the database was equally as valuable.

Liddia was super proactive getting everything ready for the last minute surge to make a polished final product and she handled the ticking clock with grace.

The process was remarkable in how democratic, smooth and humanely it proceeded, making my job as the defacto project manager extremely easy!

Thanks you guys for the best project I could have imagined!

We would also like to extend our thanks to those who put their understanding out on the web for all to see, use and grow from.