Project: Data Science Skills

“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” Geoffrey Moore.

Approach: why we decided to attack this assignment as we did.


Like any aspiring young professional Data Scientists face a question about the most important skills that raise your competitive value in today’s market place. Because Data Science is a rapidly changing industry and so are its tools, we approached a problem from a generic data scientist point of view and did a a scrape of more than 50,000 job postings on Indeed.

From this we created simple list of the 40 most commonly sought-after hard technical skills wanted by recruiters whose postings appeared under data scientist searches.

Our initial scraping pulled skills from the experiences area of each listing, and saved the state, job title, individual skills under a document identification number. Although our initial presentation shows simple frequency, this method allows us to go back and look more closely at skills in specific markets, which might make job hunting easier for those willing to relocate for their next position Strategy

Strategy: How we divided up work and slid into natural roles as the project progressed


Scraping: The centerpiece of this analysis required some tricky code to extract our skills

  sessions <- submit_form(session, form, pages)
  
    #select based on classes, this is the experience class
  experience <- sessions %>% 
    lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = '  row  result') 
             and descendant::*[@class = 'experienceList' ]]//span[@class = 'experienceList']") %>% 
    lapply(html_text) %>%
    unlist ()
  
  # Get job titles
  job_title <- sessions %>% 
      lapply(html_nodes, xpath = "//div[(@class = 'row result' or @class = '  row  result') 
             and descendant::*[@class = 'experienceList' ]]//a[@data-tn-element = 'jobTitle' 
             and following::*[@class = 'experienceList' ]]") %>%
      lapply(html_text) %>%
      unlist ()

The Xpath query is the key here. It returns every job listing and then from that every job listing where there is an element with the class “experienceList”.


We wanted to use the power of the Indeed search engine to do a lot of the heavy lifting for us. The search engine has been optimized over time to return listings relevant to a given query, so it would know a job like “quantatavie engineer” is relevant to the “data science”" query. This allowed us to feed in one query and worry about selecting the relevant results from that query at a later point in the process. This led us to pulling every job listing from Indeed in all 50 states returned from using “data science as a search term”.

With each listing, a list of desired experience is provided. These terms equate to the skills needed for the given position. We needed to map skills to each position, but “desired experience” isn’t provided for each position. In order for this excercise to be succesful, we needed to only look at positions where “desired experience was given”. We used this code to line up skills and job titles:

Database how we created a repository to store our skills for future analysis

setwd("")
db <- dbConnect(SQLite(), dbname= "indeed.sqlite")
sqldf("attach 'indeed.sqlite' as new ")
dbSendQuery(conn= db,
            "DROP TABLE IF EXISTS Jobs;
            CREATE TABLE Jobs
            (job_id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_name TEXT)")

dbSendQuery(conn= db,
            "DROP TABLE IF EXISTS State;
            CREATE TABLE State
            (state_id TEXT PRIMARY KEY
            name TEXT)")

dbSendQuery(conn = db,
            "DROP TABLE IF EXISTS Skills;
            CREATE TABLE Skills
            (skill_id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_id INTEGER,
            state_id TEXT,
            document_id INTEGER,
            skill TEXT)")
# These initialize the tables, the state and jobs are filled with useable fields to query against with skills
State <- read.csv("official_csv/state.csv")        
dbWriteTable(conn = db, name ="State", value = State, overwrite=TRUE)

Jobs <- read.csv("official_csv/job_title.csv")  
dbWriteTable(conn = db, name ="Jobs", value = Jobs, overwrite=TRUE)

Skills <- read.csv("official_csv/final_data.csv") 
dbWriteTable(conn = db, name ="Skills", value = Skills, overwrite=TRUE)

This photograph shows the database tables in DB Browser for SQLite GUI. It can be accessed by any SQL tool, or programming language for further analysis.

This is the basic code used to import three tables which were created from simple data frames in R.

Results: After scrapping 50,000+ postings nationwide, here are the results


Based on the analysis done on over 80 000 job postings on the website Indeed.com the most sought after programming languages are:

This may not come as a surprise, however the fact that R is closely followed by the proprietary data science language SAS, with Matlab not far behind may be to some.

What is even more striking at casual glance is that SQL is not in the top three, that is until you notice that SQL Server and MySQL are in separate categories.

It seems likely that if we were to go back and aggregate the many varieties of SQL, it would be right up there with Python, R and SAS.

Aside from popular analytical languages, language agnostic skills topping the list were:

If you are looking for work it would be important to show dexterity in one or more of this areas.

Despite the heavy focus on open-source tools, there is a clear need for people skilled in proprietary environments as Tableau, Oracle and AWS making strong showings in the top 25 list.

About: This project was substantial and we could not have done it without the kindness of strangers

Bibliography

Law, J., & Rosenblum, J. (n.d.). rvest tutorial: scraping the web using R. Retrieved October 15, 2017, from

DevNami. (n.d.). R Programming Import Data from URL. Retrieved from

R and SQLite: Part 1. (2012, November 18). Retrieved October 22, 2017, from

Entrepreneur Tactics. (n.d.). Scraping Website Data From Multiple Pages For FREE Using Import io. Retrieved from

Melvin L. (n.d.). Simple Web Scraping using R. Retrieved from

Bourret, R. (n.d.). rpbourret.com - XPath in Five Paragraphs. Retrieved October 22, 2017, from


Fahad Arif Lidiia Tronina
Fahad Arif Lidiia Tronina
“Seek not to become a man of knowledge but a man of value. (Einstein)”
The world is one big data problem.(Andrew McAfee)
Peter Goodridge Bethany Poulin
Peter Goodridge Bethany Poulin
“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”" John Tukey
“Fail your way to greatness!”

Aknowledgement from Project Manager

This project was the culmination of great teamwork, shared responsibility and insightful ideation and problem solving.

Although Peter’s herculean effort in extracting data was magnificent, Fahad’s diligence in problem solving and chasing down useful solutions and input on the database was equally as valuable.

Liddia was super proactive getting everything ready for the last minute surge to make a polished final product and she handled the ticking clock with grace.

The process was remarkable in how democratic, smooth and humanely it proceeded, making my job as the defacto project manager extremely easy!

Thanks you guys for the best project I could have imagined!

We would also like to extend our thanks to those who put their understanding out on the web for all to see, use and grow from.