10/20/2019

“Which are the most valued data science skills?”

Scope

  • Impossible to answer precisely

  • Skills broadly fall into categories

    • Mathematical / Statistical
    • Computers / Coding
    • Communication / Presentation
    • Experimental Design
    • Data Engineering
  • Data Scientists need Domain Expertise to answer questions

DataScience Exchange

Data Collection

  • Scraped all tags from DataScience Exchange

SELECT Id, TagName, Count from Tags ORDER BY Count DESC;

  • There are 502 unique tags which range in frequency from 1 to 6296. Of these, 395 appear more than 5 times, 95 more than 100 times, and 10 more than 1000 times.
##    TagName              Count       
##  Length:502         Min.   :   1.0  
##  Class :character   1st Qu.:   7.0  
##  Mode  :character   Median :  22.0  
##                     Mean   : 113.2  
##                     3rd Qu.:  67.0  
##                     Max.   :6296.0

Top 10 tags

Common Skills we will see again in Dice.com analysis. Unique Skills to only DataScience Exchange Tags.

Wordcloud (Count > 100)

Dice.Com

Data Collection

We used a Jupyter Python notebook to scrape web data:

Python Libraries:

  • Selenium (with Chrome webdriver): automate browsing
  • BeautifulSoup: mine job links from search results
  • lxml and xpath: load individual job listings
  • re: regular expressions to scrape job listing skills

Processing Flow:

  1. Run a search on Dice.com for Data Scientist jobs
  2. Loop through each Results Page
    • scrape each Job Listing URL link
  3. Loop through each Job Listing URL link
    • load the Job Posting then scrape the job skill(s)
  4. Save all skills to a text file — one row per job listing

Results

This data was significantly less tidy than the DataScience Exchange tags. After basic cleanup including removing ampersand-hex codes, punctuation, and obviously non-skill words such as “and” and “or”, the following observations can be made.

There are 766 unique tags which range in frequency from 1 to 159. Of these, 59 appear more than 5 times and 1 more than 100 times.

##     skills                N          
##  Length:766         Min.   :  1.000  
##  Class :character   1st Qu.:  1.000  
##  Mode  :character   Median :  1.000  
##                     Mean   :  2.602  
##                     3rd Qu.:  2.000  
##                     Max.   :159.000

Top 10 tags

Common Skills we saw in DataScience Exchange Tags. Unique Skills to only Dice.com Job Skills.

Wordcloud (Count > 5)

Dice/DS Exchange Comparison

The following skills are represented in both the top 10 DataScience Exchange tags and the Dice.com job skills:

  • machine learning, python, r

The following are unique to the top 10 DataScience exchange tags:

  • neural network, deep learning, classification, keras, scikit learn, tensorflow, nlp

and conversely these are unique to the top 10 Dice.com job skills:

  • sql, java, data mining, sas, engineering, analysis, modeling

Previous Analyses Published Online

KDNuggets—2018: Description

In November of 2018, Jeff Hale posted an entry on the KDNuggets blog where he described his findings based on a job-listing analysis performed against LinkedIn, Indeed, SimplyHired, and AngelList on October 10, 2018.

His findings, shown on the next slide, reinforce that the most requested skills are the analytical ones: computer science, analysis, statistics, and machine learning as examples. However, there are a number of “softer” skills requested, such as communications and visualization.

It should be noted that those would not necessarily be found as questions on DataScience Exchange.

KDNuggets—2018: Findings

LinkedIn—2018: Description

In April of 2018, Michael Li, VP of Data at Coinbase, posted a piece on LinkedIn where he listed his main desired skills for new data science hires.

  • Data wrangling / Munging / Manipulation

  • Experiment Design and A/B testing

  • Statistical Modeling / Machine learning

  • Soft Skills

  • Case studies and problem-solving

LinkedIn—2018: Discussion

What is important about Li’s piece is its focus on the “softer” skills of the data scientist. Of the five headings, only two would be considered classic “hard” data science: Data wrangling and Statistical modeling.

While one could make the case that Experimental Design is rigorous as in the work of Fisher, Neyman, and Pearson, it is a skill which does not receive enough mention in many discussions.

Li makes it clear that he views developing good case studies, visualizations, after-action summaries, communications, and persuasion skills as key in becoming a good data scientist. He concludes his section of soft-skills with:

Ultimately, the goal is to take the insights generated from the analysis and effectively influence critical decision-making, which drives business impact. The “hard skills” and “soft skills” need to work together for the success of a data scientist.