Data Skills

Team Profound Specialists
10/28/2018

Question: Which skills are required to be a data scientist?

Team Members: simplymathematics, jemceach, ritwar01, ghh2001, & johnpannyc

Overview

  • Sources
  • Roles
  • Tools
  • Skills
  • Characteristics
  • Salary
  • Outlook
  • Technical skills
  • Indeed
  • Conclusion

Sources

Roles

  • Characteristics: @simplymathmatics & @jemceach
  • Salary: @simplymathmatics & @jemceach
  • Skills: @simplymathmatics & @jemceach
  • Technical Skills: @simplymathematics & @jemceach
  • Indeed Skills: @ritwar01
  • SQL: @simplymathematics
  • Write up and analysis: @simplymathematics & @jemceach
  • Presentation: @ritwar01, @simplymathematics, @jemceach

Tools

Communication:

  • Email
  • Slack

Production

  • Github
  • RStudio
  • Rpres

Github Issues Tool GitHub \caption{Github: Issues Tool}

Skills

  • Scraped and cleaned data from O*NET
  • Evaluted data on scale of importance (0-7).
  • Findings: Top 5 Skills for Math and Computer Science Jobs:
Element avgvalue n
Critical Thinking 3.936667 16
Complex Problem Solving 3.714286 16
Reading Comprehension 3.660000 16
Judgment and Decision Making 3.638889 16
Active Listening 3.605000 16

Characteristics

plot of chunk unnamed-chunk-4

Characteristics (Cont.)

plot of chunk unnamed-chunk-5

Salary

Scraped Salary Data from BLS
SOC Title Mean.Annual.Wage
1133 Software developers, systems software 111780
1134 Web developers 74110
1141 Database administrators 89050
1142 Network and computer systems administrators 86340
1143 Computer network architects 107870
1151 Computer user support specialists 54150
1152 Computer network support specialists 67510

Salary (cont.)

Scraped Salary Data from BLS
SOC Title Mean.Annual.Wage
1199 Computer occupations, all other 91080
2011 Actuaries 114850
2021 Mathematicians 104700
2031 Operations research analysts 86510
2041 Statisticians 88980
2090 Miscellaneous mathematical science occupations 73670

Evaluation of Annual Wage Data

Mean

[1] 88507.69

Standard Deviation:

[1] 18113.61

plot of chunk unnamed-chunk-10

Outlook Analysis

  • Significant deviation in pay amongst occupations.
  • Considerable room for advancement!
Title 2016Wage 2016-26.EmplChange
Statisticians 88980 33.8
Mathematicians 104700 29.7
Operations research analysts 86510 27.4
Actuaries 114850 22.5

Salary vs. Outlook (cont.)

Findings: Weak , negative correlation between salary and job sector growth. Correlation = -.42
Linear Model: \[ \text{jobs}(\text{dollars}) = -.003 * \text{dollars} + 48 \]

plot of chunk unnamed-chunk-14

Technical Skills

  • Scraped and cleaned data from O*NET
  • Evaluted the frequency technical skills.
  • Findings: Top 5 technical skills for mathematical and computational occupations:
Commodity.Title n
Development environment software 437
Web platform development software 371
Object or component oriented development software 303
Data base management system software 289
Analytical or scientific software 248

BLS/ONET Findings

  • Developed a broad picture of the computational and mathematical field.
  • Unable draw any conlusions further statistical analysis about:
    • Which skills are in growing fields
    • Salary variations within skillsets

We next consulted Indeed next to compare the DOL datasets

to current job postings in the NY/CA area.

Indeed (CA)

  • Webscrape indeed CA for specific data scientist skill sets
  • Scraped and cleaned data from Indeed.
  • Total the count of each skill set.
  • Findings: Top 5 CA Data Scientist Skill sets listed on indeed.
Keyword n
Machine Learning 173
SQL 146
Python 142
AI 115
AWS 75

Indeed (NY)

Indeed.com

  • Scraped and cleaned data from Indeed.
  • Total the count of each skill set.
  • Findings: Top 5 NY Data Scientist Skill sets listed on indeed:
Keyword n
Python 54
SQL 45
Hadoop 41
AI 33
Machine Learning 26

Indeed State Comparisons

plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-19

Data Comparisons

  • Specific technical skills required for “Data Scientists.”
  • Unable to make direct comparison to ONET/BLS data.
  • We were still able to draw parallels between the datasets:
ONET/BLS Data Indeed Equivalence
Development environment software Python, Hadoop, AWS, Spark
Web platform development software Ruby, PHP, JS
Object or component oriented development software R, AI, SAS, Spark, ML
Data base management system software SQL, Tableau
Analytical or scientific software R, Spark, Tableau, SAS, Statistics

Conclusion

  • Python is a far more popular data science tool than R. However, that is likely due to the applications outside of data science.
  • Statistics is rarely listed as a required skill for Data Science
  • Machine learning/AI is a huge growth field but seemingly lacks mathematical rigor (see above point)
  • There appears to be no correlation between number of available jobs and average salary for a given occupation
  • Group work is hard

Conclusion (cont.)

Data Challenges

  • BLS and ONET databases are very broad and do little to highlight specific requirements
  • Project required further rurther regression analysis to map skills with occupational growth/salary
  • Time constraints with scraping data from Indeed.

Group Challenges

  • Steep learning curve associated with git
  • Version control
  • Coordination and management of work

SQL Database

  • SQL database stored in github repository: “csvs/db.sqlite”
  • Loaded csv files into corresponding tables
  • Further documentation stored in “csvs/SQL.Rmd”

Further Research

  • ANOVA test that maps skills to pay
  • Redo analysis with new SOC data that includes a specific 'data science' category
  • Include other job websites in our scraping and weight the results from each.