Introduction

This challenging project for DATA607 focuses on the team building, collaboration and leadership required to succeed in the field of data science and analytics. We are instructed to work closely in groups toward the task of answering a specific question: “Which are the most valued data science skills?” Our team quickly established a rapport and communications channel on #slack then set to the task with gusto and shared responsibility. We chose as our leader Duubar Villalobos Jimenez, and under his coordination we assigned a variety of tasks and deadlines to accomplish our goal. Of course, data is required. We chose to collect Data Science salary data from the website Paysa, which lists a wide variety of tech postings, the skills associated with each posting and several salary components. Details follow, but our analysis shows the Machine Learning is the most-requested skill in our sample of 390 job listings, while the highest-valued skill, measured by mean compensation, was Strategy - reflecting the higher pay for management, leadership and vision in this rapidly evolving profession.

Team members

Name Team Email
Pavan Akula Team 3 akulapavan@hotmail.com
Ambra Baboni Alexander Team 3 ambra8due@hotmail.com
Thomas Detzel Team 3 tomdetz@gmail.com
Dilip Ganesan Team 3 dilipgan@gmail.com
Kyle Gilde Team 3 kylegilde@gmail.com
Raghunathan Rammnath Team 3 raghu74us@gmail.com
Duubar Villalobos Jimenez Team 3 mydvtech@gmail.com

Our Process

Workspace preparation

Create vector with all needed libraries.

 load_packages <- c(
                    "knitr",
                    "RMySQL",
                    "tidyverse",
                    "tidyr",
                    "dplyr",
                    "stringr",
                    "plotly",
                    "htmlTable",
                    "stringr",
                    "prettydoc",
                    "shinythemes",
                    "treemap",
                    "data.tree",
                    "janitor",
                    "ggplot2",
                    "ggthemes",
                    "stats"
                  )

Organization and Communication

As a team, we had a brainstorm meetup in which some roles and lines of work were defined. Following are the most important agreements.

Github

We agreed to create a GitHub repository, D607-Group-Project. All team members had access and were able to post and read from a single repository location.

https://github.com/kylegilde/D607-Group-Project

Slack

We agreed to use Slack as our Team Collaboration platform. From Slack we were able to perform live meetups with “join.me”, which allows screen sharing and presentations to update and explore specific topics and problems. For example, we were able to look at code and discuss problems and refinements.

https://cuny-data607.slack.com

Google Docs

We created a spreadsheet in google docs to list deadlines and responsibilities.

https://docs.google.com/spreadsheets/d/1QNhmk6ebuFKYiyqrWJewhzT-PnrYgntZ09MC3yqpcx8/edit#gid=1903046315

Data

After some preliminary research and analysis, we decided to collect data from current data science job postings. We identified several web sites that offered raw information on position, location, company, salary and skills. In the end selected Paysa (https://www.paysa.com) because it offered the most comprehesive set of variables.

Limitations

Please note that this data has been collected from a single source and relates to job postings extant on March 14, 2017. No assumption should be made for past or future data science skills or any other conclusion we might end up for this project. A different sample from the same source will likely produce different results.

Collection

We were unable to find data in a table, csv file or other structured format. In addition, the Paysa proprietors declined our request to provide sample data for the project. We attempted to scrape the data, but Paysa has designed its web pages to prevent scraping. Because we did not require a huge sample of data for this process, we cut and pasted Paysa data into a text file that we then cleaned, organized and refined. We imported that raw text data into a SQL database for permanent storage, then exported it to R to conduct our analysys. The import-export process also was conducted via R. In total, we collected 390 job postings that listed 95 overall skills. Because some of those skills were the same but named differently, we collapsed the list into 35 unique skills and a miscellaneous group of skills that appeared fewer than 10 times in the data.

We named our raw text file paysa.txt and uploaded it to GitHub.

Data Preparation

Once we had our paysa.txt file with our desired information, we proceeded to extract valuable information from it and created a data frame. The code was shared among us to continue further cleaning and tidying.

Import

This code reads our raw scraped text into R from GitHub.

url <- "https://raw.githubusercontent.com/dilipganesan/D607-Group-Project/patch-1/scripts/paysa.txt"

mystring = read_file(url, locale=default_locale())

Sample of the input file.

## Data Scientist Job Results, showing 6K recent job postings
## Update Saved SearchNotification Settings
## 0%
## MATCH
## Head of SBG Data Science Engineering Logo
## Head of SBG Data Science Engineering
## Intuit
## EXPECTED
## $338K
## MARKET SALARY
## Base Salary$253K
## Annual Bonus$86K
## Signing Bonus$24K
## Annual Equity$0
## APPLY NOW
## Head of SBG Data Science Engineering at Intuit in Mountain View, CA
## How you match this job:
## Add more skills  to see how you match this job
## You can learn valuable new skills like: Distributed Systems, Big Data, Algorithms, Data Science, Strategy, Databases and more.
##   Jobs at Intuit      Jobs in Mountain View, CA
## 0%
## MATCH
## Principal Lead Data Scientist Logo
## Principal Lead Data Scientist
## Akamai Technologies
## EXPECTED
## $317K
## MARKET SALARY
## Base Salary$204K
## Annual Bonus$27K
## Annual Equity$86K
## Signing B

From text file to data frame

This code uses string handling to separate the text into an initial set of columns.

Sample of the initial data frame.

ID applyposition Base Annual signing expected skillset location
1 APPLY NOW Head of SBG Data Science Engineering at Intuit in Mountain View, CA Base Salary$253K Annual Bonus$86K Signing Bonus$24K EXPECTED $338K You can learn valuable new skills like: Distributed Systems, Big Data, Algorithms, Data Science, Strategy, Databases and more. Jobs in Mountain View, CA
2 APPLY NOW Principal Lead Data Scientist at Akamai Technologies in Santa Clara, CA Base Salary$204K Annual Bonus$27K Signing Bonus$18K EXPECTED $317K You can learn valuable new skills like: Hadoop, Data Mining, Machine Learning, Python, Matlab, Ruby and more. Jobs in Santa Clara, CA
3 APPLY NOW Principal Lead Data Scientist at Akamai Technologies in Santa Clara, CA Base Salary$204K Annual Bonus$27K Signing Bonus$18K EXPECTED $317K You can learn valuable new skills like: Hadoop, Data Mining, Big Data, Algorithms, Machine Learning, Python and more. Jobs in Santa Clara, CA
4 APPLY NOW Director, Data Scientist at Dropbox in San Francisco, CA Base Salary$183K Annual Bonus$0 Signing Bonus$30K EXPECTED $289K You can learn valuable new skills like: Data Mining, Algorithms, Machine Learning, Data Science, SQL, Analytics and more. Jobs in San Francisco, CA
5 APPLY NOW Principal Data Scientist at Microsoft in Bellevue, WA Base Salary$200K Annual Bonus$49K Signing Bonus$19K EXPECTED $289K You can learn valuable new skills like: Hadoop, Data Mining, Optimization, Algorithms, MapReduce, C++ and more. Jobs in Bellevue, WA
6 APPLY NOW Principal Data Scientist at Microsoft in Redmond, WA Base Salary$198K Annual Bonus$53K Signing Bonus$19K EXPECTED $287K You can learn valuable new skills like: Algorithms, Machine Learning, C++, Python, Deep Learning, Data Science and more. Jobs in Redmond, WA

MySQL

Next, we stored this initial data set on a remote MySQL server. We provided clear instructions for all team members about how to read tables into R.

The remote server setup is at MySQL URL mydvtech.com. We chose a commercial site because in our day-to-day work we will be reading and storing company data in company portals. This was good practice or all. Step-by-step instructions for connecting using MySQL server and cPanel were created in a manual distributed to the team.

http://rpubs.com/dvillalobos/confMySQLcPanel

Writing data into MySQL

This code connects to our remote MySQL server and writes a data frame into a table.

writeMySQLTable <- function(my.data = NULL, myLocalTableName = NULL){
  
  # Creating a schema if it doesn't exist by employing RMySQL() in R
  
  mydbconnection <- dbConnect(MySQL(), 
                  user = myLocalUser,
                  password = myLocalPassword,
                  host = myLocalHost)
  
  MySQLcode <- paste0("CREATE SCHEMA IF NOT EXISTS ",myLocalMySQLSchema,";",sep="")
  dbSendQuery(mydbconnection, MySQLcode)

  # Write our data frame into MySQL
  mydbconnection <- dbConnect(MySQL(), 
                  user = myLocalUser,
                  password = myLocalPassword,
                  host = myLocalHost,
                  dbname = myLocalMySQLSchema)
  
  myLocalTableName <- tolower(myLocalTableName)
  MySQLcode <- paste0("DROP TABLE IF EXISTS ",myLocalTableName,";",sep="")
  
  dbSendQuery(mydbconnection, MySQLcode)
  dbWriteTable(mydbconnection, name= myLocalTableName , value= my.data) 

  # Closing connection with local Schema
  dbDisconnect(mydbconnection)

  # Close all other open connections we might have
  lapply( dbListConnections( dbDriver( drv = "MySQL")), dbDisconnect)
}

Reading data from MySQL

For analysis and testing, we were encouraged to read from our MySQL instead of GitHub (our backup plan is we had problems with MySQL). One advantage of using a remote MySQL is the speed in terms of reading and transfering data; we noticed an incredible amount of resources employed when reading from our GitHub repository versus MySQL.

The following code connects to MySQL server and read the data stored into a table in a data frame into R.

readMySQLTable <- function(myLocalTableName = NULL){
  
  # Connecting to a schema by employing RMySQL() in R
  mydbconnection <- dbConnect(MySQL(), 
                  user = myLocalUser,
                  password = myLocalPassword,
                  host = myLocalHost,
                  dbname = myLocalMySQLSchema)

  # Check to see if our table exists? and read our data
  myLocalTableName <- tolower(myLocalTableName)
  if (dbExistsTable(mydbconnection, name = myLocalTableName)  == TRUE){
    slookup <- dbReadTable(mydbconnection, name = myLocalTableName)
  }

  # Closing connection with local Schema
  dbDisconnect(mydbconnection)

  #To close all open connections
  lapply( dbListConnections( dbDriver( drv = "MySQL")), dbDisconnect)
  
  return(slookup)
}

Tidying and transformation

We divided our tidying and transformation into several tasks as follows..

New cleaned table results.

ID applyposition Base Annual signing expected Skill1 Skill2 Skill3 Skill4 Skill5 Skill6 location
1 Head of SBG Data Science Engineering at Intuit in Mountain View, CA 253K 86K 24K 338K Distributed Systems Big Data Algorithms Data Science Strategy Databases Mountain View, CA
2 Principal Lead Data Scientist at Akamai Technologies in Santa Clara, CA 204K 27K 18K 317K Hadoop Data Mining Machine Learning Python Matlab Ruby Santa Clara, CA
3 Principal Lead Data Scientist at Akamai Technologies in Santa Clara, CA 204K 27K 18K 317K Hadoop Data Mining Big Data Algorithms Machine Learning Python Santa Clara, CA
4 Director, Data Scientist at Dropbox in San Francisco, CA 183K 30K 289K Data Mining Algorithms Machine Learning Data Science SQL Analytics San Francisco, CA
5 Principal Data Scientist at Microsoft in Bellevue, WA 200K 49K 19K 289K Hadoop Data Mining Optimization Algorithms MapReduce C++ Bellevue, WA
6 Principal Data Scientist at Microsoft in Redmond, WA 198K 53K 19K 287K Algorithms Machine Learning C++ Python Deep Learning Data Science Redmond, WA

Continuing with our clean up, we created new variable names, fixed dollar values and separated city and state.

Tidy table

Below is our tidy table with a total of 2217 skills. (In a subsequent step, we collapse repetitive skills (i.e., “C” and “C++”.)

ID Position Company City State Skills Type Salary
1 Head of SBG Data Science Engineering Intuit Mountain View CA Distributed Systems Base Salary 253000
2 Principal Lead Data Scientist Akamai Technologies Santa Clara CA Hadoop Base Salary 204000
3 Principal Lead Data Scientist Akamai Technologies Santa Clara CA Hadoop Base Salary 204000
4 Director, Data Scientist Dropbox San Francisco CA Data Mining Base Salary 183000
5 Principal Data Scientist Microsoft Bellevue WA Hadoop Base Salary 200000
6 Principal Data Scientist Microsoft Redmond WA Algorithms Base Salary 198000

Analysis

Skills

From our gathered data we have a total of 72 skills.

Table: Data Science Frequency Skills Ranked
Skills Count Percentage Rank
Machine Learning 260 11.73 % 1
Data Science 207 9.34 % 2
Algorithms 183 8.25 % 3
Hadoop 182 8.21 % 4
Big Data 177 7.98 % 5
Python 123 5.55 % 6
Data Mining 119 5.37 % 7
Optimization 97 4.38 % 8
Analytics 85 3.83 % 9
C++ 71 3.2 % 10
Management 61 2.75 % 11
SQL 56 2.53 % 12
Statistics 53 2.39 % 13
Matlab 40 1.8 % 14
Scala 38 1.71 % 15
Product Management 33 1.49 % 16
MapReduce 31 1.4 % 17
Strategy 27 1.22 % 18
Architectures 22 0.99 % 19
Technical Leadership 19 0.86 % 20
Deep Learning 18 0.81 % 22
Distributed Systems 18 0.81 % 22
Information Retrieval 18 0.81 % 22
AWS 17 0.77 % 24
ETL 16 0.72 % 25
Relational Databases 15 0.68 % 26
User Experience 14 0.63 % 28
Windows 14 0.63 % 28
Java 13 0.59 % 30
Ruby 13 0.59 % 30
Scalability 13 0.59 % 30
REST 11 0.5 % 32
Computer Vision 10 0.45 % 34
Leadership 10 0.45 % 34
Software Design 10 0.45 % 34
Apache Spark 8 0.36 % 36
Databases 8 0.36 % 36
C 7 0.32 % 39
Search 7 0.32 % 39
Time Series Analysis 7 0.32 % 39
Architecture 6 0.27 % 42
Automation 6 0.27 % 42
Natural Language Processing 6 0.27 % 42
PHP 6 0.27 % 42
Android 5 0.23 % 46
Game Development 5 0.23 % 46
OS X 5 0.23 % 46
Mathematics 4 0.18 % 48
Scripting 4 0.18 % 48
Business Intelligence 3 0.14 % 52
Cassandra 3 0.14 % 52
Functional Programming 3 0.14 % 52
Go 3 0.14 % 52
MySQL 3 0.14 % 52
Enterprise Software 2 0.09 % 58
Image Processing 2 0.09 % 58
LAMP 2 0.09 % 58
Recommender Systems 2 0.09 % 58
Signal Processing 2 0.09 % 58
Tomcat 2 0.09 % 58
Algorithm Design 1 0.05 % 66
Data Science Scripting 1 0.05 % 66
EMPTY 1 0.05 % 66
Engineering Management 1 0.05 % 66
Firewalls 1 0.05 % 66
HTTP 1 0.05 % 66
Mathematical Modeling 1 0.05 % 66
Network Architecture 1 0.05 % 66
Optimization Data Science 1 0.05 % 66
Product Design Data Science 1 0.05 % 66
Test Driven Development 1 0.05 % 66
Web Services 1 0.05 % 66

Top 10 most desired skills by employers, by count of raw skill names.

Compensation (Exploratory Analysis)

The Paysa data lists Base Salary, Annual Bonus, Signing Bonus and a total called Expected Salary, providing a way to another way to measure value beyond raw counts.

Tree Analysis

Another way of Analysing the above data is by performing a tree analysis. This tree shows Total Salary for jobs associated with Machine Learning skills in Washington and California.

Following are the highest-paid jobs in the data.

Total Salary

Table

Table: Companies and job offerings with the top 10 Total Salaries
Position Company City State Type Salary Rank
Head of SBG Data Science Engineering Intuit Mountain View CA Expected Salary 338000 1
Principal Data Science Manager Microsoft Redmond WA Expected Salary 322000 2
Principal Lead Data Scientist Akamai Technologies Santa Clara CA Expected Salary 317000 3
Data Science and Analytics Lead, Global Revenue Acceleration Google Mountain View CA Expected Salary 305000 4
Director, Data Scientist Dropbox San Francisco CA Expected Salary 289000 5
Principal Data Scientist Microsoft Bellevue WA Expected Salary 289000 6
Principal Data Scientist Microsoft Redmond WA Expected Salary 287000 7
Director Unknown Vienna VA Expected Salary 285000 8
Sr. Principal Data Scientist (A/B Platform) Coupang Palo Alto CA Expected Salary 281000 9
Manager of Data Science Yelp San Francisco CA Expected Salary 278000 10

Chart

Top paid skills

The below combination of desired skills by employers will generate the top Total Salary of $338000.

Table: Top paid skills by top paid Total Salaries
Top Paid Skills Listed in top most desired skills by employers
Distributed Systems FALSE
Big Data TRUE
Algorithms TRUE
Data Science TRUE
Strategy FALSE
Databases FALSE

The below tree will present the top two highest-paid salaries and respective skills.

Signing Bonus

Table

Table: Companies and job offerings with the top 10 Signing Bonus
Position Company City State Type Salary Rank
Principal Data Science Engineer OpenTable San Francisco CA Signing Salary 43000 1
Data Scientist, Population and Survey Sciences Facebook Menlo Park CA Signing Salary 42000 2
Data Scientist, Infrastructure Facebook Menlo Park CA Signing Salary 42000 3
Data Scientist- Consumer Insights Facebook Menlo Park CA Signing Salary 42000 4
Data Visualization Scientist Facebook Menlo Park CA Signing Salary 42000 5
Data Scientist Facebook Menlo Park CA Signing Salary 42000 6
Data Scientist, Auction & Delivery Facebook Menlo Park CA Signing Salary 42000 7
Software Development Mgr - Advertising Analytics and Data Science Amazon Seattle WA Signing Salary 40000 8
Infrastructure Data Scientist & Strategy Analyst Facebook Menlo Park CA Signing Salary 40000 9
Senior Data Scientist Amazon Seattle WA Signing Salary 39000 10

Chart

Top paid skills

The below combination of desired skills by employers will generate the top Signing Bonus of $43000.

Table: Top paid skills by top paid Signing Bonus
Top Paid Skills Listed in top most desired skills by employers
Product Management FALSE
Hadoop TRUE
Software Design FALSE
Information Retrieval FALSE
Machine Learning TRUE
ETL FALSE

The below tree will present the top two highest-paid signing bonus and respective skills.

Annual Bonus

Table

Table: Companies and job offerings with the top 10 Annual Bonus
Position Company City State Type Salary Rank
Head of SBG Data Science Engineering Intuit Mountain View CA Annual Salary 86000 1
Principal Data Science Manager Microsoft Redmond WA Annual Salary 59000 2
Principal Data Scientist Microsoft Redmond WA Annual Salary 53000 3
Principal Data Scientist Microsoft Bellevue WA Annual Salary 49000 4
Principal Data Scientist Architect SAP Palo Alto CA Annual Salary 49000 5
Data Science and Analytics Lead, Global Revenue Acceleration Google Mountain View CA Annual Salary 48000 6
Principal Data Science Engineer OpenTable San Francisco CA Annual Salary 46000 7
Legal- Firmwide Initiatives Team OLO Data Scientist- VP JPMorgan Chase New York NY Annual Salary 46000 8
CIB-Rapid Prototyping Data Scientist Unknown New York NY Annual Salary 46000 9
Director Unknown Vienna VA Annual Salary 45000 10

Chart

Top paid skills

The below combination of desired skills by employers will generate the top Annual Bonus of $86000.

Table: Top paid skills by top paid Annual Bonus
Top Paid Skills Listed in top most desired skills by employers
Distributed Systems FALSE
Big Data TRUE
Algorithms TRUE
Data Science TRUE
Strategy FALSE
Databases FALSE

The below tree will present the top two highest paid salaries and respective skills.

Base Salary

Table

Table: Companies and job offerings with the top 10 Base Salaries
Position Company City State Type Salary Rank
Head of Data Science and Engineering Amazon Seattle WA Base Salary 265000 1
Head of SBG Data Science Engineering Intuit Mountain View CA Base Salary 253000 2
Director Unknown Vienna VA Base Salary 240000 3
Principal Lead Data Scientist Akamai Technologies Santa Clara CA Base Salary 204000 4
Head of Data Science, Liquidity First Republic Bank San Francisco CA Base Salary 202000 5
Principal Data Scientist Microsoft Bellevue WA Base Salary 200000 6
Chief Data Scientist, Brilliant Manufacturing Job GE Seattle WA Base Salary 199000 7
Principal Data Scientist Microsoft Redmond WA Base Salary 198000 8
Corporate - Firmwide Forecasting & Analysis - Data Scientist/Engineer, Vice President JPMorgan Chase New York NY Base Salary 196000 9
Data Scientist, State Street Global Exchange, Vice President State Street Corporation New York NY Base Salary 195000 10

Chart

Top paid skills

The below combination of desired skills by employers will generate the top Base Salary of $265000.

Table: Top paid skills by top paid Base Salaries
Top Paid Skills Listed in top most desired skills by employers
Product Management FALSE
Machine Learning TRUE
Data Science TRUE
Analytics FALSE
Statistics FALSE

The below tree will present the top two highest-paid base salaries and respective skills.

Combined Table of Top paid Skills

The table below displays which skills associate with the highest-paid compensation categories. For example, Data Science, Algorithms and Big Data are associated with the highest Expected (total) Salary. Skills in Machine Learning and Hadoop are associated with the highest Signing Bonuses.

Table: Top paid skills with high skills on demand
Rank Skills Total Salary Signing Bonus Annual Bonus Base Salary
1 Machine Learning TRUE TRUE
2 Data Science TRUE TRUE TRUE
3 Algorithms TRUE TRUE
4 Hadoop TRUE
5 Big Data TRUE TRUE
6 Python
7 Data Mining
8 Optimization
9 Analytics
10 C++

Maps

Open Positions by State

Open Positions by City

Highest-Valued Skills Measured by Mean Compensation

Some of the highest-valued skills are not the most common skills. They include Strategy, Leadership, Management and Data Science, a catch-all. ETL – for Extract, Transfer and Load – is a critical area of data warehousing.

This part of the analysis looks at the value of skills based on what employers pay rather the frequency of skills in a job posting. To do this, we compute a mean value for each skill across the database.

For example, the job ‘Principle Lead Data Scientist’ at Akamai is associated with six skills: Hadoop, Data Mining, Machine Learning, Python, Matlab and Ruby. The total compensation this job is $317,000. To value each job, we divide total compensation by 6 to get $52,833. We do similar computation for skills in each job, then calculate the overall mean of those values across all jobs for each skill. We then plot those values to rank skills in descending order.

Relative Value of Skills

Using ANOVA, we can compute how much a particular skill adds or subtracts from the mean Expected Salary for Algorithms, the reference level, all other things equal.

For example, the reference mean compensation for Algorithms is $30,323. Having ETL skills adds $9,283 to that mean; Matlab skills are worth $1,282 less.

The chart summarizes the skill values relative to the Algorithm baseline.

Table: Expected Salary for Algorithms
Skill Adjustment
Algorithm (reference) 30323
Analytics 748
Architecture -2114
AWS -1342
Big Data -1182
C++ -64
Computer Vision -2706
Data Mining 220
Data Science 3037
Deep Learning -2054
Distributed Systems 2631
ETL 9283
Hadoop -1953
Information Retrieval 1972
Java -1051
Machine Learning -401
Management 418
MapReduce -1317
Matlab -1282
Misc 7587
Optimization -224
Product Management -555
Python 1492
Relational Databases -1556
REST -3414
Ruby -3079
Scala -3665
Scalability -2002
Software Design -839
SQL 7387
Statistics 1038
Strategy 13775
Technical Leadership 2340
User Experience 3296
Windows -454

The below chart, display a salary weight composition for each desired skills by employers.

Conclusion

We were asked to work as a team to answer the the question: What are the most valued data science skills? Our examination of compensation data for Data Science jobs on the Paysa website looked at “value” in as a function of the frequency of skills advertised and determined that skills such as Machine Learning, Big Data and Algorithms ranked in the top 10. In terms of mean compensation, top skills included expertise in Strategy, ETL, SQL and User Experience. There are many possible ways to value skills; our study suggests that more data and additional refinement of skills into appropriate categories would provide a more confident assessment of the most-valued skills.

Our experience also shows the benefits of working in a collaborative environment, where team members can readily learn from each other and contribute ideas, creating synergy and improving the results. Teams also benefit from strong leadership that guides while giving team members the opportunity to succeed and sometimes fail, but always reach for improvement and professional growth.