Introduction

Kaggle ML and Data Science Survey, 2017

5 months ago Kaggle, a website that offers competitions to teams of data scientists for cash prizes, released their annual user survey. This comprehensive survey asked numerous questions to the Kaggle members in order to collect metrics on it’s user base. Our group selected this data to serve as our data set for determining the top hard and soft skills required for a data scientist.

Breakdown:

Nearly 3000 observations from a larger raw set of nearly 16000 observations subset to find working data scientists.
More than 200 questions on a variety of topics.

Credit to Amber Thomas for providing the following code used for extracting and summarizing answers to multiple-choice questions.

chooseOne = function(question){
    exp_df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
}

chooseMultiple = function(question,df){
  df %>% 
    dplyr::filter(!UQ(sym(question)) == "") %>%
    dplyr::select(question) %>% 
    dplyr::mutate(totalCount = n()) %>% 
    dplyr::mutate(selections = strsplit(as.character(UQ(sym(question))), 
                                 '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>%
    unnest(selections) %>% 
    dplyr::group_by(selections) %>% 
    dplyr::summarise(totalCount = max(totalCount),
              count = n()) %>% 
    dplyr::mutate(percent = (count / totalCount) * 100) %>% 
    dplyr::arrange(desc(count))
}        

Academic_exploration=function(question,df){
     df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
  }

proportion_function <- function(vec){
    vec/sum(vec)*100
}

create_breaks <- function(dfcolumn,breaks,labels){
    dfcolumn <- as.numeric(dfcolumn)
    dfcolumn <- cut(dfcolumn,breaks=breaks,labels=labels,right=FALSE)
}

Profile of a Data Scientist

Data Scientist Demographics

This section pertains to the demographics within the Data Science community. While the Kaggle dataset seems to be a solid fit for our project goals, demographics describing the Data Science field are availble from multiple sources. By exploring the demographics of our Kaggle dataset, alongside another demographics report, we can get better validate the Kaggle dataset as well as expose sampling biases that may exist in the Kaggle dataset. The Burtchworks study on the Data Science field is the data we used for comparisons. This study can be downloaded here

CONCLUSION

Dataset Validation
- Comparing the Burtchworks study with our Kaggle dataset, our dataset matches well with gender demographics and has comparable Educational Achievement levels(Burtchworks study shows these levels fluctuate yearly)
- Given the differences in tenure rates, the Kaggle dataset may be reflective of a younger demographic within the Data Science community
General demographic observations
- Data Scientists tend to have STEM backgrounds, less than ten percent of Data Scientists come from Social Science or Humanities background
- Finance, Tech, and Academics are the 3 largest employers in the Data Science field
- Moderate to significant job growth is occurring across the top fields
- Mean and median US salaries are around 121k and 100k respectively
  - Our sample size here is only 393 respondents
  - Outliers were not addressed

Learning Platform Usefulness

Usefulness of Various Learning Platforms

This section examines the usefulness of various learning platforms.

##  [1] "Select"        "Arxiv"         "Blogs"         "College"      
##  [5] "Company"       "Conferences"   "Friends"       "Kaggle"       
##  [9] "Newsletters"   "Communities"   "Documentation" "Courses"      
## [13] "Projects"      "Podcasts"      "SO"            "Textbook"     
## [17] "TradeBook"     "Tutoring"      "YouTube"

Data scientists largly agree on the platforms that are helpful and unhelpful in their professions. Other than Kaggle (which appears unrepresentatively high due to the nature of the data set), Data Scientists find Youtube, blogs and texbooks to be great learning resources. There is also broad agreement that conferences and coursework are beneficial.

Broadly speaking, Data Scientists find benefit in meeting with and learning from more experienced members.

Learning Categories

How Data Scientists Learned Their Core Skills

In this section, we examine how data scientists gained their skill set. We believe there may be valuable insight in what makes a strong data scientist by examing how successful data scientists gained their skill set.

Our data shows a great diversity in learning styles. This indicates that not only do data scientists learn from a variety of sources, but every data scientist’s sources vary in importance. This highlights the idea that there is no right or wrong way to learn to become a data scientist. At the same time, as the four major categories amount for nearly 100% of education, this means that there are no “secret” learning sources.

It is interesting to note that nearly 75% of data scientists indicate they learned while on the job.

Common Job Algorithms

Common Alogrithms and Methods Used by Data Scientists

INTRODUCTION

In this section, we explore commonly used algorithms and methods that are presumably required as basic skills in data science field.

CONCLUSION

It appears that on average, data scientists use at least 3 algorithms and 7 methods in their work. As the bar graph shows above, the most commonly used algorithms and methods as follows:

Algorithm
- Regression/Logistic Regression (15.65%)
- Decision Trees (12.96%)
- Random Forests (11.7%)
Methods
- Data Visualization (8%)
- Logistic Regression (6.83%)
- Cross-validation (6.74%)
- Decison Tress (5.93%)
- Random Forests (5.63%)
- Neural Networks (5.28%)
- Time Series Analysis (5.03%)

An average data scientist is able to the above listed algorithms and methods as basic hard skills to meet the standard industry expectation. An exceptional data scientist may be capable of handling 7 to 30 methods and 4 to 15 algorithms.

Furthermore, the most commonly used size of dataset appears to fall in the 1GB ~ 10GB range ( > 50%). For reference, the last graph displays the most used methods by size of dataset.

Work Tools Freqeuncy

Frequency of Use for Various Tools by Data Scientists

In this section, we determine which tools are the most frequently used by a data scientist. Data scientists need tools that will perform data analysis, data warehousing, data visualization and machine learning. We suspect that a typical data scientist uses a multitude of tools to satisfy these components.

The survey reports a wide variety of tools, all of which seemingly address the different needs of a data scientist. Among the top 15 tools, there were a healthy mix of analysis tools, warehousing tools and visualization tools. This indicates that data scientists choose an assortment of tools when addressing tasks. It also indicates a potential overlap in the functionality and features of the tools available. It will be interesting to see how the landscape of tools will evolve as the field of data science matures.

Work Challenges

What Challenges ata Scientists Experience?

In this section, we address the challenges faced by Data Scientists, and how their time is typically spent at work. We believe that the time spent performing data science related tasks and their respective challenges will provide useful insights on the skills necessary to succeed as a data scientist.

The data shows that data scientists spend 34% of their time gathering a cleaning data, 25% of their time selecting and building models, and 27% of their time is spent visualizing, discovering, and communicating insights to stakeholders. This is evidence that data scientists must have superb data cleaning and modeling skills. Data scientists must be able to visually and verbally communicate their findings to stakeholders.

Accordingly, dirty data is the most prevalent challenge, at 48%. A staggering 39% of data scientists were challenged by issues related to company politics and financial/management support. Interpersonal skills are vital in navigating office politics.

One in four data scientists lack a clear question to answer and a direction to take with the data, one in five data scientists reported challenges of explaining data science to others, and one in seven data scientists reported issues with maintaining reasonable expactations for data science projects. These all speak to communication skills and the ingenuity/creativity to frame questions and problems in such a way that will garner proper responses from stakeholders.

Conclusion

Skill List

Data Scientists…

HARD SKILLS

can clean, explore, model and visualize data.
can use the most common algorithms (Regression/Logistic Regression, Decision Trees, Random Forests).
can use the most common problem solving methods (Data Visualization, Logistic Regression, Cross-validation).
can use at least one analysis tool, warehousing tool and visualization tool.
can convey complex information not just visually but verbally as well.

SOFT SKILLS

learn from diverse sources.
seek the aid of more experienced data scientists, both in person and on the internet.
continue to learn even after they have secured a job.
have strong interpersonal skills.
possess ingenuity and creativity.

In conclusion, there is no single most valuable skill for a data scientist. While our data found broad agreement in a number of areas, we found very few consensus selections. To put it as simply as possible, a data scientist must have a core skill set based on a strong understanding of the fundamental data science concepts but they must also be ready and able to seek new problem solving techniques.

SQL

The original kaggle data was in an untidy form. As part of data preparation we each created tidy data subsets and saved them to a series of csv files stored on our github. The following SQL script will import them into a series of tables. We hope that this will aid future research and help to find connections that we may have missed.

DROP TABLE IF EXISTS teaching;
DROP TABLE IF EXISTS algorithm;
DROP TABLE IF EXISTS method;
DROP TABLE IF EXISTS tool;
DROP TABLE IF EXISTS platform;
DROP TABLE IF EXISTS datascience;

CREATE TABLE datascience (
  id INTEGER PRIMARY KEY NOT NULL,
  gender VARCHAR(255) NOT NULL,
  country VARCHAR(255),
  age VARCHAR(255)
  );

LOAD DATA LOCAL INFILE 'data_scientist.csv' 
INTO TABLE datascience
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(id, gender, @country, @age)
SET
country = nullif(@country,'NA'),
age = nullif(@age,'NA')
;

CREATE TABLE teaching (
  id INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
  ds_id INTEGER NOT NULL,
  category VARCHAR(100) NOT NULL,
  percent INTEGER NOT NULL,
  foreign key(ds_id) references datascience(id)
  );
  
LOAD DATA LOCAL INFILE 'learning_category.csv' 
INTO TABLE teaching
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(ds_id, category, percent)
;


CREATE TABLE algorithm(
    id INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
    ds_id INTEGER NOT NULL,
    algorithm VARCHAR(255) NOT NULL,
    foreign key(ds_id) references datascience(id)
    );
    
LOAD DATA LOCAL INFILE 'algorithms.csv' 
INTO TABLE algorithm
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(ds_id, algorithm)
;

CREATE TABLE method(
    id INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
    ds_id INTEGER NOT NULL,
    datasetsize VARCHAR(50) NOT NULL,
    method VARCHAR(255) NOT NULL,
    frequency VARCHAR(50),
    foreign key(ds_id) references datascience(id)
    );
    
LOAD DATA LOCAL INFILE 'methods.csv' 
INTO TABLE method
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(ds_id, datasetsize, method, @frequency)
SET 
frequency = nullif(@frequency,'NA')
;

CREATE TABLE tool(
    id INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
    ds_id INTEGER NOT NULL,
    tool_name VARCHAR(255) NOT NULL,
    frequency VARCHAR(255) NOT NULL,
    foreign key(ds_id) references datascience(id)
    );
    
LOAD DATA LOCAL INFILE 'tool_use.csv' 
INTO TABLE tool
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(ds_id, tool_name, frequency)
;

CREATE TABLE platform(
    id INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
    ds_id INTEGER NOT NULL,
    platform_name VARCHAR(255) NOT NULL,
    usefulness VARCHAR(100) NOT NULL,
    foreign key(ds_id) references datascience(id)
    );
    
LOAD DATA LOCAL INFILE 'platform_usefulness.csv'
INTO TABLE platform
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(ds_id, platform_name, usefulness)
;

What are the Most Valued Data Science Skills?

By Meaghan, Albert, Hovig, Justin, Rose and Brian

March 25, 2018