DATA607 Project 3 Presentation

Gabrielle Bartomeo, Binish Chandy, Zach Dravis, Burcu Kaniskan, Niteen Kumar, Betsy Rosalen

March 25, 2018

Research Question

Which are the most valued data science skills?

While the answer to the question is by definition subjective, the Kaggle Survey provides a good starting point for exploring the views of professionals in the field and what they value.


Importing data

We used survey data from the Kaggle ML and Data Science Survey, 2017.

The survey data included 2 different csv files consisting of:

  1. multiple choice items
  2. free-response items

We chose to focus on the multiple choice data only.


A Day in a Data Scientist’s Life

Demographics Information

Variables: Country, GenderSelect & Age

Domographics Insight:

  1. Number of Respondents - 16716
  2. Country - 50+
  3. Gender - 13610 males (84%), 2778 females (16%).
  4. Median Age - 30


Top 5
Country count
United States 4034
India 2648
Other 974
Russia 570
United Kingdom 511
Brazil 461
Bottom 5
Country count
Kenya 58
Romania 58
Belarus 53
Czech Republic 53
Chile 51
Norway 51
Top 5 Respondents
Country GenderSelect count
United States Male 3195
India Male 2244
United States Female 839
Other Male 833
Russia Male 470
United Kingdom Male 424

A Day in a Data Scientist’s Life

Exploratory Data Analysis (EDA) - Demographics

A Day in a Data Scientist’s Life

Let’s take a peek at a day in the life of a Data Scientist and try to figure out what a data scientist does.

A Day in a Data Scientist’s Life

Manipulating data - DS Activities

Key Attributes: TimeGatheringData, TimeModelBuilding, TimeProduction, TimeVisualizing, and TimeFindingInsights.

Exploratory Data Analysis (EDA)

A Day in a Data Scientist’s Life

Manipulating data - Learning Platforms and Learners Sentiments

Key Attributes: 18 Learning Platforms (Arxiv,Blogs,College,Company,Conferences,Friends,Kaggle,Newsletters,Communities,Documentation,Courses,Projects,Podcasts,SO,Textbook,TradeBook,Tutoring,YouTube)

Exploratory Data Analysis (EDA)


What do Data Scientists Want to Learn?

Next, we examine what these survey takers of various educational backgrounds find themselves excited to learn. Due to the ever-evolving nature of technology and, by extension, data science, it is imperative that they remain relevant in their field and are passionate in their pursuit for relevance. Understanding what working professionals want to learn could give us insight into what skills are most valued in the field.

Does survey takers’ formal education have any relationship to the Machine Learning/Data Science method he or she is most excited about learning in the next year?

What do Data Scientists Want to Learn?

Variables and their definition

To do the analysis, we concentrate on two columns in the dataset

  1. FormalEducation: Which level of formal education have you attained?
  2. MLMethodNextYearSelect : Which Machine Learning/Data Science method are you most excited about learning in the next year?

These questions were asked to all participants.

What do Data Scientists Want to Learn?

Exploratory Data Analysis (EDA)

First we plot the distribution of formal education in the dataset

The data set predominantly contains candidates with Master’s degrees which are followed by Bachelor’s then doctoral degrees.

What do Data Scientists Want to Learn?

Now let’s look at the different Machine Learning/Data Science methods in the dataset.

Machine Learning/Data Science
Random Forests
Deep learning
Neural Nets
Text Mining
Genetic & Evolutionary Algorithms
Link Analysis
Rule Induction
Regression
Proprietary Algorithms
I don’t plan on learning a new ML/DS method
Ensemble Methods (e.g. boosting, bagging)
Factor Analysis
Social Network Analysis
Monte Carlo Methods
Time Series Analysis
Other
Bayesian Methods
Survival Analysis
MARS
Anomaly Detection
Cluster Analysis
Decision Trees
Association Rules
Uplift Modeling
Support Vector Machines (SVM)

What do Data Scientists Want to Learn?

Now we can plot the distribution of Machine Learning/Data Science methods with formal education.

What do Data Scientists Want to Learn?


Data Science Methods

What are the most frequently used data science (DS) methods by those writing code in DS professions? Do those relate to formal educational attainment?

The Kaggle dataset provides multiple different variables to assess what the most valuable data science skills may be. In the previous section, we examined what data science methods learners are most excited about and working on. In this section, we’ll look at which data science methods are the most frequently used and if that has any relationship to educational attainment–a potential indicator of whether or not certain methods require advanced academic training.

Data Science Methods

Variables and their definition

The following variables label questions asking survey respondents how often they use each of these data science methods. Response options were: Rarely, Sometimes, Often, Most of the time

The additional variables used for this analysis will include:

Data Science Methods

Manipulating data

In order to answer the question of which methods are most popular among code writers, several transformations must first be done. First, we filter the dataset down to only those who were classified as code writers: those that were employed in some capacity working in data science and writing code as part of their job duties. Additionally, we include only participants who endorsed at least one data science skill on the question, “At work, which data science methods do you use? (Select all that apply)” with variable name :WorkMethodsSelect.

Once filtered, the endorsed data science methods were aggregated and plotted for frequency (see Exploratory Data Analysis below). The top five most frequent data science methods endorsed were then selected and given a frequency score to represent among those who endorse using them to some extent, how frequently they use that tool.

The final transformation performed on the data was grouping by formal education level attainment and then identifying the most frequently endorsed data science methods for each group. This can help identify if those writing certain types of code and using certain data analyses are potentially benefitted by pursuing advanced education–a valuable insight for potential data science pupils.

Data Science Methods

Exploratory Data Analysis (EDA)

Following manipulation of the Kaggle data set, we created plots to visualize the aforementioned research questions. First, here is a look at the frequency with which the following data science methods were endorsed by a total of 7,773 respondents. Nearly 2/3 of the survey respondents endorsed the first place skill, data visualization. Over half endorse logistic regression and just shy of half endorse cross-validation and decision trees.

Options Freq
Data Visualization 5022
Logistic Regression 4291
Cross-Validation 3868
Decision Trees 3695
Random Forests 3454
Time Series Analysis 3153
Neural Networks 2811
PCA and Dimensionality Reduction 2789
kNN and Other Clustering 2624
Text Analytics 2405
Ensemble Methods 2056
Segmentation 2050
SVMs 1973
Natural Language Processing 1949
A/B Testing 1936
Bayesian Techniques 1913
Naive Bayes 1902
Gradient Boosted Machines 1557
CNNs 1417
Simulation 1398
Recommender Systems 1158
Association Rules 1146
RNNs 891
Prescriptive Modeling 851
Collaborative Filtering 793
Lift Analysis 650
Evolutionary Approaches 436
HMMs 419
Other 391
Markov Logic Networks 255
GANs 244

Data Science Methods

The following plot graphically displays the frequency of endorsements for the data science methods asked about.

Data Science Methods

In this plot we show the “Frequency Score” for the Top Five most endorsed data science methods. It’s important to break this down further than endorsement, as the above table and plot only consider which data science methods one uses at all. Just because a method is endorsed, doesn’t mean that individuals use it frequently. It may be a rare but essential method in data science. To get a more fine grained understanding of how commonly one uses a given data science method on the job, the kaggle survey followed up each endorsed method by asking respondents if they use it Rarely, Sometimes, Often, Most of the time. We converted these to numeric values (Rarely = 1; Sometimes = 2, Often = 3, and Most of the time = 4) in order to graph a score and average the categorical responses.

Of the top five data science methods endorsed, data visualization was the skill indicated to be used the most frequently.

Data Science Methods

The below plots show the frequency of methods endorsed for each formal education level assessed by Kaggle.

We see that in the majority of educational attainment brackets, data visualization remains the most frequently endorsed data science method.

Data Science Methods

The same information is also provided in tabular format:

Selections Freq RelativeFreq Degree
Data Visualization 1236 0.0944232 Bachelor’s Education
Logistic Regression 989 0.0755539 Bachelor’s Education
Decision Trees 847 0.0647059 Bachelor’s Education
Data Visualization 1129 0.0756348 Doctoral Education
Cross-Validation 1046 0.0700744 Doctoral Education
Logistic Regression 1031 0.0690695 Doctoral Education
Neural Networks 24 0.0808081 High School Education
Data Visualization 23 0.0774411 High School Education
Text Analytics 18 0.0606061 High School Education
Data Visualization 2331 0.0835873 Master’s Education
Logistic Regression 2022 0.0725069 Master’s Education
Cross-Validation 1821 0.0652992 Master’s Education
Data Visualization 150 0.0897129 Professional Education
Logistic Regression 121 0.0723684 Professional Education
Decision Trees 119 0.0711722 Professional Education
Data Visualization 137 0.1008837 Some Post Secondary Education
Logistic Regression 97 0.0714286 Some Post Secondary Education
Decision Trees 87 0.0640648 Some Post Secondary Education

Data Science Methods

Answering the research question of which data science skills are the most important can be interpreted and answered in many ways. One way to explore this deceivingly complex question is to analyze which data science methods are endorsed as being used by code writers on the job. This analysis did just that, and further explored the Top 5 most endorsed data science methods by seeing how frequently those that endorsed them actually use those methods on the job.

The bottom line of this analysis is to consider data visualization, logistic regression, cross-validation, decision trees, and random forests as not only frequently endorsed methods, but as methods that are not only essential but used in small ways. It seems like across data science code writers, these methods are popular and then for individual data science code writers, they are used frequently.

The second goal of this analysis was to understand how formal educational attainment relates to data science methods used on the job. When looking at the plots of each educational level and the table coalescing all of that data, it does not seem like data science methods used by code writers differ given the educational level. Data visualization remains the most frequently endorsed data science method for the majority of educational groups. This has important implications for students of data science in understanding that certain popular job functions are not only performed by those with advanced degrees. This speaks to how crucial skills like data visualization and the other frequently endorsed and commonly used methods are to data science as a whole.

Learners vs. Employed Data Scientists


Is there a difference between what Learners think are the important skills to learn…


…and what employed Data Scientists say are the skills and tools they use?

Learners vs. Employed Data Scientists

Variables and Manipulating Data - Likert Scales


105 variables in 4 likert scale categories

Plus a few basic demographic fields

Learners vs. Employed Data Scientists

Employed Data Scientists - Demographics


30% go by “Data Scientist”

20% go by “Scientist/Researcher” or “Software Developer/Software Engineer”

CurrentJobTitleSelect total percent
Data Scientist 644 30.51
Scientist/Researcher 225 10.66
Software Developer/Software Engineer 212 10.04
Data Analyst 185 8.76
Other 177 8.38
Researcher 138 6.54
Machine Learning Engineer 102 4.83
Engineer 73 3.46
Statistician 71 3.36
NA 71 3.36
Business Analyst 59 2.79
Computer Scientist 47 2.23
Predictive Modeler 41 1.94
Programmer 21 0.99
DBA/Database Engineer 19 0.90
Operations Research Practitioner 16 0.76
Data Miner 10 0.47

Learners vs. Employed Data Scientists

Learners - Demographics


About 73% are college students

StudentStatus total percent
Yes 113 73.38
No 41 26.62

55% are “focused on learning mostly data science skills” regardless of academic status.

StudentStatus LearningDataScience total percent
Yes Yes, I’m focused on learning mostly data science skills 63 55.75
Yes Yes, but data science is a small part of what I’m focused on learning 50 44.25
No Yes, I’m focused on learning mostly data science skills 23 56.10
No Yes, but data science is a small part of what I’m focused on learning 18 43.90

Learners vs. Employed Data Scientists

Learning Platform Usefulness - Learners


Learners - Top 3 ways to learn data science:

  1. Courses
  2. Projects
  3. College

Friends are the least useful learning platform.

Learners vs. Employed Data Scientists

Learning Platform Usefulness - Employed Data Scientists


Projects and Courses belong in the top 3, but College is in 5th place

Much greater importance on Friends

46.5% say Friends are “Very useful”

97.5% say Friends are “Somewhat useful” or “Very useful”

Learners vs. Employed Data Scientists

Job Skills Importance to Learners


63.6% say Python is “Necessary”

39% say Data Visualization is “Necessary”

35% say R is “Necessary”

Learners vs. Employed Data Scientists

Work Tools Frequency


75.4% use Python either “Often” or “Most of the time”

63.6% use R either “Often” or “Most of the time”

Learners vs. Employed Data Scientists

Work Methods Frequency


Data Visualization is at the top of the list.

91% use it “Often” or “Most of the time”

0% use it “Rarely”

Only 38.8% of Learners said Visualizations were a “Necessary” job skill to learn!


R vs. Python

The most frequently used of the programming languages are R and Python. But do those that use R recommend R or Python? And do those that use Python recommend R or Python? In other words, do those survey takers feel that others should first and foremost study the languages they themselves have taken up, or perhaps with their insight, know to suggest the language of the two they themselves did not learn?

Thus the following questions were explored:

  1. What is the distribution of following programming languages Kaggle survey takers used in the past year:
  1. What is the distribution of programming language recommendations by following programming languages Kaggle survey takers used in the past year:

R vs. Python

Variables and their definition

There are 2 variables used in this section of the analysis :

  1. LanguageRecommendationSelect=(What programming language would you recommend a new data scientist learn first? (Select one option) - Selected Choice)

  2. WorkToolsSelect= For work, which data science/analytics tools, technologies, and languages have you used in the past year? (Select all that apply) - Selected Choice

R vs. Python

Exploratory Data Analysis (EDA)

## [1] 16716   229

R vs. Python

## [1] 7955  229

R vs. Python

R vs. Python

R vs. Python

Let’s examine the above graph of LanguageRecommendationSelect

R vs. Python

## [1] 7955    5

R vs. Python


Salary Comparison for Python vs. R

Finally, true to the word “value,” considerations have to be made regarding pay. The compensation received by survey takers for their work in either R or Python needs quantification to discover which language earns a data scientist more overall and in general.

Salary Comparison for Python vs. R

Contributing Variables

Three variables were used:

There was also the variable “id” that was created for the purpose of this report, acting as a way to identify each individual survey taker, and the variable “work_tools” which was a derivative of WorkToolsSelect, breaking the lists down into their individual components.

Salary Comparison for Python vs. R

Exploration and Review of Compensations

Minimum 1st Quartile Median Mean 3rd Quartile Maximum Standard Deviation
Python $0.00 $53,000.00 $100,000.00 $112,826.14 $145,000.00 $2,000,000.00 $122,425.21
R $0.00 $58,000.00 $87,000.00 $98,177.64 $130,000.00 $550,000.00 $67,487.91


Conclusion

On the contentious debate on which Machine Learning/Data Science methods Data Scientists are most excited about learning in the next year as the most valued Data Science Skills

On Data Science Methods Used on the Job

Learners vs. Employed Data Scientists

On Data Science Activities

On the contentious debate on R vs Python as the most valued Data Science Skills