Gabrielle Bartomeo, Binish Chandy, Zach Dravis, Burcu Kaniskan, Niteen Kumar, Betsy Rosalen
March 25, 2018
The survey data included:
|
|
|
|
Kaggle captured data about time spent in different activities:
| DSActivity | mean_precent |
|---|---|
| TimeGatheringData | 37.75491 |
| TimeModelBuilding | 19.23263 |
| TimeFindingInsights | 14.50524 |
| TimeVisualizing | 13.74509 |
| TimeProduction | 10.23198 |
| 18 Learning Platforms | |
|---|---|
| Arxiv | Blogs |
| College | Company |
| Conferences | Friends |
| Kaggle | Newsletters |
| Communities | Documentation |
| Courses | Projects |
| Podcasts | SO |
| Textbook | TradeBook |
| Tutoring | YouTube |
| lid | Country | EmploymentStatus | LPlatform | LP_count | LearningPlatform |
|---|---|---|---|---|---|
| 1 | United States | Not employed, but looking for work | LearningPlatformUsefulnessKaggle | Somewhat useful | Kaggle |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessBlogs | Very useful | Blogs |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessCollege | Very useful | College |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessConferences | Very useful | Conferences |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessFriends | Very useful | Friends |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessDocumentation | Very useful | Documentation |
Understanding what working professionals want to learn could give us insight into what skills are most valued in the field.
These questions were asked to all participants.
Mostly Master’s degrees followed by Bachelor’s then doctoral degrees.
Let’s look at the different Machine Learning/Data Science methods.
| Machine Learning/Data Science |
|---|
| Random Forests |
| Deep learning |
| Neural Nets |
| Text Mining |
| Genetic & Evolutionary Algorithms |
| Link Analysis |
| Rule Induction |
| Regression |
| Proprietary Algorithms |
| I don’t plan on learning a new ML/DS method |
| Ensemble Methods (e.g. boosting, bagging) |
| Factor Analysis |
| Social Network Analysis |
| Monte Carlo Methods |
| Time Series Analysis |
| Other |
| Bayesian Methods |
| Survival Analysis |
| MARS |
| Anomaly Detection |
| Cluster Analysis |
| Decision Trees |
| Association Rules |
| Uplift Modeling |
| Support Vector Machines (SVM) |
Distribution of Machine Learning/Data Science methods with formal education.
In the previous section, we examined what data science methods learners are most excited about and working on.
In this section, we’ll look at which data science methods are the most frequently used and if that has any relationship to educational attainment.
May indicate whether or not certain methods require advanced academic training.
The variables listed below indicate how often the respondents use each of these data science methods.
One additional variable: Formal Education.
Response options were: Rarely, Sometimes, Often, Most of the time
Filtered the dataset down to only those who were:
Endorsed data science methods were aggregated and plotted for frequency.
Top five given a frequency score for frequently of use.
Grouped by formal education to identify the most frequently endorsed data science methods for each group.
May help identify if those writing certain types of code and using certain data analyses are potentially benefitted by pursuing advanced education.
|
|
|
Are there rare but essential methods in data science? Respondents were asked if they use each method Rarely, Sometimes, Often, or Most of the time. We converted these to numeric values in order to graph a score and average the categorical responses. Data visualization was the skill indicated to be used the most frequently. |
|
We see that in the majority of educational attainment brackets, data visualization remains the most frequently endorsed data science method.
| Selections | Freq | RelativeFreq | Degree |
|---|---|---|---|
| Data Visualization | 1236 | 0.0944232 | Bachelor’s Education |
| Logistic Regression | 989 | 0.0755539 | Bachelor’s Education |
| Decision Trees | 847 | 0.0647059 | Bachelor’s Education |
| Data Visualization | 1129 | 0.0756348 | Doctoral Education |
| Cross-Validation | 1046 | 0.0700744 | Doctoral Education |
| Logistic Regression | 1031 | 0.0690695 | Doctoral Education |
| Neural Networks | 24 | 0.0808081 | High School Education |
| Data Visualization | 23 | 0.0774411 | High School Education |
| Text Analytics | 18 | 0.0606061 | High School Education |
| Data Visualization | 2331 | 0.0835873 | Master’s Education |
| Logistic Regression | 2022 | 0.0725069 | Master’s Education |
| Cross-Validation | 1821 | 0.0652992 | Master’s Education |
| Data Visualization | 150 | 0.0897129 | Professional Education |
| Logistic Regression | 121 | 0.0723684 | Professional Education |
| Decision Trees | 119 | 0.0711722 | Professional Education |
| Data Visualization | 137 | 0.1008837 | Some Post Secondary Education |
| Logistic Regression | 97 | 0.0714286 | Some Post Secondary Education |
| Decision Trees | 87 | 0.0640648 | Some Post Secondary Education |
Data visualization, logistic regression, cross-validation, decision trees, and random forests are not only frequently endorsed methods, but methods that are essential even if only used in small ways.
It seems like across data science code writers, these methods are popular and then for individual data science code writers, they are used frequently.
This has important implications for students of data science in understanding that certain popular job functions are not only performed by those with advanced degrees.
105 variables in 4 likert scale categories
Plus a few basic demographic fields
30% go by “Data Scientist”
20% go by “Scientist/Researcher” or “Software Developer/Software Engineer”
| CurrentJobTitleSelect | total | percent |
|---|---|---|
| Data Scientist | 644 | 30.51 |
| Scientist/Researcher | 225 | 10.66 |
| Software Developer/Software Engineer | 212 | 10.04 |
| Data Analyst | 185 | 8.76 |
| Other | 177 | 8.38 |
| Researcher | 138 | 6.54 |
| Machine Learning Engineer | 102 | 4.83 |
| Engineer | 73 | 3.46 |
| Statistician | 71 | 3.36 |
| NA | 71 | 3.36 |
| Business Analyst | 59 | 2.79 |
| Computer Scientist | 47 | 2.23 |
| Predictive Modeler | 41 | 1.94 |
| Programmer | 21 | 0.99 |
| DBA/Database Engineer | 19 | 0.90 |
| Operations Research Practitioner | 16 | 0.76 |
| Data Miner | 10 | 0.47 |
About 73% are college students
| StudentStatus | total | percent |
|---|---|---|
| Yes | 113 | 73.38 |
| No | 41 | 26.62 |
55% are “focused on learning mostly data science skills” regardless of academic status.
| StudentStatus | LearningDataScience | total | percent |
|---|---|---|---|
| Yes | Yes, I’m focused on learning mostly data science skills | 63 | 55.75 |
| Yes | Yes, but data science is a small part of what I’m focused on learning | 50 | 44.25 |
| No | Yes, I’m focused on learning mostly data science skills | 23 | 56.10 |
| No | Yes, but data science is a small part of what I’m focused on learning | 18 | 43.90 |
Learners - Top 3 ways to learn data science:
Friends are the least useful learning platform.
Data Visualization is at the top of the list.
Only 38.8% of Learners said Visualizations were a “Necessary” job skill to learn!
The most frequently used of the programming languages are R and Python. But do those that use R recommend R or Python? And do those that use Python recommend R or Python? In other words, do those survey takers feel that others should first and foremost study the languages they themselves have taken up, or perhaps with their insight, know to suggest the language of the two they themselves did not learn?
Thus the following questions were explored:
What is the distribution of following programming languages Kaggle survey takers used in the past year:
What is the distribution of programming language recommendations by following programming languages Kaggle survey takers used in the past year:
There are 2 variables used in this section of the analysis :
LanguageRecommendationSelect=(What programming language would you recommend a new data scientist learn first? (Select one option) - Selected Choice)
WorkToolsSelect= For work, which data science/analytics tools, technologies, and languages have you used in the past year? (Select all that apply) - Selected Choice
## [1] 16716 229
## [1] 7955 229
Let’s examine the above graph of LanguageRecommendationSelect
## [1] 7955 5
Finally, true to the word “value,” considerations have to be made regarding pay. The compensation received by survey takers for their work in either R or Python needs quantification to discover which language earns a data scientist more overall and in general.
Three variables were used:
There was also the variable “id” that was created for the purpose of this report, acting as a way to identify each individual survey taker, and the variable “work_tools” which was a derivative of WorkToolsSelect, breaking the lists down into their individual components.
| Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum | Standard Deviation | |
|---|---|---|---|---|---|---|---|
| Python | $0.00 | $53,000.00 | $100,000.00 | $112,826.14 | $145,000.00 | $2,000,000.00 | $122,425.21 |
| R | $0.00 | $58,000.00 | $87,000.00 | $98,177.64 | $130,000.00 | $550,000.00 | $67,487.91 |
On the contentious debate on which Machine Learning/Data Science methods Data Scientists are most excited about learning in the next year as the most valued Data Science Skills
On Data Science Methods Used on the Job
Learners vs. Employed Data Scientists
On Data Science Activities
On the contentious debate on R vs Python as the most valued Data Science Skills