Kaggle conducted an industry-wide global survey that presents a truly comprehensive view of the state of data science and machine learning. The survey attempts to understand broadly the profile , work activities, nature of projects undertaken , programming languages used, machine learning methods usage at work and machine learning frameworks used by participants from about 147 countries globally. It also provides information about the usage of various products and services like cloud computing, hosted notebooks etc in the field of data science and machine learning. It also throws light into the status of training and training methods, public data sets and media sources that the participants depend on to enhance their knowledge in the area.
The survey was live for one week in October, and after cleaning the data finished with responses from 23,859 participants globally.
This survey received 23,859 usable respondents from 147 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity. Kaggle excluded respondents who were flagged by their survey system as “Spam”. Most of therespondents were found primarily through Kaggle channels, like email list, discussion forums and social media channels. The survey was live from October 22nd to October 29th. Respondents were allowed to complete the survey at any time during that window. The median response time for those who participated in the survey was 15-20 minutes.
Approach to analysis
Broadly the participants are grouped in two as “Employed” and“Students”. Starting from the section on work activites results are provided seperately for this two broad categories in addition to the overall result.
Wherever applicable results are further provided by Designation, Industry and Country of the participants. Results by other relevant groupings are also provided as necessary.
Latin American including ,“Argentina”,“Peru”,“Chile”,“Mexico” and “Colombia”
Major European including “Austria,”Belgium“”,Denmark“,”Finland“,”Netherlands“,”Norway“,”Sweden“and”Switzerland“.
Middle East & African including “Egypt”,“Iran, Islamic Republic of…”, “Kenya”, “Morocco”, “Nigeria”,“South Africa” & “Tunisia”
High income Asia including “Singapore”,“Israel”,“South Korea”,“Hong Kong (S.A.R.)”and “Republic of Korea”
Other Asian including “Bangladesh”,“Indonesia”,“Malaysia”,“Pakistan”,“Philippines”, “Thailand” & “Viet Nam”
Other European including “Belarus”,“Czech Republic” ,“Greece”, “Hungary”, “Italy”,“Poland”,“Portugal” ,“Spain”,“Romania”,“Turkey”& “Ukraine”
Australia & New Zealand
United Kingdom of Great Britain and Northern Ireland and Ireland combined to “UK”
Details of participants are provided by Gender,Age and country of residence. Click on the respective tabs to view the results for each groupings
The plot below shows both numbers and percentage of each category and this is followed throughout the report.
Majority of the participants are young with 60% below the age of 30 years. This includes 90% of the students and 50% of the employed .
By Age
China and India have the most younger participants with 84% from China and 77% from India are less than 30 years old. More than half the Participants from these countries ar eless than 24 years
UK. Japan, Major European countries ,Australia & New Zealand have comparatively older Participants with more than half above 30 years age.
The details of participants by their highest degreee held and Major in undergraduate studies are provided. Click on relevant tabs for details.
By Country
By Educational Degree
Physics/Astronomy undergraduate major’s have a higher proportion(14%) among Doctoral degree holders compared to the overall level (5%).
Half the Bachelor degree holders have a major in computer Science
Particpant details in terms of their Designation, Industry where they work and the annual Salraries they receive are provided, click on relevant tabs for details.
By Country
By Educational Degree
Students are not included in the remaining part of the study on Job profile . It’s intersting to note that 5% of those employed are mentioning that they are students and not their current designation at work. 90% of these students are less than 30 years of age . It could be possible that these are students doing internship or some research activity in the industry
Top Designations in Countries
India , Japan and China have proportionally more participants working in the computer/technology industry
Top Industries for each designation
Majority of the participating Product/Project Managers and Chief Officers are from computers/ Technology industry.
More han half of the Principal Investigators, Research scientists and Research Assistants work in Academics/ Education industry.
By industry
By Designation
The median salary for all participants who disclosed their income is USD 40,000- 50,000 Annualy.
Country- wise distibution
More than 75% of all participants from US,Canada,UK,France,Germany,Major European counties, Japan and Australia & Newzealand countries have an annual salary greater than $30,000. The Salary distribution in these countries is shown below.
US has the highest Salary with a median salary in the $90-100,000 range and more than 85% having an annual salary > $50,000 .
Australia & Nz have a median salary of $70-80,000 followed by Canada & Germany at $60-70,000
UK’s and High income Asian Countries including Japan have a median salary in the range $50-60,000 .
Lowest median salary in this group is France & Major European Countries at $40-50,000
Salary distribution in comparatively low income Countries
The plot below gives the distribution of annual salaries, designation wise from High Salary countries mentioned above.
In countries with lower annual salaries, Data scientists have a comparatively better median salary of $20-30,000 in comparison with Software Engineers,Data Engineers and Database Engineers.
Median Salaries though much lesser across various designations follows the same trend as High salary countries.
Data Analysis to influence or product or business decisions is the most widely mentioned work activity of the participants. Building ML models is an activity for 40% of the participants.
One of out of 4 participants is engaged in doing research to advance the state of art of machine learning.
As expected more than 70% of Data Scientists, Data/Business/Marketing analyst consider analyze and understand data to be an important part of their work.
Building prototypes for exploring new areas is an important role for 65% of Data Scientists and Principal Investigators.
Building ML service for operational improvement is clearly a more important role for Data Scientists than any other designations.
Building data infrastructure is the most important role for Data Engineers as expected.
Doing research on ML are comparatively more important roles for Research Scientists and Principal Investigators.
Majority of Chief officers do consider these activities to be an important part of their work.
60% of participants from US,Canada,UK,High income European Countries & Australia & NewZealand consider Analysing and understanding data to be an important part of their work.
50% of participants from Germany & High income European countries consider Building prototypes for exploring machine learning in new areas an important part of their work.
*Overall half the participants use Local or Hosted Developement Environments like Jupyter note book and R studio for data analysis followed by basic statistical doftware like Excel. Cloud based software and Business Intelligence software is proportionally used more by employed participants than students.
Most often used tools in each of the categories
Excel is the clear leader in basic statistical tools for analysis among employed and students. SAS & SPSS is used by more than half the participants who mentioned using Advanced statistical tools. . In Business intelligence category Tableau leads and in Local or hosted environments Jupyter Notebbok & R studio dominates. Eclipse and Emacs accounts for 40% in other tool category
70% Data Scientists uses Local Or Hosted Development Environment as one of the tools for analysis which is the highest among all designations, followed by research Scientists at almost 60%.
Use of Advanced Statistical tools is highest amomg Statisticians.
Use of Business intelligence software is comparatively higher among Business Analysts(18%).At the same time 38% Business Analyst use Basic statistical softwares as the primary tool.
One out of four participants consider themselves to be definitely Data scientists and again one out of four are doubtful about being Data Scientists. Half the participants rate themself as probable Data scientists. As expected the proportion of those who consider themself to be definetely or probable Data Scientists are higher among employed than students.
Only 57% of those who are Data scientist by designation consider themself to be definetly Data Scientists.
1 out of 4 Data analysts and 1 out of 3 Marketing & Business Analysts don’t consider themselves to be Data Scientists.
Among unemployed participants 21% consider themself to be definetly data scientists.!!!
48% of the participants spent more than half the time at work coding.
Experience
70% of the employed participants have less than 5 years of coding experience.
75% of students have less than 2 years of coding experience.
Coding Exposure of definitely Data scientists
Those of who consider themself to be definitely Data Scientists spents more time on coding and has higher experience comparatively .
Python is the clear leader with 83% of all participants using python and is proportionately more popular among students. SQL and R, the next two popular languages , are used proportionally more among employed than students.
There is an interesting contrast in the use of the other languages between students and employed .Popularity of C/C++ and Matlab among students is proportionally twice that of employed. Java is more popular among students while Bash & VisualBasic/VBA among employed.
Swift ,perl,fortran,kotline and cojure are the main languages used in the Other Category.
Most often used
Python is the most often used programming language with 53% usage with R at a distant second at 13%.
Most Recommended
Again Python is the clear leader.
SQL is used proportionally more(60%) by Accounting/Finance, Online Sales & Services, marketing/CRM, Retail/Sales and Insurance & Risk Assessment industry sectors compared to other sectors
Chinese,Russians and Japanese participants have a clear prefernce for the python with almost 90% participants using it and other languages usage much lower. C/C++ usage is proportionally higher among Chinese at 40%
Python plotting libraries for data visualization Matplotlib and Seaborn are used by most of the participants with 72% and 42% usage respectively.
R plotting library of ggplot2 follows with 42% of participants using them. ggplot2 is used more by employed than students proportionally.
Others category include Tableau,matlab,gnuplot,high charts and excel.
Most often used library for Data visualiztion
Matplotlib ,as the mostl often used library among all data visulaization libraries used with more than half the participants and 65% of the students using it most often.
Data Scientists using the top four libraries are proportionally more than other designations.
Among Data Analysts,Business Analysts and Marketing Analyst there is no large difference in the usage of Matplotlib and ggplot2 . Among Marketing analysts ggplot2 is used by more participants than Matplotlib.
In Govt/Public service, Insurance/Risk Assessment and Hospitality industries there is no large difference betweeen the proportion of participants using Matplotlib and ggplot2
Russians and Japanese participants have a clear prefernce for the python libraries of Matplotlib and seaborn over ggplot2.
60% of participants say that Independent projects are more important than academic acheivements and only 7% think that academic projects are more important. Both students and employed have the same opinion.
Boxplot are used to show the time taken for each of the project activities. The solid dark line within the rectangle gives the median value representing time taken at most for that activity by 50% of participants.Bottom of the rectangle indicates the maximum time for that activity for 25% of the participants and top for 75% of the participants
Time taken by majority of participants for each of the activities is as below. The range given is for 50% -75% of participants.
Gathering Data - 10-20% Cleaning Data - 20-30% Visualizing Data - 10-20% Model Building - 20-30% Model production - 10% Finding Insights and Communicating - 10-15% Other activities - 0-5%
Data Cleaning and Model building are the most time consuming among the Data Science Project activities.
At the median level there is no major difference between employed and students. When considering 75% participants in each category students take less time in finding insights and communication and more time in model production than employed participants.
Data Cleaning is the most time consuming activity for Business analyst,Data Analyst and Marketing Analyst designation
Those who consider themself to be definitely data scientists on an average spents 5% more time on Model production and Finding Insights and communicating to stake holders than others.
Numerical data is used by 65% of the participants followed by text data at 52%. Categorical, time series and tabular data are used by more employed participants proportionally than students. Image data is used proportionally by more students than employed.
Numerical data is the most often used type of data. Tabular data is the second mostly used type of data.
Understandably 84% of the Statisticians intearct with numeric data. 75% of Data Scientists, Data Analysts, Business Analysts and Market Analysts also interact with numerical data.
70% of Market analysts and Data journalists interact with categorical data.
80% of Data journalist interacts with text data.
Interaction with time series data is proportionately higher among Statisticians.
Data scientists & Consultants interaction with numerical ,categorical , text, time series and tabular data is much prevalent comparatively.
60% of the Data/Business/Market Analysts and Statisticians use Numerical and Tabular data the most.
Top 5
Overall status of usage as below . Students are not included in this analysis.
47% of participants mention that their employers are currently using ML methods.
24% are exploring ML methods currently.
Half the participants from all industry types except Manufacturing, Non profit Service and Govt. Public services are currently using ML methods.
Computers/Technology & Online/internet based Service industries have the highest proportion of well established ML methods.
Two out of three participants who are employed have less than 3 yers of experience in using ML methods at work.
Two out of three student participants have less than 2 years of experience in ML methods.
With experience in ML methods the confidence of being a Data Scientist increases as shown below. Some participants who mention that they don’t know machine learning consider themselves to be definitely Data Scientists. At the same time the largest proportion of those who say they are defintely not Data Scientists is among those who don’t know machine learning.
As python is the most popular programming language, python library for machine learning Scikit- Learn leads in usage at 65% . Tensor flow and it’s more user friendly version Keras follows indicating an increased use of Deep learning methods.
Ensemble ML framework of Random forest , the gradient boosting framework of Xgboost and the training framework of Caret is proportionately used more by employed participants than students.
Most often used
Scikit- Learn is the ML frame work used most often by 47% of the participants and s far ahead of Tensor flow used most often by 15% of participants.
Break up of ML frameworks used by those employed participants who are currently using ML methods at work follows. Click on the tabs to view details of other groupings.
Among those who currently use ML methods at work usage of top 5 Machine learning frameworks is proportionally higher than the average indicating use of multiple ML frameworks probably in search of better results.
Usage of top 5 Machine learning frameworks is proportionally higher than the average indicating use of multiple ML frameworks probably in search of better results.
Data Scientists uses all the top 5 Machine learning frame work more than the overall usage indicating use of multiple ML frameworks probably in search of better results. Scikit- Learn used by 87% of Data scientists.
Tensor flow usage is proportionally higher among Software Engineers, Research scientists & Assistants ,Chief Officers and Principal Investigators indicating a growing preference for Deep learning.
Proportionally more participants from Computer/ technology industry uses all top 5 frameworks with use of Tensor flow by 71% of participants from this sector.
Scikit Learn and Tensor flow is used by 85% of participnats from the Military/Defense Sector.
Use of Tensor flow and Keras is proportionally higher in Germany ,France,Japan and High income Asian Countries.
Catboost finds a place in top5 ML frameworks in Russian understandably due to its Russian Origins.
60% of the participants consider accuracy to be an important metric for evaluating succes of a ML model. Revenue and Business goals are considered to be imporant by more employed than student participants proportionally.
More than 80% of Data Scientist, Research Scientists,Statisticians and Principal investigators who are associated with an organization that build ML Models consider accuracy to be an important metric for model success.
Business goals and revenues is important for 91% of Market Analysts, 75% of Chief Officers, 71% of Business analysts and 68% of Datascientists.
For service industries like Computers/techology, Academics , Govt/public services Accuracy is important to more participants than Revenue / Business goals. For industries focused on Marketing & Sales Revenue/Business goals are equally important as accuracy.
Almost 90% of the participants felt that Fairness and Bias in ML Algorithms, Being able to explain ML model outputs and/or predictions and Reproducibilty in Data Science are all important.
Almost 30% of Participants have not done such projects. Less than 10% of partcipants have over 50% of their total projects involved in exploring unfair bias
No significant differnces between employed and students on the difficulties faced in such projects.
Exposure to more projects doesn’t affect the difficulties identified significantly.
12% of participants have no exposure to such project.
For 60% of participants exploring model insights happens in less than 50% of their projects.
Exploring model insights and interpreting model’s prediction happens in all circumstances as evident from the plot below.
There is no signifiacnt difference in circumstances for exploring models with increase in % of projects handled that involve exploring ML insights and intrepret predictions
Only 10% of the participants are confident of explaining most if not all models.
Another 48% are confident that they can understand and explain the output of many models.
35% of participants consider ML models to be black boxes but majority of them feel the models can be explained by experts.
No significant difference among participants who currently use Ml methods at work and those who consider them to be definitely Data Scientists from Overall pattern.
Again no significant difference in opinion based on increased exposure to projects that involve exploring model insights and intrepret predictions
Choice of method for explaining models seems to be based on simplicity and ease of explaining. The most simple and easy method of plotting predicted vs actual results is the method used by majority of participants. The next easy to explain method of examining feature importance comes second.
The pattern is similiar for those participant whose uses ML methods at work and those who are confident of explaining ML models but the methods are more widely used.
Ensuring proper documentation of code and readability of code are the most widely used methods to ensure reproducibilty of work.
Time consuming is the most common reason that prevents participants from sharing their work.
Following the trend of Python as the most widely used Programming Language Jupyter/ Ipython is the most widely used Integrated Development tool among both employed and student participants . Pycharm another Python based IDE is the fourth most widely used.
R studio which is the second widely used IDE is used proportionally more by employed participants than students.
MATLAB is used proportionally more by students than employed similiar to what wa sobserved in the programming language usage.
Eclipse, Emacs and Netbeans are the main IDE’s in the Other category
Data Scientists uses all the top 5 IDE’s widely than other designations with Jupyter / Python the most widely used by 87% of Data Scientists.
R studio is more widely used than Jupyter/ Ipython amomg Statisticians at 83% and Marketing Analysts at 58%.
Among Data Analysts and Business Analysts R studio is almost as widely used as Jupyter/Ipython.
Industry wise there are no significant changes from the overall usage pattern.
In Japan ,Vim and visual Studio code is almost by as many participants as those using R studio an Pycharm.
Visual Studio/ code is in the top 5 used iDE’s in Russia, China ,Other Asian countries & Canada
40% of the participants don’t use any Hosted Note books. Kaggle Kernels is used by more participants than any other hosted note books .
Databricks , IBM watson are the major notebooks used in the Other category.
Usage similiar to overall pattern. Compartively more usage among Data Scientists and Data Engineers.
Usage similiar to overall pattern
Usage similiar to overall pattern
38% of total participants and half the students don’t use any Cloud Computing Services.
Amazon services is most widely used with 40% participants using it .Usage among employed participants is 45%.
Digital Ocean ,Oracle,heroku,dropbox are the main Cloud compouting services in Other Category
Cloud computing Service usage is more among Data Scientists, software Engineers, Data Engineers, chief Officers & Project Managers at 75%. (Only 25% or less dont use the services as shown in plot below for the designations mentioned)
80% of participants from Online /Intrenet based Sales & service industry use cloud computing services. 70% participants from Computer/Tehnology and Retail/Sales Industries also use cloud computing services.
Comparatively wider usage of Cloud Computing Services is from US participants at 71%.
Amazon Web services usage is proportionally higher among participants from US & UK.
The first Cloud computing product AWS Elastic Compute Cloud launched by Amazon in 2006 is used by more participants (35%) than others.
30% of the participnats don’t use any products.
Others include aws,google cloud, azure,emr and sagemaker
Proprtionally usage of AWS Elastic compute cloud is highest in Online Service industry.
Half the participants don’t use any of the Machine learning Products . The plots below exclude this 49% and also exclude any product used by less than 1 % of the participants.
Others include Knime & Weka
Employed vs Students
MySQL the open source RDBMS from Oracle is used by 58% of participants followed by the Open Source Object relational databse management system PostgressQL with 36% usage. C based RDBMS SQLite has 35% usage . All the above are in Public domain.
Among the Properiatory products Microsoft SQL Server is used by 30% participants followed by Oracle Database.
Others include MongoDb, Redshift ad Terradata
Employed vs Students
Among the product in public domain PostgressSQL is used proportionally much more by employed(39%) than students(19%).
Database engineers use of Microsoft SQL server (72%) ,Oracle Database(57%) and Azure SQL Database(32%) is proportionally much higher than other designations.
Chinese participants leads in the usage of MySQL at 74%.
Russian participants use PostgressSQL more(53%) than other Relational data products. PostgressSQL usage is proportionally higher among participants from Brazil(51%)
55% of the participants don’t use Big Data and Anlytics Products and is exclude from the plot below for better visualization.
Others include Hadoop, Spark,Cloudera & hortonworks
Considering 50% to 75% of participants as a range, % of training received through each of the channels is as below.
online Courses - 20 - 40% Selftaught - 20 - 45% At Work - 8 - 25% Kaggle Competition - 0 - 10% University - 0 - 30%
Obviously employed participants receive more training at work and students recive more trianing at Universities.
Participants from France, Japan and Russia receives more training at work than Other countries.
Participants from China, India, Japan and Russia receives more training through kaggle Competitions than other countries.
Participants from Brazil,Canada,India,Middle East & African and Other Countries receives more training through Online courses.
Participants from Germany, India, Japan, Middle East & African, Other Asian countries and UK receive training through self than other countries.
Participants from France, Australia & NZ and US receives more training through Universities than other countries.
Coursera is the online training platform used by more participants than other platforms.
Others include You tube, Ml courseai and plural sight
Coursera is agin the mostly used channel followed by Data Camp.
Proportionally participants from Russia use Coursera the most at 62% .
52% of Online participants consider quality of online learning platforms and MOOCs better than the quality of the education provided by traditional brick and mortar institutions. 15% consider the quality to be worse than traditional institutions and 23% feel that the quality is the same. 10% have no opinion or don’t know.
One out of 3 participants have no opinion or don’t know about the quality of bootcamps in comparison with traditional institutions. 39% feel their quality is better than traditional institutions and 8% feel that quality is worse than traditional institutions. 19% feel it’s all the same
More than half the partcipants from Germany and other Major European countries ahve no opinion or don’t know about the quality of bootcamps.
BY Country
Findings of the survey can be broadly summarised as below.
Globally those engaged in the field of Datascience and Machine learning are relatively young with a median age less than 30 among those employed in this field.
They are also highly qualified with 60% having a Doctoral or Master’s degree.
Majority of them have a Computer Science/ Engineering or Mathematics degree as Undergraduate Major.
Median Salary in the high income countries like US,UK ,Japan, Major Eurpean countries is USD 50-60,000
More than half are engaged in Analyzing Data to influence product/ business decisions.
One out of 4 is engaged in doing research that advances the state of the art of Machine learning
Local or hosted development environments like Jupyter Notebook ,RStudio etc is the most widely used tool for analyzing data.
One out of 4 consider themselves to be definitely a Data Scientist. Two out of 4 rate themself as probable Data scietist and the remaining one out 4 don’t consider themself to be a data scientist.
Python is the Programming language use widely by 83% and is also used most often by 54%. SQl & R lags far behind with 44% and 36% usage . R is a distant second as the most often used language at 14%.
Continuing the trend Python based libraries of Matplotlib and Seaborn are the most widely used Data Visualization libraries. Usage of R librarry ggplot2 is at par with Seaborn. Among the most often used library Matplotlib leads with 55% followed by ggplot2 at 24%.
47% are using ML methods currently at work and another 24% are exploring the possibilit yof using.
Two out of three employed participants havel less than 3 years of using ML methods ate work again reflecting an young and less experienced human resource in the field.
Among Machine learning framework python based Scikit-Learn is used by 65% followed by Tensorflow and Keras indicating an increased use of Deep Learning methods.
62% consider metrics that consider accuracy as the metric to decide model success followed by Revenue and Business goals at 42%.
35% consider Machine learning models to be blackboxes but majority of them feel the models can be explained by experts.Only 10% of the participants are confident of explaining most if not all models.Another 48% are confident that they can understand and explain the output of many models.
The most simple and easy method of plotting predicted vs actual results is the method used by majority of participants for explaining decsions by ML models. The next easy to explain method of examining feature importance comes second.
Jupyter/ Ipython is the most widely used Integrated Development tool among both employed and student participants. Rstudio comes next.
40% of the participants don’t use any Hosted Note books. Kaggle Kernels is used by more participants than any other hosted note books .
38% of total participants and half the students don’t use any Cloud Computing Services. Amazon services is most widely used with 40% participants using it .
Cloud computing product AWS Elastic Compute Cloud launched by Amazon in 2006 is used by more participants (35%) than others.
Half the participants don’t use any of the Machine learning Products. SAS with 8% usage leads.
Among Relational datadase products available in public domain MySQL from Oracle is used by 58% of participants followed .PostgressQL with 36% usage. Among the Properiatory products Microsoft SQL Server is used by 30% participants followed by Oracle Database.
55% of the participants don’t use any Big Data and Anlytics Products. Google Query used by 12% leads.
Among the channels for training on Data Science and Machine learning Online is more widely used followed by Self learning.Coursera is the online training platform used by more participants than other platforms.
About the Author
A certified Data Scientist with an Engineering background having 23 years of experience in Data Analysis and communicating insights and recommendations across the organization’s leadership structure in Banking, Manufacturing and Projects . Experience of managing Customer Insights, Quantitative and Qualitative Research , Process analytics and Quality metrics projects and initiatives. Can be contacted at rajvikraman@gmail.com