The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We downloaded the multiple choice item survey results in csv format and placed it in our GitHub repo
This project will answer the global research question Which are the most valued data science skills?
Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using?
Those who are new to data science and learning new skills may have different opinions about which tools and methods are most important to learn and which are being used than those who are already employed in the field. Comparing the answer of ‘Learners’ vs. employed Data Scientists may give us some insight into which skills each group values most and whether or not they are in agreement.
A large portion of the data collected in the Kaggle survey was in the form of Likert scales asking respondents to place the value, importance, or frequency of use of certain skills, tools and methods on a 3-4 point scale. To answer our question about which are the ‘most valued’ data science skills we looked a these scales and analyzed for differences between what ‘Learners’ thought was important to them vs. what employed data scientists say they are using in the field.
An oversight in the survey is that they failed to capture the opinions of those who are employed about what they thought were the most important job skills. They didn’t bother to ask employed respondents those questions, so comparisons between employed data scientists and learners are a little more difficult. We can only use the data about what tools and methods employed data scientists are actually using and the frequency of their use to infer the importance of those tools and methods. The Survey also failed to capture those who are employed and ALSO students or learners! They didn’t bother to ask employed respondents if they were also studying Data Science. Professional development is critical in a field that is growing and changing as rapidly as data science, so asking employed professionals about their further studies could have been very useful information to have. Working Data Scientists may have a better insight into what direction the field is going in than students who are just in the learning phase of their journey.
Out of the 229 variables of data collected in the survey, about 105 of them fall into 5 likert scales describing, the “Learning Platform Usefulness” (which was asked of both learners and employed data scientists), “Job Skill Importance” (which only unemployed learners were asked), “Work Tools Frequency” and “WorkMethodsFrequency” (which only employed data scientists were asked to answer). We narrowed down the data to include only these as well as a few more basic demographic fields to get a sense of who the respondents are.
We also chose to focus only on respondents who are located in the US since international cultural and technological differences may skew results. Different tools and skills may be more or less valued in different countries, so we thought it best to narrow the focus on one country at a time. Further analysis to see if the findings are consistent across countries would be interesting and worthwhile.
First let’s take a look at the demographics of the survey respondents and what type of jobs they hold.
About 30% of or survey respondents would call themselves “Data Scientists” with about another 20 percent calling themselves either “Scientist/Researcher” or “Software Developer/Software Engineer”. So about 50% of our survey respondents fall in these three categories with the other 50% in other roles.
| CurrentJobTitleSelect | total | percent |
|---|---|---|
| Data Scientist | 644 | 30.51 |
| Scientist/Researcher | 225 | 10.66 |
| Software Developer/Software Engineer | 212 | 10.04 |
| Data Analyst | 185 | 8.76 |
| Other | 177 | 8.38 |
| Researcher | 138 | 6.54 |
| Machine Learning Engineer | 102 | 4.83 |
| Engineer | 73 | 3.46 |
| Statistician | 71 | 3.36 |
| NA | 71 | 3.36 |
| Business Analyst | 59 | 2.79 |
| Computer Scientist | 47 | 2.23 |
| Predictive Modeler | 41 | 1.94 |
| Programmer | 21 | 0.99 |
| DBA/Database Engineer | 19 | 0.90 |
| Operations Research Practitioner | 16 | 0.76 |
| Data Miner | 10 | 0.47 |
The Kaggle survey asked respondents if they were learning data science and their student status. 73% of the respondents who said they were learning data science are enrolled in an academic program.
| StudentStatus | total | percent |
|---|---|---|
| Yes | 113 | 73.38 |
| No | 41 | 26.62 |
55% of all respondents who said they are learning data science said they are “focused on learning mostly data science skills” with the other 45% saying that “data science is a small part of what I’m focused on learning”.
The ratio of respondents who said they are “focused on learning mostly data science skills” remains the same at about 55% regardless of whether or not they are enrolled in an academic program.
| StudentStatus | LearningDataScience | total | percent |
|---|---|---|---|
| Yes | Yes, I’m focused on learning mostly data science skills | 63 | 55.75 |
| Yes | Yes, but data science is a small part of what I’m focused on learning | 50 | 44.25 |
| No | Yes, I’m focused on learning mostly data science skills | 23 | 56.10 |
| No | Yes, but data science is a small part of what I’m focused on learning | 18 | 43.90 |
‘Learners’ think that the top 3 best ways to learn data science are through Courses, Projects and College with Arxiv and YouTube coming in 4th and 5th respectively.
| Item | low | neutral | high | mean | sd | |
|---|---|---|---|---|---|---|
| 11 | LPU.Courses | 0.000000 | 22.22222 | 77.77778 | 2.777778 | 0.4190790 |
| 12 | LPU.Projects | 1.923077 | 25.00000 | 73.07692 | 2.711539 | 0.4984894 |
| 3 | LPU.College | 1.851852 | 33.33333 | 64.81481 | 2.629630 | 0.5247208 |
| 1 | LPU.Arxiv | 0.000000 | 38.88889 | 61.11111 | 2.611111 | 0.5016313 |
| 17 | LPU.YouTube | 0.000000 | 39.13043 | 60.86957 | 2.608696 | 0.4934352 |
| 14 | LPU.SO | 0.000000 | 43.47826 | 56.52174 | 2.565217 | 0.4993602 |
| 10 | LPU.Documentation | 10.526316 | 36.84211 | 52.63158 | 2.421053 | 0.6924826 |
| 2 | LPU.Blogs | 2.631579 | 47.36842 | 50.00000 | 2.473684 | 0.5568658 |
| 7 | LPU.Kaggle | 0.000000 | 50.64935 | 49.35065 | 2.493507 | 0.5032363 |
| 9 | LPU.Communities | 0.000000 | 55.55556 | 44.44444 | 2.444444 | 0.5270463 |
| 16 | LPU.Tutoring | 6.250000 | 50.00000 | 43.75000 | 2.375000 | 0.6191392 |
| 15 | LPU.Textbook | 7.142857 | 54.76190 | 38.09524 | 2.309524 | 0.6043781 |
| 13 | LPU.Podcasts | 6.666667 | 66.66667 | 26.66667 | 2.200000 | 0.5606119 |
| 4 | LPU.Company | 0.000000 | 75.00000 | 25.00000 | 2.250000 | 0.5000000 |
| 5 | LPU.Conferences | 14.285714 | 64.28571 | 21.42857 | 2.071429 | 0.6157279 |
| 8 | LPU.Newsletters | 0.000000 | 80.00000 | 20.00000 | 2.200000 | 0.4216370 |
| 6 | LPU.Friends | 17.647059 | 64.70588 | 17.64706 | 2.000000 | 0.6123724 |
Employed Data Scientists agree with unemployed ‘Learners’ that Projects and Courses belong in the top 3, but put College in 5th place (vs. 3rd). They also include Tutoring and SO (Stack Overflow Q&A) in their top 5 with SO coming in 2nd place. YouTube (learner’s 5th choice) comes in 14th place for employed Data Scientists and Arxiv (learner’s 4th choice) is 8th among employed Data Scientists.
Another interesting difference is that the importance of Friends to learning data science is much higher among employed Data Scientists with about 46.5% saying that Friends are “Very useful” and 97.5% saying that Friends are either “Somewhat useful” or “Very useful” vs. ‘Learners’ who put Friends at absolute bottom of their list with only 17.6% saying that they are are “Very useful” and 82.4% saying that Friends are either “Somewhat useful” or “Very useful”.
This may indicate a need to create a more robust community for Data Science ‘Learners’, who may feel somewhat isolated in their studies vs. employed Data Scientists who presumably have more established work and social networks that involve Data Science.
| Item | low | neutral | high | mean | sd | |
|---|---|---|---|---|---|---|
| 12 | LPU.Projects | 0.6265664 | 21.30326 | 78.07018 | 2.774436 | 0.4329562 |
| 14 | LPU.SO | 0.7954545 | 35.45455 | 63.75000 | 2.629545 | 0.4994101 |
| 11 | LPU.Courses | 1.3586957 | 35.19022 | 63.45109 | 2.620924 | 0.5127461 |
| 17 | LPU.Tutoring | 3.5928144 | 39.52096 | 56.88623 | 2.532934 | 0.5680704 |
| 3 | LPU.College | 1.5801354 | 41.53499 | 56.88488 | 2.553047 | 0.5286014 |
| 7 | LPU.Kaggle | 0.8860759 | 46.96203 | 52.15190 | 2.512658 | 0.5175910 |
| 15 | LPU.Textbook | 2.3041475 | 47.31183 | 50.38402 | 2.480799 | 0.5442143 |
| 1 | LPU.Arxiv | 1.8918919 | 48.10811 | 50.00000 | 2.481081 | 0.5368976 |
| 16 | LPU.TradeBook | 5.6818182 | 44.31818 | 50.00000 | 2.443182 | 0.6037803 |
| 4 | LPU.Company | 4.6413502 | 47.67932 | 47.67932 | 2.430380 | 0.5825909 |
| 2 | LPU.Blogs | 1.0159652 | 52.24964 | 46.73440 | 2.457184 | 0.5185329 |
| 6 | LPU.Friends | 2.5270758 | 50.90253 | 46.57040 | 2.440433 | 0.5459573 |
| 9 | LPU.Communities | 0.0000000 | 54.36242 | 45.63758 | 2.456376 | 0.4997732 |
| 18 | LPU.YouTube | 2.4038462 | 52.08333 | 45.51282 | 2.431090 | 0.5420324 |
| 10 | LPU.Documentation | 3.0470914 | 51.52355 | 45.42936 | 2.423823 | 0.5531604 |
| 5 | LPU.Conferences | 5.1044084 | 61.48492 | 33.41067 | 2.283063 | 0.5529337 |
| 8 | LPU.Newsletters | 5.3333333 | 66.66667 | 28.00000 | 2.226667 | 0.5327738 |
| 13 | LPU.Podcasts | 11.2840467 | 70.42802 | 18.28794 | 2.070039 | 0.5403243 |
‘Learners’ put Python at the top of their list of Job Skills with 63.6% of respondents saying that it is “Necessary” and 99% saying it is either “Necessary” or “Nice to have”. Only 1% said it was “Unnecessary”. R is surprisingly further down the list with only a little more than half as many respondents saying that R is “Necessary” compared to Python at 34.7% but many more agreeing that it is at least “Nice to have” for a total of 95% in favor of learning R which is just slightly less than those in favor of Python.
Surprisingly “Data Visualization” comes in only slightly above R with 39% saying that it is “Necessary” but actually slightly lower in terms of overall importance with 8% saying it is “Unnecessary”.
| Item | low | neutral | high | mean | sd | |
|---|---|---|---|---|---|---|
| 5 | JSI.Python | 1.010101 | 35.35354 | 63.636364 | 2.626263 | 0.5068079 |
| 3 | JSI.Stats | 2.941177 | 48.03922 | 49.019608 | 2.460784 | 0.5570710 |
| 1 | JSI.BigData | 4.081633 | 54.08163 | 41.836735 | 2.377551 | 0.5655999 |
| 7 | JSI.SQL | 7.368421 | 52.63158 | 40.000000 | 2.326316 | 0.6091869 |
| 10 | JSI.Visualizations | 8.163265 | 53.06122 | 38.775510 | 2.306122 | 0.6160761 |
| 2 | JSI.Degree | 5.102041 | 60.20408 | 34.693878 | 2.295918 | 0.5599923 |
| 6 | JSI.R | 5.102041 | 60.20408 | 34.693878 | 2.295918 | 0.5599923 |
| 4 | JSI.EnterpriseTools | 17.582418 | 72.52747 | 9.890110 | 1.923077 | 0.5213395 |
| 8 | JSI.KaggleRanking | 35.051546 | 60.82474 | 4.123711 | 1.690722 | 0.5469770 |
| 9 | JSI.MOOC | 37.634409 | 59.13978 | 3.225807 | 1.655914 | 0.5416285 |
Not surprisingly I guess, Python is also high on the list of tools that working Data Scientists use with 75.4% of users saying that they use it either “Often” or “Most of the time” and only Statistica, SQL and Unix edging it out for the top 3 slots. What is surprising is that two of the the top three tools used in the field aren’t even on the list of Job Skills ‘Learners’ were asked to evaluate.
R comes in slightly lower than Python with 63.6% of users saying that they use it either “Often” or “Most of the time” but the difference between R and Python is less in the field than what ‘Learners’ seem to think is most important.
| Item | low | neutral | high | mean | sd | |
|---|---|---|---|---|---|---|
| 44 | WTF.Statistica | 20.00000 | 0 | 80.00000 | 3.200000 | 0.8366600 |
| 42 | WTF.SQL | 20.51546 | 0 | 79.48454 | 3.264949 | 0.8637334 |
| 48 | WTF.Unix | 20.73171 | 0 | 79.26829 | 3.207317 | 0.8620225 |
| 31 | WTF.Python | 24.58522 | 0 | 75.41478 | 3.220965 | 0.9533485 |
| 18 | WTF.KNIMECommercial | 25.00000 | 0 | 75.00000 | 3.000000 | 0.8164966 |
| 17 | WTF.Jupyter | 32.30975 | 0 | 67.69025 | 2.982644 | 0.9911199 |
| 33 | WTF.R | 36.38941 | 0 | 63.61059 | 2.942344 | 1.0572465 |
| 38 | WTF.SASBase | 36.96682 | 0 | 63.03318 | 2.900474 | 1.0621406 |
| 2 | WTF.AWS | 44.32432 | 0 | 55.67568 | 2.686486 | 1.0642268 |
| 40 | WTF.SASJMP | 46.42857 | 0 | 53.57143 | 2.553571 | 1.1106041 |
| 23 | WTF.Excel | 47.64151 | 0 | 52.35849 | 2.613208 | 0.8823789 |
| 6 | WTF.DataRobot | 47.82609 | 0 | 52.17391 | 2.478261 | 1.2745611 |
| 41 | WTF.Spark | 48.33837 | 0 | 51.66163 | 2.655589 | 1.0250457 |
| 9 | WTF.Hadoop | 49.83607 | 0 | 50.16393 | 2.596721 | 1.0185721 |
| 12 | WTF.IBMSPSSStatistics | 50.66667 | 0 | 49.33333 | 2.453333 | 1.1424645 |
| 14 | WTF.Impala | 50.98039 | 0 | 49.01961 | 2.470588 | 1.0835671 |
| 45 | WTF.Tableau | 51.13350 | 0 | 48.86650 | 2.561713 | 1.0680602 |
| 8 | WTF.GCP | 52.63158 | 0 | 47.36842 | 2.482456 | 1.0064610 |
| 28 | WTF.Oracle | 52.94118 | 0 | 47.05882 | 2.470588 | 0.9919462 |
| 15 | WTF.Java | 53.53160 | 0 | 46.46840 | 2.446097 | 1.0008735 |
| 5 | WTF.Cloudera | 53.84615 | 0 | 46.15385 | 2.538461 | 1.0781434 |
| 7 | WTF.Flume | 54.54545 | 0 | 45.45455 | 2.318182 | 0.9454837 |
| 34 | WTF.RapidMinerCommercial | 54.54545 | 0 | 45.45455 | 2.545454 | 1.2933396 |
| 25 | WTF.MicrosoftSQL | 55.00000 | 0 | 45.00000 | 2.480000 | 1.0198039 |
| 46 | WTF.TensorFlow | 55.06912 | 0 | 44.93088 | 2.460830 | 0.9633397 |
| 47 | WTF.TIBCO | 56.66667 | 0 | 43.33333 | 2.366667 | 1.0333519 |
| 27 | WTF.NoSQL | 57.14286 | 0 | 42.85714 | 2.444015 | 0.8977539 |
| 36 | WTF.Salfrod | 57.14286 | 0 | 42.85714 | 2.142857 | 0.8997354 |
| 21 | WTF.MATLAB | 57.38832 | 0 | 42.61168 | 2.336770 | 1.1095146 |
| 24 | WTF.MicrosoftRServer | 58.33333 | 0 | 41.66667 | 2.404762 | 1.0193200 |
| 37 | WTF.SAPBusinessObjects | 58.33333 | 0 | 41.66667 | 2.416667 | 1.1645002 |
| 11 | WTF.IBMSPSSModeler | 58.82353 | 0 | 41.17647 | 2.215686 | 1.1715584 |
| 32 | WTF.Qlik | 60.60606 | 0 | 39.39394 | 2.303030 | 1.0748502 |
| 4 | WTF.C | 61.39241 | 0 | 38.60759 | 2.316456 | 1.0754909 |
| 10 | WTF.IBMCognos | 61.53846 | 0 | 38.46154 | 2.076923 | 1.0926327 |
| 39 | WTF.SASEnterprise | 62.88660 | 0 | 37.11340 | 2.329897 | 1.0870607 |
| 20 | WTF.Mathematica | 69.56522 | 0 | 30.43478 | 2.144928 | 0.9892862 |
| 13 | WTF.IBMWatson | 72.34043 | 0 | 27.65957 | 2.000000 | 0.9088933 |
| 30 | WTF.Perl | 72.41379 | 0 | 27.58621 | 1.919540 | 0.9550160 |
| 19 | WTF.KNIMEFree | 73.52941 | 0 | 26.47059 | 2.088235 | 0.8300291 |
| 3 | WTF.Angoss | 75.00000 | 0 | 25.00000 | 1.750000 | 0.9574271 |
| 26 | WTF.Minitab | 75.00000 | 0 | 25.00000 | 1.750000 | 0.9158109 |
| 22 | WTF.Azure | 75.86207 | 0 | 24.13793 | 1.965517 | 0.8133805 |
| 1 | WTF.AmazonML | 77.17391 | 0 | 22.82609 | 1.923913 | 0.8923816 |
| 16 | WTF.Julia | 77.50000 | 0 | 22.50000 | 1.850000 | 0.9486833 |
| 43 | WTF.Stan | 78.04878 | 0 | 21.95122 | 1.926829 | 0.9052691 |
| 35 | WTF.RapidMinerFree | 86.95652 | 0 | 13.04348 | 1.760870 | 0.7939992 |
| 29 | WTF.Orange | 88.23529 | 0 | 11.76471 | 1.470588 | 0.7174301 |
The big surprise here is that Data Visualization is at the top of the list. With nobody saying that they use it “Rarely” and only 8.5% saying that they only use it “Sometimes”. 91.5% say they use it “Often” or “Most of the time”. It is by far the most frequently used method or tool for working Data Science Professionals with Statstica as the next runner up at only 80% by comparison. Remember that only 38.8% of ‘Learners’ said that they thought Visualizations were a “Necessary” job skill to learn!
| Item | low | neutral | high | mean | sd | |
|---|---|---|---|---|---|---|
| 7 | WMF.DataVisualization | 8.566276 | 0 | 91.43372 | 3.550045 | 0.6639726 |
| 6 | WMF.Cross.Validation | 24.178404 | 0 | 75.82160 | 3.160798 | 0.8500851 |
| 22 | WMF.PrescriptiveModeling | 32.710280 | 0 | 67.28972 | 2.855140 | 0.8350012 |
| 16 | WMF.LogisticRegression | 33.599202 | 0 | 66.40080 | 2.879362 | 0.8584456 |
| 12 | WMF.GBM | 34.337349 | 0 | 65.66265 | 2.879518 | 0.8848861 |
| 27 | WMF.Simulation | 35.013263 | 0 | 64.98674 | 2.854111 | 0.9006146 |
| 19 | WMF.NLP | 35.469108 | 0 | 64.53089 | 2.839817 | 0.8683888 |
| 30 | WMF.TimeSeriesAnalysis | 36.211699 | 0 | 63.78830 | 2.866295 | 0.8703743 |
| 9 | WMF.EnsembleMethods | 36.546185 | 0 | 63.45382 | 2.847390 | 0.8930782 |
| 15 | WMF.LiftAnalysis | 37.430168 | 0 | 62.56983 | 2.720670 | 0.8279985 |
| 29 | WMF.TextAnalysis | 38.420108 | 0 | 61.57989 | 2.795332 | 0.8882404 |
| 4 | WMF.CNNs | 39.830509 | 0 | 60.16949 | 2.805085 | 0.9340689 |
| 20 | WMF.NeuralNetworks | 40.545809 | 0 | 59.45419 | 2.785575 | 0.8821286 |
| 26 | WMF.Segmentation | 40.633245 | 0 | 59.36675 | 2.751979 | 0.9007434 |
| 25 | WMF.RNNs | 41.509434 | 0 | 58.49057 | 2.735849 | 0.8378153 |
| 23 | WMF.RandomForests | 41.909814 | 0 | 58.09019 | 2.733422 | 0.8695970 |
| 21 | WMF.PCA | 42.045454 | 0 | 57.95455 | 2.724026 | 0.8468828 |
| 8 | WMF.DecisionTrees | 42.063492 | 0 | 57.93651 | 2.708995 | 0.8456972 |
| 28 | WMF.SVMs | 48.051948 | 0 | 51.94805 | 2.597403 | 0.8788567 |
| 1 | WMF.A.B | 50.539957 | 0 | 49.46004 | 2.555076 | 0.8757749 |
| 14 | WMF.KNN | 51.268116 | 0 | 48.73188 | 2.543478 | 0.8293138 |
| 24 | WMF.RecommenderSystems | 51.690821 | 0 | 48.30918 | 2.507246 | 0.8411133 |
| 10 | WMF.EvolutionaryApproaches | 52.702703 | 0 | 47.29730 | 2.540541 | 0.9094510 |
| 5 | WMF.CollaborativeFiltering | 53.846154 | 0 | 46.15385 | 2.440559 | 0.8275399 |
| 3 | WMF.Bayesian | 55.580357 | 0 | 44.41964 | 2.462054 | 0.8713112 |
| 18 | WMF.NaiveBayes | 61.479592 | 0 | 38.52041 | 2.375000 | 0.8338660 |
| 17 | WMF.MLN | 64.814815 | 0 | 35.18519 | 2.351852 | 0.9144007 |
| 11 | WMF.GANs | 65.116279 | 0 | 34.88372 | 2.325581 | 0.8083178 |
| 13 | WMF.HMMs | 66.326531 | 0 | 33.67347 | 2.244898 | 0.8622821 |
| 2 | WMF.AssociationRules | 68.316832 | 0 | 31.68317 | 2.217822 | 0.7411697 |