1 Importing data

The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We downloaded the multiple choice item survey results in csv format and placed it in our GitHub repo

2 Research Question

This project will answer the global research question Which are the most valued data science skills?

3 ‘Learners’ vs. Employed Data Scientists

3.1 Question

Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using?

Those who are new to data science and learning new skills may have different opinions about which tools and methods are most important to learn and which are being used than those who are already employed in the field. Comparing the answer of ‘Learners’ vs. employed Data Scientists may give us some insight into which skills each group values most and whether or not they are in agreement.

3.2 Variables and their definition

A large portion of the data collected in the Kaggle survey was in the form of Likert scales asking respondents to place the value, importance, or frequency of use of certain skills, tools and methods on a 3-4 point scale. To answer our question about which are the ‘most valued’ data science skills we looked a these scales and analyzed for differences between what ‘Learners’ thought was important to them vs. what employed data scientists say they are using in the field.

An oversight in the survey is that they failed to capture the opinions of those who are employed about what they thought were the most important job skills. They didn’t bother to ask employed respondents those questions, so comparisons between employed data scientists and learners are a little more difficult. We can only use the data about what tools and methods employed data scientists are actually using and the frequency of their use to infer the importance of those tools and methods. The Survey also failed to capture those who are employed and ALSO students or learners! They didn’t bother to ask employed respondents if they were also studying Data Science. Professional development is critical in a field that is growing and changing as rapidly as data science, so asking employed professionals about their further studies could have been very useful information to have. Working Data Scientists may have a better insight into what direction the field is going in than students who are just in the learning phase of their journey.

3.3 Manipulating data

Out of the 229 variables of data collected in the survey, about 105 of them fall into 5 likert scales describing, the “Learning Platform Usefulness” (which was asked of both learners and employed data scientists), “Job Skill Importance” (which only unemployed learners were asked), “Work Tools Frequency” and “WorkMethodsFrequency” (which only employed data scientists were asked to answer). We narrowed down the data to include only these as well as a few more basic demographic fields to get a sense of who the respondents are.

We also chose to focus only on respondents who are located in the US since international cultural and technological differences may skew results. Different tools and skills may be more or less valued in different countries, so we thought it best to narrow the focus on one country at a time. Further analysis to see if the findings are consistent across countries would be interesting and worthwhile.

3.4 Exploratory Data Analysis (EDA)

First let’s take a look at the demographics of the survey respondents and what type of jobs they hold.

3.4.1 Employed Data Scientists

About 30% of or survey respondents would call themselves “Data Scientists” with about another 20 percent calling themselves either “Scientist/Researcher” or “Software Developer/Software Engineer”. So about 50% of our survey respondents fall in these three categories with the other 50% in other roles.

CurrentJobTitleSelect total percent
Data Scientist 644 30.51
Scientist/Researcher 225 10.66
Software Developer/Software Engineer 212 10.04
Data Analyst 185 8.76
Other 177 8.38
Researcher 138 6.54
Machine Learning Engineer 102 4.83
Engineer 73 3.46
Statistician 71 3.36
NA 71 3.36
Business Analyst 59 2.79
Computer Scientist 47 2.23
Predictive Modeler 41 1.94
Programmer 21 0.99
DBA/Database Engineer 19 0.90
Operations Research Practitioner 16 0.76
Data Miner 10 0.47

3.4.2 ‘Learners’

The Kaggle survey asked respondents if they were learning data science and their student status. 73% of the respondents who said they were learning data science are enrolled in an academic program.

StudentStatus total percent
Yes 113 73.38
No 41 26.62

55% of all respondents who said they are learning data science said they are “focused on learning mostly data science skills” with the other 45% saying that “data science is a small part of what I’m focused on learning”.

The ratio of respondents who said they are “focused on learning mostly data science skills” remains the same at about 55% regardless of whether or not they are enrolled in an academic program.

StudentStatus LearningDataScience total percent
Yes Yes, I’m focused on learning mostly data science skills 63 55.75
Yes Yes, but data science is a small part of what I’m focused on learning 50 44.25
No Yes, I’m focused on learning mostly data science skills 23 56.10
No Yes, but data science is a small part of what I’m focused on learning 18 43.90

3.4.3 Learning Platform Usefulness - ‘Learners’

‘Learners’ think that the top 3 best ways to learn data science are through Courses, Projects and College with Arxiv and YouTube coming in 4th and 5th respectively.

Item low neutral high mean sd
11 LPU.Courses 0.000000 22.22222 77.77778 2.777778 0.4190790
12 LPU.Projects 1.923077 25.00000 73.07692 2.711539 0.4984894
3 LPU.College 1.851852 33.33333 64.81481 2.629630 0.5247208
1 LPU.Arxiv 0.000000 38.88889 61.11111 2.611111 0.5016313
17 LPU.YouTube 0.000000 39.13043 60.86957 2.608696 0.4934352
14 LPU.SO 0.000000 43.47826 56.52174 2.565217 0.4993602
10 LPU.Documentation 10.526316 36.84211 52.63158 2.421053 0.6924826
2 LPU.Blogs 2.631579 47.36842 50.00000 2.473684 0.5568658
7 LPU.Kaggle 0.000000 50.64935 49.35065 2.493507 0.5032363
9 LPU.Communities 0.000000 55.55556 44.44444 2.444444 0.5270463
16 LPU.Tutoring 6.250000 50.00000 43.75000 2.375000 0.6191392
15 LPU.Textbook 7.142857 54.76190 38.09524 2.309524 0.6043781
13 LPU.Podcasts 6.666667 66.66667 26.66667 2.200000 0.5606119
4 LPU.Company 0.000000 75.00000 25.00000 2.250000 0.5000000
5 LPU.Conferences 14.285714 64.28571 21.42857 2.071429 0.6157279
8 LPU.Newsletters 0.000000 80.00000 20.00000 2.200000 0.4216370
6 LPU.Friends 17.647059 64.70588 17.64706 2.000000 0.6123724

3.4.4 Learning Platform Usefulness - Employed Data Scientists

Employed Data Scientists agree with unemployed ‘Learners’ that Projects and Courses belong in the top 3, but put College in 5th place (vs. 3rd). They also include Tutoring and SO (Stack Overflow Q&A) in their top 5 with SO coming in 2nd place. YouTube (learner’s 5th choice) comes in 14th place for employed Data Scientists and Arxiv (learner’s 4th choice) is 8th among employed Data Scientists.

Another interesting difference is that the importance of Friends to learning data science is much higher among employed Data Scientists with about 46.5% saying that Friends are “Very useful” and 97.5% saying that Friends are either “Somewhat useful” or “Very useful” vs. ‘Learners’ who put Friends at absolute bottom of their list with only 17.6% saying that they are are “Very useful” and 82.4% saying that Friends are either “Somewhat useful” or “Very useful”.

This may indicate a need to create a more robust community for Data Science ‘Learners’, who may feel somewhat isolated in their studies vs. employed Data Scientists who presumably have more established work and social networks that involve Data Science.

Item low neutral high mean sd
12 LPU.Projects 0.6265664 21.30326 78.07018 2.774436 0.4329562
14 LPU.SO 0.7954545 35.45455 63.75000 2.629545 0.4994101
11 LPU.Courses 1.3586957 35.19022 63.45109 2.620924 0.5127461
17 LPU.Tutoring 3.5928144 39.52096 56.88623 2.532934 0.5680704
3 LPU.College 1.5801354 41.53499 56.88488 2.553047 0.5286014
7 LPU.Kaggle 0.8860759 46.96203 52.15190 2.512658 0.5175910
15 LPU.Textbook 2.3041475 47.31183 50.38402 2.480799 0.5442143
1 LPU.Arxiv 1.8918919 48.10811 50.00000 2.481081 0.5368976
16 LPU.TradeBook 5.6818182 44.31818 50.00000 2.443182 0.6037803
4 LPU.Company 4.6413502 47.67932 47.67932 2.430380 0.5825909
2 LPU.Blogs 1.0159652 52.24964 46.73440 2.457184 0.5185329
6 LPU.Friends 2.5270758 50.90253 46.57040 2.440433 0.5459573
9 LPU.Communities 0.0000000 54.36242 45.63758 2.456376 0.4997732
18 LPU.YouTube 2.4038462 52.08333 45.51282 2.431090 0.5420324
10 LPU.Documentation 3.0470914 51.52355 45.42936 2.423823 0.5531604
5 LPU.Conferences 5.1044084 61.48492 33.41067 2.283063 0.5529337
8 LPU.Newsletters 5.3333333 66.66667 28.00000 2.226667 0.5327738
13 LPU.Podcasts 11.2840467 70.42802 18.28794 2.070039 0.5403243

3.4.5 Job Skills Importance to ‘Learners’

‘Learners’ put Python at the top of their list of Job Skills with 63.6% of respondents saying that it is “Necessary” and 99% saying it is either “Necessary” or “Nice to have”. Only 1% said it was “Unnecessary”. R is surprisingly further down the list with only a little more than half as many respondents saying that R is “Necessary” compared to Python at 34.7% but many more agreeing that it is at least “Nice to have” for a total of 95% in favor of learning R which is just slightly less than those in favor of Python.

Surprisingly “Data Visualization” comes in only slightly above R with 39% saying that it is “Necessary” but actually slightly lower in terms of overall importance with 8% saying it is “Unnecessary”.

Item low neutral high mean sd
5 JSI.Python 1.010101 35.35354 63.636364 2.626263 0.5068079
3 JSI.Stats 2.941177 48.03922 49.019608 2.460784 0.5570710
1 JSI.BigData 4.081633 54.08163 41.836735 2.377551 0.5655999
7 JSI.SQL 7.368421 52.63158 40.000000 2.326316 0.6091869
10 JSI.Visualizations 8.163265 53.06122 38.775510 2.306122 0.6160761
2 JSI.Degree 5.102041 60.20408 34.693878 2.295918 0.5599923
6 JSI.R 5.102041 60.20408 34.693878 2.295918 0.5599923
4 JSI.EnterpriseTools 17.582418 72.52747 9.890110 1.923077 0.5213395
8 JSI.KaggleRanking 35.051546 60.82474 4.123711 1.690722 0.5469770
9 JSI.MOOC 37.634409 59.13978 3.225807 1.655914 0.5416285

3.4.6 Work Tools Frequency

Not surprisingly I guess, Python is also high on the list of tools that working Data Scientists use with 75.4% of users saying that they use it either “Often” or “Most of the time” and only Statistica, SQL and Unix edging it out for the top 3 slots. What is surprising is that two of the the top three tools used in the field aren’t even on the list of Job Skills ‘Learners’ were asked to evaluate.

R comes in slightly lower than Python with 63.6% of users saying that they use it either “Often” or “Most of the time” but the difference between R and Python is less in the field than what ‘Learners’ seem to think is most important.

Item low neutral high mean sd
44 WTF.Statistica 20.00000 0 80.00000 3.200000 0.8366600
42 WTF.SQL 20.51546 0 79.48454 3.264949 0.8637334
48 WTF.Unix 20.73171 0 79.26829 3.207317 0.8620225
31 WTF.Python 24.58522 0 75.41478 3.220965 0.9533485
18 WTF.KNIMECommercial 25.00000 0 75.00000 3.000000 0.8164966
17 WTF.Jupyter 32.30975 0 67.69025 2.982644 0.9911199
33 WTF.R 36.38941 0 63.61059 2.942344 1.0572465
38 WTF.SASBase 36.96682 0 63.03318 2.900474 1.0621406
2 WTF.AWS 44.32432 0 55.67568 2.686486 1.0642268
40 WTF.SASJMP 46.42857 0 53.57143 2.553571 1.1106041
23 WTF.Excel 47.64151 0 52.35849 2.613208 0.8823789
6 WTF.DataRobot 47.82609 0 52.17391 2.478261 1.2745611
41 WTF.Spark 48.33837 0 51.66163 2.655589 1.0250457
9 WTF.Hadoop 49.83607 0 50.16393 2.596721 1.0185721
12 WTF.IBMSPSSStatistics 50.66667 0 49.33333 2.453333 1.1424645
14 WTF.Impala 50.98039 0 49.01961 2.470588 1.0835671
45 WTF.Tableau 51.13350 0 48.86650 2.561713 1.0680602
8 WTF.GCP 52.63158 0 47.36842 2.482456 1.0064610
28 WTF.Oracle 52.94118 0 47.05882 2.470588 0.9919462
15 WTF.Java 53.53160 0 46.46840 2.446097 1.0008735
5 WTF.Cloudera 53.84615 0 46.15385 2.538461 1.0781434
7 WTF.Flume 54.54545 0 45.45455 2.318182 0.9454837
34 WTF.RapidMinerCommercial 54.54545 0 45.45455 2.545454 1.2933396
25 WTF.MicrosoftSQL 55.00000 0 45.00000 2.480000 1.0198039
46 WTF.TensorFlow 55.06912 0 44.93088 2.460830 0.9633397
47 WTF.TIBCO 56.66667 0 43.33333 2.366667 1.0333519
27 WTF.NoSQL 57.14286 0 42.85714 2.444015 0.8977539
36 WTF.Salfrod 57.14286 0 42.85714 2.142857 0.8997354
21 WTF.MATLAB 57.38832 0 42.61168 2.336770 1.1095146
24 WTF.MicrosoftRServer 58.33333 0 41.66667 2.404762 1.0193200
37 WTF.SAPBusinessObjects 58.33333 0 41.66667 2.416667 1.1645002
11 WTF.IBMSPSSModeler 58.82353 0 41.17647 2.215686 1.1715584
32 WTF.Qlik 60.60606 0 39.39394 2.303030 1.0748502
4 WTF.C 61.39241 0 38.60759 2.316456 1.0754909
10 WTF.IBMCognos 61.53846 0 38.46154 2.076923 1.0926327
39 WTF.SASEnterprise 62.88660 0 37.11340 2.329897 1.0870607
20 WTF.Mathematica 69.56522 0 30.43478 2.144928 0.9892862
13 WTF.IBMWatson 72.34043 0 27.65957 2.000000 0.9088933
30 WTF.Perl 72.41379 0 27.58621 1.919540 0.9550160
19 WTF.KNIMEFree 73.52941 0 26.47059 2.088235 0.8300291
3 WTF.Angoss 75.00000 0 25.00000 1.750000 0.9574271
26 WTF.Minitab 75.00000 0 25.00000 1.750000 0.9158109
22 WTF.Azure 75.86207 0 24.13793 1.965517 0.8133805
1 WTF.AmazonML 77.17391 0 22.82609 1.923913 0.8923816
16 WTF.Julia 77.50000 0 22.50000 1.850000 0.9486833
43 WTF.Stan 78.04878 0 21.95122 1.926829 0.9052691
35 WTF.RapidMinerFree 86.95652 0 13.04348 1.760870 0.7939992
29 WTF.Orange 88.23529 0 11.76471 1.470588 0.7174301

3.4.7 Work Methods Frequency

The big surprise here is that Data Visualization is at the top of the list. With nobody saying that they use it “Rarely” and only 8.5% saying that they only use it “Sometimes”. 91.5% say they use it “Often” or “Most of the time”. It is by far the most frequently used method or tool for working Data Science Professionals with Statstica as the next runner up at only 80% by comparison. Remember that only 38.8% of ‘Learners’ said that they thought Visualizations were a “Necessary” job skill to learn!

Item low neutral high mean sd
7 WMF.DataVisualization 8.566276 0 91.43372 3.550045 0.6639726
6 WMF.Cross.Validation 24.178404 0 75.82160 3.160798 0.8500851
22 WMF.PrescriptiveModeling 32.710280 0 67.28972 2.855140 0.8350012
16 WMF.LogisticRegression 33.599202 0 66.40080 2.879362 0.8584456
12 WMF.GBM 34.337349 0 65.66265 2.879518 0.8848861
27 WMF.Simulation 35.013263 0 64.98674 2.854111 0.9006146
19 WMF.NLP 35.469108 0 64.53089 2.839817 0.8683888
30 WMF.TimeSeriesAnalysis 36.211699 0 63.78830 2.866295 0.8703743
9 WMF.EnsembleMethods 36.546185 0 63.45382 2.847390 0.8930782
15 WMF.LiftAnalysis 37.430168 0 62.56983 2.720670 0.8279985
29 WMF.TextAnalysis 38.420108 0 61.57989 2.795332 0.8882404
4 WMF.CNNs 39.830509 0 60.16949 2.805085 0.9340689
20 WMF.NeuralNetworks 40.545809 0 59.45419 2.785575 0.8821286
26 WMF.Segmentation 40.633245 0 59.36675 2.751979 0.9007434
25 WMF.RNNs 41.509434 0 58.49057 2.735849 0.8378153
23 WMF.RandomForests 41.909814 0 58.09019 2.733422 0.8695970
21 WMF.PCA 42.045454 0 57.95455 2.724026 0.8468828
8 WMF.DecisionTrees 42.063492 0 57.93651 2.708995 0.8456972
28 WMF.SVMs 48.051948 0 51.94805 2.597403 0.8788567
1 WMF.A.B 50.539957 0 49.46004 2.555076 0.8757749
14 WMF.KNN 51.268116 0 48.73188 2.543478 0.8293138
24 WMF.RecommenderSystems 51.690821 0 48.30918 2.507246 0.8411133
10 WMF.EvolutionaryApproaches 52.702703 0 47.29730 2.540541 0.9094510
5 WMF.CollaborativeFiltering 53.846154 0 46.15385 2.440559 0.8275399
3 WMF.Bayesian 55.580357 0 44.41964 2.462054 0.8713112
18 WMF.NaiveBayes 61.479592 0 38.52041 2.375000 0.8338660
17 WMF.MLN 64.814815 0 35.18519 2.351852 0.9144007
11 WMF.GANs 65.116279 0 34.88372 2.325581 0.8083178
13 WMF.HMMs 66.326531 0 33.67347 2.244898 0.8622821
2 WMF.AssociationRules 68.316832 0 31.68317 2.217822 0.7411697