1 Research Question

The goal of this project is to answer the research question Which are the most valued data science skills?

In order to answer that question we found and used survey data from the Kaggle ML and Data Science Survey, 2017.

While the answer to the question is by definition subjective, the Kaggle Survey was, “an industry-wide survey to establish a comprehensive view of the state of data science and machine learning” and with over 16,000 responses it provides a good starting point for exploring the views of professionals in the field and what they value.

2 Importing data

The survey was stored in 2 different files consisting of:

multiple choice items
free-response items

We chose to focus on the multiple choice data only for statistical analysis. Kaggle stored each data in csv format. We downloaded the multiple choice item survey results in csv format and placed it in our GitHub repo.

3 A Day in a Data Scientist’s Life

We start with exploring the resources utilized by Kaggle survey users for learning data science. What are the different data science activities they do, what are the different learning platforms they use and how do they feel about the userfulness of those platforms?

Insights into the demographics : How respondents data is distributed across different countries and also some interesting facts about country-wise gender distribution

3.1 Variables and their definition

To begin with, we focussed on users/ respondents demographics to understand the age group and their gender.

After analyzing data, variables:GenderSelect & Age, it appears that out of 16716 global Kaggle respondents there are 13610 males and 2778 females. In this subset male respondents are almost 5 (~4.8) times more than female respondents. Also, from the plot below it is pretty evident that repondents’ age peaks at 25 for both males and females whereas the median age is about 30.

Since, we are trying to determine what the most important Data Science Skills are, it is very important to understand what a data scientist does. What are the different activities a data scientist performs on daily basis, and how much time does each activity typically take?

Let’s take a peek at a day in the life of a Data Scientist and try to figure out what a data scientist does.

The day typically starts with a question or business problem and invloves following activties/ tasks:

GatheringData
FindingInsights
ModelBuilding
Visualizing
Production

3.2 Manipulating data

Kaggle successfully captured repondents’ data about time spent in different activities. In order to analyze this question we looked at attributes: TimeGatheringData, TimeModelBuilding, TimeProduction, TimeVisualizing, and TimeFindingInsights.

In order to determining usefulness of learners platfom we tidy the data for 18 learning platform attributes present in the data set and perform the analysis on long data type. We also successffuly manipulated data to find user’s sentiments/ remarks from platform usefulness standpoint.

3.3 Exploratory Data Analysis (EDA)

After analyzing data for US repondents it appears that data aquisition or gathering data is the main activitiy, at 37.75%. This is where a data scientist spends most of their time. Model building ranks 2nd, at 19.23%, followed by time spent in finding insights and data visualization. Only 10.23% of their total time appears to be taken by production activities.

DSActivity	mean_precent
TimeGatheringData	37.75491
TimeModelBuilding	19.23263
TimeFindingInsights	14.50524
TimeVisualizing	13.74509
TimeProduction	10.23198

Whether one is employed full-time, part-time or a student; its worth exploring how people are using different learning platforms and how they feel about them. We made use of different learning platform attributes captured in the dataset which also includes Kaggle as a learning platform.

lid	Country	EmploymentStatus	LPlatform	LP_count	LearningPlatform
1	United States	Not employed, but looking for work	LearningPlatformUsefulnessKaggle	Somewhat useful	Kaggle
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessBlogs	Very useful	Blogs
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessCollege	Very useful	College
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessConferences	Very useful	Conferences
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessFriends	Very useful	Friends
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessDocumentation	Very useful	Documentation

After analyzing respondents take on different learning platforms it appears that learners mostly benefited from personal projects as majority of resonses indicate projects as being very useful. Online courses appears to be 2nd, followed by StackOverflow and Kaggle. Blogs,textbooks and college also appear to be very userful whereas newsletters, podcasts and tradebook rank low.

4 What do Data Scientists Want to Learn?

Next, we examine what these survey takers of various educational backgrounds find themselves excited to learn. Due to the ever-evolving nature of technology and, by extension, data science, it is imperative that they remain relevant in their field and are passionate in their pursuit for relevance. Understanding what working professionals want to learn could give us insight into what skills are most valued in the field.

Does survey takers’ formal education have any relationship to the Machine Learning/Data Science method he or she is most excited about learning in the next year?

4.1 Variables and their definition

To do the analysis, we concentrate on two columns in the dataset

FormalEducation: Which level of formal education have you attained?
MLMethodNextYearSelect : Which Machine Learning/Data Science method are you most excited about learning in the next year?

These questions were asked to all participants.

4.2 Exploratory Data Analysis (EDA)

First we plot the distribution of formal education in the dataset

The data set predominantly contains candidates with Master’s degrees which are followed by Bachelor’s then doctoral degrees.

Now let’s look at the different Machine Learning/Data Science methods in the dataset.

Machine Learning/Data Science
Random Forests
Deep learning
Neural Nets
Text Mining
Genetic & Evolutionary Algorithms
Link Analysis
Rule Induction
Regression
Proprietary Algorithms
I don’t plan on learning a new ML/DS method
Ensemble Methods (e.g. boosting, bagging)
Factor Analysis
Social Network Analysis
Monte Carlo Methods
Time Series Analysis
Other
Bayesian Methods
Survival Analysis
MARS
Anomaly Detection
Cluster Analysis
Decision Trees
Association Rules
Uplift Modeling
Support Vector Machines (SVM)

Now we can plot the distribution of Machine Learning/Data Science methods with formal education.

Our results revealed that Deep Learning is the top most Machine Learning/Data Science method among the Kaggle survey takers regardless of their earned formal education. Interestingly, both 40% of respondents who had earned bachelor degree and 40% of survey takers with earned master’s degree stated that Deep Learning is the technique that they are most excited about learning in the next year. Similarly, 39% of the respondents with high school degree reported to learn Deep Learning as their top desired Machine Learning/Data Science method.

Following Deep Learning, Neural Nets emerged as the second top Machine Learning/Data Science method that Data Scientists have the desire to learn next year. Intriguingly, College Dropouts have highest percentage in the distribution in learning Neural Nets.

Time Series Analysis was found to be the third Machine Learning/Data Science method of interest. High school graduates want to learn Genetic & Evolutionary Algorithms as their third choice.

Among doctoral survey takers, Bayesian Methods is the third preference. This particular Machine Learning/Data Science method was not choice for others but only with PhDs.

The results are suggesting that there is a clear trend among the data scientists that Deep Learning is the Machine Learning/Data Science method they want to learn. As to the global research question of interest what data science skills are valued the most, the results from this insight suggest that aspiring data scientists should consider learning Deep Learning.

5 Data Science Methods

What are the most frequently used data science (DS) methods by those writing code in DS professions? Do those relate to formal educational attainment?

The Kaggle dataset provides multiple different variables to assess what the most valuable data science skills may be. In the previous section, we examined what data science methods learners are most excited about and working on. In this section, we’ll look at which data science methods are the most frequently used and if that has any relationship to educational attainment–a potential indicator of whether or not certain methods require advanced academic training.

5.1 Variables and their definition

The following variables label questions asking survey respondents how often they use each of these data science methods. Response options were: Rarely, Sometimes, Often, Most of the time

WorkMethodsFrequencyA/B
WorkMethodsFrequencyAssociationRules
WorkMethodsFrequencyBayesian
WorkMethodsFrequencyCNNs
WorkMethodsFrequencyCollaborativeFiltering
WorkMethodsFrequencyCross-Validation
WorkMethodsFrequencyDataVisualization
WorkMethodsFrequencyDecisionTrees
WorkMethodsFrequencyEnsembleMethods
WorkMethodsFrequencyEvolutionaryApproaches
WorkMethodsFrequencyGANs
WorkMethodsFrequencyGBM
WorkMethodsFrequencyHMMs
WorkMethodsFrequencyKNN
WorkMethodsFrequencyLiftAnalysis
WorkMethodsFrequencyLogisticRegression
WorkMethodsFrequencyMLN
WorkMethodsFrequencyNaiveBayes
WorkMethodsFrequencyNLP
WorkMethodsFrequencyNeuralNetworks
WorkMethodsFrequencyPCA
WorkMethodsFrequencyPrescriptiveModeling
WorkMethodsFrequencyRandomForests
WorkMethodsFrequencyRecommenderSystems
WorkMethodsFrequencyRNNs
WorkMethodsFrequencySegmentation
WorkMethodsFrequencySimulation
WorkMethodsFrequencySVMs
WorkMethodsFrequencyTextAnalysis
WorkMethodsFrequencyTimeSeriesAnalysis

The additional variables used for this analysis will include:

Formal Education

5.2 Manipulating data

In order to answer the question of which methods are most popular among code writers, several transformations must first be done. First, we filter the dataset down to only those who were classified as code writers: those that were employed in some capacity working in data science and writing code as part of their job duties. Additionally, we include only participants who endorsed at least one data science skill on the question, “At work, which data science methods do you use? (Select all that apply)” with variable name :WorkMethodsSelect.

Once filtered, the endorsed data science methods were aggregated and plotted for frequency (see Exploratory Data Analysis below). The top five most frequent data science methods endorsed were then selected and given a frequency score to represent among those who endorse using them to some extent, how frequently they use that tool.

The final transformation performed on the data was grouping by formal education level attainment and then identifying the most frequently endorsed data science methods for each group. This can help identify if those writing certain types of code and using certain data analyses are potentially benefitted by pursuing advanced education–a valuable insight for potential data science pupils.

5.3 Exploratory Data Analysis (EDA)

Following manipulation of the Kaggle data set, we created plots to visualize the aforementioned research questions. First, here is a look at the frequency with which the following data science methods were endorsed by a total of 7,773 respondents. Nearly 2/3 of the survey respondents endorsed the first place skill, data visualization. Over half endorse logistic regression and just shy of half endorse cross-validation and decision trees.

Options	Freq
Data Visualization	5022
Logistic Regression	4291
Cross-Validation	3868
Decision Trees	3695
Random Forests	3454
Time Series Analysis	3153
Neural Networks	2811
PCA and Dimensionality Reduction	2789
kNN and Other Clustering	2624
Text Analytics	2405
Ensemble Methods	2056
Segmentation	2050
SVMs	1973
Natural Language Processing	1949
A/B Testing	1936
Bayesian Techniques	1913
Naive Bayes	1902
Gradient Boosted Machines	1557
CNNs	1417
Simulation	1398
Recommender Systems	1158
Association Rules	1146
RNNs	891
Prescriptive Modeling	851
Collaborative Filtering	793
Lift Analysis	650
Evolutionary Approaches	436
HMMs	419
Other	391
Markov Logic Networks	255
GANs	244

The following plot graphically displays the frequency of endorsements for the data science methods asked about.

In this plot we show the “Frequency Score” for the Top Five most endorsed data science methods. It’s important to break this down further than endorsement, as the above table and plot only consider which data science methods one uses at all. Just because a method is endorsed, doesn’t mean that individuals use it frequently. It may be a rare but essential method in data science. To get a more fine grained understanding of how commonly one uses a given data science method on the job, the kaggle survey followed up each endorsed method by asking respondents if they use it Rarely, Sometimes, Often, Most of the time. We converted these to numeric values (Rarely = 1; Sometimes = 2, Often = 3, and Most of the time = 4) in order to graph a score and average the categorical responses.

Of the top five data science methods endorsed, data visualization was the skill indicated to be used the most frequently.

The below plots show the frequency of methods endorsed for each formal education level assessed by Kaggle.

We see that in the majority of educational attainment brackets, data visualization remains the most frequently endorsed data science method.

The same information is also provided in tabular format:

Selections	Freq	RelativeFreq	Degree
Data Visualization	1236	0.0944232	Bachelor’s Education
Logistic Regression	989	0.0755539	Bachelor’s Education
Decision Trees	847	0.0647059	Bachelor’s Education
Data Visualization	1129	0.0756348	Doctoral Education
Cross-Validation	1046	0.0700744	Doctoral Education
Logistic Regression	1031	0.0690695	Doctoral Education
Neural Networks	24	0.0808081	High School Education
Data Visualization	23	0.0774411	High School Education
Text Analytics	18	0.0606061	High School Education
Data Visualization	2331	0.0835873	Master’s Education
Logistic Regression	2022	0.0725069	Master’s Education
Cross-Validation	1821	0.0652992	Master’s Education
Data Visualization	150	0.0897129	Professional Education
Logistic Regression	121	0.0723684	Professional Education
Decision Trees	119	0.0711722	Professional Education
Data Visualization	137	0.1008837	Some Post Secondary Education
Logistic Regression	97	0.0714286	Some Post Secondary Education
Decision Trees	87	0.0640648	Some Post Secondary Education

Answering the research question of which data science skills are the most important can be interpreted and answered in many ways. One way to explore this deceivingly complex question is to analyze which data science methods are endorsed as being used by code writers on the job. This analysis did just that, and further explored the Top 5 most endorsed data science methods by seeing how frequently those that endorsed them actually use those methods on the job.

The bottom line of this analysis is to consider data visualization, logistic regression, cross-validation, decision trees, and random forests as not only frequently endorsed methods, but as methods that are not only essential but used in small ways. It seems like across data science code writers, these methods are popular and then for individual data science code writers, they are used frequently.

The second goal of this analysis was to understand how formal educational attainment relates to data science methods used on the job. When looking at the plots of each educational level and the table coalescing all of that data, it does not seem like data science methods used by code writers differ given the educational level. Data visualization remains the most frequently endorsed data science method for the majority of educational groups. This has important implications for students of data science in understanding that certain popular job functions are not only performed by those with advanced degrees. This speaks to how crucial skills like data visualization and the other frequently endorsed and commonly used methods are to data science as a whole.

6 ‘Learners’ vs. Employed Data Scientists

Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using?

Those who are new to data science and learning new skills may have different opinions about which tools and methods are most important to learn and which are being used than those who are already employed in the field. Comparing the answer of ‘Learners’ vs. employed Data Scientists may give us some insight into which skills each group values most and whether or not they are in agreement.

6.1 Variables and their definition

A large portion of the data collected in the Kaggle survey was in the form of Likert scales asking respondents to place the value, importance, or frequency of use of certain skills, tools and methods on a 3-4 point scale. To answer our question about which are the ‘most valued’ data science skills we looked a these scales and analyzed for differences between what ‘Learners’ thought was important to them vs. what employed data scientists say they are using in the field.

An oversight in the survey is that they failed to capture the opinions of those who are employed about what they thought were the most important job skills. They didn’t bother to ask employed respondents those questions, so comparisons between employed data scientists and learners are a little more difficult. We can only use the data about what tools and methods employed data scientists are actually using and the frequency of their use to infer the importance of those tools and methods. The Survey also failed to capture those who are employed and ALSO students or learners! They didn’t bother to ask employed respondents if they were also studying Data Science. Professional development is critical in a field that is growing and changing as rapidly as data science, so asking employed professionals about their further studies could have been very useful information to have. Working Data Scientists may have a better insight into what direction the field is going in than students who are just in the learning phase of their journey.

6.2 Manipulating data

Out of the 229 variables of data collected in the survey, about 105 of them fall into 5 likert scales describing, the “Learning Platform Usefulness” (which was asked of both learners and employed data scientists), “Job Skill Importance” (which only unemployed learners were asked), “Work Tools Frequency” and “WorkMethodsFrequency” (which only employed data scientists were asked to answer). We narrowed down the data to include only these as well as a few more basic demographic fields to get a sense of who the respondents are.

We also chose to focus only on respondents who are located in the US since international cultural and technological differences may skew results. Different tools and skills may be more or less valued in different countries, so we thought it best to narrow the focus on one country at a time. Further analysis to see if the findings are consistent across countries would be interesting and worthwhile.

6.3 Exploratory Data Analysis (EDA)

First let’s take a look at the demographics of the survey respondents and what type of jobs they hold.

6.3.1 Employed Data Scientists

About 30% of or survey respondents would call themselves “Data Scientists” with about another 20 percent calling themselves either “Scientist/Researcher” or “Software Developer/Software Engineer”. So about 50% of our survey respondents fall in these three categories with the other 50% in other roles.

CurrentJobTitleSelect	total	percent
Data Scientist	644	30.51
Scientist/Researcher	225	10.66
Software Developer/Software Engineer	212	10.04
Data Analyst	185	8.76
Other	177	8.38
Researcher	138	6.54
Machine Learning Engineer	102	4.83
Engineer	73	3.46
Statistician	71	3.36
NA	71	3.36
Business Analyst	59	2.79
Computer Scientist	47	2.23
Predictive Modeler	41	1.94
Programmer	21	0.99
DBA/Database Engineer	19	0.90
Operations Research Practitioner	16	0.76
Data Miner	10	0.47

6.3.2 ‘Learners’

The Kaggle survey asked respondents if they were learning data science and their student status. 73% of the respondents who said they were learning data science are enrolled in an academic program.

StudentStatus	total	percent
Yes	113	73.38
No	41	26.62

55% of all respondents who said they are learning data science said they are “focused on learning mostly data science skills” with the other 45% saying that “data science is a small part of what I’m focused on learning”.

The ratio of respondents who said they are “focused on learning mostly data science skills” remains the same at about 55% regardless of whether or not they are enrolled in an academic program.

StudentStatus	LearningDataScience	total	percent
Yes	Yes, I’m focused on learning mostly data science skills	63	55.75
Yes	Yes, but data science is a small part of what I’m focused on learning	50	44.25
No	Yes, I’m focused on learning mostly data science skills	23	56.10
No	Yes, but data science is a small part of what I’m focused on learning	18	43.90

6.3.3 Learning Platform Usefulness - ‘Learners’

‘Learners’ think that the top 3 best ways to learn data science are through Courses, Projects and College with Arxiv and YouTube coming in 4th and 5th respectively.

	Item	low	neutral	high	mean	sd
11	LPU.Courses	0.000000	22.22222	77.77778	2.777778	0.4190790
12	LPU.Projects	1.923077	25.00000	73.07692	2.711539	0.4984894
3	LPU.College	1.851852	33.33333	64.81481	2.629630	0.5247208
1	LPU.Arxiv	0.000000	38.88889	61.11111	2.611111	0.5016313
17	LPU.YouTube	0.000000	39.13043	60.86957	2.608696	0.4934352
14	LPU.SO	0.000000	43.47826	56.52174	2.565217	0.4993602
10	LPU.Documentation	10.526316	36.84211	52.63158	2.421053	0.6924826
2	LPU.Blogs	2.631579	47.36842	50.00000	2.473684	0.5568658
7	LPU.Kaggle	0.000000	50.64935	49.35065	2.493507	0.5032363
9	LPU.Communities	0.000000	55.55556	44.44444	2.444444	0.5270463
16	LPU.Tutoring	6.250000	50.00000	43.75000	2.375000	0.6191392
15	LPU.Textbook	7.142857	54.76190	38.09524	2.309524	0.6043781
13	LPU.Podcasts	6.666667	66.66667	26.66667	2.200000	0.5606119
4	LPU.Company	0.000000	75.00000	25.00000	2.250000	0.5000000
5	LPU.Conferences	14.285714	64.28571	21.42857	2.071429	0.6157279
8	LPU.Newsletters	0.000000	80.00000	20.00000	2.200000	0.4216370
6	LPU.Friends	17.647059	64.70588	17.64706	2.000000	0.6123724

6.3.4 Learning Platform Usefulness - Employed Data Scientists

Employed Data Scientists agree with unemployed ‘Learners’ that Projects and Courses belong in the top 3, but put College in 5th place (vs. 3rd). They also include Tutoring and SO (Stack Overflow Q&A) in their top 5 with SO coming in 2nd place. YouTube (learner’s 5th choice) comes in 14th place for employed Data Scientists and Arxiv (learner’s 4th choice) is 8th among employed Data Scientists.

Another interesting difference is that the importance of Friends to learning data science is much higher among employed Data Scientists with about 46.5% saying that Friends are “Very useful” and 97.5% saying that Friends are either “Somewhat useful” or “Very useful” vs. ‘Learners’ who put Friends at absolute bottom of their list with only 17.6% saying that they are are “Very useful” and 82.4% saying that Friends are either “Somewhat useful” or “Very useful”.

This may indicate a need to create a more robust community for Data Science ‘Learners’, who may feel somewhat isolated in their studies vs. employed Data Scientists who presumably have more established work and social networks that involve Data Science.

	Item	low	neutral	high	mean	sd
12	LPU.Projects	0.6265664	21.30326	78.07018	2.774436	0.4329562
14	LPU.SO	0.7954545	35.45455	63.75000	2.629545	0.4994101
11	LPU.Courses	1.3586957	35.19022	63.45109	2.620924	0.5127461
17	LPU.Tutoring	3.5928144	39.52096	56.88623	2.532934	0.5680704
3	LPU.College	1.5801354	41.53499	56.88488	2.553047	0.5286014
7	LPU.Kaggle	0.8860759	46.96203	52.15190	2.512658	0.5175910
15	LPU.Textbook	2.3041475	47.31183	50.38402	2.480799	0.5442143
1	LPU.Arxiv	1.8918919	48.10811	50.00000	2.481081	0.5368976
16	LPU.TradeBook	5.6818182	44.31818	50.00000	2.443182	0.6037803
4	LPU.Company	4.6413502	47.67932	47.67932	2.430380	0.5825909
2	LPU.Blogs	1.0159652	52.24964	46.73440	2.457184	0.5185329
6	LPU.Friends	2.5270758	50.90253	46.57040	2.440433	0.5459573
9	LPU.Communities	0.0000000	54.36242	45.63758	2.456376	0.4997732
18	LPU.YouTube	2.4038462	52.08333	45.51282	2.431090	0.5420324
10	LPU.Documentation	3.0470914	51.52355	45.42936	2.423823	0.5531604
5	LPU.Conferences	5.1044084	61.48492	33.41067	2.283063	0.5529337
8	LPU.Newsletters	5.3333333	66.66667	28.00000	2.226667	0.5327738
13	LPU.Podcasts	11.2840467	70.42802	18.28794	2.070039	0.5403243

6.3.5 Job Skills Importance to ‘Learners’

‘Learners’ put Python at the top of their list of Job Skills with 63.6% of respondents saying that it is “Necessary” and 99% saying it is either “Necessary” or “Nice to have”. Only 1% said it was “Unnecessary”. R is surprisingly further down the list with only a little more than half as many respondents saying that R is “Necessary” compared to Python at 34.7% but many more agreeing that it is at least “Nice to have” for a total of 95% in favor of learning R which is just slightly less than those in favor of Python.

Surprisingly “Data Visualization” comes in only slightly above R with 39% saying that it is “Necessary” but actually slightly lower in terms of overall importance with 8% saying it is “Unnecessary”.

	Item	low	neutral	high	mean	sd
5	JSI.Python	1.010101	35.35354	63.636364	2.626263	0.5068079
3	JSI.Stats	2.941177	48.03922	49.019608	2.460784	0.5570710
1	JSI.BigData	4.081633	54.08163	41.836735	2.377551	0.5655999
7	JSI.SQL	7.368421	52.63158	40.000000	2.326316	0.6091869
10	JSI.Visualizations	8.163265	53.06122	38.775510	2.306122	0.6160761
2	JSI.Degree	5.102041	60.20408	34.693878	2.295918	0.5599923
6	JSI.R	5.102041	60.20408	34.693878	2.295918	0.5599923
4	JSI.EnterpriseTools	17.582418	72.52747	9.890110	1.923077	0.5213395
8	JSI.KaggleRanking	35.051546	60.82474	4.123711	1.690722	0.5469770
9	JSI.MOOC	37.634409	59.13978	3.225807	1.655914	0.5416285

6.3.6 Work Tools Frequency

Not surprisingly I guess, Python is also high on the list of tools that working Data Scientists use with 75.4% of users saying that they use it either “Often” or “Most of the time” and only Statistica, SQL and Unix edging it out for the top 3 slots. What is surprising is that two of the the top three tools used in the field aren’t even on the list of Job Skills ‘Learners’ were asked to evaluate.

R comes in slightly lower than Python with 63.6% of users saying that they use it either “Often” or “Most of the time” but the difference between R and Python is less in the field than what ‘Learners’ seem to think is most important.

	Item	low	high	mean	sd
44	WTF.Statistica	20.00000	80.00000	3.200000	0.8366600
42	WTF.SQL	20.51546	79.48454	3.264949	0.8637334
48	WTF.Unix	20.73171	79.26829	3.207317	0.8620225
31	WTF.Python	24.58522	75.41478	3.220965	0.9533485
18	WTF.KNIMECommercial	25.00000	75.00000	3.000000	0.8164966
17	WTF.Jupyter	32.30975	67.69025	2.982644	0.9911199
33	WTF.R	36.38941	63.61059	2.942344	1.0572465
38	WTF.SASBase	36.96682	63.03318	2.900474	1.0621406
2	WTF.AWS	44.32432	55.67568	2.686486	1.0642268
40	WTF.SASJMP	46.42857	53.57143	2.553571	1.1106041
23	WTF.Excel	47.64151	52.35849	2.613208	0.8823789
6	WTF.DataRobot	47.82609	52.17391	2.478261	1.2745611
41	WTF.Spark	48.33837	51.66163	2.655589	1.0250457
9	WTF.Hadoop	49.83607	50.16393	2.596721	1.0185721
12	WTF.IBMSPSSStatistics	50.66667	49.33333	2.453333	1.1424645
14	WTF.Impala	50.98039	49.01961	2.470588	1.0835671
45	WTF.Tableau	51.13350	48.86650	2.561713	1.0680602
8	WTF.GCP	52.63158	47.36842	2.482456	1.0064610
28	WTF.Oracle	52.94118	47.05882	2.470588	0.9919462
15	WTF.Java	53.53160	46.46840	2.446097	1.0008735
5	WTF.Cloudera	53.84615	46.15385	2.538461	1.0781434
7	WTF.Flume	54.54545	45.45455	2.318182	0.9454837
34	WTF.RapidMinerCommercial	54.54545	45.45455	2.545454	1.2933396
25	WTF.MicrosoftSQL	55.00000	45.00000	2.480000	1.0198039
46	WTF.TensorFlow	55.06912	44.93088	2.460830	0.9633397
47	WTF.TIBCO	56.66667	43.33333	2.366667	1.0333519
27	WTF.NoSQL	57.14286	42.85714	2.444015	0.8977539
36	WTF.Salfrod	57.14286	42.85714	2.142857	0.8997354
21	WTF.MATLAB	57.38832	42.61168	2.336770	1.1095146
24	WTF.MicrosoftRServer	58.33333	41.66667	2.404762	1.0193200
37	WTF.SAPBusinessObjects	58.33333	41.66667	2.416667	1.1645002
11	WTF.IBMSPSSModeler	58.82353	41.17647	2.215686	1.1715584
32	WTF.Qlik	60.60606	39.39394	2.303030	1.0748502
4	WTF.C	61.39241	38.60759	2.316456	1.0754909
10	WTF.IBMCognos	61.53846	38.46154	2.076923	1.0926327
39	WTF.SASEnterprise	62.88660	37.11340	2.329897	1.0870607
20	WTF.Mathematica	69.56522	30.43478	2.144928	0.9892862
13	WTF.IBMWatson	72.34043	27.65957	2.000000	0.9088933
30	WTF.Perl	72.41379	27.58621	1.919540	0.9550160
19	WTF.KNIMEFree	73.52941	26.47059	2.088235	0.8300291
3	WTF.Angoss	75.00000	25.00000	1.750000	0.9574271
26	WTF.Minitab	75.00000	25.00000	1.750000	0.9158109
22	WTF.Azure	75.86207	24.13793	1.965517	0.8133805
1	WTF.AmazonML	77.17391	22.82609	1.923913	0.8923816
16	WTF.Julia	77.50000	22.50000	1.850000	0.9486833
43	WTF.Stan	78.04878	21.95122	1.926829	0.9052691
35	WTF.RapidMinerFree	86.95652	13.04348	1.760870	0.7939992
29	WTF.Orange	88.23529	11.76471	1.470588	0.7174301

6.3.7 Work Methods Frequency

The big surprise here is that Data Visualization is at the top of the list. With nobody saying that they use it “Rarely” and only 8.5% saying that they only use it “Sometimes”. 91.5% say they use it “Often” or “Most of the time”. It is by far the most frequently used method or tool for working Data Science Professionals with Statstica as the next runner up at only 80% by comparison. Remember that only 38.8% of ‘Learners’ said that they thought Visualizations were a “Necessary” job skill to learn!

	Item	low	high	mean	sd
7	WMF.DataVisualization	8.566276	91.43372	3.550045	0.6639726
6	WMF.Cross.Validation	24.178404	75.82160	3.160798	0.8500851
22	WMF.PrescriptiveModeling	32.710280	67.28972	2.855140	0.8350012
16	WMF.LogisticRegression	33.599202	66.40080	2.879362	0.8584456
12	WMF.GBM	34.337349	65.66265	2.879518	0.8848861
27	WMF.Simulation	35.013263	64.98674	2.854111	0.9006146
19	WMF.NLP	35.469108	64.53089	2.839817	0.8683888
30	WMF.TimeSeriesAnalysis	36.211699	63.78830	2.866295	0.8703743
9	WMF.EnsembleMethods	36.546185	63.45382	2.847390	0.8930782
15	WMF.LiftAnalysis	37.430168	62.56983	2.720670	0.8279985
29	WMF.TextAnalysis	38.420108	61.57989	2.795332	0.8882404
4	WMF.CNNs	39.830509	60.16949	2.805085	0.9340689
20	WMF.NeuralNetworks	40.545809	59.45419	2.785575	0.8821286
26	WMF.Segmentation	40.633245	59.36675	2.751979	0.9007434
25	WMF.RNNs	41.509434	58.49057	2.735849	0.8378153
23	WMF.RandomForests	41.909814	58.09019	2.733422	0.8695970
21	WMF.PCA	42.045454	57.95455	2.724026	0.8468828
8	WMF.DecisionTrees	42.063492	57.93651	2.708995	0.8456972
28	WMF.SVMs	48.051948	51.94805	2.597403	0.8788567
1	WMF.A.B	50.539957	49.46004	2.555076	0.8757749
14	WMF.KNN	51.268116	48.73188	2.543478	0.8293138
24	WMF.RecommenderSystems	51.690821	48.30918	2.507246	0.8411133
10	WMF.EvolutionaryApproaches	52.702703	47.29730	2.540541	0.9094510
5	WMF.CollaborativeFiltering	53.846154	46.15385	2.440559	0.8275399
3	WMF.Bayesian	55.580357	44.41964	2.462054	0.8713112
18	WMF.NaiveBayes	61.479592	38.52041	2.375000	0.8338660
17	WMF.MLN	64.814815	35.18519	2.351852	0.9144007
11	WMF.GANs	65.116279	34.88372	2.325581	0.8083178
13	WMF.HMMs	66.326531	33.67347	2.244898	0.8622821
2	WMF.AssociationRules	68.316832	31.68317	2.217822	0.7411697

7 R vs. Python

The most frequently used of the programming languages are R and Python. But do those that use R recommend R or Python? And do those that use Python recommend R or Python? In other words, do those survey takers feel that others should first and foremost study the languages they themselves have taken up, or perhaps with their insight, know to suggest the language of the two they themselves did not learn?

Thus the following questions were explored:

What is the distribution of following programming languages Kaggle survey takers used in the past year:

R Only
Python only
Both Python and R
Neither Python nor R

What is the distribution of programming language recommendations by following programming languages Kaggle survey takers used in the past year:

Using R Only
Using Python only
Using Both Python and R
Using Neither Python nor R

7.1 Variables and their definition

There are 2 variables used in this section of the analysis :

LanguageRecommendationSelect=(What programming language would you recommend a new data scientist learn first? (Select one option) - Selected Choice)
WorkToolsSelect= For work, which data science/analytics tools, technologies, and languages have you used in the past year? (Select all that apply) - Selected Choice

7.2 Manipulating data

The major task in this part of the analysis was to create a tidy data structure This can be accomplished using Select function calls and the required variables for the analysis. Because the respondents were provided the option of choosing anything that applied to them, the data for the languages were captured as strings as opposed to having one language as a column for each respondent.

7.3 Exploratory Data Analysis (EDA)

## [1] 16716   229

## [1] 7955  229

Let’s examine the above graph of LanguageRecommendationSelect

## [1] 7955    5

7.4 Results of Exploring R vs. Python:

We found that a little below the half of the survey takers (N=3540, 44.5%) reported to use both R and Python. The take home message for aspiring data scientists is that a substantial majority of the Kaggle survey takers are using both languages–both languages are used widely. Among the remaining half of the respondents, a small portion of them (N=714, 8.98%) are using neither Python nor R. The rest of the survey takers are using either R or Python. In particular, 2533 (31.84%) indicated using only Python while only 1168 (14.68%) of them reported using R Only.

The story of this contentious debate on R vs Python gets more interesting when comparing their used languages with their recommended languages. Specifically, it is plausible to assume that Python users will recommend Python while R users will recommend R. We explore this hypothesis by comparing the difference of R users recommending R and Python and the difference of Python users recommending R and Python.

Our results revealed that 72.17 % of the Python users recommended Python while 53.77% of R users recommended R. This result is not surprising–there are more Python only users than R only users in this sample, it makes sense to have differences in their recommendations since a different proportion of each know only the one language. However, what is surprising is the degree of difference in their recommendation for the other language: 15.92 % of the R users recommend Python whereas only 1.42 % of the Python users recommend R.

However, these results should be interpreted carefully because there are survey takers who did not make any recommendation. For instance 18.87% of the sample who are Python users did not respond to this question. Similarly, 17.55% of R users did not leave any opinion on their recommended languages. This is a sizable portion of the sample and if these users were to make recommendations, it’s possible that more Python users would be recommending R.

Since half the sample included respondeds who are both R and Python users, their recommendation is particularly valuable since they have experience with both languages. Of this subset, 51.72% of them recommend Python while 25.65% of them recommend R. A quarter of the users that use both reccomend R over Python.

8 Salary Comparison for Python vs. R

Finally, true to the word “value,” considerations have to be made regarding pay. The compensation received by survey takers for their work in either R or Python needs quantification to discover which language earns a data scientist more overall and in general.

8.1 Contributing Variables

What the question is primarily concerned with is three-fold:

WorkToolsSelect: This was a “select all that apply” variable with a list of various data science tools, technologies, and languages. Survey takers identified which tools were utilized in their work and the results were stored in this comma separated variable.
CompensationAmount: This was a numerical value to be entered by the survey taker indicating their annual pay.
CompensationCurrency: This was a simple character string that stated what currency the survey taker was receiving their annual salary in.

There was also the variable “id” that was created for the purpose of this report, acting as a way to identify each individual survey taker, and the variable “work_tools” which was a derivative of WorkToolsSelect, breaking the lists down into their individual components.

8.2 Preparing the data

In order to study the data to answer this question, it had to first be transformed. All users who did not provide an answer for the question on tools they use at work had their responses discarded. Similarly, all users whose compensation was not in US Dollars had their information disregarded in order to hone in on a single socio-economic focus, the US market. A separate table was created pairing survey takers with each of their languages or methods used for their job via the id variable - a number assigned to each survey taker - and work_tools variable, which stored each item in the list provided by WorkToolsSelect in its own row, matched to the id of the survey taker who provided it. Using this separate table, survey takers who used both Python and R in their jobs were removed, and any who failed to use R or Python in their job were removed as well, leaving a list of individuals who exclusively used Python or R in their career. Lastly, the amounts compensated were reformatted so as to be in a numeric format and any rows with an unlikely compensation amount - north of ten million annually - were removed to prevent data skewing by such a severe outlier.

8.3 Exploration and Review of Compensations

As we can see from the boxplots there is relatively normal distribution present in the compensation for those who solely used Python in their work. Conversely, there is a noticeable right skew in the compensation for those who solely used R in their work.

	Minimum	1st Quartile	Median	Mean	3rd Quartile	Maximum	Standard Deviation
Python	$0.00	$53,000.00	$100,000.00	$112,826.14	$145,000.00	$2,000,000.00	$122,425.21
R	$0.00	$58,000.00	$87,000.00	$98,177.64	$130,000.00	$550,000.00	$67,487.91

The average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.

9 Conclusion

On the contentious debate on which Machine Learning/Data Science methods Data Scientists are most excited about learning in the next year as the most valued Data Science Skills

Deep Learning is the top most Machine Learning/Data Science method in all categories of formal education followed by Neural Nets except High school graduates, all others wants to learn
Time Series Analysis as the third Machine Learning/Data Science method. High school graduates want to learn Genetic & Evolutionary Algorithms as their third choice.
Among doctoral survey takers, Bayesian Methods is the third preference.

On Data Science Methods Used on the Job

Data Visualization is a remarkably popular data science method. It is the most endorsed by nearly all education attainment levels.
Cross validation, random forests, logistic regression, and decision trees are also heavily endorsed.
These are not just short but required or essential tasks–not only do so many of those writing code use data visualization, but they also use it quite frequently
The data suggest that data science methods do not differ much between formal educational attainment groups.

‘Learners’ vs. Employed Data Scientists

Both Learners and employed Data Scientists agree that Courses, Projects and College are in the top three ways to learn Data Science
‘Learners’ place much less importance on Friends for learning than employed Data Scientists do
‘Learners’ place a higher importance on Python vs. R as compared with employed Data Scientists
Data Visualization is the top used skill for working Data Scientists even though learners put relatively little importance on it as a Job Skill

On Data Science Activities

Gathering Data is the main activity where data scientists spend most of their time followed by model building.
Personal projects and Online Courses appear to be very useful learning platforms.

On the contentious debate on R vs Python as the most valued Data Science Skills

Half of the sample uses both R and Python
R only to Python only users are in 1:2 ratio
More R users recommended Python than the Python users recommended R
Both users recommendations in Python is more than their recommendation in R
R users are more likely to have a higher base salary, but Python users have the greater potential for wage growth

Project 3

Gabrielle Bartomeo, Binish Chandy, Zach Dravis, Burcu Kaniskan, Niteen Kumar, Betsy Rosalen

March 25, 2018