1 Importing data

The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We downloaded the multiple choice item survey results in csv format and placed it in our GitHub repo

Importing Multiple Choice data

2 Research Question

This project will answer the global research question Which are the most values data science skills?

3 Data Science Methods

What are the most frequently used data science (DS) methods by those writing code in DS professions? Do those relate to formal educational attainment?

The Kaggle dataset provides multiple different variables to assess what the most valuable data science skills may be. In the previous section, we examined what data science methods learners are most excited about and working on. In this section, we’ll look at what which data science methods are the most frequently used and if that has any relationship to educational attainment–a potential indicator of if certain methods require advanced academic training.

3.1 Variables and their definition

The following variables label questions asking survey respondents how often they use each of these data science methods. Response options were: Rarely, Sometimes, Often, Most of the time

WorkMethodsFrequencyA/B
WorkMethodsFrequencyAssociationRules
WorkMethodsFrequencyBayesian
WorkMethodsFrequencyCNNs
WorkMethodsFrequencyCollaborativeFiltering
WorkMethodsFrequencyCross-Validation
WorkMethodsFrequencyDataVisualization
WorkMethodsFrequencyDecisionTrees
WorkMethodsFrequencyEnsembleMethods
WorkMethodsFrequencyEvolutionaryApproaches
WorkMethodsFrequencyGANs
WorkMethodsFrequencyGBM
WorkMethodsFrequencyHMMs
WorkMethodsFrequencyKNN
WorkMethodsFrequencyLiftAnalysis
WorkMethodsFrequencyLogisticRegression
WorkMethodsFrequencyMLN
WorkMethodsFrequencyNaiveBayes
WorkMethodsFrequencyNLP
WorkMethodsFrequencyNeuralNetworks
WorkMethodsFrequencyPCA
WorkMethodsFrequencyPrescriptiveModeling
WorkMethodsFrequencyRandomForests
WorkMethodsFrequencyRecommenderSystems
WorkMethodsFrequencyRNNs
WorkMethodsFrequencySegmentation
WorkMethodsFrequencySimulation
WorkMethodsFrequencySVMs
WorkMethodsFrequencyTextAnalysis
WorkMethodsFrequencyTimeSeriesAnalysis

The additional variables used for this analysis will include:

Formal Education

3.2 Manipulating data

In order to answer the question of which methods are most popular among code writers, several transformations must first be done. First, we filter the dataset down to only those who were classified as code writers: those that were employed in some capacity working in data science and writing code as part of their job duties. Additionally, we include only participants who endorsed at least one data science skill on the question, “At work, which data science methods do you use? (Select all that apply)” with variable name :WorkMethodsSelect.

Once filtered, the endorsed data science methods were aggregated and plotted for frequency (see Exploratory Data Analysis below). The top five most frequent data science methods endorsed were then selected and given a frequency score to represent among those who endorse using them to some extent, how frequently they use that tool.

The final transformation performed on the data was grouping by formal education level attainment and then identifying the most frequently endorsed data science methods for each group. This can help identify if those writing certain types of code and using certain data analyses are potentially benefitted by pursuing advanced education–a valuable insight for potential data science pupils.

3.3 Exploratory Data Analysis (EDA)

Following manipulation of the Kaggle data set, we created plots to visualize the aforementioned research questions. First, here is a look at the frequency with which the following data science methods were endorsed by a total of 7,773 respondents. Nearly 2/3 of the survey respondents endorsed the first place skill, data visualization. Over half endorse logistic regression and just shy of half endorse cross-validation and decision trees.

Options	Freq
Data Visualization	5022
Logistic Regression	4291
Cross-Validation	3868
Decision Trees	3695
Random Forests	3454
Time Series Analysis	3153
Neural Networks	2811
PCA and Dimensionality Reduction	2789
kNN and Other Clustering	2624
Text Analytics	2405
Ensemble Methods	2056
Segmentation	2050
SVMs	1973
Natural Language Processing	1949
A/B Testing	1936
Bayesian Techniques	1913
Naive Bayes	1902
Gradient Boosted Machines	1557
CNNs	1417
Simulation	1398
Recommender Systems	1158
Association Rules	1146
RNNs	891
Prescriptive Modeling	851
Collaborative Filtering	793
Lift Analysis	650
Evolutionary Approaches	436
HMMs	419
Other	391
Markov Logic Networks	255
GANs	244

The following plot graphically displays the frequency of endorsements for the data science methods asked about.

In this plot we show the “Frequency Score” for the Top Five most endorsed data science methods. It’s important to break this down further than endorsement, as the above table and plot only consider which data science methods one uses at all. Just because a method is endorsed, doesn’t mean that individuals use it frequently. It may be a rare but essential method in data science. To get a more fine grained understanding of how commonly one uses a given data science method on the job, the kaggle survey followed up each endorsed method by asking respondents if they use it Rarely, Sometimes, Often, Most of the time. We converted these to numeric values (Rarely = 1; Sometimes = 2, Often = 3, and Most of the time = 4) in order to graph a score and average the categorical responses.

Of the top five data science methods endorsed, data visualization was the skill indicated to be used the most frequently.

The below plots show the frequency of methods endorsed for each formal education level assessed by Kaggle.

We see that in the majority of educational attainment brackets, data visualization remains the most frequently endorsed data science method.

The same information is also provided in tabular format:

Selections	Freq	RelativeFreq	Degree
Data Visualization	1236	0.0944232	Bachelor’s Education
Logistic Regression	989	0.0755539	Bachelor’s Education
Decision Trees	847	0.0647059	Bachelor’s Education
Data Visualization	1129	0.0756348	Doctoral Education
Cross-Validation	1046	0.0700744	Doctoral Education
Logistic Regression	1031	0.0690695	Doctoral Education
Neural Networks	24	0.0808081	High School Education
Data Visualization	23	0.0774411	High School Education
Text Analytics	18	0.0606061	High School Education
Data Visualization	2331	0.0835873	Master’s Education
Logistic Regression	2022	0.0725069	Master’s Education
Cross-Validation	1821	0.0652992	Master’s Education
Data Visualization	150	0.0897129	Professional Education
Logistic Regression	121	0.0723684	Professional Education
Decision Trees	119	0.0711722	Professional Education
Data Visualization	137	0.1008837	Some Post Secondary Education
Logistic Regression	97	0.0714286	Some Post Secondary Education
Decision Trees	87	0.0640648	Some Post Secondary Education

Answering the research question of which data science skills are the most important can be interpreted and answered in many ways. One way to explore this deceivingly complex question is to analyze which data science methods are endorsed as being used by code writers on the job. This analysis did just that, and further explored the Top 5 most endorsed data science methods by seeing how frequently those that endorsed them actually use those methods on the job.

The bottom line of this analysis is to consider data visualization, logistic regression, cross-validation, decision trees, and random forests as not only frequently endorsed methods, but as methods that are not only essential but used in small ways. It seems like across data science code writers, these methods are popular and then for individual data science code writers, they are used frequently.

The second goal of this analysis was to understand how formal educational attainment relates to data science methods used on the job. When looking at the plots of each educational level and the table coalescing all of that data, it does not seem like data science methods used by code writers differ given the educational level. Data visualization remains the most frequently endorsed data science method for the majority of educational groups. This has important implications for students of data science in understanding that certain popular job functions are not only performed by those with advanced degrees. This speaks to how crucial skills like data visualization and the other frequently endorsed and commonly used methods are to data science as a whole.

Project3

Gabrielle Bartomeo, Binish Chandy, Zach Dravis, Burcu Kaniskan, Niteen Kumar, Betsy Rosalen

March 25, 2018