The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We downloaded the multiple choice item survey results in csv format and placed it in our GitHub repo
Importing Multiple Choice data
linkMC<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/multipleChoiceResponses.csv"
#importing MC items
MC<-read_csv (linkMC)
survey.data <- MC
#lets create a unique ID variable
MC$id <- seq.int(nrow(MC))
# Ignore this code Importing conversion rates data incase we want to do analyses
# link_conversion<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/conversionRates.csv"
# #importing MC items
# conversion<-read_csv (link_conversion)
# dim(conversion)
# #lets create a unique ID variable
# conversion$id <- seq.int(nrow(conversion))
This project will answer the global research question Which are the most values data science skills?
We start with exploring the resources utilized by Kaggle survey users for learning data science. How does a day in Data Scientists life look like and where do users look for learning, and how do they feel about different learning platforms?
Insights into the demographics : How respondents data is distributed across different countries and also some interesting facts about country-wise gender distribution
To begin with, we focussed on users/ respondents demographics to understand the age group and their gender. After analyzing data, variables:GenderSelect & Age, it appears that out of 16716 global Kaggle respondents there are 13610 males and 2778 females. In this subset male respondents are almost 5 (~4.8) times more than female respondents. Also, from the above plot it is pretty evident that repondents age peak at 25 for both males and females whereas the median age stays at 30.
Since, we are trying to determine what are the most important Data Science Skills are, it is very important to understand what does a data scientist do? And what are the different activities a data scientist perform on daily basis and how much time each activity typically takes.
Let’s take a peek at a day in Data Scientist’s life and try to figure out what does a data scientist do.
The day typically starts with a question or business problem and invloves following activties/ tasks:
Kaggele successfully captured repondents data and also clearly catetgaorized them into different time spent activities. In order to analyze this question we looked at attributes: TimeGatheringData,TimeModelBuilding,TimeProduction,,TimeVisualizing, ,TimeFindingInsights.
In order to determining usefulness of learners platfom we tidy the data for 18 learning platform attributes present in the data set and perform the analysis on long data type. We also successffuly manipulated data to find users sentiments/ remarks from platform usefulness standpoint.
After analyzing data for US repondents it appears that data aquisition or gathering data is the main activitiy, 37.75%, where a data scientist spends most of the time. The model building ranks 2nd, 19.23%, followed by time spent in finding insights and data visualization. Only 10.23% of total appears to be taken by prodcution activities.
DSActivity | mean_precent |
---|---|
TimeGatheringData | 37.75491 |
TimeModelBuilding | 19.23263 |
TimeFindingInsights | 14.50524 |
TimeVisualizing | 13.74509 |
TimeProduction | 10.23198 |
Whether one is employed ful time, part time or a student; its worth exploring how people are using different learning platform and how they feel about it. We made use of different learning platform attributes captured in the dataset which also includes Kaggle as a learning platform.
lid | Country | EmploymentStatus | LPlatform | LP_count | LearningPlatform |
---|---|---|---|---|---|
1 | United States | Not employed, but looking for work | LearningPlatformUsefulnessKaggle | Somewhat useful | Kaggle |
3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessBlogs | Very useful | Blogs |
3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessCollege | Very useful | College |
3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessConferences | Very useful | Conferences |
3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessFriends | Very useful | Friends |
3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessDocumentation | Very useful | Documentation |
After Analyzing respondetns take on different learning platform it appears that learners mostly benefited from personal projects as majority of resonse indicate it very useful. Online courses appears to be 2nd very useful only to be followed by StackOverflow and Kaggle. Blogs,textbooks and college also appear to be very userful whereas newsletters, podcasts and tradebook rank low.
Next, we examine what these survey takers of various educational backgrounds find themselves excited to learn. Due to the ever-evolving nature of technology and, by extension, data science, it is imperative that they remain relevant in their field and are passionate in their pursuit for relevance.
Does survey takers’ formal education has any relationship to the ML/DS method he or she is most excited about learning in the next year?
To do the analysis, we concentrate on two columns in the dataset -
1. FormalEducation and
2. MLMethodNextYearSelect
FormalEducation : Which level of formal education have you attained? MLMethodNextYearSelect : Which ML/DS method are you most excited about learning in the next year?
These questions are asked to all participants.
First we plot the distribution of formal education in the dataset
The data set predominantly contains candidates with Master’s degree which is followed by Bachelor then doctoral degree.
Now let’s look at the different ML/DS methods in the dataset
ML/DS Methods |
---|
Random Forests |
Deep learning |
Neural Nets |
Text Mining |
Genetic & Evolutionary Algorithms |
Link Analysis |
Rule Induction |
Regression |
Proprietary Algorithms |
I don’t plan on learning a new ML/DS method |
Ensemble Methods (e.g. boosting, bagging) |
Factor Analysis |
Social Network Analysis |
Monte Carlo Methods |
Time Series Analysis |
Other |
Bayesian Methods |
Survival Analysis |
MARS |
Anomaly Detection |
Cluster Analysis |
Decision Trees |
Association Rules |
Uplift Modeling |
Support Vector Machines (SVM) |
Now we can plot the distribution of ML/DS methods with formal education
Our results revealed that Deep Learning is the top most ML/DS method among the Kaggle survey takers regardless of their earned formal education. Interestingly, both 40 % of respondents who had earned bachelor degree and 40 % of survey takers with earned master’s degree stated that Deep Learning is the technique that they are most excited about learning in the next year. Similarly, 39 % of the respondents with high school degree reported to learn Deep Learning as their top desired ML/DS method.
Following Deep Learning, Neural Nets emerged as the second top ML/DS method that Data Scientists have the desire to learn next year. Intriguingly, College Dropouts have highest percentage in the distribution in learning Neural Nets
Time Series Analysis was found to be the third ML/DS method of interest. High school graduates want to learn Genetic & Evolutionary Algorithms as their third choice.
Among doctoral survey takers, Bayesian Methods is the third preference. This particular ML/DS method was not choice for others but only with PhDs.
The results are suggesting that there is a clear trend among the data scientists that Deep Learning is the ML/DS method they want to learn. As to the global research question of interest what data science skills are valued the most, the results from this insight suggest that aspiring data scientists should consider learning Deep Learning.
What are the most frequently used data science (DS) methods by those writing code in DS professions? Do those relate to formal educational attainment? (Zach)
What are the most frequently used DS methods? Where is the most time spent in terms of working with data? Do either of these correlate with job title or level of education? (Zach)
The following variables label questions asking survey respondents how often they use each of these data science methods. Response options were: Rarely, Sometimes, Often, Most of the time
The additional variables used for this analysis will include: - Formal Education
#Subset Data
MethodsData <- MC %>%
select(FormalEducation, contains("WorkMethods"))
#Filter data based on who answered the methods questions, which were only given to coding workers employed in some capacity
MethodsData <- MethodsData[!is.na(MethodsData$WorkMethodsSelect),]
#Most popular techniques that respondents report using at least rarely
MasterString <- MethodsData$WorkMethodsSelect
Options <- str_c(MasterString, collapse = "")
Options <- unlist(str_split(MasterString, pattern = ","))
Options <- as.data.frame(table(Options))
MethodsFrequency <- arrange(Options, desc(Freq))
#Find out when one knows a technique, how frequently they use it
#Rarely, Sometimes, Often, Most of the time
MethodsScored <- MethodsData %>%
select(-c(WorkMethodsSelect))
MethodsScored[MethodsScored=="Rarely"]<-1
MethodsScored[MethodsScored=="Sometimes"]<-2
MethodsScored[MethodsScored=="Often"]<-3
MethodsScored[MethodsScored=="Most of the time"]<-4
MethodsScored$WorkMethodsFrequencyDataVisualization <- as.integer(MethodsScored$WorkMethodsFrequencyDataVisualization)
AvgFreqDataVis <- sum(MethodsScored$WorkMethodsFrequencyDataVisualization, na.rm = T) / sum(!is.na(MethodsScored$WorkMethodsFrequencyDataVisualization))
MethodsScored$WorkMethodsFrequencyLogisticRegression <- as.integer(MethodsScored$WorkMethodsFrequencyLogisticRegression)
AvgFreqLogRegress <- sum(MethodsScored$WorkMethodsFrequencyLogisticRegression, na.rm = T) / sum(!is.na(MethodsScored$WorkMethodsFrequencyLogisticRegression))
MethodsScored$`WorkMethodsFrequencyCross-Validation` <- as.integer(MethodsScored$`WorkMethodsFrequencyCross-Validation`)
AvgFreqCrossValid <- sum(MethodsScored$`WorkMethodsFrequencyCross-Validation`, na.rm = T) / sum(!is.na(MethodsScored$`WorkMethodsFrequencyCross-Validation`))
MethodsScored$WorkMethodsFrequencyDecisionTrees <- as.integer(MethodsScored$WorkMethodsFrequencyDecisionTrees)
AvgFreqDecisionTrees <- sum(MethodsScored$WorkMethodsFrequencyDecisionTrees, na.rm = T) / sum(!is.na(MethodsScored$WorkMethodsFrequencyDecisionTrees))
MethodsScored$WorkMethodsFrequencyRandomForests <- as.integer(MethodsScored$WorkMethodsFrequencyRandomForests)
AvgFreqRandomForest <- sum(MethodsScored$WorkMethodsFrequencyRandomForests, na.rm = T) / sum(!is.na(MethodsScored$WorkMethodsFrequencyRandomForests))
Top5FreqScore <- data.frame(Method = c("Data Visualization", "Logistic Regression", "Cross Validation", "Decision Trees", "Random Forests"), FrequencyScore = c(AvgFreqDataVis, AvgFreqLogRegress, AvgFreqCrossValid, AvgFreqDecisionTrees, AvgFreqRandomForest))
#Group and create methods frequency for each level of educational attainment
EducationData <- MethodsData %>%
select(FormalEducation, WorkMethodsSelect) %>%
filter(FormalEducation != "I prefer not to answer") %>%
na.omit()
HighSchoolEducation <- EducationData %>%
filter(FormalEducation == "I did not complete any formal education past high school")
SomePostSecondaryEducation <- EducationData %>%
filter(FormalEducation == "Some college/university study without earning a bachelor\'s degree")
ProfessionalEducation <- EducationData %>%
filter(FormalEducation == "Professional degree")
BachelorsEducation <- EducationData %>%
filter(FormalEducation == "Bachelor's degree")
MastersEducation <- EducationData %>%
filter(FormalEducation == "Master's degree")
DoctoralEducation <- EducationData %>%
filter(FormalEducation == "Doctoral degree")
MethodFrequencyFunction <- function(x){
BigString <- x$WorkMethodsSelect
Selections <- str_c(BigString, collapse = "")
Selections <- unlist(str_split(BigString, pattern = ","))
TempDF <- as.data.frame(table(Selections))
TempDF <- TempDF %>%
mutate(RelativeFreq = Freq / sum(Freq))
return(TempDF)
}
HighSchoolEducation <- MethodFrequencyFunction(HighSchoolEducation)
SomePostSecondaryEducation <- MethodFrequencyFunction(SomePostSecondaryEducation)
ProfessionalEducation <- MethodFrequencyFunction(ProfessionalEducation)
BachelorsEducation <- MethodFrequencyFunction(BachelorsEducation)
MastersEducation <- MethodFrequencyFunction(MastersEducation)
DoctoralEducation <- MethodFrequencyFunction(DoctoralEducation)
MethodsFrequencyPlot <- ggplot(MethodsFrequency, aes(x = reorder(Options, -Freq), y = Freq, fill = Options)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Frequency of Enodrsements for DS Methods") +
xlab("Number of Endorsements") +
ylab("Data Science Method") +
guides(fill = F)
Top5EndorsedFreqScorePlot <- ggplot(Top5FreqScore, aes(x = reorder(Method, -FrequencyScore), y = FrequencyScore, fill = Method)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Average Frequency Scores for Top 5 Endorsed DS Methods") +
xlab("Average Frequency Score") +
ylab("Data Science Method")
MethodsFrequencyPlot
Top5EndorsedFreqScorePlot
kable(MethodsFrequency)
Options | Freq |
---|---|
Data Visualization | 5022 |
Logistic Regression | 4291 |
Cross-Validation | 3868 |
Decision Trees | 3695 |
Random Forests | 3454 |
Time Series Analysis | 3153 |
Neural Networks | 2811 |
PCA and Dimensionality Reduction | 2789 |
kNN and Other Clustering | 2624 |
Text Analytics | 2405 |
Ensemble Methods | 2056 |
Segmentation | 2050 |
SVMs | 1973 |
Natural Language Processing | 1949 |
A/B Testing | 1936 |
Bayesian Techniques | 1913 |
Naive Bayes | 1902 |
Gradient Boosted Machines | 1557 |
CNNs | 1417 |
Simulation | 1398 |
Recommender Systems | 1158 |
Association Rules | 1146 |
RNNs | 891 |
Prescriptive Modeling | 851 |
Collaborative Filtering | 793 |
Lift Analysis | 650 |
Evolutionary Approaches | 436 |
HMMs | 419 |
Other | 391 |
Markov Logic Networks | 255 |
GANs | 244 |
HighSchoolPlot <- ggplot(HighSchoolEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("High School Education Methods Usage") +
guides(fill = F)
SomeSchoolPlot <- ggplot(SomePostSecondaryEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Some Post Secondary Education Methods Usage") +
guides(fill = F)
ProfessionalPlot <- ggplot(ProfessionalEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Professional Education Methods Usage") +
guides(fill = F)
BachelorsPlot <- ggplot(BachelorsEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Bachelor's Education Methods Usage") +
guides(fill = F)
MastersPlot <- ggplot(MastersEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Master's Education Methods Usage") +
guides(fill = F)
DoctoralPlot <- ggplot(DoctoralEducation, aes(x = Selections, y = RelativeFreq, fill = Selections)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Doctoral Education Methods Usage") +
guides(fill = F)
grid.arrange(HighSchoolPlot, SomeSchoolPlot, ProfessionalPlot, BachelorsPlot, MastersPlot, DoctoralPlot, nrow = 2)
HighSchoolEducation$Degree <- "High School Education"
SomePostSecondaryEducation$Degree <- "Some Post Secondary Education"
ProfessionalEducation$Degree <- "Professional Education"
BachelorsEducation$Degree <- "Bachelor's Education"
MastersEducation$Degree <- "Master's Education"
DoctoralEducation$Degree <- "Doctoral Education"
AllEducations <- rbind(HighSchoolEducation, SomePostSecondaryEducation, ProfessionalEducation, BachelorsEducation, MastersEducation, DoctoralEducation)
Top3EachDegree <- AllEducations %>%
group_by(Degree) %>%
arrange(desc(RelativeFreq)) %>%
slice(1:3)
kable(Top3EachDegree)
Selections | Freq | RelativeFreq | Degree |
---|---|---|---|
Data Visualization | 1236 | 0.0944232 | Bachelor’s Education |
Logistic Regression | 989 | 0.0755539 | Bachelor’s Education |
Decision Trees | 847 | 0.0647059 | Bachelor’s Education |
Data Visualization | 1129 | 0.0756348 | Doctoral Education |
Cross-Validation | 1046 | 0.0700744 | Doctoral Education |
Logistic Regression | 1031 | 0.0690695 | Doctoral Education |
Neural Networks | 24 | 0.0808081 | High School Education |
Data Visualization | 23 | 0.0774411 | High School Education |
Text Analytics | 18 | 0.0606061 | High School Education |
Data Visualization | 2331 | 0.0835873 | Master’s Education |
Logistic Regression | 2022 | 0.0725069 | Master’s Education |
Cross-Validation | 1821 | 0.0652992 | Master’s Education |
Data Visualization | 150 | 0.0897129 | Professional Education |
Logistic Regression | 121 | 0.0723684 | Professional Education |
Decision Trees | 119 | 0.0711722 | Professional Education |
Data Visualization | 137 | 0.1008837 | Some Post Secondary Education |
Logistic Regression | 97 | 0.0714286 | Some Post Secondary Education |
Decision Trees | 87 | 0.0640648 | Some Post Secondary Education |
Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using?
Those who are new to data science and learning new skills may have different opinions about which tools and methods are most important to learn and which are being used than those who are already employed in the field. Comparing the answer of ‘Learners’ vs. employed Data Scientists may give us some insight into which skills each group values most and whether or not they are in agreement.
A large portion of the data collected in the Kaggle survey was in the form of Likert scales asking respondents to place the value, importance, or frequency of use of certain skills, tools and methods on a 3-4 point scale. To answer our question about which are the ‘most valued’ data science skills we looked a these scales and analyzed for differences between what ‘Learners’ thought was important to them vs. what employed data scientists say they are using in the field.
An oversight in the survey is that they failed to capture the opinions of those who are employed about what they thought were the most important job skills. They didn’t bother to ask employed respondents those questions, so comparisons between employed data scientists and learners are a little more difficult. We can only use the data about what tools and methods employed data scientists are actually using and the frequency of their use to infer the importance of those tools and methods. The Survey also failed to capture those who are employed and ALSO students or learners! They didn’t bother to ask employed respondents if they were also studying Data Science. Professional development is critical in a field that is growing and changing as rapidly as data science, so asking employed professionals about their further studies could have been very useful information to have. Working Data Scientists may have a better insight into what direction the field is going in than students who are just in the learning phase of their journey.
Out of the 229 variables of data collected in the survey, about 105 of them fall into 5 likert scales describing, the “Learning Platform Usefulness” (which was asked of both learners and employed data scientists), “Job Skill Importance” (which only unemployed learners were asked), “Work Tools Frequency” and “WorkMethodsFrequency” (which only employed data scientists were asked to answer). We narrowed down the data to include only these as well as a few more basic demographic fields to get a sense of who the respondents are.
We also chose to focus only on respondents who are located in the US since international cultural and technological differences may skew results. Different tools and skills may be more or less valued in different countries, so we thought it best to narrow the focus on one country at a time. Further analysis to see if the findings are consistent across countries would be interesting and worthwhile.
First let’s take a look at the demographics of the survey respondents and what type of jobs they hold.
About 30% of or survey respondents would call themselves “Data Scientists” with about another 20 percent calling themselves either “Scientist/Researcher” or “Software Developer/Software Engineer”. So about 50% of our survey respondents fall in these three categories with the other 50% in other roles.
CurrentJobTitleSelect | total | percent |
---|---|---|
Data Scientist | 644 | 30.51 |
Scientist/Researcher | 225 | 10.66 |
Software Developer/Software Engineer | 212 | 10.04 |
Data Analyst | 185 | 8.76 |
Other | 177 | 8.38 |
Researcher | 138 | 6.54 |
Machine Learning Engineer | 102 | 4.83 |
Engineer | 73 | 3.46 |
Statistician | 71 | 3.36 |
NA | 71 | 3.36 |
Business Analyst | 59 | 2.79 |
Computer Scientist | 47 | 2.23 |
Predictive Modeler | 41 | 1.94 |
Programmer | 21 | 0.99 |
DBA/Database Engineer | 19 | 0.90 |
Operations Research Practitioner | 16 | 0.76 |
Data Miner | 10 | 0.47 |
The Kaggle survey asked respondents if they were learning data science and their student status. 73% of the respondents who said they were learning data science are enrolled in an academic program.
StudentStatus | total | percent |
---|---|---|
Yes | 113 | 73.38 |
No | 41 | 26.62 |
55% of all respondents who said they are learning data science said they are “focused on learning mostly data science skills” with the other 45% saying that “data science is a small part of what I’m focused on learning”.
The ratio of respondents who said they are “focused on learning mostly data science skills” remains the same at about 55% regardless of whether or not they are enrolled in an academic program.
StudentStatus | LearningDataScience | total | percent |
---|---|---|---|
Yes | Yes, I’m focused on learning mostly data science skills | 63 | 55.75 |
Yes | Yes, but data science is a small part of what I’m focused on learning | 50 | 44.25 |
No | Yes, I’m focused on learning mostly data science skills | 23 | 56.10 |
No | Yes, but data science is a small part of what I’m focused on learning | 18 | 43.90 |
‘Learners’ think that the top 3 best ways to learn data science are through Courses, Projects and College with Arxiv and YouTube coming in 4th and 5th respectively.
Item | low | neutral | high | mean | sd | |
---|---|---|---|---|---|---|
11 | LPU.Courses | 0.000000 | 22.22222 | 77.77778 | 2.777778 | 0.4190790 |
12 | LPU.Projects | 1.923077 | 25.00000 | 73.07692 | 2.711539 | 0.4984894 |
3 | LPU.College | 1.851852 | 33.33333 | 64.81481 | 2.629630 | 0.5247208 |
1 | LPU.Arxiv | 0.000000 | 38.88889 | 61.11111 | 2.611111 | 0.5016313 |
17 | LPU.YouTube | 0.000000 | 39.13043 | 60.86957 | 2.608696 | 0.4934352 |
14 | LPU.SO | 0.000000 | 43.47826 | 56.52174 | 2.565217 | 0.4993602 |
10 | LPU.Documentation | 10.526316 | 36.84211 | 52.63158 | 2.421053 | 0.6924826 |
2 | LPU.Blogs | 2.631579 | 47.36842 | 50.00000 | 2.473684 | 0.5568658 |
7 | LPU.Kaggle | 0.000000 | 50.64935 | 49.35065 | 2.493507 | 0.5032363 |
9 | LPU.Communities | 0.000000 | 55.55556 | 44.44444 | 2.444444 | 0.5270463 |
16 | LPU.Tutoring | 6.250000 | 50.00000 | 43.75000 | 2.375000 | 0.6191392 |
15 | LPU.Textbook | 7.142857 | 54.76190 | 38.09524 | 2.309524 | 0.6043781 |
13 | LPU.Podcasts | 6.666667 | 66.66667 | 26.66667 | 2.200000 | 0.5606119 |
4 | LPU.Company | 0.000000 | 75.00000 | 25.00000 | 2.250000 | 0.5000000 |
5 | LPU.Conferences | 14.285714 | 64.28571 | 21.42857 | 2.071429 | 0.6157279 |
8 | LPU.Newsletters | 0.000000 | 80.00000 | 20.00000 | 2.200000 | 0.4216370 |
6 | LPU.Friends | 17.647059 | 64.70588 | 17.64706 | 2.000000 | 0.6123724 |
Employed Data Scientists agree with unemployed ‘Learners’ that Projects and Courses belong in the top 3, but put College in 5th place (vs. 3rd). They also include Tutoring and SO (Stack Overflow Q&A) in their top 5 with SO coming in 2nd place. YouTube (learner’s 5th choice) comes in 14th place for employed Data Scientists and Arxiv (learner’s 4th choice) is 8th among employed Data Scientists.
Another interesting difference is that the importance of Friends to learning data science is much higher among employed Data Scientists with about 46.5% saying that Friends are “Very useful” and 97.5% saying that Friends are either “Somewhat useful” or “Very useful” vs. ‘Learners’ who put Friends at absolute bottom of their list with only 17.6% saying that they are are “Very useful” and 82.4% saying that Friends are either “Somewhat useful” or “Very useful”.
This may indicate a need to create a more robust community for Data Science ‘Learners’, who may feel somewhat isolated in their studies vs. employed Data Scientists who presumably have more established work and social networks that involve Data Science.
Item | low | neutral | high | mean | sd | |
---|---|---|---|---|---|---|
12 | LPU.Projects | 0.6265664 | 21.30326 | 78.07018 | 2.774436 | 0.4329562 |
14 | LPU.SO | 0.7954545 | 35.45455 | 63.75000 | 2.629545 | 0.4994101 |
11 | LPU.Courses | 1.3586957 | 35.19022 | 63.45109 | 2.620924 | 0.5127461 |
17 | LPU.Tutoring | 3.5928144 | 39.52096 | 56.88623 | 2.532934 | 0.5680704 |
3 | LPU.College | 1.5801354 | 41.53499 | 56.88488 | 2.553047 | 0.5286014 |
7 | LPU.Kaggle | 0.8860759 | 46.96203 | 52.15190 | 2.512658 | 0.5175910 |
15 | LPU.Textbook | 2.3041475 | 47.31183 | 50.38402 | 2.480799 | 0.5442143 |
1 | LPU.Arxiv | 1.8918919 | 48.10811 | 50.00000 | 2.481081 | 0.5368976 |
16 | LPU.TradeBook | 5.6818182 | 44.31818 | 50.00000 | 2.443182 | 0.6037803 |
4 | LPU.Company | 4.6413502 | 47.67932 | 47.67932 | 2.430380 | 0.5825909 |
2 | LPU.Blogs | 1.0159652 | 52.24964 | 46.73440 | 2.457184 | 0.5185329 |
6 | LPU.Friends | 2.5270758 | 50.90253 | 46.57040 | 2.440433 | 0.5459573 |
9 | LPU.Communities | 0.0000000 | 54.36242 | 45.63758 | 2.456376 | 0.4997732 |
18 | LPU.YouTube | 2.4038462 | 52.08333 | 45.51282 | 2.431090 | 0.5420324 |
10 | LPU.Documentation | 3.0470914 | 51.52355 | 45.42936 | 2.423823 | 0.5531604 |
5 | LPU.Conferences | 5.1044084 | 61.48492 | 33.41067 | 2.283063 | 0.5529337 |
8 | LPU.Newsletters | 5.3333333 | 66.66667 | 28.00000 | 2.226667 | 0.5327738 |
13 | LPU.Podcasts | 11.2840467 | 70.42802 | 18.28794 | 2.070039 | 0.5403243 |
‘Learners’ put Python at the top of their list of Job Skills with 63.6% of respondents saying that it is “Necessary” and 99% saying it is either “Necessary” or “Nice to have”. Only 1% said it was “Unnecessary”. R is surprisingly further down the list with only a little more than half as many respondents saying that R is “Necessary” compared to Python at 34.7% but many more agreeing that it is at least “Nice to have” for a total of 95% in favor of learning R which is just slightly less than those in favor of Python.
Surprisingly “Data Visualization” comes in only slightly above R with 39% saying that it is “Necessary” but actually slightly lower in terms of overall importance with 8% saying it is “Unnecessary”.
Item | low | neutral | high | mean | sd | |
---|---|---|---|---|---|---|
5 | JSI.Python | 1.010101 | 35.35354 | 63.636364 | 2.626263 | 0.5068079 |
3 | JSI.Stats | 2.941177 | 48.03922 | 49.019608 | 2.460784 | 0.5570710 |
1 | JSI.BigData | 4.081633 | 54.08163 | 41.836735 | 2.377551 | 0.5655999 |
7 | JSI.SQL | 7.368421 | 52.63158 | 40.000000 | 2.326316 | 0.6091869 |
10 | JSI.Visualizations | 8.163265 | 53.06122 | 38.775510 | 2.306122 | 0.6160761 |
2 | JSI.Degree | 5.102041 | 60.20408 | 34.693878 | 2.295918 | 0.5599923 |
6 | JSI.R | 5.102041 | 60.20408 | 34.693878 | 2.295918 | 0.5599923 |
4 | JSI.EnterpriseTools | 17.582418 | 72.52747 | 9.890110 | 1.923077 | 0.5213395 |
8 | JSI.KaggleRanking | 35.051546 | 60.82474 | 4.123711 | 1.690722 | 0.5469770 |
9 | JSI.MOOC | 37.634409 | 59.13978 | 3.225807 | 1.655914 | 0.5416285 |
Not surprisingly I guess, Python is also high on the list of tools that working Data Scientists use with 75.4% of users saying that they use it either “Often” or “Most of the time” and only Statistica, SQL and Unix edging it out for the top 3 slots. What is surprising is that two of the the top three tools used in the field aren’t even on the list of Job Skills ‘Learners’ were asked to evaluate.
R comes in slightly lower than Python with 63.6% of users saying that they use it either “Often” or “Most of the time” but the difference between R and Python is less in the field than what ‘Learners’ seem to think is most important.
Item | low | neutral | high | mean | sd | |
---|---|---|---|---|---|---|
44 | WTF.Statistica | 20.00000 | 0 | 80.00000 | 3.200000 | 0.8366600 |
42 | WTF.SQL | 20.51546 | 0 | 79.48454 | 3.264949 | 0.8637334 |
48 | WTF.Unix | 20.73171 | 0 | 79.26829 | 3.207317 | 0.8620225 |
31 | WTF.Python | 24.58522 | 0 | 75.41478 | 3.220965 | 0.9533485 |
18 | WTF.KNIMECommercial | 25.00000 | 0 | 75.00000 | 3.000000 | 0.8164966 |
17 | WTF.Jupyter | 32.30975 | 0 | 67.69025 | 2.982644 | 0.9911199 |
33 | WTF.R | 36.38941 | 0 | 63.61059 | 2.942344 | 1.0572465 |
38 | WTF.SASBase | 36.96682 | 0 | 63.03318 | 2.900474 | 1.0621406 |
2 | WTF.AWS | 44.32432 | 0 | 55.67568 | 2.686486 | 1.0642268 |
40 | WTF.SASJMP | 46.42857 | 0 | 53.57143 | 2.553571 | 1.1106041 |
23 | WTF.Excel | 47.64151 | 0 | 52.35849 | 2.613208 | 0.8823789 |
6 | WTF.DataRobot | 47.82609 | 0 | 52.17391 | 2.478261 | 1.2745611 |
41 | WTF.Spark | 48.33837 | 0 | 51.66163 | 2.655589 | 1.0250457 |
9 | WTF.Hadoop | 49.83607 | 0 | 50.16393 | 2.596721 | 1.0185721 |
12 | WTF.IBMSPSSStatistics | 50.66667 | 0 | 49.33333 | 2.453333 | 1.1424645 |
14 | WTF.Impala | 50.98039 | 0 | 49.01961 | 2.470588 | 1.0835671 |
45 | WTF.Tableau | 51.13350 | 0 | 48.86650 | 2.561713 | 1.0680602 |
8 | WTF.GCP | 52.63158 | 0 | 47.36842 | 2.482456 | 1.0064610 |
28 | WTF.Oracle | 52.94118 | 0 | 47.05882 | 2.470588 | 0.9919462 |
15 | WTF.Java | 53.53160 | 0 | 46.46840 | 2.446097 | 1.0008735 |
5 | WTF.Cloudera | 53.84615 | 0 | 46.15385 | 2.538461 | 1.0781434 |
7 | WTF.Flume | 54.54545 | 0 | 45.45455 | 2.318182 | 0.9454837 |
34 | WTF.RapidMinerCommercial | 54.54545 | 0 | 45.45455 | 2.545454 | 1.2933396 |
25 | WTF.MicrosoftSQL | 55.00000 | 0 | 45.00000 | 2.480000 | 1.0198039 |
46 | WTF.TensorFlow | 55.06912 | 0 | 44.93088 | 2.460830 | 0.9633397 |
47 | WTF.TIBCO | 56.66667 | 0 | 43.33333 | 2.366667 | 1.0333519 |
27 | WTF.NoSQL | 57.14286 | 0 | 42.85714 | 2.444015 | 0.8977539 |
36 | WTF.Salfrod | 57.14286 | 0 | 42.85714 | 2.142857 | 0.8997354 |
21 | WTF.MATLAB | 57.38832 | 0 | 42.61168 | 2.336770 | 1.1095146 |
24 | WTF.MicrosoftRServer | 58.33333 | 0 | 41.66667 | 2.404762 | 1.0193200 |
37 | WTF.SAPBusinessObjects | 58.33333 | 0 | 41.66667 | 2.416667 | 1.1645002 |
11 | WTF.IBMSPSSModeler | 58.82353 | 0 | 41.17647 | 2.215686 | 1.1715584 |
32 | WTF.Qlik | 60.60606 | 0 | 39.39394 | 2.303030 | 1.0748502 |
4 | WTF.C | 61.39241 | 0 | 38.60759 | 2.316456 | 1.0754909 |
10 | WTF.IBMCognos | 61.53846 | 0 | 38.46154 | 2.076923 | 1.0926327 |
39 | WTF.SASEnterprise | 62.88660 | 0 | 37.11340 | 2.329897 | 1.0870607 |
20 | WTF.Mathematica | 69.56522 | 0 | 30.43478 | 2.144928 | 0.9892862 |
13 | WTF.IBMWatson | 72.34043 | 0 | 27.65957 | 2.000000 | 0.9088933 |
30 | WTF.Perl | 72.41379 | 0 | 27.58621 | 1.919540 | 0.9550160 |
19 | WTF.KNIMEFree | 73.52941 | 0 | 26.47059 | 2.088235 | 0.8300291 |
3 | WTF.Angoss | 75.00000 | 0 | 25.00000 | 1.750000 | 0.9574271 |
26 | WTF.Minitab | 75.00000 | 0 | 25.00000 | 1.750000 | 0.9158109 |
22 | WTF.Azure | 75.86207 | 0 | 24.13793 | 1.965517 | 0.8133805 |
1 | WTF.AmazonML | 77.17391 | 0 | 22.82609 | 1.923913 | 0.8923816 |
16 | WTF.Julia | 77.50000 | 0 | 22.50000 | 1.850000 | 0.9486833 |
43 | WTF.Stan | 78.04878 | 0 | 21.95122 | 1.926829 | 0.9052691 |
35 | WTF.RapidMinerFree | 86.95652 | 0 | 13.04348 | 1.760870 | 0.7939992 |
29 | WTF.Orange | 88.23529 | 0 | 11.76471 | 1.470588 | 0.7174301 |
The big surprise here is that Data Visualization is at the top of the list. With nobody saying that they use it “Rarely” and only 8.5% saying that they only use it “Sometimes”. 91.5% say they use it “Often” or “Most of the time”. It is by far the most frequently used method or tool for working Data Science Professionals with Statstica as the next runner up at only 80% by comparison. Remember that only 38.8% of ‘Learners’ said that they thought Visualizations were a “Necessary” job skill to learn!
Item | low | neutral | high | mean | sd | |
---|---|---|---|---|---|---|
7 | WMF.DataVisualization | 8.566276 | 0 | 91.43372 | 3.550045 | 0.6639726 |
6 | WMF.Cross.Validation | 24.178404 | 0 | 75.82160 | 3.160798 | 0.8500851 |
22 | WMF.PrescriptiveModeling | 32.710280 | 0 | 67.28972 | 2.855140 | 0.8350012 |
16 | WMF.LogisticRegression | 33.599202 | 0 | 66.40080 | 2.879362 | 0.8584456 |
12 | WMF.GBM | 34.337349 | 0 | 65.66265 | 2.879518 | 0.8848861 |
27 | WMF.Simulation | 35.013263 | 0 | 64.98674 | 2.854111 | 0.9006146 |
19 | WMF.NLP | 35.469108 | 0 | 64.53089 | 2.839817 | 0.8683888 |
30 | WMF.TimeSeriesAnalysis | 36.211699 | 0 | 63.78830 | 2.866295 | 0.8703743 |
9 | WMF.EnsembleMethods | 36.546185 | 0 | 63.45382 | 2.847390 | 0.8930782 |
15 | WMF.LiftAnalysis | 37.430168 | 0 | 62.56983 | 2.720670 | 0.8279985 |
29 | WMF.TextAnalysis | 38.420108 | 0 | 61.57989 | 2.795332 | 0.8882404 |
4 | WMF.CNNs | 39.830509 | 0 | 60.16949 | 2.805085 | 0.9340689 |
20 | WMF.NeuralNetworks | 40.545809 | 0 | 59.45419 | 2.785575 | 0.8821286 |
26 | WMF.Segmentation | 40.633245 | 0 | 59.36675 | 2.751979 | 0.9007434 |
25 | WMF.RNNs | 41.509434 | 0 | 58.49057 | 2.735849 | 0.8378153 |
23 | WMF.RandomForests | 41.909814 | 0 | 58.09019 | 2.733422 | 0.8695970 |
21 | WMF.PCA | 42.045454 | 0 | 57.95455 | 2.724026 | 0.8468828 |
8 | WMF.DecisionTrees | 42.063492 | 0 | 57.93651 | 2.708995 | 0.8456972 |
28 | WMF.SVMs | 48.051948 | 0 | 51.94805 | 2.597403 | 0.8788567 |
1 | WMF.A.B | 50.539957 | 0 | 49.46004 | 2.555076 | 0.8757749 |
14 | WMF.KNN | 51.268116 | 0 | 48.73188 | 2.543478 | 0.8293138 |
24 | WMF.RecommenderSystems | 51.690821 | 0 | 48.30918 | 2.507246 | 0.8411133 |
10 | WMF.EvolutionaryApproaches | 52.702703 | 0 | 47.29730 | 2.540541 | 0.9094510 |
5 | WMF.CollaborativeFiltering | 53.846154 | 0 | 46.15385 | 2.440559 | 0.8275399 |
3 | WMF.Bayesian | 55.580357 | 0 | 44.41964 | 2.462054 | 0.8713112 |
18 | WMF.NaiveBayes | 61.479592 | 0 | 38.52041 | 2.375000 | 0.8338660 |
17 | WMF.MLN | 64.814815 | 0 | 35.18519 | 2.351852 | 0.9144007 |
11 | WMF.GANs | 65.116279 | 0 | 34.88372 | 2.325581 | 0.8083178 |
13 | WMF.HMMs | 66.326531 | 0 | 33.67347 | 2.244898 | 0.8622821 |
2 | WMF.AssociationRules | 68.316832 | 0 | 31.68317 | 2.217822 | 0.7411697 |
(Burcu) The most frequently used of the programming languages are R and Python. But are these in agreement as well? Do those survey takers feel that others should first and foremost study the languages they themselves have taken up, or perhaps with their insight, know to suggest the language of the two they themselves did not learn?
Thus the following questions were explored:
There are 2 variables used in this section of the analysis :
LanguageRecommendationSelect=(What programming language would you recommend a new data scientist learn first? (Select one option) - Selected Choice)
WorkToolsSelect= For work, which data science/analytics tools, technologies, and languages have you used in the past year? (Select all that apply) - Selected Choice
The major task in this part of the analysis were creating a tidy data for Select all that apply type of column for WorkToolsSelect. Because the respondents were provided the option of choosing anything that apply to them, the data for the languages were captures as string as opposed to having one language as a column for each respondent.
dim(MC)
## [1] 16716 229
tb1<-MC %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb1)
#removing NAs and empty values in column=WorkToolsSelect
df <- MC[!(MC$WorkToolsSelect == "" | is.na(MC$WorkToolsSelect)), ]
dim(df)
## [1] 7955 229
tb2<-df %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb2)
#creating a new variable called work_tools where the original column values are split
#please note that this code will generate long data
df1<-df %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
#check
tb3<-df1 %>%
select (id, WorkToolsSelect,work_tools) %>%
filter (id %in% c(1:3))
datatable(tb3)
df2<-df1 %>%
group_by(id, work_tools) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0)
df3<-df2 %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"))%>%
select (id, R, Python, lang_use)
tb4<-df3 %>%
filter (id %in% c(1:10))
datatable(tb4)
#computing percentages
df4<-df3 %>%
group_by(lang_use) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df4, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 4, dom = 'tip'))
p<-ggplot (df4, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users") +
theme(axis.text.x = element_text(angle = 90), legend.position = 'none') +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))
p
# ggplotly(p)
Let’s examine the above graph by LanguageRecommendationSelect
#check
tb5<-df1 %>%
select (id, WorkToolsSelect,work_tools, LanguageRecommendationSelect) %>%
filter (id %in% c(1:3))
datatable(tb5)
df5<-df1 %>%
group_by(id, work_tools,LanguageRecommendationSelect) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0) %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"),
lang_rec = case_when (
(LanguageRecommendationSelect=="R") ~ "Recommending R ",
(LanguageRecommendationSelect=="Python" ) ~ "Recommending Python ",
(LanguageRecommendationSelect!="R" |LanguageRecommendationSelect!="Python") ~ "Recommending Neither Python nor R",
(LanguageRecommendationSelect=="NA"|LanguageRecommendationSelect==" " ) ~ "Recommending Nothing"))%>%
select (id, R, Python, lang_use,lang_rec )
dim(df5)
## [1] 7955 5
tb6<-df5 %>%
filter (id %in% c(1:10))
datatable(tb6)
#computing percentages
df6<-df5 %>%
group_by(lang_use,lang_rec) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df6, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 4, dom = 'tip'))
p1<-ggplot (df6, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users and their recommended language") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))+
facet_wrap(~lang_rec)+
theme(legend.position = 'none')
p1
# ggplotly(p1)
We found that a little below the half of the survey takers (N=3540, 44.5%) reported to use both R and Python. The take home message for the aspiring data scientists is that substantial majority of the Kaggle survey takers are using both languages, both languages are used widely. Among the remaining half of the respondents, small portion of them (N=714, 8.98%) are using neither Python nor R. The rest of the survey takers are using either R or Python. In particular, 2533 (31.84%) indicated using only Python while only 1168 (14.68%) of them noted to use R Only.
The story of this contentious debate on R vs Python gets more interesting as to comparing their used languages with their recommended languages. Specifically, it is plausible to assume that Python users will recommend Python while R users will recommend R. Can we explore this hypothesis ? How much degree of difference among the Python and R users as to their recommended languages ?
Our results revealed that 72.17 % of the Python users recommended Python while 53.77% of R users recommended R. This results is not surprising to our hypothesis. Since there are more Python only users than R only users in this sample, it makes sense to have differences in their recommendation. However, what is surprising is degree of difference in their recommendation for the other language. For example, 15.92 % of the R users recommending Python whereas only 1.42 % of the Python users are recommending R.
However, this results should be interpreted carefully because we have survey takers who did not make any recommendation. For instance 18.87 % of the sample who are Python users did not respond this question. Similarly, 17.55 % of R users did not leave any opinion on their recommended languages. If these Python users would say any recommendation, would that increase overall Python users’ recommendation of R? Who knows?
Since half the sample included both R and Python users, let’s see their recommendation. The 51.72% of them recommending Python while 25.65% of them recommending R. The quarter of the both users are recommending R but Python.
Finally, true to the word “value,” considerations have to be made regarding pay. The compensation received by survey takers for their work in either R or Python needs quantification to discover which language earns a data scientist more overall and in general.
What the question is primarily concerned with is three-fold:
There was also the variable “id” that was created for the purpose of this report, acting as a way to identify each individual survey taker, and the variable “work_tools” which was a derivative of WorkToolsSelect, breaking the lists down into their individual components.
RQ6 <- MC %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
RQ6 <- RQ6 %>%
filter(!is.na(WorkToolsSelect)) %>% # Filters out all columns with NA in the WorkToolsSelect column
filter(CompensationCurrency == "USD") %>% # Makes sure to only use rows whose currency is in USD
filter(work_tools == "Python" | work_tools == "R") %>% # The work tools are R or Python, period.
select(id, work_tools, CompensationAmount) # Only have three rows to work with
RQ6_ids <- select(filter(as.data.frame(table(RQ6$id)), Freq == 1), Var1) # Only want people who use R or Python EXCLUSIVELY, not R and/or Python
RQ6_ids <- droplevels(RQ6_ids)$Var1 # Removed the levels so we can actually get the IDs
RQ6 <- filter(RQ6, id %in% RQ6_ids) # Only keep those rows whose id are inside of list of ids with R or Python exclusively used at work
RQ6 <- select(RQ6, -id) # No use for the ID anymore, it's done its job
RQ6$CompensationAmount <- gsub(",", "", RQ6$CompensationAmount) # Removed the commas from the compensation amount to prep for numeric transformation
RQ6$CompensationAmount <- as.numeric(RQ6$CompensationAmount) # made the column into a numeric for easier mathematical comparison and sorting
RQ6 <- filter(RQ6, CompensationAmount < 9999999) # ... let's just be a little realistic, nobody is earning more than fifteen million a year at this point in time or prior to it, and this one-dollar-off-from-a-million entry is an anomaly in the data set
RQ6.summary <- rbind(summary(select(filter(RQ6, work_tools=="Python"), CompensationAmount)[[1]]), summary(select(filter(RQ6, work_tools=="R"), CompensationAmount)[[1]])) # Summary of the data
RQ6.summary <- as.data.frame(RQ6.summary)
colnames(RQ6.summary) <- c("Minimum", "1st Quartile", "Median", "Mean", "3rd Quartile", "Maximum") # Renamed the columns
RQ6.summary["Standard Deviation"] <- c( sd(filter(RQ6, work_tools == "Python")$CompensationAmount), sd(filter(RQ6, work_tools == "R")$CompensationAmount) )
RQ6.summary <- as.data.frame(sapply(RQ6.summary, function (x) paste("$", formatC(x, digits=2, format="f", big.mark=","), sep="") )) # Made all of the items in the data frame into dollar format
rownames(RQ6.summary) <- c("Python", "R") # Applied the rownames
rm(RQ6_ids) # remove the now-unused variable to save memory
In order to study the data it had to first be transformed. All users who did not provide an answer for the question on tools they use at work had their responses discarded. Similarly, all users whose compensation was not in US Dollars had their information disregarded in order to hone in on a single socio-economic focus, the US market. A separate table was created pairing survey takers with each of their languages or methods used for their job via the id variable - a number assigned to each survey taker - and work_tools variable, which stored each item in the list provided by WorkToolsSelect in its own row, matched to the id of the survey taker who provided it. Using this separate table, survey takers who used both Python and R in their jobs were removed, and any who failed to use R or Python in their job were removed as well, leaving a list of individuals who exclusively used Python or R in their career. Lastly, the amounts compensated were reformatted so as to be in a numeric format and any rows with an unlikely compensation amount - north of ten million annually - were removed to prevent data skewing by such a severe outlier.
RQ6_boxplot <- ggplot(RQ6) +
geom_boxplot( aes(x = factor(work_tools),
y = CompensationAmount,
fill = factor(work_tools)
)
) +
scale_y_continuous(breaks=seq(0,2000000,25000)) +
labs( x = "Programming Language",
y = "Annual Compensation in USD",
fill = "Programming Language")
RQ6_boxplot_ylim <- boxplot.stats(RQ6$CompensationAmount)$stats[c(1, 5)]
RQ6_boxplot <- RQ6_boxplot + coord_cartesian(ylim = RQ6_boxplot_ylim*1.05)
RQ6_boxplot
As we can see from the boxplots there is relatively normal distribution present in the compensation for those who solely used Python in their work. Conversely, there is a noticeable right skew in the compensation for those who solely used R in their work.
knitr::kable(RQ6.summary)
Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum | Standard Deviation | |
---|---|---|---|---|---|---|---|
Python | $0.00 | $53,000.00 | $100,000.00 | $112,826.14 | $145,000.00 | $2,000,000.00 | $122,425.21 |
R | $0.00 | $58,000.00 | $87,000.00 | $98,177.64 | $130,000.00 | $550,000.00 | $67,487.91 |
The average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.
On the contentious debate on which ML/DS methods Data Scientists are most excited about learning in the next year as the most valued Data Science Skills
Deep Learning is the top most ML/DS method in all categories of formal education followed by Neural Nets except High school graduates, all others wants to learn
Time Series Analysis as the third ML/DS method. High school graduates want to learn Genetic & Evolutionary Algorithms as their third choice.
Among doctoral survey takers, Bayesian Methods is the third preference.
On the contentious debate on R vs Python as the most valued Data Science Skills