AIM

This is the part 3 of this project.This is a data analytics project for mining analyzing, visualizing the data collected by the Kaggle Data science survey conducted in 2017.

Part 3 - This section will analyze and study the professional lives of the participants, their major degree ,time spend studying data science topics, what job titles they hold,which ML method they actually use in the industries , which bolgs the participants prefer the most for studying data science etc.

Let’s get started.

  1. Let’s start with the most preferred blog sites for learning datascience-This is a multiple answer field. Let’s find the top 15 most preferred answers.
blogs<-SurveyDf %>% group_by(BlogsPodcastsNewslettersSelect) %>%
  summarise(count=n()) %>% 
  top_n(15) %>% 
  arrange(desc(count))

#removing NA value
blogs[1,1]<-NA
colnames(blogs)<-c("Blogname","Count")

#let's plot them
hchart(na.omit(blogs),hcaes(x=Blogname,y=Count),type="column",color="#062D67") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of most preferred blogs for learning",align="center") %>%
  hc_add_theme(hc_theme_elementary()) 

Created with Highstock 5.0.6BlognameCountChart context menuBarplot of most preferred blogs for learningOther (Separate different a…Other (Separate different answers with semicolon)KDnuggets BlogR Bloggers Blog AggregatorO'Reilly Data NewsletterBecoming a Data Scientist …Becoming a Data Scientist PodcastNo Free Hunch BlogKDnuggets Blog,R Blogger…KDnuggets Blog,R Bloggers Blog AggregatorSiraj Raval YouTube ChannelKDnuggets Blog,O'Reilly D…KDnuggets Blog,O'Reilly Data NewsletterKDnuggets Blog,O'Reilly D…KDnuggets Blog,O'Reilly Data Newsletter,R Bloggers Blog AggregatorKDnuggets Blog,No Free H…KDnuggets Blog,No Free Hunch Blog,R Bloggers Blog AggregatorKDnuggets Blog,No Free H…KDnuggets Blog,No Free Hunch BlogFastML Blog,KDnuggets Blo…FastML Blog,KDnuggets Blog,No Free Hunch BlogKDnuggets Blog,Siraj Raval…KDnuggets Blog,Siraj Raval YouTube Channel0200400600800
hence one can see that one of the most famous and preferred blog sites are R bloggers and Kdnuggets.


  1. Let’s now study how long aprticipants have been learning data science-
table(LearningDataScienceTime)
## LearningDataScienceTime
##                < 1 year   1-2 years 10-15 years   15+ years   3-5 years 
##       12367        2093        1566          14          30         540 
##  5-10 years 
##         106
hchart(SurveyDf$LearningDataScienceTime,type="pie",name="count")
Created with Highstock 5.0.6< 1 year< 1 year1-2 years1-2 years10-15 years10-15 years15+ years15+ years3-5 years3-5 years5-10 years5-10 years

So most of the participants have started learning data science in the past year itself or its been less than a year since they started studying learning data science.

Let’s check the age distribution of the particpiants and for how long they have been learning data science.

hcboxplot(x=SurveyDf$Age,var=SurveyDf$LearningDataScienceTime,outliers = F,color="#09870D",name="Age Distribution") %>%
hc_chart(type="column")  %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Boxplot of ages and the learning time of participants",align="center") %>%
  hc_add_theme(hc_theme_elementary()) 
Created with Highstock 5.0.6Chart context menuBoxplot of ages and the learning time of participants< 1 year1-2 years10-15 years15+ years3-5 years5-10 years010203040506070

The above plot was quiet predictable as people with less time learning data science are younger.


3) Let’s now study what participants entered and feel which skills are important for becoming a data scientist?

Let’s do some data wrangling and transformations.

Making a separate data frame for each variable, for easier understanding.

#let's make a function to ease things
#function takes argument as a dataframe and the categorical variable which we want summarize and group
aggr<-function(df,var) 
{
  require(dplyr)
  var <- enquo(var) #quoting
  dfname<-df %>% 
    group_by_at(vars(!!var)) %>%  ## Group by variables selected by name:
    summarise(count=n()) %>%
    arrange(desc(count))
 
    dfname#function returns a summarized dataframe
  
}

RSkill<-aggr(SurveyDf,JobSkillImportanceR)
RSkill[1,]<-NA
SqlSkill<-aggr(SurveyDf,JobSkillImportanceSQL)
SqlSkill[1,]<-NA
PythonSkill<-aggr(SurveyDf,JobSkillImportancePython)
PythonSkill[1,]<-NA
BigDataSkill<-aggr(SurveyDf,JobSkillImportanceBigData)
BigDataSkill[1,]<-NA
StatsSkill<-aggr(SurveyDf,JobSkillImportanceStats)
StatsSkill[1,]<-NA
DegreeSkill<-aggr(SurveyDf,JobSkillImportanceDegree)
DegreeSkill[1,]<-NA
EnterToolsSkill<-aggr(SurveyDf,JobSkillImportanceEnterpriseTools)
EnterToolsSkill[1,]<-NA
MOOCSkill<-aggr(SurveyDf,JobSkillImportanceMOOC)
MOOCSkill[1,]<-NA
DataVisSkill<-aggr(SurveyDf,JobSkillImportanceVisualizations)
DataVisSkill[1,]<-NA
KaggleRankSkill<-aggr(SurveyDf,JobSkillImportanceKaggleRanking)
KaggleRankSkill[1,]<-NA


hchart(na.omit(RSkill),hcaes(x=JobSkillImportanceR,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of R skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of R skillNice to haveNice to haveNecessaryNecessaryUnnecessaryUnnecessary
hchart(na.omit(PythonSkill),hcaes(x=JobSkillImportancePython,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Python skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Python skillNecessaryNecessaryNice to haveNice to haveUnnecessaryUnnecessary
hchart(na.omit(SqlSkill),hcaes(x=JobSkillImportanceSQL,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of SQL skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of SQL skillNice to haveNice to haveNecessaryNecessaryUnnecessaryUnnecessary
hchart(na.omit(BigDataSkill),hcaes(x=JobSkillImportanceBigData,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Big Data skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Big Data skillNice to haveNice to haveNecessaryNecessaryUnnecessaryUnnecessary
hchart(na.omit(StatsSkill),hcaes(x=JobSkillImportanceStats,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Statistics kill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Statistics killNecessaryNecessaryNice to haveNice to haveUnnecessaryUnnecessary
hchart(na.omit(DataVisSkill),hcaes(x=JobSkillImportanceVisualizations,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Data Viz skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Data Viz skillNice to haveNice to haveNecessaryNecessaryUnnecessaryUnnecessary
hchart(na.omit(DegreeSkill),hcaes(x=JobSkillImportanceDegree,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Degree",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of DegreeNice to haveNice to haveNecessaryNecessaryUnnecessaryUnnecessary
hchart(na.omit(EnterToolsSkill),hcaes(x=JobSkillImportanceEnterpriseTools,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Enterprise Tools skill",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Enterprise Tools skillNice to haveNice to haveUnnecessaryUnnecessaryNecessaryNecessary
hchart(na.omit(MOOCSkill),hcaes(x=JobSkillImportanceMOOC,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of MOOCs",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of MOOCsNice to haveNice to haveUnnecessaryUnnecessaryNecessaryNecessary
hchart(na.omit(KaggleRankSkill),hcaes(x=JobSkillImportanceKaggleRanking,y=count),type="pie",name="Count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Piechart of importance of Kaggle Rankings",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6Chart context menuPiechart of importance of Kaggle RankingsNice to haveNice to haveUnnecessaryUnnecessaryNecessaryNecessary
  1. We can see from the above plot that the most unnecessary skill amongst all is having a knowledge of Enterprise tools, Degree, Kaggle Rankings and MOOCs. These have higher count of unnecessary skills entered by the participants.

  2. Whereas, Knowledge of Statistics,Python,R and Big data skills are most necessary and Nice to have skills as per answers entered by the survey participants.


What proves that you have good Data science knowledge?

knowlegdeDf<-SurveyDf %>% group_by(ProveKnowledgeSelect) %>%
  summarise(count=n()) %>%
  arrange(desc(count))

knowlegdeDf[1,]<-NA 

hchart(na.omit(knowlegdeDf),hcaes(x=ProveKnowledgeSelect,y=count),type="column",color="#049382",name="count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of what proves you have Datascience knowledge",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6ProveKnowledgeSelectcountChart context menuBarplot of what proves you have Datascience knowledgeExperience fromwork in a companyrelated to MLExperience from work in a company related to MLKaggleCompetitionsKaggle CompetitionsOnline Coursesand CertificationsOnline Courses and CertificationsGithub PortfolioMaster's degreePhDOther0250500750100012501500

Let’s now heck the formal education of participants:

table(FormalEducation)
## FormalEducation
##                                                                   
##                                                              1701 
##                                                 Bachelor's degree 
##                                                              4811 
##                                                   Doctoral degree 
##                                                              2347 
##          I did not complete any formal education past high school 
##                                                               257 
##                                            I prefer not to answer 
##                                                                90 
##                                                   Master's degree 
##                                                              6273 
##                                               Professional degree 
##                                                               451 
## Some college/university study without earning a bachelor's degree 
##                                                               786

Let’s check the most famous machine learning technique in which participants consider themselves competent?

Mltechique<-SurveyDf %>% group_by(MLTechniquesSelect) %>%
  summarise(count=n()) %>% 
  arrange(desc(count)) %>%
  top_n(20)
## Selecting by count
Mltechique[1,]<-NA

hchart(na.omit(Mltechique),hcaes(x=MLTechniquesSelect,y=count),type="column",color="purple",name="count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of competent ML techniques of participants",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6MLTechniquesSelectcountChart context menuBarplot of competent ML techniques of participantsLogistic RegressionDecision Trees -Random Fo…Decision Trees - Random Forests,Logistic RegressionOther (please specify; sep…Other (please specify; separate by semi-colon)Neural Networks - CNNsDecision Trees -Random Fo…Decision Trees - Random ForestsBayesian TechniquesBayesian Techniques,Logis…Bayesian Techniques,Logistic RegressionDecision Trees -Gradient B…Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Logistic RegressionDecision Trees -Random Fo…Decision Trees - Random Forests,Logistic Regression,Support Vector Machines (SVMs)Decision Trees -Gradient B…Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Ensemble Methods,Gradient Boosting,Logistic Regression,Support Vector Machines (SVMs)Bayesian Techniques,Decis…Bayesian Techniques,Decision Trees - Random Forests,Logistic RegressionDecision Trees -Gradient B…Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Ensemble Methods,Gradient Boosting,Logistic RegressionLogistic Regression,Suppor…Logistic Regression,Support Vector Machines (SVMs)Bayesian Techniques,Decis…Bayesian Techniques,Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Logistic RegressionBayesian Techniques,Decis…Bayesian Techniques,Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Ensemble Methods,Gradient Boosting,Logistic Regression,Support Vector Machines (SVMs)Support Vector Machines (…Support Vector Machines (SVMs)Logistic Regression,Neural …Logistic Regression,Neural Networks - CNNsDecision Trees -Gradient B…Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Logistic Regression,Support Vector Machines (SVMs)Decision Trees -Gradient B…Decision Trees - Gradient Boosted Machines,Decision Trees - Random Forests,Ensemble Methods,Gradient Boosting,Logistic Regression,Neural Networks - CNNs,Support Vector Machines (SVMs)01000250500750

So we cant notice that Logistic regression, Decision trees, Random forets are the top 2 most competent techniques in which the participants are competent and can successfully implement and are most efficient in implementing.


Let’s check which Learning algorithm participants use at work ?

Now we will check which machine learning algorithm is most used by the participants at their work.

MLalgoWork<-SurveyDf %>% group_by(WorkAlgorithmsSelect) %>%
  summarise(count=n()) %>%
  arrange(desc(count)) %>%
  top_n(20)
## Selecting by count
MLalgoWork[c(1,3),]<-NA


hchart(na.omit(MLalgoWork),hcaes(x=WorkAlgorithmsSelect,y=count),type="column",color="green",name="count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of Most used ML algorithms at Work",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6WorkAlgorithmsSelectcountChart context menuBarplot of Most used ML algorithms at WorkRegression/Logistic Regres…Regression/Logistic RegressionDecision Trees,Random For…Decision Trees,Random Forests,Regression/Logistic RegressionDecision Trees,Regression/…Decision Trees,Regression/Logistic RegressionNeural NetworksBayesian TechniquesBayesian Techniques,Regr…Bayesian Techniques,Regression/Logistic RegressionDecision TreesDecision Trees,Ensemble M…Decision Trees,Ensemble Methods,Gradient Boosted Machines,Random Forests,Regression/Logistic RegressionRandom Forests,Regressio…Random Forests,Regression/Logistic RegressionBayesian Techniques,Decis…Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic RegressionDecision Trees,Ensemble M…Decision Trees,Ensemble Methods,Random Forests,Regression/Logistic RegressionBayesian Techniques,Decis…Bayesian Techniques,Decision Trees,Regression/Logistic RegressionCNNsDecision Trees,Ensemble M…Decision Trees,Ensemble Methods,Gradient Boosted Machines,Random Forests,Regression/Logistic Regression,SVMsNeural Networks,Regressio…Neural Networks,Regression/Logistic RegressionCNNs,Neural NetworksRandom ForestsCNNs,Neural Networks,RNNsDecision Trees,Gradient Bo…Decision Trees,Gradient Boosted Machines,Random Forests,Regression/Logistic Regression0100200300400500600

Again as we can see from the above plot, Regression,Logistic regression and decision trees lead the pack as the most used learning algorithms which used at work by participants.


Now let’s check which tools are used most at work?

This field answers -For work, which data science/analytics tools, technologies, and languages the participants have used in the past year?

We are going to find tht top 20 tools.

ToolatWork<-SurveyDf %>% group_by(WorkToolsSelect) %>%
  summarise(count=n()) %>%
  arrange(desc(count)) %>%
  top_n(20)
## Selecting by count
ToolatWork[c(1),]<-NA


hchart(na.omit(ToolatWork),hcaes(x=WorkToolsSelect,y=count),type="column",color="#7C0E3E",name="count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of Most used data science tools used at Work",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6WorkToolsSelectcountChart context menuBarplot of Most used data science tools used at WorkPython,RRPythonPython,R,SQLR,SQLPython,TensorFlowJupyter notebooks,Python,RC/C++,PythonJupyter notebooks,Python,TensorFlowMATLAB/OctaveOtherPython,SQLJupyter notebooks,PythonJupyter notebooks,Python,R,SQLMATLAB/Octave,PythonSQLJupyter notebooks,Python,SQLR,SAS Base,SQLAmazon Web services,Python,RMATLAB/Octave,Python,TensorFlow0255075100125150

From the above plot we can see that Python and R are collectively used by datascientists the most as entered by the survey participants. Hence Python and R still tops the most used tools at work according to the survey.


Most used ML method at work?

MethodatWork<-SurveyDf %>% group_by(WorkMethodsSelect
) %>%
  summarise(count=n()) %>%
  arrange(desc(count)) %>%
  top_n(20)
## Selecting by count
MethodatWork[c(1,3),]<-NA


hchart(na.omit(MethodatWork),hcaes(x=WorkMethodsSelect
,y=count),type="column",color="#F14B5B",name="count") %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_title(text="Barplot of Most used ML and DS methods used at Work",align="center") %>%
  hc_add_theme(hc_theme_elementary())
Created with Highstock 5.0.6WorkMethodsSelectcountChart context menuBarplot of Most used ML and DS methods used at WorkData VisualizationLogistic RegressionTime Series AnalysisNeural NetworksA/B TestingData Visualization,Time Se…Data Visualization,Time Series AnalysisText AnalyticsDecision TreesCNNsBayesian TechniquesData Visualization,Logistic …Data Visualization,Logistic RegressionData Visualization,Decision…Data Visualization,Decision TreesData Visualization,Logistic …Data Visualization,Logistic Regression,Time Series AnalysisNatural Language ProcessingCNNs,Neural NetworksData Visualization,Text Ana…Data Visualization,Text AnalyticsA/B Testing,Data Visualizati…A/B Testing,Data VisualizationLogistic Regression,Time S…Logistic Regression,Time Series Analysis050100150200