1 Importing data

The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We dowloaded the multiple choice item survey results in csv format and placed it in our GitHub repo

Importing Multiple Choice data

linkMC<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/multipleChoiceResponses.csv"
#importing MC items
MC<-read_csv (linkMC)
dim(MC)
## [1] 16716   228
#lets create a unique ID variable
MC$id <- seq.int(nrow(MC))

Ignore this codeImporting conversionrates data incase we want to do analyses

# link_conversion<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/conversionRates.csv"
# #importing MC items
# conversion<-read_csv (link_conversion)
# dim(conversion)
# #lets create a unique ID variable
# conversion$id <- seq.int(nrow(conversion))

2 Research Question

This project will answer this globalresearch question Which are the most values data science skills? The following 6 research questions will provide answer to this global question.

3 Research Question 1

What is the relationship between the most popular platforms for learning DS and X (Niteen)? Alternatively phrased: What data science learning resources and which locations of open data are utilized by people of varying levels of education? (delete me if you need to!)

3.1 Variables and their definition

This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)

3.2 Manipulating data

3.3 Exploratory Data Analysis (EDA)

4 Research Question 2

Does survey takers' formal education has any relationship to the ML/DS method he
or she is most excited about learning in the next year? (Binish)
To do the analysis, we concentrate on two colums in the dataset -
FormalEducation and MLMethodNextYearSelect
FormalEducation : Which level of formal education have you attained?
MLMethodNextYearSelect : Which ML/DS method are you most excited about learning 
in the next year?
These questions are asked to all participants.
First we plot the distribution of formal education in the dataset

The data set predominantly contains candidates with Master's degree.   
Now let's look at the different ML/DS methods in the dataset
ML Methods
Random Forests
Deep learning
Neural Nets
Text Mining
Genetic & Evolutionary Algorithms
Link Analysis
Rule Induction
Regression
Proprietary Algorithms
I don’t plan on learning a new ML/DS method
Ensemble Methods (e.g. boosting, bagging)
Factor Analysis
Social Network Analysis
Monte Carlo Methods
Time Series Analysis
Other
Bayesian Methods
Survival Analysis
MARS
Anomaly Detection
Cluster Analysis
Decision Trees
Association Rules
Uplift Modeling
Support Vector Machines (SVM)
Now we can plot the distribution of ML/DS methods with formal education

FormalEducation MLMethodNextYearSelect percentage
Bachelor Deep learning 0.40
Bachelor Neural Nets 0.14
Bachelor Time Series Analysis 0.06
College Dropout Deep learning 0.37
College Dropout Neural Nets 0.16
College Dropout Time Series Analysis 0.07
Doctoral Deep learning 0.44
Doctoral Neural Nets 0.10
Doctoral Bayesian Methods 0.06
Doctoral Time Series Analysis 0.06
Masters Deep learning 0.40
Masters Neural Nets 0.12
Masters Time Series Analysis 0.07
Professional Deep learning 0.38
Professional Neural Nets 0.14
Professional Time Series Analysis 0.05
High School Deep learning 0.39
High School Neural Nets 0.14
High School Genetic & Evolutionary Algorithms 0.10
Deep Learning is the top most ML/DS method in all categories of formal education
followed by Neural Nets. Except High school graduates, all others wants to learn
Time Series Analysis as the third ML/DS method. High school graduates want to 
learn Genetic & Evolutionary Algorithms as theri third choice. Among doctoral 
survey takers, Bayesian Methods is the third preference.

5 Research Question 3

What are the most frequently used DS methods? Where is the most time spent in terms of working with data? Do either of these correlate with job title or level of education? (Zach)

5.1 Variables and their definition

This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)

5.2 Manipulating data

5.3 Exploratory Data Analysis (EDA)

6 Research Question 4

Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using? (Betsy)

6.1 Variables and their definition

This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)

6.2 Manipulating data

6.3 Exploratory Data Analysis (EDA)

# Select only variables that seem most related to “Which are the most valued data science skills?”
# I May narrow down these columns even more later, but want to leave as much as possible for now
# Filter for US Only
USOnly <- MC %>%
select(-c(56:73, 76:79, 167:196, 198:228)) %>%
filter(Country=='United States')
# Separate those employed in Data Science from those who are not.
# Filter for Employed only, TitleFit better than 'Poorly', and CodeWriters only
# Remove those that said they are "Employed by a company that doesn't perform advanced analytics"
employed <- USOnly %>%
filter(!grepl('Not employed',EmploymentStatus),
TitleFit!="Poorly",
!grepl('doesn\'t perform advanced analytics',CurrentEmployerType),
CodeWriter=="Yes",
JobFunctionSelect != 'Build and/or run the data infrastructure')
# Filter for Data Science Learners who are not employed.
# The Survey failed to capture those who are employed and ALSO students or learners!!!
# Didn't bother to ask employed respondents if they were also sudying Data Science.
learner <- USOnly %>%
filter(grepl('Not employed',EmploymentStatus),
grepl('Yes',LearningDataScience))
# Get rid of empty columns
employed <- remove_empty_cols(employed)
## Warning: 'remove_empty_cols' is deprecated.
## Use 'remove_empty("cols")' instead.
## See help("Deprecated")
learner <- remove_empty_cols(learner)
## Warning: 'remove_empty_cols' is deprecated.
## Use 'remove_empty("cols")' instead.
## See help("Deprecated")
glimpse(employed)
## Observations: 1,676
## Variables: 125
## $ GenderSelect                               <chr> "Male", "Male", "Ma...
## $ Country                                    <chr> "United States", "U...
## $ Age                                        <int> 35, 25, 33, NA, 35,...
## $ EmploymentStatus                           <chr> "Employed full-time...
## $ CodeWriter                                 <chr> "Yes", "Yes", "Yes"...
## $ CurrentJobTitleSelect                      <chr> "Computer Scientist...
## $ TitleFit                                   <chr> "Fine", "Fine", "Pe...
## $ CurrentEmployerType                        <chr> "Employed by govern...
## $ MLToolNextYearSelect                       <chr> "TensorFlow", "Amaz...
## $ MLMethodNextYearSelect                     <chr> "Text Mining", "Dee...
## $ LanguageRecommendationSelect               <chr> "R", "Python", "Mat...
## $ PublicDatasetsSelect                       <chr> "Dataset aggregator...
## $ LearningPlatformSelect                     <chr> "Arxiv,Blogs,Kaggle...
## $ LearningPlatformUsefulnessArxiv            <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessBlogs            <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessCollege          <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessCompany          <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessConferences      <chr> NA, NA, "Not Useful...
## $ LearningPlatformUsefulnessFriends          <chr> NA, NA, "Somewhat u...
## $ LearningPlatformUsefulnessKaggle           <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessNewsletters      <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCommunities      <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessDocumentation    <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCourses          <chr> NA, NA, NA, "Very u...
## $ LearningPlatformUsefulnessProjects         <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessPodcasts         <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessSO               <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessTextbook         <chr> "Very useful", "Som...
## $ LearningPlatformUsefulnessTradeBook        <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessTutoring         <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessYouTube          <chr> NA, NA, NA, NA, NA,...
## $ BlogsPodcastsNewslettersSelect             <chr> NA, NA, NA, "KDnugg...
## $ DataScienceIdentitySelect                  <chr> "No", "Yes", "No", ...
## $ FormalEducation                            <chr> "Master's degree", ...
## $ UniversityImportance                       <chr> "Very important", "...
## $ JobFunctionSelect                          <chr> "Build and/or run t...
## $ WorkAlgorithmsSelect                       <chr> NA, "CNNs,Neural Ne...
## $ WorkToolsSelect                            <chr> "C/C++,Cloudera,Had...
## $ WorkToolsFrequencyAmazonML                 <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyAWS                      <chr> NA, NA, NA, "Often"...
## $ WorkToolsFrequencyAngoss                   <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyC                        <chr> "Sometimes", "Often...
## $ WorkToolsFrequencyCloudera                 <chr> "Most of the time",...
## $ WorkToolsFrequencyDataRobot                <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyFlume                    <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyGCP                      <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyHadoop                   <chr> "Most of the time",...
## $ WorkToolsFrequencyIBMCognos                <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSModeler           <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSStatistics        <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMWatson                <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyImpala                   <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJava                     <chr> "Most of the time",...
## $ WorkToolsFrequencyJulia                    <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJupyter                  <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyKNIMECommercial          <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyKNIMEFree                <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMathematica              <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMATLAB                   <chr> NA, NA, "Most of th...
## $ WorkToolsFrequencyAzure                    <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyExcel                    <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftRServer         <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftSQL             <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMinitab                  <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyNoSQL                    <chr> "Most of the time",...
## $ WorkToolsFrequencyOracle                   <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyOrange                   <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPerl                     <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPython                   <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyQlik                     <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyR                        <chr> "Sometimes", NA, NA...
## $ WorkToolsFrequencyRapidMinerCommercial     <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyRapidMinerFree           <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySalfrod                  <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySAPBusinessObjects       <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASBase                  <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASEnterprise            <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASJMP                   <chr> NA, NA, NA, NA, "Mo...
## $ WorkToolsFrequencySpark                    <chr> NA, NA, NA, "Someti...
## $ WorkToolsFrequencySQL                      <chr> NA, NA, NA, NA, "Of...
## $ WorkToolsFrequencyStan                     <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyStatistica               <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTableau                  <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTensorFlow               <chr> NA, "Often", NA, NA...
## $ WorkToolsFrequencyTIBCO                    <chr> NA, NA, NA, NA, "So...
## $ WorkToolsFrequencyUnix                     <chr> "Most of the time",...
## $ WorkToolsFrequencySelect1                  <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySelect2                  <chr> NA, NA, NA, NA, NA,...
## $ WorkFrequencySelect3                       <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsSelect                          <chr> "A/B Testing,Cross-...
## $ `WorkMethodsFrequencyA/B`                  <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyAssociationRules       <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyBayesian               <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyCNNs                   <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyCollaborativeFiltering <chr> NA, NA, NA, NA, NA,...
## $ `WorkMethodsFrequencyCross-Validation`     <chr> "Sometimes", NA, "S...
## $ WorkMethodsFrequencyDataVisualization      <chr> "Most of the time",...
## $ WorkMethodsFrequencyDecisionTrees          <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyEnsembleMethods        <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyEvolutionaryApproaches <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGANs                   <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGBM                    <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyHMMs                   <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyKNN                    <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyLiftAnalysis           <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyLogisticRegression     <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyMLN                    <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyNaiveBayes             <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyNLP                    <chr> "Most of the time",...
## $ WorkMethodsFrequencyNeuralNetworks         <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyPCA                    <chr> NA, "Often", NA, NA...
## $ WorkMethodsFrequencyPrescriptiveModeling   <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRandomForests          <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyRecommenderSystems     <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRNNs                   <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySegmentation           <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencySimulation             <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySVMs                   <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyTextAnalysis           <chr> "Most of the time",...
## $ WorkMethodsFrequencyTimeSeriesAnalysis     <chr> "Often", NA, NA, NA...
## $ WorkMethodsFrequencySelect1                <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect2                <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect3                <chr> NA, NA, NA, NA, NA,...
## $ WorkDataVisualizations                     <chr> "26-50% of projects...
## $ id                                         <int> 7, 22, 23, 25, 35, ...
glimpse(learner)
## Observations: 154
## Variables: 50
## $ GenderSelect                            <chr> "Female", "Male", "Non...
## $ Country                                 <chr> "United States", "Unit...
## $ Age                                     <int> 22, 47, 21, 27, 13, 23...
## $ EmploymentStatus                        <chr> "Not employed, and not...
## $ StudentStatus                           <chr> "Yes", "No", "Yes", "Y...
## $ LearningDataScience                     <chr> "Yes, but data science...
## $ MLToolNextYearSelect                    <chr> "SQL", "TensorFlow", "...
## $ MLMethodNextYearSelect                  <chr> "Deep learning", "Deep...
## $ LanguageRecommendationSelect            <chr> "R", "Python", "Python...
## $ PublicDatasetsSelect                    <chr> "GitHub,Google Search,...
## $ LearningPlatformSelect                  <chr> "College/University,St...
## $ LearningPlatformUsefulnessArxiv         <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessBlogs         <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCollege       <chr> "Very useful", NA, "So...
## $ LearningPlatformUsefulnessCompany       <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessConferences   <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessFriends       <chr> NA, NA, "Very useful",...
## $ LearningPlatformUsefulnessKaggle        <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessNewsletters   <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCommunities   <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessDocumentation <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessCourses       <chr> NA, "Very useful", "Ve...
## $ LearningPlatformUsefulnessProjects      <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessPodcasts      <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessSO            <chr> "Somewhat useful", NA,...
## $ LearningPlatformUsefulnessTextbook      <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessTutoring      <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessYouTube       <chr> "Somewhat useful", NA,...
## $ BlogsPodcastsNewslettersSelect          <chr> "Becoming a Data Scien...
## $ LearningDataScienceTime                 <chr> "< 1 year", "1-2 years...
## $ JobSkillImportanceBigData               <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceDegree                <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceStats                 <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceEnterpriseTools       <chr> "Nice to have", "Unnec...
## $ JobSkillImportancePython                <chr> "Nice to have", "Neces...
## $ JobSkillImportanceR                     <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceSQL                   <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceKaggleRanking         <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceMOOC                  <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceVisualizations        <chr> "Nice to have", "Neces...
## $ JobSkillImportanceOtherSelect1          <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect2          <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect3          <chr> NA, NA, NA, NA, NA, NA...
## $ CoursePlatformSelect                    <chr> NA, "Coursera,Udacity"...
## $ HardwarePersonalProjectsSelect          <chr> "Basic laptop (Macbook...
## $ TimeSpentStudying                       <chr> "2 - 10 hours", "2 - 1...
## $ ProveKnowledgeSelect                    <chr> "Experience from work ...
## $ DataScienceIdentitySelect               <chr> "No", "Yes", "Sort of ...
## $ FormalEducation                         <chr> "Some college/universi...
## $ id                                      <int> 44, 85, 208, 210, 212,...
# Take a peek at the demographics of those who are employed...
employed %>%
group_by(CurrentJobTitleSelect) %>%
summarise(n())
## # A tibble: 16 x 2
##    CurrentJobTitleSelect                `n()`
##    <chr>                                <int>
##  1 Business Analyst                        45
##  2 Computer Scientist                      36
##  3 Data Analyst                           155
##  4 Data Miner                               7
##  5 Data Scientist                         559
##  6 DBA/Database Engineer                   17
##  7 Engineer                                53
##  8 Machine Learning Engineer               79
##  9 Operations Research Practitioner        13
## 10 Other                                  149
## 11 Predictive Modeler                      38
## 12 Programmer                              17
## 13 Researcher                             110
## 14 Scientist/Researcher                   185
## 15 Software Developer/Software Engineer   151
## 16 Statistician                            62
# Need to Tidy this data so that each response is in a separate row rather than all in one
employed %>%
group_by(CurrentEmployerType) %>%
summarise(n())
## # A tibble: 34 x 2
##    CurrentEmployerType                                               `n()`
##    <chr>                                                             <int>
##  1 Employed by a company that performs advanced analytics              549
##  2 Employed by a company that performs advanced analytics,Employed …     4
##  3 Employed by a company that performs advanced analytics,Self-empl…     3
##  4 Employed by college or university                                   291
##  5 Employed by college or university,Employed by a company that per…     1
##  6 Employed by college or university,Employed by a company that per…     1
##  7 Employed by college or university,Employed by government              2
##  8 Employed by college or university,Employed by non-profit or NGO       6
##  9 Employed by college or university,Employed by non-profit or NGO,…     1
## 10 Employed by company that makes advanced analytic software           182
## # ... with 24 more rows
employed %>%
filter(JobFunctionSelect != 'Build and/or run the data infrastructure that your business uses for storing, analyzing, and operationalizing data') %>%
group_by(CurrentJobTitleSelect) %>%
summarise(n())
## # A tibble: 16 x 2
##    CurrentJobTitleSelect                `n()`
##    <chr>                                <int>
##  1 Business Analyst                        42
##  2 Computer Scientist                      29
##  3 Data Analyst                           144
##  4 Data Miner                               4
##  5 Data Scientist                         522
##  6 DBA/Database Engineer                    7
##  7 Engineer                                37
##  8 Machine Learning Engineer               76
##  9 Operations Research Practitioner        13
## 10 Other                                  132
## 11 Predictive Modeler                      38
## 12 Programmer                              11
## 13 Researcher                             109
## 14 Scientist/Researcher                   172
## 15 Software Developer/Software Engineer   100
## 16 Statistician                            60
# Take a peek at the demographics of those who are learners...
learner %>%
group_by(StudentStatus, LearningDataScience) %>%
summarise(n())
## # A tibble: 4 x 3
## # Groups: StudentStatus [?]
##   StudentStatus LearningDataScience                                  `n()`
##   <chr>         <chr>                                                <int>
## 1 No            Yes, but data science is a small part of what I'm f…    18
## 2 No            Yes, I'm focused on learning mostly data science sk…    23
## 3 Yes           Yes, but data science is a small part of what I'm f…    50
## 4 Yes           Yes, I'm focused on learning mostly data science sk…    63

7 Research Question 5

Is there any interaction between the Kaggle survey takers’ program language use (R or Python) and their recommended program languages? (e.g. R users recommending R more than Python users recommending Python) (Burcu)

7.1 Variables and their definition

This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)

7.2 Manipulating data

dim(MC)
## [1] 16716   229
tb1<-MC %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb1)
#removing NAs and empty values in column=WorkToolsSelect
df <- MC[!(MC$WorkToolsSelect == "" | is.na(MC$WorkToolsSelect)), ]
dim(df)
## [1] 7955  229
tb2<-df %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb2)
#creating a new variable called work_tools where the original column values are split
#please note that this code will generate long data
df1<-df %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
#check
tb3<-df1 %>%
select (id, WorkToolsSelect,work_tools) %>%
filter (id %in% c(1:3))
datatable(tb3)
df2<-df1 %>%
group_by(id, work_tools) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0)
df3<-df2 %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"))%>%
select (id, R, Python, lang_use)
tb4<-df3 %>%
filter (id %in% c(1:10))
datatable(tb4)
#computing percentages
df4<-df3 %>%
group_by(lang_use) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df4, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p<-ggplot (df4, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))
ggplotly(p)

Let’s examine the above graph by LanguageRecommendationSelect

#check
tb5<-df1 %>%
select (id, WorkToolsSelect,work_tools, LanguageRecommendationSelect) %>%
filter (id %in% c(1:3))
datatable(tb5)
df5<-df1 %>%
group_by(id, work_tools,LanguageRecommendationSelect) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0) %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"),
lang_rec = case_when (
(LanguageRecommendationSelect=="R") ~ "Recommending R ",
(LanguageRecommendationSelect=="Python" ) ~ "Recommending Python ",
(LanguageRecommendationSelect!="R" |LanguageRecommendationSelect!="Python") ~ "Recommending Neither Python nor R",
(LanguageRecommendationSelect=="NA"|LanguageRecommendationSelect==" " ) ~ "Recommending Nothing"))%>%
select (id, R, Python, lang_use,lang_rec )
dim(df5)
## [1] 7955    5
tb6<-df5 %>%
filter (id %in% c(1:10))
datatable(tb6)
#computing percentages
df6<-df5 %>%
group_by(lang_use,lang_rec) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df6, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p1<-ggplot (df6, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users and their recommended language") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))+
facet_wrap(~lang_rec)+
theme(legend.position = 'none')
ggplotly(p1)

7.3 Exploratory Data Analysis (EDA)

8 Research Question6

Of those receiving pay in US Dollars, is Python or R overall most profitable for a Kaggle survey taker? (Gabby)

8.1 Variables and their definition

This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)

8.2 Manipulating data

RQ6 <- MC %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
RQ6 <- RQ6 %>%
filter(!is.na(WorkToolsSelect)) %>% # Filters out all columns with NA in the WorkToolsSelect column
filter(CompensationCurrency == "USD") %>% # Makes sure to only use rows whose currency is in USD
filter(work_tools == "Python" | work_tools == "R") %>% # The work tools are R or Python, period.
select(id, work_tools, CompensationAmount) # Only have three rows to work with
RQ6_ids <- select(filter(as.data.frame(table(RQ6$id)), Freq == 1), Var1) # Only want people who use R or Python EXCLUSIVELY, not R and/or Python
RQ6_ids <- droplevels(RQ6_ids)$Var1 # Removed the levels so we can actually get the IDs
RQ6 <- filter(RQ6, id %in% RQ6_ids) # Only keep those rows whose id are inside of list of ids with R or Python exclusively used at work
RQ6 <- select(RQ6, -id) # No use for the ID anymore, it's done its job
RQ6$CompensationAmount <- gsub(",", "", RQ6$CompensationAmount) # Removed the commas from the compensation amount to prep for numeric transformation
RQ6$CompensationAmount <- as.numeric(RQ6$CompensationAmount) # made the column into a numeric for easier mathematical comparison and sorting
RQ6 <- filter(RQ6, CompensationAmount < 9999999) # ... let's just be a little realistic, nobody is earning more than fifteen million a year at this point in time or prior to it, and this one-dollar-off-from-a-million entry is an anomaly in the data set
rm(RQ6_ids) # remove the now-unused variable to save memory

8.3 Exploratory Data Analysis (EDA)

RQ6_boxplot <- ggplot(RQ6) +
geom_boxplot( aes(x = factor(work_tools),
y = CompensationAmount,
fill = factor(work_tools)
)
) +
scale_y_continuous(breaks=seq(0,2000000,25000)) +
labs( x = "Programming Language",
y = "Annual Compensation in USD",
fill = "Programming Language")
RQ6_boxplot_ylim <- boxplot.stats(RQ6$CompensationAmount)$stats[c(1, 5)]
RQ6_boxplot <- RQ6_boxplot + coord_cartesian(ylim = RQ6_boxplot_ylim*1.05)
RQ6_boxplot

The average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.