The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We dowloaded the multiple choice item survey results in csv format and placed it in our GitHub repo
Importing Multiple Choice data
linkMC<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/multipleChoiceResponses.csv"
#importing MC items
MC<-read_csv (linkMC)
dim(MC)
## [1] 16716 228
#lets create a unique ID variable
MC$id <- seq.int(nrow(MC))
Ignore this codeImporting conversionrates data incase we want to do analyses
# link_conversion<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/conversionRates.csv"
# #importing MC items
# conversion<-read_csv (link_conversion)
# dim(conversion)
# #lets create a unique ID variable
# conversion$id <- seq.int(nrow(conversion))
This project will answer this globalresearch question Which are the most values data science skills? The following 6 research questions will provide answer to this global question.
What is the relationship between the most popular platforms for learning DS and X (Niteen)? Alternatively phrased: What data science learning resources and which locations of open data are utilized by people of varying levels of education? (delete me if you need to!)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Does survey takers' formal education has any relationship to the ML/DS method he
or she is most excited about learning in the next year? (Binish)
To do the analysis, we concentrate on two colums in the dataset -
FormalEducation and MLMethodNextYearSelect
FormalEducation : Which level of formal education have you attained?
MLMethodNextYearSelect : Which ML/DS method are you most excited about learning
in the next year?
These questions are asked to all participants.
First we plot the distribution of formal education in the dataset
The data set predominantly contains candidates with Master's degree.
Now let's look at the different ML/DS methods in the dataset
| ML Methods |
|---|
| Random Forests |
| Deep learning |
| Neural Nets |
| Text Mining |
| Genetic & Evolutionary Algorithms |
| Link Analysis |
| Rule Induction |
| Regression |
| Proprietary Algorithms |
| I don’t plan on learning a new ML/DS method |
| Ensemble Methods (e.g. boosting, bagging) |
| Factor Analysis |
| Social Network Analysis |
| Monte Carlo Methods |
| Time Series Analysis |
| Other |
| Bayesian Methods |
| Survival Analysis |
| MARS |
| Anomaly Detection |
| Cluster Analysis |
| Decision Trees |
| Association Rules |
| Uplift Modeling |
| Support Vector Machines (SVM) |
Now we can plot the distribution of ML/DS methods with formal education
| FormalEducation | MLMethodNextYearSelect | percentage |
|---|---|---|
| Bachelor | Deep learning | 0.40 |
| Bachelor | Neural Nets | 0.14 |
| Bachelor | Time Series Analysis | 0.06 |
| College Dropout | Deep learning | 0.37 |
| College Dropout | Neural Nets | 0.16 |
| College Dropout | Time Series Analysis | 0.07 |
| Doctoral | Deep learning | 0.44 |
| Doctoral | Neural Nets | 0.10 |
| Doctoral | Bayesian Methods | 0.06 |
| Doctoral | Time Series Analysis | 0.06 |
| Masters | Deep learning | 0.40 |
| Masters | Neural Nets | 0.12 |
| Masters | Time Series Analysis | 0.07 |
| Professional | Deep learning | 0.38 |
| Professional | Neural Nets | 0.14 |
| Professional | Time Series Analysis | 0.05 |
| High School | Deep learning | 0.39 |
| High School | Neural Nets | 0.14 |
| High School | Genetic & Evolutionary Algorithms | 0.10 |
Deep Learning is the top most ML/DS method in all categories of formal education
followed by Neural Nets. Except High school graduates, all others wants to learn
Time Series Analysis as the third ML/DS method. High school graduates want to
learn Genetic & Evolutionary Algorithms as theri third choice. Among doctoral
survey takers, Bayesian Methods is the third preference.
What are the most frequently used DS methods? Where is the most time spent in terms of working with data? Do either of these correlate with job title or level of education? (Zach)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using? (Betsy)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
# Select only variables that seem most related to “Which are the most valued data science skills?”
# I May narrow down these columns even more later, but want to leave as much as possible for now
# Filter for US Only
USOnly <- MC %>%
select(-c(56:73, 76:79, 167:196, 198:228)) %>%
filter(Country=='United States')
# Separate those employed in Data Science from those who are not.
# Filter for Employed only, TitleFit better than 'Poorly', and CodeWriters only
# Remove those that said they are "Employed by a company that doesn't perform advanced analytics"
employed <- USOnly %>%
filter(!grepl('Not employed',EmploymentStatus),
TitleFit!="Poorly",
!grepl('doesn\'t perform advanced analytics',CurrentEmployerType),
CodeWriter=="Yes",
JobFunctionSelect != 'Build and/or run the data infrastructure')
# Filter for Data Science Learners who are not employed.
# The Survey failed to capture those who are employed and ALSO students or learners!!!
# Didn't bother to ask employed respondents if they were also sudying Data Science.
learner <- USOnly %>%
filter(grepl('Not employed',EmploymentStatus),
grepl('Yes',LearningDataScience))
# Get rid of empty columns
employed <- remove_empty_cols(employed)
## Warning: 'remove_empty_cols' is deprecated.
## Use 'remove_empty("cols")' instead.
## See help("Deprecated")
learner <- remove_empty_cols(learner)
## Warning: 'remove_empty_cols' is deprecated.
## Use 'remove_empty("cols")' instead.
## See help("Deprecated")
glimpse(employed)
## Observations: 1,676
## Variables: 125
## $ GenderSelect <chr> "Male", "Male", "Ma...
## $ Country <chr> "United States", "U...
## $ Age <int> 35, 25, 33, NA, 35,...
## $ EmploymentStatus <chr> "Employed full-time...
## $ CodeWriter <chr> "Yes", "Yes", "Yes"...
## $ CurrentJobTitleSelect <chr> "Computer Scientist...
## $ TitleFit <chr> "Fine", "Fine", "Pe...
## $ CurrentEmployerType <chr> "Employed by govern...
## $ MLToolNextYearSelect <chr> "TensorFlow", "Amaz...
## $ MLMethodNextYearSelect <chr> "Text Mining", "Dee...
## $ LanguageRecommendationSelect <chr> "R", "Python", "Mat...
## $ PublicDatasetsSelect <chr> "Dataset aggregator...
## $ LearningPlatformSelect <chr> "Arxiv,Blogs,Kaggle...
## $ LearningPlatformUsefulnessArxiv <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessBlogs <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessCollege <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessCompany <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessConferences <chr> NA, NA, "Not Useful...
## $ LearningPlatformUsefulnessFriends <chr> NA, NA, "Somewhat u...
## $ LearningPlatformUsefulnessKaggle <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessNewsletters <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCommunities <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessDocumentation <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCourses <chr> NA, NA, NA, "Very u...
## $ LearningPlatformUsefulnessProjects <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessPodcasts <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessSO <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessTextbook <chr> "Very useful", "Som...
## $ LearningPlatformUsefulnessTradeBook <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessTutoring <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessYouTube <chr> NA, NA, NA, NA, NA,...
## $ BlogsPodcastsNewslettersSelect <chr> NA, NA, NA, "KDnugg...
## $ DataScienceIdentitySelect <chr> "No", "Yes", "No", ...
## $ FormalEducation <chr> "Master's degree", ...
## $ UniversityImportance <chr> "Very important", "...
## $ JobFunctionSelect <chr> "Build and/or run t...
## $ WorkAlgorithmsSelect <chr> NA, "CNNs,Neural Ne...
## $ WorkToolsSelect <chr> "C/C++,Cloudera,Had...
## $ WorkToolsFrequencyAmazonML <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyAWS <chr> NA, NA, NA, "Often"...
## $ WorkToolsFrequencyAngoss <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyC <chr> "Sometimes", "Often...
## $ WorkToolsFrequencyCloudera <chr> "Most of the time",...
## $ WorkToolsFrequencyDataRobot <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyFlume <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyGCP <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyHadoop <chr> "Most of the time",...
## $ WorkToolsFrequencyIBMCognos <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSModeler <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSStatistics <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMWatson <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyImpala <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJava <chr> "Most of the time",...
## $ WorkToolsFrequencyJulia <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJupyter <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyKNIMECommercial <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyKNIMEFree <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMathematica <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMATLAB <chr> NA, NA, "Most of th...
## $ WorkToolsFrequencyAzure <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyExcel <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftRServer <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftSQL <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMinitab <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyNoSQL <chr> "Most of the time",...
## $ WorkToolsFrequencyOracle <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyOrange <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPerl <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPython <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyQlik <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyR <chr> "Sometimes", NA, NA...
## $ WorkToolsFrequencyRapidMinerCommercial <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyRapidMinerFree <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySalfrod <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySAPBusinessObjects <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASBase <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASEnterprise <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASJMP <chr> NA, NA, NA, NA, "Mo...
## $ WorkToolsFrequencySpark <chr> NA, NA, NA, "Someti...
## $ WorkToolsFrequencySQL <chr> NA, NA, NA, NA, "Of...
## $ WorkToolsFrequencyStan <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyStatistica <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTableau <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTensorFlow <chr> NA, "Often", NA, NA...
## $ WorkToolsFrequencyTIBCO <chr> NA, NA, NA, NA, "So...
## $ WorkToolsFrequencyUnix <chr> "Most of the time",...
## $ WorkToolsFrequencySelect1 <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySelect2 <chr> NA, NA, NA, NA, NA,...
## $ WorkFrequencySelect3 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsSelect <chr> "A/B Testing,Cross-...
## $ `WorkMethodsFrequencyA/B` <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyAssociationRules <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyBayesian <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyCNNs <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyCollaborativeFiltering <chr> NA, NA, NA, NA, NA,...
## $ `WorkMethodsFrequencyCross-Validation` <chr> "Sometimes", NA, "S...
## $ WorkMethodsFrequencyDataVisualization <chr> "Most of the time",...
## $ WorkMethodsFrequencyDecisionTrees <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyEnsembleMethods <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyEvolutionaryApproaches <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGANs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGBM <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyHMMs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyKNN <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyLiftAnalysis <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyLogisticRegression <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyMLN <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyNaiveBayes <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyNLP <chr> "Most of the time",...
## $ WorkMethodsFrequencyNeuralNetworks <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyPCA <chr> NA, "Often", NA, NA...
## $ WorkMethodsFrequencyPrescriptiveModeling <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRandomForests <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyRecommenderSystems <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRNNs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySegmentation <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencySimulation <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySVMs <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyTextAnalysis <chr> "Most of the time",...
## $ WorkMethodsFrequencyTimeSeriesAnalysis <chr> "Often", NA, NA, NA...
## $ WorkMethodsFrequencySelect1 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect2 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect3 <chr> NA, NA, NA, NA, NA,...
## $ WorkDataVisualizations <chr> "26-50% of projects...
## $ id <int> 7, 22, 23, 25, 35, ...
glimpse(learner)
## Observations: 154
## Variables: 50
## $ GenderSelect <chr> "Female", "Male", "Non...
## $ Country <chr> "United States", "Unit...
## $ Age <int> 22, 47, 21, 27, 13, 23...
## $ EmploymentStatus <chr> "Not employed, and not...
## $ StudentStatus <chr> "Yes", "No", "Yes", "Y...
## $ LearningDataScience <chr> "Yes, but data science...
## $ MLToolNextYearSelect <chr> "SQL", "TensorFlow", "...
## $ MLMethodNextYearSelect <chr> "Deep learning", "Deep...
## $ LanguageRecommendationSelect <chr> "R", "Python", "Python...
## $ PublicDatasetsSelect <chr> "GitHub,Google Search,...
## $ LearningPlatformSelect <chr> "College/University,St...
## $ LearningPlatformUsefulnessArxiv <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessBlogs <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCollege <chr> "Very useful", NA, "So...
## $ LearningPlatformUsefulnessCompany <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessConferences <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessFriends <chr> NA, NA, "Very useful",...
## $ LearningPlatformUsefulnessKaggle <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessNewsletters <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCommunities <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessDocumentation <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessCourses <chr> NA, "Very useful", "Ve...
## $ LearningPlatformUsefulnessProjects <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessPodcasts <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessSO <chr> "Somewhat useful", NA,...
## $ LearningPlatformUsefulnessTextbook <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessTutoring <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessYouTube <chr> "Somewhat useful", NA,...
## $ BlogsPodcastsNewslettersSelect <chr> "Becoming a Data Scien...
## $ LearningDataScienceTime <chr> "< 1 year", "1-2 years...
## $ JobSkillImportanceBigData <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceDegree <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceStats <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceEnterpriseTools <chr> "Nice to have", "Unnec...
## $ JobSkillImportancePython <chr> "Nice to have", "Neces...
## $ JobSkillImportanceR <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceSQL <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceKaggleRanking <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceMOOC <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceVisualizations <chr> "Nice to have", "Neces...
## $ JobSkillImportanceOtherSelect1 <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect2 <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect3 <chr> NA, NA, NA, NA, NA, NA...
## $ CoursePlatformSelect <chr> NA, "Coursera,Udacity"...
## $ HardwarePersonalProjectsSelect <chr> "Basic laptop (Macbook...
## $ TimeSpentStudying <chr> "2 - 10 hours", "2 - 1...
## $ ProveKnowledgeSelect <chr> "Experience from work ...
## $ DataScienceIdentitySelect <chr> "No", "Yes", "Sort of ...
## $ FormalEducation <chr> "Some college/universi...
## $ id <int> 44, 85, 208, 210, 212,...
# Take a peek at the demographics of those who are employed...
employed %>%
group_by(CurrentJobTitleSelect) %>%
summarise(n())
## # A tibble: 16 x 2
## CurrentJobTitleSelect `n()`
## <chr> <int>
## 1 Business Analyst 45
## 2 Computer Scientist 36
## 3 Data Analyst 155
## 4 Data Miner 7
## 5 Data Scientist 559
## 6 DBA/Database Engineer 17
## 7 Engineer 53
## 8 Machine Learning Engineer 79
## 9 Operations Research Practitioner 13
## 10 Other 149
## 11 Predictive Modeler 38
## 12 Programmer 17
## 13 Researcher 110
## 14 Scientist/Researcher 185
## 15 Software Developer/Software Engineer 151
## 16 Statistician 62
# Need to Tidy this data so that each response is in a separate row rather than all in one
employed %>%
group_by(CurrentEmployerType) %>%
summarise(n())
## # A tibble: 34 x 2
## CurrentEmployerType `n()`
## <chr> <int>
## 1 Employed by a company that performs advanced analytics 549
## 2 Employed by a company that performs advanced analytics,Employed … 4
## 3 Employed by a company that performs advanced analytics,Self-empl… 3
## 4 Employed by college or university 291
## 5 Employed by college or university,Employed by a company that per… 1
## 6 Employed by college or university,Employed by a company that per… 1
## 7 Employed by college or university,Employed by government 2
## 8 Employed by college or university,Employed by non-profit or NGO 6
## 9 Employed by college or university,Employed by non-profit or NGO,… 1
## 10 Employed by company that makes advanced analytic software 182
## # ... with 24 more rows
employed %>%
filter(JobFunctionSelect != 'Build and/or run the data infrastructure that your business uses for storing, analyzing, and operationalizing data') %>%
group_by(CurrentJobTitleSelect) %>%
summarise(n())
## # A tibble: 16 x 2
## CurrentJobTitleSelect `n()`
## <chr> <int>
## 1 Business Analyst 42
## 2 Computer Scientist 29
## 3 Data Analyst 144
## 4 Data Miner 4
## 5 Data Scientist 522
## 6 DBA/Database Engineer 7
## 7 Engineer 37
## 8 Machine Learning Engineer 76
## 9 Operations Research Practitioner 13
## 10 Other 132
## 11 Predictive Modeler 38
## 12 Programmer 11
## 13 Researcher 109
## 14 Scientist/Researcher 172
## 15 Software Developer/Software Engineer 100
## 16 Statistician 60
# Take a peek at the demographics of those who are learners...
learner %>%
group_by(StudentStatus, LearningDataScience) %>%
summarise(n())
## # A tibble: 4 x 3
## # Groups: StudentStatus [?]
## StudentStatus LearningDataScience `n()`
## <chr> <chr> <int>
## 1 No Yes, but data science is a small part of what I'm f… 18
## 2 No Yes, I'm focused on learning mostly data science sk… 23
## 3 Yes Yes, but data science is a small part of what I'm f… 50
## 4 Yes Yes, I'm focused on learning mostly data science sk… 63
Is there any interaction between the Kaggle survey takers’ program language use (R or Python) and their recommended program languages? (e.g. R users recommending R more than Python users recommending Python) (Burcu)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
dim(MC)
## [1] 16716 229
tb1<-MC %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb1)
#removing NAs and empty values in column=WorkToolsSelect
df <- MC[!(MC$WorkToolsSelect == "" | is.na(MC$WorkToolsSelect)), ]
dim(df)
## [1] 7955 229
tb2<-df %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb2)
#creating a new variable called work_tools where the original column values are split
#please note that this code will generate long data
df1<-df %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
#check
tb3<-df1 %>%
select (id, WorkToolsSelect,work_tools) %>%
filter (id %in% c(1:3))
datatable(tb3)
df2<-df1 %>%
group_by(id, work_tools) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0)
df3<-df2 %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"))%>%
select (id, R, Python, lang_use)
tb4<-df3 %>%
filter (id %in% c(1:10))
datatable(tb4)
#computing percentages
df4<-df3 %>%
group_by(lang_use) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df4, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p<-ggplot (df4, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))
ggplotly(p)
Let’s examine the above graph by LanguageRecommendationSelect
#check
tb5<-df1 %>%
select (id, WorkToolsSelect,work_tools, LanguageRecommendationSelect) %>%
filter (id %in% c(1:3))
datatable(tb5)
df5<-df1 %>%
group_by(id, work_tools,LanguageRecommendationSelect) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0) %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"),
lang_rec = case_when (
(LanguageRecommendationSelect=="R") ~ "Recommending R ",
(LanguageRecommendationSelect=="Python" ) ~ "Recommending Python ",
(LanguageRecommendationSelect!="R" |LanguageRecommendationSelect!="Python") ~ "Recommending Neither Python nor R",
(LanguageRecommendationSelect=="NA"|LanguageRecommendationSelect==" " ) ~ "Recommending Nothing"))%>%
select (id, R, Python, lang_use,lang_rec )
dim(df5)
## [1] 7955 5
tb6<-df5 %>%
filter (id %in% c(1:10))
datatable(tb6)
#computing percentages
df6<-df5 %>%
group_by(lang_use,lang_rec) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df6, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p1<-ggplot (df6, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users and their recommended language") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))+
facet_wrap(~lang_rec)+
theme(legend.position = 'none')
ggplotly(p1)
Of those receiving pay in US Dollars, is Python or R overall most profitable for a Kaggle survey taker? (Gabby)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
RQ6 <- MC %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
RQ6 <- RQ6 %>%
filter(!is.na(WorkToolsSelect)) %>% # Filters out all columns with NA in the WorkToolsSelect column
filter(CompensationCurrency == "USD") %>% # Makes sure to only use rows whose currency is in USD
filter(work_tools == "Python" | work_tools == "R") %>% # The work tools are R or Python, period.
select(id, work_tools, CompensationAmount) # Only have three rows to work with
RQ6_ids <- select(filter(as.data.frame(table(RQ6$id)), Freq == 1), Var1) # Only want people who use R or Python EXCLUSIVELY, not R and/or Python
RQ6_ids <- droplevels(RQ6_ids)$Var1 # Removed the levels so we can actually get the IDs
RQ6 <- filter(RQ6, id %in% RQ6_ids) # Only keep those rows whose id are inside of list of ids with R or Python exclusively used at work
RQ6 <- select(RQ6, -id) # No use for the ID anymore, it's done its job
RQ6$CompensationAmount <- gsub(",", "", RQ6$CompensationAmount) # Removed the commas from the compensation amount to prep for numeric transformation
RQ6$CompensationAmount <- as.numeric(RQ6$CompensationAmount) # made the column into a numeric for easier mathematical comparison and sorting
RQ6 <- filter(RQ6, CompensationAmount < 9999999) # ... let's just be a little realistic, nobody is earning more than fifteen million a year at this point in time or prior to it, and this one-dollar-off-from-a-million entry is an anomaly in the data set
rm(RQ6_ids) # remove the now-unused variable to save memory
RQ6_boxplot <- ggplot(RQ6) +
geom_boxplot( aes(x = factor(work_tools),
y = CompensationAmount,
fill = factor(work_tools)
)
) +
scale_y_continuous(breaks=seq(0,2000000,25000)) +
labs( x = "Programming Language",
y = "Annual Compensation in USD",
fill = "Programming Language")
RQ6_boxplot_ylim <- boxplot.stats(RQ6$CompensationAmount)$stats[c(1, 5)]
RQ6_boxplot <- RQ6_boxplot + coord_cartesian(ylim = RQ6_boxplot_ylim*1.05)
RQ6_boxplot
The average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.