Setup
library(knitr)
library(kableExtra)
library(prettydoc)
library(RCurl)
library(dplyr)
library(ggplot2)
library(forcats)
library(varhandle)
library(stringr)Kaggle ML and Data Science Survey
Kaggle conducted a survey of more than 16,000 data scientists in 2017. Data used in the following analysis is pulled from https://www.kaggle.com/kaggle/kaggle-survey-2017.
At work, what proportion of your analytics projects incorporate data visualization?
One way to determine how important a particular skill is to data scientsists is to look at how much of working data scientists’ job uses that skill. One question the survey asks is, “At work, what proportion of your analytics projects incorporate data visualization?”
As the analysis shows below, the largest number of response by far is that more than three-quarters of all the respondents’s assigned projects incorporate data visualization. Data visualization must be an imrportant skill!
Kaggle.Multi <- read.csv("https://raw.githubusercontent.com/aliceafriedman/DATA607_Proj3/master/multipleChoiceResponses.csv", sep=",", header = TRUE)PercOrder <- c(
"None",
"10-25% of projects",
"26-50% of projects",
"51-75% of projects",
"76-99% of projects",
"100% of projects")
Kaggle.Multi$WorkDataVisualizations <- factor(Kaggle.Multi$WorkDataVisualizations, PercOrder)
Kaggle.Multi %>%
filter(WorkDataVisualizations!="") %>%
mutate(WorkDataVisualizations = fct_recode(WorkDataVisualizations,
"10-25%" = "10-25% of projects",
"26-50%" = "26-50% of projects",
"51-75%" = "51-75% of projects",
"76-100%" = "76-99% of projects",
"76-100%" = "100% of projects"
)) %>%
ggplot(aes(WorkDataVisualizations))+
geom_bar(aes(fill=WorkDataVisualizations))+
theme(axis.text.x=element_text(angle = 60, hjust = 1))+
labs(title = "At work, what proportion of your analytics projects\n incorportate data visualizations?",
x = "Percent of Projects Incorporating Visualization",
y = "Count")Who responded to this survey?
Kaggle.Multi %>%
filter(CurrentJobTitleSelect!="") %>%
mutate(CurrentJobTitleSelect = CurrentJobTitleSelect %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(fct_relevel(CurrentJobTitleSelect, "Other")))+
geom_bar(aes(fill=CurrentJobTitleSelect))+
theme(axis.text.x=element_text(angle = 60, hjust = 1), legend.position="none")+
labs(title = "Select the option that's most similar to your current job/professional \ntitle (or most recent title if retired)",
x= "Job Title",
y = "Count")Salary Data
Another way to look at skills is to see how answers vary by salary. The most important skills should draw a higher salary.
This data requires a bit of munging, as salaries have been reported in several currencies, and a quick look at the repsonses seems to indicate that many individuals have entered there salaries in “thousands” whereas others have entered it in dollars. Some of the responses are in the millions. Are these liars, or just very well paid survey respondents?
To address the former, we will filter by responses reported in USD. To address the two latter, we have a options:
Ignore outliers (e.g. responses less than $1000, and more than $1,000,000)
Assume that everyone is telling the truth about their salaries
Assume that responses less than $1000 are reporting in thousands (e.g. a response of $87 is intended to mean a salary of $87,000, something which would be expected based on overall salary data for data scientists)
Some combination of the above, taking into account known salary distributions for particular titles.
The data munging to access US reported salaries also must take into account that data has been entered in a varirty of formats, resulting in data that cannot be easily converted to numeric.
First, let’s investigate the data to see which of these options is most reasonable.
USSalaries <- Kaggle.Multi %>% filter(CompensationCurrency=="USD") %>%
select(CompensationAmount) %>%
#First, we must unfactor the data using the varhandle library
mutate(CompensationAmount = unfactor(CompensationAmount),
#Then, we must remove all commas, which are included in some, but not all, data entries before we can finally convert to numeric
CompensationAmount = str_remove_all(CompensationAmount,","),
CompensationAmount = as.numeric(CompensationAmount))## Warning in evalq(as.numeric(CompensationAmount), <environment>): NAs
## introduced by coercion
quantile(USSalaries[[1]], na.rm = TRUE)## 0% 25% 50% 75% 100%
## 0 60000 96000 140000 9999999
ggplot(USSalaries, aes(y = USSalaries[[1]]))+geom_boxplot()## Warning: Removed 38 rows containing non-finite values (stat_boxplot).
For the purpose of this analysis, I will choose option 4: Ignoring outliers, limiting the sample to titles with known salary ranges, and assuming that reported salaries of less than $1000 are intending to report true salaries in thousands of dollars per year.
I will also convert the from factor to numeric, and then sort into bins.
Kaggle.Multi %>%
mutate(CompensationAmount = as.numeric(as.numeric(levels(CompensationAmount)[CompensationAmount])),
CompensationAmount = ifelse(CompensationAmount < 100, CompensationAmount*1000, CompensationAmount)) %>%
filter(CompensationCurrency=="USD",
CompensationAmount < 1000000) %>%
glimpse()## Warning in evalq(as.numeric(as.numeric(levels(CompensationAmount)
## [CompensationAmount])), : NAs introduced by coercion
## Observations: 1,288
## Variables: 228
## $ GenderSelect <fct> Male, Male, Male, ...
## $ Country <fct> United States, Uni...
## $ Age <int> 25, 33, 35, 37, 31...
## $ EmploymentStatus <fct> Employed part-time...
## $ StudentStatus <fct> , , , , , , , , , ...
## $ LearningDataScience <fct> , , , , , , , , , ...
## $ CodeWriter <fct> Yes, Yes, Yes, Yes...
## $ CareerSwitcher <fct> , , , , , , , , , ...
## $ CurrentJobTitleSelect <fct> Researcher, Scient...
## $ TitleFit <fct> Fine, Perfectly, F...
## $ CurrentEmployerType <fct> Employed by colleg...
## $ MLToolNextYearSelect <fct> Amazon Machine Lea...
## $ MLMethodNextYearSelect <fct> Deep learning, Dee...
## $ LanguageRecommendationSelect <fct> Python, Matlab, Py...
## $ PublicDatasetsSelect <fct> Dataset aggregator...
## $ LearningPlatformSelect <fct> Arxiv,College/Univ...
## $ LearningPlatformUsefulnessArxiv <fct> Somewhat useful, S...
## $ LearningPlatformUsefulnessBlogs <fct> , Not Useful, , So...
## $ LearningPlatformUsefulnessCollege <fct> Very useful, , , S...
## $ LearningPlatformUsefulnessCompany <fct> , , , Very useful,...
## $ LearningPlatformUsefulnessConferences <fct> , Not Useful, , So...
## $ LearningPlatformUsefulnessFriends <fct> , Somewhat useful,...
## $ LearningPlatformUsefulnessKaggle <fct> Very useful, , Ver...
## $ LearningPlatformUsefulnessNewsletters <fct> , , , Somewhat use...
## $ LearningPlatformUsefulnessCommunities <fct> , , , Somewhat use...
## $ LearningPlatformUsefulnessDocumentation <fct> , , , Somewhat use...
## $ LearningPlatformUsefulnessCourses <fct> , , Somewhat usefu...
## $ LearningPlatformUsefulnessProjects <fct> , , Somewhat usefu...
## $ LearningPlatformUsefulnessPodcasts <fct> , , , Somewhat use...
## $ LearningPlatformUsefulnessSO <fct> Very useful, , Ver...
## $ LearningPlatformUsefulnessTextbook <fct> Somewhat useful, ,...
## $ LearningPlatformUsefulnessTradeBook <fct> , , , Somewhat use...
## $ LearningPlatformUsefulnessTutoring <fct> Very useful, Very ...
## $ LearningPlatformUsefulnessYouTube <fct> , , , Somewhat use...
## $ BlogsPodcastsNewslettersSelect <fct> , , Becoming a Dat...
## $ LearningDataScienceTime <fct> , , , , , , , , , ...
## $ JobSkillImportanceBigData <fct> , , , , , , , , , ...
## $ JobSkillImportanceDegree <fct> , , , , , , , , , ...
## $ JobSkillImportanceStats <fct> , , , , , , , , , ...
## $ JobSkillImportanceEnterpriseTools <fct> , , , , , , , , , ...
## $ JobSkillImportancePython <fct> , , , , , , , , , ...
## $ JobSkillImportanceR <fct> , , , , , , , , , ...
## $ JobSkillImportanceSQL <fct> , , , , , , , , , ...
## $ JobSkillImportanceKaggleRanking <fct> , , , , , , , , , ...
## $ JobSkillImportanceMOOC <fct> , , , , , , , , , ...
## $ JobSkillImportanceVisualizations <fct> , , , , , , , , , ...
## $ JobSkillImportanceOtherSelect1 <fct> , , , , , , , , , ...
## $ JobSkillImportanceOtherSelect2 <fct> , , , , , , , , , ...
## $ JobSkillImportanceOtherSelect3 <fct> , , , , , , , , , ...
## $ CoursePlatformSelect <fct> , , , , , , , , , ...
## $ HardwarePersonalProjectsSelect <fct> , , , , , , , , , ...
## $ TimeSpentStudying <fct> , , , , , , , , , ...
## $ ProveKnowledgeSelect <fct> , , , , , , , , , ...
## $ DataScienceIdentitySelect <fct> Yes, No, Sort of (...
## $ FormalEducation <fct> Bachelor's degree,...
## $ MajorSelect <fct> Physics, Electrica...
## $ Tenure <fct> 3 to 5 years, Less...
## $ PastJobTitlesSelect <fct> , Engineer,Researc...
## $ FirstTrainingSelect <fct> University courses...
## $ LearningCategorySelftTaught <dbl> 30, 10, 10, 40, 35...
## $ LearningCategoryOnlineCourses <dbl> 20, 30, 50, 0, 20,...
## $ LearningCategoryWork <dbl> 0, 0, 15, 58, 15, ...
## $ LearningCategoryUniversity <dbl> 50, 10, 0, 2, 30, ...
## $ LearningCategoryKaggle <dbl> 0, 50, 25, 0, 0, 0...
## $ LearningCategoryOther <dbl> 0, 0, 0, 0, 0, 0, ...
## $ MLSkillsSelect <fct> Adversarial Learni...
## $ MLTechniquesSelect <fct> Decision Trees - G...
## $ ParentsEducation <fct> A master's degree,...
## $ EmployerIndustry <fct> Academic, Telecomm...
## $ EmployerSize <fct> I don't know, Fewe...
## $ EmployerSizeChange <fct> Increased signific...
## $ EmployerMLTime <fct> Don't know, Don't ...
## $ EmployerSearchMethod <fct> Some other way, A ...
## $ UniversityImportance <fct> Important, Somewha...
## $ JobFunctionSelect <fct> Research that adva...
## $ WorkHardwareSelect <fct> Basic laptop (Macb...
## $ WorkDataTypeSelect <fct> Image data,Text da...
## $ WorkProductionFrequency <fct> Never, Never, Rare...
## $ WorkDatasetSize <fct> 1GB, 100GB, 10MB, ...
## $ WorkAlgorithmsSelect <fct> CNNs,Neural Networ...
## $ WorkToolsSelect <fct> C/C++,Jupyter note...
## $ WorkToolsFrequencyAmazonML <fct> , , , , , Sometime...
## $ WorkToolsFrequencyAWS <fct> , , , , , , Someti...
## $ WorkToolsFrequencyAngoss <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyC <fct> Often, , , Often, ...
## $ WorkToolsFrequencyCloudera <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyDataRobot <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyFlume <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyGCP <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyHadoop <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyIBMCognos <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyIBMSPSSModeler <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyIBMSPSSStatistics <fct> , , , , , , , Rare...
## $ WorkToolsFrequencyIBMWatson <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyImpala <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyJava <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyJulia <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyJupyter <fct> Most of the time, ...
## $ WorkToolsFrequencyKNIMECommercial <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyKNIMEFree <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyMathematica <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyMATLAB <fct> , Most of the time...
## $ WorkToolsFrequencyAzure <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyExcel <fct> , , , , , , , , Of...
## $ WorkToolsFrequencyMicrosoftRServer <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyMicrosoftSQL <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyMinitab <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyNoSQL <fct> , , , , , Sometime...
## $ WorkToolsFrequencyOracle <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyOrange <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyPerl <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyPython <fct> Most of the time, ...
## $ WorkToolsFrequencyQlik <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyR <fct> , , Rarely, , , So...
## $ WorkToolsFrequencyRapidMinerCommercial <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyRapidMinerFree <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySalfrod <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySAPBusinessObjects <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySASBase <fct> , , , , , , , Most...
## $ WorkToolsFrequencySASEnterprise <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySASJMP <fct> , , Most of the ti...
## $ WorkToolsFrequencySpark <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySQL <fct> , , Often, , , , ,...
## $ WorkToolsFrequencyStan <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyStatistica <fct> , , , , , , , , , ...
## $ WorkToolsFrequencyTableau <fct> , , , , , , , , Of...
## $ WorkToolsFrequencyTensorFlow <fct> Often, , , Often, ...
## $ WorkToolsFrequencyTIBCO <fct> , , Sometimes, , ,...
## $ WorkToolsFrequencyUnix <fct> , , , , , , Often,...
## $ WorkToolsFrequencySelect1 <fct> , , , , , , , , , ...
## $ WorkToolsFrequencySelect2 <fct> , , , , , , , , , ...
## $ WorkFrequencySelect3 <fct> , , , , , , , , , ...
## $ WorkMethodsSelect <fct> CNNs,Data Visualiz...
## $ WorkMethodsFrequencyA.B <fct> , , Most of the ti...
## $ WorkMethodsFrequencyAssociationRules <fct> , , , , , , Someti...
## $ WorkMethodsFrequencyBayesian <fct> , , , , , , Someti...
## $ WorkMethodsFrequencyCNNs <fct> Most of the time, ...
## $ WorkMethodsFrequencyCollaborativeFiltering <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyCross.Validation <fct> , Sometimes, , Mos...
## $ WorkMethodsFrequencyDataVisualization <fct> Most of the time, ...
## $ WorkMethodsFrequencyDecisionTrees <fct> , , , Often, , , O...
## $ WorkMethodsFrequencyEnsembleMethods <fct> , , , Often, , , O...
## $ WorkMethodsFrequencyEvolutionaryApproaches <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyGANs <fct> , , , Rarely, , , ...
## $ WorkMethodsFrequencyGBM <fct> , Sometimes, , Oft...
## $ WorkMethodsFrequencyHMMs <fct> , , , , , , Someti...
## $ WorkMethodsFrequencyKNN <fct> , , , Sometimes, ,...
## $ WorkMethodsFrequencyLiftAnalysis <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyLogisticRegression <fct> , , , Rarely, , , ...
## $ WorkMethodsFrequencyMLN <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyNaiveBayes <fct> , , , Rarely, , , ...
## $ WorkMethodsFrequencyNLP <fct> , , , , Often, , ,...
## $ WorkMethodsFrequencyNeuralNetworks <fct> Most of the time, ...
## $ WorkMethodsFrequencyPCA <fct> Often, , , Often, ...
## $ WorkMethodsFrequencyPrescriptiveModeling <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyRandomForests <fct> , Sometimes, , Som...
## $ WorkMethodsFrequencyRecommenderSystems <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencyRNNs <fct> , , , Rarely, , , ...
## $ WorkMethodsFrequencySegmentation <fct> , , , Sometimes, ,...
## $ WorkMethodsFrequencySimulation <fct> , , , Sometimes, ,...
## $ WorkMethodsFrequencySVMs <fct> , , , Sometimes, ,...
## $ WorkMethodsFrequencyTextAnalysis <fct> , , , , , , , Rare...
## $ WorkMethodsFrequencyTimeSeriesAnalysis <fct> , , , Rarely, , , ...
## $ WorkMethodsFrequencySelect1 <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencySelect2 <fct> , , , , , , , , , ...
## $ WorkMethodsFrequencySelect3 <fct> , , , , , , , , , ...
## $ TimeGatheringData <int> 0, 0, 30, 15, 30, ...
## $ TimeModelBuilding <dbl> 80, 0, 20, 35, 30,...
## $ TimeProduction <dbl> 0, 0, 5, 20, 30, 0...
## $ TimeVisualizing <dbl> 20, 0, 15, 10, 10,...
## $ TimeFindingInsights <dbl> 0, 0, 30, 20, 0, 0...
## $ TimeOtherSelect <int> 0, 0, 0, 0, 0, 0, ...
## $ AlgorithmUnderstandingLevel <fct> Enough to refine a...
## $ WorkChallengesSelect <fct> Explaining data sc...
## $ WorkChallengeFrequencyPolitics <fct> , , , , , , Someti...
## $ WorkChallengeFrequencyUnusedResults <fct> , , , , Sometimes,...
## $ WorkChallengeFrequencyUnusefulInstrumenting <fct> , , , , Sometimes,...
## $ WorkChallengeFrequencyDeployment <fct> , , , Sometimes, ,...
## $ WorkChallengeFrequencyDirtyData <fct> , Sometimes, Somet...
## $ WorkChallengeFrequencyExplaining <fct> Often, , Sometimes...
## $ WorkChallengeFrequencyPass <fct> , , , , , , , , , ...
## $ WorkChallengeFrequencyIntegration <fct> , , , , , , , , Of...
## $ WorkChallengeFrequencyTalent <fct> , , , , Often, Som...
## $ WorkChallengeFrequencyDataFunds <fct> , , , , , , , , So...
## $ WorkChallengeFrequencyDomainExpertise <fct> , , Often, , , , ,...
## $ WorkChallengeFrequencyML <fct> , , , Often, , , ,...
## $ WorkChallengeFrequencyTools <fct> , , , Often, , , ,...
## $ WorkChallengeFrequencyExpectations <fct> , , Sometimes, Som...
## $ WorkChallengeFrequencyITCoordination <fct> , , , Sometimes, ,...
## $ WorkChallengeFrequencyHiringFunds <fct> , , , , , , , , Of...
## $ WorkChallengeFrequencyPrivacy <fct> , , , , , , , , , ...
## $ WorkChallengeFrequencyScaling <fct> , , , , Sometimes,...
## $ WorkChallengeFrequencyEnvironments <fct> , , , , , , Often,...
## $ WorkChallengeFrequencyClarity <fct> , , , , , , Often,...
## $ WorkChallengeFrequencyDataAccess <fct> , , Sometimes, Som...
## $ WorkChallengeFrequencyOtherSelect <fct> Most of the time, ...
## $ WorkDataVisualizations <fct> 76-99% of projects...
## $ WorkInternalVsExternalTools <fct> More internal than...
## $ WorkMLTeamSeatSelect <fct> Standalone Team, S...
## $ WorkDatasets <fct> Mostly work with g...
## $ WorkDatasetsChallenge <fct> Generating good da...
## $ WorkDataStorage <fct> Flat files not in ...
## $ WorkDataSharing <fct> I don't typically ...
## $ WorkDataSourcing <fct> , , , , , , , , , ...
## $ WorkCodeSharing <fct> Git, Generic cloud...
## $ RemoteWork <fct> Sometimes, Sometim...
## $ CompensationAmount <dbl> 20000, 100000, 133...
## $ CompensationCurrency <fct> USD, USD, USD, USD...
## $ SalaryChange <fct> Has stayed about t...
## $ JobSatisfaction <fct> 6, 7, 8, 9, 6, 8, ...
## $ JobSearchResource <fct> , , , , , , , , , ...
## $ JobHuntTime <fct> , , , , , , , , , ...
## $ JobFactorLearning <fct> , , , , , , , , , ...
## $ JobFactorSalary <fct> , , , , , , , , , ...
## $ JobFactorOffice <fct> , , , , , , , , , ...
## $ JobFactorLanguages <fct> , , , , , , , , , ...
## $ JobFactorCommute <fct> , , , , , , , , , ...
## $ JobFactorManagement <fct> , , , , , , , , , ...
## $ JobFactorExperienceLevel <fct> , , , , , , , , , ...
## $ JobFactorDepartment <fct> , , , , , , , , , ...
## $ JobFactorTitle <fct> , , , , , , , , , ...
## $ JobFactorCompanyFunding <fct> , , , , , , , , , ...
## $ JobFactorImpact <fct> , , , , , , , , , ...
## $ JobFactorRemote <fct> , , , , , , , , , ...
## $ JobFactorIndustry <fct> , , , , , , , , , ...
## $ JobFactorLeaderReputation <fct> , , , , , , , , , ...
## $ JobFactorDiversity <fct> , , , , , , , , , ...
## $ JobFactorPublishingOpportunity <fct> , , , , , , , , , ...
#
# ggplot(aes(fct_relevel(CurrentJobTitleSelect, "Other")))+
#geom_bar(aes(fill=CurrentJobTitleSelect))+
#theme(axis.text.x=element_text(angle = 60, hjust = 1), legend.position="none")+
#labs(title = "Select the option that's most similar to your current job/professional \ntitle (or most recent title if retired)",
# x= "Job Title",
# y = "Count")