An October 2012 Harvard Business Review article called Data Scientist the “sexiest job of the 21st century.” In 21st-century time, 2012 is eons ago. Glassdoor, the popular career and company review site, has named Data Scientist the best job in America for three years running. We hear tales of huge starting salaries and high demand for talent. Still, for a number of reasons, we’re cautious.
First, data scientist jobs are prone to the catch-22 prevalent in many industries - all jobs seem to want a couple years of experience in the field. How do you get experience if there are no seemingly true entry-level jobs?
Second, data science degree programs and boot camps are popping up faster the Chick-fil-A franchises. Berkeley’s brand-new undergraduate Data Science program received nearly 800 major pre-declarations as soon as it was made available, which is unprecedented for a new academic program. (They have to use a giant lecture hall (below) for the first course in the sequence.) Competition for entry-level positions is already competitive and growing. It’s going to get crazier. [Big Classroom]
Also, job listings have not evolved to reflect the growth in data science-specific graduate programs. Many open positions list wanting graduate degrees in statistics or specific areas of computer science or PhD degrees in quantitative areas, echoing the backgrounds of many early Data Science industry leaders. In larger companies, our resumes may not make it past HR to the hiring managers.
Finally, many of us are older students who already have good careers and have family obligations and outside lives. We’re very interested in data science, but we seek interesting, rewarding work that allows for balance. Our exploration will touch on this.
In short, as first or second semester students in a M.S. in Data Science program, we want to ensure we’re learning the skills future employers want and that we know how to demonstrate those abilities in projects and portfolios. While some of us might just apply the skills we learn in the pprogram to our current jobs, we want to have options.
To further our understanding of the needed skills - and fulfill the requirements of this project - we looked at three different sets of data related to data science skills. We start with a discussion of the results of a recent look at data science job listings. Next, two members of our team sliced and diced data from a large survey of data scientists gathered by Kaggle, the machine learning contest and education site. Their analysis is shared. Finally, we built our own survey using a free online tool and elicited responses from data science leaders at companies at which we are interested in eventually seeking employment. Results of that survey are analyzed.
library(knitr)
library(kableExtra)
library(prettydoc)
library(RCurl)
library(dplyr)
library(ggplot2)
library(forcats)
library(varhandle)
library(stringr)
library(tidyr)
library(tidyverse)
library(RMySQL)
library(plotly)
Connect to the database
connection <- dbConnect(MySQL(),
user="",
password="",
dbname="test",
host="localhost"
)
## Error in .local(drv, ...): Failed to connect to database: Error: Can't connect to MySQL server on 'localhost' (0)
What are employers looking for?
Methodology:
For the first chart, we searched job listings on LinkedIn, Indeed, SimplyHired, Monster, and AngelList on October 16, 2018 and used the keyword data scientist. The chart below shows how many data scientist jobs each website listed.
qry_jobs <- "SELECT JobSites, Count as Listing FROM DS_JobList order by Count desc"
query1 <- dbGetQuery(connection, qry_jobs)
## Error in dbGetQuery(connection, qry_jobs): object 'connection' not found
query1$JobSites <- factor(query1$JobSites, levels = unique(query1$JobSites)[order(query1$Listing, decreasing = TRUE)])
## Error in factor(query1$JobSites, levels = unique(query1$JobSites)[order(query1$Listing, : object 'query1' not found
p <- plot_ly(data = query1, y = ~Listing, x = ~JobSites,
name = "Data Scientist Job Listing",
type = "bar", marker = list(color = "red")
)
## Error in is.data.frame(data): object 'query1' not found
p
## Error in eval(expr, envir, enclos): object 'p' not found
The most frequent general data scientist skills the employers are looking for.
Methodology:
For the second and third chart, we used the author’s provided data set in google sheet and loaded them in our MySQL database in a normalized format. We sum the keywords occurrences and averaged them across the job listing sites.
qry_dsjobs <- "SELECT Keyword, sum(LinkedIn + Indeed + SimplyHired + Monster)/4 as Ave_Listing from DS_GenSkills group by Keyword"
query <- dbGetQuery(connection, qry_dsjobs)
## Error in dbGetQuery(connection, qry_dsjobs): object 'connection' not found
query$Keyword <- factor(query$Keyword, levels = unique(query$Keyword)[order(query$Ave_Listing, decreasing = TRUE)])
## Error in factor(query$Keyword, levels = unique(query$Keyword)[order(query$Ave_Listing, : object 'query' not found
#knitr::kable(query)
p <- plot_ly(data = query, y = ~Ave_Listing, x = ~Keyword,
name = "General Skills in Data Scientist Job Listing",
type = "bar", marker = list(color = "blue")
)
## Error in is.data.frame(data): object 'query' not found
p
## Error in eval(expr, envir, enclos): object 'p' not found
The chart shows that generally, employers are looking for data scientists who are very good if not proficient in analysis, machine learning and statistics just to name a few. Computer science background and communication completes the top 5 general skills requirements.
The top 20 specific languages, libraries, and tech tools employers are looking for data scientists to have experience with.
qry_dsts <- "SELECT Keyword, `Avg %` as Ave_Perc FROM DS_SoftSkills order by 2 desc limit 20"
query2 <- dbGetQuery(connection, qry_dsts)
## Error in dbGetQuery(connection, qry_dsts): object 'connection' not found
query2$Keyword <- factor(query2$Keyword, levels = unique(query2$Keyword)[order(query2$Ave_Perc, decreasing = TRUE)])
## Error in factor(query2$Keyword, levels = unique(query2$Keyword)[order(query2$Ave_Perc, : object 'query2' not found
p <- plot_ly(data = query2, y = ~Ave_Perc, x = ~Keyword,
name = "Top 20 Technology Skills in Data Scientist Job Listing",
type = "bar", marker = list(color = "green")
)
## Error in is.data.frame(data): object 'query2' not found
p
## Error in eval(expr, envir, enclos): object 'p' not found
Among these 20 top technology skills, Python and R are currently the most in-demand language, followed by SQL. Seeing all of these, we know that at CUNY SPS, we are learning the tools and acquiring the skills that are current and in-demand in the IT workplace.
Kaggle conducted a survey of more than 16,000 data scientists in 2017. Data used in the following analysis is pulled from https://www.kaggle.com/kaggle/kaggle-survey-2017.
Kaggle.Multi <- read.csv("https://raw.githubusercontent.com/aliceafriedman/DATA607_Proj3/master/multipleChoiceResponses.csv", sep=",", header = TRUE)
One way to evaluate how much valuable different data science skills are is to determine how much time people spend using them.
The following analysis shows that for the survey respondents as a whole, the most important skills are: 1. Gathering data
Model building
Visualizing data
Finding insights
Putting work into production (e.g. running versioning control)
timeSpent <- Kaggle.Multi %>%
select(
TimeGatheringData,
TimeModelBuilding,
TimeProduction,
TimeVisualizing,
TimeFindingInsights,
TimeOtherSelect) %>%
rename(
'Gathering Data' = TimeGatheringData,
'Modeling' = TimeModelBuilding,
'Production' = TimeProduction,
'Visualizing' = TimeVisualizing,
'Finding Insights' = TimeFindingInsights,
'Other' = TimeOtherSelect) %>%
gather(key = "Skill", value = "PercTimeSpent", na.rm = TRUE) %>% glimpse()
## Observations: 45,140
## Variables: 2
## $ Skill <chr> "Gathering Data", "Gathering Data", "Gathering D...
## $ PercTimeSpent <dbl> 0, 50, 30, 60, 30, 60, 40, 30, 35, 40, 0, 0, 20,...
timeSpent %>%
group_by(Skill) %>%
summarize("avg" = mean(PercTimeSpent), "Median % Time Spent"= median(PercTimeSpent)) %>%
arrange(desc(avg)) %>%
rename('Average % Time Spent' = avg) %>%
kable(digits = 0)
| Skill | Average % Time Spent | Median % Time Spent |
|---|---|---|
| Gathering Data | 36 | 35 |
| Modeling | 21 | 20 |
| Visualizing | 14 | 10 |
| Finding Insights | 13 | 10 |
| Production | 11 | 10 |
| Other | 2 | 0 |
# kable_styling(bootstrap_options = "responsive")
Along with the multiple choice question survey, the respondents were also given an option of submitting other as an option and an open text box where they could type any text in case they selected other. For example, for the question regarding: where do the respondents spend their time. There were multiple options in this - Gathering Data, Model Building, Production, Visualizing, Finding Insights and Other. In the other option, there was an open text box where they filled the relevant information.
We checked for some relevant data in the free user form response in a coupld of fileds like - Job title using data science (CurrentJobTitleFreeForm) and time the respondents spent under the other option (TimeOtherSelectFreeForm ). We did some analysis below around this free form data set - freeformResponses.csv
getURL <- "https://raw.githubusercontent.com/aliceafriedman/DATA607_Proj3/master/freeformResponses.csv"
free.form.response.df1 <- read.csv(getURL, header = TRUE, sep = ",")
dim(free.form.response.df1)
## [1] 16716 62
#Keeping only columns which have at least some data, and removing all the columns which do not have any data
free.form.response.df2 <- free.form.response.df1[, apply(free.form.response.df1, 2, function(x){any(!is.na(x))})]
dim(free.form.response.df2)
## [1] 16716 51
apply(free.form.response.df2, 2, function(x){sum(!(x == ""), na.rm = TRUE)})
## GenderFreeForm
## 134
## KaggleMotivationFreeForm
## 746
## CurrentJobTitleFreeForm
## 1143
## MLToolNextYearFreeForm
## 385
## MLMethodNextYearFreeForm
## 227
## LanguageRecommendationFreeForm
## 81
## PublicDatasetsFreeForm
## 262
## PersonalProjectsChallengeFreeForm
## 3523
## LearningPlatformCommunityFreeForm
## 179
## LearningPlatformFreeForm1
## 330
## LearningPlatformFreeForm2
## 54
## LearningPlatformFreeForm3
## 45
## LearningPlatformUsefulnessFreeForm1Select
## 369
## LearningPlatformUsefulnessFreeForm2Select
## 64
## LearningPlatformUsefulnessFreeForm3Select
## 54
## BlogsPodcastsNewslettersFreeForm
## 1116
## JobSkillImportanceOtherSelect1FreeForm
## 201
## JobSkillImportanceOtherSelect2FreeForm
## 87
## JobSkillImportanceOtherSelect3FreeForm
## 35
## CoursePlatformFreeForm
## 333
## HardwarePersonalProjectsFreeForm
## 120
## ProveKnowledgeFreeForm
## 122
## ImpactfulAlgorithmFreeForm
## 4379
## InterestingProblemFreeForm
## 4467
## DataScienceIdentityFreeForm
## 2417
## MajorFreeForm
## 809
## PastJobTitlesFreeForm
## 2094
## FirstTrainingFreeForm
## 244
## LearningCategoryOtherFreeForm
## 476
## MLSkillsFreeForm
## 672
## MLTechniquesFreeform
## 759
## EmployerIndustryOtherFreeForm
## 926
## EmployerSearchMethodOtherFreeForm
## 640
## JobFunctionFreeForm
## 504
## WorkHardwareFreeForm
## 121
## WorkDataTypeFreeForm
## 662
## WorkLibrariesFreeForm
## 4504
## WorkAlgorithmsFreeForm
## 415
## WorkToolsFreeForm1
## 664
## WorkToolsFreeForm2
## 133
## WorkToolsFreeForm3
## 80
## WorkMethodsFreeForm1
## 189
## WorkMethodsFreeForm2
## 33
## WorkMethodsFreeForm3
## 61
## TimeOtherSelectFreeForm
## 358
## WorkChallengesFreeForm
## 214
## WorkMLTeamSeatFreeForm
## 931
## WorkDataStorageFreeForm
## 256
## WorkCodeSharingFreeForm
## 375
## SalaryChangeFreeForm
## 101
## JobSearchResourceFreeForm
## 199
Digging further into the free form response data set to get more insights. By digging and analyzing further, we could see some of the common texts being used in the 2 relevant coulumns - time spent (other) and job title (other). But most of them are random values, so we can at least show some data trends in these 2 other columns which are relevant here.
## [1] 23
## [1] 42
## timespent count_occurence percent_occurence
## 1 software application related 42 0.1169916
## timespent count_occurence percent_occurence
## 1 admin/meeting/discussion 14 0.03899721
kable(df_other_time_spent)
| timespent | count_occurence | percent_occurence |
|---|---|---|
| software application related | 42 | 0.1169916 |
| admin/meeting/discussion | 14 | 0.0389972 |
| management related | 30 | 0.0835655 |
| data related | 23 | 0.0640669 |
| research related | 27 | 0.0752089 |
| other of other | 228 | 0.6350975 |
Another way to determine how important a particular skill is to data scientsists is to look at how much of working data scientists’ job uses that skill. One question the survey asks is, “At work, what proportion of your analytics projects incorporate data visualization?”
As the analysis shows below, the largest number of response by far is that more than three-quarters of all the respondents’s assigned projects incorporate data visualization. Although the median response is that only 10% of time is spent on this skill, data visualization must be an imrportant skill nonetheless!
PercOrder <- c(
"None",
"10-25% of projects",
"26-50% of projects",
"51-75% of projects",
"76-99% of projects",
"100% of projects")
Kaggle.Multi$WorkDataVisualizations <- factor(Kaggle.Multi$WorkDataVisualizations, PercOrder)
Kaggle.Multi %>%
filter(WorkDataVisualizations!="") %>%
mutate(WorkDataVisualizations = fct_recode(WorkDataVisualizations,
"10-25%" = "10-25% of projects",
"26-50%" = "26-50% of projects",
"51-75%" = "51-75% of projects",
"76-100%" = "76-99% of projects",
"76-100%" = "100% of projects"
)) %>%
ggplot(aes(WorkDataVisualizations))+
geom_bar(aes(fill=WorkDataVisualizations))+
theme(axis.text.x=element_text(angle = 60, hjust = 1))+
labs(title = "At work, what proportion of your analytics projects\n incorportate data visualizations?",
x = "Percent of Projects Incorporating Visualization",
y = "Count")
Kaggle.Multi %>%
filter(CurrentJobTitleSelect!="") %>%
mutate(CurrentJobTitleSelect = CurrentJobTitleSelect %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(fct_relevel(CurrentJobTitleSelect, "Other")))+
geom_bar(aes(fill=CurrentJobTitleSelect))+
theme(axis.text.x=element_text(angle = 60, hjust = 1), legend.position="none")+
labs(title = "Select the option that's most similar to your current job/professional \ntitle (or most recent title if retired)",
x= "Job Title",
y = "Count")
Job title distribution check in the free response survey:
kable(df_other_job_title)
| job.category | count_occurence | proportion_occurence |
|---|---|---|
| data related jobs | 230 | 0.2012248 |
| software related jobs | 126 | 0.1102362 |
| executive leadership related | 81 | 0.0708661 |
| teaching related jobs | 78 | 0.0682415 |
| student | 26 | 0.0227472 |
| finance related jobs | 8 | 0.0069991 |
| math related jobs | 8 | 0.0069991 |
| manager related jobs | 114 | 0.0997375 |
We could figure out how the 59 % of responses in the CurrentJobTitleFreeForm form fell under the above given categories by our data analysis. The remaining 41% of the job titles under the free form CurrentJobTitleFreeForm are not consistent enough to show any other trend. This is because of any text allowed in the free form and the respondents can write any text, hence not giving proper details to show any relevance.
This data requires a bit of munging, as salaries have been reported in several currencies, and a quick look at the repsonses seems to indicate that many individuals have entered there salaries in “thousands” whereas others have entered it in dollars. Some of the responses are in the millions. Are these liars, or just very well paid survey respondents?
To address the former, we will filter by responses reported in USD. To address the two latter, we have a options:
Ignore outliers (e.g. responses less than $1000, and more than $1,000,000)
Assume that everyone is telling the truth about their salaries
Assume that responses less than $1000 are reporting in thousands (e.g. a response of $87 is intended to mean a salary of $87,000, something which would be expected based on overall salary data for data scientists)
Some combination of the above, taking into account known salary distributions for particular titles.
The data munging to access US reported salaries also must take into account that data has been entered in a varirty of formats, resulting in data that cannot be easily converted to numeric.
First, let’s investigate the data to see which of these options is most reasonable.
USSalaries <- Kaggle.Multi %>% filter(CompensationCurrency=="USD") %>%
select(CompensationAmount) %>%
#First, we must unfactor the data using the varhandle library
mutate(CompensationAmount = unfactor(CompensationAmount),
#Then, we must remove all commas, which are included in some, but not all, data entries before we can finally convert to numeric
CompensationAmount = str_remove_all(CompensationAmount,","),
CompensationAmount = as.numeric(CompensationAmount))
## Warning in evalq(as.numeric(CompensationAmount), <environment>): NAs
## introduced by coercion
quantile(USSalaries[[1]], na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 60000 96000 140000 9999999
The entry “9999999” seems obviously to be a false entry, and a number of the other outliers seems implausible as well. How many reported salaries are more than $1M?
USSalariesTest <- USSalaries %>% filter(USSalaries > 1000000)
USSalariesTest
## CompensationAmount
## 1 9999999
## 2 2500000
## 3 2000000
Only 3! Seems plausible that 2 people out of more than 16,000 respondents could earn more than $1M, I will assume those are real entries. I will exclude the “9999999” entry, and rerun the results.
USSalaries <- USSalaries %>% filter(USSalaries != 9999999)
quantile(USSalaries[[1]], na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 60000 96000 140000 2500000
USSalariesTest2 <- USSalaries %>% filter(USSalaries >0 & USSalaries < 1000)
USSalariesTest2
## CompensationAmount
## 1 205
## 2 155
## 3 96
## 4 80
## 5 485
## 6 102
## 7 185
## 8 100
## 9 110
## 10 900
## 11 265
## 12 1
Since the number of entries less than $1000 is small (12), they will not meaningfully impact the analysis. We can proceed under the (possibly incorrect) assumption that all data except 9999999 is a valid entry. The below histogram produces a median salary roughly in line with what we would expect to see for data analytics roles, which confirms that we should proceed.
USSalaries <- USSalaries %>% filter(USSalaries < 9999999)
quantile(USSalaries[[1]], na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 60000 96000 140000 2500000
ggplot(USSalaries, aes(USSalaries[[1]])) +
geom_histogram(binwidth = 25000)
The Kaggle dataset analyzes over 16,000 respondents from multiple countries. Are the most important skills the same across these variables?
Limiting our data to responses by US$ for comparison’s sake, we will perform an analysis of the previous survey response after creating a factor for respondents’ salaries by quantile.
We will create levels from the numeric data using the cut function.
USD <- Kaggle.Multi %>%
mutate(CompensationAmount = as.numeric(as.numeric(levels(CompensationAmount)[CompensationAmount])),
CompensationAmount = ifelse(CompensationAmount < 100, CompensationAmount*1000, CompensationAmount)) %>%
filter(CompensationCurrency=="USD",
CompensationAmount < 9999999)
## Warning in evalq(as.numeric(as.numeric(levels(CompensationAmount)
## [CompensationAmount])), : NAs introduced by coercion
quantile(USD$CompensationAmount, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 60000 96000 137500 2500000
#Create levels using cut()
USD <- USD %>% select(
TimeGatheringData,
TimeModelBuilding,
TimeProduction,
TimeVisualizing,
TimeFindingInsights,
TimeOtherSelect,
CompensationAmount) %>%
mutate(IncomeLevel = cut(x=CompensationAmount, breaks = c(0, 60000, 96000, 137500, 2500000)))
levels(USD$IncomeLevel) <- c("Low", "Medium", "High", "Very_high")
timeSpentByUSDSalary <- USD %>%
rename(
'Gathering Data' = TimeGatheringData,
'Modeling' = TimeModelBuilding,
'Production' = TimeProduction,
'Visualizing' = TimeVisualizing,
'Finding Insights' = TimeFindingInsights,
'Other' = TimeOtherSelect) %>%
gather(key="Skill", value="PercTimeSpent", -CompensationAmount, -IncomeLevel, na.rm = TRUE)
timeSpentByUSDSalary%>%
group_by(Skill, IncomeLevel) %>%
summarize("avg" = mean(PercTimeSpent), "Median % Time Spent"= median(PercTimeSpent)) %>%
arrange(Skill, IncomeLevel) %>%
rename('Average % Time Spent' = avg) %>%
kable(digits = 0) %>%
kable_styling(bootstrap_options = "responsive")
| Skill | IncomeLevel | Average % Time Spent | Median % Time Spent |
|---|---|---|---|
| Finding Insights | Low | 14 | 10 |
| Finding Insights | Medium | 14 | 10 |
| Finding Insights | High | 15 | 10 |
| Finding Insights | Very_high | 15 | 10 |
| Finding Insights | NA | 13 | 15 |
| Gathering Data | Low | 36 | 30 |
| Gathering Data | Medium | 41 | 40 |
| Gathering Data | High | 38 | 40 |
| Gathering Data | Very_high | 36 | 35 |
| Gathering Data | NA | 29 | 25 |
| Modeling | Low | 22 | 20 |
| Modeling | Medium | 18 | 15 |
| Modeling | High | 18 | 15 |
| Modeling | Very_high | 20 | 20 |
| Modeling | NA | 18 | 10 |
| Other | Low | 2 | 0 |
| Other | Medium | 2 | 0 |
| Other | High | 3 | 0 |
| Other | Very_high | 3 | 0 |
| Other | NA | 0 | 0 |
| Production | Low | 10 | 10 |
| Production | Medium | 11 | 10 |
| Production | High | 11 | 10 |
| Production | Very_high | 12 | 10 |
| Production | NA | 6 | 2 |
| Visualizing | Low | 15 | 10 |
| Visualizing | Medium | 14 | 10 |
| Visualizing | High | 13 | 10 |
| Visualizing | Very_high | 13 | 10 |
| Visualizing | NA | 33 | 25 |
These results are remarkably consistent across the income levels, althought the lowest earners seem to spend less time gathering data than other groups.
What about job satisfaction? Do more satisfied workers spend more time on certain parts of the job?
timeSpentBySat <- Kaggle.Multi %>% select(
TimeGatheringData,
TimeModelBuilding,
TimeProduction,
TimeVisualizing,
TimeFindingInsights,
TimeOtherSelect,
JobSatisfaction) %>%
rename(
'Gathering Data' = TimeGatheringData,
'Modeling' = TimeModelBuilding,
'Production' = TimeProduction,
'Visualizing' = TimeVisualizing,
'Finding Insights' = TimeFindingInsights,
'Other' = TimeOtherSelect) %>%
gather(key="Skill", value="PercTimeSpent", -JobSatisfaction, na.rm = TRUE) %>%
mutate(JobSatisfaction = fct_recode(JobSatisfaction,
"Low" = "1 - Highly Dissatisfied",
"Low" = "2",
"Low" = "3",
"Medium" = "4",
"Medium" = "5",
"Medium" = "6",
"Medium" = "7",
"High" = "8",
"High" = "9",
"High" = "10 - Highly Satisfied",
"NA" = "",
"NA" = "I prefer not to share")) %>%
filter(!is.na(JobSatisfaction),
JobSatisfaction !="NA")
levels(timeSpentBySat$JobSatisfaction) <- c("NA", "Low", "Medium", "High")
timeSpentBySat%>%
group_by(Skill, JobSatisfaction) %>%
summarize("avg" = mean(PercTimeSpent), "Median % Time Spent"= median(PercTimeSpent)) %>%
arrange(Skill, JobSatisfaction) %>%
rename('Average % Time Spent' = avg) %>%
kable(digits = 0) %>%
kable_styling(bootstrap_options = "responsive")
| Skill | JobSatisfaction | Average % Time Spent | Median % Time Spent |
|---|---|---|---|
| Finding Insights | Low | 12 | 10 |
| Finding Insights | Medium | 13 | 10 |
| Finding Insights | High | 13 | 10 |
| Gathering Data | Low | 41 | 40 |
| Gathering Data | Medium | 36 | 35 |
| Gathering Data | High | 36 | 35 |
| Modeling | Low | 18 | 15 |
| Modeling | Medium | 23 | 20 |
| Modeling | High | 21 | 20 |
| Other | Low | 3 | 0 |
| Other | Medium | 2 | 0 |
| Other | High | 2 | 0 |
| Production | Low | 9 | 5 |
| Production | Medium | 11 | 10 |
| Production | High | 11 | 10 |
| Visualizing | Low | 14 | 10 |
| Visualizing | Medium | 13 | 10 |
| Visualizing | High | 15 | 10 |
Interestingly, the least satisfied workers also seems to spend the most time gathering data!
In addition to examining the data and trends discussed in the article and doing our own analysis of the Kaggle data science skills survey results, we wanted to create our own survey looking at company needs. This was an excellent opportunity to learn more about designing and administering a survey in addition to wrangling and analyzing the results. Also, it allowed us to reach out and establish contacts at some companies at which we might want to work as data scientists.
We sent out the survey to at least 15 different contacts. We attempted to target data science leads or leaders with a good grasp of how their team hires and grows. As of 10/17/18, we have only received six responses. Obviously, it was a targeted survey that received only a few responses and cannot be claimed[ to be representative to the overall industry. Still, it was a worthwhile and educational endeavor. (You can help us out and take the survey here. It will only take a couple minutes.)
We work[ed as a team to develop a set of survey questions, listed below. We then configured these questions to be administered via a free account on Survey Monkey. That free account does not include any export functionality, let alone API connectivity, so results were compiled manually in a tidy-ish format in a csv file.
#load the csv into R
df_dfs <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Project3/master/Data%20Science%20Skills%20Survey%20Monkey%20Responses_20181017.csv"))
#Display all six responses to our eight questions
kable(df_dfs)
| Respondent | Question.No. | Question | Answer |
|---|---|---|---|
| 6 | 1 | How large is your data science team? | 6-10 |
| 6 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | Kinda, some people are better at things than others and we utilize our strengths |
| 6 | 3 | Where is data science located within your organization? | Centralized data science team |
| 6 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 25-50% |
| 6 | 5 | What top 3 skills do you look for in a new applicant? | SQL |
| 6 | 5 | What top 3 skills do you look for in a new applicant? | Machine learning |
| 6 | 5 | What top 3 skills do you look for in a new applicant? | Statistical modeling |
| 6 | 6 | Which of these skills is hardest to find in applicants? | Machine learning |
| 6 | 7 | Which skills is most associated with advancement within your team? | Presentation skills |
| 6 | 8 | If you hire junior data scientists, what is the median starting salary? | $100,000-124,999 |
| 5 | 1 | How large is your data science team? | 3-5 |
| 5 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | Kinda, some people are better at things than others and we utilize our strengths |
| 5 | 3 | Where is data science located within your organization? | Centralized data science team |
| 5 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 50-100% |
| 5 | 5 | What top 3 skills do you look for in a new applicant? | Python |
| 5 | 5 | What top 3 skills do you look for in a new applicant? | Hadoop;Map/Reduce |
| 5 | 5 | What top 3 skills do you look for in a new applicant? | Project management |
| 5 | 6 | Which of these skills is hardest to find in applicants? | Project management |
| 5 | 7 | Which skills is most associated with advancement within your team? | Project management |
| 5 | 8 | If you hire junior data scientists, what is the median starting salary? | $100,000-124,999 |
| 4 | 1 | How large is your data science team? | 3-5 |
| 4 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | No, everyone is versatile |
| 4 | 3 | Where is data science located within your organization? | Deployed within business units |
| 4 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 1-25% |
| 4 | 5 | What top 3 skills do you look for in a new applicant? | Research design |
| 4 | 5 | What top 3 skills do you look for in a new applicant? | Project management |
| 4 | 5 | What top 3 skills do you look for in a new applicant? | Presentation skills |
| 4 | 6 | Which of these skills is hardest to find in applicants? | Ability to apply skills to practical problems |
| 4 | 7 | Which skills is most associated with advancement within your team? | Presentation skills |
| 4 | 8 | If you hire junior data scientists, what is the median starting salary? | $75,000-99,999 |
| 3 | 1 | How large is your data science team? | 3-5 |
| 3 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | Kinda, some people are better at things than others and we utilize our strengths |
| 3 | 3 | Where is data science located within your organization? | Deployed within business units |
| 3 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 25-50% |
| 3 | 5 | What top 3 skills do you look for in a new applicant? | Python |
| 3 | 5 | What top 3 skills do you look for in a new applicant? | SQL |
| 3 | 5 | What top 3 skills do you look for in a new applicant? | Research design |
| 3 | 6 | Which of these skills is hardest to find in applicants? | Research design |
| 3 | 7 | Which skills is most associated with advancement within your team? | Research design |
| 3 | 8 | If you hire junior data scientists, what is the median starting salary? | $50,000-74,999 |
| 2 | 1 | How large is your data science team? | 6-10 |
| 2 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | Kinda, some people are better at things than others and we utilize our strengths |
| 2 | 3 | Where is data science located within your organization? | Centralized data science team |
| 2 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 50-100% |
| 2 | 5 | What top 3 skills do you look for in a new applicant? | Machine learning |
| 2 | 5 | What top 3 skills do you look for in a new applicant? | Statistical modeling |
| 2 | 5 | What top 3 skills do you look for in a new applicant? | Presentation skills |
| 2 | 6 | Which of these skills is hardest to find in applicants? | Presentation skills |
| 2 | 7 | Which skills is most associated with advancement within your team? | Machine learning |
| 2 | 8 | If you hire junior data scientists, what is the median starting salary? | $100,000-124,999 |
| 1 | 1 | How large is your data science team? | 3-5 |
| 1 | 2 | Are data science roles differentiated on your team - meaning do some specialize in data engineering and others in data science? | Yes, some of our team members specialize in data engineering and others in data science |
| 1 | 3 | Where is data science located within your organization? | Deployed within business units |
| 1 | 4 | What kind of growth, if any, do you anticipate in the next two years in terms of your data science team? | 25-50% |
| 1 | 5 | What top 3 skills do you look for in a new applicant? | Machine learning |
| 1 | 5 | What top 3 skills do you look for in a new applicant? | Statistical modeling |
| 1 | 5 | What top 3 skills do you look for in a new applicant? | Project management |
| 1 | 6 | Which of these skills is hardest to find in applicants? | Graphical models |
| 1 | 7 | Which skills is most associated with advancement within your team? | Machine learning |
| 1 | 8 | If you hire junior data scientists, what is the median starting salary? | $100,000-124,999 |
Who did we survey? We don’t have statistics on this, but our responses likely came from smaller companies with smaller than average data science teams.
#limit data to question 1; drop levels removes unused factors - which are responses to to other questions - will be important later
team_size <- df_dfs %>% filter(df_dfs$Question.No. == 1) %>% droplevels()
ggplot(team_size,aes(x=Answer),na.rm = TRUE) + geom_bar(aes(fill=Answer)) + theme_light() +
labs(title = "Team size - from survey",
x= "Team size",
y = "Count")
So all our responses came from organizations with more than a couple but no more than 10 team members. Based on the way we created our survey result data, possible answers to questions that were not selected by any users are not included in the csv and therefore are not included in the data frame or resulting visualization. For data science team size, this does not really matter. We know there are teams larger than 10, but none of those responded to the survey. For other questions, such as questions asking to pick the top three most important skills from a list of skills, the options that received no responses are important.
#Summarize the responses, including a list of levels with the factor for answers for this question
team_size$Answer
## [1] 6-10 3-5 3-5 3-5 6-10 3-5
## Levels: 3-5 6-10
We want to add back the possible answers that did not receive responses. There is surely a better way to structure survey response data in R using one the survey-oriented packages, but don’t call us Shirley. We’ll display the code for adding the unused possible answers as factors here as a separate chunk of code. For later question responses, we will include this step right after subsetting.
#add unused answers
levels(team_size$Answer) <- c(levels(team_size$Answer),"1-2","11-20","21+")
#check that the desired levels were added
levels(team_size$Answer)
## [1] "3-5" "6-10" "1-2" "11-20" "21+"
Now let’s re-graph the responses, adding some (slight) aesthetic touches, along with the answer choices that were not selected. As the data science team sizes indicate, we mostly reached out to (and heard back from) smaller teams.
#create an order to show the factors in logical manner
positions <- c('1-2', "3-5", '6-10', '11-20', '21+')
#including anbswers wi
ggplot(team_size,aes(x=Answer)) + geom_bar(aes(fill=Answer)) + theme_light() + scale_fill_discrete(drop=FALSE) + scale_x_discrete(drop=FALSE, limits = positions) + labs(x = "Data Science Team Size",y="Count")
#limit data to question 2; drop levels removes unused factors - which are responses to to other questions - will be important later
team_roles <- df_dfs %>% filter(df_dfs$Question.No. == 2) %>% droplevels()
#add unused answers
levels(team_roles$Answer) <- c(levels(team_roles$Answer),"Other")
#create an order to show the factors in logical manner
positions2 <- c('No, everyone is versatile', "Kinda, some people are better at things than others and we utilize our strengths", 'Yes, some of our team members specialize in data engineering and others in data science',"Other")
#including answers with no respones and adjusting axis in a few ways
ggplot(team_roles,aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme_light() + scale_fill_discrete(drop=FALSE) + scale_x_discrete(drop=FALSE, limits = positions2, labels = function(x) str_wrap(str_replace_all(x, "foo" , " "),width = 20)) + labs(x = "Are data science roles on your team differentiated?",y="") + theme(axis.text.x = element_text(angle=45, hjust=1)) + theme(legend.position = "bottom")
team_roles %>% mutate(team.role.segregation = ifelse(Answer == "Kinda, some people are better at things than others and we utilize our strengths" | Answer == "Yes, some of our team members specialize in data engineering and others in data science", "Specialized roles", "versatile")) %>%
ggplot(aes(team.role.segregation)) +
geom_bar(aes(fill = Answer)) + labs(title = "Roles segregation within data science teams from the survey") + theme(axis.text.x=element_text(angle = 60, hjust = 1))
Even on the smaller data science teams from which we received responses, the majority of teams used members in different ways. This is somewhat encouraging. Data science job listings tend to list an impossible number of skills that they seemingly expected qualified applicants to have. Our responses, though paltry in number, suggest we might be able to start contributing to a team without mastering everything. A “kinda” team might be a great opportunity to get build some cool stuff using our strengths while also filling in skills gaps.
Now let’s look at responses related to where data science “lives” within an organization. At some companies, data scientists work together on a centralized team that attempts to solve the problems of business units or clients. In other firms, data scientists might work within an operational or other division and have a “dotted line” reporting relationship to a data science or analytics leader. They may have weekly or other iterative collaborations with the other data scientists at their company, but they might work more with the business users or analysts within their data domain.
#limit data to question 3; drop levels removes unused factors - which are responses to to other questions - will be important later
team_location <- df_dfs %>% filter(df_dfs$Question.No. == 3) %>% droplevels()
#including answers with no respones and adjusting axis in a few ways
team_location %>%
ggplot(aes(Answer)) +
geom_dotplot(aes(fill = Answer), binwidth = 0.5, dotsize = 1) + labs(title = "Data Science team location in the organization") + theme(axis.text.x=element_text(angle = 60, hjust = 1), axis.text.y = element_blank(), axis.title.x = element_blank()) +
ylim(0,nrow(team_location) + 2)
As is clearly evident, our responses were evenly split between central and deployed data science teams.
Next, we were curious about data science teams’ plans for growth. Again, selfishly, we particularly targeted companies at which we may want to work as data scientists, so these were especially interesting responses.
#limit data to question 4; drop levels removes unused factors - which are responses to to other questions - will be important later
team_growth <- df_dfs %>% filter(df_dfs$Question.No. == 4) %>% droplevels()
#add unused answers
levels(team_growth$Answer) <- c(levels(team_growth$Answer),"No growth - or negative growth","More than double")
#create an order to show the factors in logical manner
positions4 <- c("No growth - or negative growth", "1-25%",'25-50%',"50-100%","More than double")
#including answers with no respones and adjusting axis in a few ways
ggplot(team_growth,aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme_light() + scale_fill_discrete(drop=FALSE) + scale_x_discrete(drop=FALSE, limits = positions4, labels = function(x) str_wrap(str_replace_all(x, "foo" , " "),width = 20)) + labs(x = "What kind of team growth do you anticipate in the next 2 years?",y="", title = "Expected Growth of existing data science teams in the next 2 years") + theme(axis.text.x = element_text(angle=45, hjust=1))
All teams expect to grow, though none of the respondents expect to more than double. Encouraging, but the wild days of data scientist job growth might be slowing.
Perhaps most importantly, what skills are employers looking for in hiring data scientists? We asked team leaders to pick their top three. Based on our survey, we see the below trends for the most sought after skills.
#limit data to question 5; drop levels removes unused factors - which are responses to to other questions - will be important later
team_skills <- df_dfs %>% filter(df_dfs$Question.No. == 5) %>% droplevels()
#add unused answers
levels(team_skills$Answer) <- c(levels(team_skills$Answer),"R","Java","Data wrangling", "Unstructured data", "Natural language processing", "Optimization", "Graphical models", "Privacy and ethics", "Other soft skills", "Other (please specify)")
#create an order to show the factors in logical manner
positions5 <- c("Data wrangling", "Graphical models", "Hadoop;Map/Reduce", "Java", "Machine learning", "Natural language processing", "Optimization","Other (please specify)", "Other soft skills", "Presentation skills", "Privacy and ethics", "Project management", "Python", "R", "Research design", "SQL", "Statistical modeling", "Unstructured data")
#including answers with no respones and adjusting axis in a few ways
team_skills %>% within(Answer <- factor(Answer, levels = names(sort(table(Answer), decreasing = TRUE)))) %>% ggplot(aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme(axis.text.x = element_text(angle=45, hjust=1)) + labs(title = "Top wanted skills")
Machine learning and statistical modeling were both picked by half of the six respondents, along with the (relatively) soft skills of project management. Many of us have project management experience from our pre-data science careers, so that’s heartening. We take note of Python being picked by two respondents, with R going unpicked.
On the same topic, we asked what skills is hardest to find among applicants?
#limit data to question 6; drop levels removes unused factors - which are responses to to other questions - will be important later
team_skill2 <- df_dfs %>% filter(df_dfs$Question.No. == 6) %>% droplevels()
#add unused answers
levels(team_skill2$Answer) <- c(levels(team_skill2$Answer),"R","Java","Data wrangling", "Unstructured data", "Natural language processing", "Optimization", "Graphical models", "Privacy and ethics", "Other soft skills", "Other (please specify)")
#create an order to show the factors in logical manner
positions6 <- c("Ability to apply skills to practical problems","Data wrangling", "Graphical models", "Hadoop;Map/Reduce", "Java", "Machine learning", "Natural language processing", "Optimization","Other (please specify)", "Other soft skills", "Presentation skills", "Privacy and ethics", "Project management", "Python", "R", "Research design", "SQL", "Statistical modeling", "Unstructured data")
#including answers with no respones and adjusting axis in a few ways
team_skill2 %>% within(Answer <- factor(Answer, levels = names(sort(table(Answer), decreasing = TRUE)))) %>% ggplot(aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme(axis.text.x = element_text(angle=90, hjust=1), axis.title.x = element_blank()) + labs(title = "Hardest to find skills")
No two respondents chose the same skill! Interesting that four of the six responses are non-technical.
We also added a related question - what skill is most associated with advancement on your team?
#limit data to question ;7 drop levels removes unused factors - which are responses to to other questions - will be important later
team_skill3 <- df_dfs %>% filter(df_dfs$Question.No. == 7) %>% droplevels()
#add unused answers
levels(team_skill3$Answer) <- c(levels(team_skill3$Answer),"R","Java","Data wrangling", "Unstructured data", "Natural language processing", "Optimization", "Graphical models", "Privacy and ethics", "Other soft skills", "Other (please specify)")
#including answers with no respones and adjusting axis in a few ways
#reusing position vector from 6
team_skill3 %>% within(Answer <- factor(Answer, levels = names(sort(table(Answer), decreasing = TRUE)))) %>% ggplot(aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme(axis.text.x = element_text(angle=90, hjust=1), axis.title.x = element_blank()) + labs(title = "Skills associated most with advancement") +
scale_fill_discrete(drop=FALSE) + scale_x_discrete(drop=FALSE, limits = positions6, labels = function(x) str_wrap(str_replace_all(x, "foo" , " "),width = 30))
Machine learning showed up yet again with multiple selections. Soft skills are heavily represented as well, echoing industry feedback that project management and presentation skills are just as essential to success as technical competency.
Finally, if teams hire junior (or relatively entry level data scientists), what is the average starting salary?
#limit data to question 8 drop levels removes unused factors - which are responses to to other questions - will be important later
team_salary <- df_dfs %>% filter(df_dfs$Question.No. == 8) %>% droplevels()
#add unused answers
levels(team_salary$Answer) <- c(levels(team_salary$Answer),"Less than $50,000", "$125,000+")
#create an order to show the factors in logical manner
positions8 <- c("Less than $50,000","$50,000-74,999","$75,000-99,999","$100,000-124,999","$125,000+")
#including answers with no respones and adjusting axis in a few ways
ggplot(team_salary,aes(x=Answer)) + geom_bar(aes(fill = Answer)) + theme_light() + scale_fill_discrete(drop=FALSE) + scale_x_discrete(drop=FALSE, limits = positions8, labels = function(x) str_wrap(str_replace_all(x, "foo" , " "),width = 30)) + labs(x = "What is the average pay rate for junior data scientists?",y="") + theme(axis.text.x = element_text(angle=45, hjust=1))
This is in line with what we’d expect given other industry data.
We see numerous possibilities for further exploration. A deeper look at the dearth of entry-level data science positions. Perhaps we could even use a combination of web scraping and machine learning to identify entry-level positions automatically. Additionally, a survey of recent graduates of the CUNY program would be of much interest.
Eleanor Romero - Job Listings Article Discussion, Presentation Slides
Alice Friedman - Kaggle Analysis 1, Technical Compilation, Presenter
Deepak Mongia - Kaggle Analysis 2, Graph Consistency Owner
Jeff Littlejohn - Custom Survey Build & Analysis, Content Compilation