Data Science is the “Hot New Field”, but is there an Agreement About what Skills are Needed to Become One?

In this project we go behind varied data sets that support different points of view about the research question.


Research Question

Overview

Collaboration Model

The work on this project was fully collaborative. We divided the work into pieces, and for each piece we assigned a team of three or more.

Everyone was on more than one team. This way people who knew less could learn from those who knew more, and we could all contribute.

We held a number of meetings on zoom where we reviewed together each step of the process and made consensus-based decisions about direction.

We used Slack, Google Drive, Azure and Github to share thoughts and ideas and code.

Who is “we”? We are:

  • Carlisle
  • Cassie
  • Dan
  • Eric
  • Esteban
  • Mari
  • Tyler

#Packages used

library(tidytext)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(stringr)
library(DT)
library(kableExtra)
library(wesanderson)
library(tidyr)
library(corpus)
library(keyring)
library(RODBC)
library(readr)
library(stringi)
library(tm)
library(corpus)
library(wordcloud)
library(data.table)
library(RedditExtractoR)
library(readxl)
library(dplyr )
library(magrittr)
library("tm")
library("SnowballC")
library("RColorBrewer")
library("wordcloud")
library("data.table")

Methology

Our methodology was to mine varied data sets which provided information that could support different points of view about the research question. We settled on three: a data set of 7000 Indeed job listings from 2018, a compendium of work skills from the University of Chicago’s Open Skills project, and a survey that we created for the purpose of this project.

We used the data sets to create a comprehensive word_catalog of data scientist skills, knowledge and abilities. Then we ran the word_catalog back over the Indeed data set so that we could pull out the skills that appeared most. Finally, we used our findings to compare the Indeed data set for data scientists, the Indeed data set for data analysts, and the survey.

1.Collect Data

Collect the Data to be Analyzed


In this step we read in the following data sets and write them back to a normalized database on Azure. From that point forward, we only pull data from the database so that we are confident we have consistent, secure, and persistent storage for our data:

  • Data Scientist Job Market In the US, Kaggle, downloaded into csv from the website
  • Data from the University of Chicago’s Center for Data Science & Public Policy’s Open Skills Project, accessed through their API
  • Our survey, loaded from Google Drive

Read in data and filter by position

conn_str <- paste0(
  'Driver={ODBC Driver 17 for SQL Server};
   Server=tcp:ehtmp.database.windows.net,1433;
   Database=ds_skills;
   Encrypt=yes;
   TrustServerCertificate=no;
   Connection Timeout=30;',
   'Uid=',keyring::key_get(service = "my-skills-db-username", keyring = "my-skills-db-keyring"),';',
   'Pwd=', keyring::key_get(service = "my-skills-db-pwd", keyring = "my-skills-db-keyring"), ';'
)
dbConnection <- odbcDriverConnect(conn_str)

all_data <- (sqlQuery(dbConnection, "SELECT * FROM ds_skills.kaggle.job_postings_raw"))

#if the connection doesnt work use the code below to upload the csv file
#all_data<-read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/raw_jobdata.csv")
all_data$position<-tolower(all_data$position)
all_data$description<-tolower(all_data$description)
all_data$description<-tolower(all_data$description)%>%
  str_remove_all("â|€|™|\\n")

##filters for data science positions
data_scientists<-all_data%>%
  mutate(contents = str_detect(tolower(position), "data [b-z]|ai|machine"))%>%
  filter(contents == TRUE)

data_analysts<-all_data%>%
  mutate(contents = str_detect(tolower(position), "anal"))%>%
  filter(contents == TRUE)

rm(all_data) 

Filter for targeted skills

Using the same search criteria we used above make new columns containing the strings of interest to be worked with later. Some of the issues with this approach is the abundance NA values.

ds<-data_scientists%>%
  mutate(skill = str_extract_all(data_scientists$description, " .{75,100}   skill.{150,200} "))%>%
  mutate(must_have=str_extract_all(data_scientists$description, "must have.{150,200} "))%>%
  mutate(knowledge=str_extract_all(data_scientists$description, " .{75,100} knowledge.{150,200} "))%>%
  mutate(experience=str_extract_all(data_scientists$description, " .{100,150} exper.{100,150} "))%>%
  mutate(excel=str_extract_all(data_scientists$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
  mutate(responsible=str_extract_all(data_scientists$description, "responsible.{150,200} "))%>%
  mutate(proficient=str_extract_all(data_scientists$description," .{100,150} profi.{110,160} "))%>%
  mutate(understands=str_extract_all(data_scientists$description, " .{100,150} understand.{150,200} "))%>%
  mutate(utilize=str_extract_all(data_scientists$description, "utilize.{150,200} "))%>%
  mutate(lead=str_extract_all(data_scientists$description, " .{150,200} lead.{150,200} "))%>%
  mutate(work=str_extract_all(data_scientists$description, " .{50,75} work.{150,200} "))%>%
  mutate(looking=str_extract_all(data_scientists$description, "looking.{150,200} "))


ds$skill<-lapply(ds$skill, function(x)paste(unlist(x), collapse=' '))

ds$must_have<-lapply(ds$must_have, function(x)paste(unlist(x), collapse=' '))

ds$knowledge<-lapply(ds$knowledge, function(x)paste(unlist(x), collapse=' '))

ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))

ds$experience<-lapply(ds$experience, function(x)paste(unlist(x), collapse=' '))

ds$excel<-lapply(ds$excel, function(x)paste(unlist(x), collapse=' '))

ds$responsible<-lapply(ds$responsible, function(x)paste(unlist(x), collapse=' '))

ds$proficient<-lapply(ds$proficient, function(x)paste(unlist(x), collapse=' '))

ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))

ds$utilize<-lapply(ds$utilize, function(x)paste(unlist(x), collapse=' '))

ds$lead<-lapply(ds$lead, function(x)paste(unlist(x), collapse=' '))

ds$work<-lapply(ds$work, function(x)paste(unlist(x), collapse=' '))

ds$looking<-lapply(ds$looking, function(x)paste(unlist(x), collapse=' '))

Data Analyst

da<-data_analysts%>%
  mutate(skill = str_extract_all(data_analysts$description, " .{75,100}   skill.{150,200} "))%>%
  mutate(must_have=str_extract_all(data_analysts$description, "must have.{150,200} "))%>%
  mutate(knowledge=str_extract_all(data_analysts$description, " .{75,100} knowledge.{150,200} "))%>%
  mutate(experience=str_extract_all(data_analysts$description, " .{100,150} exper.{100,150} "))%>%
  mutate(excel=str_extract_all(data_analysts$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
  mutate(responsible=str_extract_all(data_analysts$description, "responsible.{150,200} "))%>%
  mutate(proficient=str_extract_all(data_analysts$description," .{100,150} profi.{110,160} "))%>%
  mutate(understands=str_extract_all(data_analysts$description, " .{100,150} understand.{150,200} "))%>%
  mutate(utilize=str_extract_all(data_analysts$description, "utilize.{150,200} "))%>%
  mutate(lead=str_extract_all(data_analysts$description, " .{150,200} lead.{150,200} "))%>%
  mutate(work=str_extract_all(data_analysts$description, " .{50,75} work.{150,200} "))%>%
  mutate(looking=str_extract_all(data_analysts$description, "looking.{150,200} "))


da$skill<-lapply(da$skill, function(x)paste(unlist(x), collapse=' '))

da$must_have<-lapply(da$must_have, function(x)paste(unlist(x), collapse=' '))

da$knowledge<-lapply(da$knowledge, function(x)paste(unlist(x), collapse=' '))

da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))

da$experience<-lapply(da$experience, function(x)paste(unlist(x), collapse=' '))

da$excel<-lapply(da$excel, function(x)paste(unlist(x), collapse=' '))

da$responsible<-lapply(da$responsible, function(x)paste(unlist(x), collapse=' '))

da$proficient<-lapply(da$proficient, function(x)paste(unlist(x), collapse=' '))

da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))

da$utilize<-lapply(da$utilize, function(x)paste(unlist(x), collapse=' '))

da$lead<-lapply(da$lead, function(x)paste(unlist(x), collapse=' '))

da$work<-lapply(da$work, function(x)paste(unlist(x), collapse=' '))

da$looking<-lapply(da$looking, function(x)paste(unlist(x), collapse=' '))

Transform Data Science dataframe to long

ds_long <- ds %>% gather(keyword, text, 7:18)

text <- ds_long %>% select(text)

Make corpus and remove punctuation, numbers, stopwords, convert cases, etc

corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("skill","responsible","proficient","knowledge","understands", "must", "experience", "character", "will", "looking", "excels at", "work", "lead", "utilize"))
corpus_Clean <- tm_map(corpus, stripWhitespace)
wordcloud(corpus, max.words = 50, colors = colorRampPalette(brewer.pal(7, "Dark2"))(32))

Tokenization of textbody into unigrams (one word), bigrams (two words), trigrams (three words), and quadgrams(four words)

#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, 20)))


#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))


#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))

Plot unigram

#Unigrams

unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)


ggplot(unigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Unigrams") +labs(x = "", y = "")

Plot bigram

#Bigrams

bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)

ggplot(bigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Bigrams") +labs(x = "", y = "")

Plot trigram

#Trigrams

trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)

ggplot(trigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Trigrams") +labs(x = "", y = "")

2.The Word Catalog

Create a Word Catalog

Building the word_catalog is the heart of our project. The Indeed database contains 7,000 lengthy job descriptions. Performing a word count or a simple regex on “skills” is very limited – consider, e.g., the sentence “The applicant must be familiar with all aspects of the data process, from collection to analysis, and be adept at communicating their findings.”

In order to capture all of the skills in our data set, we built a list of as many possible words and phrases describing data science skills as might exist in the data set.

We then used this list to go back over the data set to calculate those word and phrase frequencies which were most prominent.

We built this list by combining a deep dive into Regex with an n-gram analysis so that we could see not only what words appeared frequently, but how they appeared together with other words.

We used relevant keywords from both the Indeed data set and the Open Skills data set to supplement this comprehensive list of possible skills.

In the end, our word_catalog contained [put number here] individual words and phrases describing data science skills.

Create a word catalog and search for string in column

#Load word_catalog 

dictionary_analyst <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/Eric_DataAnalystDictionary.csv", header=FALSE, fileEncoding = "UTF-8-BOM")

dictionary_ngrams <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/Ngrams_dictionary.csv?token=ASEY4BIFPB7FWOOYPWAQNWTANUIPC", fileEncoding = "UTF-8-BOM")

os <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/OS_dictionary_skills.csv?token=ASEY4BNIW2YDLN2N57PN7ELANZUJK", fileEncoding = "UTF-8-BOM")

onet <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/ONET%20Technology%20Skills.csv?token=ASEY4BJAZDGVPDDHSSZJZ43ANZUNY", fileEncoding = "UTF-8-BOM")


# create word catalogue with single skill column for merge

dictionary_onet <- onet %>% select(skill = Skill)
dictionary_os <- os %>% select(skill)

#Assign column name to analyst word catalog 

names(dictionary_analyst) <- ('skill')
names(dictionary_ngrams) <- ('skill')

#convert to lowercase and remove special characters where needed

dictionary_analyst <- dictionary_analyst %>% mutate(across(where(is.character), tolower))
dictionary_ngrams <- dictionary_ngrams  %>% mutate(across(where(is.character), tolower))
dictionary_os <- dictionary_os  %>% mutate(across(where(is.character), tolower))
dictionary_onet <- dictionary_onet  %>% mutate(across(where(is.character), tolower)) %>% mutate_all(funs(gsub("[[:punct:]]", "", .)))


#merge dictionaries and remove duplicates 

MyMerge <- function(x, y){
  df <- merge(x, y, all = TRUE)
  return(df)
}

# merge all four dictionaries and delete duplicate skills

dictionary <- Reduce(MyMerge, list(dictionary_analyst, dictionary_ngrams, dictionary_onet, dictionary_os)) %>% distinct()


# remove common character strings found within words and phrases to analyze separately 

#remove skills that need to be removed when then appear alone but not when in a phrase

dictionary$skill <- str_remove(dictionary$skill, "(?! )(ai|science|business)(?! )")

#remove short character skills that will be picked up within words

dictionary$skill <- str_remove(dictionary$skill, "(?:^|\\W)(r|c|ms|go)(?:$|\\W)")

# turn word_catalog into vector and remove empty rows

dictionary <- dictionary[!apply(dictionary == "", 1, all),]
3.Detect words in word_catalog

Detecting words in the word_catalog we created

Now we are ready to detect word and phrase frequencies in our data set. We will separate out job positions that include “data scientist” from those that include “data analyst” because we cannot answer our research question without investigating whether data science is really just a new fancy term for data analysis.

The detection will be done in two parts. The first will be counts of small words that are hard to isolate in large descriptions. The second will run all other identified skills in our word_catalog through our job descriptions. The resulting skill frequencies of these efforts will be merged for analysis.

get counts for r, ms, ai go

data_sci_count<-data.frame("r" = sum(str_count(data_scientists$description, " r | r,| r\\.")), "ms" = sum(str_count(data_scientists$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_scientists$description, " go ")), "AI" = sum(str_count(data_scientists$description, " ai | ai,| ai\\.")))

data_ana_count<-data.frame("r" = sum(str_count(data_analysts$description, " r | r,| r\\.")), "ms" = sum(str_count(data_analysts$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_analysts$description, " go ")), "AI" = sum(str_count(data_analysts$description, " ai | ai,| ai\\.")))

Detect Word_catalog words in original descriptions

One way to assess how important a particular skill is, is to look for how many times each word from our word_catalog is mentioned throughout the dataset. Here, we’re looking for an overall count of the word_catalog words that show up most frequently in the Kaggle dataset of job descriptions. We’ll run these counts both on ‘data scientist’ job desccriptions and ‘data analyst’ job descriptions.

# Pulls skills out of description based on catalog


setDT(data_scientists)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_scientists)]

setDT(data_analysts)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_analysts)]


# Create a count of skills for data science

skillsfreq_ds <- data_scientists %>%
  separate_rows(skills, sep = ',') %>%
  group_by(skills = tolower(skills)) %>%
  summarise(count = n())

# Create a count of skills for data analyst

skillsfreq_da <- data_analysts %>%
  separate_rows(skills, sep = ',') %>%
  group_by(skills = tolower(skills)) %>%
  summarise(count = n())

# Merge counts for r, ms, ai, go

data_sci_count <- data_sci_count %>% gather(skills, count, 1:4)

data_ana_count <- data_ana_count %>% gather(skills, count, 1:4)

skillsfreq_ds <- merge(skillsfreq_ds, data_sci_count, all = TRUE)

skillsfreq_da <- merge(skillsfreq_da, data_ana_count, all = TRUE)


# Merge top data skills for data scientist and data analyst 

skillsfreq_all <- full_join(skillsfreq_ds ,skillsfreq_da,by="skills") %>% rename(count_ds = count.x, count_da = count.y)

Let’s take a look at the top skills that show up for each of “data scientist” and “data analyst”:

Table top data scientist skills

top_skills_ds <- skillsfreq_ds %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number()) 

top_skills_ds <- left_join(top_skills_ds, ds_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()
## Joining, by = "skills"
findings_table_ds <-(top_skills_ds) %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(height = "400px")

findings_table_ds
Top Skills
skills count rank jd_count
1 python 5250 1 1750
4 machine learning 1693 2 1693
5 design 1468 3 1468
6 computer 1460 4 1460
7 science 1384 5 1384
8 research 1254 6 1254
9 statistics 1249 7 1249
10 sql 1172 8 1172
11 communication 1168 9 1168
12 r 1118 10 950
13 math 1102 11 1102
14 solutions 1056 12 1056
15 algorithms 975 13 975
16 programming 960 14 960
17 leader 905 15 905
18 organization 844 16 844
19 passion 821 17 821
20 analytical 799 18 799
21 phd 789 19 789
22 scala 789 20 789
23 quantitative 779 21 779
24 spark 779 22 779
25 mathematics 768 23 768
26 communication skills 766 24 766
27 java 765 25 765
28 AI 729 26 NA
29 vision 714 27 714
30 hadoop 706 28 706
31 database 699 29 699
32 big data 695 30 695
33 written 684 31 684
34 ml 617 32 617
35 visualization 578 33 578
36 years of experience 576 34 576
37 data sets 566 35 566
38 leadership 536 36 536
39 data analysis 532 37 532
40 git 516 38 516
41 collaborative 511 39 511
42 office 506 40 506
43 data mining 483 41 483
44 verbal 481 42 481
45 innovation 471 43 471
46 presentation 463 44 463
47 creating 451 45 451
48 sas 445 46 445
49 deep learning 432 47 432
50 large data 398 48 398
51 natural language 390 49 390
52 data engineer 389 50 389
53 physics 337 51 337
54 collaboration 329 52 329
55 software development 327 53 327
56 language processing 320 54 320
57 artificial intelligence 318 55 318
58 natural language processing 318 56 318
59 economics 315 57 315
60 learning algorithms 315 58 315
61 programming languages 313 59 313
62 business problems 312 60 312
63 data visualization 312 61 312
64 machine learning techniques 298 62 298
65 writing 295 63 295
66 rtable 291 64 291
67 data analytics 285 65 285
68 tableau 279 66 279
69 machine learning algorithms 278 67 278
70 matlab 277 68 277
71 consulting 274 69 274
72 problem solving 273 70 273
73 learning models 268 71 268
74 nlp 254 72 254
75 influence 251 73 251
76 linux 251 74 251
77 flexible 244 75 244
78 etl 239 76 239
79 statistical modeling 237 77 237
80 large scale 235 78 235
81 nosql 232 79 232
82 machine learning models 231 80 231
83 data processing 227 81 227
84 large data sets 227 82 227
85 data pipeline 225 83 225
86 ms 223 84 253
87 predictive models 211 85 211
88 interpersonal 201 86 201
89 masters 201 87 201
90 data engineering 194 88 194
91 software engineers 194 89 194
92 data pipelines 179 90 179
93 decision making 173 91 173
94 organizational 172 92 172
95 forecasting 169 93 169
96 bachelor’s degree 164 94 164
97 monitoring 162 95 162
98 data management 161 96 161
99 have experience 157 97 157
100 creativity 155 98 155
101 microsoft 148 99 148
102 predictive analytics 146 100 146
103 project management 141 101 141
104 data collection 132 102 132
105 azure 127 103 127
106 javascript 127 104 127
107 modeling techniques 122 105 122
108 data architecture 120 106 120
109 data models 116 107 116
110 sap 112 108 112
111 unix 112 109 112
112 Go 107 110 NA
113 array 104 111 104
114 modelling 102 112 102
115 ruby 98 113 98
116 work independently 97 114 97
117 data warehousing 95 115 95
118 facebook 95 116 95
119 mysql 92 117 92
120 powerpoint 84 118 84
121 language understanding 83 119 83
122 ecommerce 80 120 80
123 data systems 78 121 78
124 solving problems 77 122 77
125 data extraction 75 123 75
126 mongodb 75 124 75
127 apache spark 65 125 65
128 natural language understanding 64 126 64
129 critical thinking 62 127 62
130 kpmg 60 128 60
131 data integration 59 129 59
132 big data architecture 57 130 57
133 github 57 131 57
134 postgresql 56 132 56
135 data manipulation 51 133 51
136 highly motivated 51 134 51
137 masters degree 49 135 49
138 elasticsearch 47 136 47
139 speaking 47 137 47
140 architecture capabilities 46 138 46
141 bachelors 46 139 46
142 covering technologies 46 140 46
143 multi-task 46 141 46
144 bash 45 142 45
145 nlu 40 143 40
146 shell script 40 144 40
147 troubleshooting 40 145 40
148 youtube 40 146 40
149 coordination 39 147 39
150 data gathering 39 148 39
151 time management 38 149 38
152 methodological 34 150 34
153 microsoft office 31 151 31
154 django 30 152 30
155 google analytics 30 153 30
156 market research 29 154 29
157 network analysis 28 155 28
158 data insights 27 156 27
159 microsoft azure 23 157 23
160 data preparation 22 158 22
161 negotiation 21 159 21
162 sales and marketing 19 160 19
163 vba 19 161 19
164 jupyter notebook 18 162 18
165 microstrategy 18 163 18
166 doctorate degree 17 164 17
167 manage multiple projects 16 165 16
168 analytics data 15 166 15
169 microsoft excel 15 167 15
170 machine learning data 13 168 13
171 strategic thinking 13 169 13
172 highly organized 10 170 10
173 jquery 10 171 10
174 9 172 NA
175 grammatical 9 173 9
176 symantec 9 174 9
177 telecommunications 9 175 9
178 data entry 8 176 8
179 data mapping 8 177 8
180 data reporting 8 178 8
181 eko 8 179 8
182 active learning 7 180 7
183 apache hadoop 7 181 7
184 confluence 7 182 7
185 amazon redshift 6 183 6
186 data transfer 6 184 6
187 microsoft word 6 185 6
188 swift 6 186 6
189 complex problem solving 5 187 5
190 english language 5 188 5
191 ibm db2 5 189 5
192 microsoft sql server 5 190 5
193 service orientation 5 191 5
194 unix shell 5 192 5
195 apache kafka 4 193 4
196 operations analysis 4 194 4
197 see the big picture 4 195 4
198 apache hive 3 196 3
199 data interpretation 3 197 3
200 experience in information technology 3 198 3
201 microsoft access 3 199 3
202 minitab 3 200 3
203 skype 3 201 3
204 systems analysis 3 202 3
205 technology design 3 203 3
206 ubuntu 3 204 3
207 work well in a team 3 205 3
208 active listening 2 206 2
209 citrix 2 207 2
210 clerical 2 208 2
211 client management 2 209 2
212 data cleanup 2 210 2
213 data organization 2 211 2
214 google adwords 2 212 2
215 mathematical reasoning 2 213 2
216 microsoft outlook 2 214 2
217 microsoft powerpoint 2 215 2
218 amazon dynamodb 1 216 1
219 bring creativity 1 217 1
220 design development 1 218 1
221 engineering and technology 1 219 1
222 filemaker pro 1 220 1
223 ibm infosphere datastage 1 221 1
224 judgment and decision making 1 222 1
225 microsoft dynamics 1 223 1
226 microsoft windows server 1 224 1
227 oracle hyperion 1 225 1
228 oracle java 1 226 1
229 organizational management 1 227 1
230 prepare data for analysis 1 228 1
231 quality control analysis 1 229 1
232 reading comprehension 1 230 1
233 report creation 1 231 1
234 teradata database 1 232 1
235 wireshark 1 233 1

Table top data analyst skills

top_skills_da <- skillsfreq_da %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number())

top_skills_da <- left_join(top_skills_da, da_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()
## Joining, by = "skills"
findings_table_da <-(top_skills_da) %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(height = "400px")

findings_table_da
Top Skills
skills count rank jd_count
1 python 1275 1 425
4 research 785 2 785
5 communication 776 3 776
6 analytical 653 4 653
7 design 627 5 627
8 organization 605 6 605
9 written 514 7 514
10 quantitative 499 8 499
11 communication skills 488 9 488
12 leader 484 10 484
13 sql 467 11 467
14 statistics 466 12 466
15 office 463 13 463
16 math 403 14 403
17 r 401 15 326
18 solutions 400 16 400
19 computer 380 17 380
20 presentation 373 18 373
21 database 367 19 367
22 vision 358 20 358
23 leadership 327 21 327
24 verbal 324 22 324
25 passion 323 23 323
26 programming 313 24 313
27 science 312 25 312
28 sas 309 26 309
29 data analysis 304 27 304
30 writing 300 28 300
31 years of experience 286 29 286
32 collaborative 284 30 284
33 economics 268 31 268
34 mathematics 257 32 257
35 visualization 243 33 243
36 organizational 224 34 224
37 microsoft 223 35 223
38 innovation 207 36 207
39 machine learning 207 37 207
40 data sets 206 38 206
41 tableau 203 39 203
42 interpersonal 198 40 198
43 creating 196 41 196
44 git 195 42 195
45 powerpoint 190 43 190
46 consulting 182 44 182
47 ms 173 45 165
48 problem solving 172 46 172
49 data visualization 162 47 162
50 project management 159 48 159
51 collaboration 154 49 154
52 phd 151 50 151
53 bachelor’s degree 149 51 149
54 data analytics 140 52 140
55 flexible 137 53 137
56 java 134 54 134
57 ml 134 55 134
58 large data 133 56 133
59 algorithms 130 57 130
60 rtable 128 58 128
61 work independently 125 59 125
62 data collection 121 60 121
63 big data 120 61 120
64 data management 120 62 120
65 influence 118 63 118
66 monitoring 118 64 118
67 decision making 114 65 114
68 scala 113 66 113
69 data mining 105 67 105
70 microsoft office 100 68 100
71 forecasting 96 69 96
72 market research 95 70 95
73 hadoop 92 71 92
74 matlab 92 72 92
75 physics 89 73 89
76 programming languages 88 74 88
77 business problems 87 75 87
78 spark 87 76 87
79 masters 85 77 85
80 data engineer 81 78 81
81 critical thinking 77 79 77
82 multi-task 77 80 77
83 Go 74 81 NA
84 etl 73 82 73
85 large data sets 72 83 72
86 coordination 68 84 68
87 microsoft excel 68 85 68
88 statistical modeling 61 86 61
89 vba 60 87 60
90 have experience 58 88 58
91 facebook 53 89 53
92 sap 53 90 53
93 highly motivated 52 91 52
94 time management 51 92 51
95 creativity 50 93 50
96 data engineering 50 94 50
97 linux 46 95 46
98 bachelors 41 96 41
99 google analytics 41 97 41
100 software engineers 41 98 41
101 data processing 40 99 40
102 modelling 39 100 39
103 predictive analytics 39 101 39
104 predictive models 39 102 39
105 software development 38 103 38
106 javascript 37 104 37
107 modeling techniques 36 105 36
108 deep learning 35 106 35
109 array 34 107 34
110 troubleshooting 34 108 34
111 unix 34 109 34
112 artificial intelligence 33 110 33
113 machine learning techniques 33 111 33
114 data manipulation 31 112 31
115 data models 31 113 31
116 data systems 31 114 31
117 natural language 31 115 31
118 data warehousing 30 116 30
119 mysql 30 117 30
120 solving problems 30 118 30
121 youtube 29 119 29
122 data integration 28 120 28
123 manage multiple projects 28 121 28
124 masters degree 27 122 27
125 data pipeline 26 123 26
126 ecommerce 26 124 26
127 learning algorithms 26 125 26
128 methodological 26 126 26
129 speaking 26 127 26
130 AI 25 128 NA
131 data extraction 25 129 25
132 language processing 25 130 25
133 data gathering 24 131 24
134 machine learning algorithms 24 132 24
135 natural language processing 24 133 24
136 learning models 23 134 23
137 large scale 22 135 22
138 nosql 22 136 22
139 machine learning models 21 137 21
140 azure 19 138 19
141 data entry 19 139 19
142 data insights 19 140 19
143 doctorate degree 19 141 19
144 microsoft word 18 142 18
145 highly organized 17 143 17
146 ruby 17 144 17
147 data pipelines 16 145 16
148 data reporting 15 146 15
149 negotiation 15 147 15
150 systems analysis 15 148 15
151 microsoft powerpoint 14 149 14
152 nlp 14 150 14
153 data architecture 13 151 13
154 bash 12 152 12
155 network analysis 12 153 12
156 elasticsearch 11 154 11
157 postgresql 11 155 11
158 service orientation 11 156 11
159 strategic thinking 11 157 11
160 english language 10 158 10
161 github 10 159 10
162 mongodb 10 160 10
163 data preparation 8 161 8
164 data transfer 8 162 8
165 analytics data 7 163 7
166 client management 7 164 7
167 microsoft access 7 165 7
168 6 166 NA
169 apache spark 6 167 6
170 microsoft project 6 168 6
171 shell script 6 169 6
172 jupyter notebook 5 170 5
173 language understanding 5 171 5
174 microstrategy 5 172 5
175 sales and marketing 5 173 5
176 symantec 5 174 5
177 data interpretation 4 175 4
178 grammatical 4 176 4
179 minitab 4 177 4
180 natural language understanding 4 178 4
181 report creation 4 179 4
182 amazon redshift 3 180 3
183 complex problem solving 3 181 3
184 data mapping 3 182 3
185 data storytelling 3 183 3
186 kpmg 3 184 3
187 lexisnexis 3 185 3
188 microsoft outlook 3 186 3
189 telecommunications 3 187 3
190 work well in a team 3 188 3
191 active learning 2 189 2
192 administration and management 2 190 2
193 ajax 2 191 2
194 apache hadoop 2 192 2
195 clerical 2 193 2
196 confluence 2 194 2
197 experience in market research 2 195 2
198 google docs 2 196 2
199 mcafee 2 197 2
200 microsoft azure 2 198 2
201 nlu 2 199 2
202 operations analysis 2 200 2
203 see the big picture 2 201 2
204 swift 2 202 2
205 wireshark 2 203 2
206 apache kafka 1 204 1
207 apache tomcat 1 205 1
208 data cleanup 1 206 1
209 data organization 1 207 1
210 datadriven 1 208 1
211 deductive reasoning 1 209 1
212 design development 1 210 1
213 django 1 211 1
214 eko 1 212 1
215 epic systems 1 213 1
216 experience in information technology 1 214 1
217 filemaker pro 1 215 1
218 google adwords 1 216 1
219 ibm db2 1 217 1
220 jquery 1 218 1
221 judgment and decision making 1 219 1
222 machine learning data 1 220 1
223 mathematical reasoning 1 221 1
224 microsoft dynamics 1 222 1
225 microsoft sharepoint 1 223 1
226 microsoft sql server 1 224 1
227 microsoft sql server reporting services 1 225 1
228 microsoft windows server 1 226 1
229 oracle hyperion 1 227 1
230 organizational management 1 228 1
231 processing information 1 229 1
232 reading comprehension 1 230 1
233 skype 1 231 1
234 systems evaluation 1 232 1
235 tax software 1 233 1
236 technology design 1 234 1
237 ubuntu 1 235 1
238 unix shell 1 236 1

We can see that there are a few skills that stand out among both positions (Python and R among them). In order to compare the importance of each skill between the two roles, however, we need to look at their frequency in a slightly different way…

Count number of job descriptions containing each word-catalog entry

Here, we want to compare how many job descriptions within each dataset contain each word_catalog entry. Again, we run this code on both “data scientist” job descriptions and “data analyst” job descriptions. By focusing on the number of job descriptions, we can calculate a proportion of the total for each skill in each dataset, that will allow us to make comparisons of their relative importance to each role.

# # data_sci_jd_count <- tibble(
#     "r" = nrow(filter(data_scientists, str_detect(data_scientists$description, " r | r,| r\\.") == TRUE)), 
#     "ms" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ms | ms,| ms.") == TRUE)), 
#     "go" = nrow(filter(data_scientists, str_detect(data_scientists$description, " go ") == TRUE)),
#     "ai" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ai | ai,| ai\\.") == TRUE))) %>% 
#   pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
# 
# dict_dsjd_freq <- tibble(
#   "skills" = dictionary,
#   "freq" = lapply(dictionary, function(x){
#     nrow(filter(data_scientists, str_detect(data_scientists$description, x) == TRUE))
#   }) %>% as.vector(mode="integer"))
# 
# ds_jd_count <- union_all(data_sci_jd_count, dict_dsjd_freq)
# 
# 
# data_ana_jd_count <- tibble(
#     "r" = nrow(filter(data_analysts, str_detect(data_analysts$description, " r | r,| r\\.") == TRUE)), 
#     "ms" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ms | ms,| ms.") == TRUE)), 
#     "go" = nrow(filter(data_analysts, str_detect(data_analysts$description, " go ") == TRUE)),
#     "ai" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ai | ai,| ai\\.") == TRUE))) %>%
#   pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
# 
# dict_dsjd_freq <- tibble(
#   "skills" = dictionary,
#   "freq" = lapply(dictionary, function(x){
#     nrow(filter(data_analysts, str_detect(data_analysts$description, x) == TRUE))
#   }) %>% as.vector(mode="integer"))
# 
# da_jd_count <- union_all(data_ana_jd_count, dict_dsjd_freq)

Table top skills across both jobs by number, ranking within dataset, and number and percent of job descriptions

In order to facilitate some analysis, let’s join the results from each of the two datsets with some comparative metrics:

  • rank within dataset is the rank of the skill’s frequency amongst all skills, for each dataset
  • jd_percent is the percentage of job descriptions which mention the skill
  • frequency_per_jd measures how often a particular skill shows up per job description in which it’s mentioned
top_skills_all <- full_join(top_skills_ds, top_skills_da, by="skills") %>% 
  rename(count_ds = count.x, count_da = count.y, rank_ds = rank.x, rank_da = rank.y, jd_count_ds = jd_count.x, jd_count_da = jd_count.y) %>%
  select(skills, count_ds, rank_ds, jd_count_ds, count_da, rank_da, jd_count_da) %>% 
  mutate(avg_rank = (rank_ds + rank_da) / 2,
         jd_percent_ds = round(jd_count_ds / nrow(data_scientists), 3),
         freq_per_jd_ds= round(count_ds/ jd_count_ds, 3),
         jd_percent_da = round(jd_count_da / nrow(data_analysts), 3),
         freq_per_jd_da = round(count_da/ jd_count_da, 3)) %>%
  arrange(avg_rank)

findings_table_all<- top_skills_all %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(height = "400px")

findings_table_all
Top Skills
skills count_ds rank_ds jd_count_ds count_da rank_da jd_count_da avg_rank jd_percent_ds freq_per_jd_ds jd_percent_da freq_per_jd_da
python 5250 1 1750 1275 1 425 1.0 0.727 3.000 0.348 3.000
design 1468 3 1468 627 5 627 4.0 0.610 1.000 0.514 1.000
research 1254 6 1254 785 2 785 4.0 0.521 1.000 0.643 1.000
communication 1168 9 1168 776 3 776 6.0 0.485 1.000 0.636 1.000
statistics 1249 7 1249 466 12 466 9.5 0.519 1.000 0.382 1.000
sql 1172 8 1172 467 11 467 9.5 0.487 1.000 0.382 1.000
computer 1460 4 1460 380 17 380 10.5 0.607 1.000 0.311 1.000
organization 844 16 844 605 6 605 11.0 0.351 1.000 0.495 1.000
analytical 799 18 799 653 4 653 11.0 0.332 1.000 0.535 1.000
r 1118 10 950 401 15 326 12.5 0.395 1.177 0.267 1.230
math 1102 11 1102 403 14 403 12.5 0.458 1.000 0.330 1.000
leader 905 15 905 484 10 484 12.5 0.376 1.000 0.396 1.000
solutions 1056 12 1056 400 16 400 14.0 0.439 1.000 0.328 1.000
quantitative 779 21 779 499 8 499 14.5 0.324 1.000 0.409 1.000
science 1384 5 1384 312 25 312 15.0 0.575 1.000 0.256 1.000
communication skills 766 24 766 488 9 488 16.5 0.318 1.000 0.400 1.000
programming 960 14 960 313 24 313 19.0 0.399 1.000 0.256 1.000
written 684 31 684 514 7 514 19.0 0.284 1.000 0.421 1.000
machine learning 1693 2 1693 207 37 207 19.5 0.704 1.000 0.170 1.000
passion 821 17 821 323 23 323 20.0 0.341 1.000 0.265 1.000
vision 714 27 714 358 20 358 23.5 0.297 1.000 0.293 1.000
database 699 29 699 367 19 367 24.0 0.291 1.000 0.301 1.000
office 506 40 506 463 13 463 26.5 0.210 1.000 0.379 1.000
mathematics 768 23 768 257 32 257 27.5 0.319 1.000 0.210 1.000
leadership 536 36 536 327 21 327 28.5 0.223 1.000 0.268 1.000
presentation 463 44 463 373 18 373 31.0 0.192 1.000 0.305 1.000
years of experience 576 34 576 286 29 286 31.5 0.239 1.000 0.234 1.000
data analysis 532 37 532 304 27 304 32.0 0.221 1.000 0.249 1.000
verbal 481 42 481 324 22 324 32.0 0.200 1.000 0.265 1.000
visualization 578 33 578 243 33 243 33.0 0.240 1.000 0.199 1.000
phd 789 19 789 151 50 151 34.5 0.328 1.000 0.124 1.000
collaborative 511 39 511 284 30 284 34.5 0.212 1.000 0.233 1.000
algorithms 975 13 975 130 57 130 35.0 0.405 1.000 0.106 1.000
sas 445 46 445 309 26 309 36.0 0.185 1.000 0.253 1.000
data sets 566 35 566 206 38 206 36.5 0.235 1.000 0.169 1.000
java 765 25 765 134 54 134 39.5 0.318 1.000 0.110 1.000
innovation 471 43 471 207 36 207 39.5 0.196 1.000 0.170 1.000
git 516 38 516 195 42 195 40.0 0.214 1.000 0.160 1.000
scala 789 20 789 113 66 113 43.0 0.328 1.000 0.093 1.000
creating 451 45 451 196 41 196 43.0 0.187 1.000 0.161 1.000
ml 617 32 617 134 55 134 43.5 0.256 1.000 0.110 1.000
economics 315 57 315 268 31 268 44.0 0.131 1.000 0.219 1.000
big data 695 30 695 120 61 120 45.5 0.289 1.000 0.098 1.000
writing 295 63 295 300 28 300 45.5 0.123 1.000 0.246 1.000
spark 779 22 779 87 76 87 49.0 0.324 1.000 0.071 1.000
hadoop 706 28 706 92 71 92 49.5 0.293 1.000 0.075 1.000
collaboration 329 52 329 154 49 154 50.5 0.137 1.000 0.126 1.000
large data 398 48 398 133 56 133 52.0 0.165 1.000 0.109 1.000
tableau 279 66 279 203 39 203 52.5 0.116 1.000 0.166 1.000
data mining 483 41 483 105 67 105 54.0 0.201 1.000 0.086 1.000
data visualization 312 61 312 162 47 162 54.0 0.130 1.000 0.133 1.000
consulting 274 69 274 182 44 182 56.5 0.114 1.000 0.149 1.000
problem solving 273 70 273 172 46 172 58.0 0.113 1.000 0.141 1.000
data analytics 285 65 285 140 52 140 58.5 0.118 1.000 0.115 1.000
rtable 291 64 291 128 58 128 61.0 0.121 1.000 0.105 1.000
physics 337 51 337 89 73 89 62.0 0.140 1.000 0.073 1.000
interpersonal 201 86 201 198 40 198 63.0 0.084 1.000 0.162 1.000
organizational 172 92 172 224 34 224 63.0 0.071 1.000 0.183 1.000
data engineer 389 50 389 81 78 81 64.0 0.162 1.000 0.066 1.000
flexible 244 75 244 137 53 137 64.0 0.101 1.000 0.112 1.000
ms 223 84 253 173 45 165 64.5 0.105 0.881 0.135 1.048
programming languages 313 59 313 88 74 88 66.5 0.130 1.000 0.072 1.000
microsoft 148 99 148 223 35 223 67.0 0.062 1.000 0.183 1.000
business problems 312 60 312 87 75 87 67.5 0.130 1.000 0.071 1.000
influence 251 73 251 118 63 118 68.0 0.104 1.000 0.097 1.000
matlab 277 68 277 92 72 92 70.0 0.115 1.000 0.075 1.000
bachelor’s degree 164 94 164 149 51 149 72.5 0.068 1.000 0.122 1.000
project management 141 101 141 159 48 159 74.5 0.059 1.000 0.130 1.000
deep learning 432 47 432 35 106 35 76.5 0.180 1.000 0.029 1.000
AI 729 26 NA 25 128 NA 77.0 NA NA NA NA
software development 327 53 327 38 103 38 78.0 0.136 1.000 0.031 1.000
decision making 173 91 173 114 65 114 78.0 0.072 1.000 0.093 1.000
etl 239 76 239 73 82 73 79.0 0.099 1.000 0.060 1.000
data management 161 96 161 120 62 120 79.0 0.067 1.000 0.098 1.000
monitoring 162 95 162 118 64 118 79.5 0.067 1.000 0.097 1.000
powerpoint 84 118 84 190 43 190 80.5 0.035 1.000 0.156 1.000
forecasting 169 93 169 96 69 96 81.0 0.070 1.000 0.079 1.000
data collection 132 102 132 121 60 121 81.0 0.055 1.000 0.099 1.000
statistical modeling 237 77 237 61 86 61 81.5 0.099 1.000 0.050 1.000
natural language 390 49 390 31 115 31 82.0 0.162 1.000 0.025 1.000
masters 201 87 201 85 77 85 82.0 0.084 1.000 0.070 1.000
artificial intelligence 318 55 318 33 110 33 82.5 0.132 1.000 0.027 1.000
large data sets 227 82 227 72 83 72 82.5 0.094 1.000 0.059 1.000
linux 251 74 251 46 95 46 84.5 0.104 1.000 0.038 1.000
machine learning techniques 298 62 298 33 111 33 86.5 0.124 1.000 0.027 1.000
work independently 97 114 97 125 59 125 86.5 0.040 1.000 0.102 1.000
data processing 227 81 227 40 99 40 90.0 0.094 1.000 0.033 1.000
data engineering 194 88 194 50 94 50 91.0 0.081 1.000 0.041 1.000
learning algorithms 315 58 315 26 125 26 91.5 0.131 1.000 0.021 1.000
language processing 320 54 320 25 130 25 92.0 0.133 1.000 0.020 1.000
have experience 157 97 157 58 88 58 92.5 0.065 1.000 0.048 1.000
predictive models 211 85 211 39 102 39 93.5 0.088 1.000 0.032 1.000
software engineers 194 89 194 41 98 41 93.5 0.081 1.000 0.034 1.000
natural language processing 318 56 318 24 133 24 94.5 0.132 1.000 0.020 1.000
creativity 155 98 155 50 93 50 95.5 0.064 1.000 0.041 1.000
Go 107 110 NA 74 81 NA 95.5 NA NA NA NA
sap 112 108 112 53 90 53 99.0 0.047 1.000 0.043 1.000
machine learning algorithms 278 67 278 24 132 24 99.5 0.116 1.000 0.020 1.000
predictive analytics 146 100 146 39 101 39 100.5 0.061 1.000 0.032 1.000
learning models 268 71 268 23 134 23 102.5 0.111 1.000 0.019 1.000
facebook 95 116 95 53 89 53 102.5 0.039 1.000 0.043 1.000
data pipeline 225 83 225 26 123 26 103.0 0.094 1.000 0.021 1.000
critical thinking 62 127 62 77 79 77 103.0 0.026 1.000 0.063 1.000
javascript 127 104 127 37 104 37 104.0 0.053 1.000 0.030 1.000
modeling techniques 122 105 122 36 105 36 105.0 0.051 1.000 0.029 1.000
modelling 102 112 102 39 100 39 106.0 0.042 1.000 0.032 1.000
large scale 235 78 235 22 135 22 106.5 0.098 1.000 0.018 1.000
nosql 232 79 232 22 136 22 107.5 0.096 1.000 0.018 1.000
machine learning models 231 80 231 21 137 21 108.5 0.096 1.000 0.017 1.000
unix 112 109 112 34 109 34 109.0 0.047 1.000 0.028 1.000
array 104 111 104 34 107 34 109.0 0.043 1.000 0.028 1.000
microsoft office 31 151 31 100 68 100 109.5 0.013 1.000 0.082 1.000
data models 116 107 116 31 113 31 110.0 0.048 1.000 0.025 1.000
multi-task 46 141 46 77 80 77 110.5 0.019 1.000 0.063 1.000
nlp 254 72 254 14 150 14 111.0 0.106 1.000 0.011 1.000
market research 29 154 29 95 70 95 112.0 0.012 1.000 0.078 1.000
highly motivated 51 134 51 52 91 52 112.5 0.021 1.000 0.043 1.000
data warehousing 95 115 95 30 116 30 115.5 0.039 1.000 0.025 1.000
coordination 39 147 39 68 84 68 115.5 0.016 1.000 0.056 1.000
mysql 92 117 92 30 117 30 117.0 0.038 1.000 0.025 1.000
data pipelines 179 90 179 16 145 16 117.5 0.074 1.000 0.013 1.000
data systems 78 121 78 31 114 31 117.5 0.032 1.000 0.025 1.000
bachelors 46 139 46 41 96 41 117.5 0.019 1.000 0.034 1.000
solving problems 77 122 77 30 118 30 120.0 0.032 1.000 0.025 1.000
azure 127 103 127 19 138 19 120.5 0.053 1.000 0.016 1.000
time management 38 149 38 51 92 51 120.5 0.016 1.000 0.042 1.000
ecommerce 80 120 80 26 124 26 122.0 0.033 1.000 0.021 1.000
data manipulation 51 133 51 31 112 31 122.5 0.021 1.000 0.025 1.000
vba 19 161 19 60 87 60 124.0 0.008 1.000 0.049 1.000
data integration 59 129 59 28 120 28 124.5 0.025 1.000 0.023 1.000
google analytics 30 153 30 41 97 41 125.0 0.012 1.000 0.034 1.000
data extraction 75 123 75 25 129 25 126.0 0.031 1.000 0.020 1.000
microsoft excel 15 167 15 68 85 68 126.0 0.006 1.000 0.056 1.000
troubleshooting 40 145 40 34 108 34 126.5 0.017 1.000 0.028 1.000
data architecture 120 106 120 13 151 13 128.5 0.050 1.000 0.011 1.000
ruby 98 113 98 17 144 17 128.5 0.041 1.000 0.014 1.000
masters degree 49 135 49 27 122 27 128.5 0.020 1.000 0.022 1.000
speaking 47 137 47 26 127 26 132.0 0.020 1.000 0.021 1.000
youtube 40 146 40 29 119 29 132.5 0.017 1.000 0.024 1.000
methodological 34 150 34 26 126 26 138.0 0.014 1.000 0.021 1.000
data gathering 39 148 39 24 131 24 139.5 0.016 1.000 0.020 1.000
mongodb 75 124 75 10 160 10 142.0 0.031 1.000 0.008 1.000
manage multiple projects 16 165 16 28 121 28 143.0 0.007 1.000 0.023 1.000
postgresql 56 132 56 11 155 11 143.5 0.023 1.000 0.009 1.000
language understanding 83 119 83 5 171 5 145.0 0.034 1.000 0.004 1.000
github 57 131 57 10 159 10 145.0 0.024 1.000 0.008 1.000
elasticsearch 47 136 47 11 154 11 145.0 0.020 1.000 0.009 1.000
apache spark 65 125 65 6 167 6 146.0 0.027 1.000 0.005 1.000
bash 45 142 45 12 152 12 147.0 0.019 1.000 0.010 1.000
data insights 27 156 27 19 140 19 148.0 0.011 1.000 0.016 1.000
natural language understanding 64 126 64 4 178 4 152.0 0.027 1.000 0.003 1.000
doctorate degree 17 164 17 19 141 19 152.5 0.007 1.000 0.016 1.000
negotiation 21 159 21 15 147 15 153.0 0.009 1.000 0.012 1.000
network analysis 28 155 28 12 153 12 154.0 0.012 1.000 0.010 1.000
kpmg 60 128 60 3 184 3 156.0 0.025 1.000 0.002 1.000
shell script 40 144 40 6 169 6 156.5 0.017 1.000 0.005 1.000
highly organized 10 170 10 17 143 17 156.5 0.004 1.000 0.014 1.000
data entry 8 176 8 19 139 19 157.5 0.003 1.000 0.016 1.000
data preparation 22 158 22 8 161 8 159.5 0.009 1.000 0.007 1.000
data reporting 8 178 8 15 146 15 162.0 0.003 1.000 0.012 1.000
strategic thinking 13 169 13 11 157 11 163.0 0.005 1.000 0.009 1.000
microsoft word 6 185 6 18 142 18 163.5 0.002 1.000 0.015 1.000
analytics data 15 166 15 7 163 7 164.5 0.006 1.000 0.006 1.000
jupyter notebook 18 162 18 5 170 5 166.0 0.007 1.000 0.004 1.000
sales and marketing 19 160 19 5 173 5 166.5 0.008 1.000 0.004 1.000
microstrategy 18 163 18 5 172 5 167.5 0.007 1.000 0.004 1.000
9 172 NA 6 166 NA 169.0 NA NA NA NA
nlu 40 143 40 2 199 2 171.0 0.017 1.000 0.002 1.000
data transfer 6 184 6 8 162 8 173.0 0.002 1.000 0.007 1.000
english language 5 188 5 10 158 10 173.0 0.002 1.000 0.008 1.000
service orientation 5 191 5 11 156 11 173.5 0.002 1.000 0.009 1.000
symantec 9 174 9 5 174 5 174.0 0.004 1.000 0.004 1.000
grammatical 9 173 9 4 176 4 174.5 0.004 1.000 0.003 1.000
systems analysis 3 202 3 15 148 15 175.0 0.001 1.000 0.012 1.000
microsoft azure 23 157 23 2 198 2 177.5 0.010 1.000 0.002 1.000
data mapping 8 177 8 3 182 3 179.5 0.003 1.000 0.002 1.000
telecommunications 9 175 9 3 187 3 181.0 0.004 1.000 0.002 1.000
django 30 152 30 1 211 1 181.5 0.012 1.000 0.001 1.000
amazon redshift 6 183 6 3 180 3 181.5 0.002 1.000 0.002 1.000
microsoft access 3 199 3 7 165 7 182.0 0.001 1.000 0.006 1.000
microsoft powerpoint 2 215 2 14 149 14 182.0 0.001 1.000 0.011 1.000
complex problem solving 5 187 5 3 181 3 184.0 0.002 1.000 0.002 1.000
active learning 7 180 7 2 189 2 184.5 0.003 1.000 0.002 1.000
data interpretation 3 197 3 4 175 4 186.0 0.001 1.000 0.003 1.000
apache hadoop 7 181 7 2 192 2 186.5 0.003 1.000 0.002 1.000
client management 2 209 2 7 164 7 186.5 0.001 1.000 0.006 1.000
confluence 7 182 7 2 194 2 188.0 0.003 1.000 0.002 1.000
minitab 3 200 3 4 177 4 188.5 0.001 1.000 0.003 1.000
machine learning data 13 168 13 1 220 1 194.0 0.005 1.000 0.001 1.000
swift 6 186 6 2 202 2 194.0 0.002 1.000 0.002 1.000
jquery 10 171 10 1 218 1 194.5 0.004 1.000 0.001 1.000
eko 8 179 8 1 212 1 195.5 0.003 1.000 0.001 1.000
work well in a team 3 205 3 3 188 3 196.5 0.001 1.000 0.002 1.000
operations analysis 4 194 4 2 200 2 197.0 0.002 1.000 0.002 1.000
see the big picture 4 195 4 2 201 2 198.0 0.002 1.000 0.002 1.000
apache kafka 4 193 4 1 204 1 198.5 0.002 1.000 0.001 1.000
microsoft outlook 2 214 2 3 186 3 200.0 0.001 1.000 0.002 1.000
clerical 2 208 2 2 193 2 200.5 0.001 1.000 0.002 1.000
ibm db2 5 189 5 1 217 1 203.0 0.002 1.000 0.001 1.000
report creation 1 231 1 4 179 4 205.0 0.000 1.000 0.003 1.000
experience in information technology 3 198 3 1 214 1 206.0 0.001 1.000 0.001 1.000
microsoft sql server 5 190 5 1 224 1 207.0 0.002 1.000 0.001 1.000
data cleanup 2 210 2 1 206 1 208.0 0.001 1.000 0.001 1.000
data organization 2 211 2 1 207 1 209.0 0.001 1.000 0.001 1.000
unix shell 5 192 5 1 236 1 214.0 0.002 1.000 0.001 1.000
google adwords 2 212 2 1 216 1 214.0 0.001 1.000 0.001 1.000
design development 1 218 1 1 210 1 214.0 0.000 1.000 0.001 1.000
skype 3 201 3 1 231 1 216.0 0.001 1.000 0.001 1.000
mathematical reasoning 2 213 2 1 221 1 217.0 0.001 1.000 0.001 1.000
filemaker pro 1 220 1 1 215 1 217.5 0.000 1.000 0.001 1.000
wireshark 1 233 1 2 203 2 218.0 0.000 1.000 0.002 1.000
technology design 3 203 3 1 234 1 218.5 0.001 1.000 0.001 1.000
ubuntu 3 204 3 1 235 1 219.5 0.001 1.000 0.001 1.000
judgment and decision making 1 222 1 1 219 1 220.5 0.000 1.000 0.001 1.000
microsoft dynamics 1 223 1 1 222 1 222.5 0.000 1.000 0.001 1.000
microsoft windows server 1 224 1 1 226 1 225.0 0.000 1.000 0.001 1.000
oracle hyperion 1 225 1 1 227 1 226.0 0.000 1.000 0.001 1.000
organizational management 1 227 1 1 228 1 227.5 0.000 1.000 0.001 1.000
reading comprehension 1 230 1 1 230 1 230.0 0.000 1.000 0.001 1.000
big data architecture 57 130 57 NA NA NA NA 0.024 1.000 NA NA
architecture capabilities 46 138 46 NA NA NA NA 0.019 1.000 NA NA
covering technologies 46 140 46 NA NA NA NA 0.019 1.000 NA NA
apache hive 3 196 3 NA NA NA NA 0.001 1.000 NA NA
active listening 2 206 2 NA NA NA NA 0.001 1.000 NA NA
citrix 2 207 2 NA NA NA NA 0.001 1.000 NA NA
amazon dynamodb 1 216 1 NA NA NA NA 0.000 1.000 NA NA
bring creativity 1 217 1 NA NA NA NA 0.000 1.000 NA NA
engineering and technology 1 219 1 NA NA NA NA 0.000 1.000 NA NA
ibm infosphere datastage 1 221 1 NA NA NA NA 0.000 1.000 NA NA
oracle java 1 226 1 NA NA NA NA 0.000 1.000 NA NA
prepare data for analysis 1 228 1 NA NA NA NA 0.000 1.000 NA NA
quality control analysis 1 229 1 NA NA NA NA 0.000 1.000 NA NA
teradata database 1 232 1 NA NA NA NA 0.000 1.000 NA NA
microsoft project NA NA NA 6 168 6 NA NA NA 0.005 1.000
data storytelling NA NA NA 3 183 3 NA NA NA 0.002 1.000
lexisnexis NA NA NA 3 185 3 NA NA NA 0.002 1.000
administration and management NA NA NA 2 190 2 NA NA NA 0.002 1.000
ajax NA NA NA 2 191 2 NA NA NA 0.002 1.000
experience in market research NA NA NA 2 195 2 NA NA NA 0.002 1.000
google docs NA NA NA 2 196 2 NA NA NA 0.002 1.000
mcafee NA NA NA 2 197 2 NA NA NA 0.002 1.000
apache tomcat NA NA NA 1 205 1 NA NA NA 0.001 1.000
datadriven NA NA NA 1 208 1 NA NA NA 0.001 1.000
deductive reasoning NA NA NA 1 209 1 NA NA NA 0.001 1.000
epic systems NA NA NA 1 213 1 NA NA NA 0.001 1.000
microsoft sharepoint NA NA NA 1 223 1 NA NA NA 0.001 1.000
microsoft sql server reporting services NA NA NA 1 225 1 NA NA NA 0.001 1.000
processing information NA NA NA 1 229 1 NA NA NA 0.001 1.000
systems evaluation NA NA NA 1 232 1 NA NA NA 0.001 1.000
tax software NA NA NA 1 233 1 NA NA NA 0.001 1.000
4.Analyze our survey

Analyzing our Survey

In our quest to determine the top data science skills, our team formulated a survey and distributed it to our peers, colleagues, friends, and family. We received 32 survey responses, and analyzed the data to determine the top data science skills. Respondents were prompted to provide three skills in all.

In order to analyze our survey data, we used n-gram analysis. Because of the less complex and lengthy nature of this data compared to the job description data, we did not use the word_catalog but instead relied on our common sense to pull out relevant data skills from our n-gram analysis.

## Loading Data

survey <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/survey-final.csv?token=AKGJZWL6KFYOLXBQ5XWZ7OLANHUIQ")

skills_only <- select(survey, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
skills_only<-skills_only %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Exploratory Word Cloud

To get an initial look at the data, all the skills collected into one data frame, and the wordcloud library was used to generate a word cloud of all the words survey participants submitted.

all <- pivot_longer(skills_only, 1:3)
corpus4 <- VCorpus(VectorSource(all$value))
corpus4 <- tm_map(corpus4, removePunctuation)
corpus4 <- tm_map(corpus4, content_transformer(tolower))
corpus4 <- tm_map(corpus4, removeNumbers)
corpus4 <- tm_map(corpus4, removeWords, stopwords_en)
corpus4 <- tm_map(corpus4, stripWhitespace)
wordcloud(corpus4, max.words = 50, colors = wes_palette(name = "Zissou1"))

Stemming

To further analyze the survey, stemming was used. In additional to removing stopwords_en, “skill”, “skills”, “ability”, and “abilities” were removed since those aren’t standalone skills. The stem completion list was forumlated by hand based on the generated list of stems.

no_punc <- removePunctuation(all$value)
fewer_words <- removeWords(no_punc, c(stopwords_en, "etc", "eg", "skill", "skills", "ability", "abilities"))
unlisted <-  unlist(strsplit(fewer_words, split = ' '))
stems <- stemDocument(unlisted)
stems<- stripWhitespace(stems)
stems <- tolower(stems)
stem_corpus <- VCorpus(VectorSource(stems))
wordcloud(stem_corpus, max.words = 50, colors = wes_palette(name = "Zissou1"))

test_complete <- c("statistics", "visualization", "programming", "database", "analytics", "software", "solving", "thinking",  "communication", "code", "machine", "learning", "modeling", "munging", "interpretation", "recognition", "computer", "aptitude", "knowledge", "technical", "storytelling", "collaboration", "analysis", "data", "sql", "creativity", "python", "business")


test <- stemCompletion(stems, dictionary=test_complete)
testcorp <- VCorpus(VectorSource(test))
wordcloud(testcorp, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

Next, the frequency of the ten most common unigrams, bigrams, and trigrams for the entire data set were analyzed.

#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(1, 20)))


#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))


#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))
unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)

ggplot(unigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams") +labs(x = "", y = "")

#Bigrams

bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)

ggplot(bigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams") +labs(x = "", y = "")

#Trigrams
trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)

ggplot(trigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Trigrams") +labs(x = "", y = "")

N Gram Analysis, With Filtering

To get a closer look at where the responses break down, the unigram and bigram frequencies were run again on filtered sections of the data set.

Filtering By Field and Occupation

First, the data set was filtered both by field and by occupation. The first group was individuals who worked full time in either computer science or data science. The second group was students or teachers in computer science or data science.

industry <- filter(survey, Occupation == "Full-Time Work")
industry <- filter(industry, Field == "Data Science" | Field == "Computer Science")
not_industry <- filter(survey, Occupation == "Graduate Student (Full Time)" | Occupation == "Teacher / Professor" | Occupation == "High School Student" | Occupation == "Undergraduate" | Occupation == "Other")
not_industry <- filter(not_industry, Field == "Data Science" | Field == "Computer Science")

industry <- select(industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

industry<- industry %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

not_industry <- select(not_industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

not_industry<-not_industry %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Initial Word Cloud Visualization

industry <- pivot_longer(industry, 1:3)
corpus_industry <- VCorpus(VectorSource(industry$value))
corpus_industry <- tm_map(corpus_industry, removePunctuation)
corpus_industry <- tm_map(corpus_industry, content_transformer(tolower))
corpus_industry <- tm_map(corpus_industry, removeNumbers)
corpus_industry <- tm_map(corpus_industry, removeWords, c(stopwords_en, "eg", "etc"))
corpus_industry <- tm_map(corpus_industry, stripWhitespace)
wordcloud(corpus_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))

not_industry <- pivot_longer(not_industry, 1:3)
corpus_not_industry <- VCorpus(VectorSource(not_industry$value))
corpus_not_industry <- tm_map(corpus_not_industry, removePunctuation)
corpus_not_industry <- tm_map(corpus_not_industry, content_transformer(tolower))
corpus_not_industry <- tm_map(corpus_not_industry, removeNumbers)
corpus_not_industry <- tm_map(corpus_not_industry, removeWords, c(stopwords_en, "etc", "eg"))
corpus_not_industry <- tm_map(corpus_not_industry, stripWhitespace)
wordcloud(corpus_not_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

unigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(1, 20)))

unigramrow_ind <- sort(slam::row_sums(unigram_ind), decreasing=T)
unigramfreq_ind <- data.table(tok = names(unigramrow_ind), freq = unigramrow_ind)

ggplot(unigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")

unigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(1, 20)))
unigramrow_not <- sort(slam::row_sums(unigram_not), decreasing=T)
unigramfreq_not <- data.table(tok = names(unigramrow_not), freq = unigramrow_not)

ggplot(unigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")

bigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_ind <- sort(slam::row_sums(bigram_ind), decreasing=T)
bigramfreq_ind <- data.table(tok = names(bigramrow_ind), freq = bigramrow_ind)

ggplot(bigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")

bigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not <- sort(slam::row_sums(bigram_not), decreasing=T)
bigramfreq_not <- data.table(tok = names(bigramrow_not), freq = bigramrow_not)

ggplot(bigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")

Filtering Only By Field

The second filtered data set was only filtered by field. The first group was individuals in data science or computer science, and the second group was individuals who weren’t in computer science or data science.

csds <- filter(survey, Field == "Data Science" | Field == "Computer Science")

not_csds <- filter(survey, Field == "Other STEM Field" | Field == "Other Non-Stem Field")

csds <- select(csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

csds<- csds %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

not_csds <- select(not_csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

not_csds<-not_csds %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Word Clouds

csds <- pivot_longer(csds, 1:3)
corpus_csds <- VCorpus(VectorSource(csds$value))
corpus_csds <- tm_map(corpus_csds, removePunctuation)
corpus_csds <- tm_map(corpus_csds, content_transformer(tolower))
corpus_csds <- tm_map(corpus_csds, removeNumbers)
corpus_csds <- tm_map(corpus_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_csds <- tm_map(corpus_csds, stripWhitespace)
wordcloud(corpus_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))

not_csds <- pivot_longer(not_csds, 1:3)
corpus_not_csds <- VCorpus(VectorSource(not_csds$value))
corpus_not_csds <- tm_map(corpus_not_csds, removePunctuation)
corpus_not_csds <- tm_map(corpus_not_csds, content_transformer(tolower))
corpus_not_csds <- tm_map(corpus_not_csds, removeNumbers)
corpus_not_csds <- tm_map(corpus_not_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_not_csds <- tm_map(corpus_not_csds, stripWhitespace)
wordcloud(corpus_not_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

unigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(1, 20)))

unigramrow_csds <- sort(slam::row_sums(unigram_csds), decreasing=T)
unigramfreq_csds <- data.table(tok = names(unigramrow_csds), freq = unigramrow_csds)

ggplot(unigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Fields") +labs(x = "", y = "")

unigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(1, 20)))

unigramrow_not_csds <- sort(slam::row_sums(unigram_not_csds), decreasing=T)
unigramfreq_not_csds <- data.table(tok = names(unigramrow_not_csds), freq = unigramrow_not_csds)

ggplot(unigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Other Fields") +labs(x = "", y = "")

bigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_csds <- sort(slam::row_sums(bigram_csds), decreasing=T)
bigramfreq_csds <- data.table(tok = names(bigramrow_csds), freq = bigramrow_csds)

ggplot(bigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science") +labs(x = "", y = "")

bigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not_csds <- sort(slam::row_sums(bigram_not_csds), decreasing=T)
bigramfreq_not_csds <- data.table(tok = names(bigramrow_not_csds), freq = bigramrow_not_csds)

ggplot(bigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Other Fields") +labs(x = "", y = "")

Conclusions

The top broad skills identified by the survey were “programming skills” and “analytical skills”. “Statistics”, “R”, and “Python” were top skills identified by individuals working full-time in either computer science or data science. Overall, in the filtered data sets, there weren’t many common bigrams, which was likely due to the small size of the survey. Interestingly, two survey respondents in the data science / computer science group identified “business knowledge” as a top skill.

Overall, most of the skills identified by the survey were either broad answers, such as “programming skills” or “analytical skills”, or specific programming languages such as Python, R, and sQL. A few respondents also identified abstract skills, such as creativity. While there was more specificity in the survey answers of individuals working or studying in data science or computer science, more survey responses should be gathered to make any larger conclusions.

Findings

Graph Top 20 Skills by total count, and percentage of job descriptions for each of Data Scientist and Data Analyst

From our Kaggle datasets, let’s look at the top 20 skills for each position - data scientist and data analyst - both by their straight number of mentions within the dataset, and by the percentage of job descriptions on which they appear within their dataset…

top20_countds <- top_skills_all %>% slice_max(order_by=count_ds, n=20)

ggplot(top20_countds, aes(x=reorder(skills, count_ds), y=count_ds)) +
  geom_col(fill="coral") + coord_flip() + labs(
    title = "Overall no. of Mentions/ Skill",
    subtitle = "Data Scientist",
    x = "Skill",
    y = "Mentions",
    caption= "Kaggle"
  )

top20_jdpercentds <- top_skills_all %>% slice_max(order_by=jd_percent_ds, n=20)

ggplot(top20_jdpercentds, aes(x=reorder(skills, jd_percent_ds), y=jd_percent_ds)) +
  geom_col(fill="coral") + coord_flip() +labs(
    title = "Percent of Job Descriptions Mentioning Skill",
    subtitle = "Data Scientist",
    x = "Skill",
    y = "% of Job Descriptions",
    caption= "Kaggle"
  )

top20_countda <- top_skills_all %>% slice_max(order_by=count_da, n=20)

ggplot(top20_countda, aes(x=reorder(skills, count_da), y=count_da)) +
  geom_col(fill="blue") + coord_flip() + labs(
    title = "Overall No. of Mentions/ Skill",
    subtitle = "Data Analyst",
    x = "Skill",
    y = "Mentions",
    caption= "Kaggle"
  )

top20_jdpercentda <- top_skills_all %>% slice_max(order_by=jd_percent_da, n=20)

ggplot(top20_jdpercentda, aes(x=reorder(skills, jd_percent_da), y=jd_percent_da)) +
  geom_col(fill="blue") + coord_flip() + labs(
    title = "Percent of Job Descriptions Mentioning Skill",
    subtitle = "Data Analyst", 
    x = "Skill",
    y = "% of Job Descriptions",
    caption= "Kaggle"
  )

Python: One Skill to Rule them All?

First, We can see that while python is far and away the skill with the most overall mentions for both positions, this is driven in large part by its being mentioned with greater frequency on each of the job descriptions on which it appears.

Looking at the percentage of job descriptions which mention python, we can see that its dominance over other skills mentioned in the “data scientists” job descriptions is less pronounced, and that it falls in importance among the Data Analyst job descriptions to rank 13.

Scientist vs. Analyst

Looking at the other skills that round out each position’s top 20, we can draw another clear conclusion that we expect connects directly to each role’s relative importance given to python: the second most-frequently-requested skill among data science job descriptions is machine learning, while data analyst descriptions do not mention this among their top 20 most requested skills.

While both roles meniton research and SQL with similar frequency, there is a clear delineation that emerges to separate the two roles. In addition to machine learning and python, Data Scientist job descriptions are more likely to require knowledge of statistics, mathematics, and R - and to desire the candidate have completed a Ph.D, indicating a desire for deeper subject matter expertise in these areas. Data Analyst roles, on the other hand, place more emphasis on softer skills - communication, vision, leadership, and organization, and may thus be a better entry-point for working in the field.

findings_table_ds
Top Skills
skills count rank jd_count
1 python 5250 1 1750
4 machine learning 1693 2 1693
5 design 1468 3 1468
6 computer 1460 4 1460
7 science 1384 5 1384
8 research 1254 6 1254
9 statistics 1249 7 1249
10 sql 1172 8 1172
11 communication 1168 9 1168
12 r 1118 10 950
13 math 1102 11 1102
14 solutions 1056 12 1056
15 algorithms 975 13 975
16 programming 960 14 960
17 leader 905 15 905
18 organization 844 16 844
19 passion 821 17 821
20 analytical 799 18 799
21 phd 789 19 789
22 scala 789 20 789
23 quantitative 779 21 779
24 spark 779 22 779
25 mathematics 768 23 768
26 communication skills 766 24 766
27 java 765 25 765
28 AI 729 26 NA
29 vision 714 27 714
30 hadoop 706 28 706
31 database 699 29 699
32 big data 695 30 695
33 written 684 31 684
34 ml 617 32 617
35 visualization 578 33 578
36 years of experience 576 34 576
37 data sets 566 35 566
38 leadership 536 36 536
39 data analysis 532 37 532
40 git 516 38 516
41 collaborative 511 39 511
42 office 506 40 506
43 data mining 483 41 483
44 verbal 481 42 481
45 innovation 471 43 471
46 presentation 463 44 463
47 creating 451 45 451
48 sas 445 46 445
49 deep learning 432 47 432
50 large data 398 48 398
51 natural language 390 49 390
52 data engineer 389 50 389
53 physics 337 51 337
54 collaboration 329 52 329
55 software development 327 53 327
56 language processing 320 54 320
57 artificial intelligence 318 55 318
58 natural language processing 318 56 318
59 economics 315 57 315
60 learning algorithms 315 58 315
61 programming languages 313 59 313
62 business problems 312 60 312
63 data visualization 312 61 312
64 machine learning techniques 298 62 298
65 writing 295 63 295
66 rtable 291 64 291
67 data analytics 285 65 285
68 tableau 279 66 279
69 machine learning algorithms 278 67 278
70 matlab 277 68 277
71 consulting 274 69 274
72 problem solving 273 70 273
73 learning models 268 71 268
74 nlp 254 72 254
75 influence 251 73 251
76 linux 251 74 251
77 flexible 244 75 244
78 etl 239 76 239
79 statistical modeling 237 77 237
80 large scale 235 78 235
81 nosql 232 79 232
82 machine learning models 231 80 231
83 data processing 227 81 227
84 large data sets 227 82 227
85 data pipeline 225 83 225
86 ms 223 84 253
87 predictive models 211 85 211
88 interpersonal 201 86 201
89 masters 201 87 201
90 data engineering 194 88 194
91 software engineers 194 89 194
92 data pipelines 179 90 179
93 decision making 173 91 173
94 organizational 172 92 172
95 forecasting 169 93 169
96 bachelor’s degree 164 94 164
97 monitoring 162 95 162
98 data management 161 96 161
99 have experience 157 97 157
100 creativity 155 98 155
101 microsoft 148 99 148
102 predictive analytics 146 100 146
103 project management 141 101 141
104 data collection 132 102 132
105 azure 127 103 127
106 javascript 127 104 127
107 modeling techniques 122 105 122
108 data architecture 120 106 120
109 data models 116 107 116
110 sap 112 108 112
111 unix 112 109 112
112 Go 107 110 NA
113 array 104 111 104
114 modelling 102 112 102
115 ruby 98 113 98
116 work independently 97 114 97
117 data warehousing 95 115 95
118 facebook 95 116 95
119 mysql 92 117 92
120 powerpoint 84 118 84
121 language understanding 83 119 83
122 ecommerce 80 120 80
123 data systems 78 121 78
124 solving problems 77 122 77
125 data extraction 75 123 75
126 mongodb 75 124 75
127 apache spark 65 125 65
128 natural language understanding 64 126 64
129 critical thinking 62 127 62
130 kpmg 60 128 60
131 data integration 59 129 59
132 big data architecture 57 130 57
133 github 57 131 57
134 postgresql 56 132 56
135 data manipulation 51 133 51
136 highly motivated 51 134 51
137 masters degree 49 135 49
138 elasticsearch 47 136 47
139 speaking 47 137 47
140 architecture capabilities 46 138 46
141 bachelors 46 139 46
142 covering technologies 46 140 46
143 multi-task 46 141 46
144 bash 45 142 45
145 nlu 40 143 40
146 shell script 40 144 40
147 troubleshooting 40 145 40
148 youtube 40 146 40
149 coordination 39 147 39
150 data gathering 39 148 39
151 time management 38 149 38
152 methodological 34 150 34
153 microsoft office 31 151 31
154 django 30 152 30
155 google analytics 30 153 30
156 market research 29 154 29
157 network analysis 28 155 28
158 data insights 27 156 27
159 microsoft azure 23 157 23
160 data preparation 22 158 22
161 negotiation 21 159 21
162 sales and marketing 19 160 19
163 vba 19 161 19
164 jupyter notebook 18 162 18
165 microstrategy 18 163 18
166 doctorate degree 17 164 17
167 manage multiple projects 16 165 16
168 analytics data 15 166 15
169 microsoft excel 15 167 15
170 machine learning data 13 168 13
171 strategic thinking 13 169 13
172 highly organized 10 170 10
173 jquery 10 171 10
174 9 172 NA
175 grammatical 9 173 9
176 symantec 9 174 9
177 telecommunications 9 175 9
178 data entry 8 176 8
179 data mapping 8 177 8
180 data reporting 8 178 8
181 eko 8 179 8
182 active learning 7 180 7
183 apache hadoop 7 181 7
184 confluence 7 182 7
185 amazon redshift 6 183 6
186 data transfer 6 184 6
187 microsoft word 6 185 6
188 swift 6 186 6
189 complex problem solving 5 187 5
190 english language 5 188 5
191 ibm db2 5 189 5
192 microsoft sql server 5 190 5
193 service orientation 5 191 5
194 unix shell 5 192 5
195 apache kafka 4 193 4
196 operations analysis 4 194 4
197 see the big picture 4 195 4
198 apache hive 3 196 3
199 data interpretation 3 197 3
200 experience in information technology 3 198 3
201 microsoft access 3 199 3
202 minitab 3 200 3
203 skype 3 201 3
204 systems analysis 3 202 3
205 technology design 3 203 3
206 ubuntu 3 204 3
207 work well in a team 3 205 3
208 active listening 2 206 2
209 citrix 2 207 2
210 clerical 2 208 2
211 client management 2 209 2
212 data cleanup 2 210 2
213 data organization 2 211 2
214 google adwords 2 212 2
215 mathematical reasoning 2 213 2
216 microsoft outlook 2 214 2
217 microsoft powerpoint 2 215 2
218 amazon dynamodb 1 216 1
219 bring creativity 1 217 1
220 design development 1 218 1
221 engineering and technology 1 219 1
222 filemaker pro 1 220 1
223 ibm infosphere datastage 1 221 1
224 judgment and decision making 1 222 1
225 microsoft dynamics 1 223 1
226 microsoft windows server 1 224 1
227 oracle hyperion 1 225 1
228 oracle java 1 226 1
229 organizational management 1 227 1
230 prepare data for analysis 1 228 1
231 quality control analysis 1 229 1
232 reading comprehension 1 230 1
233 report creation 1 231 1
234 teradata database 1 232 1
235 wireshark 1 233 1
findings_table_da
Top Skills
skills count rank jd_count
1 python 1275 1 425
4 research 785 2 785
5 communication 776 3 776
6 analytical 653 4 653
7 design 627 5 627
8 organization 605 6 605
9 written 514 7 514
10 quantitative 499 8 499
11 communication skills 488 9 488
12 leader 484 10 484
13 sql 467 11 467
14 statistics 466 12 466
15 office 463 13 463
16 math 403 14 403
17 r 401 15 326
18 solutions 400 16 400
19 computer 380 17 380
20 presentation 373 18 373
21 database 367 19 367
22 vision 358 20 358
23 leadership 327 21 327
24 verbal 324 22 324
25 passion 323 23 323
26 programming 313 24 313
27 science 312 25 312
28 sas 309 26 309
29 data analysis 304 27 304
30 writing 300 28 300
31 years of experience 286 29 286
32 collaborative 284 30 284
33 economics 268 31 268
34 mathematics 257 32 257
35 visualization 243 33 243
36 organizational 224 34 224
37 microsoft 223 35 223
38 innovation 207 36 207
39 machine learning 207 37 207
40 data sets 206 38 206
41 tableau 203 39 203
42 interpersonal 198 40 198
43 creating 196 41 196
44 git 195 42 195
45 powerpoint 190 43 190
46 consulting 182 44 182
47 ms 173 45 165
48 problem solving 172 46 172
49 data visualization 162 47 162
50 project management 159 48 159
51 collaboration 154 49 154
52 phd 151 50 151
53 bachelor’s degree 149 51 149
54 data analytics 140 52 140
55 flexible 137 53 137
56 java 134 54 134
57 ml 134 55 134
58 large data 133 56 133
59 algorithms 130 57 130
60 rtable 128 58 128
61 work independently 125 59 125
62 data collection 121 60 121
63 big data 120 61 120
64 data management 120 62 120
65 influence 118 63 118
66 monitoring 118 64 118
67 decision making 114 65 114
68 scala 113 66 113
69 data mining 105 67 105
70 microsoft office 100 68 100
71 forecasting 96 69 96
72 market research 95 70 95
73 hadoop 92 71 92
74 matlab 92 72 92
75 physics 89 73 89
76 programming languages 88 74 88
77 business problems 87 75 87
78 spark 87 76 87
79 masters 85 77 85
80 data engineer 81 78 81
81 critical thinking 77 79 77
82 multi-task 77 80 77
83 Go 74 81 NA
84 etl 73 82 73
85 large data sets 72 83 72
86 coordination 68 84 68
87 microsoft excel 68 85 68
88 statistical modeling 61 86 61
89 vba 60 87 60
90 have experience 58 88 58
91 facebook 53 89 53
92 sap 53 90 53
93 highly motivated 52 91 52
94 time management 51 92 51
95 creativity 50 93 50
96 data engineering 50 94 50
97 linux 46 95 46
98 bachelors 41 96 41
99 google analytics 41 97 41
100 software engineers 41 98 41
101 data processing 40 99 40
102 modelling 39 100 39
103 predictive analytics 39 101 39
104 predictive models 39 102 39
105 software development 38 103 38
106 javascript 37 104 37
107 modeling techniques 36 105 36
108 deep learning 35 106 35
109 array 34 107 34
110 troubleshooting 34 108 34
111 unix 34 109 34
112 artificial intelligence 33 110 33
113 machine learning techniques 33 111 33
114 data manipulation 31 112 31
115 data models 31 113 31
116 data systems 31 114 31
117 natural language 31 115 31
118 data warehousing 30 116 30
119 mysql 30 117 30
120 solving problems 30 118 30
121 youtube 29 119 29
122 data integration 28 120 28
123 manage multiple projects 28 121 28
124 masters degree 27 122 27
125 data pipeline 26 123 26
126 ecommerce 26 124 26
127 learning algorithms 26 125 26
128 methodological 26 126 26
129 speaking 26 127 26
130 AI 25 128 NA
131 data extraction 25 129 25
132 language processing 25 130 25
133 data gathering 24 131 24
134 machine learning algorithms 24 132 24
135 natural language processing 24 133 24
136 learning models 23 134 23
137 large scale 22 135 22
138 nosql 22 136 22
139 machine learning models 21 137 21
140 azure 19 138 19
141 data entry 19 139 19
142 data insights 19 140 19
143 doctorate degree 19 141 19
144 microsoft word 18 142 18
145 highly organized 17 143 17
146 ruby 17 144 17
147 data pipelines 16 145 16
148 data reporting 15 146 15
149 negotiation 15 147 15
150 systems analysis 15 148 15
151 microsoft powerpoint 14 149 14
152 nlp 14 150 14
153 data architecture 13 151 13
154 bash 12 152 12
155 network analysis 12 153 12
156 elasticsearch 11 154 11
157 postgresql 11 155 11
158 service orientation 11 156 11
159 strategic thinking 11 157 11
160 english language 10 158 10
161 github 10 159 10
162 mongodb 10 160 10
163 data preparation 8 161 8
164 data transfer 8 162 8
165 analytics data 7 163 7
166 client management 7 164 7
167 microsoft access 7 165 7
168 6 166 NA
169 apache spark 6 167 6
170 microsoft project 6 168 6
171 shell script 6 169 6
172 jupyter notebook 5 170 5
173 language understanding 5 171 5
174 microstrategy 5 172 5
175 sales and marketing 5 173 5
176 symantec 5 174 5
177 data interpretation 4 175 4
178 grammatical 4 176 4
179 minitab 4 177 4
180 natural language understanding 4 178 4
181 report creation 4 179 4
182 amazon redshift 3 180 3
183 complex problem solving 3 181 3
184 data mapping 3 182 3
185 data storytelling 3 183 3
186 kpmg 3 184 3
187 lexisnexis 3 185 3
188 microsoft outlook 3 186 3
189 telecommunications 3 187 3
190 work well in a team 3 188 3
191 active learning 2 189 2
192 administration and management 2 190 2
193 ajax 2 191 2
194 apache hadoop 2 192 2
195 clerical 2 193 2
196 confluence 2 194 2
197 experience in market research 2 195 2
198 google docs 2 196 2
199 mcafee 2 197 2
200 microsoft azure 2 198 2
201 nlu 2 199 2
202 operations analysis 2 200 2
203 see the big picture 2 201 2
204 swift 2 202 2
205 wireshark 2 203 2
206 apache kafka 1 204 1
207 apache tomcat 1 205 1
208 data cleanup 1 206 1
209 data organization 1 207 1
210 datadriven 1 208 1
211 deductive reasoning 1 209 1
212 design development 1 210 1
213 django 1 211 1
214 eko 1 212 1
215 epic systems 1 213 1
216 experience in information technology 1 214 1
217 filemaker pro 1 215 1
218 google adwords 1 216 1
219 ibm db2 1 217 1
220 jquery 1 218 1
221 judgment and decision making 1 219 1
222 machine learning data 1 220 1
223 mathematical reasoning 1 221 1
224 microsoft dynamics 1 222 1
225 microsoft sharepoint 1 223 1
226 microsoft sql server 1 224 1
227 microsoft sql server reporting services 1 225 1
228 microsoft windows server 1 226 1
229 oracle hyperion 1 227 1
230 organizational management 1 228 1
231 processing information 1 229 1
232 reading comprehension 1 230 1
233 skype 1 231 1
234 systems evaluation 1 232 1
235 tax software 1 233 1
236 technology design 1 234 1
237 ubuntu 1 235 1
238 unix shell 1 236 1
findings_table_all
Top Skills
skills count_ds rank_ds jd_count_ds count_da rank_da jd_count_da avg_rank jd_percent_ds freq_per_jd_ds jd_percent_da freq_per_jd_da
python 5250 1 1750 1275 1 425 1.0 0.727 3.000 0.348 3.000
design 1468 3 1468 627 5 627 4.0 0.610 1.000 0.514 1.000
research 1254 6 1254 785 2 785 4.0 0.521 1.000 0.643 1.000
communication 1168 9 1168 776 3 776 6.0 0.485 1.000 0.636 1.000
statistics 1249 7 1249 466 12 466 9.5 0.519 1.000 0.382 1.000
sql 1172 8 1172 467 11 467 9.5 0.487 1.000 0.382 1.000
computer 1460 4 1460 380 17 380 10.5 0.607 1.000 0.311 1.000
organization 844 16 844 605 6 605 11.0 0.351 1.000 0.495 1.000
analytical 799 18 799 653 4 653 11.0 0.332 1.000 0.535 1.000
r 1118 10 950 401 15 326 12.5 0.395 1.177 0.267 1.230
math 1102 11 1102 403 14 403 12.5 0.458 1.000 0.330 1.000
leader 905 15 905 484 10 484 12.5 0.376 1.000 0.396 1.000
solutions 1056 12 1056 400 16 400 14.0 0.439 1.000 0.328 1.000
quantitative 779 21 779 499 8 499 14.5 0.324 1.000 0.409 1.000
science 1384 5 1384 312 25 312 15.0 0.575 1.000 0.256 1.000
communication skills 766 24 766 488 9 488 16.5 0.318 1.000 0.400 1.000
programming 960 14 960 313 24 313 19.0 0.399 1.000 0.256 1.000
written 684 31 684 514 7 514 19.0 0.284 1.000 0.421 1.000
machine learning 1693 2 1693 207 37 207 19.5 0.704 1.000 0.170 1.000
passion 821 17 821 323 23 323 20.0 0.341 1.000 0.265 1.000
vision 714 27 714 358 20 358 23.5 0.297 1.000 0.293 1.000
database 699 29 699 367 19 367 24.0 0.291 1.000 0.301 1.000
office 506 40 506 463 13 463 26.5 0.210 1.000 0.379 1.000
mathematics 768 23 768 257 32 257 27.5 0.319 1.000 0.210 1.000
leadership 536 36 536 327 21 327 28.5 0.223 1.000 0.268 1.000
presentation 463 44 463 373 18 373 31.0 0.192 1.000 0.305 1.000
years of experience 576 34 576 286 29 286 31.5 0.239 1.000 0.234 1.000
data analysis 532 37 532 304 27 304 32.0 0.221 1.000 0.249 1.000
verbal 481 42 481 324 22 324 32.0 0.200 1.000 0.265 1.000
visualization 578 33 578 243 33 243 33.0 0.240 1.000 0.199 1.000
phd 789 19 789 151 50 151 34.5 0.328 1.000 0.124 1.000
collaborative 511 39 511 284 30 284 34.5 0.212 1.000 0.233 1.000
algorithms 975 13 975 130 57 130 35.0 0.405 1.000 0.106 1.000
sas 445 46 445 309 26 309 36.0 0.185 1.000 0.253 1.000
data sets 566 35 566 206 38 206 36.5 0.235 1.000 0.169 1.000
java 765 25 765 134 54 134 39.5 0.318 1.000 0.110 1.000
innovation 471 43 471 207 36 207 39.5 0.196 1.000 0.170 1.000
git 516 38 516 195 42 195 40.0 0.214 1.000 0.160 1.000
scala 789 20 789 113 66 113 43.0 0.328 1.000 0.093 1.000
creating 451 45 451 196 41 196 43.0 0.187 1.000 0.161 1.000
ml 617 32 617 134 55 134 43.5 0.256 1.000 0.110 1.000
economics 315 57 315 268 31 268 44.0 0.131 1.000 0.219 1.000
big data 695 30 695 120 61 120 45.5 0.289 1.000 0.098 1.000
writing 295 63 295 300 28 300 45.5 0.123 1.000 0.246 1.000
spark 779 22 779 87 76 87 49.0 0.324 1.000 0.071 1.000
hadoop 706 28 706 92 71 92 49.5 0.293 1.000 0.075 1.000
collaboration 329 52 329 154 49 154 50.5 0.137 1.000 0.126 1.000
large data 398 48 398 133 56 133 52.0 0.165 1.000 0.109 1.000
tableau 279 66 279 203 39 203 52.5 0.116 1.000 0.166 1.000
data mining 483 41 483 105 67 105 54.0 0.201 1.000 0.086 1.000
data visualization 312 61 312 162 47 162 54.0 0.130 1.000 0.133 1.000
consulting 274 69 274 182 44 182 56.5 0.114 1.000 0.149 1.000
problem solving 273 70 273 172 46 172 58.0 0.113 1.000 0.141 1.000
data analytics 285 65 285 140 52 140 58.5 0.118 1.000 0.115 1.000
rtable 291 64 291 128 58 128 61.0 0.121 1.000 0.105 1.000
physics 337 51 337 89 73 89 62.0 0.140 1.000 0.073 1.000
interpersonal 201 86 201 198 40 198 63.0 0.084 1.000 0.162 1.000
organizational 172 92 172 224 34 224 63.0 0.071 1.000 0.183 1.000
data engineer 389 50 389 81 78 81 64.0 0.162 1.000 0.066 1.000
flexible 244 75 244 137 53 137 64.0 0.101 1.000 0.112 1.000
ms 223 84 253 173 45 165 64.5 0.105 0.881 0.135 1.048
programming languages 313 59 313 88 74 88 66.5 0.130 1.000 0.072 1.000
microsoft 148 99 148 223 35 223 67.0 0.062 1.000 0.183 1.000
business problems 312 60 312 87 75 87 67.5 0.130 1.000 0.071 1.000
influence 251 73 251 118 63 118 68.0 0.104 1.000 0.097 1.000
matlab 277 68 277 92 72 92 70.0 0.115 1.000 0.075 1.000
bachelor’s degree 164 94 164 149 51 149 72.5 0.068 1.000 0.122 1.000
project management 141 101 141 159 48 159 74.5 0.059 1.000 0.130 1.000
deep learning 432 47 432 35 106 35 76.5 0.180 1.000 0.029 1.000
AI 729 26 NA 25 128 NA 77.0 NA NA NA NA
software development 327 53 327 38 103 38 78.0 0.136 1.000 0.031 1.000
decision making 173 91 173 114 65 114 78.0 0.072 1.000 0.093 1.000
etl 239 76 239 73 82 73 79.0 0.099 1.000 0.060 1.000
data management 161 96 161 120 62 120 79.0 0.067 1.000 0.098 1.000
monitoring 162 95 162 118 64 118 79.5 0.067 1.000 0.097 1.000
powerpoint 84 118 84 190 43 190 80.5 0.035 1.000 0.156 1.000
forecasting 169 93 169 96 69 96 81.0 0.070 1.000 0.079 1.000
data collection 132 102 132 121 60 121 81.0 0.055 1.000 0.099 1.000
statistical modeling 237 77 237 61 86 61 81.5 0.099 1.000 0.050 1.000
natural language 390 49 390 31 115 31 82.0 0.162 1.000 0.025 1.000
masters 201 87 201 85 77 85 82.0 0.084 1.000 0.070 1.000
artificial intelligence 318 55 318 33 110 33 82.5 0.132 1.000 0.027 1.000
large data sets 227 82 227 72 83 72 82.5 0.094 1.000 0.059 1.000
linux 251 74 251 46 95 46 84.5 0.104 1.000 0.038 1.000
machine learning techniques 298 62 298 33 111 33 86.5 0.124 1.000 0.027 1.000
work independently 97 114 97 125 59 125 86.5 0.040 1.000 0.102 1.000
data processing 227 81 227 40 99 40 90.0 0.094 1.000 0.033 1.000
data engineering 194 88 194 50 94 50 91.0 0.081 1.000 0.041 1.000
learning algorithms 315 58 315 26 125 26 91.5 0.131 1.000 0.021 1.000
language processing 320 54 320 25 130 25 92.0 0.133 1.000 0.020 1.000
have experience 157 97 157 58 88 58 92.5 0.065 1.000 0.048 1.000
predictive models 211 85 211 39 102 39 93.5 0.088 1.000 0.032 1.000
software engineers 194 89 194 41 98 41 93.5 0.081 1.000 0.034 1.000
natural language processing 318 56 318 24 133 24 94.5 0.132 1.000 0.020 1.000
creativity 155 98 155 50 93 50 95.5 0.064 1.000 0.041 1.000
Go 107 110 NA 74 81 NA 95.5 NA NA NA NA
sap 112 108 112 53 90 53 99.0 0.047 1.000 0.043 1.000
machine learning algorithms 278 67 278 24 132 24 99.5 0.116 1.000 0.020 1.000
predictive analytics 146 100 146 39 101 39 100.5 0.061 1.000 0.032 1.000
learning models 268 71 268 23 134 23 102.5 0.111 1.000 0.019 1.000
facebook 95 116 95 53 89 53 102.5 0.039 1.000 0.043 1.000
data pipeline 225 83 225 26 123 26 103.0 0.094 1.000 0.021 1.000
critical thinking 62 127 62 77 79 77 103.0 0.026 1.000 0.063 1.000
javascript 127 104 127 37 104 37 104.0 0.053 1.000 0.030 1.000
modeling techniques 122 105 122 36 105 36 105.0 0.051 1.000 0.029 1.000
modelling 102 112 102 39 100 39 106.0 0.042 1.000 0.032 1.000
large scale 235 78 235 22 135 22 106.5 0.098 1.000 0.018 1.000
nosql 232 79 232 22 136 22 107.5 0.096 1.000 0.018 1.000
machine learning models 231 80 231 21 137 21 108.5 0.096 1.000 0.017 1.000
unix 112 109 112 34 109 34 109.0 0.047 1.000 0.028 1.000
array 104 111 104 34 107 34 109.0 0.043 1.000 0.028 1.000
microsoft office 31 151 31 100 68 100 109.5 0.013 1.000 0.082 1.000
data models 116 107 116 31 113 31 110.0 0.048 1.000 0.025 1.000
multi-task 46 141 46 77 80 77 110.5 0.019 1.000 0.063 1.000
nlp 254 72 254 14 150 14 111.0 0.106 1.000 0.011 1.000
market research 29 154 29 95 70 95 112.0 0.012 1.000 0.078 1.000
highly motivated 51 134 51 52 91 52 112.5 0.021 1.000 0.043 1.000
data warehousing 95 115 95 30 116 30 115.5 0.039 1.000 0.025 1.000
coordination 39 147 39 68 84 68 115.5 0.016 1.000 0.056 1.000
mysql 92 117 92 30 117 30 117.0 0.038 1.000 0.025 1.000
data pipelines 179 90 179 16 145 16 117.5 0.074 1.000 0.013 1.000
data systems 78 121 78 31 114 31 117.5 0.032 1.000 0.025 1.000
bachelors 46 139 46 41 96 41 117.5 0.019 1.000 0.034 1.000
solving problems 77 122 77 30 118 30 120.0 0.032 1.000 0.025 1.000
azure 127 103 127 19 138 19 120.5 0.053 1.000 0.016 1.000
time management 38 149 38 51 92 51 120.5 0.016 1.000 0.042 1.000
ecommerce 80 120 80 26 124 26 122.0 0.033 1.000 0.021 1.000
data manipulation 51 133 51 31 112 31 122.5 0.021 1.000 0.025 1.000
vba 19 161 19 60 87 60 124.0 0.008 1.000 0.049 1.000
data integration 59 129 59 28 120 28 124.5 0.025 1.000 0.023 1.000
google analytics 30 153 30 41 97 41 125.0 0.012 1.000 0.034 1.000
data extraction 75 123 75 25 129 25 126.0 0.031 1.000 0.020 1.000
microsoft excel 15 167 15 68 85 68 126.0 0.006 1.000 0.056 1.000
troubleshooting 40 145 40 34 108 34 126.5 0.017 1.000 0.028 1.000
data architecture 120 106 120 13 151 13 128.5 0.050 1.000 0.011 1.000
ruby 98 113 98 17 144 17 128.5 0.041 1.000 0.014 1.000
masters degree 49 135 49 27 122 27 128.5 0.020 1.000 0.022 1.000
speaking 47 137 47 26 127 26 132.0 0.020 1.000 0.021 1.000
youtube 40 146 40 29 119 29 132.5 0.017 1.000 0.024 1.000
methodological 34 150 34 26 126 26 138.0 0.014 1.000 0.021 1.000
data gathering 39 148 39 24 131 24 139.5 0.016 1.000 0.020 1.000
mongodb 75 124 75 10 160 10 142.0 0.031 1.000 0.008 1.000
manage multiple projects 16 165 16 28 121 28 143.0 0.007 1.000 0.023 1.000
postgresql 56 132 56 11 155 11 143.5 0.023 1.000 0.009 1.000
language understanding 83 119 83 5 171 5 145.0 0.034 1.000 0.004 1.000
github 57 131 57 10 159 10 145.0 0.024 1.000 0.008 1.000
elasticsearch 47 136 47 11 154 11 145.0 0.020 1.000 0.009 1.000
apache spark 65 125 65 6 167 6 146.0 0.027 1.000 0.005 1.000
bash 45 142 45 12 152 12 147.0 0.019 1.000 0.010 1.000
data insights 27 156 27 19 140 19 148.0 0.011 1.000 0.016 1.000
natural language understanding 64 126 64 4 178 4 152.0 0.027 1.000 0.003 1.000
doctorate degree 17 164 17 19 141 19 152.5 0.007 1.000 0.016 1.000
negotiation 21 159 21 15 147 15 153.0 0.009 1.000 0.012 1.000
network analysis 28 155 28 12 153 12 154.0 0.012 1.000 0.010 1.000
kpmg 60 128 60 3 184 3 156.0 0.025 1.000 0.002 1.000
shell script 40 144 40 6 169 6 156.5 0.017 1.000 0.005 1.000
highly organized 10 170 10 17 143 17 156.5 0.004 1.000 0.014 1.000
data entry 8 176 8 19 139 19 157.5 0.003 1.000 0.016 1.000
data preparation 22 158 22 8 161 8 159.5 0.009 1.000 0.007 1.000
data reporting 8 178 8 15 146 15 162.0 0.003 1.000 0.012 1.000
strategic thinking 13 169 13 11 157 11 163.0 0.005 1.000 0.009 1.000
microsoft word 6 185 6 18 142 18 163.5 0.002 1.000 0.015 1.000
analytics data 15 166 15 7 163 7 164.5 0.006 1.000 0.006 1.000
jupyter notebook 18 162 18 5 170 5 166.0 0.007 1.000 0.004 1.000
sales and marketing 19 160 19 5 173 5 166.5 0.008 1.000 0.004 1.000
microstrategy 18 163 18 5 172 5 167.5 0.007 1.000 0.004 1.000
9 172 NA 6 166 NA 169.0 NA NA NA NA
nlu 40 143 40 2 199 2 171.0 0.017 1.000 0.002 1.000
data transfer 6 184 6 8 162 8 173.0 0.002 1.000 0.007 1.000
english language 5 188 5 10 158 10 173.0 0.002 1.000 0.008 1.000
service orientation 5 191 5 11 156 11 173.5 0.002 1.000 0.009 1.000
symantec 9 174 9 5 174 5 174.0 0.004 1.000 0.004 1.000
grammatical 9 173 9 4 176 4 174.5 0.004 1.000 0.003 1.000
systems analysis 3 202 3 15 148 15 175.0 0.001 1.000 0.012 1.000
microsoft azure 23 157 23 2 198 2 177.5 0.010 1.000 0.002 1.000
data mapping 8 177 8 3 182 3 179.5 0.003 1.000 0.002 1.000
telecommunications 9 175 9 3 187 3 181.0 0.004 1.000 0.002 1.000
django 30 152 30 1 211 1 181.5 0.012 1.000 0.001 1.000
amazon redshift 6 183 6 3 180 3 181.5 0.002 1.000 0.002 1.000
microsoft access 3 199 3 7 165 7 182.0 0.001 1.000 0.006 1.000
microsoft powerpoint 2 215 2 14 149 14 182.0 0.001 1.000 0.011 1.000
complex problem solving 5 187 5 3 181 3 184.0 0.002 1.000 0.002 1.000
active learning 7 180 7 2 189 2 184.5 0.003 1.000 0.002 1.000
data interpretation 3 197 3 4 175 4 186.0 0.001 1.000 0.003 1.000
apache hadoop 7 181 7 2 192 2 186.5 0.003 1.000 0.002 1.000
client management 2 209 2 7 164 7 186.5 0.001 1.000 0.006 1.000
confluence 7 182 7 2 194 2 188.0 0.003 1.000 0.002 1.000
minitab 3 200 3 4 177 4 188.5 0.001 1.000 0.003 1.000
machine learning data 13 168 13 1 220 1 194.0 0.005 1.000 0.001 1.000
swift 6 186 6 2 202 2 194.0 0.002 1.000 0.002 1.000
jquery 10 171 10 1 218 1 194.5 0.004 1.000 0.001 1.000
eko 8 179 8 1 212 1 195.5 0.003 1.000 0.001 1.000
work well in a team 3 205 3 3 188 3 196.5 0.001 1.000 0.002 1.000
operations analysis 4 194 4 2 200 2 197.0 0.002 1.000 0.002 1.000
see the big picture 4 195 4 2 201 2 198.0 0.002 1.000 0.002 1.000
apache kafka 4 193 4 1 204 1 198.5 0.002 1.000 0.001 1.000
microsoft outlook 2 214 2 3 186 3 200.0 0.001 1.000 0.002 1.000
clerical 2 208 2 2 193 2 200.5 0.001 1.000 0.002 1.000
ibm db2 5 189 5 1 217 1 203.0 0.002 1.000 0.001 1.000
report creation 1 231 1 4 179 4 205.0 0.000 1.000 0.003 1.000
experience in information technology 3 198 3 1 214 1 206.0 0.001 1.000 0.001 1.000
microsoft sql server 5 190 5 1 224 1 207.0 0.002 1.000 0.001 1.000
data cleanup 2 210 2 1 206 1 208.0 0.001 1.000 0.001 1.000
data organization 2 211 2 1 207 1 209.0 0.001 1.000 0.001 1.000
unix shell 5 192 5 1 236 1 214.0 0.002 1.000 0.001 1.000
google adwords 2 212 2 1 216 1 214.0 0.001 1.000 0.001 1.000
design development 1 218 1 1 210 1 214.0 0.000 1.000 0.001 1.000
skype 3 201 3 1 231 1 216.0 0.001 1.000 0.001 1.000
mathematical reasoning 2 213 2 1 221 1 217.0 0.001 1.000 0.001 1.000
filemaker pro 1 220 1 1 215 1 217.5 0.000 1.000 0.001 1.000
wireshark 1 233 1 2 203 2 218.0 0.000 1.000 0.002 1.000
technology design 3 203 3 1 234 1 218.5 0.001 1.000 0.001 1.000
ubuntu 3 204 3 1 235 1 219.5 0.001 1.000 0.001 1.000
judgment and decision making 1 222 1 1 219 1 220.5 0.000 1.000 0.001 1.000
microsoft dynamics 1 223 1 1 222 1 222.5 0.000 1.000 0.001 1.000
microsoft windows server 1 224 1 1 226 1 225.0 0.000 1.000 0.001 1.000
oracle hyperion 1 225 1 1 227 1 226.0 0.000 1.000 0.001 1.000
organizational management 1 227 1 1 228 1 227.5 0.000 1.000 0.001 1.000
reading comprehension 1 230 1 1 230 1 230.0 0.000 1.000 0.001 1.000
big data architecture 57 130 57 NA NA NA NA 0.024 1.000 NA NA
architecture capabilities 46 138 46 NA NA NA NA 0.019 1.000 NA NA
covering technologies 46 140 46 NA NA NA NA 0.019 1.000 NA NA
apache hive 3 196 3 NA NA NA NA 0.001 1.000 NA NA
active listening 2 206 2 NA NA NA NA 0.001 1.000 NA NA
citrix 2 207 2 NA NA NA NA 0.001 1.000 NA NA
amazon dynamodb 1 216 1 NA NA NA NA 0.000 1.000 NA NA
bring creativity 1 217 1 NA NA NA NA 0.000 1.000 NA NA
engineering and technology 1 219 1 NA NA NA NA 0.000 1.000 NA NA
ibm infosphere datastage 1 221 1 NA NA NA NA 0.000 1.000 NA NA
oracle java 1 226 1 NA NA NA NA 0.000 1.000 NA NA
prepare data for analysis 1 228 1 NA NA NA NA 0.000 1.000 NA NA
quality control analysis 1 229 1 NA NA NA NA 0.000 1.000 NA NA
teradata database 1 232 1 NA NA NA NA 0.000 1.000 NA NA
microsoft project NA NA NA 6 168 6 NA NA NA 0.005 1.000
data storytelling NA NA NA 3 183 3 NA NA NA 0.002 1.000
lexisnexis NA NA NA 3 185 3 NA NA NA 0.002 1.000
administration and management NA NA NA 2 190 2 NA NA NA 0.002 1.000
ajax NA NA NA 2 191 2 NA NA NA 0.002 1.000
experience in market research NA NA NA 2 195 2 NA NA NA 0.002 1.000
google docs NA NA NA 2 196 2 NA NA NA 0.002 1.000
mcafee NA NA NA 2 197 2 NA NA NA 0.002 1.000
apache tomcat NA NA NA 1 205 1 NA NA NA 0.001 1.000
datadriven NA NA NA 1 208 1 NA NA NA 0.001 1.000
deductive reasoning NA NA NA 1 209 1 NA NA NA 0.001 1.000
epic systems NA NA NA 1 213 1 NA NA NA 0.001 1.000
microsoft sharepoint NA NA NA 1 223 1 NA NA NA 0.001 1.000
microsoft sql server reporting services NA NA NA 1 225 1 NA NA NA 0.001 1.000
processing information NA NA NA 1 229 1 NA NA NA 0.001 1.000
systems evaluation NA NA NA 1 232 1 NA NA NA 0.001 1.000
tax software NA NA NA 1 233 1 NA NA NA 0.001 1.000

Survey

The top broad skills identified by the survey were “programming skills” and “analytical skills”. “Statistics”, “R”, and “Python” were top skills identified by individuals working full-time in either computer science or data science. Overall, in the filtered data sets, there weren’t many common bigrams, which was likely due to the small size of the survey. Interestingly, two survey respondents in the data science / computer science group identified “business knowledge” as a top skill.

Overall, most of the skills identified by the survey were either broad answers, such as “programming skills” or “analytical skills”, or specific programming languages such as Python, R, and sQL. A few respondents also identified abstract skills, such as creativity. While there was more specificity in the survey answers of individuals working or studying in data science or computer science, more survey responses should be gathered to make any larger conclusions

Conclusion

To review, the following conclusions emerged from our research:

  • Data science emphasizes a range of “hard” data-centric skills (especially programming skills like Python, SQL and R, as well as statistical and math skills, and a knowledge of machine learning.

  • At least for employers, data scientist positions and data analyst positions point to very different skill sets. For data analysts, communication, research, analytical and organizational skills are most prominent. While there is overlap on almost all of the skills, the emphases are very different.

  • Working data scientists and computer scientists from our survey corroborate this point of view. Statistics and programming make up the bulk of the skills for data scientists.

  • Individuals from our survey not directly involved with data science are more likely to include softer skills. They see writing, thinking, creativity and communication as aspects of data science.

So what are the most valuable skills for data scientist? Here we really answer a more narrow question - what are the most valuable skills to learn in order to be hired and work in the field of data science? The answer to this is clear - programming, statistics, math, algorithms and computer science.

But if we wanted to answer the question more broadly - what skills would be most valuable for the next generation of data scientist to learn? -

we might want to include the thoughts of ethicists, philosophers, futurists, consumers, and representatives of vulnerable populations. But that will have to wait for another project.