In this project we go behind varied data sets that support different points of view about the research question.
Overview
The work on this project was fully collaborative. We divided the work into pieces, and for each piece we assigned a team of three or more.
Everyone was on more than one team. This way people who knew less could learn from those who knew more, and we could all contribute.
We held a number of meetings on zoom where we reviewed together each step of the process and made consensus-based decisions about direction.
We used Slack, Google Drive, Azure and Github to share thoughts and ideas and code.
Who is “we”? We are:
#Packages used
library(tidytext)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(stringr)
library(DT)
library(kableExtra)
library(wesanderson)
library(tidyr)
library(corpus)
library(keyring)
library(RODBC)
library(readr)
library(stringi)
library(tm)
library(corpus)
library(wordcloud)
library(data.table)
library(RedditExtractoR)
library(readxl)
library(dplyr )
library(magrittr)
library("tm")
library("SnowballC")
library("RColorBrewer")
library("wordcloud")
library("data.table")
Our methodology was to mine varied data sets which provided information that could support different points of view about the research question. We settled on three: a data set of 7000 Indeed job listings from 2018, a compendium of work skills from the University of Chicago’s Open Skills project, and a survey that we created for the purpose of this project.
We used the data sets to create a comprehensive word_catalog of data scientist skills, knowledge and abilities. Then we ran the word_catalog back over the Indeed data set so that we could pull out the skills that appeared most. Finally, we used our findings to compare the Indeed data set for data scientists, the Indeed data set for data analysts, and the survey.
Collect the Data to be Analyzed
In this step we read in the following data sets and write them back to a normalized database on Azure. From that point forward, we only pull data from the database so that we are confident we have consistent, secure, and persistent storage for our data:
Read in data and filter by position
conn_str <- paste0(
'Driver={ODBC Driver 17 for SQL Server};
Server=tcp:ehtmp.database.windows.net,1433;
Database=ds_skills;
Encrypt=yes;
TrustServerCertificate=no;
Connection Timeout=30;',
'Uid=',keyring::key_get(service = "my-skills-db-username", keyring = "my-skills-db-keyring"),';',
'Pwd=', keyring::key_get(service = "my-skills-db-pwd", keyring = "my-skills-db-keyring"), ';'
)
dbConnection <- odbcDriverConnect(conn_str)
all_data <- (sqlQuery(dbConnection, "SELECT * FROM ds_skills.kaggle.job_postings_raw"))
#if the connection doesnt work use the code below to upload the csv file
#all_data<-read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/raw_jobdata.csv")
all_data$position<-tolower(all_data$position)
all_data$description<-tolower(all_data$description)
all_data$description<-tolower(all_data$description)%>%
str_remove_all("â|€|™|\\n")
##filters for data science positions
data_scientists<-all_data%>%
mutate(contents = str_detect(tolower(position), "data [b-z]|ai|machine"))%>%
filter(contents == TRUE)
data_analysts<-all_data%>%
mutate(contents = str_detect(tolower(position), "anal"))%>%
filter(contents == TRUE)
rm(all_data)
Filter for targeted skills
Using the same search criteria we used above make new columns containing the strings of interest to be worked with later. Some of the issues with this approach is the abundance NA values.
ds<-data_scientists%>%
mutate(skill = str_extract_all(data_scientists$description, " .{75,100} skill.{150,200} "))%>%
mutate(must_have=str_extract_all(data_scientists$description, "must have.{150,200} "))%>%
mutate(knowledge=str_extract_all(data_scientists$description, " .{75,100} knowledge.{150,200} "))%>%
mutate(experience=str_extract_all(data_scientists$description, " .{100,150} exper.{100,150} "))%>%
mutate(excel=str_extract_all(data_scientists$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
mutate(responsible=str_extract_all(data_scientists$description, "responsible.{150,200} "))%>%
mutate(proficient=str_extract_all(data_scientists$description," .{100,150} profi.{110,160} "))%>%
mutate(understands=str_extract_all(data_scientists$description, " .{100,150} understand.{150,200} "))%>%
mutate(utilize=str_extract_all(data_scientists$description, "utilize.{150,200} "))%>%
mutate(lead=str_extract_all(data_scientists$description, " .{150,200} lead.{150,200} "))%>%
mutate(work=str_extract_all(data_scientists$description, " .{50,75} work.{150,200} "))%>%
mutate(looking=str_extract_all(data_scientists$description, "looking.{150,200} "))
ds$skill<-lapply(ds$skill, function(x)paste(unlist(x), collapse=' '))
ds$must_have<-lapply(ds$must_have, function(x)paste(unlist(x), collapse=' '))
ds$knowledge<-lapply(ds$knowledge, function(x)paste(unlist(x), collapse=' '))
ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))
ds$experience<-lapply(ds$experience, function(x)paste(unlist(x), collapse=' '))
ds$excel<-lapply(ds$excel, function(x)paste(unlist(x), collapse=' '))
ds$responsible<-lapply(ds$responsible, function(x)paste(unlist(x), collapse=' '))
ds$proficient<-lapply(ds$proficient, function(x)paste(unlist(x), collapse=' '))
ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))
ds$utilize<-lapply(ds$utilize, function(x)paste(unlist(x), collapse=' '))
ds$lead<-lapply(ds$lead, function(x)paste(unlist(x), collapse=' '))
ds$work<-lapply(ds$work, function(x)paste(unlist(x), collapse=' '))
ds$looking<-lapply(ds$looking, function(x)paste(unlist(x), collapse=' '))
Data Analyst
da<-data_analysts%>%
mutate(skill = str_extract_all(data_analysts$description, " .{75,100} skill.{150,200} "))%>%
mutate(must_have=str_extract_all(data_analysts$description, "must have.{150,200} "))%>%
mutate(knowledge=str_extract_all(data_analysts$description, " .{75,100} knowledge.{150,200} "))%>%
mutate(experience=str_extract_all(data_analysts$description, " .{100,150} exper.{100,150} "))%>%
mutate(excel=str_extract_all(data_analysts$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
mutate(responsible=str_extract_all(data_analysts$description, "responsible.{150,200} "))%>%
mutate(proficient=str_extract_all(data_analysts$description," .{100,150} profi.{110,160} "))%>%
mutate(understands=str_extract_all(data_analysts$description, " .{100,150} understand.{150,200} "))%>%
mutate(utilize=str_extract_all(data_analysts$description, "utilize.{150,200} "))%>%
mutate(lead=str_extract_all(data_analysts$description, " .{150,200} lead.{150,200} "))%>%
mutate(work=str_extract_all(data_analysts$description, " .{50,75} work.{150,200} "))%>%
mutate(looking=str_extract_all(data_analysts$description, "looking.{150,200} "))
da$skill<-lapply(da$skill, function(x)paste(unlist(x), collapse=' '))
da$must_have<-lapply(da$must_have, function(x)paste(unlist(x), collapse=' '))
da$knowledge<-lapply(da$knowledge, function(x)paste(unlist(x), collapse=' '))
da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))
da$experience<-lapply(da$experience, function(x)paste(unlist(x), collapse=' '))
da$excel<-lapply(da$excel, function(x)paste(unlist(x), collapse=' '))
da$responsible<-lapply(da$responsible, function(x)paste(unlist(x), collapse=' '))
da$proficient<-lapply(da$proficient, function(x)paste(unlist(x), collapse=' '))
da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))
da$utilize<-lapply(da$utilize, function(x)paste(unlist(x), collapse=' '))
da$lead<-lapply(da$lead, function(x)paste(unlist(x), collapse=' '))
da$work<-lapply(da$work, function(x)paste(unlist(x), collapse=' '))
da$looking<-lapply(da$looking, function(x)paste(unlist(x), collapse=' '))
Transform Data Science dataframe to long
ds_long <- ds %>% gather(keyword, text, 7:18)
text <- ds_long %>% select(text)
Make corpus and remove punctuation, numbers, stopwords, convert cases, etc
corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("skill","responsible","proficient","knowledge","understands", "must", "experience", "character", "will", "looking", "excels at", "work", "lead", "utilize"))
corpus_Clean <- tm_map(corpus, stripWhitespace)
wordcloud(corpus, max.words = 50, colors = colorRampPalette(brewer.pal(7, "Dark2"))(32))
Tokenization of textbody into unigrams (one word), bigrams (two words), trigrams (three words), and quadgrams(four words)
#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, 20)))
#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))
Plot unigram
#Unigrams
unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)
ggplot(unigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = "coral") + theme_bw() +
ggtitle("Top 25 Unigrams") +labs(x = "", y = "")
Plot bigram
#Bigrams
bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)
ggplot(bigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = "coral") + theme_bw() +
ggtitle("Top 25 Bigrams") +labs(x = "", y = "")
Plot trigram
#Trigrams
trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)
ggplot(trigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = "coral") + theme_bw() +
ggtitle("Top 25 Trigrams") +labs(x = "", y = "")
Create a Word Catalog
Building the word_catalog is the heart of our project. The Indeed database contains 7,000 lengthy job descriptions. Performing a word count or a simple regex on “skills” is very limited – consider, e.g., the sentence “The applicant must be familiar with all aspects of the data process, from collection to analysis, and be adept at communicating their findings.”
In order to capture all of the skills in our data set, we built a list of as many possible words and phrases describing data science skills as might exist in the data set.
We then used this list to go back over the data set to calculate those word and phrase frequencies which were most prominent.
We built this list by combining a deep dive into Regex with an n-gram analysis so that we could see not only what words appeared frequently, but how they appeared together with other words.
We used relevant keywords from both the Indeed data set and the Open Skills data set to supplement this comprehensive list of possible skills.
In the end, our word_catalog contained [put number here] individual words and phrases describing data science skills.
Create a word catalog and search for string in column
#Load word_catalog
dictionary_analyst <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/Eric_DataAnalystDictionary.csv", header=FALSE, fileEncoding = "UTF-8-BOM")
dictionary_ngrams <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/Ngrams_dictionary.csv?token=ASEY4BIFPB7FWOOYPWAQNWTANUIPC", fileEncoding = "UTF-8-BOM")
os <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/OS_dictionary_skills.csv?token=ASEY4BNIW2YDLN2N57PN7ELANZUJK", fileEncoding = "UTF-8-BOM")
onet <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/ONET%20Technology%20Skills.csv?token=ASEY4BJAZDGVPDDHSSZJZ43ANZUNY", fileEncoding = "UTF-8-BOM")
# create word catalogue with single skill column for merge
dictionary_onet <- onet %>% select(skill = Skill)
dictionary_os <- os %>% select(skill)
#Assign column name to analyst word catalog
names(dictionary_analyst) <- ('skill')
names(dictionary_ngrams) <- ('skill')
#convert to lowercase and remove special characters where needed
dictionary_analyst <- dictionary_analyst %>% mutate(across(where(is.character), tolower))
dictionary_ngrams <- dictionary_ngrams %>% mutate(across(where(is.character), tolower))
dictionary_os <- dictionary_os %>% mutate(across(where(is.character), tolower))
dictionary_onet <- dictionary_onet %>% mutate(across(where(is.character), tolower)) %>% mutate_all(funs(gsub("[[:punct:]]", "", .)))
#merge dictionaries and remove duplicates
MyMerge <- function(x, y){
df <- merge(x, y, all = TRUE)
return(df)
}
# merge all four dictionaries and delete duplicate skills
dictionary <- Reduce(MyMerge, list(dictionary_analyst, dictionary_ngrams, dictionary_onet, dictionary_os)) %>% distinct()
# remove common character strings found within words and phrases to analyze separately
#remove skills that need to be removed when then appear alone but not when in a phrase
dictionary$skill <- str_remove(dictionary$skill, "(?! )(ai|science|business)(?! )")
#remove short character skills that will be picked up within words
dictionary$skill <- str_remove(dictionary$skill, "(?:^|\\W)(r|c|ms|go)(?:$|\\W)")
# turn word_catalog into vector and remove empty rows
dictionary <- dictionary[!apply(dictionary == "", 1, all),]
Detecting words in the word_catalog we created
Now we are ready to detect word and phrase frequencies in our data set. We will separate out job positions that include “data scientist” from those that include “data analyst” because we cannot answer our research question without investigating whether data science is really just a new fancy term for data analysis.
The detection will be done in two parts. The first will be counts of small words that are hard to isolate in large descriptions. The second will run all other identified skills in our word_catalog through our job descriptions. The resulting skill frequencies of these efforts will be merged for analysis.
get counts for r, ms, ai go
data_sci_count<-data.frame("r" = sum(str_count(data_scientists$description, " r | r,| r\\.")), "ms" = sum(str_count(data_scientists$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_scientists$description, " go ")), "AI" = sum(str_count(data_scientists$description, " ai | ai,| ai\\.")))
data_ana_count<-data.frame("r" = sum(str_count(data_analysts$description, " r | r,| r\\.")), "ms" = sum(str_count(data_analysts$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_analysts$description, " go ")), "AI" = sum(str_count(data_analysts$description, " ai | ai,| ai\\.")))
Detect Word_catalog words in original descriptions
One way to assess how important a particular skill is, is to look for how many times each word from our word_catalog is mentioned throughout the dataset. Here, we’re looking for an overall count of the word_catalog words that show up most frequently in the Kaggle dataset of job descriptions. We’ll run these counts both on ‘data scientist’ job desccriptions and ‘data analyst’ job descriptions.
# Pulls skills out of description based on catalog
setDT(data_scientists)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_scientists)]
setDT(data_analysts)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_analysts)]
# Create a count of skills for data science
skillsfreq_ds <- data_scientists %>%
separate_rows(skills, sep = ',') %>%
group_by(skills = tolower(skills)) %>%
summarise(count = n())
# Create a count of skills for data analyst
skillsfreq_da <- data_analysts %>%
separate_rows(skills, sep = ',') %>%
group_by(skills = tolower(skills)) %>%
summarise(count = n())
# Merge counts for r, ms, ai, go
data_sci_count <- data_sci_count %>% gather(skills, count, 1:4)
data_ana_count <- data_ana_count %>% gather(skills, count, 1:4)
skillsfreq_ds <- merge(skillsfreq_ds, data_sci_count, all = TRUE)
skillsfreq_da <- merge(skillsfreq_da, data_ana_count, all = TRUE)
# Merge top data skills for data scientist and data analyst
skillsfreq_all <- full_join(skillsfreq_ds ,skillsfreq_da,by="skills") %>% rename(count_ds = count.x, count_da = count.y)
Let’s take a look at the top skills that show up for each of “data scientist” and “data analyst”:
Table top data scientist skills
top_skills_ds <- skillsfreq_ds %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number())
top_skills_ds <- left_join(top_skills_ds, ds_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()
## Joining, by = "skills"
findings_table_ds <-(top_skills_ds) %>%
kbl(caption = "Top Skills") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(height = "400px")
findings_table_ds
skills | count | rank | jd_count | |
---|---|---|---|---|
1 | python | 5250 | 1 | 1750 |
4 | machine learning | 1693 | 2 | 1693 |
5 | design | 1468 | 3 | 1468 |
6 | computer | 1460 | 4 | 1460 |
7 | science | 1384 | 5 | 1384 |
8 | research | 1254 | 6 | 1254 |
9 | statistics | 1249 | 7 | 1249 |
10 | sql | 1172 | 8 | 1172 |
11 | communication | 1168 | 9 | 1168 |
12 | r | 1118 | 10 | 950 |
13 | math | 1102 | 11 | 1102 |
14 | solutions | 1056 | 12 | 1056 |
15 | algorithms | 975 | 13 | 975 |
16 | programming | 960 | 14 | 960 |
17 | leader | 905 | 15 | 905 |
18 | organization | 844 | 16 | 844 |
19 | passion | 821 | 17 | 821 |
20 | analytical | 799 | 18 | 799 |
21 | phd | 789 | 19 | 789 |
22 | scala | 789 | 20 | 789 |
23 | quantitative | 779 | 21 | 779 |
24 | spark | 779 | 22 | 779 |
25 | mathematics | 768 | 23 | 768 |
26 | communication skills | 766 | 24 | 766 |
27 | java | 765 | 25 | 765 |
28 | AI | 729 | 26 | NA |
29 | vision | 714 | 27 | 714 |
30 | hadoop | 706 | 28 | 706 |
31 | database | 699 | 29 | 699 |
32 | big data | 695 | 30 | 695 |
33 | written | 684 | 31 | 684 |
34 | ml | 617 | 32 | 617 |
35 | visualization | 578 | 33 | 578 |
36 | years of experience | 576 | 34 | 576 |
37 | data sets | 566 | 35 | 566 |
38 | leadership | 536 | 36 | 536 |
39 | data analysis | 532 | 37 | 532 |
40 | git | 516 | 38 | 516 |
41 | collaborative | 511 | 39 | 511 |
42 | office | 506 | 40 | 506 |
43 | data mining | 483 | 41 | 483 |
44 | verbal | 481 | 42 | 481 |
45 | innovation | 471 | 43 | 471 |
46 | presentation | 463 | 44 | 463 |
47 | creating | 451 | 45 | 451 |
48 | sas | 445 | 46 | 445 |
49 | deep learning | 432 | 47 | 432 |
50 | large data | 398 | 48 | 398 |
51 | natural language | 390 | 49 | 390 |
52 | data engineer | 389 | 50 | 389 |
53 | physics | 337 | 51 | 337 |
54 | collaboration | 329 | 52 | 329 |
55 | software development | 327 | 53 | 327 |
56 | language processing | 320 | 54 | 320 |
57 | artificial intelligence | 318 | 55 | 318 |
58 | natural language processing | 318 | 56 | 318 |
59 | economics | 315 | 57 | 315 |
60 | learning algorithms | 315 | 58 | 315 |
61 | programming languages | 313 | 59 | 313 |
62 | business problems | 312 | 60 | 312 |
63 | data visualization | 312 | 61 | 312 |
64 | machine learning techniques | 298 | 62 | 298 |
65 | writing | 295 | 63 | 295 |
66 | rtable | 291 | 64 | 291 |
67 | data analytics | 285 | 65 | 285 |
68 | tableau | 279 | 66 | 279 |
69 | machine learning algorithms | 278 | 67 | 278 |
70 | matlab | 277 | 68 | 277 |
71 | consulting | 274 | 69 | 274 |
72 | problem solving | 273 | 70 | 273 |
73 | learning models | 268 | 71 | 268 |
74 | nlp | 254 | 72 | 254 |
75 | influence | 251 | 73 | 251 |
76 | linux | 251 | 74 | 251 |
77 | flexible | 244 | 75 | 244 |
78 | etl | 239 | 76 | 239 |
79 | statistical modeling | 237 | 77 | 237 |
80 | large scale | 235 | 78 | 235 |
81 | nosql | 232 | 79 | 232 |
82 | machine learning models | 231 | 80 | 231 |
83 | data processing | 227 | 81 | 227 |
84 | large data sets | 227 | 82 | 227 |
85 | data pipeline | 225 | 83 | 225 |
86 | ms | 223 | 84 | 253 |
87 | predictive models | 211 | 85 | 211 |
88 | interpersonal | 201 | 86 | 201 |
89 | masters | 201 | 87 | 201 |
90 | data engineering | 194 | 88 | 194 |
91 | software engineers | 194 | 89 | 194 |
92 | data pipelines | 179 | 90 | 179 |
93 | decision making | 173 | 91 | 173 |
94 | organizational | 172 | 92 | 172 |
95 | forecasting | 169 | 93 | 169 |
96 | bachelor’s degree | 164 | 94 | 164 |
97 | monitoring | 162 | 95 | 162 |
98 | data management | 161 | 96 | 161 |
99 | have experience | 157 | 97 | 157 |
100 | creativity | 155 | 98 | 155 |
101 | microsoft | 148 | 99 | 148 |
102 | predictive analytics | 146 | 100 | 146 |
103 | project management | 141 | 101 | 141 |
104 | data collection | 132 | 102 | 132 |
105 | azure | 127 | 103 | 127 |
106 | javascript | 127 | 104 | 127 |
107 | modeling techniques | 122 | 105 | 122 |
108 | data architecture | 120 | 106 | 120 |
109 | data models | 116 | 107 | 116 |
110 | sap | 112 | 108 | 112 |
111 | unix | 112 | 109 | 112 |
112 | Go | 107 | 110 | NA |
113 | array | 104 | 111 | 104 |
114 | modelling | 102 | 112 | 102 |
115 | ruby | 98 | 113 | 98 |
116 | work independently | 97 | 114 | 97 |
117 | data warehousing | 95 | 115 | 95 |
118 | 95 | 116 | 95 | |
119 | mysql | 92 | 117 | 92 |
120 | powerpoint | 84 | 118 | 84 |
121 | language understanding | 83 | 119 | 83 |
122 | ecommerce | 80 | 120 | 80 |
123 | data systems | 78 | 121 | 78 |
124 | solving problems | 77 | 122 | 77 |
125 | data extraction | 75 | 123 | 75 |
126 | mongodb | 75 | 124 | 75 |
127 | apache spark | 65 | 125 | 65 |
128 | natural language understanding | 64 | 126 | 64 |
129 | critical thinking | 62 | 127 | 62 |
130 | kpmg | 60 | 128 | 60 |
131 | data integration | 59 | 129 | 59 |
132 | big data architecture | 57 | 130 | 57 |
133 | github | 57 | 131 | 57 |
134 | postgresql | 56 | 132 | 56 |
135 | data manipulation | 51 | 133 | 51 |
136 | highly motivated | 51 | 134 | 51 |
137 | masters degree | 49 | 135 | 49 |
138 | elasticsearch | 47 | 136 | 47 |
139 | speaking | 47 | 137 | 47 |
140 | architecture capabilities | 46 | 138 | 46 |
141 | bachelors | 46 | 139 | 46 |
142 | covering technologies | 46 | 140 | 46 |
143 | multi-task | 46 | 141 | 46 |
144 | bash | 45 | 142 | 45 |
145 | nlu | 40 | 143 | 40 |
146 | shell script | 40 | 144 | 40 |
147 | troubleshooting | 40 | 145 | 40 |
148 | youtube | 40 | 146 | 40 |
149 | coordination | 39 | 147 | 39 |
150 | data gathering | 39 | 148 | 39 |
151 | time management | 38 | 149 | 38 |
152 | methodological | 34 | 150 | 34 |
153 | microsoft office | 31 | 151 | 31 |
154 | django | 30 | 152 | 30 |
155 | google analytics | 30 | 153 | 30 |
156 | market research | 29 | 154 | 29 |
157 | network analysis | 28 | 155 | 28 |
158 | data insights | 27 | 156 | 27 |
159 | microsoft azure | 23 | 157 | 23 |
160 | data preparation | 22 | 158 | 22 |
161 | negotiation | 21 | 159 | 21 |
162 | sales and marketing | 19 | 160 | 19 |
163 | vba | 19 | 161 | 19 |
164 | jupyter notebook | 18 | 162 | 18 |
165 | microstrategy | 18 | 163 | 18 |
166 | doctorate degree | 17 | 164 | 17 |
167 | manage multiple projects | 16 | 165 | 16 |
168 | analytics data | 15 | 166 | 15 |
169 | microsoft excel | 15 | 167 | 15 |
170 | machine learning data | 13 | 168 | 13 |
171 | strategic thinking | 13 | 169 | 13 |
172 | highly organized | 10 | 170 | 10 |
173 | jquery | 10 | 171 | 10 |
174 | 9 | 172 | NA | |
175 | grammatical | 9 | 173 | 9 |
176 | symantec | 9 | 174 | 9 |
177 | telecommunications | 9 | 175 | 9 |
178 | data entry | 8 | 176 | 8 |
179 | data mapping | 8 | 177 | 8 |
180 | data reporting | 8 | 178 | 8 |
181 | eko | 8 | 179 | 8 |
182 | active learning | 7 | 180 | 7 |
183 | apache hadoop | 7 | 181 | 7 |
184 | confluence | 7 | 182 | 7 |
185 | amazon redshift | 6 | 183 | 6 |
186 | data transfer | 6 | 184 | 6 |
187 | microsoft word | 6 | 185 | 6 |
188 | swift | 6 | 186 | 6 |
189 | complex problem solving | 5 | 187 | 5 |
190 | english language | 5 | 188 | 5 |
191 | ibm db2 | 5 | 189 | 5 |
192 | microsoft sql server | 5 | 190 | 5 |
193 | service orientation | 5 | 191 | 5 |
194 | unix shell | 5 | 192 | 5 |
195 | apache kafka | 4 | 193 | 4 |
196 | operations analysis | 4 | 194 | 4 |
197 | see the big picture | 4 | 195 | 4 |
198 | apache hive | 3 | 196 | 3 |
199 | data interpretation | 3 | 197 | 3 |
200 | experience in information technology | 3 | 198 | 3 |
201 | microsoft access | 3 | 199 | 3 |
202 | minitab | 3 | 200 | 3 |
203 | skype | 3 | 201 | 3 |
204 | systems analysis | 3 | 202 | 3 |
205 | technology design | 3 | 203 | 3 |
206 | ubuntu | 3 | 204 | 3 |
207 | work well in a team | 3 | 205 | 3 |
208 | active listening | 2 | 206 | 2 |
209 | citrix | 2 | 207 | 2 |
210 | clerical | 2 | 208 | 2 |
211 | client management | 2 | 209 | 2 |
212 | data cleanup | 2 | 210 | 2 |
213 | data organization | 2 | 211 | 2 |
214 | google adwords | 2 | 212 | 2 |
215 | mathematical reasoning | 2 | 213 | 2 |
216 | microsoft outlook | 2 | 214 | 2 |
217 | microsoft powerpoint | 2 | 215 | 2 |
218 | amazon dynamodb | 1 | 216 | 1 |
219 | bring creativity | 1 | 217 | 1 |
220 | design development | 1 | 218 | 1 |
221 | engineering and technology | 1 | 219 | 1 |
222 | filemaker pro | 1 | 220 | 1 |
223 | ibm infosphere datastage | 1 | 221 | 1 |
224 | judgment and decision making | 1 | 222 | 1 |
225 | microsoft dynamics | 1 | 223 | 1 |
226 | microsoft windows server | 1 | 224 | 1 |
227 | oracle hyperion | 1 | 225 | 1 |
228 | oracle java | 1 | 226 | 1 |
229 | organizational management | 1 | 227 | 1 |
230 | prepare data for analysis | 1 | 228 | 1 |
231 | quality control analysis | 1 | 229 | 1 |
232 | reading comprehension | 1 | 230 | 1 |
233 | report creation | 1 | 231 | 1 |
234 | teradata database | 1 | 232 | 1 |
235 | wireshark | 1 | 233 | 1 |
Table top data analyst skills
top_skills_da <- skillsfreq_da %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number())
top_skills_da <- left_join(top_skills_da, da_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()
## Joining, by = "skills"
findings_table_da <-(top_skills_da) %>%
kbl(caption = "Top Skills") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
scroll_box(height = "400px")
findings_table_da
skills | count | rank | jd_count | |
---|---|---|---|---|
1 | python | 1275 | 1 | 425 |
4 | research | 785 | 2 | 785 |
5 | communication | 776 | 3 | 776 |
6 | analytical | 653 | 4 | 653 |
7 | design | 627 | 5 | 627 |
8 | organization | 605 | 6 | 605 |
9 | written | 514 | 7 | 514 |
10 | quantitative | 499 | 8 | 499 |
11 | communication skills | 488 | 9 | 488 |
12 | leader | 484 | 10 | 484 |
13 | sql | 467 | 11 | 467 |
14 | statistics | 466 | 12 | 466 |
15 | office | 463 | 13 | 463 |
16 | math | 403 | 14 | 403 |
17 | r | 401 | 15 | 326 |
18 | solutions | 400 | 16 | 400 |
19 | computer | 380 | 17 | 380 |
20 | presentation | 373 | 18 | 373 |
21 | database | 367 | 19 | 367 |
22 | vision | 358 | 20 | 358 |
23 | leadership | 327 | 21 | 327 |
24 | verbal | 324 | 22 | 324 |
25 | passion | 323 | 23 | 323 |
26 | programming | 313 | 24 | 313 |
27 | science | 312 | 25 | 312 |
28 | sas | 309 | 26 | 309 |
29 | data analysis | 304 | 27 | 304 |
30 | writing | 300 | 28 | 300 |
31 | years of experience | 286 | 29 | 286 |
32 | collaborative | 284 | 30 | 284 |
33 | economics | 268 | 31 | 268 |
34 | mathematics | 257 | 32 | 257 |
35 | visualization | 243 | 33 | 243 |
36 | organizational | 224 | 34 | 224 |
37 | microsoft | 223 | 35 | 223 |
38 | innovation | 207 | 36 | 207 |
39 | machine learning | 207 | 37 | 207 |
40 | data sets | 206 | 38 | 206 |
41 | tableau | 203 | 39 | 203 |
42 | interpersonal | 198 | 40 | 198 |
43 | creating | 196 | 41 | 196 |
44 | git | 195 | 42 | 195 |
45 | powerpoint | 190 | 43 | 190 |
46 | consulting | 182 | 44 | 182 |
47 | ms | 173 | 45 | 165 |
48 | problem solving | 172 | 46 | 172 |
49 | data visualization | 162 | 47 | 162 |
50 | project management | 159 | 48 | 159 |
51 | collaboration | 154 | 49 | 154 |
52 | phd | 151 | 50 | 151 |
53 | bachelor’s degree | 149 | 51 | 149 |
54 | data analytics | 140 | 52 | 140 |
55 | flexible | 137 | 53 | 137 |
56 | java | 134 | 54 | 134 |
57 | ml | 134 | 55 | 134 |
58 | large data | 133 | 56 | 133 |
59 | algorithms | 130 | 57 | 130 |
60 | rtable | 128 | 58 | 128 |
61 | work independently | 125 | 59 | 125 |
62 | data collection | 121 | 60 | 121 |
63 | big data | 120 | 61 | 120 |
64 | data management | 120 | 62 | 120 |
65 | influence | 118 | 63 | 118 |
66 | monitoring | 118 | 64 | 118 |
67 | decision making | 114 | 65 | 114 |
68 | scala | 113 | 66 | 113 |
69 | data mining | 105 | 67 | 105 |
70 | microsoft office | 100 | 68 | 100 |
71 | forecasting | 96 | 69 | 96 |
72 | market research | 95 | 70 | 95 |
73 | hadoop | 92 | 71 | 92 |
74 | matlab | 92 | 72 | 92 |
75 | physics | 89 | 73 | 89 |
76 | programming languages | 88 | 74 | 88 |
77 | business problems | 87 | 75 | 87 |
78 | spark | 87 | 76 | 87 |
79 | masters | 85 | 77 | 85 |
80 | data engineer | 81 | 78 | 81 |
81 | critical thinking | 77 | 79 | 77 |
82 | multi-task | 77 | 80 | 77 |
83 | Go | 74 | 81 | NA |
84 | etl | 73 | 82 | 73 |
85 | large data sets | 72 | 83 | 72 |
86 | coordination | 68 | 84 | 68 |
87 | microsoft excel | 68 | 85 | 68 |
88 | statistical modeling | 61 | 86 | 61 |
89 | vba | 60 | 87 | 60 |
90 | have experience | 58 | 88 | 58 |
91 | 53 | 89 | 53 | |
92 | sap | 53 | 90 | 53 |
93 | highly motivated | 52 | 91 | 52 |
94 | time management | 51 | 92 | 51 |
95 | creativity | 50 | 93 | 50 |
96 | data engineering | 50 | 94 | 50 |
97 | linux | 46 | 95 | 46 |
98 | bachelors | 41 | 96 | 41 |
99 | google analytics | 41 | 97 | 41 |
100 | software engineers | 41 | 98 | 41 |
101 | data processing | 40 | 99 | 40 |
102 | modelling | 39 | 100 | 39 |
103 | predictive analytics | 39 | 101 | 39 |
104 | predictive models | 39 | 102 | 39 |
105 | software development | 38 | 103 | 38 |
106 | javascript | 37 | 104 | 37 |
107 | modeling techniques | 36 | 105 | 36 |
108 | deep learning | 35 | 106 | 35 |
109 | array | 34 | 107 | 34 |
110 | troubleshooting | 34 | 108 | 34 |
111 | unix | 34 | 109 | 34 |
112 | artificial intelligence | 33 | 110 | 33 |
113 | machine learning techniques | 33 | 111 | 33 |
114 | data manipulation | 31 | 112 | 31 |
115 | data models | 31 | 113 | 31 |
116 | data systems | 31 | 114 | 31 |
117 | natural language | 31 | 115 | 31 |
118 | data warehousing | 30 | 116 | 30 |
119 | mysql | 30 | 117 | 30 |
120 | solving problems | 30 | 118 | 30 |
121 | youtube | 29 | 119 | 29 |
122 | data integration | 28 | 120 | 28 |
123 | manage multiple projects | 28 | 121 | 28 |
124 | masters degree | 27 | 122 | 27 |
125 | data pipeline | 26 | 123 | 26 |
126 | ecommerce | 26 | 124 | 26 |
127 | learning algorithms | 26 | 125 | 26 |
128 | methodological | 26 | 126 | 26 |
129 | speaking | 26 | 127 | 26 |
130 | AI | 25 | 128 | NA |
131 | data extraction | 25 | 129 | 25 |
132 | language processing | 25 | 130 | 25 |
133 | data gathering | 24 | 131 | 24 |
134 | machine learning algorithms | 24 | 132 | 24 |
135 | natural language processing | 24 | 133 | 24 |
136 | learning models | 23 | 134 | 23 |
137 | large scale | 22 | 135 | 22 |
138 | nosql | 22 | 136 | 22 |
139 | machine learning models | 21 | 137 | 21 |
140 | azure | 19 | 138 | 19 |
141 | data entry | 19 | 139 | 19 |
142 | data insights | 19 | 140 | 19 |
143 | doctorate degree | 19 | 141 | 19 |
144 | microsoft word | 18 | 142 | 18 |
145 | highly organized | 17 | 143 | 17 |
146 | ruby | 17 | 144 | 17 |
147 | data pipelines | 16 | 145 | 16 |
148 | data reporting | 15 | 146 | 15 |
149 | negotiation | 15 | 147 | 15 |
150 | systems analysis | 15 | 148 | 15 |
151 | microsoft powerpoint | 14 | 149 | 14 |
152 | nlp | 14 | 150 | 14 |
153 | data architecture | 13 | 151 | 13 |
154 | bash | 12 | 152 | 12 |
155 | network analysis | 12 | 153 | 12 |
156 | elasticsearch | 11 | 154 | 11 |
157 | postgresql | 11 | 155 | 11 |
158 | service orientation | 11 | 156 | 11 |
159 | strategic thinking | 11 | 157 | 11 |
160 | english language | 10 | 158 | 10 |
161 | github | 10 | 159 | 10 |
162 | mongodb | 10 | 160 | 10 |
163 | data preparation | 8 | 161 | 8 |
164 | data transfer | 8 | 162 | 8 |
165 | analytics data | 7 | 163 | 7 |
166 | client management | 7 | 164 | 7 |
167 | microsoft access | 7 | 165 | 7 |
168 | 6 | 166 | NA | |
169 | apache spark | 6 | 167 | 6 |
170 | microsoft project | 6 | 168 | 6 |
171 | shell script | 6 | 169 | 6 |
172 | jupyter notebook | 5 | 170 | 5 |
173 | language understanding | 5 | 171 | 5 |
174 | microstrategy | 5 | 172 | 5 |
175 | sales and marketing | 5 | 173 | 5 |
176 | symantec | 5 | 174 | 5 |
177 | data interpretation | 4 | 175 | 4 |
178 | grammatical | 4 | 176 | 4 |
179 | minitab | 4 | 177 | 4 |
180 | natural language understanding | 4 | 178 | 4 |
181 | report creation | 4 | 179 | 4 |
182 | amazon redshift | 3 | 180 | 3 |
183 | complex problem solving | 3 | 181 | 3 |
184 | data mapping | 3 | 182 | 3 |
185 | data storytelling | 3 | 183 | 3 |
186 | kpmg | 3 | 184 | 3 |
187 | lexisnexis | 3 | 185 | 3 |
188 | microsoft outlook | 3 | 186 | 3 |
189 | telecommunications | 3 | 187 | 3 |
190 | work well in a team | 3 | 188 | 3 |
191 | active learning | 2 | 189 | 2 |
192 | administration and management | 2 | 190 | 2 |
193 | ajax | 2 | 191 | 2 |
194 | apache hadoop | 2 | 192 | 2 |
195 | clerical | 2 | 193 | 2 |
196 | confluence | 2 | 194 | 2 |
197 | experience in market research | 2 | 195 | 2 |
198 | google docs | 2 | 196 | 2 |
199 | mcafee | 2 | 197 | 2 |
200 | microsoft azure | 2 | 198 | 2 |
201 | nlu | 2 | 199 | 2 |
202 | operations analysis | 2 | 200 | 2 |
203 | see the big picture | 2 | 201 | 2 |
204 | swift | 2 | 202 | 2 |
205 | wireshark | 2 | 203 | 2 |
206 | apache kafka | 1 | 204 | 1 |
207 | apache tomcat | 1 | 205 | 1 |
208 | data cleanup | 1 | 206 | 1 |
209 | data organization | 1 | 207 | 1 |
210 | datadriven | 1 | 208 | 1 |
211 | deductive reasoning | 1 | 209 | 1 |
212 | design development | 1 | 210 | 1 |
213 | django | 1 | 211 | 1 |
214 | eko | 1 | 212 | 1 |
215 | epic systems | 1 | 213 | 1 |
216 | experience in information technology | 1 | 214 | 1 |
217 | filemaker pro | 1 | 215 | 1 |
218 | google adwords | 1 | 216 | 1 |
219 | ibm db2 | 1 | 217 | 1 |
220 | jquery | 1 | 218 | 1 |
221 | judgment and decision making | 1 | 219 | 1 |
222 | machine learning data | 1 | 220 | 1 |
223 | mathematical reasoning | 1 | 221 | 1 |
224 | microsoft dynamics | 1 | 222 | 1 |
225 | microsoft sharepoint | 1 | 223 | 1 |
226 | microsoft sql server | 1 | 224 | 1 |
227 | microsoft sql server reporting services | 1 | 225 | 1 |
228 | microsoft windows server | 1 | 226 | 1 |
229 | oracle hyperion | 1 | 227 | 1 |
230 | organizational management | 1 | 228 | 1 |
231 | processing information | 1 | 229 | 1 |
232 | reading comprehension | 1 | 230 | 1 |
233 | skype | 1 | 231 | 1 |
234 | systems evaluation | 1 | 232 | 1 |
235 | tax software | 1 | 233 | 1 |
236 | technology design | 1 | 234 | 1 |
237 | ubuntu | 1 | 235 | 1 |
238 | unix shell | 1 | 236 | 1 |
We can see that there are a few skills that stand out among both positions (Python and R among them). In order to compare the importance of each skill between the two roles, however, we need to look at their frequency in a slightly different way…
Count number of job descriptions containing each word-catalog entry
Here, we want to compare how many job descriptions within each dataset contain each word_catalog entry. Again, we run this code on both “data scientist” job descriptions and “data analyst” job descriptions. By focusing on the number of job descriptions, we can calculate a proportion of the total for each skill in each dataset, that will allow us to make comparisons of their relative importance to each role.
# # data_sci_jd_count <- tibble(
# "r" = nrow(filter(data_scientists, str_detect(data_scientists$description, " r | r,| r\\.") == TRUE)),
# "ms" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ms | ms,| ms.") == TRUE)),
# "go" = nrow(filter(data_scientists, str_detect(data_scientists$description, " go ") == TRUE)),
# "ai" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ai | ai,| ai\\.") == TRUE))) %>%
# pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
#
# dict_dsjd_freq <- tibble(
# "skills" = dictionary,
# "freq" = lapply(dictionary, function(x){
# nrow(filter(data_scientists, str_detect(data_scientists$description, x) == TRUE))
# }) %>% as.vector(mode="integer"))
#
# ds_jd_count <- union_all(data_sci_jd_count, dict_dsjd_freq)
#
#
# data_ana_jd_count <- tibble(
# "r" = nrow(filter(data_analysts, str_detect(data_analysts$description, " r | r,| r\\.") == TRUE)),
# "ms" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ms | ms,| ms.") == TRUE)),
# "go" = nrow(filter(data_analysts, str_detect(data_analysts$description, " go ") == TRUE)),
# "ai" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ai | ai,| ai\\.") == TRUE))) %>%
# pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
#
# dict_dsjd_freq <- tibble(
# "skills" = dictionary,
# "freq" = lapply(dictionary, function(x){
# nrow(filter(data_analysts, str_detect(data_analysts$description, x) == TRUE))
# }) %>% as.vector(mode="integer"))
#
# da_jd_count <- union_all(data_ana_jd_count, dict_dsjd_freq)
Table top skills across both jobs by number, ranking within dataset, and number and percent of job descriptions
In order to facilitate some analysis, let’s join the results from each of the two datsets with some comparative metrics:
top_skills_all <- full_join(top_skills_ds, top_skills_da, by="skills") %>%
rename(count_ds = count.x, count_da = count.y, rank_ds = rank.x, rank_da = rank.y, jd_count_ds = jd_count.x, jd_count_da = jd_count.y) %>%
select(skills, count_ds, rank_ds, jd_count_ds, count_da, rank_da, jd_count_da) %>%
mutate(avg_rank = (rank_ds + rank_da) / 2,
jd_percent_ds = round(jd_count_ds / nrow(data_scientists), 3),
freq_per_jd_ds= round(count_ds/ jd_count_ds, 3),
jd_percent_da = round(jd_count_da / nrow(data_analysts), 3),
freq_per_jd_da = round(count_da/ jd_count_da, 3)) %>%
arrange(avg_rank)
findings_table_all<- top_skills_all %>%
kbl(caption = "Top Skills") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
scroll_box(height = "400px")
findings_table_all
skills | count_ds | rank_ds | jd_count_ds | count_da | rank_da | jd_count_da | avg_rank | jd_percent_ds | freq_per_jd_ds | jd_percent_da | freq_per_jd_da |
---|---|---|---|---|---|---|---|---|---|---|---|
python | 5250 | 1 | 1750 | 1275 | 1 | 425 | 1.0 | 0.727 | 3.000 | 0.348 | 3.000 |
design | 1468 | 3 | 1468 | 627 | 5 | 627 | 4.0 | 0.610 | 1.000 | 0.514 | 1.000 |
research | 1254 | 6 | 1254 | 785 | 2 | 785 | 4.0 | 0.521 | 1.000 | 0.643 | 1.000 |
communication | 1168 | 9 | 1168 | 776 | 3 | 776 | 6.0 | 0.485 | 1.000 | 0.636 | 1.000 |
statistics | 1249 | 7 | 1249 | 466 | 12 | 466 | 9.5 | 0.519 | 1.000 | 0.382 | 1.000 |
sql | 1172 | 8 | 1172 | 467 | 11 | 467 | 9.5 | 0.487 | 1.000 | 0.382 | 1.000 |
computer | 1460 | 4 | 1460 | 380 | 17 | 380 | 10.5 | 0.607 | 1.000 | 0.311 | 1.000 |
organization | 844 | 16 | 844 | 605 | 6 | 605 | 11.0 | 0.351 | 1.000 | 0.495 | 1.000 |
analytical | 799 | 18 | 799 | 653 | 4 | 653 | 11.0 | 0.332 | 1.000 | 0.535 | 1.000 |
r | 1118 | 10 | 950 | 401 | 15 | 326 | 12.5 | 0.395 | 1.177 | 0.267 | 1.230 |
math | 1102 | 11 | 1102 | 403 | 14 | 403 | 12.5 | 0.458 | 1.000 | 0.330 | 1.000 |
leader | 905 | 15 | 905 | 484 | 10 | 484 | 12.5 | 0.376 | 1.000 | 0.396 | 1.000 |
solutions | 1056 | 12 | 1056 | 400 | 16 | 400 | 14.0 | 0.439 | 1.000 | 0.328 | 1.000 |
quantitative | 779 | 21 | 779 | 499 | 8 | 499 | 14.5 | 0.324 | 1.000 | 0.409 | 1.000 |
science | 1384 | 5 | 1384 | 312 | 25 | 312 | 15.0 | 0.575 | 1.000 | 0.256 | 1.000 |
communication skills | 766 | 24 | 766 | 488 | 9 | 488 | 16.5 | 0.318 | 1.000 | 0.400 | 1.000 |
programming | 960 | 14 | 960 | 313 | 24 | 313 | 19.0 | 0.399 | 1.000 | 0.256 | 1.000 |
written | 684 | 31 | 684 | 514 | 7 | 514 | 19.0 | 0.284 | 1.000 | 0.421 | 1.000 |
machine learning | 1693 | 2 | 1693 | 207 | 37 | 207 | 19.5 | 0.704 | 1.000 | 0.170 | 1.000 |
passion | 821 | 17 | 821 | 323 | 23 | 323 | 20.0 | 0.341 | 1.000 | 0.265 | 1.000 |
vision | 714 | 27 | 714 | 358 | 20 | 358 | 23.5 | 0.297 | 1.000 | 0.293 | 1.000 |
database | 699 | 29 | 699 | 367 | 19 | 367 | 24.0 | 0.291 | 1.000 | 0.301 | 1.000 |
office | 506 | 40 | 506 | 463 | 13 | 463 | 26.5 | 0.210 | 1.000 | 0.379 | 1.000 |
mathematics | 768 | 23 | 768 | 257 | 32 | 257 | 27.5 | 0.319 | 1.000 | 0.210 | 1.000 |
leadership | 536 | 36 | 536 | 327 | 21 | 327 | 28.5 | 0.223 | 1.000 | 0.268 | 1.000 |
presentation | 463 | 44 | 463 | 373 | 18 | 373 | 31.0 | 0.192 | 1.000 | 0.305 | 1.000 |
years of experience | 576 | 34 | 576 | 286 | 29 | 286 | 31.5 | 0.239 | 1.000 | 0.234 | 1.000 |
data analysis | 532 | 37 | 532 | 304 | 27 | 304 | 32.0 | 0.221 | 1.000 | 0.249 | 1.000 |
verbal | 481 | 42 | 481 | 324 | 22 | 324 | 32.0 | 0.200 | 1.000 | 0.265 | 1.000 |
visualization | 578 | 33 | 578 | 243 | 33 | 243 | 33.0 | 0.240 | 1.000 | 0.199 | 1.000 |
phd | 789 | 19 | 789 | 151 | 50 | 151 | 34.5 | 0.328 | 1.000 | 0.124 | 1.000 |
collaborative | 511 | 39 | 511 | 284 | 30 | 284 | 34.5 | 0.212 | 1.000 | 0.233 | 1.000 |
algorithms | 975 | 13 | 975 | 130 | 57 | 130 | 35.0 | 0.405 | 1.000 | 0.106 | 1.000 |
sas | 445 | 46 | 445 | 309 | 26 | 309 | 36.0 | 0.185 | 1.000 | 0.253 | 1.000 |
data sets | 566 | 35 | 566 | 206 | 38 | 206 | 36.5 | 0.235 | 1.000 | 0.169 | 1.000 |
java | 765 | 25 | 765 | 134 | 54 | 134 | 39.5 | 0.318 | 1.000 | 0.110 | 1.000 |
innovation | 471 | 43 | 471 | 207 | 36 | 207 | 39.5 | 0.196 | 1.000 | 0.170 | 1.000 |
git | 516 | 38 | 516 | 195 | 42 | 195 | 40.0 | 0.214 | 1.000 | 0.160 | 1.000 |
scala | 789 | 20 | 789 | 113 | 66 | 113 | 43.0 | 0.328 | 1.000 | 0.093 | 1.000 |
creating | 451 | 45 | 451 | 196 | 41 | 196 | 43.0 | 0.187 | 1.000 | 0.161 | 1.000 |
ml | 617 | 32 | 617 | 134 | 55 | 134 | 43.5 | 0.256 | 1.000 | 0.110 | 1.000 |
economics | 315 | 57 | 315 | 268 | 31 | 268 | 44.0 | 0.131 | 1.000 | 0.219 | 1.000 |
big data | 695 | 30 | 695 | 120 | 61 | 120 | 45.5 | 0.289 | 1.000 | 0.098 | 1.000 |
writing | 295 | 63 | 295 | 300 | 28 | 300 | 45.5 | 0.123 | 1.000 | 0.246 | 1.000 |
spark | 779 | 22 | 779 | 87 | 76 | 87 | 49.0 | 0.324 | 1.000 | 0.071 | 1.000 |
hadoop | 706 | 28 | 706 | 92 | 71 | 92 | 49.5 | 0.293 | 1.000 | 0.075 | 1.000 |
collaboration | 329 | 52 | 329 | 154 | 49 | 154 | 50.5 | 0.137 | 1.000 | 0.126 | 1.000 |
large data | 398 | 48 | 398 | 133 | 56 | 133 | 52.0 | 0.165 | 1.000 | 0.109 | 1.000 |
tableau | 279 | 66 | 279 | 203 | 39 | 203 | 52.5 | 0.116 | 1.000 | 0.166 | 1.000 |
data mining | 483 | 41 | 483 | 105 | 67 | 105 | 54.0 | 0.201 | 1.000 | 0.086 | 1.000 |
data visualization | 312 | 61 | 312 | 162 | 47 | 162 | 54.0 | 0.130 | 1.000 | 0.133 | 1.000 |
consulting | 274 | 69 | 274 | 182 | 44 | 182 | 56.5 | 0.114 | 1.000 | 0.149 | 1.000 |
problem solving | 273 | 70 | 273 | 172 | 46 | 172 | 58.0 | 0.113 | 1.000 | 0.141 | 1.000 |
data analytics | 285 | 65 | 285 | 140 | 52 | 140 | 58.5 | 0.118 | 1.000 | 0.115 | 1.000 |
rtable | 291 | 64 | 291 | 128 | 58 | 128 | 61.0 | 0.121 | 1.000 | 0.105 | 1.000 |
physics | 337 | 51 | 337 | 89 | 73 | 89 | 62.0 | 0.140 | 1.000 | 0.073 | 1.000 |
interpersonal | 201 | 86 | 201 | 198 | 40 | 198 | 63.0 | 0.084 | 1.000 | 0.162 | 1.000 |
organizational | 172 | 92 | 172 | 224 | 34 | 224 | 63.0 | 0.071 | 1.000 | 0.183 | 1.000 |
data engineer | 389 | 50 | 389 | 81 | 78 | 81 | 64.0 | 0.162 | 1.000 | 0.066 | 1.000 |
flexible | 244 | 75 | 244 | 137 | 53 | 137 | 64.0 | 0.101 | 1.000 | 0.112 | 1.000 |
ms | 223 | 84 | 253 | 173 | 45 | 165 | 64.5 | 0.105 | 0.881 | 0.135 | 1.048 |
programming languages | 313 | 59 | 313 | 88 | 74 | 88 | 66.5 | 0.130 | 1.000 | 0.072 | 1.000 |
microsoft | 148 | 99 | 148 | 223 | 35 | 223 | 67.0 | 0.062 | 1.000 | 0.183 | 1.000 |
business problems | 312 | 60 | 312 | 87 | 75 | 87 | 67.5 | 0.130 | 1.000 | 0.071 | 1.000 |
influence | 251 | 73 | 251 | 118 | 63 | 118 | 68.0 | 0.104 | 1.000 | 0.097 | 1.000 |
matlab | 277 | 68 | 277 | 92 | 72 | 92 | 70.0 | 0.115 | 1.000 | 0.075 | 1.000 |
bachelor’s degree | 164 | 94 | 164 | 149 | 51 | 149 | 72.5 | 0.068 | 1.000 | 0.122 | 1.000 |
project management | 141 | 101 | 141 | 159 | 48 | 159 | 74.5 | 0.059 | 1.000 | 0.130 | 1.000 |
deep learning | 432 | 47 | 432 | 35 | 106 | 35 | 76.5 | 0.180 | 1.000 | 0.029 | 1.000 |
AI | 729 | 26 | NA | 25 | 128 | NA | 77.0 | NA | NA | NA | NA |
software development | 327 | 53 | 327 | 38 | 103 | 38 | 78.0 | 0.136 | 1.000 | 0.031 | 1.000 |
decision making | 173 | 91 | 173 | 114 | 65 | 114 | 78.0 | 0.072 | 1.000 | 0.093 | 1.000 |
etl | 239 | 76 | 239 | 73 | 82 | 73 | 79.0 | 0.099 | 1.000 | 0.060 | 1.000 |
data management | 161 | 96 | 161 | 120 | 62 | 120 | 79.0 | 0.067 | 1.000 | 0.098 | 1.000 |
monitoring | 162 | 95 | 162 | 118 | 64 | 118 | 79.5 | 0.067 | 1.000 | 0.097 | 1.000 |
powerpoint | 84 | 118 | 84 | 190 | 43 | 190 | 80.5 | 0.035 | 1.000 | 0.156 | 1.000 |
forecasting | 169 | 93 | 169 | 96 | 69 | 96 | 81.0 | 0.070 | 1.000 | 0.079 | 1.000 |
data collection | 132 | 102 | 132 | 121 | 60 | 121 | 81.0 | 0.055 | 1.000 | 0.099 | 1.000 |
statistical modeling | 237 | 77 | 237 | 61 | 86 | 61 | 81.5 | 0.099 | 1.000 | 0.050 | 1.000 |
natural language | 390 | 49 | 390 | 31 | 115 | 31 | 82.0 | 0.162 | 1.000 | 0.025 | 1.000 |
masters | 201 | 87 | 201 | 85 | 77 | 85 | 82.0 | 0.084 | 1.000 | 0.070 | 1.000 |
artificial intelligence | 318 | 55 | 318 | 33 | 110 | 33 | 82.5 | 0.132 | 1.000 | 0.027 | 1.000 |
large data sets | 227 | 82 | 227 | 72 | 83 | 72 | 82.5 | 0.094 | 1.000 | 0.059 | 1.000 |
linux | 251 | 74 | 251 | 46 | 95 | 46 | 84.5 | 0.104 | 1.000 | 0.038 | 1.000 |
machine learning techniques | 298 | 62 | 298 | 33 | 111 | 33 | 86.5 | 0.124 | 1.000 | 0.027 | 1.000 |
work independently | 97 | 114 | 97 | 125 | 59 | 125 | 86.5 | 0.040 | 1.000 | 0.102 | 1.000 |
data processing | 227 | 81 | 227 | 40 | 99 | 40 | 90.0 | 0.094 | 1.000 | 0.033 | 1.000 |
data engineering | 194 | 88 | 194 | 50 | 94 | 50 | 91.0 | 0.081 | 1.000 | 0.041 | 1.000 |
learning algorithms | 315 | 58 | 315 | 26 | 125 | 26 | 91.5 | 0.131 | 1.000 | 0.021 | 1.000 |
language processing | 320 | 54 | 320 | 25 | 130 | 25 | 92.0 | 0.133 | 1.000 | 0.020 | 1.000 |
have experience | 157 | 97 | 157 | 58 | 88 | 58 | 92.5 | 0.065 | 1.000 | 0.048 | 1.000 |
predictive models | 211 | 85 | 211 | 39 | 102 | 39 | 93.5 | 0.088 | 1.000 | 0.032 | 1.000 |
software engineers | 194 | 89 | 194 | 41 | 98 | 41 | 93.5 | 0.081 | 1.000 | 0.034 | 1.000 |
natural language processing | 318 | 56 | 318 | 24 | 133 | 24 | 94.5 | 0.132 | 1.000 | 0.020 | 1.000 |
creativity | 155 | 98 | 155 | 50 | 93 | 50 | 95.5 | 0.064 | 1.000 | 0.041 | 1.000 |
Go | 107 | 110 | NA | 74 | 81 | NA | 95.5 | NA | NA | NA | NA |
sap | 112 | 108 | 112 | 53 | 90 | 53 | 99.0 | 0.047 | 1.000 | 0.043 | 1.000 |
machine learning algorithms | 278 | 67 | 278 | 24 | 132 | 24 | 99.5 | 0.116 | 1.000 | 0.020 | 1.000 |
predictive analytics | 146 | 100 | 146 | 39 | 101 | 39 | 100.5 | 0.061 | 1.000 | 0.032 | 1.000 |
learning models | 268 | 71 | 268 | 23 | 134 | 23 | 102.5 | 0.111 | 1.000 | 0.019 | 1.000 |
95 | 116 | 95 | 53 | 89 | 53 | 102.5 | 0.039 | 1.000 | 0.043 | 1.000 | |
data pipeline | 225 | 83 | 225 | 26 | 123 | 26 | 103.0 | 0.094 | 1.000 | 0.021 | 1.000 |
critical thinking | 62 | 127 | 62 | 77 | 79 | 77 | 103.0 | 0.026 | 1.000 | 0.063 | 1.000 |
javascript | 127 | 104 | 127 | 37 | 104 | 37 | 104.0 | 0.053 | 1.000 | 0.030 | 1.000 |
modeling techniques | 122 | 105 | 122 | 36 | 105 | 36 | 105.0 | 0.051 | 1.000 | 0.029 | 1.000 |
modelling | 102 | 112 | 102 | 39 | 100 | 39 | 106.0 | 0.042 | 1.000 | 0.032 | 1.000 |
large scale | 235 | 78 | 235 | 22 | 135 | 22 | 106.5 | 0.098 | 1.000 | 0.018 | 1.000 |
nosql | 232 | 79 | 232 | 22 | 136 | 22 | 107.5 | 0.096 | 1.000 | 0.018 | 1.000 |
machine learning models | 231 | 80 | 231 | 21 | 137 | 21 | 108.5 | 0.096 | 1.000 | 0.017 | 1.000 |
unix | 112 | 109 | 112 | 34 | 109 | 34 | 109.0 | 0.047 | 1.000 | 0.028 | 1.000 |
array | 104 | 111 | 104 | 34 | 107 | 34 | 109.0 | 0.043 | 1.000 | 0.028 | 1.000 |
microsoft office | 31 | 151 | 31 | 100 | 68 | 100 | 109.5 | 0.013 | 1.000 | 0.082 | 1.000 |
data models | 116 | 107 | 116 | 31 | 113 | 31 | 110.0 | 0.048 | 1.000 | 0.025 | 1.000 |
multi-task | 46 | 141 | 46 | 77 | 80 | 77 | 110.5 | 0.019 | 1.000 | 0.063 | 1.000 |
nlp | 254 | 72 | 254 | 14 | 150 | 14 | 111.0 | 0.106 | 1.000 | 0.011 | 1.000 |
market research | 29 | 154 | 29 | 95 | 70 | 95 | 112.0 | 0.012 | 1.000 | 0.078 | 1.000 |
highly motivated | 51 | 134 | 51 | 52 | 91 | 52 | 112.5 | 0.021 | 1.000 | 0.043 | 1.000 |
data warehousing | 95 | 115 | 95 | 30 | 116 | 30 | 115.5 | 0.039 | 1.000 | 0.025 | 1.000 |
coordination | 39 | 147 | 39 | 68 | 84 | 68 | 115.5 | 0.016 | 1.000 | 0.056 | 1.000 |
mysql | 92 | 117 | 92 | 30 | 117 | 30 | 117.0 | 0.038 | 1.000 | 0.025 | 1.000 |
data pipelines | 179 | 90 | 179 | 16 | 145 | 16 | 117.5 | 0.074 | 1.000 | 0.013 | 1.000 |
data systems | 78 | 121 | 78 | 31 | 114 | 31 | 117.5 | 0.032 | 1.000 | 0.025 | 1.000 |
bachelors | 46 | 139 | 46 | 41 | 96 | 41 | 117.5 | 0.019 | 1.000 | 0.034 | 1.000 |
solving problems | 77 | 122 | 77 | 30 | 118 | 30 | 120.0 | 0.032 | 1.000 | 0.025 | 1.000 |
azure | 127 | 103 | 127 | 19 | 138 | 19 | 120.5 | 0.053 | 1.000 | 0.016 | 1.000 |
time management | 38 | 149 | 38 | 51 | 92 | 51 | 120.5 | 0.016 | 1.000 | 0.042 | 1.000 |
ecommerce | 80 | 120 | 80 | 26 | 124 | 26 | 122.0 | 0.033 | 1.000 | 0.021 | 1.000 |
data manipulation | 51 | 133 | 51 | 31 | 112 | 31 | 122.5 | 0.021 | 1.000 | 0.025 | 1.000 |
vba | 19 | 161 | 19 | 60 | 87 | 60 | 124.0 | 0.008 | 1.000 | 0.049 | 1.000 |
data integration | 59 | 129 | 59 | 28 | 120 | 28 | 124.5 | 0.025 | 1.000 | 0.023 | 1.000 |
google analytics | 30 | 153 | 30 | 41 | 97 | 41 | 125.0 | 0.012 | 1.000 | 0.034 | 1.000 |
data extraction | 75 | 123 | 75 | 25 | 129 | 25 | 126.0 | 0.031 | 1.000 | 0.020 | 1.000 |
microsoft excel | 15 | 167 | 15 | 68 | 85 | 68 | 126.0 | 0.006 | 1.000 | 0.056 | 1.000 |
troubleshooting | 40 | 145 | 40 | 34 | 108 | 34 | 126.5 | 0.017 | 1.000 | 0.028 | 1.000 |
data architecture | 120 | 106 | 120 | 13 | 151 | 13 | 128.5 | 0.050 | 1.000 | 0.011 | 1.000 |
ruby | 98 | 113 | 98 | 17 | 144 | 17 | 128.5 | 0.041 | 1.000 | 0.014 | 1.000 |
masters degree | 49 | 135 | 49 | 27 | 122 | 27 | 128.5 | 0.020 | 1.000 | 0.022 | 1.000 |
speaking | 47 | 137 | 47 | 26 | 127 | 26 | 132.0 | 0.020 | 1.000 | 0.021 | 1.000 |
youtube | 40 | 146 | 40 | 29 | 119 | 29 | 132.5 | 0.017 | 1.000 | 0.024 | 1.000 |
methodological | 34 | 150 | 34 | 26 | 126 | 26 | 138.0 | 0.014 | 1.000 | 0.021 | 1.000 |
data gathering | 39 | 148 | 39 | 24 | 131 | 24 | 139.5 | 0.016 | 1.000 | 0.020 | 1.000 |
mongodb | 75 | 124 | 75 | 10 | 160 | 10 | 142.0 | 0.031 | 1.000 | 0.008 | 1.000 |
manage multiple projects | 16 | 165 | 16 | 28 | 121 | 28 | 143.0 | 0.007 | 1.000 | 0.023 | 1.000 |
postgresql | 56 | 132 | 56 | 11 | 155 | 11 | 143.5 | 0.023 | 1.000 | 0.009 | 1.000 |
language understanding | 83 | 119 | 83 | 5 | 171 | 5 | 145.0 | 0.034 | 1.000 | 0.004 | 1.000 |
github | 57 | 131 | 57 | 10 | 159 | 10 | 145.0 | 0.024 | 1.000 | 0.008 | 1.000 |
elasticsearch | 47 | 136 | 47 | 11 | 154 | 11 | 145.0 | 0.020 | 1.000 | 0.009 | 1.000 |
apache spark | 65 | 125 | 65 | 6 | 167 | 6 | 146.0 | 0.027 | 1.000 | 0.005 | 1.000 |
bash | 45 | 142 | 45 | 12 | 152 | 12 | 147.0 | 0.019 | 1.000 | 0.010 | 1.000 |
data insights | 27 | 156 | 27 | 19 | 140 | 19 | 148.0 | 0.011 | 1.000 | 0.016 | 1.000 |
natural language understanding | 64 | 126 | 64 | 4 | 178 | 4 | 152.0 | 0.027 | 1.000 | 0.003 | 1.000 |
doctorate degree | 17 | 164 | 17 | 19 | 141 | 19 | 152.5 | 0.007 | 1.000 | 0.016 | 1.000 |
negotiation | 21 | 159 | 21 | 15 | 147 | 15 | 153.0 | 0.009 | 1.000 | 0.012 | 1.000 |
network analysis | 28 | 155 | 28 | 12 | 153 | 12 | 154.0 | 0.012 | 1.000 | 0.010 | 1.000 |
kpmg | 60 | 128 | 60 | 3 | 184 | 3 | 156.0 | 0.025 | 1.000 | 0.002 | 1.000 |
shell script | 40 | 144 | 40 | 6 | 169 | 6 | 156.5 | 0.017 | 1.000 | 0.005 | 1.000 |
highly organized | 10 | 170 | 10 | 17 | 143 | 17 | 156.5 | 0.004 | 1.000 | 0.014 | 1.000 |
data entry | 8 | 176 | 8 | 19 | 139 | 19 | 157.5 | 0.003 | 1.000 | 0.016 | 1.000 |
data preparation | 22 | 158 | 22 | 8 | 161 | 8 | 159.5 | 0.009 | 1.000 | 0.007 | 1.000 |
data reporting | 8 | 178 | 8 | 15 | 146 | 15 | 162.0 | 0.003 | 1.000 | 0.012 | 1.000 |
strategic thinking | 13 | 169 | 13 | 11 | 157 | 11 | 163.0 | 0.005 | 1.000 | 0.009 | 1.000 |
microsoft word | 6 | 185 | 6 | 18 | 142 | 18 | 163.5 | 0.002 | 1.000 | 0.015 | 1.000 |
analytics data | 15 | 166 | 15 | 7 | 163 | 7 | 164.5 | 0.006 | 1.000 | 0.006 | 1.000 |
jupyter notebook | 18 | 162 | 18 | 5 | 170 | 5 | 166.0 | 0.007 | 1.000 | 0.004 | 1.000 |
sales and marketing | 19 | 160 | 19 | 5 | 173 | 5 | 166.5 | 0.008 | 1.000 | 0.004 | 1.000 |
microstrategy | 18 | 163 | 18 | 5 | 172 | 5 | 167.5 | 0.007 | 1.000 | 0.004 | 1.000 |
9 | 172 | NA | 6 | 166 | NA | 169.0 | NA | NA | NA | NA | |
nlu | 40 | 143 | 40 | 2 | 199 | 2 | 171.0 | 0.017 | 1.000 | 0.002 | 1.000 |
data transfer | 6 | 184 | 6 | 8 | 162 | 8 | 173.0 | 0.002 | 1.000 | 0.007 | 1.000 |
english language | 5 | 188 | 5 | 10 | 158 | 10 | 173.0 | 0.002 | 1.000 | 0.008 | 1.000 |
service orientation | 5 | 191 | 5 | 11 | 156 | 11 | 173.5 | 0.002 | 1.000 | 0.009 | 1.000 |
symantec | 9 | 174 | 9 | 5 | 174 | 5 | 174.0 | 0.004 | 1.000 | 0.004 | 1.000 |
grammatical | 9 | 173 | 9 | 4 | 176 | 4 | 174.5 | 0.004 | 1.000 | 0.003 | 1.000 |
systems analysis | 3 | 202 | 3 | 15 | 148 | 15 | 175.0 | 0.001 | 1.000 | 0.012 | 1.000 |
microsoft azure | 23 | 157 | 23 | 2 | 198 | 2 | 177.5 | 0.010 | 1.000 | 0.002 | 1.000 |
data mapping | 8 | 177 | 8 | 3 | 182 | 3 | 179.5 | 0.003 | 1.000 | 0.002 | 1.000 |
telecommunications | 9 | 175 | 9 | 3 | 187 | 3 | 181.0 | 0.004 | 1.000 | 0.002 | 1.000 |
django | 30 | 152 | 30 | 1 | 211 | 1 | 181.5 | 0.012 | 1.000 | 0.001 | 1.000 |
amazon redshift | 6 | 183 | 6 | 3 | 180 | 3 | 181.5 | 0.002 | 1.000 | 0.002 | 1.000 |
microsoft access | 3 | 199 | 3 | 7 | 165 | 7 | 182.0 | 0.001 | 1.000 | 0.006 | 1.000 |
microsoft powerpoint | 2 | 215 | 2 | 14 | 149 | 14 | 182.0 | 0.001 | 1.000 | 0.011 | 1.000 |
complex problem solving | 5 | 187 | 5 | 3 | 181 | 3 | 184.0 | 0.002 | 1.000 | 0.002 | 1.000 |
active learning | 7 | 180 | 7 | 2 | 189 | 2 | 184.5 | 0.003 | 1.000 | 0.002 | 1.000 |
data interpretation | 3 | 197 | 3 | 4 | 175 | 4 | 186.0 | 0.001 | 1.000 | 0.003 | 1.000 |
apache hadoop | 7 | 181 | 7 | 2 | 192 | 2 | 186.5 | 0.003 | 1.000 | 0.002 | 1.000 |
client management | 2 | 209 | 2 | 7 | 164 | 7 | 186.5 | 0.001 | 1.000 | 0.006 | 1.000 |
confluence | 7 | 182 | 7 | 2 | 194 | 2 | 188.0 | 0.003 | 1.000 | 0.002 | 1.000 |
minitab | 3 | 200 | 3 | 4 | 177 | 4 | 188.5 | 0.001 | 1.000 | 0.003 | 1.000 |
machine learning data | 13 | 168 | 13 | 1 | 220 | 1 | 194.0 | 0.005 | 1.000 | 0.001 | 1.000 |
swift | 6 | 186 | 6 | 2 | 202 | 2 | 194.0 | 0.002 | 1.000 | 0.002 | 1.000 |
jquery | 10 | 171 | 10 | 1 | 218 | 1 | 194.5 | 0.004 | 1.000 | 0.001 | 1.000 |
eko | 8 | 179 | 8 | 1 | 212 | 1 | 195.5 | 0.003 | 1.000 | 0.001 | 1.000 |
work well in a team | 3 | 205 | 3 | 3 | 188 | 3 | 196.5 | 0.001 | 1.000 | 0.002 | 1.000 |
operations analysis | 4 | 194 | 4 | 2 | 200 | 2 | 197.0 | 0.002 | 1.000 | 0.002 | 1.000 |
see the big picture | 4 | 195 | 4 | 2 | 201 | 2 | 198.0 | 0.002 | 1.000 | 0.002 | 1.000 |
apache kafka | 4 | 193 | 4 | 1 | 204 | 1 | 198.5 | 0.002 | 1.000 | 0.001 | 1.000 |
microsoft outlook | 2 | 214 | 2 | 3 | 186 | 3 | 200.0 | 0.001 | 1.000 | 0.002 | 1.000 |
clerical | 2 | 208 | 2 | 2 | 193 | 2 | 200.5 | 0.001 | 1.000 | 0.002 | 1.000 |
ibm db2 | 5 | 189 | 5 | 1 | 217 | 1 | 203.0 | 0.002 | 1.000 | 0.001 | 1.000 |
report creation | 1 | 231 | 1 | 4 | 179 | 4 | 205.0 | 0.000 | 1.000 | 0.003 | 1.000 |
experience in information technology | 3 | 198 | 3 | 1 | 214 | 1 | 206.0 | 0.001 | 1.000 | 0.001 | 1.000 |
microsoft sql server | 5 | 190 | 5 | 1 | 224 | 1 | 207.0 | 0.002 | 1.000 | 0.001 | 1.000 |
data cleanup | 2 | 210 | 2 | 1 | 206 | 1 | 208.0 | 0.001 | 1.000 | 0.001 | 1.000 |
data organization | 2 | 211 | 2 | 1 | 207 | 1 | 209.0 | 0.001 | 1.000 | 0.001 | 1.000 |
unix shell | 5 | 192 | 5 | 1 | 236 | 1 | 214.0 | 0.002 | 1.000 | 0.001 | 1.000 |
google adwords | 2 | 212 | 2 | 1 | 216 | 1 | 214.0 | 0.001 | 1.000 | 0.001 | 1.000 |
design development | 1 | 218 | 1 | 1 | 210 | 1 | 214.0 | 0.000 | 1.000 | 0.001 | 1.000 |
skype | 3 | 201 | 3 | 1 | 231 | 1 | 216.0 | 0.001 | 1.000 | 0.001 | 1.000 |
mathematical reasoning | 2 | 213 | 2 | 1 | 221 | 1 | 217.0 | 0.001 | 1.000 | 0.001 | 1.000 |
filemaker pro | 1 | 220 | 1 | 1 | 215 | 1 | 217.5 | 0.000 | 1.000 | 0.001 | 1.000 |
wireshark | 1 | 233 | 1 | 2 | 203 | 2 | 218.0 | 0.000 | 1.000 | 0.002 | 1.000 |
technology design | 3 | 203 | 3 | 1 | 234 | 1 | 218.5 | 0.001 | 1.000 | 0.001 | 1.000 |
ubuntu | 3 | 204 | 3 | 1 | 235 | 1 | 219.5 | 0.001 | 1.000 | 0.001 | 1.000 |
judgment and decision making | 1 | 222 | 1 | 1 | 219 | 1 | 220.5 | 0.000 | 1.000 | 0.001 | 1.000 |
microsoft dynamics | 1 | 223 | 1 | 1 | 222 | 1 | 222.5 | 0.000 | 1.000 | 0.001 | 1.000 |
microsoft windows server | 1 | 224 | 1 | 1 | 226 | 1 | 225.0 | 0.000 | 1.000 | 0.001 | 1.000 |
oracle hyperion | 1 | 225 | 1 | 1 | 227 | 1 | 226.0 | 0.000 | 1.000 | 0.001 | 1.000 |
organizational management | 1 | 227 | 1 | 1 | 228 | 1 | 227.5 | 0.000 | 1.000 | 0.001 | 1.000 |
reading comprehension | 1 | 230 | 1 | 1 | 230 | 1 | 230.0 | 0.000 | 1.000 | 0.001 | 1.000 |
big data architecture | 57 | 130 | 57 | NA | NA | NA | NA | 0.024 | 1.000 | NA | NA |
architecture capabilities | 46 | 138 | 46 | NA | NA | NA | NA | 0.019 | 1.000 | NA | NA |
covering technologies | 46 | 140 | 46 | NA | NA | NA | NA | 0.019 | 1.000 | NA | NA |
apache hive | 3 | 196 | 3 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
active listening | 2 | 206 | 2 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
citrix | 2 | 207 | 2 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
amazon dynamodb | 1 | 216 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
bring creativity | 1 | 217 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
engineering and technology | 1 | 219 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
ibm infosphere datastage | 1 | 221 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
oracle java | 1 | 226 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
prepare data for analysis | 1 | 228 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
quality control analysis | 1 | 229 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
teradata database | 1 | 232 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
microsoft project | NA | NA | NA | 6 | 168 | 6 | NA | NA | NA | 0.005 | 1.000 |
data storytelling | NA | NA | NA | 3 | 183 | 3 | NA | NA | NA | 0.002 | 1.000 |
lexisnexis | NA | NA | NA | 3 | 185 | 3 | NA | NA | NA | 0.002 | 1.000 |
administration and management | NA | NA | NA | 2 | 190 | 2 | NA | NA | NA | 0.002 | 1.000 |
ajax | NA | NA | NA | 2 | 191 | 2 | NA | NA | NA | 0.002 | 1.000 |
experience in market research | NA | NA | NA | 2 | 195 | 2 | NA | NA | NA | 0.002 | 1.000 |
google docs | NA | NA | NA | 2 | 196 | 2 | NA | NA | NA | 0.002 | 1.000 |
mcafee | NA | NA | NA | 2 | 197 | 2 | NA | NA | NA | 0.002 | 1.000 |
apache tomcat | NA | NA | NA | 1 | 205 | 1 | NA | NA | NA | 0.001 | 1.000 |
datadriven | NA | NA | NA | 1 | 208 | 1 | NA | NA | NA | 0.001 | 1.000 |
deductive reasoning | NA | NA | NA | 1 | 209 | 1 | NA | NA | NA | 0.001 | 1.000 |
epic systems | NA | NA | NA | 1 | 213 | 1 | NA | NA | NA | 0.001 | 1.000 |
microsoft sharepoint | NA | NA | NA | 1 | 223 | 1 | NA | NA | NA | 0.001 | 1.000 |
microsoft sql server reporting services | NA | NA | NA | 1 | 225 | 1 | NA | NA | NA | 0.001 | 1.000 |
processing information | NA | NA | NA | 1 | 229 | 1 | NA | NA | NA | 0.001 | 1.000 |
systems evaluation | NA | NA | NA | 1 | 232 | 1 | NA | NA | NA | 0.001 | 1.000 |
tax software | NA | NA | NA | 1 | 233 | 1 | NA | NA | NA | 0.001 | 1.000 |
Analyzing our Survey
In our quest to determine the top data science skills, our team formulated a survey and distributed it to our peers, colleagues, friends, and family. We received 32 survey responses, and analyzed the data to determine the top data science skills. Respondents were prompted to provide three skills in all.
In order to analyze our survey data, we used n-gram analysis. Because of the less complex and lengthy nature of this data compared to the job description data, we did not use the word_catalog but instead relied on our common sense to pull out relevant data skills from our n-gram analysis.
## Loading Data
survey <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/survey-final.csv?token=AKGJZWL6KFYOLXBQ5XWZ7OLANHUIQ")
skills_only <- select(survey, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
skills_only<-skills_only %>% rename(
first = What.is.the.most.important.skill.for.a.data.scientist.,
second = What.is.the.second.most.important.skill.for.a.data.scientist.,
third = What.is.the.third.most.important.skill.for.a.data.scientist.
)
Exploratory Word Cloud
To get an initial look at the data, all the skills collected into one data frame, and the wordcloud library was used to generate a word cloud of all the words survey participants submitted.
all <- pivot_longer(skills_only, 1:3)
corpus4 <- VCorpus(VectorSource(all$value))
corpus4 <- tm_map(corpus4, removePunctuation)
corpus4 <- tm_map(corpus4, content_transformer(tolower))
corpus4 <- tm_map(corpus4, removeNumbers)
corpus4 <- tm_map(corpus4, removeWords, stopwords_en)
corpus4 <- tm_map(corpus4, stripWhitespace)
wordcloud(corpus4, max.words = 50, colors = wes_palette(name = "Zissou1"))
Stemming
To further analyze the survey, stemming was used. In additional to removing stopwords_en, “skill”, “skills”, “ability”, and “abilities” were removed since those aren’t standalone skills. The stem completion list was forumlated by hand based on the generated list of stems.
no_punc <- removePunctuation(all$value)
fewer_words <- removeWords(no_punc, c(stopwords_en, "etc", "eg", "skill", "skills", "ability", "abilities"))
unlisted <- unlist(strsplit(fewer_words, split = ' '))
stems <- stemDocument(unlisted)
stems<- stripWhitespace(stems)
stems <- tolower(stems)
stem_corpus <- VCorpus(VectorSource(stems))
wordcloud(stem_corpus, max.words = 50, colors = wes_palette(name = "Zissou1"))
test_complete <- c("statistics", "visualization", "programming", "database", "analytics", "software", "solving", "thinking", "communication", "code", "machine", "learning", "modeling", "munging", "interpretation", "recognition", "computer", "aptitude", "knowledge", "technical", "storytelling", "collaboration", "analysis", "data", "sql", "creativity", "python", "business")
test <- stemCompletion(stems, dictionary=test_complete)
testcorp <- VCorpus(VectorSource(test))
wordcloud(testcorp, max.words = 50, colors = wes_palette(name = "Zissou1"))
N Gram Analysis
Next, the frequency of the ten most common unigrams, bigrams, and trigrams for the entire data set were analyzed.
#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(1, 20)))
#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))
unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)
ggplot(unigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Unigrams") +labs(x = "", y = "")
#Bigrams
bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)
ggplot(bigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Bigrams") +labs(x = "", y = "")
#Trigrams
trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)
ggplot(trigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Trigrams") +labs(x = "", y = "")
N Gram Analysis, With Filtering
To get a closer look at where the responses break down, the unigram and bigram frequencies were run again on filtered sections of the data set.
Filtering By Field and Occupation
First, the data set was filtered both by field and by occupation. The first group was individuals who worked full time in either computer science or data science. The second group was students or teachers in computer science or data science.
industry <- filter(survey, Occupation == "Full-Time Work")
industry <- filter(industry, Field == "Data Science" | Field == "Computer Science")
not_industry <- filter(survey, Occupation == "Graduate Student (Full Time)" | Occupation == "Teacher / Professor" | Occupation == "High School Student" | Occupation == "Undergraduate" | Occupation == "Other")
not_industry <- filter(not_industry, Field == "Data Science" | Field == "Computer Science")
industry <- select(industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
industry<- industry %>% rename(
first = What.is.the.most.important.skill.for.a.data.scientist.,
second = What.is.the.second.most.important.skill.for.a.data.scientist.,
third = What.is.the.third.most.important.skill.for.a.data.scientist.
)
not_industry <- select(not_industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
not_industry<-not_industry %>% rename(
first = What.is.the.most.important.skill.for.a.data.scientist.,
second = What.is.the.second.most.important.skill.for.a.data.scientist.,
third = What.is.the.third.most.important.skill.for.a.data.scientist.
)
Initial Word Cloud Visualization
industry <- pivot_longer(industry, 1:3)
corpus_industry <- VCorpus(VectorSource(industry$value))
corpus_industry <- tm_map(corpus_industry, removePunctuation)
corpus_industry <- tm_map(corpus_industry, content_transformer(tolower))
corpus_industry <- tm_map(corpus_industry, removeNumbers)
corpus_industry <- tm_map(corpus_industry, removeWords, c(stopwords_en, "eg", "etc"))
corpus_industry <- tm_map(corpus_industry, stripWhitespace)
wordcloud(corpus_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))
not_industry <- pivot_longer(not_industry, 1:3)
corpus_not_industry <- VCorpus(VectorSource(not_industry$value))
corpus_not_industry <- tm_map(corpus_not_industry, removePunctuation)
corpus_not_industry <- tm_map(corpus_not_industry, content_transformer(tolower))
corpus_not_industry <- tm_map(corpus_not_industry, removeNumbers)
corpus_not_industry <- tm_map(corpus_not_industry, removeWords, c(stopwords_en, "etc", "eg"))
corpus_not_industry <- tm_map(corpus_not_industry, stripWhitespace)
wordcloud(corpus_not_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))
N Gram Analysis
unigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(1, 20)))
unigramrow_ind <- sort(slam::row_sums(unigram_ind), decreasing=T)
unigramfreq_ind <- data.table(tok = names(unigramrow_ind), freq = unigramrow_ind)
ggplot(unigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Unigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")
unigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(1, 20)))
unigramrow_not <- sort(slam::row_sums(unigram_not), decreasing=T)
unigramfreq_not <- data.table(tok = names(unigramrow_not), freq = unigramrow_not)
ggplot(unigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Unigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")
bigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_ind <- sort(slam::row_sums(bigram_ind), decreasing=T)
bigramfreq_ind <- data.table(tok = names(bigramrow_ind), freq = bigramrow_ind)
ggplot(bigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Bigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")
bigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not <- sort(slam::row_sums(bigram_not), decreasing=T)
bigramfreq_not <- data.table(tok = names(bigramrow_not), freq = bigramrow_not)
ggplot(bigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Bigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")
Filtering Only By Field
The second filtered data set was only filtered by field. The first group was individuals in data science or computer science, and the second group was individuals who weren’t in computer science or data science.
csds <- filter(survey, Field == "Data Science" | Field == "Computer Science")
not_csds <- filter(survey, Field == "Other STEM Field" | Field == "Other Non-Stem Field")
csds <- select(csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
csds<- csds %>% rename(
first = What.is.the.most.important.skill.for.a.data.scientist.,
second = What.is.the.second.most.important.skill.for.a.data.scientist.,
third = What.is.the.third.most.important.skill.for.a.data.scientist.
)
not_csds <- select(not_csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
not_csds<-not_csds %>% rename(
first = What.is.the.most.important.skill.for.a.data.scientist.,
second = What.is.the.second.most.important.skill.for.a.data.scientist.,
third = What.is.the.third.most.important.skill.for.a.data.scientist.
)
Word Clouds
csds <- pivot_longer(csds, 1:3)
corpus_csds <- VCorpus(VectorSource(csds$value))
corpus_csds <- tm_map(corpus_csds, removePunctuation)
corpus_csds <- tm_map(corpus_csds, content_transformer(tolower))
corpus_csds <- tm_map(corpus_csds, removeNumbers)
corpus_csds <- tm_map(corpus_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_csds <- tm_map(corpus_csds, stripWhitespace)
wordcloud(corpus_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))
not_csds <- pivot_longer(not_csds, 1:3)
corpus_not_csds <- VCorpus(VectorSource(not_csds$value))
corpus_not_csds <- tm_map(corpus_not_csds, removePunctuation)
corpus_not_csds <- tm_map(corpus_not_csds, content_transformer(tolower))
corpus_not_csds <- tm_map(corpus_not_csds, removeNumbers)
corpus_not_csds <- tm_map(corpus_not_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_not_csds <- tm_map(corpus_not_csds, stripWhitespace)
wordcloud(corpus_not_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))
N Gram Analysis
unigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(1, 20)))
unigramrow_csds <- sort(slam::row_sums(unigram_csds), decreasing=T)
unigramfreq_csds <- data.table(tok = names(unigramrow_csds), freq = unigramrow_csds)
ggplot(unigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Unigrams - Computer Science and Data Science Fields") +labs(x = "", y = "")
unigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(1, 20)))
unigramrow_not_csds <- sort(slam::row_sums(unigram_not_csds), decreasing=T)
unigramfreq_not_csds <- data.table(tok = names(unigramrow_not_csds), freq = unigramrow_not_csds)
ggplot(unigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Unigrams - Other Fields") +labs(x = "", y = "")
bigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_csds <- sort(slam::row_sums(bigram_csds), decreasing=T)
bigramfreq_csds <- data.table(tok = names(bigramrow_csds), freq = bigramrow_csds)
ggplot(bigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Bigrams - Computer Science and Data Science") +labs(x = "", y = "")
bigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not_csds <- sort(slam::row_sums(bigram_not_csds), decreasing=T)
bigramfreq_not_csds <- data.table(tok = names(bigramrow_not_csds), freq = bigramrow_not_csds)
ggplot(bigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
ggtitle("Top 10 Bigrams - Other Fields") +labs(x = "", y = "")
Conclusions
The top broad skills identified by the survey were “programming skills” and “analytical skills”. “Statistics”, “R”, and “Python” were top skills identified by individuals working full-time in either computer science or data science. Overall, in the filtered data sets, there weren’t many common bigrams, which was likely due to the small size of the survey. Interestingly, two survey respondents in the data science / computer science group identified “business knowledge” as a top skill.
Overall, most of the skills identified by the survey were either broad answers, such as “programming skills” or “analytical skills”, or specific programming languages such as Python, R, and sQL. A few respondents also identified abstract skills, such as creativity. While there was more specificity in the survey answers of individuals working or studying in data science or computer science, more survey responses should be gathered to make any larger conclusions.
DISCLAIMER.- Findings in these blogs were not used to draw any conclusions or final analysis. It serves the purpose of understanding what people say about the skills needed in Data Science
We started by researching popular sites like Youtube, Quora and Reddit to have a better understanding of the research question.
Youtube
Keywords everywhere is a chrome extension that allows the user to perform keyword density checks in desired websites.
The table below is the result of a density check for the keywords “data scientist skills” in Youtube.
We observe that most of the videos published with the subject - data scientist skills, have been published in the past eleven months and seem to be raising in popularity at a fast pace. In addition, the terms data scientist and data analysts are used interchangeably - Does that mean both positions are the same, or have the same skills?
Blogs
Some bloggers insist that Python and R are key, while others promote communication and presentation skills as most important.
A debate rages on Quora as to whether data scientists need high-level math skills or no math at all.
Some in the field argue that data science is just a fancy term for what we used to call a data analyst, while others insist the field is new and different.
Quora
Reddit
Analyzing Reddit Blog
content <- reddit_content("https://www.reddit.com/r/datascience/comments/m5mub0/why_do_so_many_of_us_suck_at_basic_programming/")
##
|
| | 0%
|
|======================================================================| 100%
dfcomment <- content %>%
select(comment)
write_delim(dfcomment, file = "blog.csv", col_names = TRUE)
#write.csv(content, file = "untidy_blog.csv", row.names = FALSE)
EH FUnction
EH_WordCloudIt <- function(dfDataframe, sColumn, bStem = TRUE)
{
# Load the data as a corpus
docs <- Corpus(VectorSource(dfDataframe[[sColumn]]))
#Clean the data
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
if (bStem)
{
docs <- tm_map(docs, stemDocument)
}
#gets words by frequency
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#head(d, 100)
#generates the word cloud. The set.seed appears to determine the display of similarly ranked elements.
#Without it, those elements are randomly displayed each time.
#https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer - other pallettes
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
print("wordcloud finished")
return(d)
}
Experiment
findSkills <- ".............................................................................skills"
#findSkills <- "able to....................................................................."
ext<-str_extract(dfcomment$comment, findSkills )
dfExt<-as.data.frame(ext)
dfExt <- na.omit(dfExt)
freqSkills <- EH_WordCloudIt(dfExt, "ext", FALSE)
## [1] "wordcloud finished"
freqSkills
## word freq
## skills skills 16
## concepts concepts 5
## basic basic 4
## documentation documentation 4
## oop oop 3
## proper proper 3
## writing writing 3
## programming programming 2
## learned learned 2
## never never 2
## good good 2
## improve improve 2
## job job 2
## people people 2
## \031re \031re 1
## aren\031t aren\031t 1
## coders coders 1
## data data 1
## great great 1
## hired hired 1
## obviously obviously 1
## scientists scientists 1
## since since 1
## advanced advanced 1
## benefit benefit 1
## necessarily necessarily 1
## object object 1
## obs obs 1
## oriented oriented 1
## many many 1
## hours hours 1
## life life 1
## reading reading 1
## \034talk \034talk 1
## can can 1
## coding coding 1
## next next 1
## senior senior 1
## shop\035 shop\035 1
## domain domain 1
## exist exist 1
## perhaps perhaps 1
## statistical statistical 1
## strong strong 1
## backgrounds backgrounds 1
## biostats biostats 1
## exchemistry exchemistry 1
## picked picked 1
## seem seem 1
## somehow somehow 1
## type type 1
## break break 1
## complaining complaining 1
## guy guy 1
## hear hear 1
## one one 1
## rogrammers rogrammers 1
## room room 1
## walked walked 1
## much much 1
## reason reason 1
## see see 1
## think think 1
## variation variation 1
## whatever whatever 1
## modularity modularity 1
## anyone anyone 1
## recommend recommend 1
## ancillary ancillary 1
## gotta gotta 1
## invest invest 1
## pick pick 1
## speed speed 1
## stay stay 1
## hire hire 1
## ling ling 1
## personally personally 1
## put put 1
## research research 1
## someone someone 1
## work work 1
## clients clients 1
## ing ing 1
## none none 1
## requested requested 1
## core core 1
## engineering engineering 1
## really really 1
## shows shows 1
## software software 1
## students students 1
## century century 1
## degree degree 1
## level level 1
## sexiest sexiest 1
“what are the most valued skills of the data scientist?”
Graph Top 20 Skills by total count, and percentage of job descriptions for each of Data Scientist and Data Analyst
From our Kaggle datasets, let’s look at the top 20 skills for each position - data scientist and data analyst - both by their straight number of mentions within the dataset, and by the percentage of job descriptions on which they appear within their dataset…
top20_countds <- top_skills_all %>% slice_max(order_by=count_ds, n=20)
ggplot(top20_countds, aes(x=reorder(skills, count_ds), y=count_ds)) +
geom_col(fill="coral") + coord_flip() + labs(
title = "Overall no. of Mentions/ Skill",
subtitle = "Data Scientist",
x = "Skill",
y = "Mentions",
caption= "Kaggle"
)
top20_jdpercentds <- top_skills_all %>% slice_max(order_by=jd_percent_ds, n=20)
ggplot(top20_jdpercentds, aes(x=reorder(skills, jd_percent_ds), y=jd_percent_ds)) +
geom_col(fill="coral") + coord_flip() +labs(
title = "Percent of Job Descriptions Mentioning Skill",
subtitle = "Data Scientist",
x = "Skill",
y = "% of Job Descriptions",
caption= "Kaggle"
)
top20_countda <- top_skills_all %>% slice_max(order_by=count_da, n=20)
ggplot(top20_countda, aes(x=reorder(skills, count_da), y=count_da)) +
geom_col(fill="blue") + coord_flip() + labs(
title = "Overall No. of Mentions/ Skill",
subtitle = "Data Analyst",
x = "Skill",
y = "Mentions",
caption= "Kaggle"
)
top20_jdpercentda <- top_skills_all %>% slice_max(order_by=jd_percent_da, n=20)
ggplot(top20_jdpercentda, aes(x=reorder(skills, jd_percent_da), y=jd_percent_da)) +
geom_col(fill="blue") + coord_flip() + labs(
title = "Percent of Job Descriptions Mentioning Skill",
subtitle = "Data Analyst",
x = "Skill",
y = "% of Job Descriptions",
caption= "Kaggle"
)
Python: One Skill to Rule them All?
First, We can see that while python is far and away the skill with the most overall mentions for both positions, this is driven in large part by its being mentioned with greater frequency on each of the job descriptions on which it appears.
Looking at the percentage of job descriptions which mention python, we can see that its dominance over other skills mentioned in the “data scientists” job descriptions is less pronounced, and that it falls in importance among the Data Analyst job descriptions to rank 13.
Scientist vs. Analyst
Looking at the other skills that round out each position’s top 20, we can draw another clear conclusion that we expect connects directly to each role’s relative importance given to python: the second most-frequently-requested skill among data science job descriptions is machine learning, while data analyst descriptions do not mention this among their top 20 most requested skills.
While both roles meniton research and SQL with similar frequency, there is a clear delineation that emerges to separate the two roles. In addition to machine learning and python, Data Scientist job descriptions are more likely to require knowledge of statistics, mathematics, and R - and to desire the candidate have completed a Ph.D, indicating a desire for deeper subject matter expertise in these areas. Data Analyst roles, on the other hand, place more emphasis on softer skills - communication, vision, leadership, and organization, and may thus be a better entry-point for working in the field.
findings_table_ds
skills | count | rank | jd_count | |
---|---|---|---|---|
1 | python | 5250 | 1 | 1750 |
4 | machine learning | 1693 | 2 | 1693 |
5 | design | 1468 | 3 | 1468 |
6 | computer | 1460 | 4 | 1460 |
7 | science | 1384 | 5 | 1384 |
8 | research | 1254 | 6 | 1254 |
9 | statistics | 1249 | 7 | 1249 |
10 | sql | 1172 | 8 | 1172 |
11 | communication | 1168 | 9 | 1168 |
12 | r | 1118 | 10 | 950 |
13 | math | 1102 | 11 | 1102 |
14 | solutions | 1056 | 12 | 1056 |
15 | algorithms | 975 | 13 | 975 |
16 | programming | 960 | 14 | 960 |
17 | leader | 905 | 15 | 905 |
18 | organization | 844 | 16 | 844 |
19 | passion | 821 | 17 | 821 |
20 | analytical | 799 | 18 | 799 |
21 | phd | 789 | 19 | 789 |
22 | scala | 789 | 20 | 789 |
23 | quantitative | 779 | 21 | 779 |
24 | spark | 779 | 22 | 779 |
25 | mathematics | 768 | 23 | 768 |
26 | communication skills | 766 | 24 | 766 |
27 | java | 765 | 25 | 765 |
28 | AI | 729 | 26 | NA |
29 | vision | 714 | 27 | 714 |
30 | hadoop | 706 | 28 | 706 |
31 | database | 699 | 29 | 699 |
32 | big data | 695 | 30 | 695 |
33 | written | 684 | 31 | 684 |
34 | ml | 617 | 32 | 617 |
35 | visualization | 578 | 33 | 578 |
36 | years of experience | 576 | 34 | 576 |
37 | data sets | 566 | 35 | 566 |
38 | leadership | 536 | 36 | 536 |
39 | data analysis | 532 | 37 | 532 |
40 | git | 516 | 38 | 516 |
41 | collaborative | 511 | 39 | 511 |
42 | office | 506 | 40 | 506 |
43 | data mining | 483 | 41 | 483 |
44 | verbal | 481 | 42 | 481 |
45 | innovation | 471 | 43 | 471 |
46 | presentation | 463 | 44 | 463 |
47 | creating | 451 | 45 | 451 |
48 | sas | 445 | 46 | 445 |
49 | deep learning | 432 | 47 | 432 |
50 | large data | 398 | 48 | 398 |
51 | natural language | 390 | 49 | 390 |
52 | data engineer | 389 | 50 | 389 |
53 | physics | 337 | 51 | 337 |
54 | collaboration | 329 | 52 | 329 |
55 | software development | 327 | 53 | 327 |
56 | language processing | 320 | 54 | 320 |
57 | artificial intelligence | 318 | 55 | 318 |
58 | natural language processing | 318 | 56 | 318 |
59 | economics | 315 | 57 | 315 |
60 | learning algorithms | 315 | 58 | 315 |
61 | programming languages | 313 | 59 | 313 |
62 | business problems | 312 | 60 | 312 |
63 | data visualization | 312 | 61 | 312 |
64 | machine learning techniques | 298 | 62 | 298 |
65 | writing | 295 | 63 | 295 |
66 | rtable | 291 | 64 | 291 |
67 | data analytics | 285 | 65 | 285 |
68 | tableau | 279 | 66 | 279 |
69 | machine learning algorithms | 278 | 67 | 278 |
70 | matlab | 277 | 68 | 277 |
71 | consulting | 274 | 69 | 274 |
72 | problem solving | 273 | 70 | 273 |
73 | learning models | 268 | 71 | 268 |
74 | nlp | 254 | 72 | 254 |
75 | influence | 251 | 73 | 251 |
76 | linux | 251 | 74 | 251 |
77 | flexible | 244 | 75 | 244 |
78 | etl | 239 | 76 | 239 |
79 | statistical modeling | 237 | 77 | 237 |
80 | large scale | 235 | 78 | 235 |
81 | nosql | 232 | 79 | 232 |
82 | machine learning models | 231 | 80 | 231 |
83 | data processing | 227 | 81 | 227 |
84 | large data sets | 227 | 82 | 227 |
85 | data pipeline | 225 | 83 | 225 |
86 | ms | 223 | 84 | 253 |
87 | predictive models | 211 | 85 | 211 |
88 | interpersonal | 201 | 86 | 201 |
89 | masters | 201 | 87 | 201 |
90 | data engineering | 194 | 88 | 194 |
91 | software engineers | 194 | 89 | 194 |
92 | data pipelines | 179 | 90 | 179 |
93 | decision making | 173 | 91 | 173 |
94 | organizational | 172 | 92 | 172 |
95 | forecasting | 169 | 93 | 169 |
96 | bachelor’s degree | 164 | 94 | 164 |
97 | monitoring | 162 | 95 | 162 |
98 | data management | 161 | 96 | 161 |
99 | have experience | 157 | 97 | 157 |
100 | creativity | 155 | 98 | 155 |
101 | microsoft | 148 | 99 | 148 |
102 | predictive analytics | 146 | 100 | 146 |
103 | project management | 141 | 101 | 141 |
104 | data collection | 132 | 102 | 132 |
105 | azure | 127 | 103 | 127 |
106 | javascript | 127 | 104 | 127 |
107 | modeling techniques | 122 | 105 | 122 |
108 | data architecture | 120 | 106 | 120 |
109 | data models | 116 | 107 | 116 |
110 | sap | 112 | 108 | 112 |
111 | unix | 112 | 109 | 112 |
112 | Go | 107 | 110 | NA |
113 | array | 104 | 111 | 104 |
114 | modelling | 102 | 112 | 102 |
115 | ruby | 98 | 113 | 98 |
116 | work independently | 97 | 114 | 97 |
117 | data warehousing | 95 | 115 | 95 |
118 | 95 | 116 | 95 | |
119 | mysql | 92 | 117 | 92 |
120 | powerpoint | 84 | 118 | 84 |
121 | language understanding | 83 | 119 | 83 |
122 | ecommerce | 80 | 120 | 80 |
123 | data systems | 78 | 121 | 78 |
124 | solving problems | 77 | 122 | 77 |
125 | data extraction | 75 | 123 | 75 |
126 | mongodb | 75 | 124 | 75 |
127 | apache spark | 65 | 125 | 65 |
128 | natural language understanding | 64 | 126 | 64 |
129 | critical thinking | 62 | 127 | 62 |
130 | kpmg | 60 | 128 | 60 |
131 | data integration | 59 | 129 | 59 |
132 | big data architecture | 57 | 130 | 57 |
133 | github | 57 | 131 | 57 |
134 | postgresql | 56 | 132 | 56 |
135 | data manipulation | 51 | 133 | 51 |
136 | highly motivated | 51 | 134 | 51 |
137 | masters degree | 49 | 135 | 49 |
138 | elasticsearch | 47 | 136 | 47 |
139 | speaking | 47 | 137 | 47 |
140 | architecture capabilities | 46 | 138 | 46 |
141 | bachelors | 46 | 139 | 46 |
142 | covering technologies | 46 | 140 | 46 |
143 | multi-task | 46 | 141 | 46 |
144 | bash | 45 | 142 | 45 |
145 | nlu | 40 | 143 | 40 |
146 | shell script | 40 | 144 | 40 |
147 | troubleshooting | 40 | 145 | 40 |
148 | youtube | 40 | 146 | 40 |
149 | coordination | 39 | 147 | 39 |
150 | data gathering | 39 | 148 | 39 |
151 | time management | 38 | 149 | 38 |
152 | methodological | 34 | 150 | 34 |
153 | microsoft office | 31 | 151 | 31 |
154 | django | 30 | 152 | 30 |
155 | google analytics | 30 | 153 | 30 |
156 | market research | 29 | 154 | 29 |
157 | network analysis | 28 | 155 | 28 |
158 | data insights | 27 | 156 | 27 |
159 | microsoft azure | 23 | 157 | 23 |
160 | data preparation | 22 | 158 | 22 |
161 | negotiation | 21 | 159 | 21 |
162 | sales and marketing | 19 | 160 | 19 |
163 | vba | 19 | 161 | 19 |
164 | jupyter notebook | 18 | 162 | 18 |
165 | microstrategy | 18 | 163 | 18 |
166 | doctorate degree | 17 | 164 | 17 |
167 | manage multiple projects | 16 | 165 | 16 |
168 | analytics data | 15 | 166 | 15 |
169 | microsoft excel | 15 | 167 | 15 |
170 | machine learning data | 13 | 168 | 13 |
171 | strategic thinking | 13 | 169 | 13 |
172 | highly organized | 10 | 170 | 10 |
173 | jquery | 10 | 171 | 10 |
174 | 9 | 172 | NA | |
175 | grammatical | 9 | 173 | 9 |
176 | symantec | 9 | 174 | 9 |
177 | telecommunications | 9 | 175 | 9 |
178 | data entry | 8 | 176 | 8 |
179 | data mapping | 8 | 177 | 8 |
180 | data reporting | 8 | 178 | 8 |
181 | eko | 8 | 179 | 8 |
182 | active learning | 7 | 180 | 7 |
183 | apache hadoop | 7 | 181 | 7 |
184 | confluence | 7 | 182 | 7 |
185 | amazon redshift | 6 | 183 | 6 |
186 | data transfer | 6 | 184 | 6 |
187 | microsoft word | 6 | 185 | 6 |
188 | swift | 6 | 186 | 6 |
189 | complex problem solving | 5 | 187 | 5 |
190 | english language | 5 | 188 | 5 |
191 | ibm db2 | 5 | 189 | 5 |
192 | microsoft sql server | 5 | 190 | 5 |
193 | service orientation | 5 | 191 | 5 |
194 | unix shell | 5 | 192 | 5 |
195 | apache kafka | 4 | 193 | 4 |
196 | operations analysis | 4 | 194 | 4 |
197 | see the big picture | 4 | 195 | 4 |
198 | apache hive | 3 | 196 | 3 |
199 | data interpretation | 3 | 197 | 3 |
200 | experience in information technology | 3 | 198 | 3 |
201 | microsoft access | 3 | 199 | 3 |
202 | minitab | 3 | 200 | 3 |
203 | skype | 3 | 201 | 3 |
204 | systems analysis | 3 | 202 | 3 |
205 | technology design | 3 | 203 | 3 |
206 | ubuntu | 3 | 204 | 3 |
207 | work well in a team | 3 | 205 | 3 |
208 | active listening | 2 | 206 | 2 |
209 | citrix | 2 | 207 | 2 |
210 | clerical | 2 | 208 | 2 |
211 | client management | 2 | 209 | 2 |
212 | data cleanup | 2 | 210 | 2 |
213 | data organization | 2 | 211 | 2 |
214 | google adwords | 2 | 212 | 2 |
215 | mathematical reasoning | 2 | 213 | 2 |
216 | microsoft outlook | 2 | 214 | 2 |
217 | microsoft powerpoint | 2 | 215 | 2 |
218 | amazon dynamodb | 1 | 216 | 1 |
219 | bring creativity | 1 | 217 | 1 |
220 | design development | 1 | 218 | 1 |
221 | engineering and technology | 1 | 219 | 1 |
222 | filemaker pro | 1 | 220 | 1 |
223 | ibm infosphere datastage | 1 | 221 | 1 |
224 | judgment and decision making | 1 | 222 | 1 |
225 | microsoft dynamics | 1 | 223 | 1 |
226 | microsoft windows server | 1 | 224 | 1 |
227 | oracle hyperion | 1 | 225 | 1 |
228 | oracle java | 1 | 226 | 1 |
229 | organizational management | 1 | 227 | 1 |
230 | prepare data for analysis | 1 | 228 | 1 |
231 | quality control analysis | 1 | 229 | 1 |
232 | reading comprehension | 1 | 230 | 1 |
233 | report creation | 1 | 231 | 1 |
234 | teradata database | 1 | 232 | 1 |
235 | wireshark | 1 | 233 | 1 |
findings_table_da
skills | count | rank | jd_count | |
---|---|---|---|---|
1 | python | 1275 | 1 | 425 |
4 | research | 785 | 2 | 785 |
5 | communication | 776 | 3 | 776 |
6 | analytical | 653 | 4 | 653 |
7 | design | 627 | 5 | 627 |
8 | organization | 605 | 6 | 605 |
9 | written | 514 | 7 | 514 |
10 | quantitative | 499 | 8 | 499 |
11 | communication skills | 488 | 9 | 488 |
12 | leader | 484 | 10 | 484 |
13 | sql | 467 | 11 | 467 |
14 | statistics | 466 | 12 | 466 |
15 | office | 463 | 13 | 463 |
16 | math | 403 | 14 | 403 |
17 | r | 401 | 15 | 326 |
18 | solutions | 400 | 16 | 400 |
19 | computer | 380 | 17 | 380 |
20 | presentation | 373 | 18 | 373 |
21 | database | 367 | 19 | 367 |
22 | vision | 358 | 20 | 358 |
23 | leadership | 327 | 21 | 327 |
24 | verbal | 324 | 22 | 324 |
25 | passion | 323 | 23 | 323 |
26 | programming | 313 | 24 | 313 |
27 | science | 312 | 25 | 312 |
28 | sas | 309 | 26 | 309 |
29 | data analysis | 304 | 27 | 304 |
30 | writing | 300 | 28 | 300 |
31 | years of experience | 286 | 29 | 286 |
32 | collaborative | 284 | 30 | 284 |
33 | economics | 268 | 31 | 268 |
34 | mathematics | 257 | 32 | 257 |
35 | visualization | 243 | 33 | 243 |
36 | organizational | 224 | 34 | 224 |
37 | microsoft | 223 | 35 | 223 |
38 | innovation | 207 | 36 | 207 |
39 | machine learning | 207 | 37 | 207 |
40 | data sets | 206 | 38 | 206 |
41 | tableau | 203 | 39 | 203 |
42 | interpersonal | 198 | 40 | 198 |
43 | creating | 196 | 41 | 196 |
44 | git | 195 | 42 | 195 |
45 | powerpoint | 190 | 43 | 190 |
46 | consulting | 182 | 44 | 182 |
47 | ms | 173 | 45 | 165 |
48 | problem solving | 172 | 46 | 172 |
49 | data visualization | 162 | 47 | 162 |
50 | project management | 159 | 48 | 159 |
51 | collaboration | 154 | 49 | 154 |
52 | phd | 151 | 50 | 151 |
53 | bachelor’s degree | 149 | 51 | 149 |
54 | data analytics | 140 | 52 | 140 |
55 | flexible | 137 | 53 | 137 |
56 | java | 134 | 54 | 134 |
57 | ml | 134 | 55 | 134 |
58 | large data | 133 | 56 | 133 |
59 | algorithms | 130 | 57 | 130 |
60 | rtable | 128 | 58 | 128 |
61 | work independently | 125 | 59 | 125 |
62 | data collection | 121 | 60 | 121 |
63 | big data | 120 | 61 | 120 |
64 | data management | 120 | 62 | 120 |
65 | influence | 118 | 63 | 118 |
66 | monitoring | 118 | 64 | 118 |
67 | decision making | 114 | 65 | 114 |
68 | scala | 113 | 66 | 113 |
69 | data mining | 105 | 67 | 105 |
70 | microsoft office | 100 | 68 | 100 |
71 | forecasting | 96 | 69 | 96 |
72 | market research | 95 | 70 | 95 |
73 | hadoop | 92 | 71 | 92 |
74 | matlab | 92 | 72 | 92 |
75 | physics | 89 | 73 | 89 |
76 | programming languages | 88 | 74 | 88 |
77 | business problems | 87 | 75 | 87 |
78 | spark | 87 | 76 | 87 |
79 | masters | 85 | 77 | 85 |
80 | data engineer | 81 | 78 | 81 |
81 | critical thinking | 77 | 79 | 77 |
82 | multi-task | 77 | 80 | 77 |
83 | Go | 74 | 81 | NA |
84 | etl | 73 | 82 | 73 |
85 | large data sets | 72 | 83 | 72 |
86 | coordination | 68 | 84 | 68 |
87 | microsoft excel | 68 | 85 | 68 |
88 | statistical modeling | 61 | 86 | 61 |
89 | vba | 60 | 87 | 60 |
90 | have experience | 58 | 88 | 58 |
91 | 53 | 89 | 53 | |
92 | sap | 53 | 90 | 53 |
93 | highly motivated | 52 | 91 | 52 |
94 | time management | 51 | 92 | 51 |
95 | creativity | 50 | 93 | 50 |
96 | data engineering | 50 | 94 | 50 |
97 | linux | 46 | 95 | 46 |
98 | bachelors | 41 | 96 | 41 |
99 | google analytics | 41 | 97 | 41 |
100 | software engineers | 41 | 98 | 41 |
101 | data processing | 40 | 99 | 40 |
102 | modelling | 39 | 100 | 39 |
103 | predictive analytics | 39 | 101 | 39 |
104 | predictive models | 39 | 102 | 39 |
105 | software development | 38 | 103 | 38 |
106 | javascript | 37 | 104 | 37 |
107 | modeling techniques | 36 | 105 | 36 |
108 | deep learning | 35 | 106 | 35 |
109 | array | 34 | 107 | 34 |
110 | troubleshooting | 34 | 108 | 34 |
111 | unix | 34 | 109 | 34 |
112 | artificial intelligence | 33 | 110 | 33 |
113 | machine learning techniques | 33 | 111 | 33 |
114 | data manipulation | 31 | 112 | 31 |
115 | data models | 31 | 113 | 31 |
116 | data systems | 31 | 114 | 31 |
117 | natural language | 31 | 115 | 31 |
118 | data warehousing | 30 | 116 | 30 |
119 | mysql | 30 | 117 | 30 |
120 | solving problems | 30 | 118 | 30 |
121 | youtube | 29 | 119 | 29 |
122 | data integration | 28 | 120 | 28 |
123 | manage multiple projects | 28 | 121 | 28 |
124 | masters degree | 27 | 122 | 27 |
125 | data pipeline | 26 | 123 | 26 |
126 | ecommerce | 26 | 124 | 26 |
127 | learning algorithms | 26 | 125 | 26 |
128 | methodological | 26 | 126 | 26 |
129 | speaking | 26 | 127 | 26 |
130 | AI | 25 | 128 | NA |
131 | data extraction | 25 | 129 | 25 |
132 | language processing | 25 | 130 | 25 |
133 | data gathering | 24 | 131 | 24 |
134 | machine learning algorithms | 24 | 132 | 24 |
135 | natural language processing | 24 | 133 | 24 |
136 | learning models | 23 | 134 | 23 |
137 | large scale | 22 | 135 | 22 |
138 | nosql | 22 | 136 | 22 |
139 | machine learning models | 21 | 137 | 21 |
140 | azure | 19 | 138 | 19 |
141 | data entry | 19 | 139 | 19 |
142 | data insights | 19 | 140 | 19 |
143 | doctorate degree | 19 | 141 | 19 |
144 | microsoft word | 18 | 142 | 18 |
145 | highly organized | 17 | 143 | 17 |
146 | ruby | 17 | 144 | 17 |
147 | data pipelines | 16 | 145 | 16 |
148 | data reporting | 15 | 146 | 15 |
149 | negotiation | 15 | 147 | 15 |
150 | systems analysis | 15 | 148 | 15 |
151 | microsoft powerpoint | 14 | 149 | 14 |
152 | nlp | 14 | 150 | 14 |
153 | data architecture | 13 | 151 | 13 |
154 | bash | 12 | 152 | 12 |
155 | network analysis | 12 | 153 | 12 |
156 | elasticsearch | 11 | 154 | 11 |
157 | postgresql | 11 | 155 | 11 |
158 | service orientation | 11 | 156 | 11 |
159 | strategic thinking | 11 | 157 | 11 |
160 | english language | 10 | 158 | 10 |
161 | github | 10 | 159 | 10 |
162 | mongodb | 10 | 160 | 10 |
163 | data preparation | 8 | 161 | 8 |
164 | data transfer | 8 | 162 | 8 |
165 | analytics data | 7 | 163 | 7 |
166 | client management | 7 | 164 | 7 |
167 | microsoft access | 7 | 165 | 7 |
168 | 6 | 166 | NA | |
169 | apache spark | 6 | 167 | 6 |
170 | microsoft project | 6 | 168 | 6 |
171 | shell script | 6 | 169 | 6 |
172 | jupyter notebook | 5 | 170 | 5 |
173 | language understanding | 5 | 171 | 5 |
174 | microstrategy | 5 | 172 | 5 |
175 | sales and marketing | 5 | 173 | 5 |
176 | symantec | 5 | 174 | 5 |
177 | data interpretation | 4 | 175 | 4 |
178 | grammatical | 4 | 176 | 4 |
179 | minitab | 4 | 177 | 4 |
180 | natural language understanding | 4 | 178 | 4 |
181 | report creation | 4 | 179 | 4 |
182 | amazon redshift | 3 | 180 | 3 |
183 | complex problem solving | 3 | 181 | 3 |
184 | data mapping | 3 | 182 | 3 |
185 | data storytelling | 3 | 183 | 3 |
186 | kpmg | 3 | 184 | 3 |
187 | lexisnexis | 3 | 185 | 3 |
188 | microsoft outlook | 3 | 186 | 3 |
189 | telecommunications | 3 | 187 | 3 |
190 | work well in a team | 3 | 188 | 3 |
191 | active learning | 2 | 189 | 2 |
192 | administration and management | 2 | 190 | 2 |
193 | ajax | 2 | 191 | 2 |
194 | apache hadoop | 2 | 192 | 2 |
195 | clerical | 2 | 193 | 2 |
196 | confluence | 2 | 194 | 2 |
197 | experience in market research | 2 | 195 | 2 |
198 | google docs | 2 | 196 | 2 |
199 | mcafee | 2 | 197 | 2 |
200 | microsoft azure | 2 | 198 | 2 |
201 | nlu | 2 | 199 | 2 |
202 | operations analysis | 2 | 200 | 2 |
203 | see the big picture | 2 | 201 | 2 |
204 | swift | 2 | 202 | 2 |
205 | wireshark | 2 | 203 | 2 |
206 | apache kafka | 1 | 204 | 1 |
207 | apache tomcat | 1 | 205 | 1 |
208 | data cleanup | 1 | 206 | 1 |
209 | data organization | 1 | 207 | 1 |
210 | datadriven | 1 | 208 | 1 |
211 | deductive reasoning | 1 | 209 | 1 |
212 | design development | 1 | 210 | 1 |
213 | django | 1 | 211 | 1 |
214 | eko | 1 | 212 | 1 |
215 | epic systems | 1 | 213 | 1 |
216 | experience in information technology | 1 | 214 | 1 |
217 | filemaker pro | 1 | 215 | 1 |
218 | google adwords | 1 | 216 | 1 |
219 | ibm db2 | 1 | 217 | 1 |
220 | jquery | 1 | 218 | 1 |
221 | judgment and decision making | 1 | 219 | 1 |
222 | machine learning data | 1 | 220 | 1 |
223 | mathematical reasoning | 1 | 221 | 1 |
224 | microsoft dynamics | 1 | 222 | 1 |
225 | microsoft sharepoint | 1 | 223 | 1 |
226 | microsoft sql server | 1 | 224 | 1 |
227 | microsoft sql server reporting services | 1 | 225 | 1 |
228 | microsoft windows server | 1 | 226 | 1 |
229 | oracle hyperion | 1 | 227 | 1 |
230 | organizational management | 1 | 228 | 1 |
231 | processing information | 1 | 229 | 1 |
232 | reading comprehension | 1 | 230 | 1 |
233 | skype | 1 | 231 | 1 |
234 | systems evaluation | 1 | 232 | 1 |
235 | tax software | 1 | 233 | 1 |
236 | technology design | 1 | 234 | 1 |
237 | ubuntu | 1 | 235 | 1 |
238 | unix shell | 1 | 236 | 1 |
findings_table_all
skills | count_ds | rank_ds | jd_count_ds | count_da | rank_da | jd_count_da | avg_rank | jd_percent_ds | freq_per_jd_ds | jd_percent_da | freq_per_jd_da |
---|---|---|---|---|---|---|---|---|---|---|---|
python | 5250 | 1 | 1750 | 1275 | 1 | 425 | 1.0 | 0.727 | 3.000 | 0.348 | 3.000 |
design | 1468 | 3 | 1468 | 627 | 5 | 627 | 4.0 | 0.610 | 1.000 | 0.514 | 1.000 |
research | 1254 | 6 | 1254 | 785 | 2 | 785 | 4.0 | 0.521 | 1.000 | 0.643 | 1.000 |
communication | 1168 | 9 | 1168 | 776 | 3 | 776 | 6.0 | 0.485 | 1.000 | 0.636 | 1.000 |
statistics | 1249 | 7 | 1249 | 466 | 12 | 466 | 9.5 | 0.519 | 1.000 | 0.382 | 1.000 |
sql | 1172 | 8 | 1172 | 467 | 11 | 467 | 9.5 | 0.487 | 1.000 | 0.382 | 1.000 |
computer | 1460 | 4 | 1460 | 380 | 17 | 380 | 10.5 | 0.607 | 1.000 | 0.311 | 1.000 |
organization | 844 | 16 | 844 | 605 | 6 | 605 | 11.0 | 0.351 | 1.000 | 0.495 | 1.000 |
analytical | 799 | 18 | 799 | 653 | 4 | 653 | 11.0 | 0.332 | 1.000 | 0.535 | 1.000 |
r | 1118 | 10 | 950 | 401 | 15 | 326 | 12.5 | 0.395 | 1.177 | 0.267 | 1.230 |
math | 1102 | 11 | 1102 | 403 | 14 | 403 | 12.5 | 0.458 | 1.000 | 0.330 | 1.000 |
leader | 905 | 15 | 905 | 484 | 10 | 484 | 12.5 | 0.376 | 1.000 | 0.396 | 1.000 |
solutions | 1056 | 12 | 1056 | 400 | 16 | 400 | 14.0 | 0.439 | 1.000 | 0.328 | 1.000 |
quantitative | 779 | 21 | 779 | 499 | 8 | 499 | 14.5 | 0.324 | 1.000 | 0.409 | 1.000 |
science | 1384 | 5 | 1384 | 312 | 25 | 312 | 15.0 | 0.575 | 1.000 | 0.256 | 1.000 |
communication skills | 766 | 24 | 766 | 488 | 9 | 488 | 16.5 | 0.318 | 1.000 | 0.400 | 1.000 |
programming | 960 | 14 | 960 | 313 | 24 | 313 | 19.0 | 0.399 | 1.000 | 0.256 | 1.000 |
written | 684 | 31 | 684 | 514 | 7 | 514 | 19.0 | 0.284 | 1.000 | 0.421 | 1.000 |
machine learning | 1693 | 2 | 1693 | 207 | 37 | 207 | 19.5 | 0.704 | 1.000 | 0.170 | 1.000 |
passion | 821 | 17 | 821 | 323 | 23 | 323 | 20.0 | 0.341 | 1.000 | 0.265 | 1.000 |
vision | 714 | 27 | 714 | 358 | 20 | 358 | 23.5 | 0.297 | 1.000 | 0.293 | 1.000 |
database | 699 | 29 | 699 | 367 | 19 | 367 | 24.0 | 0.291 | 1.000 | 0.301 | 1.000 |
office | 506 | 40 | 506 | 463 | 13 | 463 | 26.5 | 0.210 | 1.000 | 0.379 | 1.000 |
mathematics | 768 | 23 | 768 | 257 | 32 | 257 | 27.5 | 0.319 | 1.000 | 0.210 | 1.000 |
leadership | 536 | 36 | 536 | 327 | 21 | 327 | 28.5 | 0.223 | 1.000 | 0.268 | 1.000 |
presentation | 463 | 44 | 463 | 373 | 18 | 373 | 31.0 | 0.192 | 1.000 | 0.305 | 1.000 |
years of experience | 576 | 34 | 576 | 286 | 29 | 286 | 31.5 | 0.239 | 1.000 | 0.234 | 1.000 |
data analysis | 532 | 37 | 532 | 304 | 27 | 304 | 32.0 | 0.221 | 1.000 | 0.249 | 1.000 |
verbal | 481 | 42 | 481 | 324 | 22 | 324 | 32.0 | 0.200 | 1.000 | 0.265 | 1.000 |
visualization | 578 | 33 | 578 | 243 | 33 | 243 | 33.0 | 0.240 | 1.000 | 0.199 | 1.000 |
phd | 789 | 19 | 789 | 151 | 50 | 151 | 34.5 | 0.328 | 1.000 | 0.124 | 1.000 |
collaborative | 511 | 39 | 511 | 284 | 30 | 284 | 34.5 | 0.212 | 1.000 | 0.233 | 1.000 |
algorithms | 975 | 13 | 975 | 130 | 57 | 130 | 35.0 | 0.405 | 1.000 | 0.106 | 1.000 |
sas | 445 | 46 | 445 | 309 | 26 | 309 | 36.0 | 0.185 | 1.000 | 0.253 | 1.000 |
data sets | 566 | 35 | 566 | 206 | 38 | 206 | 36.5 | 0.235 | 1.000 | 0.169 | 1.000 |
java | 765 | 25 | 765 | 134 | 54 | 134 | 39.5 | 0.318 | 1.000 | 0.110 | 1.000 |
innovation | 471 | 43 | 471 | 207 | 36 | 207 | 39.5 | 0.196 | 1.000 | 0.170 | 1.000 |
git | 516 | 38 | 516 | 195 | 42 | 195 | 40.0 | 0.214 | 1.000 | 0.160 | 1.000 |
scala | 789 | 20 | 789 | 113 | 66 | 113 | 43.0 | 0.328 | 1.000 | 0.093 | 1.000 |
creating | 451 | 45 | 451 | 196 | 41 | 196 | 43.0 | 0.187 | 1.000 | 0.161 | 1.000 |
ml | 617 | 32 | 617 | 134 | 55 | 134 | 43.5 | 0.256 | 1.000 | 0.110 | 1.000 |
economics | 315 | 57 | 315 | 268 | 31 | 268 | 44.0 | 0.131 | 1.000 | 0.219 | 1.000 |
big data | 695 | 30 | 695 | 120 | 61 | 120 | 45.5 | 0.289 | 1.000 | 0.098 | 1.000 |
writing | 295 | 63 | 295 | 300 | 28 | 300 | 45.5 | 0.123 | 1.000 | 0.246 | 1.000 |
spark | 779 | 22 | 779 | 87 | 76 | 87 | 49.0 | 0.324 | 1.000 | 0.071 | 1.000 |
hadoop | 706 | 28 | 706 | 92 | 71 | 92 | 49.5 | 0.293 | 1.000 | 0.075 | 1.000 |
collaboration | 329 | 52 | 329 | 154 | 49 | 154 | 50.5 | 0.137 | 1.000 | 0.126 | 1.000 |
large data | 398 | 48 | 398 | 133 | 56 | 133 | 52.0 | 0.165 | 1.000 | 0.109 | 1.000 |
tableau | 279 | 66 | 279 | 203 | 39 | 203 | 52.5 | 0.116 | 1.000 | 0.166 | 1.000 |
data mining | 483 | 41 | 483 | 105 | 67 | 105 | 54.0 | 0.201 | 1.000 | 0.086 | 1.000 |
data visualization | 312 | 61 | 312 | 162 | 47 | 162 | 54.0 | 0.130 | 1.000 | 0.133 | 1.000 |
consulting | 274 | 69 | 274 | 182 | 44 | 182 | 56.5 | 0.114 | 1.000 | 0.149 | 1.000 |
problem solving | 273 | 70 | 273 | 172 | 46 | 172 | 58.0 | 0.113 | 1.000 | 0.141 | 1.000 |
data analytics | 285 | 65 | 285 | 140 | 52 | 140 | 58.5 | 0.118 | 1.000 | 0.115 | 1.000 |
rtable | 291 | 64 | 291 | 128 | 58 | 128 | 61.0 | 0.121 | 1.000 | 0.105 | 1.000 |
physics | 337 | 51 | 337 | 89 | 73 | 89 | 62.0 | 0.140 | 1.000 | 0.073 | 1.000 |
interpersonal | 201 | 86 | 201 | 198 | 40 | 198 | 63.0 | 0.084 | 1.000 | 0.162 | 1.000 |
organizational | 172 | 92 | 172 | 224 | 34 | 224 | 63.0 | 0.071 | 1.000 | 0.183 | 1.000 |
data engineer | 389 | 50 | 389 | 81 | 78 | 81 | 64.0 | 0.162 | 1.000 | 0.066 | 1.000 |
flexible | 244 | 75 | 244 | 137 | 53 | 137 | 64.0 | 0.101 | 1.000 | 0.112 | 1.000 |
ms | 223 | 84 | 253 | 173 | 45 | 165 | 64.5 | 0.105 | 0.881 | 0.135 | 1.048 |
programming languages | 313 | 59 | 313 | 88 | 74 | 88 | 66.5 | 0.130 | 1.000 | 0.072 | 1.000 |
microsoft | 148 | 99 | 148 | 223 | 35 | 223 | 67.0 | 0.062 | 1.000 | 0.183 | 1.000 |
business problems | 312 | 60 | 312 | 87 | 75 | 87 | 67.5 | 0.130 | 1.000 | 0.071 | 1.000 |
influence | 251 | 73 | 251 | 118 | 63 | 118 | 68.0 | 0.104 | 1.000 | 0.097 | 1.000 |
matlab | 277 | 68 | 277 | 92 | 72 | 92 | 70.0 | 0.115 | 1.000 | 0.075 | 1.000 |
bachelor’s degree | 164 | 94 | 164 | 149 | 51 | 149 | 72.5 | 0.068 | 1.000 | 0.122 | 1.000 |
project management | 141 | 101 | 141 | 159 | 48 | 159 | 74.5 | 0.059 | 1.000 | 0.130 | 1.000 |
deep learning | 432 | 47 | 432 | 35 | 106 | 35 | 76.5 | 0.180 | 1.000 | 0.029 | 1.000 |
AI | 729 | 26 | NA | 25 | 128 | NA | 77.0 | NA | NA | NA | NA |
software development | 327 | 53 | 327 | 38 | 103 | 38 | 78.0 | 0.136 | 1.000 | 0.031 | 1.000 |
decision making | 173 | 91 | 173 | 114 | 65 | 114 | 78.0 | 0.072 | 1.000 | 0.093 | 1.000 |
etl | 239 | 76 | 239 | 73 | 82 | 73 | 79.0 | 0.099 | 1.000 | 0.060 | 1.000 |
data management | 161 | 96 | 161 | 120 | 62 | 120 | 79.0 | 0.067 | 1.000 | 0.098 | 1.000 |
monitoring | 162 | 95 | 162 | 118 | 64 | 118 | 79.5 | 0.067 | 1.000 | 0.097 | 1.000 |
powerpoint | 84 | 118 | 84 | 190 | 43 | 190 | 80.5 | 0.035 | 1.000 | 0.156 | 1.000 |
forecasting | 169 | 93 | 169 | 96 | 69 | 96 | 81.0 | 0.070 | 1.000 | 0.079 | 1.000 |
data collection | 132 | 102 | 132 | 121 | 60 | 121 | 81.0 | 0.055 | 1.000 | 0.099 | 1.000 |
statistical modeling | 237 | 77 | 237 | 61 | 86 | 61 | 81.5 | 0.099 | 1.000 | 0.050 | 1.000 |
natural language | 390 | 49 | 390 | 31 | 115 | 31 | 82.0 | 0.162 | 1.000 | 0.025 | 1.000 |
masters | 201 | 87 | 201 | 85 | 77 | 85 | 82.0 | 0.084 | 1.000 | 0.070 | 1.000 |
artificial intelligence | 318 | 55 | 318 | 33 | 110 | 33 | 82.5 | 0.132 | 1.000 | 0.027 | 1.000 |
large data sets | 227 | 82 | 227 | 72 | 83 | 72 | 82.5 | 0.094 | 1.000 | 0.059 | 1.000 |
linux | 251 | 74 | 251 | 46 | 95 | 46 | 84.5 | 0.104 | 1.000 | 0.038 | 1.000 |
machine learning techniques | 298 | 62 | 298 | 33 | 111 | 33 | 86.5 | 0.124 | 1.000 | 0.027 | 1.000 |
work independently | 97 | 114 | 97 | 125 | 59 | 125 | 86.5 | 0.040 | 1.000 | 0.102 | 1.000 |
data processing | 227 | 81 | 227 | 40 | 99 | 40 | 90.0 | 0.094 | 1.000 | 0.033 | 1.000 |
data engineering | 194 | 88 | 194 | 50 | 94 | 50 | 91.0 | 0.081 | 1.000 | 0.041 | 1.000 |
learning algorithms | 315 | 58 | 315 | 26 | 125 | 26 | 91.5 | 0.131 | 1.000 | 0.021 | 1.000 |
language processing | 320 | 54 | 320 | 25 | 130 | 25 | 92.0 | 0.133 | 1.000 | 0.020 | 1.000 |
have experience | 157 | 97 | 157 | 58 | 88 | 58 | 92.5 | 0.065 | 1.000 | 0.048 | 1.000 |
predictive models | 211 | 85 | 211 | 39 | 102 | 39 | 93.5 | 0.088 | 1.000 | 0.032 | 1.000 |
software engineers | 194 | 89 | 194 | 41 | 98 | 41 | 93.5 | 0.081 | 1.000 | 0.034 | 1.000 |
natural language processing | 318 | 56 | 318 | 24 | 133 | 24 | 94.5 | 0.132 | 1.000 | 0.020 | 1.000 |
creativity | 155 | 98 | 155 | 50 | 93 | 50 | 95.5 | 0.064 | 1.000 | 0.041 | 1.000 |
Go | 107 | 110 | NA | 74 | 81 | NA | 95.5 | NA | NA | NA | NA |
sap | 112 | 108 | 112 | 53 | 90 | 53 | 99.0 | 0.047 | 1.000 | 0.043 | 1.000 |
machine learning algorithms | 278 | 67 | 278 | 24 | 132 | 24 | 99.5 | 0.116 | 1.000 | 0.020 | 1.000 |
predictive analytics | 146 | 100 | 146 | 39 | 101 | 39 | 100.5 | 0.061 | 1.000 | 0.032 | 1.000 |
learning models | 268 | 71 | 268 | 23 | 134 | 23 | 102.5 | 0.111 | 1.000 | 0.019 | 1.000 |
95 | 116 | 95 | 53 | 89 | 53 | 102.5 | 0.039 | 1.000 | 0.043 | 1.000 | |
data pipeline | 225 | 83 | 225 | 26 | 123 | 26 | 103.0 | 0.094 | 1.000 | 0.021 | 1.000 |
critical thinking | 62 | 127 | 62 | 77 | 79 | 77 | 103.0 | 0.026 | 1.000 | 0.063 | 1.000 |
javascript | 127 | 104 | 127 | 37 | 104 | 37 | 104.0 | 0.053 | 1.000 | 0.030 | 1.000 |
modeling techniques | 122 | 105 | 122 | 36 | 105 | 36 | 105.0 | 0.051 | 1.000 | 0.029 | 1.000 |
modelling | 102 | 112 | 102 | 39 | 100 | 39 | 106.0 | 0.042 | 1.000 | 0.032 | 1.000 |
large scale | 235 | 78 | 235 | 22 | 135 | 22 | 106.5 | 0.098 | 1.000 | 0.018 | 1.000 |
nosql | 232 | 79 | 232 | 22 | 136 | 22 | 107.5 | 0.096 | 1.000 | 0.018 | 1.000 |
machine learning models | 231 | 80 | 231 | 21 | 137 | 21 | 108.5 | 0.096 | 1.000 | 0.017 | 1.000 |
unix | 112 | 109 | 112 | 34 | 109 | 34 | 109.0 | 0.047 | 1.000 | 0.028 | 1.000 |
array | 104 | 111 | 104 | 34 | 107 | 34 | 109.0 | 0.043 | 1.000 | 0.028 | 1.000 |
microsoft office | 31 | 151 | 31 | 100 | 68 | 100 | 109.5 | 0.013 | 1.000 | 0.082 | 1.000 |
data models | 116 | 107 | 116 | 31 | 113 | 31 | 110.0 | 0.048 | 1.000 | 0.025 | 1.000 |
multi-task | 46 | 141 | 46 | 77 | 80 | 77 | 110.5 | 0.019 | 1.000 | 0.063 | 1.000 |
nlp | 254 | 72 | 254 | 14 | 150 | 14 | 111.0 | 0.106 | 1.000 | 0.011 | 1.000 |
market research | 29 | 154 | 29 | 95 | 70 | 95 | 112.0 | 0.012 | 1.000 | 0.078 | 1.000 |
highly motivated | 51 | 134 | 51 | 52 | 91 | 52 | 112.5 | 0.021 | 1.000 | 0.043 | 1.000 |
data warehousing | 95 | 115 | 95 | 30 | 116 | 30 | 115.5 | 0.039 | 1.000 | 0.025 | 1.000 |
coordination | 39 | 147 | 39 | 68 | 84 | 68 | 115.5 | 0.016 | 1.000 | 0.056 | 1.000 |
mysql | 92 | 117 | 92 | 30 | 117 | 30 | 117.0 | 0.038 | 1.000 | 0.025 | 1.000 |
data pipelines | 179 | 90 | 179 | 16 | 145 | 16 | 117.5 | 0.074 | 1.000 | 0.013 | 1.000 |
data systems | 78 | 121 | 78 | 31 | 114 | 31 | 117.5 | 0.032 | 1.000 | 0.025 | 1.000 |
bachelors | 46 | 139 | 46 | 41 | 96 | 41 | 117.5 | 0.019 | 1.000 | 0.034 | 1.000 |
solving problems | 77 | 122 | 77 | 30 | 118 | 30 | 120.0 | 0.032 | 1.000 | 0.025 | 1.000 |
azure | 127 | 103 | 127 | 19 | 138 | 19 | 120.5 | 0.053 | 1.000 | 0.016 | 1.000 |
time management | 38 | 149 | 38 | 51 | 92 | 51 | 120.5 | 0.016 | 1.000 | 0.042 | 1.000 |
ecommerce | 80 | 120 | 80 | 26 | 124 | 26 | 122.0 | 0.033 | 1.000 | 0.021 | 1.000 |
data manipulation | 51 | 133 | 51 | 31 | 112 | 31 | 122.5 | 0.021 | 1.000 | 0.025 | 1.000 |
vba | 19 | 161 | 19 | 60 | 87 | 60 | 124.0 | 0.008 | 1.000 | 0.049 | 1.000 |
data integration | 59 | 129 | 59 | 28 | 120 | 28 | 124.5 | 0.025 | 1.000 | 0.023 | 1.000 |
google analytics | 30 | 153 | 30 | 41 | 97 | 41 | 125.0 | 0.012 | 1.000 | 0.034 | 1.000 |
data extraction | 75 | 123 | 75 | 25 | 129 | 25 | 126.0 | 0.031 | 1.000 | 0.020 | 1.000 |
microsoft excel | 15 | 167 | 15 | 68 | 85 | 68 | 126.0 | 0.006 | 1.000 | 0.056 | 1.000 |
troubleshooting | 40 | 145 | 40 | 34 | 108 | 34 | 126.5 | 0.017 | 1.000 | 0.028 | 1.000 |
data architecture | 120 | 106 | 120 | 13 | 151 | 13 | 128.5 | 0.050 | 1.000 | 0.011 | 1.000 |
ruby | 98 | 113 | 98 | 17 | 144 | 17 | 128.5 | 0.041 | 1.000 | 0.014 | 1.000 |
masters degree | 49 | 135 | 49 | 27 | 122 | 27 | 128.5 | 0.020 | 1.000 | 0.022 | 1.000 |
speaking | 47 | 137 | 47 | 26 | 127 | 26 | 132.0 | 0.020 | 1.000 | 0.021 | 1.000 |
youtube | 40 | 146 | 40 | 29 | 119 | 29 | 132.5 | 0.017 | 1.000 | 0.024 | 1.000 |
methodological | 34 | 150 | 34 | 26 | 126 | 26 | 138.0 | 0.014 | 1.000 | 0.021 | 1.000 |
data gathering | 39 | 148 | 39 | 24 | 131 | 24 | 139.5 | 0.016 | 1.000 | 0.020 | 1.000 |
mongodb | 75 | 124 | 75 | 10 | 160 | 10 | 142.0 | 0.031 | 1.000 | 0.008 | 1.000 |
manage multiple projects | 16 | 165 | 16 | 28 | 121 | 28 | 143.0 | 0.007 | 1.000 | 0.023 | 1.000 |
postgresql | 56 | 132 | 56 | 11 | 155 | 11 | 143.5 | 0.023 | 1.000 | 0.009 | 1.000 |
language understanding | 83 | 119 | 83 | 5 | 171 | 5 | 145.0 | 0.034 | 1.000 | 0.004 | 1.000 |
github | 57 | 131 | 57 | 10 | 159 | 10 | 145.0 | 0.024 | 1.000 | 0.008 | 1.000 |
elasticsearch | 47 | 136 | 47 | 11 | 154 | 11 | 145.0 | 0.020 | 1.000 | 0.009 | 1.000 |
apache spark | 65 | 125 | 65 | 6 | 167 | 6 | 146.0 | 0.027 | 1.000 | 0.005 | 1.000 |
bash | 45 | 142 | 45 | 12 | 152 | 12 | 147.0 | 0.019 | 1.000 | 0.010 | 1.000 |
data insights | 27 | 156 | 27 | 19 | 140 | 19 | 148.0 | 0.011 | 1.000 | 0.016 | 1.000 |
natural language understanding | 64 | 126 | 64 | 4 | 178 | 4 | 152.0 | 0.027 | 1.000 | 0.003 | 1.000 |
doctorate degree | 17 | 164 | 17 | 19 | 141 | 19 | 152.5 | 0.007 | 1.000 | 0.016 | 1.000 |
negotiation | 21 | 159 | 21 | 15 | 147 | 15 | 153.0 | 0.009 | 1.000 | 0.012 | 1.000 |
network analysis | 28 | 155 | 28 | 12 | 153 | 12 | 154.0 | 0.012 | 1.000 | 0.010 | 1.000 |
kpmg | 60 | 128 | 60 | 3 | 184 | 3 | 156.0 | 0.025 | 1.000 | 0.002 | 1.000 |
shell script | 40 | 144 | 40 | 6 | 169 | 6 | 156.5 | 0.017 | 1.000 | 0.005 | 1.000 |
highly organized | 10 | 170 | 10 | 17 | 143 | 17 | 156.5 | 0.004 | 1.000 | 0.014 | 1.000 |
data entry | 8 | 176 | 8 | 19 | 139 | 19 | 157.5 | 0.003 | 1.000 | 0.016 | 1.000 |
data preparation | 22 | 158 | 22 | 8 | 161 | 8 | 159.5 | 0.009 | 1.000 | 0.007 | 1.000 |
data reporting | 8 | 178 | 8 | 15 | 146 | 15 | 162.0 | 0.003 | 1.000 | 0.012 | 1.000 |
strategic thinking | 13 | 169 | 13 | 11 | 157 | 11 | 163.0 | 0.005 | 1.000 | 0.009 | 1.000 |
microsoft word | 6 | 185 | 6 | 18 | 142 | 18 | 163.5 | 0.002 | 1.000 | 0.015 | 1.000 |
analytics data | 15 | 166 | 15 | 7 | 163 | 7 | 164.5 | 0.006 | 1.000 | 0.006 | 1.000 |
jupyter notebook | 18 | 162 | 18 | 5 | 170 | 5 | 166.0 | 0.007 | 1.000 | 0.004 | 1.000 |
sales and marketing | 19 | 160 | 19 | 5 | 173 | 5 | 166.5 | 0.008 | 1.000 | 0.004 | 1.000 |
microstrategy | 18 | 163 | 18 | 5 | 172 | 5 | 167.5 | 0.007 | 1.000 | 0.004 | 1.000 |
9 | 172 | NA | 6 | 166 | NA | 169.0 | NA | NA | NA | NA | |
nlu | 40 | 143 | 40 | 2 | 199 | 2 | 171.0 | 0.017 | 1.000 | 0.002 | 1.000 |
data transfer | 6 | 184 | 6 | 8 | 162 | 8 | 173.0 | 0.002 | 1.000 | 0.007 | 1.000 |
english language | 5 | 188 | 5 | 10 | 158 | 10 | 173.0 | 0.002 | 1.000 | 0.008 | 1.000 |
service orientation | 5 | 191 | 5 | 11 | 156 | 11 | 173.5 | 0.002 | 1.000 | 0.009 | 1.000 |
symantec | 9 | 174 | 9 | 5 | 174 | 5 | 174.0 | 0.004 | 1.000 | 0.004 | 1.000 |
grammatical | 9 | 173 | 9 | 4 | 176 | 4 | 174.5 | 0.004 | 1.000 | 0.003 | 1.000 |
systems analysis | 3 | 202 | 3 | 15 | 148 | 15 | 175.0 | 0.001 | 1.000 | 0.012 | 1.000 |
microsoft azure | 23 | 157 | 23 | 2 | 198 | 2 | 177.5 | 0.010 | 1.000 | 0.002 | 1.000 |
data mapping | 8 | 177 | 8 | 3 | 182 | 3 | 179.5 | 0.003 | 1.000 | 0.002 | 1.000 |
telecommunications | 9 | 175 | 9 | 3 | 187 | 3 | 181.0 | 0.004 | 1.000 | 0.002 | 1.000 |
django | 30 | 152 | 30 | 1 | 211 | 1 | 181.5 | 0.012 | 1.000 | 0.001 | 1.000 |
amazon redshift | 6 | 183 | 6 | 3 | 180 | 3 | 181.5 | 0.002 | 1.000 | 0.002 | 1.000 |
microsoft access | 3 | 199 | 3 | 7 | 165 | 7 | 182.0 | 0.001 | 1.000 | 0.006 | 1.000 |
microsoft powerpoint | 2 | 215 | 2 | 14 | 149 | 14 | 182.0 | 0.001 | 1.000 | 0.011 | 1.000 |
complex problem solving | 5 | 187 | 5 | 3 | 181 | 3 | 184.0 | 0.002 | 1.000 | 0.002 | 1.000 |
active learning | 7 | 180 | 7 | 2 | 189 | 2 | 184.5 | 0.003 | 1.000 | 0.002 | 1.000 |
data interpretation | 3 | 197 | 3 | 4 | 175 | 4 | 186.0 | 0.001 | 1.000 | 0.003 | 1.000 |
apache hadoop | 7 | 181 | 7 | 2 | 192 | 2 | 186.5 | 0.003 | 1.000 | 0.002 | 1.000 |
client management | 2 | 209 | 2 | 7 | 164 | 7 | 186.5 | 0.001 | 1.000 | 0.006 | 1.000 |
confluence | 7 | 182 | 7 | 2 | 194 | 2 | 188.0 | 0.003 | 1.000 | 0.002 | 1.000 |
minitab | 3 | 200 | 3 | 4 | 177 | 4 | 188.5 | 0.001 | 1.000 | 0.003 | 1.000 |
machine learning data | 13 | 168 | 13 | 1 | 220 | 1 | 194.0 | 0.005 | 1.000 | 0.001 | 1.000 |
swift | 6 | 186 | 6 | 2 | 202 | 2 | 194.0 | 0.002 | 1.000 | 0.002 | 1.000 |
jquery | 10 | 171 | 10 | 1 | 218 | 1 | 194.5 | 0.004 | 1.000 | 0.001 | 1.000 |
eko | 8 | 179 | 8 | 1 | 212 | 1 | 195.5 | 0.003 | 1.000 | 0.001 | 1.000 |
work well in a team | 3 | 205 | 3 | 3 | 188 | 3 | 196.5 | 0.001 | 1.000 | 0.002 | 1.000 |
operations analysis | 4 | 194 | 4 | 2 | 200 | 2 | 197.0 | 0.002 | 1.000 | 0.002 | 1.000 |
see the big picture | 4 | 195 | 4 | 2 | 201 | 2 | 198.0 | 0.002 | 1.000 | 0.002 | 1.000 |
apache kafka | 4 | 193 | 4 | 1 | 204 | 1 | 198.5 | 0.002 | 1.000 | 0.001 | 1.000 |
microsoft outlook | 2 | 214 | 2 | 3 | 186 | 3 | 200.0 | 0.001 | 1.000 | 0.002 | 1.000 |
clerical | 2 | 208 | 2 | 2 | 193 | 2 | 200.5 | 0.001 | 1.000 | 0.002 | 1.000 |
ibm db2 | 5 | 189 | 5 | 1 | 217 | 1 | 203.0 | 0.002 | 1.000 | 0.001 | 1.000 |
report creation | 1 | 231 | 1 | 4 | 179 | 4 | 205.0 | 0.000 | 1.000 | 0.003 | 1.000 |
experience in information technology | 3 | 198 | 3 | 1 | 214 | 1 | 206.0 | 0.001 | 1.000 | 0.001 | 1.000 |
microsoft sql server | 5 | 190 | 5 | 1 | 224 | 1 | 207.0 | 0.002 | 1.000 | 0.001 | 1.000 |
data cleanup | 2 | 210 | 2 | 1 | 206 | 1 | 208.0 | 0.001 | 1.000 | 0.001 | 1.000 |
data organization | 2 | 211 | 2 | 1 | 207 | 1 | 209.0 | 0.001 | 1.000 | 0.001 | 1.000 |
unix shell | 5 | 192 | 5 | 1 | 236 | 1 | 214.0 | 0.002 | 1.000 | 0.001 | 1.000 |
google adwords | 2 | 212 | 2 | 1 | 216 | 1 | 214.0 | 0.001 | 1.000 | 0.001 | 1.000 |
design development | 1 | 218 | 1 | 1 | 210 | 1 | 214.0 | 0.000 | 1.000 | 0.001 | 1.000 |
skype | 3 | 201 | 3 | 1 | 231 | 1 | 216.0 | 0.001 | 1.000 | 0.001 | 1.000 |
mathematical reasoning | 2 | 213 | 2 | 1 | 221 | 1 | 217.0 | 0.001 | 1.000 | 0.001 | 1.000 |
filemaker pro | 1 | 220 | 1 | 1 | 215 | 1 | 217.5 | 0.000 | 1.000 | 0.001 | 1.000 |
wireshark | 1 | 233 | 1 | 2 | 203 | 2 | 218.0 | 0.000 | 1.000 | 0.002 | 1.000 |
technology design | 3 | 203 | 3 | 1 | 234 | 1 | 218.5 | 0.001 | 1.000 | 0.001 | 1.000 |
ubuntu | 3 | 204 | 3 | 1 | 235 | 1 | 219.5 | 0.001 | 1.000 | 0.001 | 1.000 |
judgment and decision making | 1 | 222 | 1 | 1 | 219 | 1 | 220.5 | 0.000 | 1.000 | 0.001 | 1.000 |
microsoft dynamics | 1 | 223 | 1 | 1 | 222 | 1 | 222.5 | 0.000 | 1.000 | 0.001 | 1.000 |
microsoft windows server | 1 | 224 | 1 | 1 | 226 | 1 | 225.0 | 0.000 | 1.000 | 0.001 | 1.000 |
oracle hyperion | 1 | 225 | 1 | 1 | 227 | 1 | 226.0 | 0.000 | 1.000 | 0.001 | 1.000 |
organizational management | 1 | 227 | 1 | 1 | 228 | 1 | 227.5 | 0.000 | 1.000 | 0.001 | 1.000 |
reading comprehension | 1 | 230 | 1 | 1 | 230 | 1 | 230.0 | 0.000 | 1.000 | 0.001 | 1.000 |
big data architecture | 57 | 130 | 57 | NA | NA | NA | NA | 0.024 | 1.000 | NA | NA |
architecture capabilities | 46 | 138 | 46 | NA | NA | NA | NA | 0.019 | 1.000 | NA | NA |
covering technologies | 46 | 140 | 46 | NA | NA | NA | NA | 0.019 | 1.000 | NA | NA |
apache hive | 3 | 196 | 3 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
active listening | 2 | 206 | 2 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
citrix | 2 | 207 | 2 | NA | NA | NA | NA | 0.001 | 1.000 | NA | NA |
amazon dynamodb | 1 | 216 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
bring creativity | 1 | 217 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
engineering and technology | 1 | 219 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
ibm infosphere datastage | 1 | 221 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
oracle java | 1 | 226 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
prepare data for analysis | 1 | 228 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
quality control analysis | 1 | 229 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
teradata database | 1 | 232 | 1 | NA | NA | NA | NA | 0.000 | 1.000 | NA | NA |
microsoft project | NA | NA | NA | 6 | 168 | 6 | NA | NA | NA | 0.005 | 1.000 |
data storytelling | NA | NA | NA | 3 | 183 | 3 | NA | NA | NA | 0.002 | 1.000 |
lexisnexis | NA | NA | NA | 3 | 185 | 3 | NA | NA | NA | 0.002 | 1.000 |
administration and management | NA | NA | NA | 2 | 190 | 2 | NA | NA | NA | 0.002 | 1.000 |
ajax | NA | NA | NA | 2 | 191 | 2 | NA | NA | NA | 0.002 | 1.000 |
experience in market research | NA | NA | NA | 2 | 195 | 2 | NA | NA | NA | 0.002 | 1.000 |
google docs | NA | NA | NA | 2 | 196 | 2 | NA | NA | NA | 0.002 | 1.000 |
mcafee | NA | NA | NA | 2 | 197 | 2 | NA | NA | NA | 0.002 | 1.000 |
apache tomcat | NA | NA | NA | 1 | 205 | 1 | NA | NA | NA | 0.001 | 1.000 |
datadriven | NA | NA | NA | 1 | 208 | 1 | NA | NA | NA | 0.001 | 1.000 |
deductive reasoning | NA | NA | NA | 1 | 209 | 1 | NA | NA | NA | 0.001 | 1.000 |
epic systems | NA | NA | NA | 1 | 213 | 1 | NA | NA | NA | 0.001 | 1.000 |
microsoft sharepoint | NA | NA | NA | 1 | 223 | 1 | NA | NA | NA | 0.001 | 1.000 |
microsoft sql server reporting services | NA | NA | NA | 1 | 225 | 1 | NA | NA | NA | 0.001 | 1.000 |
processing information | NA | NA | NA | 1 | 229 | 1 | NA | NA | NA | 0.001 | 1.000 |
systems evaluation | NA | NA | NA | 1 | 232 | 1 | NA | NA | NA | 0.001 | 1.000 |
tax software | NA | NA | NA | 1 | 233 | 1 | NA | NA | NA | 0.001 | 1.000 |
Survey
The top broad skills identified by the survey were “programming skills” and “analytical skills”. “Statistics”, “R”, and “Python” were top skills identified by individuals working full-time in either computer science or data science. Overall, in the filtered data sets, there weren’t many common bigrams, which was likely due to the small size of the survey. Interestingly, two survey respondents in the data science / computer science group identified “business knowledge” as a top skill.
Overall, most of the skills identified by the survey were either broad answers, such as “programming skills” or “analytical skills”, or specific programming languages such as Python, R, and sQL. A few respondents also identified abstract skills, such as creativity. While there was more specificity in the survey answers of individuals working or studying in data science or computer science, more survey responses should be gathered to make any larger conclusions
To review, the following conclusions emerged from our research:
Data science emphasizes a range of “hard” data-centric skills (especially programming skills like Python, SQL and R, as well as statistical and math skills, and a knowledge of machine learning.
At least for employers, data scientist positions and data analyst positions point to very different skill sets. For data analysts, communication, research, analytical and organizational skills are most prominent. While there is overlap on almost all of the skills, the emphases are very different.
Working data scientists and computer scientists from our survey corroborate this point of view. Statistics and programming make up the bulk of the skills for data scientists.
Individuals from our survey not directly involved with data science are more likely to include softer skills. They see writing, thinking, creativity and communication as aspects of data science.
So what are the most valuable skills for data scientist? Here we really answer a more narrow question - what are the most valuable skills to learn in order to be hired and work in the field of data science? The answer to this is clear - programming, statistics, math, algorithms and computer science.
But if we wanted to answer the question more broadly - what skills would be most valuable for the next generation of data scientist to learn? -
we might want to include the thoughts of ethicists, philosophers, futurists, consumers, and representatives of vulnerable populations. But that will have to wait for another project.