…

Research Question

Overview

Collaboration Model

The work on this project was fully collaborative. We divided the work into pieces, and for each piece we assigned a team of three or more.

Everyone was on more than one team. This way people who knew less could learn from those who knew more, and we could all contribute.

We held a number of meetings on zoom where we reviewed together each step of the process and made consensus-based decisions about direction.

We used Slack, Google Drive, Azure and Github to share thoughts and ideas and code.

Who is “we”? We are:

Carlisle
Cassie
Dan
Eric
Esteban
Mari
Tyler

#Packages used

library(tidytext)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(stringr)
library(DT)
library(kableExtra)
library(wesanderson)
library(tidyr)
library(corpus)
library(keyring)
library(RODBC)
library(readr)
library(stringi)
library(tm)
library(corpus)
library(wordcloud)
library(data.table)
library(RedditExtractoR)
library(readxl)
library(dplyr )
library(magrittr)

library("tm")
library("SnowballC")
library("RColorBrewer")
library("wordcloud")
library("data.table")

Methology

Our methodology was to mine varied data sets which provided information that could support different points of view about the research question. We settled on three: a data set of 7000 Indeed job listings from 2018, a compendium of work skills from the University of Chicago’s Open Skills project, and a survey that we created for the purpose of this project.

We used the data sets to create a comprehensive word_catalog of data scientist skills, knowledge and abilities. Then we ran the word_catalog back over the Indeed data set so that we could pull out the skills that appeared most. Finally, we used our findings to compare the Indeed data set for data scientists, the Indeed data set for data analysts, and the survey.

1.Collect Data

Collect the Data to be Analyzed

In this step we read in the following data sets and write them back to a normalized database on Azure. From that point forward, we only pull data from the database so that we are confident we have consistent, secure, and persistent storage for our data:

Data Scientist Job Market In the US, Kaggle, downloaded into csv from the website
Data from the University of Chicago’s Center for Data Science & Public Policy’s Open Skills Project, accessed through their API
Our survey, loaded from Google Drive

Read in data and filter by position

conn_str <- paste0(
  'Driver={ODBC Driver 17 for SQL Server};
   Server=tcp:ehtmp.database.windows.net,1433;
   Database=ds_skills;
   Encrypt=yes;
   TrustServerCertificate=no;
   Connection Timeout=30;',
   'Uid=',keyring::key_get(service = "my-skills-db-username", keyring = "my-skills-db-keyring"),';',
   'Pwd=', keyring::key_get(service = "my-skills-db-pwd", keyring = "my-skills-db-keyring"), ';'
)
dbConnection <- odbcDriverConnect(conn_str)

all_data <- (sqlQuery(dbConnection, "SELECT * FROM ds_skills.kaggle.job_postings_raw"))

#if the connection doesnt work use the code below to upload the csv file
#all_data<-read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/raw_jobdata.csv")
all_data$position<-tolower(all_data$position)
all_data$description<-tolower(all_data$description)
all_data$description<-tolower(all_data$description)%>%
  str_remove_all("â|€|™|\\n")

##filters for data science positions
data_scientists<-all_data%>%
  mutate(contents = str_detect(tolower(position), "data [b-z]|ai|machine"))%>%
  filter(contents == TRUE)

data_analysts<-all_data%>%
  mutate(contents = str_detect(tolower(position), "anal"))%>%
  filter(contents == TRUE)

rm(all_data)

Filter for targeted skills

Using the same search criteria we used above make new columns containing the strings of interest to be worked with later. Some of the issues with this approach is the abundance NA values.

ds<-data_scientists%>%
  mutate(skill = str_extract_all(data_scientists$description, " .{75,100}   skill.{150,200} "))%>%
  mutate(must_have=str_extract_all(data_scientists$description, "must have.{150,200} "))%>%
  mutate(knowledge=str_extract_all(data_scientists$description, " .{75,100} knowledge.{150,200} "))%>%
  mutate(experience=str_extract_all(data_scientists$description, " .{100,150} exper.{100,150} "))%>%
  mutate(excel=str_extract_all(data_scientists$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
  mutate(responsible=str_extract_all(data_scientists$description, "responsible.{150,200} "))%>%
  mutate(proficient=str_extract_all(data_scientists$description," .{100,150} profi.{110,160} "))%>%
  mutate(understands=str_extract_all(data_scientists$description, " .{100,150} understand.{150,200} "))%>%
  mutate(utilize=str_extract_all(data_scientists$description, "utilize.{150,200} "))%>%
  mutate(lead=str_extract_all(data_scientists$description, " .{150,200} lead.{150,200} "))%>%
  mutate(work=str_extract_all(data_scientists$description, " .{50,75} work.{150,200} "))%>%
  mutate(looking=str_extract_all(data_scientists$description, "looking.{150,200} "))


ds$skill<-lapply(ds$skill, function(x)paste(unlist(x), collapse=' '))

ds$must_have<-lapply(ds$must_have, function(x)paste(unlist(x), collapse=' '))

ds$knowledge<-lapply(ds$knowledge, function(x)paste(unlist(x), collapse=' '))

ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))

ds$experience<-lapply(ds$experience, function(x)paste(unlist(x), collapse=' '))

ds$excel<-lapply(ds$excel, function(x)paste(unlist(x), collapse=' '))

ds$responsible<-lapply(ds$responsible, function(x)paste(unlist(x), collapse=' '))

ds$proficient<-lapply(ds$proficient, function(x)paste(unlist(x), collapse=' '))

ds$understands<-lapply(ds$understands, function(x)paste(unlist(x), collapse=' '))

ds$utilize<-lapply(ds$utilize, function(x)paste(unlist(x), collapse=' '))

ds$lead<-lapply(ds$lead, function(x)paste(unlist(x), collapse=' '))

ds$work<-lapply(ds$work, function(x)paste(unlist(x), collapse=' '))

ds$looking<-lapply(ds$looking, function(x)paste(unlist(x), collapse=' '))

Data Analyst

da<-data_analysts%>%
  mutate(skill = str_extract_all(data_analysts$description, " .{75,100}   skill.{150,200} "))%>%
  mutate(must_have=str_extract_all(data_analysts$description, "must have.{150,200} "))%>%
  mutate(knowledge=str_extract_all(data_analysts$description, " .{75,100} knowledge.{150,200} "))%>%
  mutate(experience=str_extract_all(data_analysts$description, " .{100,150} exper.{100,150} "))%>%
  mutate(excel=str_extract_all(data_analysts$description, "excel at.{150,200} |excel with.{150,200} |excel in.{150,200} "))%>%
  mutate(responsible=str_extract_all(data_analysts$description, "responsible.{150,200} "))%>%
  mutate(proficient=str_extract_all(data_analysts$description," .{100,150} profi.{110,160} "))%>%
  mutate(understands=str_extract_all(data_analysts$description, " .{100,150} understand.{150,200} "))%>%
  mutate(utilize=str_extract_all(data_analysts$description, "utilize.{150,200} "))%>%
  mutate(lead=str_extract_all(data_analysts$description, " .{150,200} lead.{150,200} "))%>%
  mutate(work=str_extract_all(data_analysts$description, " .{50,75} work.{150,200} "))%>%
  mutate(looking=str_extract_all(data_analysts$description, "looking.{150,200} "))


da$skill<-lapply(da$skill, function(x)paste(unlist(x), collapse=' '))

da$must_have<-lapply(da$must_have, function(x)paste(unlist(x), collapse=' '))

da$knowledge<-lapply(da$knowledge, function(x)paste(unlist(x), collapse=' '))

da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))

da$experience<-lapply(da$experience, function(x)paste(unlist(x), collapse=' '))

da$excel<-lapply(da$excel, function(x)paste(unlist(x), collapse=' '))

da$responsible<-lapply(da$responsible, function(x)paste(unlist(x), collapse=' '))

da$proficient<-lapply(da$proficient, function(x)paste(unlist(x), collapse=' '))

da$understands<-lapply(da$understands, function(x)paste(unlist(x), collapse=' '))

da$utilize<-lapply(da$utilize, function(x)paste(unlist(x), collapse=' '))

da$lead<-lapply(da$lead, function(x)paste(unlist(x), collapse=' '))

da$work<-lapply(da$work, function(x)paste(unlist(x), collapse=' '))

da$looking<-lapply(da$looking, function(x)paste(unlist(x), collapse=' '))

Transform Data Science dataframe to long

ds_long <- ds %>% gather(keyword, text, 7:18)

text <- ds_long %>% select(text)

Make corpus and remove punctuation, numbers, stopwords, convert cases, etc

corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("skill","responsible","proficient","knowledge","understands", "must", "experience", "character", "will", "looking", "excels at", "work", "lead", "utilize"))
corpus_Clean <- tm_map(corpus, stripWhitespace)

wordcloud(corpus, max.words = 50, colors = colorRampPalette(brewer.pal(7, "Dark2"))(32))

Tokenization of textbody into unigrams (one word), bigrams (two words), trigrams (three words), and quadgrams(four words)

#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, 20)))


#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))


#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))

Plot unigram

#Unigrams

unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)


ggplot(unigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Unigrams") +labs(x = "", y = "")

Plot bigram

#Bigrams

bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)

ggplot(bigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Bigrams") +labs(x = "", y = "")

Plot trigram

#Trigrams

trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)

ggplot(trigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Top 25 Trigrams") +labs(x = "", y = "")

2.The Word Catalog

Create a Word Catalog

Building the word_catalog is the heart of our project. The Indeed database contains 7,000 lengthy job descriptions. Performing a word count or a simple regex on “skills” is very limited – consider, e.g., the sentence “The applicant must be familiar with all aspects of the data process, from collection to analysis, and be adept at communicating their findings.”

In order to capture all of the skills in our data set, we built a list of as many possible words and phrases describing data science skills as might exist in the data set.

We then used this list to go back over the data set to calculate those word and phrase frequencies which were most prominent.

We built this list by combining a deep dive into Regex with an n-gram analysis so that we could see not only what words appeared frequently, but how they appeared together with other words.

We used relevant keywords from both the Indeed data set and the Open Skills data set to supplement this comprehensive list of possible skills.

In the end, our word_catalog contained [put number here] individual words and phrases describing data science skills.

Create a word catalog and search for string in column

#Load word_catalog 

dictionary_analyst <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/project_3/Eric_DataAnalystDictionary.csv", header=FALSE, fileEncoding = "UTF-8-BOM")

dictionary_ngrams <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/Ngrams_dictionary.csv?token=ASEY4BIFPB7FWOOYPWAQNWTANUIPC", fileEncoding = "UTF-8-BOM")

os <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/OS_dictionary_skills.csv?token=ASEY4BNIW2YDLN2N57PN7ELANZUJK", fileEncoding = "UTF-8-BOM")

onet <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/dictionary/ONET%20Technology%20Skills.csv?token=ASEY4BJAZDGVPDDHSSZJZ43ANZUNY", fileEncoding = "UTF-8-BOM")


# create word catalogue with single skill column for merge

dictionary_onet <- onet %>% select(skill = Skill)
dictionary_os <- os %>% select(skill)

#Assign column name to analyst word catalog 

names(dictionary_analyst) <- ('skill')
names(dictionary_ngrams) <- ('skill')

#convert to lowercase and remove special characters where needed

dictionary_analyst <- dictionary_analyst %>% mutate(across(where(is.character), tolower))
dictionary_ngrams <- dictionary_ngrams  %>% mutate(across(where(is.character), tolower))
dictionary_os <- dictionary_os  %>% mutate(across(where(is.character), tolower))
dictionary_onet <- dictionary_onet  %>% mutate(across(where(is.character), tolower)) %>% mutate_all(funs(gsub("[[:punct:]]", "", .)))


#merge dictionaries and remove duplicates 

MyMerge <- function(x, y){
  df <- merge(x, y, all = TRUE)
  return(df)
}

# merge all four dictionaries and delete duplicate skills

dictionary <- Reduce(MyMerge, list(dictionary_analyst, dictionary_ngrams, dictionary_onet, dictionary_os)) %>% distinct()


# remove common character strings found within words and phrases to analyze separately 

#remove skills that need to be removed when then appear alone but not when in a phrase

dictionary$skill <- str_remove(dictionary$skill, "(?! )(ai|science|business)(?! )")

#remove short character skills that will be picked up within words

dictionary$skill <- str_remove(dictionary$skill, "(?:^|\\W)(r|c|ms|go)(?:$|\\W)")

# turn word_catalog into vector and remove empty rows

dictionary <- dictionary[!apply(dictionary == "", 1, all),]

3.Detect words in word_catalog

Detecting words in the word_catalog we created

Now we are ready to detect word and phrase frequencies in our data set. We will separate out job positions that include “data scientist” from those that include “data analyst” because we cannot answer our research question without investigating whether data science is really just a new fancy term for data analysis.

The detection will be done in two parts. The first will be counts of small words that are hard to isolate in large descriptions. The second will run all other identified skills in our word_catalog through our job descriptions. The resulting skill frequencies of these efforts will be merged for analysis.

get counts for r, ms, ai go

data_sci_count<-data.frame("r" = sum(str_count(data_scientists$description, " r | r,| r\\.")), "ms" = sum(str_count(data_scientists$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_scientists$description, " go ")), "AI" = sum(str_count(data_scientists$description, " ai | ai,| ai\\.")))

data_ana_count<-data.frame("r" = sum(str_count(data_analysts$description, " r | r,| r\\.")), "ms" = sum(str_count(data_analysts$description, " ms | ms,| ms\\.")), Go = sum(str_count(data_analysts$description, " go ")), "AI" = sum(str_count(data_analysts$description, " ai | ai,| ai\\.")))

Detect Word_catalog words in original descriptions

One way to assess how important a particular skill is, is to look for how many times each word from our word_catalog is mentioned throughout the dataset. Here, we’re looking for an overall count of the word_catalog words that show up most frequently in the Kaggle dataset of job descriptions. We’ll run these counts both on ‘data scientist’ job desccriptions and ‘data analyst’ job descriptions.

# Pulls skills out of description based on catalog


setDT(data_scientists)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_scientists)]

setDT(data_analysts)[, skills := paste(dictionary[unlist(lapply(dictionary, function(x) grepl(x, description, ignore.case = T)))], collapse = ","), by = 1:nrow(data_analysts)]


# Create a count of skills for data science

skillsfreq_ds <- data_scientists %>%
  separate_rows(skills, sep = ',') %>%
  group_by(skills = tolower(skills)) %>%
  summarise(count = n())

# Create a count of skills for data analyst

skillsfreq_da <- data_analysts %>%
  separate_rows(skills, sep = ',') %>%
  group_by(skills = tolower(skills)) %>%
  summarise(count = n())

# Merge counts for r, ms, ai, go

data_sci_count <- data_sci_count %>% gather(skills, count, 1:4)

data_ana_count <- data_ana_count %>% gather(skills, count, 1:4)

skillsfreq_ds <- merge(skillsfreq_ds, data_sci_count, all = TRUE)

skillsfreq_da <- merge(skillsfreq_da, data_ana_count, all = TRUE)


# Merge top data skills for data scientist and data analyst 

skillsfreq_all <- full_join(skillsfreq_ds ,skillsfreq_da,by="skills") %>% rename(count_ds = count.x, count_da = count.y)

Let’s take a look at the top skills that show up for each of “data scientist” and “data analyst”:

Table top data scientist skills

top_skills_ds <- skillsfreq_ds %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number()) 

top_skills_ds <- left_join(top_skills_ds, ds_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()

## Joining, by = "skills"

findings_table_ds <-(top_skills_ds) %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(height = "400px")

findings_table_ds

Top Skills
	skills	count	rank	jd_count
1	python	5250	1	1750
4	machine learning	1693	2	1693
5	design	1468	3	1468
6	computer	1460	4	1460
7	science	1384	5	1384
8	research	1254	6	1254
9	statistics	1249	7	1249
10	sql	1172	8	1172
11	communication	1168	9	1168
12	r	1118	10	950
13	math	1102	11	1102
14	solutions	1056	12	1056
15	algorithms	975	13	975
16	programming	960	14	960
17	leader	905	15	905
18	organization	844	16	844
19	passion	821	17	821
20	analytical	799	18	799
21	phd	789	19	789
22	scala	789	20	789
23	quantitative	779	21	779
24	spark	779	22	779
25	mathematics	768	23	768
26	communication skills	766	24	766
27	java	765	25	765
28	AI	729	26	NA
29	vision	714	27	714
30	hadoop	706	28	706
31	database	699	29	699
32	big data	695	30	695
33	written	684	31	684
34	ml	617	32	617
35	visualization	578	33	578
36	years of experience	576	34	576
37	data sets	566	35	566
38	leadership	536	36	536
39	data analysis	532	37	532
40	git	516	38	516
41	collaborative	511	39	511
42	office	506	40	506
43	data mining	483	41	483
44	verbal	481	42	481
45	innovation	471	43	471
46	presentation	463	44	463
47	creating	451	45	451
48	sas	445	46	445
49	deep learning	432	47	432
50	large data	398	48	398
51	natural language	390	49	390
52	data engineer	389	50	389
53	physics	337	51	337
54	collaboration	329	52	329
55	software development	327	53	327
56	language processing	320	54	320
57	artificial intelligence	318	55	318
58	natural language processing	318	56	318
59	economics	315	57	315
60	learning algorithms	315	58	315
61	programming languages	313	59	313
62	business problems	312	60	312
63	data visualization	312	61	312
64	machine learning techniques	298	62	298
65	writing	295	63	295
66	rtable	291	64	291
67	data analytics	285	65	285
68	tableau	279	66	279
69	machine learning algorithms	278	67	278
70	matlab	277	68	277
71	consulting	274	69	274
72	problem solving	273	70	273
73	learning models	268	71	268
74	nlp	254	72	254
75	influence	251	73	251
76	linux	251	74	251
77	flexible	244	75	244
78	etl	239	76	239
79	statistical modeling	237	77	237
80	large scale	235	78	235
81	nosql	232	79	232
82	machine learning models	231	80	231
83	data processing	227	81	227
84	large data sets	227	82	227
85	data pipeline	225	83	225
86	ms	223	84	253
87	predictive models	211	85	211
88	interpersonal	201	86	201
89	masters	201	87	201
90	data engineering	194	88	194
91	software engineers	194	89	194
92	data pipelines	179	90	179
93	decision making	173	91	173
94	organizational	172	92	172
95	forecasting	169	93	169
96	bachelor’s degree	164	94	164
97	monitoring	162	95	162
98	data management	161	96	161
99	have experience	157	97	157
100	creativity	155	98	155
101	microsoft	148	99	148
102	predictive analytics	146	100	146
103	project management	141	101	141
104	data collection	132	102	132
105	azure	127	103	127
106	javascript	127	104	127
107	modeling techniques	122	105	122
108	data architecture	120	106	120
109	data models	116	107	116
110	sap	112	108	112
111	unix	112	109	112
112	Go	107	110	NA
113	array	104	111	104
114	modelling	102	112	102
115	ruby	98	113	98
116	work independently	97	114	97
117	data warehousing	95	115	95
118	facebook	95	116	95
119	mysql	92	117	92
120	powerpoint	84	118	84
121	language understanding	83	119	83
122	ecommerce	80	120	80
123	data systems	78	121	78
124	solving problems	77	122	77
125	data extraction	75	123	75
126	mongodb	75	124	75
127	apache spark	65	125	65
128	natural language understanding	64	126	64
129	critical thinking	62	127	62
130	kpmg	60	128	60
131	data integration	59	129	59
132	big data architecture	57	130	57
133	github	57	131	57
134	postgresql	56	132	56
135	data manipulation	51	133	51
136	highly motivated	51	134	51
137	masters degree	49	135	49
138	elasticsearch	47	136	47
139	speaking	47	137	47
140	architecture capabilities	46	138	46
141	bachelors	46	139	46
142	covering technologies	46	140	46
143	multi-task	46	141	46
144	bash	45	142	45
145	nlu	40	143	40
146	shell script	40	144	40
147	troubleshooting	40	145	40
148	youtube	40	146	40
149	coordination	39	147	39
150	data gathering	39	148	39
151	time management	38	149	38
152	methodological	34	150	34
153	microsoft office	31	151	31
154	django	30	152	30
155	google analytics	30	153	30
156	market research	29	154	29
157	network analysis	28	155	28
158	data insights	27	156	27
159	microsoft azure	23	157	23
160	data preparation	22	158	22
161	negotiation	21	159	21
162	sales and marketing	19	160	19
163	vba	19	161	19
164	jupyter notebook	18	162	18
165	microstrategy	18	163	18
166	doctorate degree	17	164	17
167	manage multiple projects	16	165	16
168	analytics data	15	166	15
169	microsoft excel	15	167	15
170	machine learning data	13	168	13
171	strategic thinking	13	169	13
172	highly organized	10	170	10
173	jquery	10	171	10
174		9	172	NA
175	grammatical	9	173	9
176	symantec	9	174	9
177	telecommunications	9	175	9
178	data entry	8	176	8
179	data mapping	8	177	8
180	data reporting	8	178	8
181	eko	8	179	8
182	active learning	7	180	7
183	apache hadoop	7	181	7
184	confluence	7	182	7
185	amazon redshift	6	183	6
186	data transfer	6	184	6
187	microsoft word	6	185	6
188	swift	6	186	6
189	complex problem solving	5	187	5
190	english language	5	188	5
191	ibm db2	5	189	5
192	microsoft sql server	5	190	5
193	service orientation	5	191	5
194	unix shell	5	192	5
195	apache kafka	4	193	4
196	operations analysis	4	194	4
197	see the big picture	4	195	4
198	apache hive	3	196	3
199	data interpretation	3	197	3
200	experience in information technology	3	198	3
201	microsoft access	3	199	3
202	minitab	3	200	3
203	skype	3	201	3
204	systems analysis	3	202	3
205	technology design	3	203	3
206	ubuntu	3	204	3
207	work well in a team	3	205	3
208	active listening	2	206	2
209	citrix	2	207	2
210	clerical	2	208	2
211	client management	2	209	2
212	data cleanup	2	210	2
213	data organization	2	211	2
214	google adwords	2	212	2
215	mathematical reasoning	2	213	2
216	microsoft outlook	2	214	2
217	microsoft powerpoint	2	215	2
218	amazon dynamodb	1	216	1
219	bring creativity	1	217	1
220	design development	1	218	1
221	engineering and technology	1	219	1
222	filemaker pro	1	220	1
223	ibm infosphere datastage	1	221	1
224	judgment and decision making	1	222	1
225	microsoft dynamics	1	223	1
226	microsoft windows server	1	224	1
227	oracle hyperion	1	225	1
228	oracle java	1	226	1
229	organizational management	1	227	1
230	prepare data for analysis	1	228	1
231	quality control analysis	1	229	1
232	reading comprehension	1	230	1
233	report creation	1	231	1
234	teradata database	1	232	1
235	wireshark	1	233	1

Table top data analyst skills

top_skills_da <- skillsfreq_da %>% arrange(desc(count)) %>% select(skills, count) %>% mutate(rank = row_number())

top_skills_da <- left_join(top_skills_da, da_jd_count) %>% rename(jd_count = freq) %>% unique.data.frame()

## Joining, by = "skills"

findings_table_da <-(top_skills_da) %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(height = "400px")

findings_table_da

Top Skills
	skills	count	rank	jd_count
1	python	1275	1	425
4	research	785	2	785
5	communication	776	3	776
6	analytical	653	4	653
7	design	627	5	627
8	organization	605	6	605
9	written	514	7	514
10	quantitative	499	8	499
11	communication skills	488	9	488
12	leader	484	10	484
13	sql	467	11	467
14	statistics	466	12	466
15	office	463	13	463
16	math	403	14	403
17	r	401	15	326
18	solutions	400	16	400
19	computer	380	17	380
20	presentation	373	18	373
21	database	367	19	367
22	vision	358	20	358
23	leadership	327	21	327
24	verbal	324	22	324
25	passion	323	23	323
26	programming	313	24	313
27	science	312	25	312
28	sas	309	26	309
29	data analysis	304	27	304
30	writing	300	28	300
31	years of experience	286	29	286
32	collaborative	284	30	284
33	economics	268	31	268
34	mathematics	257	32	257
35	visualization	243	33	243
36	organizational	224	34	224
37	microsoft	223	35	223
38	innovation	207	36	207
39	machine learning	207	37	207
40	data sets	206	38	206
41	tableau	203	39	203
42	interpersonal	198	40	198
43	creating	196	41	196
44	git	195	42	195
45	powerpoint	190	43	190
46	consulting	182	44	182
47	ms	173	45	165
48	problem solving	172	46	172
49	data visualization	162	47	162
50	project management	159	48	159
51	collaboration	154	49	154
52	phd	151	50	151
53	bachelor’s degree	149	51	149
54	data analytics	140	52	140
55	flexible	137	53	137
56	java	134	54	134
57	ml	134	55	134
58	large data	133	56	133
59	algorithms	130	57	130
60	rtable	128	58	128
61	work independently	125	59	125
62	data collection	121	60	121
63	big data	120	61	120
64	data management	120	62	120
65	influence	118	63	118
66	monitoring	118	64	118
67	decision making	114	65	114
68	scala	113	66	113
69	data mining	105	67	105
70	microsoft office	100	68	100
71	forecasting	96	69	96
72	market research	95	70	95
73	hadoop	92	71	92
74	matlab	92	72	92
75	physics	89	73	89
76	programming languages	88	74	88
77	business problems	87	75	87
78	spark	87	76	87
79	masters	85	77	85
80	data engineer	81	78	81
81	critical thinking	77	79	77
82	multi-task	77	80	77
83	Go	74	81	NA
84	etl	73	82	73
85	large data sets	72	83	72
86	coordination	68	84	68
87	microsoft excel	68	85	68
88	statistical modeling	61	86	61
89	vba	60	87	60
90	have experience	58	88	58
91	facebook	53	89	53
92	sap	53	90	53
93	highly motivated	52	91	52
94	time management	51	92	51
95	creativity	50	93	50
96	data engineering	50	94	50
97	linux	46	95	46
98	bachelors	41	96	41
99	google analytics	41	97	41
100	software engineers	41	98	41
101	data processing	40	99	40
102	modelling	39	100	39
103	predictive analytics	39	101	39
104	predictive models	39	102	39
105	software development	38	103	38
106	javascript	37	104	37
107	modeling techniques	36	105	36
108	deep learning	35	106	35
109	array	34	107	34
110	troubleshooting	34	108	34
111	unix	34	109	34
112	artificial intelligence	33	110	33
113	machine learning techniques	33	111	33
114	data manipulation	31	112	31
115	data models	31	113	31
116	data systems	31	114	31
117	natural language	31	115	31
118	data warehousing	30	116	30
119	mysql	30	117	30
120	solving problems	30	118	30
121	youtube	29	119	29
122	data integration	28	120	28
123	manage multiple projects	28	121	28
124	masters degree	27	122	27
125	data pipeline	26	123	26
126	ecommerce	26	124	26
127	learning algorithms	26	125	26
128	methodological	26	126	26
129	speaking	26	127	26
130	AI	25	128	NA
131	data extraction	25	129	25
132	language processing	25	130	25
133	data gathering	24	131	24
134	machine learning algorithms	24	132	24
135	natural language processing	24	133	24
136	learning models	23	134	23
137	large scale	22	135	22
138	nosql	22	136	22
139	machine learning models	21	137	21
140	azure	19	138	19
141	data entry	19	139	19
142	data insights	19	140	19
143	doctorate degree	19	141	19
144	microsoft word	18	142	18
145	highly organized	17	143	17
146	ruby	17	144	17
147	data pipelines	16	145	16
148	data reporting	15	146	15
149	negotiation	15	147	15
150	systems analysis	15	148	15
151	microsoft powerpoint	14	149	14
152	nlp	14	150	14
153	data architecture	13	151	13
154	bash	12	152	12
155	network analysis	12	153	12
156	elasticsearch	11	154	11
157	postgresql	11	155	11
158	service orientation	11	156	11
159	strategic thinking	11	157	11
160	english language	10	158	10
161	github	10	159	10
162	mongodb	10	160	10
163	data preparation	8	161	8
164	data transfer	8	162	8
165	analytics data	7	163	7
166	client management	7	164	7
167	microsoft access	7	165	7
168		6	166	NA
169	apache spark	6	167	6
170	microsoft project	6	168	6
171	shell script	6	169	6
172	jupyter notebook	5	170	5
173	language understanding	5	171	5
174	microstrategy	5	172	5
175	sales and marketing	5	173	5
176	symantec	5	174	5
177	data interpretation	4	175	4
178	grammatical	4	176	4
179	minitab	4	177	4
180	natural language understanding	4	178	4
181	report creation	4	179	4
182	amazon redshift	3	180	3
183	complex problem solving	3	181	3
184	data mapping	3	182	3
185	data storytelling	3	183	3
186	kpmg	3	184	3
187	lexisnexis	3	185	3
188	microsoft outlook	3	186	3
189	telecommunications	3	187	3
190	work well in a team	3	188	3
191	active learning	2	189	2
192	administration and management	2	190	2
193	ajax	2	191	2
194	apache hadoop	2	192	2
195	clerical	2	193	2
196	confluence	2	194	2
197	experience in market research	2	195	2
198	google docs	2	196	2
199	mcafee	2	197	2
200	microsoft azure	2	198	2
201	nlu	2	199	2
202	operations analysis	2	200	2
203	see the big picture	2	201	2
204	swift	2	202	2
205	wireshark	2	203	2
206	apache kafka	1	204	1
207	apache tomcat	1	205	1
208	data cleanup	1	206	1
209	data organization	1	207	1
210	datadriven	1	208	1
211	deductive reasoning	1	209	1
212	design development	1	210	1
213	django	1	211	1
214	eko	1	212	1
215	epic systems	1	213	1
216	experience in information technology	1	214	1
217	filemaker pro	1	215	1
218	google adwords	1	216	1
219	ibm db2	1	217	1
220	jquery	1	218	1
221	judgment and decision making	1	219	1
222	machine learning data	1	220	1
223	mathematical reasoning	1	221	1
224	microsoft dynamics	1	222	1
225	microsoft sharepoint	1	223	1
226	microsoft sql server	1	224	1
227	microsoft sql server reporting services	1	225	1
228	microsoft windows server	1	226	1
229	oracle hyperion	1	227	1
230	organizational management	1	228	1
231	processing information	1	229	1
232	reading comprehension	1	230	1
233	skype	1	231	1
234	systems evaluation	1	232	1
235	tax software	1	233	1
236	technology design	1	234	1
237	ubuntu	1	235	1
238	unix shell	1	236	1

We can see that there are a few skills that stand out among both positions (Python and R among them). In order to compare the importance of each skill between the two roles, however, we need to look at their frequency in a slightly different way…

Count number of job descriptions containing each word-catalog entry

Here, we want to compare how many job descriptions within each dataset contain each word_catalog entry. Again, we run this code on both “data scientist” job descriptions and “data analyst” job descriptions. By focusing on the number of job descriptions, we can calculate a proportion of the total for each skill in each dataset, that will allow us to make comparisons of their relative importance to each role.

# # data_sci_jd_count <- tibble(
#     "r" = nrow(filter(data_scientists, str_detect(data_scientists$description, " r | r,| r\\.") == TRUE)), 
#     "ms" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ms | ms,| ms.") == TRUE)), 
#     "go" = nrow(filter(data_scientists, str_detect(data_scientists$description, " go ") == TRUE)),
#     "ai" = nrow(filter(data_scientists, str_detect(data_scientists$description, " ai | ai,| ai\\.") == TRUE))) %>% 
#   pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
# 
# dict_dsjd_freq <- tibble(
#   "skills" = dictionary,
#   "freq" = lapply(dictionary, function(x){
#     nrow(filter(data_scientists, str_detect(data_scientists$description, x) == TRUE))
#   }) %>% as.vector(mode="integer"))
# 
# ds_jd_count <- union_all(data_sci_jd_count, dict_dsjd_freq)
# 
# 
# data_ana_jd_count <- tibble(
#     "r" = nrow(filter(data_analysts, str_detect(data_analysts$description, " r | r,| r\\.") == TRUE)), 
#     "ms" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ms | ms,| ms.") == TRUE)), 
#     "go" = nrow(filter(data_analysts, str_detect(data_analysts$description, " go ") == TRUE)),
#     "ai" = nrow(filter(data_analysts, str_detect(data_analysts$description, " ai | ai,| ai\\.") == TRUE))) %>%
#   pivot_longer(names_to="skills", cols=c("r", "ms", "go", "ai"), values_to="freq")
# 
# dict_dsjd_freq <- tibble(
#   "skills" = dictionary,
#   "freq" = lapply(dictionary, function(x){
#     nrow(filter(data_analysts, str_detect(data_analysts$description, x) == TRUE))
#   }) %>% as.vector(mode="integer"))
# 
# da_jd_count <- union_all(data_ana_jd_count, dict_dsjd_freq)

Table top skills across both jobs by number, ranking within dataset, and number and percent of job descriptions

In order to facilitate some analysis, let’s join the results from each of the two datsets with some comparative metrics:

rank within dataset is the rank of the skill’s frequency amongst all skills, for each dataset
jd_percent is the percentage of job descriptions which mention the skill
frequency_per_jd measures how often a particular skill shows up per job description in which it’s mentioned

top_skills_all <- full_join(top_skills_ds, top_skills_da, by="skills") %>% 
  rename(count_ds = count.x, count_da = count.y, rank_ds = rank.x, rank_da = rank.y, jd_count_ds = jd_count.x, jd_count_da = jd_count.y) %>%
  select(skills, count_ds, rank_ds, jd_count_ds, count_da, rank_da, jd_count_da) %>% 
  mutate(avg_rank = (rank_ds + rank_da) / 2,
         jd_percent_ds = round(jd_count_ds / nrow(data_scientists), 3),
         freq_per_jd_ds= round(count_ds/ jd_count_ds, 3),
         jd_percent_da = round(jd_count_da / nrow(data_analysts), 3),
         freq_per_jd_da = round(count_da/ jd_count_da, 3)) %>%
  arrange(avg_rank)

findings_table_all<- top_skills_all %>%
  kbl(caption = "Top Skills") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(height = "400px")

findings_table_all

Top Skills
skills	count_ds	rank_ds	jd_count_ds	count_da	rank_da	jd_count_da	avg_rank	jd_percent_ds	freq_per_jd_ds	jd_percent_da	freq_per_jd_da
python	5250	1	1750	1275	1	425	1.0	0.727	3.000	0.348	3.000
design	1468	3	1468	627	5	627	4.0	0.610	1.000	0.514	1.000
research	1254	6	1254	785	2	785	4.0	0.521	1.000	0.643	1.000
communication	1168	9	1168	776	3	776	6.0	0.485	1.000	0.636	1.000
statistics	1249	7	1249	466	12	466	9.5	0.519	1.000	0.382	1.000
sql	1172	8	1172	467	11	467	9.5	0.487	1.000	0.382	1.000
computer	1460	4	1460	380	17	380	10.5	0.607	1.000	0.311	1.000
organization	844	16	844	605	6	605	11.0	0.351	1.000	0.495	1.000
analytical	799	18	799	653	4	653	11.0	0.332	1.000	0.535	1.000
r	1118	10	950	401	15	326	12.5	0.395	1.177	0.267	1.230
math	1102	11	1102	403	14	403	12.5	0.458	1.000	0.330	1.000
leader	905	15	905	484	10	484	12.5	0.376	1.000	0.396	1.000
solutions	1056	12	1056	400	16	400	14.0	0.439	1.000	0.328	1.000
quantitative	779	21	779	499	8	499	14.5	0.324	1.000	0.409	1.000
science	1384	5	1384	312	25	312	15.0	0.575	1.000	0.256	1.000
communication skills	766	24	766	488	9	488	16.5	0.318	1.000	0.400	1.000
programming	960	14	960	313	24	313	19.0	0.399	1.000	0.256	1.000
written	684	31	684	514	7	514	19.0	0.284	1.000	0.421	1.000
machine learning	1693	2	1693	207	37	207	19.5	0.704	1.000	0.170	1.000
passion	821	17	821	323	23	323	20.0	0.341	1.000	0.265	1.000
vision	714	27	714	358	20	358	23.5	0.297	1.000	0.293	1.000
database	699	29	699	367	19	367	24.0	0.291	1.000	0.301	1.000
office	506	40	506	463	13	463	26.5	0.210	1.000	0.379	1.000
mathematics	768	23	768	257	32	257	27.5	0.319	1.000	0.210	1.000
leadership	536	36	536	327	21	327	28.5	0.223	1.000	0.268	1.000
presentation	463	44	463	373	18	373	31.0	0.192	1.000	0.305	1.000
years of experience	576	34	576	286	29	286	31.5	0.239	1.000	0.234	1.000
data analysis	532	37	532	304	27	304	32.0	0.221	1.000	0.249	1.000
verbal	481	42	481	324	22	324	32.0	0.200	1.000	0.265	1.000
visualization	578	33	578	243	33	243	33.0	0.240	1.000	0.199	1.000
phd	789	19	789	151	50	151	34.5	0.328	1.000	0.124	1.000
collaborative	511	39	511	284	30	284	34.5	0.212	1.000	0.233	1.000
algorithms	975	13	975	130	57	130	35.0	0.405	1.000	0.106	1.000
sas	445	46	445	309	26	309	36.0	0.185	1.000	0.253	1.000
data sets	566	35	566	206	38	206	36.5	0.235	1.000	0.169	1.000
java	765	25	765	134	54	134	39.5	0.318	1.000	0.110	1.000
innovation	471	43	471	207	36	207	39.5	0.196	1.000	0.170	1.000
git	516	38	516	195	42	195	40.0	0.214	1.000	0.160	1.000
scala	789	20	789	113	66	113	43.0	0.328	1.000	0.093	1.000
creating	451	45	451	196	41	196	43.0	0.187	1.000	0.161	1.000
ml	617	32	617	134	55	134	43.5	0.256	1.000	0.110	1.000
economics	315	57	315	268	31	268	44.0	0.131	1.000	0.219	1.000
big data	695	30	695	120	61	120	45.5	0.289	1.000	0.098	1.000
writing	295	63	295	300	28	300	45.5	0.123	1.000	0.246	1.000
spark	779	22	779	87	76	87	49.0	0.324	1.000	0.071	1.000
hadoop	706	28	706	92	71	92	49.5	0.293	1.000	0.075	1.000
collaboration	329	52	329	154	49	154	50.5	0.137	1.000	0.126	1.000
large data	398	48	398	133	56	133	52.0	0.165	1.000	0.109	1.000
tableau	279	66	279	203	39	203	52.5	0.116	1.000	0.166	1.000
data mining	483	41	483	105	67	105	54.0	0.201	1.000	0.086	1.000
data visualization	312	61	312	162	47	162	54.0	0.130	1.000	0.133	1.000
consulting	274	69	274	182	44	182	56.5	0.114	1.000	0.149	1.000
problem solving	273	70	273	172	46	172	58.0	0.113	1.000	0.141	1.000
data analytics	285	65	285	140	52	140	58.5	0.118	1.000	0.115	1.000
rtable	291	64	291	128	58	128	61.0	0.121	1.000	0.105	1.000
physics	337	51	337	89	73	89	62.0	0.140	1.000	0.073	1.000
interpersonal	201	86	201	198	40	198	63.0	0.084	1.000	0.162	1.000
organizational	172	92	172	224	34	224	63.0	0.071	1.000	0.183	1.000
data engineer	389	50	389	81	78	81	64.0	0.162	1.000	0.066	1.000
flexible	244	75	244	137	53	137	64.0	0.101	1.000	0.112	1.000
ms	223	84	253	173	45	165	64.5	0.105	0.881	0.135	1.048
programming languages	313	59	313	88	74	88	66.5	0.130	1.000	0.072	1.000
microsoft	148	99	148	223	35	223	67.0	0.062	1.000	0.183	1.000
business problems	312	60	312	87	75	87	67.5	0.130	1.000	0.071	1.000
influence	251	73	251	118	63	118	68.0	0.104	1.000	0.097	1.000
matlab	277	68	277	92	72	92	70.0	0.115	1.000	0.075	1.000
bachelor’s degree	164	94	164	149	51	149	72.5	0.068	1.000	0.122	1.000
project management	141	101	141	159	48	159	74.5	0.059	1.000	0.130	1.000
deep learning	432	47	432	35	106	35	76.5	0.180	1.000	0.029	1.000
AI	729	26	NA	25	128	NA	77.0	NA	NA	NA	NA
software development	327	53	327	38	103	38	78.0	0.136	1.000	0.031	1.000
decision making	173	91	173	114	65	114	78.0	0.072	1.000	0.093	1.000
etl	239	76	239	73	82	73	79.0	0.099	1.000	0.060	1.000
data management	161	96	161	120	62	120	79.0	0.067	1.000	0.098	1.000
monitoring	162	95	162	118	64	118	79.5	0.067	1.000	0.097	1.000
powerpoint	84	118	84	190	43	190	80.5	0.035	1.000	0.156	1.000
forecasting	169	93	169	96	69	96	81.0	0.070	1.000	0.079	1.000
data collection	132	102	132	121	60	121	81.0	0.055	1.000	0.099	1.000
statistical modeling	237	77	237	61	86	61	81.5	0.099	1.000	0.050	1.000
natural language	390	49	390	31	115	31	82.0	0.162	1.000	0.025	1.000
masters	201	87	201	85	77	85	82.0	0.084	1.000	0.070	1.000
artificial intelligence	318	55	318	33	110	33	82.5	0.132	1.000	0.027	1.000
large data sets	227	82	227	72	83	72	82.5	0.094	1.000	0.059	1.000
linux	251	74	251	46	95	46	84.5	0.104	1.000	0.038	1.000
machine learning techniques	298	62	298	33	111	33	86.5	0.124	1.000	0.027	1.000
work independently	97	114	97	125	59	125	86.5	0.040	1.000	0.102	1.000
data processing	227	81	227	40	99	40	90.0	0.094	1.000	0.033	1.000
data engineering	194	88	194	50	94	50	91.0	0.081	1.000	0.041	1.000
learning algorithms	315	58	315	26	125	26	91.5	0.131	1.000	0.021	1.000
language processing	320	54	320	25	130	25	92.0	0.133	1.000	0.020	1.000
have experience	157	97	157	58	88	58	92.5	0.065	1.000	0.048	1.000
predictive models	211	85	211	39	102	39	93.5	0.088	1.000	0.032	1.000
software engineers	194	89	194	41	98	41	93.5	0.081	1.000	0.034	1.000
natural language processing	318	56	318	24	133	24	94.5	0.132	1.000	0.020	1.000
creativity	155	98	155	50	93	50	95.5	0.064	1.000	0.041	1.000
Go	107	110	NA	74	81	NA	95.5	NA	NA	NA	NA
sap	112	108	112	53	90	53	99.0	0.047	1.000	0.043	1.000
machine learning algorithms	278	67	278	24	132	24	99.5	0.116	1.000	0.020	1.000
predictive analytics	146	100	146	39	101	39	100.5	0.061	1.000	0.032	1.000
learning models	268	71	268	23	134	23	102.5	0.111	1.000	0.019	1.000
facebook	95	116	95	53	89	53	102.5	0.039	1.000	0.043	1.000
data pipeline	225	83	225	26	123	26	103.0	0.094	1.000	0.021	1.000
critical thinking	62	127	62	77	79	77	103.0	0.026	1.000	0.063	1.000
javascript	127	104	127	37	104	37	104.0	0.053	1.000	0.030	1.000
modeling techniques	122	105	122	36	105	36	105.0	0.051	1.000	0.029	1.000
modelling	102	112	102	39	100	39	106.0	0.042	1.000	0.032	1.000
large scale	235	78	235	22	135	22	106.5	0.098	1.000	0.018	1.000
nosql	232	79	232	22	136	22	107.5	0.096	1.000	0.018	1.000
machine learning models	231	80	231	21	137	21	108.5	0.096	1.000	0.017	1.000
unix	112	109	112	34	109	34	109.0	0.047	1.000	0.028	1.000
array	104	111	104	34	107	34	109.0	0.043	1.000	0.028	1.000
microsoft office	31	151	31	100	68	100	109.5	0.013	1.000	0.082	1.000
data models	116	107	116	31	113	31	110.0	0.048	1.000	0.025	1.000
multi-task	46	141	46	77	80	77	110.5	0.019	1.000	0.063	1.000
nlp	254	72	254	14	150	14	111.0	0.106	1.000	0.011	1.000
market research	29	154	29	95	70	95	112.0	0.012	1.000	0.078	1.000
highly motivated	51	134	51	52	91	52	112.5	0.021	1.000	0.043	1.000
data warehousing	95	115	95	30	116	30	115.5	0.039	1.000	0.025	1.000
coordination	39	147	39	68	84	68	115.5	0.016	1.000	0.056	1.000
mysql	92	117	92	30	117	30	117.0	0.038	1.000	0.025	1.000
data pipelines	179	90	179	16	145	16	117.5	0.074	1.000	0.013	1.000
data systems	78	121	78	31	114	31	117.5	0.032	1.000	0.025	1.000
bachelors	46	139	46	41	96	41	117.5	0.019	1.000	0.034	1.000
solving problems	77	122	77	30	118	30	120.0	0.032	1.000	0.025	1.000
azure	127	103	127	19	138	19	120.5	0.053	1.000	0.016	1.000
time management	38	149	38	51	92	51	120.5	0.016	1.000	0.042	1.000
ecommerce	80	120	80	26	124	26	122.0	0.033	1.000	0.021	1.000
data manipulation	51	133	51	31	112	31	122.5	0.021	1.000	0.025	1.000
vba	19	161	19	60	87	60	124.0	0.008	1.000	0.049	1.000
data integration	59	129	59	28	120	28	124.5	0.025	1.000	0.023	1.000
google analytics	30	153	30	41	97	41	125.0	0.012	1.000	0.034	1.000
data extraction	75	123	75	25	129	25	126.0	0.031	1.000	0.020	1.000
microsoft excel	15	167	15	68	85	68	126.0	0.006	1.000	0.056	1.000
troubleshooting	40	145	40	34	108	34	126.5	0.017	1.000	0.028	1.000
data architecture	120	106	120	13	151	13	128.5	0.050	1.000	0.011	1.000
ruby	98	113	98	17	144	17	128.5	0.041	1.000	0.014	1.000
masters degree	49	135	49	27	122	27	128.5	0.020	1.000	0.022	1.000
speaking	47	137	47	26	127	26	132.0	0.020	1.000	0.021	1.000
youtube	40	146	40	29	119	29	132.5	0.017	1.000	0.024	1.000
methodological	34	150	34	26	126	26	138.0	0.014	1.000	0.021	1.000
data gathering	39	148	39	24	131	24	139.5	0.016	1.000	0.020	1.000
mongodb	75	124	75	10	160	10	142.0	0.031	1.000	0.008	1.000
manage multiple projects	16	165	16	28	121	28	143.0	0.007	1.000	0.023	1.000
postgresql	56	132	56	11	155	11	143.5	0.023	1.000	0.009	1.000
language understanding	83	119	83	5	171	5	145.0	0.034	1.000	0.004	1.000
github	57	131	57	10	159	10	145.0	0.024	1.000	0.008	1.000
elasticsearch	47	136	47	11	154	11	145.0	0.020	1.000	0.009	1.000
apache spark	65	125	65	6	167	6	146.0	0.027	1.000	0.005	1.000
bash	45	142	45	12	152	12	147.0	0.019	1.000	0.010	1.000
data insights	27	156	27	19	140	19	148.0	0.011	1.000	0.016	1.000
natural language understanding	64	126	64	4	178	4	152.0	0.027	1.000	0.003	1.000
doctorate degree	17	164	17	19	141	19	152.5	0.007	1.000	0.016	1.000
negotiation	21	159	21	15	147	15	153.0	0.009	1.000	0.012	1.000
network analysis	28	155	28	12	153	12	154.0	0.012	1.000	0.010	1.000
kpmg	60	128	60	3	184	3	156.0	0.025	1.000	0.002	1.000
shell script	40	144	40	6	169	6	156.5	0.017	1.000	0.005	1.000
highly organized	10	170	10	17	143	17	156.5	0.004	1.000	0.014	1.000
data entry	8	176	8	19	139	19	157.5	0.003	1.000	0.016	1.000
data preparation	22	158	22	8	161	8	159.5	0.009	1.000	0.007	1.000
data reporting	8	178	8	15	146	15	162.0	0.003	1.000	0.012	1.000
strategic thinking	13	169	13	11	157	11	163.0	0.005	1.000	0.009	1.000
microsoft word	6	185	6	18	142	18	163.5	0.002	1.000	0.015	1.000
analytics data	15	166	15	7	163	7	164.5	0.006	1.000	0.006	1.000
jupyter notebook	18	162	18	5	170	5	166.0	0.007	1.000	0.004	1.000
sales and marketing	19	160	19	5	173	5	166.5	0.008	1.000	0.004	1.000
microstrategy	18	163	18	5	172	5	167.5	0.007	1.000	0.004	1.000
	9	172	NA	6	166	NA	169.0	NA	NA	NA	NA
nlu	40	143	40	2	199	2	171.0	0.017	1.000	0.002	1.000
data transfer	6	184	6	8	162	8	173.0	0.002	1.000	0.007	1.000
english language	5	188	5	10	158	10	173.0	0.002	1.000	0.008	1.000
service orientation	5	191	5	11	156	11	173.5	0.002	1.000	0.009	1.000
symantec	9	174	9	5	174	5	174.0	0.004	1.000	0.004	1.000
grammatical	9	173	9	4	176	4	174.5	0.004	1.000	0.003	1.000
systems analysis	3	202	3	15	148	15	175.0	0.001	1.000	0.012	1.000
microsoft azure	23	157	23	2	198	2	177.5	0.010	1.000	0.002	1.000
data mapping	8	177	8	3	182	3	179.5	0.003	1.000	0.002	1.000
telecommunications	9	175	9	3	187	3	181.0	0.004	1.000	0.002	1.000
django	30	152	30	1	211	1	181.5	0.012	1.000	0.001	1.000
amazon redshift	6	183	6	3	180	3	181.5	0.002	1.000	0.002	1.000
microsoft access	3	199	3	7	165	7	182.0	0.001	1.000	0.006	1.000
microsoft powerpoint	2	215	2	14	149	14	182.0	0.001	1.000	0.011	1.000
complex problem solving	5	187	5	3	181	3	184.0	0.002	1.000	0.002	1.000
active learning	7	180	7	2	189	2	184.5	0.003	1.000	0.002	1.000
data interpretation	3	197	3	4	175	4	186.0	0.001	1.000	0.003	1.000
apache hadoop	7	181	7	2	192	2	186.5	0.003	1.000	0.002	1.000
client management	2	209	2	7	164	7	186.5	0.001	1.000	0.006	1.000
confluence	7	182	7	2	194	2	188.0	0.003	1.000	0.002	1.000
minitab	3	200	3	4	177	4	188.5	0.001	1.000	0.003	1.000
machine learning data	13	168	13	1	220	1	194.0	0.005	1.000	0.001	1.000
swift	6	186	6	2	202	2	194.0	0.002	1.000	0.002	1.000
jquery	10	171	10	1	218	1	194.5	0.004	1.000	0.001	1.000
eko	8	179	8	1	212	1	195.5	0.003	1.000	0.001	1.000
work well in a team	3	205	3	3	188	3	196.5	0.001	1.000	0.002	1.000
operations analysis	4	194	4	2	200	2	197.0	0.002	1.000	0.002	1.000
see the big picture	4	195	4	2	201	2	198.0	0.002	1.000	0.002	1.000
apache kafka	4	193	4	1	204	1	198.5	0.002	1.000	0.001	1.000
microsoft outlook	2	214	2	3	186	3	200.0	0.001	1.000	0.002	1.000
clerical	2	208	2	2	193	2	200.5	0.001	1.000	0.002	1.000
ibm db2	5	189	5	1	217	1	203.0	0.002	1.000	0.001	1.000
report creation	1	231	1	4	179	4	205.0	0.000	1.000	0.003	1.000
experience in information technology	3	198	3	1	214	1	206.0	0.001	1.000	0.001	1.000
microsoft sql server	5	190	5	1	224	1	207.0	0.002	1.000	0.001	1.000
data cleanup	2	210	2	1	206	1	208.0	0.001	1.000	0.001	1.000
data organization	2	211	2	1	207	1	209.0	0.001	1.000	0.001	1.000
unix shell	5	192	5	1	236	1	214.0	0.002	1.000	0.001	1.000
google adwords	2	212	2	1	216	1	214.0	0.001	1.000	0.001	1.000
design development	1	218	1	1	210	1	214.0	0.000	1.000	0.001	1.000
skype	3	201	3	1	231	1	216.0	0.001	1.000	0.001	1.000
mathematical reasoning	2	213	2	1	221	1	217.0	0.001	1.000	0.001	1.000
filemaker pro	1	220	1	1	215	1	217.5	0.000	1.000	0.001	1.000
wireshark	1	233	1	2	203	2	218.0	0.000	1.000	0.002	1.000
technology design	3	203	3	1	234	1	218.5	0.001	1.000	0.001	1.000
ubuntu	3	204	3	1	235	1	219.5	0.001	1.000	0.001	1.000
judgment and decision making	1	222	1	1	219	1	220.5	0.000	1.000	0.001	1.000
microsoft dynamics	1	223	1	1	222	1	222.5	0.000	1.000	0.001	1.000
microsoft windows server	1	224	1	1	226	1	225.0	0.000	1.000	0.001	1.000
oracle hyperion	1	225	1	1	227	1	226.0	0.000	1.000	0.001	1.000
organizational management	1	227	1	1	228	1	227.5	0.000	1.000	0.001	1.000
reading comprehension	1	230	1	1	230	1	230.0	0.000	1.000	0.001	1.000
big data architecture	57	130	57	NA	NA	NA	NA	0.024	1.000	NA	NA
architecture capabilities	46	138	46	NA	NA	NA	NA	0.019	1.000	NA	NA
covering technologies	46	140	46	NA	NA	NA	NA	0.019	1.000	NA	NA
apache hive	3	196	3	NA	NA	NA	NA	0.001	1.000	NA	NA
active listening	2	206	2	NA	NA	NA	NA	0.001	1.000	NA	NA
citrix	2	207	2	NA	NA	NA	NA	0.001	1.000	NA	NA
amazon dynamodb	1	216	1	NA	NA	NA	NA	0.000	1.000	NA	NA
bring creativity	1	217	1	NA	NA	NA	NA	0.000	1.000	NA	NA
engineering and technology	1	219	1	NA	NA	NA	NA	0.000	1.000	NA	NA
ibm infosphere datastage	1	221	1	NA	NA	NA	NA	0.000	1.000	NA	NA
oracle java	1	226	1	NA	NA	NA	NA	0.000	1.000	NA	NA
prepare data for analysis	1	228	1	NA	NA	NA	NA	0.000	1.000	NA	NA
quality control analysis	1	229	1	NA	NA	NA	NA	0.000	1.000	NA	NA
teradata database	1	232	1	NA	NA	NA	NA	0.000	1.000	NA	NA
microsoft project	NA	NA	NA	6	168	6	NA	NA	NA	0.005	1.000
data storytelling	NA	NA	NA	3	183	3	NA	NA	NA	0.002	1.000
lexisnexis	NA	NA	NA	3	185	3	NA	NA	NA	0.002	1.000
administration and management	NA	NA	NA	2	190	2	NA	NA	NA	0.002	1.000
ajax	NA	NA	NA	2	191	2	NA	NA	NA	0.002	1.000
experience in market research	NA	NA	NA	2	195	2	NA	NA	NA	0.002	1.000
google docs	NA	NA	NA	2	196	2	NA	NA	NA	0.002	1.000
mcafee	NA	NA	NA	2	197	2	NA	NA	NA	0.002	1.000
apache tomcat	NA	NA	NA	1	205	1	NA	NA	NA	0.001	1.000
datadriven	NA	NA	NA	1	208	1	NA	NA	NA	0.001	1.000
deductive reasoning	NA	NA	NA	1	209	1	NA	NA	NA	0.001	1.000
epic systems	NA	NA	NA	1	213	1	NA	NA	NA	0.001	1.000
microsoft sharepoint	NA	NA	NA	1	223	1	NA	NA	NA	0.001	1.000
microsoft sql server reporting services	NA	NA	NA	1	225	1	NA	NA	NA	0.001	1.000
processing information	NA	NA	NA	1	229	1	NA	NA	NA	0.001	1.000
systems evaluation	NA	NA	NA	1	232	1	NA	NA	NA	0.001	1.000
tax software	NA	NA	NA	1	233	1	NA	NA	NA	0.001	1.000

4.Analyze our survey

Analyzing our Survey

In our quest to determine the top data science skills, our team formulated a survey and distributed it to our peers, colleagues, friends, and family. We received 32 survey responses, and analyzed the data to determine the top data science skills. Respondents were prompted to provide three skills in all.

In order to analyze our survey data, we used n-gram analysis. Because of the less complex and lengthy nature of this data compared to the job description data, we did not use the word_catalog but instead relied on our common sense to pull out relevant data skills from our n-gram analysis.

## Loading Data

survey <- read.csv("https://raw.githubusercontent.com/ericonsi/Project3/master/survey-final.csv?token=AKGJZWL6KFYOLXBQ5XWZ7OLANHUIQ")

skills_only <- select(survey, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))
skills_only<-skills_only %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Exploratory Word Cloud

To get an initial look at the data, all the skills collected into one data frame, and the wordcloud library was used to generate a word cloud of all the words survey participants submitted.

all <- pivot_longer(skills_only, 1:3)
corpus4 <- VCorpus(VectorSource(all$value))
corpus4 <- tm_map(corpus4, removePunctuation)
corpus4 <- tm_map(corpus4, content_transformer(tolower))
corpus4 <- tm_map(corpus4, removeNumbers)
corpus4 <- tm_map(corpus4, removeWords, stopwords_en)
corpus4 <- tm_map(corpus4, stripWhitespace)
wordcloud(corpus4, max.words = 50, colors = wes_palette(name = "Zissou1"))

Stemming

To further analyze the survey, stemming was used. In additional to removing stopwords_en, “skill”, “skills”, “ability”, and “abilities” were removed since those aren’t standalone skills. The stem completion list was forumlated by hand based on the generated list of stems.

no_punc <- removePunctuation(all$value)
fewer_words <- removeWords(no_punc, c(stopwords_en, "etc", "eg", "skill", "skills", "ability", "abilities"))
unlisted <-  unlist(strsplit(fewer_words, split = ' '))

stems <- stemDocument(unlisted)
stems<- stripWhitespace(stems)
stems <- tolower(stems)
stem_corpus <- VCorpus(VectorSource(stems))
wordcloud(stem_corpus, max.words = 50, colors = wes_palette(name = "Zissou1"))

test_complete <- c("statistics", "visualization", "programming", "database", "analytics", "software", "solving", "thinking",  "communication", "code", "machine", "learning", "modeling", "munging", "interpretation", "recognition", "computer", "aptitude", "knowledge", "technical", "storytelling", "collaboration", "analysis", "data", "sql", "creativity", "python", "business")


test <- stemCompletion(stems, dictionary=test_complete)
testcorp <- VCorpus(VectorSource(test))
wordcloud(testcorp, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

Next, the frequency of the ten most common unigrams, bigrams, and trigrams for the entire data set were analyzed.

#Unigrams
unigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE) }
unigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(1, 20)))


#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))


#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigram <- TermDocumentMatrix(corpus4, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))

unigramrow <- sort(slam::row_sums(unigram), decreasing=T)
unigramfreq <- data.table(tok = names(unigramrow), freq = unigramrow)

ggplot(unigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams") +labs(x = "", y = "")

#Bigrams

bigramrow <- sort(slam::row_sums(bigram), decreasing=T)
bigramfreq <- data.table(tok = names(bigramrow), freq = bigramrow)

ggplot(bigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams") +labs(x = "", y = "")

#Trigrams
trigramrow <- sort(slam::row_sums(trigram), decreasing=T)
trigramfreq <- data.table(tok = names(trigramrow), freq = trigramrow)

ggplot(trigramfreq[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Trigrams") +labs(x = "", y = "")

N Gram Analysis, With Filtering

To get a closer look at where the responses break down, the unigram and bigram frequencies were run again on filtered sections of the data set.

Filtering By Field and Occupation

First, the data set was filtered both by field and by occupation. The first group was individuals who worked full time in either computer science or data science. The second group was students or teachers in computer science or data science.

industry <- filter(survey, Occupation == "Full-Time Work")
industry <- filter(industry, Field == "Data Science" | Field == "Computer Science")
not_industry <- filter(survey, Occupation == "Graduate Student (Full Time)" | Occupation == "Teacher / Professor" | Occupation == "High School Student" | Occupation == "Undergraduate" | Occupation == "Other")
not_industry <- filter(not_industry, Field == "Data Science" | Field == "Computer Science")

industry <- select(industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

industry<- industry %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

not_industry <- select(not_industry, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

not_industry<-not_industry %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Initial Word Cloud Visualization

industry <- pivot_longer(industry, 1:3)
corpus_industry <- VCorpus(VectorSource(industry$value))
corpus_industry <- tm_map(corpus_industry, removePunctuation)
corpus_industry <- tm_map(corpus_industry, content_transformer(tolower))
corpus_industry <- tm_map(corpus_industry, removeNumbers)
corpus_industry <- tm_map(corpus_industry, removeWords, c(stopwords_en, "eg", "etc"))
corpus_industry <- tm_map(corpus_industry, stripWhitespace)
wordcloud(corpus_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))

not_industry <- pivot_longer(not_industry, 1:3)
corpus_not_industry <- VCorpus(VectorSource(not_industry$value))
corpus_not_industry <- tm_map(corpus_not_industry, removePunctuation)
corpus_not_industry <- tm_map(corpus_not_industry, content_transformer(tolower))
corpus_not_industry <- tm_map(corpus_not_industry, removeNumbers)
corpus_not_industry <- tm_map(corpus_not_industry, removeWords, c(stopwords_en, "etc", "eg"))
corpus_not_industry <- tm_map(corpus_not_industry, stripWhitespace)
wordcloud(corpus_not_industry, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

unigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(1, 20)))

unigramrow_ind <- sort(slam::row_sums(unigram_ind), decreasing=T)
unigramfreq_ind <- data.table(tok = names(unigramrow_ind), freq = unigramrow_ind)

ggplot(unigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")

unigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(1, 20)))
unigramrow_not <- sort(slam::row_sums(unigram_not), decreasing=T)
unigramfreq_not <- data.table(tok = names(unigramrow_not), freq = unigramrow_not)

ggplot(unigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")

bigram_ind <- TermDocumentMatrix(corpus_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_ind <- sort(slam::row_sums(bigram_ind), decreasing=T)
bigramfreq_ind <- data.table(tok = names(bigramrow_ind), freq = bigramrow_ind)

ggplot(bigramfreq_ind[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science Full-Time Workers") +labs(x = "", y = "")

bigram_not <- TermDocumentMatrix(corpus_not_industry, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not <- sort(slam::row_sums(bigram_not), decreasing=T)
bigramfreq_not <- data.table(tok = names(bigramrow_not), freq = bigramrow_not)

ggplot(bigramfreq_not[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science Professors and Students") +labs(x = "", y = "")

Filtering Only By Field

The second filtered data set was only filtered by field. The first group was individuals in data science or computer science, and the second group was individuals who weren’t in computer science or data science.

csds <- filter(survey, Field == "Data Science" | Field == "Computer Science")

not_csds <- filter(survey, Field == "Other STEM Field" | Field == "Other Non-Stem Field")

csds <- select(csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

csds<- csds %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

not_csds <- select(not_csds, c(What.is.the.most.important.skill.for.a.data.scientist., What.is.the.second.most.important.skill.for.a.data.scientist., What.is.the.third.most.important.skill.for.a.data.scientist.))

not_csds<-not_csds %>% rename(
  first = What.is.the.most.important.skill.for.a.data.scientist.,
  second = What.is.the.second.most.important.skill.for.a.data.scientist.,
  third = What.is.the.third.most.important.skill.for.a.data.scientist.
)

Word Clouds

csds <- pivot_longer(csds, 1:3)
corpus_csds <- VCorpus(VectorSource(csds$value))
corpus_csds <- tm_map(corpus_csds, removePunctuation)
corpus_csds <- tm_map(corpus_csds, content_transformer(tolower))
corpus_csds <- tm_map(corpus_csds, removeNumbers)
corpus_csds <- tm_map(corpus_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_csds <- tm_map(corpus_csds, stripWhitespace)
wordcloud(corpus_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))

not_csds <- pivot_longer(not_csds, 1:3)
corpus_not_csds <- VCorpus(VectorSource(not_csds$value))
corpus_not_csds <- tm_map(corpus_not_csds, removePunctuation)
corpus_not_csds <- tm_map(corpus_not_csds, content_transformer(tolower))
corpus_not_csds <- tm_map(corpus_not_csds, removeNumbers)
corpus_not_csds <- tm_map(corpus_not_csds, removeWords, c(stopwords_en, "eg", "etc"))
corpus_not_csds <- tm_map(corpus_not_csds, stripWhitespace)
wordcloud(corpus_not_csds, max.words = 50, colors = wes_palette(name = "Zissou1"))

N Gram Analysis

unigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(1, 20)))

unigramrow_csds <- sort(slam::row_sums(unigram_csds), decreasing=T)
unigramfreq_csds <- data.table(tok = names(unigramrow_csds), freq = unigramrow_csds)

ggplot(unigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Computer Science and Data Science Fields") +labs(x = "", y = "")

unigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(1, 20)))

unigramrow_not_csds <- sort(slam::row_sums(unigram_not_csds), decreasing=T)
unigramfreq_not_csds <- data.table(tok = names(unigramrow_not_csds), freq = unigramrow_not_csds)

ggplot(unigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Unigrams - Other Fields") +labs(x = "", y = "")

bigram_csds <- TermDocumentMatrix(corpus_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_csds <- sort(slam::row_sums(bigram_csds), decreasing=T)
bigramfreq_csds <- data.table(tok = names(bigramrow_csds), freq = bigramrow_csds)

ggplot(bigramfreq_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Computer Science and Data Science") +labs(x = "", y = "")

bigram_not_csds <- TermDocumentMatrix(corpus_not_csds, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
bigramrow_not_csds <- sort(slam::row_sums(bigram_not_csds), decreasing=T)
bigramfreq_not_csds <- data.table(tok = names(bigramrow_not_csds), freq = bigramrow_not_csds)

ggplot(bigramfreq_not_csds[1:10,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = wes_palette(name = "Zissou1", 10, type = "continuous")) + theme_bw() +
     ggtitle("Top 10 Bigrams - Other Fields") +labs(x = "", y = "")

Conclusions

The top broad skills identified by the survey were “programming skills” and “analytical skills”. “Statistics”, “R”, and “Python” were top skills identified by individuals working full-time in either computer science or data science. Overall, in the filtered data sets, there weren’t many common bigrams, which was likely due to the small size of the survey. Interestingly, two survey respondents in the data science / computer science group identified “business knowledge” as a top skill.

Overall, most of the skills identified by the survey were either broad answers, such as “programming skills” or “analytical skills”, or specific programming languages such as Python, R, and sQL. A few respondents also identified abstract skills, such as creativity. While there was more specificity in the survey answers of individuals working or studying in data science or computer science, more survey responses should be gathered to make any larger conclusions.

5.Analyze Popular Blogs

DISCLAIMER.- Findings in these blogs were not used to draw any conclusions or final analysis. It serves the purpose of understanding what people say about the skills needed in Data Science

We started by researching popular sites like Youtube, Quora and Reddit to have a better understanding of the research question.

Youtube

Keywords everywhere is a chrome extension that allows the user to perform keyword density checks in desired websites.

The table below is the result of a density check for the keywords “data scientist skills” in Youtube.

We observe that most of the videos published with the subject - data scientist skills, have been published in the past eleven months and seem to be raising in popularity at a fast pace. In addition, the terms data scientist and data analysts are used interchangeably - Does that mean both positions are the same, or have the same skills?

Blogs

Some bloggers insist that Python and R are key, while others promote communication and presentation skills as most important.

A debate rages on Quora as to whether data scientists need high-level math skills or no math at all.

Some in the field argue that data science is just a fancy term for what we used to call a data analyst, while others insist the field is new and different.

Quora

Reddit

Analyzing Reddit Blog

content <- reddit_content("https://www.reddit.com/r/datascience/comments/m5mub0/why_do_so_many_of_us_suck_at_basic_programming/")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

dfcomment <- content %>%
  select(comment)
write_delim(dfcomment, file = "blog.csv", col_names = TRUE)
#write.csv(content, file = "untidy_blog.csv", row.names = FALSE)

EH FUnction

EH_WordCloudIt <- function(dfDataframe, sColumn, bStem = TRUE)
{
  # Load the data as a corpus
  docs <- Corpus(VectorSource(dfDataframe[[sColumn]]))
  
  #Clean the data
  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
  docs <- tm_map(docs, toSpace, "/")
  docs <- tm_map(docs, toSpace, "@")
  docs <- tm_map(docs, toSpace, "\\|")
  
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove english common stopwords
  docs <- tm_map(docs, removeWords, stopwords("english"))
  # Remove your own stop word
  # specify your stopwords as a character vector
  docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  
  if (bStem)
  {
    docs <- tm_map(docs, stemDocument)
    
  }
   
  #gets words by frequency
  dtm <- TermDocumentMatrix(docs)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  #head(d, 100)
  
  #generates the word cloud.  The set.seed appears to determine the display of similarly ranked elements.  
  #Without it, those elements are randomly displayed each time.
  
  #https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer - other pallettes
  set.seed(1234)
  wordcloud(words = d$word, freq = d$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))
  print("wordcloud finished")
  
  return(d)
}

Experiment

findSkills <- ".............................................................................skills"
#findSkills <- "able to....................................................................."
ext<-str_extract(dfcomment$comment, findSkills )
dfExt<-as.data.frame(ext)
dfExt <- na.omit(dfExt)
freqSkills <- EH_WordCloudIt(dfExt, "ext", FALSE)

## [1] "wordcloud finished"

freqSkills

##                        word freq
## skills               skills   16
## concepts           concepts    5
## basic                 basic    4
## documentation documentation    4
## oop                     oop    3
## proper               proper    3
## writing             writing    3
## programming     programming    2
## learned             learned    2
## never                 never    2
## good                   good    2
## improve             improve    2
## job                     job    2
## people               people    2
## \031re               \031re    1
## aren\031t         aren\031t    1
## coders               coders    1
## data                   data    1
## great                 great    1
## hired                 hired    1
## obviously         obviously    1
## scientists       scientists    1
## since                 since    1
## advanced           advanced    1
## benefit             benefit    1
## necessarily     necessarily    1
## object               object    1
## obs                     obs    1
## oriented           oriented    1
## many                   many    1
## hours                 hours    1
## life                   life    1
## reading             reading    1
## \034talk           \034talk    1
## can                     can    1
## coding               coding    1
## next                   next    1
## senior               senior    1
## shop\035           shop\035    1
## domain               domain    1
## exist                 exist    1
## perhaps             perhaps    1
## statistical     statistical    1
## strong               strong    1
## backgrounds     backgrounds    1
## biostats           biostats    1
## exchemistry     exchemistry    1
## picked               picked    1
## seem                   seem    1
## somehow             somehow    1
## type                   type    1
## break                 break    1
## complaining     complaining    1
## guy                     guy    1
## hear                   hear    1
## one                     one    1
## rogrammers       rogrammers    1
## room                   room    1
## walked               walked    1
## much                   much    1
## reason               reason    1
## see                     see    1
## think                 think    1
## variation         variation    1
## whatever           whatever    1
## modularity       modularity    1
## anyone               anyone    1
## recommend         recommend    1
## ancillary         ancillary    1
## gotta                 gotta    1
## invest               invest    1
## pick                   pick    1
## speed                 speed    1
## stay                   stay    1
## hire                   hire    1
## ling                   ling    1
## personally       personally    1
## put                     put    1
## research           research    1
## someone             someone    1
## work                   work    1
## clients             clients    1
## ing                     ing    1
## none                   none    1
## requested         requested    1
## core                   core    1
## engineering     engineering    1
## really               really    1
## shows                 shows    1
## software           software    1
## students           students    1
## century             century    1
## degree               degree    1
## level                 level    1
## sexiest             sexiest    1

“what are the most valued skills of the data scientist?”

Findings

Graph Top 20 Skills by total count, and percentage of job descriptions for each of Data Scientist and Data Analyst

From our Kaggle datasets, let’s look at the top 20 skills for each position - data scientist and data analyst - both by their straight number of mentions within the dataset, and by the percentage of job descriptions on which they appear within their dataset…

top20_countds <- top_skills_all %>% slice_max(order_by=count_ds, n=20)

ggplot(top20_countds, aes(x=reorder(skills, count_ds), y=count_ds)) +
  geom_col(fill="coral") + coord_flip() + labs(
    title = "Overall no. of Mentions/ Skill",
    subtitle = "Data Scientist",
    x = "Skill",
    y = "Mentions",
    caption= "Kaggle"
  )

top20_jdpercentds <- top_skills_all %>% slice_max(order_by=jd_percent_ds, n=20)

ggplot(top20_jdpercentds, aes(x=reorder(skills, jd_percent_ds), y=jd_percent_ds)) +
  geom_col(fill="coral") + coord_flip() +labs(
    title = "Percent of Job Descriptions Mentioning Skill",
    subtitle = "Data Scientist",
    x = "Skill",
    y = "% of Job Descriptions",
    caption= "Kaggle"
  )

top20_countda <- top_skills_all %>% slice_max(order_by=count_da, n=20)

ggplot(top20_countda, aes(x=reorder(skills, count_da), y=count_da)) +
  geom_col(fill="blue") + coord_flip() + labs(
    title = "Overall No. of Mentions/ Skill",
    subtitle = "Data Analyst",
    x = "Skill",
    y = "Mentions",
    caption= "Kaggle"
  )

top20_jdpercentda <- top_skills_all %>% slice_max(order_by=jd_percent_da, n=20)

ggplot(top20_jdpercentda, aes(x=reorder(skills, jd_percent_da), y=jd_percent_da)) +
  geom_col(fill="blue") + coord_flip() + labs(
    title = "Percent of Job Descriptions Mentioning Skill",
    subtitle = "Data Analyst", 
    x = "Skill",
    y = "% of Job Descriptions",
    caption= "Kaggle"
  )

Python: One Skill to Rule them All?

First, We can see that while python is far and away the skill with the most overall mentions for both positions, this is driven in large part by its being mentioned with greater frequency on each of the job descriptions on which it appears.

Looking at the percentage of job descriptions which mention python, we can see that its dominance over other skills mentioned in the “data scientists” job descriptions is less pronounced, and that it falls in importance among the Data Analyst job descriptions to rank 13.

Scientist vs. Analyst

Looking at the other skills that round out each position’s top 20, we can draw another clear conclusion that we expect connects directly to each role’s relative importance given to python: the second most-frequently-requested skill among data science job descriptions is machine learning, while data analyst descriptions do not mention this among their top 20 most requested skills.

While both roles meniton research and SQL with similar frequency, there is a clear delineation that emerges to separate the two roles. In addition to machine learning and python, Data Scientist job descriptions are more likely to require knowledge of statistics, mathematics, and R - and to desire the candidate have completed a Ph.D, indicating a desire for deeper subject matter expertise in these areas. Data Analyst roles, on the other hand, place more emphasis on softer skills - communication, vision, leadership, and organization, and may thus be a better entry-point for working in the field.

findings_table_ds

Top Skills
	skills	count	rank	jd_count
1	python	5250	1	1750
4	machine learning	1693	2	1693
5	design	1468	3	1468
6	computer	1460	4	1460
7	science	1384	5	1384
8	research	1254	6	1254
9	statistics	1249	7	1249
10	sql	1172	8	1172
11	communication	1168	9	1168
12	r	1118	10	950
13	math	1102	11	1102
14	solutions	1056	12	1056
15	algorithms	975	13	975
16	programming	960	14	960
17	leader	905	15	905
18	organization	844	16	844
19	passion	821	17	821
20	analytical	799	18	799
21	phd	789	19	789
22	scala	789	20	789
23	quantitative	779	21	779
24	spark	779	22	779
25	mathematics	768	23	768
26	communication skills	766	24	766
27	java	765	25	765
28	AI	729	26	NA
29	vision	714	27	714
30	hadoop	706	28	706
31	database	699	29	699
32	big data	695	30	695
33	written	684	31	684
34	ml	617	32	617
35	visualization	578	33	578
36	years of experience	576	34	576
37	data sets	566	35	566
38	leadership	536	36	536
39	data analysis	532	37	532
40	git	516	38	516
41	collaborative	511	39	511
42	office	506	40	506
43	data mining	483	41	483
44	verbal	481	42	481
45	innovation	471	43	471
46	presentation	463	44	463
47	creating	451	45	451
48	sas	445	46	445
49	deep learning	432	47	432
50	large data	398	48	398
51	natural language	390	49	390
52	data engineer	389	50	389
53	physics	337	51	337
54	collaboration	329	52	329
55	software development	327	53	327
56	language processing	320	54	320
57	artificial intelligence	318	55	318
58	natural language processing	318	56	318
59	economics	315	57	315
60	learning algorithms	315	58	315
61	programming languages	313	59	313
62	business problems	312	60	312
63	data visualization	312	61	312
64	machine learning techniques	298	62	298
65	writing	295	63	295
66	rtable	291	64	291
67	data analytics	285	65	285
68	tableau	279	66	279
69	machine learning algorithms	278	67	278
70	matlab	277	68	277
71	consulting	274	69	274
72	problem solving	273	70	273
73	learning models	268	71	268
74	nlp	254	72	254
75	influence	251	73	251
76	linux	251	74	251
77	flexible	244	75	244
78	etl	239	76	239
79	statistical modeling	237	77	237
80	large scale	235	78	235
81	nosql	232	79	232
82	machine learning models	231	80	231
83	data processing	227	81	227
84	large data sets	227	82	227
85	data pipeline	225	83	225
86	ms	223	84	253
87	predictive models	211	85	211
88	interpersonal	201	86	201
89	masters	201	87	201
90	data engineering	194	88	194
91	software engineers	194	89	194
92	data pipelines	179	90	179
93	decision making	173	91	173
94	organizational	172	92	172
95	forecasting	169	93	169
96	bachelor’s degree	164	94	164
97	monitoring	162	95	162
98	data management	161	96	161
99	have experience	157	97	157
100	creativity	155	98	155
101	microsoft	148	99	148
102	predictive analytics	146	100	146
103	project management	141	101	141
104	data collection	132	102	132
105	azure	127	103	127
106	javascript	127	104	127
107	modeling techniques	122	105	122
108	data architecture	120	106	120
109	data models	116	107	116
110	sap	112	108	112
111	unix	112	109	112
112	Go	107	110	NA
113	array	104	111	104
114	modelling	102	112	102
115	ruby	98	113	98
116	work independently	97	114	97
117	data warehousing	95	115	95
118	facebook	95	116	95
119	mysql	92	117	92
120	powerpoint	84	118	84
121	language understanding	83	119	83
122	ecommerce	80	120	80
123	data systems	78	121	78
124	solving problems	77	122	77
125	data extraction	75	123	75
126	mongodb	75	124	75
127	apache spark	65	125	65
128	natural language understanding	64	126	64
129	critical thinking	62	127	62
130	kpmg	60	128	60
131	data integration	59	129	59
132	big data architecture	57	130	57
133	github	57	131	57
134	postgresql	56	132	56
135	data manipulation	51	133	51
136	highly motivated	51	134	51
137	masters degree	49	135	49
138	elasticsearch	47	136	47
139	speaking	47	137	47
140	architecture capabilities	46	138	46
141	bachelors	46	139	46
142	covering technologies	46	140	46
143	multi-task	46	141	46
144	bash	45	142	45
145	nlu	40	143	40
146	shell script	40	144	40
147	troubleshooting	40	145	40
148	youtube	40	146	40
149	coordination	39	147	39
150	data gathering	39	148	39
151	time management	38	149	38
152	methodological	34	150	34
153	microsoft office	31	151	31
154	django	30	152	30
155	google analytics	30	153	30
156	market research	29	154	29
157	network analysis	28	155	28
158	data insights	27	156	27
159	microsoft azure	23	157	23
160	data preparation	22	158	22
161	negotiation	21	159	21
162	sales and marketing	19	160	19
163	vba	19	161	19
164	jupyter notebook	18	162	18
165	microstrategy	18	163	18
166	doctorate degree	17	164	17
167	manage multiple projects	16	165	16
168	analytics data	15	166	15
169	microsoft excel	15	167	15
170	machine learning data	13	168	13
171	strategic thinking	13	169	13
172	highly organized	10	170	10
173	jquery	10	171	10
174		9	172	NA
175	grammatical	9	173	9
176	symantec	9	174	9
177	telecommunications	9	175	9
178	data entry	8	176	8
179	data mapping	8	177	8
180	data reporting	8	178	8
181	eko	8	179	8
182	active learning	7	180	7
183	apache hadoop	7	181	7
184	confluence	7	182	7
185	amazon redshift	6	183	6
186	data transfer	6	184	6
187	microsoft word	6	185	6
188	swift	6	186	6
189	complex problem solving	5	187	5
190	english language	5	188	5
191	ibm db2	5	189	5
192	microsoft sql server	5	190	5
193	service orientation	5	191	5
194	unix shell	5	192	5
195	apache kafka	4	193	4
196	operations analysis	4	194	4
197	see the big picture	4	195	4
198	apache hive	3	196	3
199	data interpretation	3	197	3
200	experience in information technology	3	198	3
201	microsoft access	3	199	3
202	minitab	3	200	3
203	skype	3	201	3
204	systems analysis	3	202	3
205	technology design	3	203	3
206	ubuntu	3	204	3
207	work well in a team	3	205	3
208	active listening	2	206	2
209	citrix	2	207	2
210	clerical	2	208	2
211	client management	2	209	2
212	data cleanup	2	210	2
213	data organization	2	211	2
214	google adwords	2	212	2
215	mathematical reasoning	2	213	2
216	microsoft outlook	2	214	2
217	microsoft powerpoint	2	215	2
218	amazon dynamodb	1	216	1
219	bring creativity	1	217	1
220	design development	1	218	1
221	engineering and technology	1	219	1
222	filemaker pro	1	220	1
223	ibm infosphere datastage	1	221	1
224	judgment and decision making	1	222	1
225	microsoft dynamics	1	223	1
226	microsoft windows server	1	224	1
227	oracle hyperion	1	225	1
228	oracle java	1	226	1
229	organizational management	1	227	1
230	prepare data for analysis	1	228	1
231	quality control analysis	1	229	1
232	reading comprehension	1	230	1
233	report creation	1	231	1
234	teradata database	1	232	1
235	wireshark	1	233	1

findings_table_da

Top Skills
	skills	count	rank	jd_count
1	python	1275	1	425
4	research	785	2	785
5	communication	776	3	776
6	analytical	653	4	653
7	design	627	5	627
8	organization	605	6	605
9	written	514	7	514
10	quantitative	499	8	499
11	communication skills	488	9	488
12	leader	484	10	484
13	sql	467	11	467
14	statistics	466	12	466
15	office	463	13	463
16	math	403	14	403
17	r	401	15	326
18	solutions	400	16	400
19	computer	380	17	380
20	presentation	373	18	373
21	database	367	19	367
22	vision	358	20	358
23	leadership	327	21	327
24	verbal	324	22	324
25	passion	323	23	323
26	programming	313	24	313
27	science	312	25	312
28	sas	309	26	309
29	data analysis	304	27	304
30	writing	300	28	300
31	years of experience	286	29	286
32	collaborative	284	30	284
33	economics	268	31	268
34	mathematics	257	32	257
35	visualization	243	33	243
36	organizational	224	34	224
37	microsoft	223	35	223
38	innovation	207	36	207
39	machine learning	207	37	207
40	data sets	206	38	206
41	tableau	203	39	203
42	interpersonal	198	40	198
43	creating	196	41	196
44	git	195	42	195
45	powerpoint	190	43	190
46	consulting	182	44	182
47	ms	173	45	165
48	problem solving	172	46	172
49	data visualization	162	47	162
50	project management	159	48	159
51	collaboration	154	49	154
52	phd	151	50	151
53	bachelor’s degree	149	51	149
54	data analytics	140	52	140
55	flexible	137	53	137
56	java	134	54	134
57	ml	134	55	134
58	large data	133	56	133
59	algorithms	130	57	130
60	rtable	128	58	128
61	work independently	125	59	125
62	data collection	121	60	121
63	big data	120	61	120
64	data management	120	62	120
65	influence	118	63	118
66	monitoring	118	64	118
67	decision making	114	65	114
68	scala	113	66	113
69	data mining	105	67	105
70	microsoft office	100	68	100
71	forecasting	96	69	96
72	market research	95	70	95
73	hadoop	92	71	92
74	matlab	92	72	92
75	physics	89	73	89
76	programming languages	88	74	88
77	business problems	87	75	87
78	spark	87	76	87
79	masters	85	77	85
80	data engineer	81	78	81
81	critical thinking	77	79	77
82	multi-task	77	80	77
83	Go	74	81	NA
84	etl	73	82	73
85	large data sets	72	83	72
86	coordination	68	84	68
87	microsoft excel	68	85	68
88	statistical modeling	61	86	61
89	vba	60	87	60
90	have experience	58	88	58
91	facebook	53	89	53
92	sap	53	90	53
93	highly motivated	52	91	52
94	time management	51	92	51
95	creativity	50	93	50
96	data engineering	50	94	50
97	linux	46	95	46
98	bachelors	41	96	41
99	google analytics	41	97	41
100	software engineers	41	98	41
101	data processing	40	99	40
102	modelling	39	100	39
103	predictive analytics	39	101	39
104	predictive models	39	102	39
105	software development	38	103	38
106	javascript	37	104	37
107	modeling techniques	36	105	36
108	deep learning	35	106	35
109	array	34	107	34
110	troubleshooting	34	108	34
111	unix	34	109	34
112	artificial intelligence	33	110	33
113	machine learning techniques	33	111	33
114	data manipulation	31	112	31
115	data models	31	113	31
116	data systems	31	114	31
117	natural language	31	115	31
118	data warehousing	30	116	30
119	mysql	30	117	30
120	solving problems	30	118	30
121	youtube	29	119	29
122	data integration	28	120	28
123	manage multiple projects	28	121	28
124	masters degree	27	122	27
125	data pipeline	26	123	26
126	ecommerce	26	124	26
127	learning algorithms	26	125	26
128	methodological	26	126	26
129	speaking	26	127	26
130	AI	25	128	NA
131	data extraction	25	129	25
132	language processing	25	130	25
133	data gathering	24	131	24
134	machine learning algorithms	24	132	24
135	natural language processing	24	133	24
136	learning models	23	134	23
137	large scale	22	135	22
138	nosql	22	136	22
139	machine learning models	21	137	21
140	azure	19	138	19
141	data entry	19	139	19
142	data insights	19	140	19
143	doctorate degree	19	141	19
144	microsoft word	18	142	18
145	highly organized	17	143	17
146	ruby	17	144	17
147	data pipelines	16	145	16
148	data reporting	15	146	15
149	negotiation	15	147	15
150	systems analysis	15	148	15
151	microsoft powerpoint	14	149	14
152	nlp	14	150	14
153	data architecture	13	151	13
154	bash	12	152	12
155	network analysis	12	153	12
156	elasticsearch	11	154	11
157	postgresql	11	155	11
158	service orientation	11	156	11
159	strategic thinking	11	157	11
160	english language	10	158	10
161	github	10	159	10
162	mongodb	10	160	10
163	data preparation	8	161	8
164	data transfer	8	162	8
165	analytics data	7	163	7
166	client management	7	164	7
167	microsoft access	7	165	7
168		6	166	NA
169	apache spark	6	167	6
170	microsoft project	6	168	6
171	shell script	6	169	6
172	jupyter notebook	5	170	5
173	language understanding	5	171	5
174	microstrategy	5	172	5
175	sales and marketing	5	173	5
176	symantec	5	174	5
177	data interpretation	4	175	4
178	grammatical	4	176	4
179	minitab	4	177	4
180	natural language understanding	4	178	4
181	report creation	4	179	4
182	amazon redshift	3	180	3
183	complex problem solving	3	181	3
184	data mapping	3	182	3
185	data storytelling	3	183	3
186	kpmg	3	184	3
187	lexisnexis	3	185	3
188	microsoft outlook	3	186	3
189	telecommunications	3	187	3
190	work well in a team	3	188	3
191	active learning	2	189	2
192	administration and management	2	190	2
193	ajax	2	191	2
194	apache hadoop	2	192	2
195	clerical	2	193	2
196	confluence	2	194	2
197	experience in market research	2	195	2
198	google docs	2	196	2
199	mcafee	2	197	2
200	microsoft azure	2	198	2
201	nlu	2	199	2
202	operations analysis	2	200	2
203	see the big picture	2	201	2
204	swift	2	202	2
205	wireshark	2	203	2
206	apache kafka	1	204	1
207	apache tomcat	1	205	1
208	data cleanup	1	206	1
209	data organization	1	207	1
210	datadriven	1	208	1
211	deductive reasoning	1	209	1
212	design development	1	210	1
213	django	1	211	1
214	eko	1	212	1
215	epic systems	1	213	1
216	experience in information technology	1	214	1
217	filemaker pro	1	215	1
218	google adwords	1	216	1
219	ibm db2	1	217	1
220	jquery	1	218	1
221	judgment and decision making	1	219	1
222	machine learning data	1	220	1
223	mathematical reasoning	1	221	1
224	microsoft dynamics	1	222	1
225	microsoft sharepoint	1	223	1
226	microsoft sql server	1	224	1
227	microsoft sql server reporting services	1	225	1
228	microsoft windows server	1	226	1
229	oracle hyperion	1	227	1
230	organizational management	1	228	1
231	processing information	1	229	1
232	reading comprehension	1	230	1
233	skype	1	231	1
234	systems evaluation	1	232	1
235	tax software	1	233	1
236	technology design	1	234	1
237	ubuntu	1	235	1
238	unix shell	1	236	1

findings_table_all

Top Skills
skills	count_ds	rank_ds	jd_count_ds	count_da	rank_da	jd_count_da	avg_rank	jd_percent_ds	freq_per_jd_ds	jd_percent_da	freq_per_jd_da
python	5250	1	1750	1275	1	425	1.0	0.727	3.000	0.348	3.000
design	1468	3	1468	627	5	627	4.0	0.610	1.000	0.514	1.000
research	1254	6	1254	785	2	785	4.0	0.521	1.000	0.643	1.000
communication	1168	9	1168	776	3	776	6.0	0.485	1.000	0.636	1.000
statistics	1249	7	1249	466	12	466	9.5	0.519	1.000	0.382	1.000
sql	1172	8	1172	467	11	467	9.5	0.487	1.000	0.382	1.000
computer	1460	4	1460	380	17	380	10.5	0.607	1.000	0.311	1.000
organization	844	16	844	605	6	605	11.0	0.351	1.000	0.495	1.000
analytical	799	18	799	653	4	653	11.0	0.332	1.000	0.535	1.000
r	1118	10	950	401	15	326	12.5	0.395	1.177	0.267	1.230
math	1102	11	1102	403	14	403	12.5	0.458	1.000	0.330	1.000
leader	905	15	905	484	10	484	12.5	0.376	1.000	0.396	1.000
solutions	1056	12	1056	400	16	400	14.0	0.439	1.000	0.328	1.000
quantitative	779	21	779	499	8	499	14.5	0.324	1.000	0.409	1.000
science	1384	5	1384	312	25	312	15.0	0.575	1.000	0.256	1.000
communication skills	766	24	766	488	9	488	16.5	0.318	1.000	0.400	1.000
programming	960	14	960	313	24	313	19.0	0.399	1.000	0.256	1.000
written	684	31	684	514	7	514	19.0	0.284	1.000	0.421	1.000
machine learning	1693	2	1693	207	37	207	19.5	0.704	1.000	0.170	1.000
passion	821	17	821	323	23	323	20.0	0.341	1.000	0.265	1.000
vision	714	27	714	358	20	358	23.5	0.297	1.000	0.293	1.000
database	699	29	699	367	19	367	24.0	0.291	1.000	0.301	1.000
office	506	40	506	463	13	463	26.5	0.210	1.000	0.379	1.000
mathematics	768	23	768	257	32	257	27.5	0.319	1.000	0.210	1.000
leadership	536	36	536	327	21	327	28.5	0.223	1.000	0.268	1.000
presentation	463	44	463	373	18	373	31.0	0.192	1.000	0.305	1.000
years of experience	576	34	576	286	29	286	31.5	0.239	1.000	0.234	1.000
data analysis	532	37	532	304	27	304	32.0	0.221	1.000	0.249	1.000
verbal	481	42	481	324	22	324	32.0	0.200	1.000	0.265	1.000
visualization	578	33	578	243	33	243	33.0	0.240	1.000	0.199	1.000
phd	789	19	789	151	50	151	34.5	0.328	1.000	0.124	1.000
collaborative	511	39	511	284	30	284	34.5	0.212	1.000	0.233	1.000
algorithms	975	13	975	130	57	130	35.0	0.405	1.000	0.106	1.000
sas	445	46	445	309	26	309	36.0	0.185	1.000	0.253	1.000
data sets	566	35	566	206	38	206	36.5	0.235	1.000	0.169	1.000
java	765	25	765	134	54	134	39.5	0.318	1.000	0.110	1.000
innovation	471	43	471	207	36	207	39.5	0.196	1.000	0.170	1.000
git	516	38	516	195	42	195	40.0	0.214	1.000	0.160	1.000
scala	789	20	789	113	66	113	43.0	0.328	1.000	0.093	1.000
creating	451	45	451	196	41	196	43.0	0.187	1.000	0.161	1.000
ml	617	32	617	134	55	134	43.5	0.256	1.000	0.110	1.000
economics	315	57	315	268	31	268	44.0	0.131	1.000	0.219	1.000
big data	695	30	695	120	61	120	45.5	0.289	1.000	0.098	1.000
writing	295	63	295	300	28	300	45.5	0.123	1.000	0.246	1.000
spark	779	22	779	87	76	87	49.0	0.324	1.000	0.071	1.000
hadoop	706	28	706	92	71	92	49.5	0.293	1.000	0.075	1.000
collaboration	329	52	329	154	49	154	50.5	0.137	1.000	0.126	1.000
large data	398	48	398	133	56	133	52.0	0.165	1.000	0.109	1.000
tableau	279	66	279	203	39	203	52.5	0.116	1.000	0.166	1.000
data mining	483	41	483	105	67	105	54.0	0.201	1.000	0.086	1.000
data visualization	312	61	312	162	47	162	54.0	0.130	1.000	0.133	1.000
consulting	274	69	274	182	44	182	56.5	0.114	1.000	0.149	1.000
problem solving	273	70	273	172	46	172	58.0	0.113	1.000	0.141	1.000
data analytics	285	65	285	140	52	140	58.5	0.118	1.000	0.115	1.000
rtable	291	64	291	128	58	128	61.0	0.121	1.000	0.105	1.000
physics	337	51	337	89	73	89	62.0	0.140	1.000	0.073	1.000
interpersonal	201	86	201	198	40	198	63.0	0.084	1.000	0.162	1.000
organizational	172	92	172	224	34	224	63.0	0.071	1.000	0.183	1.000
data engineer	389	50	389	81	78	81	64.0	0.162	1.000	0.066	1.000
flexible	244	75	244	137	53	137	64.0	0.101	1.000	0.112	1.000
ms	223	84	253	173	45	165	64.5	0.105	0.881	0.135	1.048
programming languages	313	59	313	88	74	88	66.5	0.130	1.000	0.072	1.000
microsoft	148	99	148	223	35	223	67.0	0.062	1.000	0.183	1.000
business problems	312	60	312	87	75	87	67.5	0.130	1.000	0.071	1.000
influence	251	73	251	118	63	118	68.0	0.104	1.000	0.097	1.000
matlab	277	68	277	92	72	92	70.0	0.115	1.000	0.075	1.000
bachelor’s degree	164	94	164	149	51	149	72.5	0.068	1.000	0.122	1.000
project management	141	101	141	159	48	159	74.5	0.059	1.000	0.130	1.000
deep learning	432	47	432	35	106	35	76.5	0.180	1.000	0.029	1.000
AI	729	26	NA	25	128	NA	77.0	NA	NA	NA	NA
software development	327	53	327	38	103	38	78.0	0.136	1.000	0.031	1.000
decision making	173	91	173	114	65	114	78.0	0.072	1.000	0.093	1.000
etl	239	76	239	73	82	73	79.0	0.099	1.000	0.060	1.000
data management	161	96	161	120	62	120	79.0	0.067	1.000	0.098	1.000
monitoring	162	95	162	118	64	118	79.5	0.067	1.000	0.097	1.000
powerpoint	84	118	84	190	43	190	80.5	0.035	1.000	0.156	1.000
forecasting	169	93	169	96	69	96	81.0	0.070	1.000	0.079	1.000
data collection	132	102	132	121	60	121	81.0	0.055	1.000	0.099	1.000
statistical modeling	237	77	237	61	86	61	81.5	0.099	1.000	0.050	1.000
natural language	390	49	390	31	115	31	82.0	0.162	1.000	0.025	1.000
masters	201	87	201	85	77	85	82.0	0.084	1.000	0.070	1.000
artificial intelligence	318	55	318	33	110	33	82.5	0.132	1.000	0.027	1.000
large data sets	227	82	227	72	83	72	82.5	0.094	1.000	0.059	1.000
linux	251	74	251	46	95	46	84.5	0.104	1.000	0.038	1.000
machine learning techniques	298	62	298	33	111	33	86.5	0.124	1.000	0.027	1.000
work independently	97	114	97	125	59	125	86.5	0.040	1.000	0.102	1.000
data processing	227	81	227	40	99	40	90.0	0.094	1.000	0.033	1.000
data engineering	194	88	194	50	94	50	91.0	0.081	1.000	0.041	1.000
learning algorithms	315	58	315	26	125	26	91.5	0.131	1.000	0.021	1.000
language processing	320	54	320	25	130	25	92.0	0.133	1.000	0.020	1.000
have experience	157	97	157	58	88	58	92.5	0.065	1.000	0.048	1.000
predictive models	211	85	211	39	102	39	93.5	0.088	1.000	0.032	1.000
software engineers	194	89	194	41	98	41	93.5	0.081	1.000	0.034	1.000
natural language processing	318	56	318	24	133	24	94.5	0.132	1.000	0.020	1.000
creativity	155	98	155	50	93	50	95.5	0.064	1.000	0.041	1.000
Go	107	110	NA	74	81	NA	95.5	NA	NA	NA	NA
sap	112	108	112	53	90	53	99.0	0.047	1.000	0.043	1.000
machine learning algorithms	278	67	278	24	132	24	99.5	0.116	1.000	0.020	1.000
predictive analytics	146	100	146	39	101	39	100.5	0.061	1.000	0.032	1.000
learning models	268	71	268	23	134	23	102.5	0.111	1.000	0.019	1.000
facebook	95	116	95	53	89	53	102.5	0.039	1.000	0.043	1.000
data pipeline	225	83	225	26	123	26	103.0	0.094	1.000	0.021	1.000
critical thinking	62	127	62	77	79	77	103.0	0.026	1.000	0.063	1.000
javascript	127	104	127	37	104	37	104.0	0.053	1.000	0.030	1.000
modeling techniques	122	105	122	36	105	36	105.0	0.051	1.000	0.029	1.000
modelling	102	112	102	39	100	39	106.0	0.042	1.000	0.032	1.000
large scale	235	78	235	22	135	22	106.5	0.098	1.000	0.018	1.000
nosql	232	79	232	22	136	22	107.5	0.096	1.000	0.018	1.000
machine learning models	231	80	231	21	137	21	108.5	0.096	1.000	0.017	1.000
unix	112	109	112	34	109	34	109.0	0.047	1.000	0.028	1.000
array	104	111	104	34	107	34	109.0	0.043	1.000	0.028	1.000
microsoft office	31	151	31	100	68	100	109.5	0.013	1.000	0.082	1.000
data models	116	107	116	31	113	31	110.0	0.048	1.000	0.025	1.000
multi-task	46	141	46	77	80	77	110.5	0.019	1.000	0.063	1.000
nlp	254	72	254	14	150	14	111.0	0.106	1.000	0.011	1.000
market research	29	154	29	95	70	95	112.0	0.012	1.000	0.078	1.000
highly motivated	51	134	51	52	91	52	112.5	0.021	1.000	0.043	1.000
data warehousing	95	115	95	30	116	30	115.5	0.039	1.000	0.025	1.000
coordination	39	147	39	68	84	68	115.5	0.016	1.000	0.056	1.000
mysql	92	117	92	30	117	30	117.0	0.038	1.000	0.025	1.000
data pipelines	179	90	179	16	145	16	117.5	0.074	1.000	0.013	1.000
data systems	78	121	78	31	114	31	117.5	0.032	1.000	0.025	1.000
bachelors	46	139	46	41	96	41	117.5	0.019	1.000	0.034	1.000
solving problems	77	122	77	30	118	30	120.0	0.032	1.000	0.025	1.000
azure	127	103	127	19	138	19	120.5	0.053	1.000	0.016	1.000
time management	38	149	38	51	92	51	120.5	0.016	1.000	0.042	1.000
ecommerce	80	120	80	26	124	26	122.0	0.033	1.000	0.021	1.000
data manipulation	51	133	51	31	112	31	122.5	0.021	1.000	0.025	1.000
vba	19	161	19	60	87	60	124.0	0.008	1.000	0.049	1.000
data integration	59	129	59	28	120	28	124.5	0.025	1.000	0.023	1.000
google analytics	30	153	30	41	97	41	125.0	0.012	1.000	0.034	1.000
data extraction	75	123	75	25	129	25	126.0	0.031	1.000	0.020	1.000
microsoft excel	15	167	15	68	85	68	126.0	0.006	1.000	0.056	1.000
troubleshooting	40	145	40	34	108	34	126.5	0.017	1.000	0.028	1.000
data architecture	120	106	120	13	151	13	128.5	0.050	1.000	0.011	1.000
ruby	98	113	98	17	144	17	128.5	0.041	1.000	0.014	1.000
masters degree	49	135	49	27	122	27	128.5	0.020	1.000	0.022	1.000
speaking	47	137	47	26	127	26	132.0	0.020	1.000	0.021	1.000
youtube	40	146	40	29	119	29	132.5	0.017	1.000	0.024	1.000
methodological	34	150	34	26	126	26	138.0	0.014	1.000	0.021	1.000
data gathering	39	148	39	24	131	24	139.5	0.016	1.000	0.020	1.000
mongodb	75	124	75	10	160	10	142.0	0.031	1.000	0.008	1.000
manage multiple projects	16	165	16	28	121	28	143.0	0.007	1.000	0.023	1.000
postgresql	56	132	56	11	155	11	143.5	0.023	1.000	0.009	1.000
language understanding	83	119	83	5	171	5	145.0	0.034	1.000	0.004	1.000
github	57	131	57	10	159	10	145.0	0.024	1.000	0.008	1.000
elasticsearch	47	136	47	11	154	11	145.0	0.020	1.000	0.009	1.000
apache spark	65	125	65	6	167	6	146.0	0.027	1.000	0.005	1.000
bash	45	142	45	12	152	12	147.0	0.019	1.000	0.010	1.000
data insights	27	156	27	19	140	19	148.0	0.011	1.000	0.016	1.000
natural language understanding	64	126	64	4	178	4	152.0	0.027	1.000	0.003	1.000
doctorate degree	17	164	17	19	141	19	152.5	0.007	1.000	0.016	1.000
negotiation	21	159	21	15	147	15	153.0	0.009	1.000	0.012	1.000
network analysis	28	155	28	12	153	12	154.0	0.012	1.000	0.010	1.000
kpmg	60	128	60	3	184	3	156.0	0.025	1.000	0.002	1.000
shell script	40	144	40	6	169	6	156.5	0.017	1.000	0.005	1.000
highly organized	10	170	10	17	143	17	156.5	0.004	1.000	0.014	1.000
data entry	8	176	8	19	139	19	157.5	0.003	1.000	0.016	1.000
data preparation	22	158	22	8	161	8	159.5	0.009	1.000	0.007	1.000
data reporting	8	178	8	15	146	15	162.0	0.003	1.000	0.012	1.000
strategic thinking	13	169	13	11	157	11	163.0	0.005	1.000	0.009	1.000
microsoft word	6	185	6	18	142	18	163.5	0.002	1.000	0.015	1.000
analytics data	15	166	15	7	163	7	164.5	0.006	1.000	0.006	1.000
jupyter notebook	18	162	18	5	170	5	166.0	0.007	1.000	0.004	1.000
sales and marketing	19	160	19	5	173	5	166.5	0.008	1.000	0.004	1.000
microstrategy	18	163	18	5	172	5	167.5	0.007	1.000	0.004	1.000
	9	172	NA	6	166	NA	169.0	NA	NA	NA	NA
nlu	40	143	40	2	199	2	171.0	0.017	1.000	0.002	1.000
data transfer	6	184	6	8	162	8	173.0	0.002	1.000	0.007	1.000
english language	5	188	5	10	158	10	173.0	0.002	1.000	0.008	1.000
service orientation	5	191	5	11	156	11	173.5	0.002	1.000	0.009	1.000
symantec	9	174	9	5	174	5	174.0	0.004	1.000	0.004	1.000
grammatical	9	173	9	4	176	4	174.5	0.004	1.000	0.003	1.000
systems analysis	3	202	3	15	148	15	175.0	0.001	1.000	0.012	1.000
microsoft azure	23	157	23	2	198	2	177.5	0.010	1.000	0.002	1.000
data mapping	8	177	8	3	182	3	179.5	0.003	1.000	0.002	1.000
telecommunications	9	175	9	3	187	3	181.0	0.004	1.000	0.002	1.000
django	30	152	30	1	211	1	181.5	0.012	1.000	0.001	1.000
amazon redshift	6	183	6	3	180	3	181.5	0.002	1.000	0.002	1.000
microsoft access	3	199	3	7	165	7	182.0	0.001	1.000	0.006	1.000
microsoft powerpoint	2	215	2	14	149	14	182.0	0.001	1.000	0.011	1.000
complex problem solving	5	187	5	3	181	3	184.0	0.002	1.000	0.002	1.000
active learning	7	180	7	2	189	2	184.5	0.003	1.000	0.002	1.000
data interpretation	3	197	3	4	175	4	186.0	0.001	1.000	0.003	1.000
apache hadoop	7	181	7	2	192	2	186.5	0.003	1.000	0.002	1.000
client management	2	209	2	7	164	7	186.5	0.001	1.000	0.006	1.000
confluence	7	182	7	2	194	2	188.0	0.003	1.000	0.002	1.000
minitab	3	200	3	4	177	4	188.5	0.001	1.000	0.003	1.000
machine learning data	13	168	13	1	220	1	194.0	0.005	1.000	0.001	1.000
swift	6	186	6	2	202	2	194.0	0.002	1.000	0.002	1.000
jquery	10	171	10	1	218	1	194.5	0.004	1.000	0.001	1.000
eko	8	179	8	1	212	1	195.5	0.003	1.000	0.001	1.000
work well in a team	3	205	3	3	188	3	196.5	0.001	1.000	0.002	1.000
operations analysis	4	194	4	2	200	2	197.0	0.002	1.000	0.002	1.000
see the big picture	4	195	4	2	201	2	198.0	0.002	1.000	0.002	1.000
apache kafka	4	193	4	1	204	1	198.5	0.002	1.000	0.001	1.000
microsoft outlook	2	214	2	3	186	3	200.0	0.001	1.000	0.002	1.000
clerical	2	208	2	2	193	2	200.5	0.001	1.000	0.002	1.000
ibm db2	5	189	5	1	217	1	203.0	0.002	1.000	0.001	1.000
report creation	1	231	1	4	179	4	205.0	0.000	1.000	0.003	1.000
experience in information technology	3	198	3	1	214	1	206.0	0.001	1.000	0.001	1.000
microsoft sql server	5	190	5	1	224	1	207.0	0.002	1.000	0.001	1.000
data cleanup	2	210	2	1	206	1	208.0	0.001	1.000	0.001	1.000
data organization	2	211	2	1	207	1	209.0	0.001	1.000	0.001	1.000
unix shell	5	192	5	1	236	1	214.0	0.002	1.000	0.001	1.000
google adwords	2	212	2	1	216	1	214.0	0.001	1.000	0.001	1.000
design development	1	218	1	1	210	1	214.0	0.000	1.000	0.001	1.000
skype	3	201	3	1	231	1	216.0	0.001	1.000	0.001	1.000
mathematical reasoning	2	213	2	1	221	1	217.0	0.001	1.000	0.001	1.000
filemaker pro	1	220	1	1	215	1	217.5	0.000	1.000	0.001	1.000
wireshark	1	233	1	2	203	2	218.0	0.000	1.000	0.002	1.000
technology design	3	203	3	1	234	1	218.5	0.001	1.000	0.001	1.000
ubuntu	3	204	3	1	235	1	219.5	0.001	1.000	0.001	1.000
judgment and decision making	1	222	1	1	219	1	220.5	0.000	1.000	0.001	1.000
microsoft dynamics	1	223	1	1	222	1	222.5	0.000	1.000	0.001	1.000
microsoft windows server	1	224	1	1	226	1	225.0	0.000	1.000	0.001	1.000
oracle hyperion	1	225	1	1	227	1	226.0	0.000	1.000	0.001	1.000
organizational management	1	227	1	1	228	1	227.5	0.000	1.000	0.001	1.000
reading comprehension	1	230	1	1	230	1	230.0	0.000	1.000	0.001	1.000
big data architecture	57	130	57	NA	NA	NA	NA	0.024	1.000	NA	NA
architecture capabilities	46	138	46	NA	NA	NA	NA	0.019	1.000	NA	NA
covering technologies	46	140	46	NA	NA	NA	NA	0.019	1.000	NA	NA
apache hive	3	196	3	NA	NA	NA	NA	0.001	1.000	NA	NA
active listening	2	206	2	NA	NA	NA	NA	0.001	1.000	NA	NA
citrix	2	207	2	NA	NA	NA	NA	0.001	1.000	NA	NA
amazon dynamodb	1	216	1	NA	NA	NA	NA	0.000	1.000	NA	NA
bring creativity	1	217	1	NA	NA	NA	NA	0.000	1.000	NA	NA
engineering and technology	1	219	1	NA	NA	NA	NA	0.000	1.000	NA	NA
ibm infosphere datastage	1	221	1	NA	NA	NA	NA	0.000	1.000	NA	NA
oracle java	1	226	1	NA	NA	NA	NA	0.000	1.000	NA	NA
prepare data for analysis	1	228	1	NA	NA	NA	NA	0.000	1.000	NA	NA
quality control analysis	1	229	1	NA	NA	NA	NA	0.000	1.000	NA	NA
teradata database	1	232	1	NA	NA	NA	NA	0.000	1.000	NA	NA
microsoft project	NA	NA	NA	6	168	6	NA	NA	NA	0.005	1.000
data storytelling	NA	NA	NA	3	183	3	NA	NA	NA	0.002	1.000
lexisnexis	NA	NA	NA	3	185	3	NA	NA	NA	0.002	1.000
administration and management	NA	NA	NA	2	190	2	NA	NA	NA	0.002	1.000
ajax	NA	NA	NA	2	191	2	NA	NA	NA	0.002	1.000
experience in market research	NA	NA	NA	2	195	2	NA	NA	NA	0.002	1.000
google docs	NA	NA	NA	2	196	2	NA	NA	NA	0.002	1.000
mcafee	NA	NA	NA	2	197	2	NA	NA	NA	0.002	1.000
apache tomcat	NA	NA	NA	1	205	1	NA	NA	NA	0.001	1.000
datadriven	NA	NA	NA	1	208	1	NA	NA	NA	0.001	1.000
deductive reasoning	NA	NA	NA	1	209	1	NA	NA	NA	0.001	1.000
epic systems	NA	NA	NA	1	213	1	NA	NA	NA	0.001	1.000
microsoft sharepoint	NA	NA	NA	1	223	1	NA	NA	NA	0.001	1.000
microsoft sql server reporting services	NA	NA	NA	1	225	1	NA	NA	NA	0.001	1.000
processing information	NA	NA	NA	1	229	1	NA	NA	NA	0.001	1.000
systems evaluation	NA	NA	NA	1	232	1	NA	NA	NA	0.001	1.000
tax software	NA	NA	NA	1	233	1	NA	NA	NA	0.001	1.000

Survey

Conclusion

To review, the following conclusions emerged from our research:

Data science emphasizes a range of “hard” data-centric skills (especially programming skills like Python, SQL and R, as well as statistical and math skills, and a knowledge of machine learning.
At least for employers, data scientist positions and data analyst positions point to very different skill sets. For data analysts, communication, research, analytical and organizational skills are most prominent. While there is overlap on almost all of the skills, the emphases are very different.
Working data scientists and computer scientists from our survey corroborate this point of view. Statistics and programming make up the bulk of the skills for data scientists.
Individuals from our survey not directly involved with data science are more likely to include softer skills. They see writing, thinking, creativity and communication as aspects of data science.

So what are the most valuable skills for data scientist? Here we really answer a more narrow question - what are the most valuable skills to learn in order to be hired and work in the field of data science? The answer to this is clear - programming, statistics, math, algorithms and computer science.

But if we wanted to answer the question more broadly - what skills would be most valuable for the next generation of data scientist to learn? -

we might want to include the thoughts of ethicists, philosophers, futurists, consumers, and representatives of vulnerable populations. But that will have to wait for another project.

…

Data Science is the “Hot New Field”, but is there an Agreement About what Skills are Needed to Become One?

Research Question

Collaboration Model

Methology

1.Collect Data

2.The Word Catalog

3.Detect words in word_catalog

4.Analyze our survey

5.Analyze Popular Blogs

Findings

Conclusion