William Edwards Deming is known for the quote “In God we trust, all others must bring data.” It is clear that data-driven decision making is superior and is in these times a requirement to maintain a competitive edge. Every industry now treats their data as an asset to get insights from and is a foundation in the decision making process. Turning data into insights not only requires appropriate systems in place, but also requires the appropriate subject matter experts to interpret and discover what the data has to say.
Data Science is the art of analyzing data to gain knowledge in order to make business decisions. Data Science at its core is a union of mathematics, statistics, and programming disciplines. Data Scientists need to possess various skills such as data/quantitative analysis, R/Python programming, Machine Learning, Data Visualization, etc. Some skills may be more important than others depending on the task at hand.
As future Data Scientists we seek to understand what are some of the most important skills to possess. We will gather this consensus by conducting exploratory analysis on what skills employers are looking for through job listings on LinkedIn and Indeed.
Given the prompt “Which are the most valued data science skills?”, our team has brainstormed various sources of information that describe the knowledge or skills sought after by current or prospective employers of data scientists. Job listings represent a logical starting point, and we have chosen to focus on LinkedIn and Indeed, two of the most popular online homes for this information.
Our general approach consists of:
1. Scraping job listings from LinkedIn and Indeed
2. Applying a whitelist of data science-relevant words, paired or non-paired, to the scraped text
3. Storing the resulting sets in a SQL database
4. Performing an exploratory analysis
For scraping, we use R’s rvest package and focus on results from searches for “data science” in the New York City area. Our whitelist is informed by a review of several online journalistic sources and includes tools, skills, and knowledge bases like “Python”, “Web Scraping”, and “Machine Learning”, among many others. We use MySQL for database storage. And our analysis relies on R’s tidyverse collection of packages.
Our workflow moves from preparation to scraping and pre-processing to storage to analysis and visualization. Each stage is described below in more detail.
The aforementioned whitelist of 83 data science-related words or terms is compiled from several articles focused on valued and demanded skills for data scientists. The articles are cited below, alphabetized by author’s last name.
Dataquest.com (2019). “How to write a great data science resume”. Dataquest Career Guide. Accessed 10/14/19 from https://www.dataquest.io/blog/how-data-science-resume-cv/
Doyle, A. (2019). “Important job skills for data scientists”. The balance careers. Accessed 10/14/19 from https://www.thebalancecareers.com/list-of-data-scientist-skills-2062381
Gerencer, T. (2019). “Data scientist resume: Sample and complete guide [20+ Examples]”. Zety. Accessed 10/14/19 from https://zety.com/blog/data-scientist-resume-example
Hale, J. (2018). “The most in demand skills for data scientists”. Towards Data Science. Accessed 10/14/2019 from https://towardsdatascience.com/the-most-in-demand-skills-for-data-scientists-4a4a8db896db
Indeed.com (2019). “Data scientist resume samples”. Indeed Career Guide. Accessed 10/14/19 from https://www.indeed.com/career-advice/resume-samples/information-technology-resumes/data-scientist
wlist <- read.csv('https://raw.githubusercontent.com/dhairavc/data607-project3/master/whitelist.csv')
wlist
## Whitelist
## 1 communication
## 2 data analysis
## 3 r programming
## 4 collaboration
## 5 mathematics
## 6 python
## 7 creative thinking
## 8 statistics
## 9 nosql
## 10 critical thinking
## 11 data visualization
## 12 visualization
## 13 hadoop
## 14 problem solving
## 15 programming
## 16 openrefine
## 17 active learning
## 18 quantitative analysis
## 19 tensorflow
## 20 perceptiveness
## 21 machine learning
## 22 matplotlib
## 23 generating hypotheses
## 24 debugging
## 25 cloudera
## 26 judgement
## 27 modeling
## 28 perl
## 29 interpersonal skills
## 30 probability
## 31 vb
## 32 big data
## 33 predictive models
## 34 data analytics
## 35 data manipulation
## 36 data wrangling
## 37 data science tools
## 38 data tools
## 39 data mining
## 40 analytical methodologies
## 41 interpreting data
## 42 metrics
## 43 modeling data
## 44 modeling tools
## 45 data visualizations
## 46 research
## 47 risk modeling
## 48 testing hypotheses
## 49 pandas
## 50 numpy
## 51 data cleaning
## 52 command line
## 53 git
## 54 sql
## 55 api
## 56 computer science
## 57 deep learning
## 58 nlp
## 59 natural language processing
## 60 neural networks
## 61 data engineering
## 62 information systems
## 63 azure
## 64 ssis
## 65 saas
## 66 ssrs
## 67 tableau
## 68 bi
## 69 html
## 70 css
## 71 javascript
## 72 etl
## 73 designing
## 74 hive
## 75 mapreduce
## 76 pearl
## 77 qlikview
## 78 spotfire
## 79 sas
## 80 r
## 81 matplotlib
## 82 scipy
## 83 mongodb
library(rvest)
library(tidyverse)
library(wordcloud)
library(ggrepel)
library(png)
library(RCurl)
library(grid)
library(RMySQL)
Data pre-processing uses five user-defined functions to extract words from job listings, and specifically, their skills sections; aggregate and sort the extracted words; apply the whitelist to filter out irrelevant words; and apply the prior three functions to each of LinkedIn and Indeed. Each function is described in more detail below.
The extract_words function will take a link to a particular job post and the selector which contains the list of interest and return a dataframe of words in that job post’s skills section and the number of occurrences of each particular words.
extract_words <- function(link_to_job_page, selector) {
# download the html and turn it into an XML file with read_html()
job_page <- read_html(link_to_job_page)
# extract specific nodes with html_nodes() using css selector
skills_tag <- html_nodes(job_page, selector)
# extract content from nodes
skills_text <- html_text(skills_tag)
# remove punctuation
skills_text <- gsub('[[:punct:]]', '', skills_text)
# split sentences into individual words
words <- unlist(strsplit(skills_text, " "))
words <- tolower(words)
# count the number of occurrences of each word
wordcount <- table(words)
wordcount_df <- as.data.frame(wordcount)
return(wordcount_df)
}
Given a list of links to job pages, call the extract_words function to get the total word counts from the chosen selector from each each link, aggregate them and return the counts in decreasing order in a data frame.
get_word_counts <- function(links_to_jobs, selector) {
# initialize a list
counts <- list()
for (i in 1:length(links_to_jobs)) {
df <- extract_words(links_to_jobs[i], selector)
counts[[i]] <- df
}
# combine into a dataframe
skill_count <- do.call(rbind, counts)
# sum multiple occurences of the same word
total_skill_count <- skill_count %>%
group_by(words) %>%
summarize(occurrences = sum(Freq)) %>%
arrange(desc(occurrences))
return(total_skill_count)
}
A function that applies the whitelist of data science key words and skills to our count of word occurences to filter our irrelevant words.
get_DS_skills <- function(word_count) {
# apply whitelist
total_skill_count <- word_count %>% filter(words %in% wlist$Whitelist)
return(total_skill_count)
}
Many results are returned when searching for data scientist jobs, each containing a a.result-card__full-card-link element as shown below. These elements contain href attributes which contain the links to the actual job pages. We loop through the search results to extract a list of links to individual job posts.
The skills of interest are found in list form between <ul> and <li> tags and the Selector Gadget tool produces the selector .description__text--rich li which we use to extract the content contained between each <li> tag.
linkedin_top_skills <- function() {
# save the url which contains the search results
root_url <- "https://www.linkedin.com/jobs/search/?keywords=data%20scientist&start="
# for each job, extract the href attribute from each job using the css selector
# create a list of links by extracting the href attribute from the nodes
links_to_jobs <- vector()
for (x in 0:39) {
paged_url <- paste(root_url, 25*x, collapse = NULL, sep = "")
search_results <- read_html(paged_url)
temp_paths_to_jobs <- search_results %>%
html_nodes("a.result-card__full-card-link") %>%
html_attr("href")
links_to_jobs <- c(links_to_jobs, temp_paths_to_jobs)
}
# for Linkedin job posts, skills are located in <li> tags with the following selector
selector <- ".description__text--rich li"
# get word counts from the list of links to job posts
word_count <- get_word_counts(links_to_jobs, selector)
# uncomment this to see the results pre-whitelist
#print(word_count)
# get data science related skills from the above word count
skill_count <- get_DS_skills(word_count)
return(skill_count)
}
In the case of sourcing data from Indeed, the links are actually contained in the href attributes of children of <div class="title"> elements.
As before, the skills of interest are found in list form between <ul> and <li> tags.
indeed_top_skills <- function() {
# save the url which contains the search results
domain <- "https://www.indeed.com"
root_url <- "https://www.indeed.com/jobs?q=data+science&l=New+York+City%2C+NY&start="
paths_to_jobs <- vector()
# for each job, extract the href attribute from each job using the css selector
# create a list of links by extracting the href attribute from the nodes
for (x in 0:50) {
paged_url <- paste(root_url, 10*x, collapse = NULL, sep = "")
search_results <- read_html(paged_url)
temp_paths_to_jobs <- search_results %>%
html_nodes(".title") %>%
html_children() %>%
html_attr("href")
paths_to_jobs <- c(paths_to_jobs, temp_paths_to_jobs)
}
# contatenate paths with the domain name to create valid links
links_to_jobs <- str_c(domain, paths_to_jobs)
# for Indeed job posts, skills are located in <li> tags so the selector is simple
selector <- "li"
# get word counts from the list of links to job posts
word_count <- get_word_counts(links_to_jobs, selector)
# uncomment this to see the results pre-whitelist
#print(word_count)
# get data science related skills from the above word count
skill_count <- get_DS_skills(word_count)
return(skill_count)
}
Call the functions defined above and scrape the data from the two sources and write the dataframes to a .csv file.
linkedin <- linkedin_top_skills()
write.csv(linkedin, file = "linkedin.csv", row.names = FALSE, col.names = TRUE, sep = "", quote = FALSE)
indeed <- indeed_top_skills()
write.csv(indeed, file = "indeed.csv", row.names = FALSE, col.names = TRUE, sep = "", quote = FALSE)
Take observations and occurrences of skills for both job boards, normalize and store in database for future reporting
Create database and 3 tables:
1. Board_Name: holds the name of the job boards 2. Skill_Name: table to store unique set of observed skills 3. Board_Summary: table to store occurrences of skills in relation to the job boards
#Create database
dbSendQuery(con2, "DROP DATABASE IF EXISTS dsskills;")
dbSendQuery(con2, "CREATE DATABASE dsskills;")
dbSendQuery(con2, "use dsskills;")
#Create and populate a table to store job boards
BoardID <- c(1,2)
Board_Name <- c("Indeed", "LinkedIn")
JobBoard <- data.frame(BoardID, Board_Name)
dbWriteTable(con2, name="JobBoard", value = JobBoard, row.names=FALSE)
dbSendQuery(con2, "ALTER TABLE jobboard ADD PRIMARY KEY (BoardID);")
#Create and populate table to store unique set of observed skills
allwords <- rbind(droplevels.data.frame(indeed), droplevels.data.frame(linkedin))
allwords2 <- allwords %>% group_by(words) %>% summarise(Occurrences = sum(occurrences))
uniquewords <- data.frame(noquote(as.character(allwords2$words)))
colnames(uniquewords) <- c("Word")
dbWriteTable(con2, name = "Skill_Name", value = uniquewords )
dbSendQuery(con2, "ALTER TABLE Skill_Name CHANGE row_names WordID INT;")
dbSendQuery(con2, "ALTER TABLE Skill_Name ADD PRIMARY KEY (WordID);")
dbSendQuery(con2, "DELETE FROM Skill_Name WHERE WordID IS NULL;")
dbSendQuery(con2, "ALTER TABLE Skill_Name MODIFY COLUMN WordID DOUBLE NOT NULL;")
#Create and populate table to store occurrences of skills in relation to the job boards
linkedin_summary <- cbind(Board_Name = JobBoard$Board_Name[2], droplevels.data.frame(linkedin))
indeed_summary <- cbind(Board_Name = JobBoard$Board_Name[1], droplevels.data.frame(indeed))
board_summary <- rbind(indeed_summary, linkedin_summary)
board_summary_INT <- data.frame(BoardID = as.numeric(board_summary$Board_Name), SkillID = as.numeric(board_summary$words), Occurences = board_summary$occurrences)
dbWriteTable(con2, name = "Board_summary", value = board_summary_INT)
dbSendQuery(con2, "ALTER TABLE Board_Summary CHANGE row_names MatchID INT;")
dbSendQuery(con2, "ALTER TABLE Board_Summary ADD PRIMARY KEY (MatchID);")
dbSendQuery(con2, "ALTER TABLE Board_Summary ADD FOREIGN KEY (BoardID) REFERENCES JobBoard (BoardID);")
dbSendQuery(con2, "ALTER TABLE Board_Summary ADD FOREIGN KEY (SkillID) REFERENCES Skill_Name (WordID);")
dbClearResult(dbListResults(con2)[[1]])
dbGetQuery(con2, "SELECT * FROM JobBoard;")
## BoardID Board_Name
## 1 1 Indeed
## 2 2 LinkedIn
head(dbGetQuery(con2, "SELECT * FROM Skill_Name;"), 10)
## WordID Word
## 1 1 programming
## 2 2 python
## 3 3 r
## 4 4 etl
## 5 5 sql
## 6 6 git
## 7 7 mathematics
## 8 8 matplotlib
## 9 9 nlp
## 10 10 nosql
head(dbGetQuery(con2, "SELECT * FROM Board_Summary;"))
## MatchID BoardID SkillID Occurences
## 1 1 1 5 800
## 2 2 1 2 755
## 3 3 1 18 590
## 4 4 1 13 545
## 5 5 1 22 525
## 6 6 1 15 425
head(dbGetQuery(con2, "SELECT JobBoard.Board_Name, Skill_Name.Word, Board_Summary.Occurences FROM Board_Summary
JOIN Skill_Name
ON Board_Summary.SkillID = Skill_Name.WordID
JOIN JobBoard
ON JobBoard.BoardID = Board_Summary.BoardID
ORDER BY Board_Summary.Occurences DESC;")
, 10)
## Board_Name Word Occurences
## 1 LinkedIn python 879
## 2 LinkedIn modeling 862
## 3 LinkedIn statistics 842
## 4 Indeed sql 800
## 5 Indeed python 755
## 6 LinkedIn sql 709
## 7 LinkedIn research 610
## 8 Indeed modeling 590
## 9 LinkedIn r 555
## 10 LinkedIn communication 549
indeed$site <- "Indeed"
linkedin$site <- "LinkedIn"
append <- bind_rows(indeed, linkedin)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
ggplot(data = append, mapping=aes(x=site, y=occurrences)) +
geom_boxplot() +
labs(title="Distribution of occurrences by site", x="Site", y="Occurrences") +
coord_flip() +
geom_label_repel(aes(label = ifelse(is_outlier(occurrences),as.character(words),'')),
box.padding = 0.4,
point.padding = 0.3,
segment.color = 'grey50')
library(wordcloud)
###Linkedin
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Most common words on LinkedIn")
wordcloud(words=linkedin$words, freq=linkedin$occurrences, main="Title")
###Indeed
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Most common words on Indeed")
wordcloud(words=indeed$words, freq=indeed$occurrences, main="Title")
##LinkedIn
ggplot(data = linkedin[is_outlier(linkedin$occurrences),], aes(x=reorder(words, -occurrences), y=occurrences)) +
geom_bar(stat="identity", fill="blue") +
labs(title="Top Data Science Skills per LinkedIn", x="Skills", y="Occurrences (#)") +
theme(axis.text.x=element_text(angle=45, hjust=1))
##Indeed
ggplot(data = indeed[is_outlier(indeed$occurrences),], aes(x=reorder(words, -occurrences), y=occurrences)) +
geom_bar(stat="identity", fill="red") +
labs(title="Top Data Science Skills per Indeed", x="Skills", y="Occurrences (#)") +
theme(axis.text.x=element_text(angle=45, hjust=1))
library(ggplot2)
library(ggrepel)
merge <- merge(indeed, linkedin, by="words") #Only includes common words across both sites
ggplot(data = merge, aes(x = occurrences.x, y = occurrences.y)) +
geom_point() +
labs(title="Plot of occurrences of common words", x="Occurrences on Indeed (#)", y="Occurrences on Linkedin (#)") +
geom_label_repel(aes(label = ifelse((is_outlier(occurrences.x) | is_outlier(occurrences.y)), as.character(words),'')),
box.padding = 0.35,
point.padding = 0.5,
segment.color = 'grey50')
We chose to look at LinkedIn and Indeed since they host the most job postings for data scientists and took the most common words from both sites. We then filtered out the words that relate to data science skills from a whitelist we created from articles and posts that talked about the most demanded and needed data science skills. We created a dataframe for the words and stored them in a SQL database. Based on the plots, we see that LinkedIn and Indeed both have similar words with a few differences. LinkedIn seems to say to have skills in SAS(Statistical Analytics System), Statistics, and Python. Indeed seems to say to have skills in Python, SQL, and Modeling. Both talk about having communication and research skills. In conclusion, the most valued data science skills are the programming and mathematical skills, preferably Python. SQL, and R for programming and statistics for mathematical. Also communication, research, and modeling skills are valued skills a data scientist should consider valuable skills to have.