In this project we attempted to answer the question “which are the most valued data science skills”, using data obtained from the internet through web scrapping. The analysis done here can be useful for students who are aspired to work in the data science and analytic field. We wish it it can be helpful for students in complementing their training plan, and refining their resumes.
We identified two kinds of skills - the “hard” skills and the “soft” skills. The hard skills are technical and domain specific skills, such as the ability to code in a certain computer language or perform statistical analysis, etc. The soft skills are personal attributes that help you engage and succeed in a team. Hard skills are somewhat quantifiable. A person’s ability to code can be reflected by that person’s past works, degrees, and certificates. Soft skills are harder to measure. We recognize these inherent differences between the two kinds of skills, and we analyze them separately.
The project, including web scraping and analysis, was done completely in R. We also chose github to be our depository to store our results.
We used the following R packages:
We also used SelectorGadget to identify some of the fields for scrapping.
To collect information about hard skills, we investigated the following two websites:
We chose these two websites because they are relatively easy to scrap for information. Typing in “data scientist” in either websites, you will be led to a search result webpage listing the job postings. For each job posting, there is a small section listing the “desired experience” or “required skills” for the job. Using SelectorGadget, the nodes can be identified:
Using rvest, they can be directly scrapped from the search result page.
To automate the scrapping process, we also discovered patterns in the URL structures for both websites, which enable us to generate web links automatically that can take us to all the search result pages.
For indeed.com, the first result page is “https://www.indeed.com/jobs?q=data+scientist&start=”. To reach subsequent pages, we found that one only needs to add numbers after “start=”, in increments of ten. For example, Page 2 can be reached as “https://www.indeed.com/jobs?q=data+scientist&start=10”, and Page 3 would have “start=20”, etc. We were able to use this strategy to scrap 2,370 pages from indeed.com, each page contains about 14 to 15 job postings.
The cybercoders.com has similar structure for URL. The first result page is “https://www.cybercoders.com/search/?page=1&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype=”. We noticed that when the “page=1” string is manipulated, we can reach other search result pages. For example the 2nd page will be “page=2”, and so on. We were able to scrap 6 pages from cybercoders.com, each page contains about 20 job postings.
To gauge what are the soft skills that employers are looking for, we did some research and found this website useful: https://www.thebalance.com/list-of-soft-skills-2063770. In particular, the webpage has a list of 146 soft skills. We copied them and pasted into a .txt file and it can be reached here: https://raw.githubusercontent.com/Tyllis/Data607/master/list_of_soft_skills.txt.
We decided to use regular expression to collect information on the soft skills. Our strategy is simple. It is basically a keyword search on each relevant web page we scrapped. Our program downloaded the raw html codes from the web page using readLine. Then the functions from the stringr package were used to extract the words on the soft skill list from the html codes.
We decided to use this strategy because, unlike a search result page, the web pages we were attempting to scrap vary largely. Each web page was coded to different styles with different structures. You may be able to find the information on one website using rvest, but it usually would not work on the next website. Given the extent of knowledge we currently have on web scrapping, this is our best strategy.
We used two data sources in our endeavor.
We found a useful website compiled by Ryan Swanstrom: http://101.datascience.community/2015/07/14/awesome-data-science-colleges-list/. The .csv file he complied contains a list of 566 colleges that offers degrees or certificates related to data science. What’s useful for us is a column listing URL links to each school’s data science homepage. It can be reached here: https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv.
With this list, we can use the readLines function to download the html codes for each URL link. Then we used str_extract from the stringr package to look for and extract the words on the soft skills list for each URL Link.
For indeed.com, it is a bit more comoplicated. The complication was that we needed to obtain the URL links to the actual job posting website. We knew how to navigate the result pages easily, by manipulating the “start=” string in the URL. But each result page contains hundreds of links, and only 14 to 15 of these links are what we wanted - ones that link to the actual job posting websites.
To solve the problem, we used getHTMLLinks function from the XML package. This function returns all the links in the search result page. Then, we found a pattern in the links that lead to actual job posting sites. They are links that containing one of the following strings:
We can use str_detect from the stringr package to identify and extract these links, which enable our program to automatically go to these job-posting websites, and use readLines function to download the raw html codes.
We now present the R codes written of this project. We would like to warn the readers that some of these codes are computationally intensive. It might take a long time to load.
library(RCurl)
library(XML)
library(rvest)
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)
First, we created a vector containing 2,370 URL links, each is a search result page.
indeed <- "https://www.indeed.com"
pages <- seq(from = 10, to = 23700, by = 10)
searchresult <- paste(indeed, "/jobs?q=data+scientist&start=", pages, sep = "")
Below codes wrapped the rvest functions into one function.
readLinks <- function(url, lookingfor){
htmllink <- read_html(url)
nodes <- htmllink %>% html_nodes(lookingfor) %>% html_text()
return(nodes)
}
Below code execute the rvest functions, using lapply.
indeed_hardskills <- lapply(searchresult, readLinks, lookingfor = ".experienceList")
The result is a list containing 2,370 elements, in which each element is a search result page from indeed.com. Each element contains 14 to 15 sub-elements, corresponding to 14 to 15 job postings on each result page. Each of these sub-elements then contains the desire experience, or hard skills, in vectors.
We checked for uniqueness. It turns out that a lot the result pages are duplicates. Using the unique function, the 2,370 search result pages shrink down to about 140 pages. Since each page contains 14 to 15 job links, we still ended up with 2,082 job links. We also unpacked the list in this step.
indeed_hardskills <- unlist(unique(indeed_hardskills))
write.csv(indeed_hardskills, file ="indeed_hardskills.csv", row.names=FALSE)
The result is a dataset with 2,082 rows for the 2,082 data scientist jobs found on indeed.com and one column containing the experience or skills required for the jobs. The dataset requires some tidying. We will save that step in the Analysis section below. The data was saved and resides in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv
Similarly, we used the same method to scrap cybercoders.com search result for data scientist. Instead of lapply, we created a function that contains a for-loop. The function also packages the results into a skill-frequency data frame.
scrape_cyber_coders <- function(){
url_start <- "https://www.cybercoders.com/search/?page="
url_end <- "&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype="
num_pages <- c(1:6)
vector <- rep()
for(p in num_pages){
url <- sprintf("%s%d%s",url_start, p, url_end)
h = read_html(url)
skills <- h %>% html_nodes(".skill-list span") %>% html_text()
skills <- skills[skills != ""]
vector <- append(vector, skills)
}
sorted_skills <- sort(table(vector), decreasing = FALSE)
df <- as.data.frame(sorted_skills)
write.csv(df,file ="cybercoders.csv", row.names=FALSE)
}
Scrape_cyber_coders()
The above function also took care of the data tidying task as well as aggregation. The result is a dataset with 146 rows corresponding to 146 hard skills appeared on the websites. There are two columns. One column contains the hard skills, and the other column contains how many times the skills are needed in the websites, i.e. frequency. The file was saved and resides in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv
Two tryCatch functions were created to ignore any errors and return NULL if it does.
getURLFun <- function(x){
return(tryCatch(getURL(x), error = function(e) NULL))
}
getHTMLLinksFun <- function(x){
return(tryCatch(getHTMLLinks(x), error = function(e) NULL))
}
Below code will attempt to extract all URL links in each of the 2,370 pages. Note that this step is very computationally intensive. The process will take several hours, depending on your computer hardware. The resulting object, if saved to a file on disk, will take more than 30 MB of space.
rawlinks <- getHTMLLinksFun(getURLFun(searchresult))
We then extracted the actual links to the job posting sites, using the two string patterns we discovered. First we identified the index, then we used str_extract to extract the links.
findidx <- str_detect(rawlinks, pattern = "/pagead/clk|/rc/clk")
goodlinks <- rawlinks[findidx] %>%
str_extract(pattern = "/rc/clk[[:graph:]]*|/pagead/clk[[:graph:]]*")
We then concatenate the whole URL using paste function.
joblinks <- paste(indeed, goodlinks, sep = "")
Now we can use readLines function to download the html codes from these links.
First we created a tryCatch function to ignore any errors encounter and return a null if it does.
readLinesFun <- function(x){
return(tryCatch(readLines(x), error = function(e) NULL))
}
We used lapply to apply readLines function on each job links. We found that this step is even more computationally intensive. For this project, we chose to perform scrapping on the first 500 job links.
temp <- lapply(joblinks[1:500], readLinesFun)
The result is a list containing html codes for the 500 job websites.
Below, we used a for-loop to extract the soft skills words from the html codes.
result <- c()
for (val in temp){
meshcodes <- paste(val, collapse = "") # Collapse the html codes into one big string
temp <- str_extract(meshcodes, softskills) # extract the soft skills words from the html string
temp <- temp[!is.na(temp)] # remove "na" elements
result <- c(result, paste(temp, collapse = ", ")) # Collapse and append to the result vector
}
Lastly, we created a data.frame object to hold the results for output file.
indeed_softskills <- data.frame(joblinks[1:500], soft_skills = result)
write.csv(indeed_softskills, file ="indeed_softskills", row.names=FALSE)
The result is a data.frame object with 500 rows for the 500 job link we scrapped. It has two columns, one containing the URLs of the job links, and the other column containing the hard skills. The dataset requires few more tidying steps and we will do that in the Analysis section below. The data is stored in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv
First, we loaded the list of colleges that offers data science degrees.
url <- "https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv"
dscolleges <- read.csv(url, stringsAsFactor = F)
url_list <- dscolleges$URL
We again used the readLinesFun function to download all the html codes for each URL links.
homepages <- sapply(url_list, readLinesFun, USE.NAMES = F)
Below, we used a for-loop similar to the indeed.com scheme above to extract the words matching the soft skills list. This time, we placed the result in a new column called skills and placed back into the original dscolleges data frame.
dscolleges$skills <- c(rep(NA, dim(dscolleges)[1]))
rowcount = 0
for (htmlcodes in homepages){
meshcodes <- paste(htmlcodes, collapse = "")
rowcount <- rowcount + 1
temp <- str_extract(meshcodes, softskills)
temp <- temp[!is.na(temp)]
raw[rowcount, "skills"] <- paste(temp, collapse = ", ")
}
write.csv(dscolleges, file ="college_softskills", row.names=FALSE)
The resulting data.frame object was just the original college table with one additional column - the “skills” column, which contains the soft skills found on those websites. The data file is stored in the following github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv
We are now ready to analyze the 4 datasets. First we read the scrapping results into data.frame objects.
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv"
indeed_hs <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv"
cybercoders_hs <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv"
indeed_ss <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv"
colleges_ss <- read.csv(url, header = T, stringsAsFactors = F)
Before we can analyze the datasets, they required a little tidying. We then extracted top 20 skills of each dataset, and plotted them on frequency graphs.
Below codes prepare the indeed.com data for the hard skills analysis.
# Remove the row number from each row, split the strings, unlist to combine into one vector and remove white spaces
indeed_hs <- indeed_hs[,1] %>%
str_replace(pattern = "^[0-9]* ", replacement ="") %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Remove any empty cells and remove "Data Science" element as this is a generic term and not a specific skill
indeed_hs <- indeed_hs[indeed_hs != ""]
indeed_hs <- indeed_hs[indeed_hs != 'Data Science']
# Sort by frequency and convert to a dataframe
indeed_hs <- indeed_hs %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
indeed_hstop20 <- tail(indeed_hs, 20)
Let’s see how many total skills show up in the indeed.com job postings:
# Total count of skills in the dataset
(total_indeedhs <- sum(indeed_hs$Freq))
## [1] 14670
# Number of types of skills
(num_indeedhs <- dim(indeed_hs)[1])
## [1] 109
So, there are 14670 skills appearing in our dataset, of which there are 109 unique types.
We now plot them in a frequency plot.
names(indeed_hstop20)[1] <- "skills"
ggplot(indeed_hstop20,aes(x = skills,Freq,fill=Freq))+
geom_bar(position = 'dodge',stat = "identity") +
coord_flip() +
labs(title = "Frequency of Skills Appearing in Job Postings on Indeed.com",y="Frequency",x = "Hard Skills")
According to this graph, the top 5 most valuable hard skills data scientists are:
The data from cybercoders.com was prepared beforehand and ready for analysis right away. We just needed to extract the top 20 skills.
# Extract top 20 skills
cc_top20 <- tail(cybercoders_hs, 20)
# Total count of skills in the dataset
(total_cchs <- sum(cybercoders_hs$Freq))
## [1] 519
# Number of types of skills
(num_cchs <- dim(cybercoders_hs)[1])
## [1] 146
There are 519 skills appearing in our dataset, of which 146 are unique.
Below codes plot the frequency graph.
names(cc_top20)[1] <- "skills"
ggplot(cc_top20,aes(x = reorder(skills, Freq),Freq,fill=Freq))+
geom_bar(position = 'dodge',stat = "identity") +
coord_flip() +
labs(title = "Frequency of Skills Appearing in Job Postings on Cybercoders.com",y="Frequency",x = "Hard Skills")
According to this graph, the top 5 most valuable hard skills data scientists are:
This largely agrees with the dataset from indeed.com, especially the top 3 hard skills. Both datasets suggest that Python, Machine Learning, and R are the most valuable hard skills for data scientists.
We now move on to analyze the datasets for soft skills. Below codes prepare the indeed.com data for the soft skills analysis.
# Split the strings, unlist to combine into one vector and remove white spaces
indeed_ss <- indeed_ss$soft_skills %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Sort by frequency and convert to a dataframe
indeed_ss <- indeed_ss %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
indeed_sstop20 <- tail(indeed_ss, 20)
# Total count of skills
(total_indeedss <- sum(indeed_ss$Freq))
## [1] 3106
# Number of type of skills
(num_indeedss <- dim(indeed_ss)[1])
## [1] 84
So, there are 3106 soft skills appearing in our dataset, of which there are 84 unique types.
Below is the frequency graph.
names(indeed_sstop20)[1] <- "skills"
ggplot(indeed_sstop20,aes(x = skills,Freq,fill=Freq))+
geom_bar(position = 'dodge',stat = "identity") +
coord_flip() +
labs(title = "Frequency of Skills Appearing in Job Postings on Indeed.com",y="Frequency",x = "Soft Skills")
The indeed.com dataset suggests that the top 5 soft skills employers deem most valuable in a data scientist are:
Lastly, let’s look at the data scrapped from the various colleges’ homepages.
# Split the strings, unlist to combine into one vector and remove white spaces
colleges_ss <- colleges_ss$skills %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Sort by frequency and convert to a dataframe
colleges_ss <- colleges_ss %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
colleges_sstop20 <- tail(colleges_ss, 20)
# Total count of skills
(total_colss <- sum(colleges_ss$Freq))
## [1] 2863
# Number of skills
(num_colss <- dim(colleges_ss)[1])
## [1] 64
So we observed 2863 soft skills appearing in the various colleges’ homepages, of which 64 are unique.
names(colleges_sstop20)[1] <- "skills"
ggplot(colleges_sstop20,aes(x = skills,Freq,fill=Freq))+
geom_bar(position = 'dodge',stat = "identity") +
coord_flip() +
labs(title = "Frequency of Skills Appearing in College Webpages",y="Frequency",x = "Soft Skills")
The college homepages datatsets suggest that the top 5 most valuable soft skills for data scientist are:
Again, this agrees largely with the indeed.com dataset. In fact, both datasets agree exactly on the top 6 most valuable skills. Attributes such as experience, social, organization, and communication rank consistently at the very top of what employers are looking for.
This is particularly interesting because, as explained in the Methodology section above, the strategy we deployed for the soft skills is a simple key word search in the raw html codes downloaded from the websites. This brute force strategy is simple but it has a flaw: the word search does not take context into consideration. The words may appear in the website but do not mean to describe the soft skills for data scientist. Our hope was that the chance of this occurrence is not the norm and low, and by processing large amount of data, we may be still able to find something of interests. The results above show clearly that there is indeed a ranking order for these soft skills in terms of values. It could mean one of the following two things:
We tend to believe in the former because there is no evidence to believe in the later. Going forward, it may be possible to create some sorts of hypothesis test to test the later conclusion; or better yet, devise a more sophisticated technique to extract those soft skills from the websites (for example, a machine learning program that does take context into account when extracting words). Considering the extent of the knowledge we currently have in web scrapping, the above results should suffice the scope of the project.
Using the aforementioned datasets, we were able to determine the current, shared consensus on the most valued skills, both “hard” and “soft”, that a data scientist should have in today’s corporate environment. By breaking out the demand (refer to our analysis of each dataset) with our web scrapping, data aggregation, parsing, and analysis techniques we were able to see that the job site datasets and the college datasets were relatively in line with each other for the most valued skills and traits. This is indicative of a industry movement towards universally accepted characteristics and abilities for the data science career path. This of course can be subject to major industry shifts as the field as a whole is still in the early stages of development.
With regards to soft skills, attributes such as experience, social, organization, and communication consistently rank at the very top of what employers are looking for.
With regards to hard skills, the top three most valued skills were Python, Machine Learning, and R.
Knowing this information, students can now aim to refine their skill sets to be the best data scientists that they can be per the desired industry standard.