Group 5: Jun Yan, Hector Santana, Rafal Decowski, Chad Smith, Aryeh Sturm
October 22, 2017
In this project, we attempted to answer the question, “What are the most valued data science skills?”, using data obtained from the internet through web scraping.
The analysis done here can be useful for students who aspire to work in data science and analytics fields. We hope it can be helpful for students in complementing their training plan and refining their resumes.
We identified two kinds of skills - the “hard” skills and the “soft” skills.
The hard skills are technical and domain specific skills, such as the ability to code in a certain computer language or perform statistical analysis, etc.
The soft skills are personal attributes which help you engage and succeed in a team.
Hard skills are somewhat quantifiable. A person's ability to code can be reflected by that person's past works, degrees, and certificates. Soft skills are harder to measure. We recognize these inherent differences between the two kinds of skills, and analyzed them separately.
The project, including web scraping and analysis, was done completely in R.
We also chose github to be our depository to store our results.
The following R packages were used:
We also used SelectorGadget to identify some of the fields for scraping.
To collect information about hard skills, we investigated the following two websites:
We chose these two websites because they are relatively easy to scrape for information.
Typing in “data scientist” in either website, you will be led to a search result webpage listing the job postings. For each job posting, there is a small section listing the “desired experience” or “required skills” for the job. Using SelectorGadget, the nodes can be identified:
Using rvest, they can be directly scraped from the search result page.
To automate the scraping process, we leveraged patterns discovered in the URL structures for both websites, which enabled us to generate web links automatically that would lead us to all the search result pages.
To reach subsequent pages, we found that one only needs to add numbers after “start=”, in increments of ten. For example, Page 2 can be reached as “https://www.indeed.com/jobs?q=data+scientist&start=10”, and Page 3 would have “start=20”, etc.
We were able to use this strategy to scrape 2,370 pages from indeed.com, each page contains about 14 to 15 job postings.
We noticed that when the “page=1” string is manipulated, we can reach other search result pages. For example the 2nd page will be “page=2”, and so on. We were able to scrape 6 pages from cybercoders.com, each page contains about 20 job postings.
To gauge what type of soft skills employers look for, we did some research and found this website useful: https://www.thebalance.com/list-of-soft-skills-2063770.
The webpage has a list of 146 unique soft skills which we saved as a .txt file here: https://raw.githubusercontent.com/Tyllis/Data607/master/list_of_soft_skills.txt.
We decided to use regular expressions to collect information on the soft skills. Our strategy is simple. It is basically a keyword search on each relevant web page we scraped. Our program downloaded the raw html codes from the web page using readLines. Then the functions from the stringr package were used to extract the words on the soft skill list from the html codes.
We decided to use this strategy because, unlike a search result page, the web pages we were attempting to scrape varied significantly. Each web page was coded to different styles with different structures. You may be able to find the information on one website using rvest, but it usually would not work on the next website. Given the extent of knowledge we currently have on web scraping, this is our best strategy.
We used two data sources in our endeavor:
We found a useful website compiled by Ryan Swanstrom: http://101.datascience.community/2015/07/14/awesome-data-science-colleges-list/.
He comiled a .csv file containing a list of 566 colleges which offer degrees or certificates related to data science.
What's useful for us is a column with links to each school's data science homepage.
The file is accessible here: https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv.
With this list, we can use the readLines function to download the html codes for each URL link. Then we used str_extract from the stringr package to look for and extract the words on the soft skills list for each URL Link.
For indeed.com, it is a bit more complicated. The complication was that we needed to obtain the URL links to the actual job posting website. We knew how to navigate the result pages easily, by manipulating the “start=” string in the URL. But each result page contains hundreds of links, and only 14 to 15 of these links are what we wanted - ones that link to the actual job posting websites.
To solve the problem, we used getHTMLLinks function from the XML package. This function returns all the links in the search result page. Then, we found a pattern in the links that lead to actual job posting sites. They are links that containing one of the following strings:
We can use str_detect from the stringr package to identify and extract these links, which enable our program to automatically go to these job-posting websites, and use readLines function to download the raw html codes.
We now present the R code written for this project.
We would like to warn the readers that some of the code is computationally intensive and might take a long time to load/replicate.
indeed <- "https://www.indeed.com"
pages <- seq(from = 10, to = 23700, by = 10)
searchresult <- paste(indeed, "/jobs?q=data+scientist&start=", pages, sep = "")
readLinks <- function(url, lookingfor){
htmllink <- read_html(url)
nodes <- htmllink %>% html_nodes(lookingfor) %>% html_text()
return(nodes)
}
indeed_hardskills <- lapply(searchresult, readLinks, lookingfor = ".experienceList")
The result is a list containing 2,370 elements, in which each element is a search result page from indeed.com. Each element contains 14 to 15 sub-elements, corresponding to the 14-15 job postings from each result page.
Each of these sub-elements then contains the desired experience (hard skills) in vectors.
Using the unique function, the 2,370 search result pages shrink down to about 140 pages. Since each page contains 14 to 15 job links, we still ended up with 2,082 job links. We also unpacked the list in this step.
indeed_hardskills <- unlist(unique(indeed_hardskills))
write.csv(indeed_hardskills, file ="indeed_hardskills.csv", row.names=FALSE)
The results are saved and reside in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv
Similarly, we used the same method to scrape cybercoders.com search results for “data scientist”.
Instead of lapply, we created a function that contains a for-loop. The function also packages the results into a skill-frequency data frame.
scrape_cyber_coders <- function(){
url_start <- "https://www.cybercoders.com/search/?page="
url_end <- "&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype="
num_pages <- c(1:6)
vector <- rep()
for(p in num_pages){
url <- sprintf("%s%d%s",url_start, p, url_end)
h = read_html(url)
skills <- h %>% html_nodes(".skill-list span") %>% html_text()
skills <- skills[skills != ""]
vector <- append(vector, skills)
}
sorted_skills <- sort(table(vector), decreasing = FALSE)
df <- as.data.frame(sorted_skills)
write.csv(df,file ="cybercoders.csv", row.names=FALSE)
}
Scrape_cyber_coders()
The results are saved and reside in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv
Two “tryCatch()” functions were created to ignore any errors and return NULLs:
getURLFun <- function(x){
return(tryCatch(getURL(x), error = function(e) NULL))
}
getHTMLLinksFun <- function(x){
return(tryCatch(getHTMLLinks(x), error = function(e) NULL))
}
Note that this step is very computationally intensive. The process takes up to several hours, depending on your computer hardware. The resulting object, if saved to a file on disk, will take more than 30 MB of space.
rawlinks <- getHTMLLinksFun(getURLFun(searchresult))
We then extracted the actual links to the job posting sites, using the two string patterns we discovered.
findidx <- str_detect(rawlinks, pattern = "/pagead/clk|/rc/clk")
goodlinks <- rawlinks[findidx] %>%
str_extract(pattern = "/rc/clk[[:graph:]]*|/pagead/clk[[:graph:]]*")
joblinks <- paste(indeed, goodlinks, sep = "")
Now we can use readLines function to download the html codes from these links.
readLinesFun <- function(x){
return(tryCatch(readLines(x), error = function(e) NULL))
}
We found that this step is even more computationally intensive. For this project, we chose to perform scraping on the first 500 job links:
temp <- lapply(joblinks[1:500], readLinesFun)
The result is a list containing html codes for the 500 job websites.
result <- c()
for (val in temp){
meshcodes <- paste(val, collapse = "") # Collapse the html codes into one big string
temp <- str_extract(meshcodes, softskills) # extract the soft skills words from the html string
temp <- temp[!is.na(temp)] # remove "na" elements
result <- c(result, paste(temp, collapse = ", ")) # Collapse and append to the result vector
}
indeed_softskills <- data.frame(joblinks[1:500], soft_skills = result)
write.csv(indeed_softskills, file ="indeed_softskills", row.names=FALSE)
This is stored in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv
url <- "https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv"
dscolleges <- read.csv(url, stringsAsFactor = F)
url_list <- dscolleges$URL
homepages <- sapply(url_list, readLinesFun, USE.NAMES = F)
skills and placed back into the original dscolleges data frame.dscolleges$skills <- c(rep(NA, dim(dscolleges)[1]))
rowcount = 0
for (htmlcodes in homepages){
meshcodes <- paste(htmlcodes, collapse = "")
rowcount <- rowcount + 1
temp <- str_extract(meshcodes, softskills)
temp <- temp[!is.na(temp)]
raw[rowcount, "skills"] <- paste(temp, collapse = ", ")
}
write.csv(dscolleges, file ="college_softskills", row.names=FALSE)
The file is stored in the following github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv
We are now ready to analyze the 4 datasets.
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv"
indeed_hs <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv"
cybercoders_hs <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv"
indeed_ss <- read.csv(url, header = T, stringsAsFactors = F)
url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv"
colleges_ss <- read.csv(url, header = T, stringsAsFactors = F)
Before we could analyze the datasets, they required a little tidying.
We then extracted top 20 skills of each dataset and plotted them on frequency graphs.
Below code prepared the data for the hard skills analysis.
# Remove the row number from each row, split the strings, unlist to combine into one vector and remove white spaces
indeed_hs <- indeed_hs[,1] %>%
str_replace(pattern = "^[0-9]* ", replacement ="") %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Remove any empty cells and remove "Data Science" element as this is a generic term and not a specific skill
indeed_hs <- indeed_hs[indeed_hs != ""]
indeed_hs <- indeed_hs[indeed_hs != 'Data Science']
# Sort by frequency and convert to a dataframe
indeed_hs <- indeed_hs %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
indeed_hstop20 <- tail(indeed_hs, 20)
Let's see how many total skills show up in the indeed.com job postings:
# Total count of skills in the dataset
(total_indeedhs <- sum(indeed_hs$Freq))
[1] 14670
# Number of types of skills
(num_indeedhs <- dim(indeed_hs)[1])
[1] 109
There are 14670 skills appearing in our dataset, of which there are 109 unique types.
We now exhibit these skills in a frequency plot.
According to this graph, the top 5 most valuable hard skills data scientists are:
The data from cybercoders.com was prepared beforehand and ready for analysis right away. We just needed to extract the top 20 skills:
cc_top20 <- tail(cybercoders_hs, 20) # Extract top 20 skills
(total_cchs <- sum(cybercoders_hs$Freq)) # Total count of skills in the dataset
[1] 519
(num_cchs <- dim(cybercoders_hs)[1]) # Number of types of skills
[1] 146
There are 519 skills appearing in our dataset, of which, 146 are unique.
According to this graph, the top 5 most valuable hard skills data scientists are:
This largely agrees with the dataset from indeed.com, especially the top 3 hard skills. Both datasets suggest that Python, Machine Learning, and R are the most valuable hard skills for data scientists.
We now move on to analyze the datasets for soft skills. Below code prepares the indeed.com data for the soft skills analysis:
# Split the strings, unlist to combine into one vector and remove white spaces
indeed_ss <- indeed_ss$soft_skills %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Sort by frequency and convert to a dataframe
indeed_ss <- indeed_ss %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
indeed_sstop20 <- tail(indeed_ss, 20)
# Total count of skills
(total_indeedss <- sum(indeed_ss$Freq))
[1] 3106
# Number of type of skills
(num_indeedss <- dim(indeed_ss)[1])
[1] 84
…so, there are 3106 soft skills appearing in our dataset, of which there are 84 unique types.
Move to the next slide for the frequency graph –>
The indeed.com dataset suggests that the top 5 soft skills employers deem most valuable in a data scientist are:
Lastly, let's look at the data scraped from the various colleges' homepages:
# Split the strings, unlist to combine into one vector and remove white spaces
colleges_ss <- colleges_ss$skills %>%
strsplit(split = ", ") %>%
unlist() %>%
trimws()
# Sort by frequency and convert to a dataframe
colleges_ss <- colleges_ss %>%
table()%>%
sort(decreasing = F) %>%
as.data.frame()
# Extract only top hits
colleges_sstop20 <- tail(colleges_ss, 20)
# Total count of skills
(total_colss <- sum(colleges_ss$Freq))
[1] 2863
# Number of skills
(num_colss <- dim(colleges_ss)[1])
[1] 64
So we observed 2863 soft skills appearing in the various colleges' homepages, of which 64 are unique.
Move to the next slide for the frequency graph –>
The college homepages datatsets suggest that the top 5 most valuable soft skills for data scientist are:
Again, this agrees largely with the indeed.com dataset. In fact, both datasets agree exactly on the top 6 most valuable skills. Attributes such as experience, social, organization, and communication rank consistently at the very top of what employers are looking for.
Using the aforementioned datasets, we were able to determine the current, shared consensus on the most valued skills, both “hard” and “soft”, that a data scientist should have in today's corporate environment. By breaking out the demand (refer to our analysis of each dataset) with our web scraping, data aggregation, parsing, and analysis techniques we were able to see that the job site datasets and the college datasets were relatively in line with each other for the most valued skills and traits. This is indicative of a industry movement towards universally accepted characteristics and abilities for the data science career path. This of course can be subject to major industry shifts as the field as a whole is still in the early stages of development.
With regards to soft skills, attributes such as experience, social, organization, and communication consistently rank at the very top of what employers are looking for.
With regards to hard skills, the top three most valued skills were Python, Machine Learning, and R.
Knowing this information, students can now aim to refine their skill sets to be the best data scientists that they can be per the desired industry standard.