DATA 607 Project 3

Introduction

In this project we attempted to answer the question “which are the most valued data science skills”, using data obtained from the internet through web scrapping. The analysis done here can be useful for students who are aspired to work in the data science and analytic field. We wish it it can be helpful for students in complementing their training plan, and refining their resumes.

We identified two kinds of skills - the “hard” skills and the “soft” skills. The hard skills are technical and domain specific skills, such as the ability to code in a certain computer language or perform statistical analysis, etc. The soft skills are personal attributes that help you engage and succeed in a team. Hard skills are somewhat quantifiable. A person’s ability to code can be reflected by that person’s past works, degrees, and certificates. Soft skills are harder to measure. We recognize these inherent differences between the two kinds of skills, and we analyze them separately.

Methodology

The project, including web scraping and analysis, was done completely in R. We also chose github to be our depository to store our results.

We used the following R packages:

RCurl
XML
rvest
stringr
tidyr
dplyr
ggplot2

We also used SelectorGadget to identify some of the fields for scrapping.

Hard Skills

To collect information about hard skills, we investigated the following two websites:

indeed.com
cybercoders.com

We chose these two websites because they are relatively easy to scrap for information. Typing in “data scientist” in either websites, you will be led to a search result webpage listing the job postings. For each job posting, there is a small section listing the “desired experience” or “required skills” for the job. Using SelectorGadget, the nodes can be identified:

For indeed.com, the node for “desire experience” is “.experienceList”
For cybercoders.com, the node for “required skills” is “.skill-list span”.

Using rvest, they can be directly scrapped from the search result page.

To automate the scrapping process, we also discovered patterns in the URL structures for both websites, which enable us to generate web links automatically that can take us to all the search result pages.

For indeed.com, the first result page is “https://www.indeed.com/jobs?q=data+scientist&start=”. To reach subsequent pages, we found that one only needs to add numbers after “start=”, in increments of ten. For example, Page 2 can be reached as “https://www.indeed.com/jobs?q=data+scientist&start=10”, and Page 3 would have “start=20”, etc. We were able to use this strategy to scrap 2,370 pages from indeed.com, each page contains about 14 to 15 job postings.

The cybercoders.com has similar structure for URL. The first result page is “https://www.cybercoders.com/search/?page=1&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype=”. We noticed that when the “page=1” string is manipulated, we can reach other search result pages. For example the 2nd page will be “page=2”, and so on. We were able to scrap 6 pages from cybercoders.com, each page contains about 20 job postings.

Soft Skills

To gauge what are the soft skills that employers are looking for, we did some research and found this website useful: https://www.thebalance.com/list-of-soft-skills-2063770. In particular, the webpage has a list of 146 soft skills. We copied them and pasted into a .txt file and it can be reached here: https://raw.githubusercontent.com/Tyllis/Data607/master/list_of_soft_skills.txt.

We decided to use regular expression to collect information on the soft skills. Our strategy is simple. It is basically a keyword search on each relevant web page we scrapped. Our program downloaded the raw html codes from the web page using readLine. Then the functions from the stringr package were used to extract the words on the soft skill list from the html codes.

We decided to use this strategy because, unlike a search result page, the web pages we were attempting to scrap vary largely. Each web page was coded to different styles with different structures. You may be able to find the information on one website using rvest, but it usually would not work on the next website. Given the extent of knowledge we currently have on web scrapping, this is our best strategy.

We used two data sources in our endeavor.

colleges that offer data scientist related degrees or certificates
indeed.com job postings for data scientists

We found a useful website compiled by Ryan Swanstrom: http://101.datascience.community/2015/07/14/awesome-data-science-colleges-list/. The .csv file he complied contains a list of 566 colleges that offers degrees or certificates related to data science. What’s useful for us is a column listing URL links to each school’s data science homepage. It can be reached here: https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv.

With this list, we can use the readLines function to download the html codes for each URL link. Then we used str_extract from the stringr package to look for and extract the words on the soft skills list for each URL Link.

For indeed.com, it is a bit more comoplicated. The complication was that we needed to obtain the URL links to the actual job posting website. We knew how to navigate the result pages easily, by manipulating the “start=” string in the URL. But each result page contains hundreds of links, and only 14 to 15 of these links are what we wanted - ones that link to the actual job posting websites.

To solve the problem, we used getHTMLLinks function from the XML package. This function returns all the links in the search result page. Then, we found a pattern in the links that lead to actual job posting sites. They are links that containing one of the following strings:

/pagead/clk
/rc/clk

We can use str_detect from the stringr package to identify and extract these links, which enable our program to automatically go to these job-posting websites, and use readLines function to download the raw html codes.

Web Scraping Codes

We now present the R codes written of this project. We would like to warn the readers that some of these codes are computationally intensive. It might take a long time to load.

library(RCurl)
library(XML)
library(rvest)
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)

Scraping Indeed.com for Hard Skills

First, we created a vector containing 2,370 URL links, each is a search result page.

indeed <- "https://www.indeed.com"
pages <- seq(from = 10, to = 23700, by = 10)
searchresult <- paste(indeed, "/jobs?q=data+scientist&start=", pages, sep = "")

Below codes wrapped the rvest functions into one function.

readLinks <- function(url, lookingfor){
            htmllink <- read_html(url)
            nodes <- htmllink %>% html_nodes(lookingfor) %>% html_text() 
            return(nodes)
            }

Below code execute the rvest functions, using lapply.

indeed_hardskills <- lapply(searchresult, readLinks, lookingfor = ".experienceList")

The result is a list containing 2,370 elements, in which each element is a search result page from indeed.com. Each element contains 14 to 15 sub-elements, corresponding to 14 to 15 job postings on each result page. Each of these sub-elements then contains the desire experience, or hard skills, in vectors.

We checked for uniqueness. It turns out that a lot the result pages are duplicates. Using the unique function, the 2,370 search result pages shrink down to about 140 pages. Since each page contains 14 to 15 job links, we still ended up with 2,082 job links. We also unpacked the list in this step.

indeed_hardskills <- unlist(unique(indeed_hardskills))
write.csv(indeed_hardskills, file ="indeed_hardskills.csv", row.names=FALSE)

The result is a dataset with 2,082 rows for the 2,082 data scientist jobs found on indeed.com and one column containing the experience or skills required for the jobs. The dataset requires some tidying. We will save that step in the Analysis section below. The data was saved and resides in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv

Scraping Cybercoders.com

Similarly, we used the same method to scrap cybercoders.com search result for data scientist. Instead of lapply, we created a function that contains a for-loop. The function also packages the results into a skill-frequency data frame.

scrape_cyber_coders <- function(){
  url_start <- "https://www.cybercoders.com/search/?page="
  url_end <- "&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype="

  num_pages <- c(1:6)
  vector <- rep()
  
  for(p in num_pages){
      url <- sprintf("%s%d%s",url_start, p, url_end)
      h = read_html(url)
      skills <- h %>% html_nodes(".skill-list span") %>% html_text()
      skills <- skills[skills != ""]
      vector <- append(vector, skills)
  }
  sorted_skills <- sort(table(vector), decreasing = FALSE)
  df <- as.data.frame(sorted_skills) 
  write.csv(df,file ="cybercoders.csv", row.names=FALSE)
}
Scrape_cyber_coders()

The above function also took care of the data tidying task as well as aggregation. The result is a dataset with 146 rows corresponding to 146 hard skills appeared on the websites. There are two columns. One column contains the hard skills, and the other column contains how many times the skills are needed in the websites, i.e. frequency. The file was saved and resides in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv

Scrapping Indeed.com for Soft Skills

Two tryCatch functions were created to ignore any errors and return NULL if it does.

getURLFun <- function(x){
  return(tryCatch(getURL(x), error = function(e) NULL))
}
getHTMLLinksFun <- function(x){
  return(tryCatch(getHTMLLinks(x), error = function(e) NULL))
}

Below code will attempt to extract all URL links in each of the 2,370 pages. Note that this step is very computationally intensive. The process will take several hours, depending on your computer hardware. The resulting object, if saved to a file on disk, will take more than 30 MB of space.

rawlinks <- getHTMLLinksFun(getURLFun(searchresult))

We then extracted the actual links to the job posting sites, using the two string patterns we discovered. First we identified the index, then we used str_extract to extract the links.

findidx <- str_detect(rawlinks, pattern = "/pagead/clk|/rc/clk")
goodlinks <- rawlinks[findidx] %>%
            str_extract(pattern = "/rc/clk[[:graph:]]*|/pagead/clk[[:graph:]]*")

We then concatenate the whole URL using paste function.

joblinks <- paste(indeed, goodlinks, sep = "")

Now we can use readLines function to download the html codes from these links.

First we created a tryCatch function to ignore any errors encounter and return a null if it does.

readLinesFun <- function(x){ 
      return(tryCatch(readLines(x), error = function(e) NULL)) 
}

We used lapply to apply readLines function on each job links. We found that this step is even more computationally intensive. For this project, we chose to perform scrapping on the first 500 job links.

temp <- lapply(joblinks[1:500], readLinesFun)

The result is a list containing html codes for the 500 job websites.

Below, we used a for-loop to extract the soft skills words from the html codes.

result <- c()
for (val in temp){
  meshcodes <- paste(val, collapse = "")     # Collapse the html codes into one big string
  temp <- str_extract(meshcodes, softskills) # extract the soft skills words from the html string
  temp <- temp[!is.na(temp)]                          # remove "na" elements
  result <- c(result, paste(temp, collapse = ", "))   # Collapse and append to the result vector
}

Lastly, we created a data.frame object to hold the results for output file.

indeed_softskills <- data.frame(joblinks[1:500], soft_skills = result)
write.csv(indeed_softskills, file ="indeed_softskills", row.names=FALSE)

The result is a data.frame object with 500 rows for the 500 job link we scrapped. It has two columns, one containing the URLs of the job links, and the other column containing the hard skills. The dataset requires few more tidying steps and we will do that in the Analysis section below. The data is stored in this github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv

Scrapping College Homepages

First, we loaded the list of colleges that offers data science degrees.

url <- "https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv"
dscolleges <- read.csv(url, stringsAsFactor = F)
url_list <- dscolleges$URL

We again used the readLinesFun function to download all the html codes for each URL links.

homepages <- sapply(url_list, readLinesFun, USE.NAMES = F)

Below, we used a for-loop similar to the indeed.com scheme above to extract the words matching the soft skills list. This time, we placed the result in a new column called skills and placed back into the original dscolleges data frame.

dscolleges$skills <- c(rep(NA, dim(dscolleges)[1]))
rowcount = 0
for (htmlcodes in homepages){
  meshcodes <- paste(htmlcodes, collapse = "") 
    rowcount <- rowcount + 1
    temp <- str_extract(meshcodes, softskills)
    temp <- temp[!is.na(temp)]
    raw[rowcount, "skills"] <- paste(temp, collapse = ", ")
}
write.csv(dscolleges, file ="college_softskills", row.names=FALSE)

The resulting data.frame object was just the original college table with one additional column - the “skills” column, which contains the soft skills found on those websites. The data file is stored in the following github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv

Analysis of Results

We are now ready to analyze the 4 datasets. First we read the scrapping results into data.frame objects.

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv"
indeed_hs <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv"
cybercoders_hs <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv"
indeed_ss <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv"
colleges_ss <- read.csv(url, header = T, stringsAsFactors = F)

Before we can analyze the datasets, they required a little tidying. We then extracted top 20 skills of each dataset, and plotted them on frequency graphs.

Indeed.com Data for Hard Skills

Below codes prepare the indeed.com data for the hard skills analysis.

# Remove the row number from each row, split the strings, unlist to combine into one vector and remove white spaces
indeed_hs <- indeed_hs[,1] %>%
            str_replace(pattern = "^[0-9]* ", replacement ="") %>% 
            strsplit(split = ", ") %>%
            unlist() %>%
            trimws()

# Remove any empty cells and remove "Data Science" element as this is a generic term and not a specific skill
indeed_hs <- indeed_hs[indeed_hs != ""]
indeed_hs <- indeed_hs[indeed_hs != 'Data Science']

# Sort by frequency and convert to a dataframe
indeed_hs <- indeed_hs %>%
            table()%>%
            sort(decreasing = F) %>%
            as.data.frame()

# Extract only top hits             
indeed_hstop20 <- tail(indeed_hs, 20)

Let’s see how many total skills show up in the indeed.com job postings:

# Total count of skills in the dataset
(total_indeedhs <- sum(indeed_hs$Freq))

## [1] 14670

# Number of types of skills
(num_indeedhs <- dim(indeed_hs)[1])

## [1] 109

So, there are 14670 skills appearing in our dataset, of which there are 109 unique types.

We now plot them in a frequency plot.

names(indeed_hstop20)[1] <- "skills"
ggplot(indeed_hstop20,aes(x = skills,Freq,fill=Freq))+
  geom_bar(position = 'dodge',stat = "identity") +
  coord_flip() +
  labs(title = "Frequency of Skills Appearing in Job Postings on Indeed.com",y="Frequency",x = "Hard Skills")

According to this graph, the top 5 most valuable hard skills data scientists are:

Machine Learning
Python
R
Big Data
Hadoop

Cybercoders.com Data

The data from cybercoders.com was prepared beforehand and ready for analysis right away. We just needed to extract the top 20 skills.

# Extract top 20 skills
cc_top20 <- tail(cybercoders_hs, 20)

# Total count of skills in the dataset
(total_cchs <- sum(cybercoders_hs$Freq))

## [1] 519

# Number of types of skills
(num_cchs <- dim(cybercoders_hs)[1])

## [1] 146

There are 519 skills appearing in our dataset, of which 146 are unique.

Below codes plot the frequency graph.

names(cc_top20)[1] <- "skills"
ggplot(cc_top20,aes(x = reorder(skills, Freq),Freq,fill=Freq))+
  geom_bar(position = 'dodge',stat = "identity") +
  coord_flip() +
  labs(title = "Frequency of Skills Appearing in Job Postings on Cybercoders.com",y="Frequency",x = "Hard Skills")

According to this graph, the top 5 most valuable hard skills data scientists are:

Python
Machine Learning
R
Hadoop
Data Mining

This largely agrees with the dataset from indeed.com, especially the top 3 hard skills. Both datasets suggest that Python, Machine Learning, and R are the most valuable hard skills for data scientists.

Indeed.com Data for Soft Skills

We now move on to analyze the datasets for soft skills. Below codes prepare the indeed.com data for the soft skills analysis.

# Split the strings, unlist to combine into one vector and remove white spaces
indeed_ss <- indeed_ss$soft_skills %>%
            strsplit(split = ", ") %>%
            unlist() %>%
            trimws()

# Sort by frequency and convert to a dataframe
indeed_ss <- indeed_ss %>%
            table()%>%
            sort(decreasing = F) %>%
            as.data.frame()

# Extract only top hits             
indeed_sstop20 <- tail(indeed_ss, 20)

# Total count of skills
(total_indeedss <- sum(indeed_ss$Freq))

## [1] 3106

# Number of type of skills
(num_indeedss <- dim(indeed_ss)[1])

## [1] 84

So, there are 3106 soft skills appearing in our dataset, of which there are 84 unique types.

Below is the frequency graph.

names(indeed_sstop20)[1] <- "skills"
ggplot(indeed_sstop20,aes(x = skills,Freq,fill=Freq))+
  geom_bar(position = 'dodge',stat = "identity") +
  coord_flip() +
  labs(title = "Frequency of Skills Appearing in Job Postings on Indeed.com",y="Frequency",x = "Soft Skills")

The indeed.com dataset suggests that the top 5 soft skills employers deem most valuable in a data scientist are:

experience
social
research
communication
organization

College Homepages Data

Lastly, let’s look at the data scrapped from the various colleges’ homepages.

# Split the strings, unlist to combine into one vector and remove white spaces
colleges_ss <- colleges_ss$skills %>%
  strsplit(split = ", ") %>%
  unlist() %>%
  trimws()

# Sort by frequency and convert to a dataframe
colleges_ss <- colleges_ss %>%
  table()%>%
  sort(decreasing = F) %>%
  as.data.frame()

# Extract only top hits             
colleges_sstop20 <- tail(colleges_ss, 20)

# Total count of skills
(total_colss <- sum(colleges_ss$Freq))

## [1] 2863

# Number of skills
(num_colss <- dim(colleges_ss)[1])

## [1] 64

So we observed 2863 soft skills appearing in the various colleges’ homepages, of which 64 are unique.

names(colleges_sstop20)[1] <- "skills"
ggplot(colleges_sstop20,aes(x = skills,Freq,fill=Freq))+
  geom_bar(position = 'dodge',stat = "identity") +
  coord_flip() +
  labs(title = "Frequency of Skills Appearing in College Webpages",y="Frequency",x = "Soft Skills")

The college homepages datatsets suggest that the top 5 most valuable soft skills for data scientist are:

social
research
management
experience
organization

Again, this agrees largely with the indeed.com dataset. In fact, both datasets agree exactly on the top 6 most valuable skills. Attributes such as experience, social, organization, and communication rank consistently at the very top of what employers are looking for.

This is particularly interesting because, as explained in the Methodology section above, the strategy we deployed for the soft skills is a simple key word search in the raw html codes downloaded from the websites. This brute force strategy is simple but it has a flaw: the word search does not take context into consideration. The words may appear in the website but do not mean to describe the soft skills for data scientist. Our hope was that the chance of this occurrence is not the norm and low, and by processing large amount of data, we may be still able to find something of interests. The results above show clearly that there is indeed a ranking order for these soft skills in terms of values. It could mean one of the following two things:

that we have truely captured the underlying ranking order for soft skills.
that certain words are more susceptible to be captured out of context than others.

We tend to believe in the former because there is no evidence to believe in the later. Going forward, it may be possible to create some sorts of hypothesis test to test the later conclusion; or better yet, devise a more sophisticated technique to extract those soft skills from the websites (for example, a machine learning program that does take context into account when extracting words). Considering the extent of the knowledge we currently have in web scrapping, the above results should suffice the scope of the project.

Conclusion

Using the aforementioned datasets, we were able to determine the current, shared consensus on the most valued skills, both “hard” and “soft”, that a data scientist should have in today’s corporate environment. By breaking out the demand (refer to our analysis of each dataset) with our web scrapping, data aggregation, parsing, and analysis techniques we were able to see that the job site datasets and the college datasets were relatively in line with each other for the most valued skills and traits. This is indicative of a industry movement towards universally accepted characteristics and abilities for the data science career path. This of course can be subject to major industry shifts as the field as a whole is still in the early stages of development.

With regards to soft skills, attributes such as experience, social, organization, and communication consistently rank at the very top of what employers are looking for.

With regards to hard skills, the top three most valued skills were Python, Machine Learning, and R.

Knowing this information, students can now aim to refine their skill sets to be the best data scientists that they can be per the desired industry standard.