Project 3

Group 5: Jun Yan, Hector Santana, Rafal Decowski, Chad Smith, Aryeh Sturm
October 22, 2017

Introduction

In this project, we attempted to answer the question, “What are the most valued data science skills?”, using data obtained from the internet through web scraping.

The analysis done here can be useful for students who aspire to work in data science and analytics fields. We hope it can be helpful for students in complementing their training plan and refining their resumes.

We identified two kinds of skills - the “hard” skills and the “soft” skills.

The hard skills are technical and domain specific skills, such as the ability to code in a certain computer language or perform statistical analysis, etc.

The soft skills are personal attributes which help you engage and succeed in a team.

Hard skills are somewhat quantifiable. A person's ability to code can be reflected by that person's past works, degrees, and certificates. Soft skills are harder to measure. We recognize these inherent differences between the two kinds of skills, and analyzed them separately.

Methodology

The project, including web scraping and analysis, was done completely in R.

We also chose github to be our depository to store our results.

The following R packages were used:

RCurl
XML
rvest
stringr
tidyr
dplyr
ggplot2

We also used SelectorGadget to identify some of the fields for scraping.

Hard Skills

To collect information about hard skills, we investigated the following two websites:

indeed.com
cybercoders.com

We chose these two websites because they are relatively easy to scrape for information.

Typing in “data scientist” in either website, you will be led to a search result webpage listing the job postings. For each job posting, there is a small section listing the “desired experience” or “required skills” for the job. Using SelectorGadget, the nodes can be identified:

For indeed.com, the node for “desire experience” is “.experienceList”
For cybercoders.com, the node for “required skills” is “.skill-list span”.

Using rvest, they can be directly scraped from the search result page.

Hard Skills (cont.)

To automate the scraping process, we leveraged patterns discovered in the URL structures for both websites, which enabled us to generate web links automatically that would lead us to all the search result pages.

indeed.com: first result page is “https://www.indeed.com/jobs?q=data+scientist&start=”.

To reach subsequent pages, we found that one only needs to add numbers after “start=”, in increments of ten. For example, Page 2 can be reached as “https://www.indeed.com/jobs?q=data+scientist&start=10”, and Page 3 would have “start=20”, etc.

We were able to use this strategy to scrape 2,370 pages from indeed.com, each page contains about 14 to 15 job postings.

cybercoders.com: has a similar structure for URL. The first result page is: https://www.cybercoders.com/search/?page=1&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype=

We noticed that when the “page=1” string is manipulated, we can reach other search result pages. For example the 2nd page will be “page=2”, and so on. We were able to scrape 6 pages from cybercoders.com, each page contains about 20 job postings.

Soft Skills

To gauge what type of soft skills employers look for, we did some research and found this website useful: https://www.thebalance.com/list-of-soft-skills-2063770.

The webpage has a list of 146 unique soft skills which we saved as a .txt file here: https://raw.githubusercontent.com/Tyllis/Data607/master/list_of_soft_skills.txt.

readLines, RegEx, and stringr:

We decided to use regular expressions to collect information on the soft skills. Our strategy is simple. It is basically a keyword search on each relevant web page we scraped. Our program downloaded the raw html codes from the web page using readLines. Then the functions from the stringr package were used to extract the words on the soft skill list from the html codes.

We decided to use this strategy because, unlike a search result page, the web pages we were attempting to scrape varied significantly. Each web page was coded to different styles with different structures. You may be able to find the information on one website using rvest, but it usually would not work on the next website. Given the extent of knowledge we currently have on web scraping, this is our best strategy.

Soft Skills | Colleges

We used two data sources in our endeavor:

colleges that offer data scientist related degrees or certificates
indeed.com job postings for data scientists

We found a useful website compiled by Ryan Swanstrom: http://101.datascience.community/2015/07/14/awesome-data-science-colleges-list/.

He comiled a .csv file containing a list of 566 colleges which offer degrees or certificates related to data science.

What's useful for us is a column with links to each school's data science homepage.

The file is accessible here: https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv.

With this list, we can use the readLines function to download the html codes for each URL link. Then we used str_extract from the stringr package to look for and extract the words on the soft skills list for each URL Link.

Soft Skills | Indeed

For indeed.com, it is a bit more complicated. The complication was that we needed to obtain the URL links to the actual job posting website. We knew how to navigate the result pages easily, by manipulating the “start=” string in the URL. But each result page contains hundreds of links, and only 14 to 15 of these links are what we wanted - ones that link to the actual job posting websites.

To solve the problem, we used getHTMLLinks function from the XML package. This function returns all the links in the search result page. Then, we found a pattern in the links that lead to actual job posting sites. They are links that containing one of the following strings:

/pagead/clk
/rc/clk

We can use str_detect from the stringr package to identify and extract these links, which enable our program to automatically go to these job-posting websites, and use readLines function to download the raw html codes.

Web Scraping Code

We now present the R code written for this project.

We would like to warn the readers that some of the code is computationally intensive and might take a long time to load/replicate.

Scraping Indeed.com for Hard Skills

First, we created a vector containing 2,370 URL links, each is a search result page.

indeed <- "https://www.indeed.com"
pages <- seq(from = 10, to = 23700, by = 10)
searchresult <- paste(indeed, "/jobs?q=data+scientist&start=", pages, sep = "")

Then wrapped the rvest functions into one function.

readLinks <- function(url, lookingfor){
            htmllink <- read_html(url)
            nodes <- htmllink %>% html_nodes(lookingfor) %>% html_text() 
            return(nodes)
            }

And executed the rvest function using lapply.

indeed_hardskills <- lapply(searchresult, readLinks, lookingfor = ".experienceList")

The result is a list containing 2,370 elements, in which each element is a search result page from indeed.com. Each element contains 14 to 15 sub-elements, corresponding to the 14-15 job postings from each result page.

Each of these sub-elements then contains the desired experience (hard skills) in vectors.

Uniqueness: It turns out that a lot the result pages are duplicates.

Using the unique function, the 2,370 search result pages shrink down to about 140 pages. Since each page contains 14 to 15 job links, we still ended up with 2,082 job links. We also unpacked the list in this step.

indeed_hardskills <- unlist(unique(indeed_hardskills))
write.csv(indeed_hardskills, file ="indeed_hardskills.csv", row.names=FALSE)

The results are saved and reside in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv

Scraping Cybercoders.com

Similarly, we used the same method to scrape cybercoders.com search results for “data scientist”.

Instead of lapply, we created a function that contains a for-loop. The function also packages the results into a skill-frequency data frame.

scrape_cyber_coders <- function(){
  url_start <- "https://www.cybercoders.com/search/?page="
  url_end <- "&searchterms=Data%20Scientist&searchlocation=&newsearch=true&originalsearch=true&sorttype="

  num_pages <- c(1:6)
  vector <- rep()

  for(p in num_pages){
      url <- sprintf("%s%d%s",url_start, p, url_end)
      h = read_html(url)
      skills <- h %>% html_nodes(".skill-list span") %>% html_text()
      skills <- skills[skills != ""]
      vector <- append(vector, skills)
  }
  sorted_skills <- sort(table(vector), decreasing = FALSE)
  df <- as.data.frame(sorted_skills) 
  write.csv(df,file ="cybercoders.csv", row.names=FALSE)
}
Scrape_cyber_coders()

The results are saved and reside in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv

Scraping Indeed.com for Soft Skills

Two “tryCatch()” functions were created to ignore any errors and return NULLs:

getURLFun <- function(x){
  return(tryCatch(getURL(x), error = function(e) NULL))
}
getHTMLLinksFun <- function(x){
  return(tryCatch(getHTMLLinks(x), error = function(e) NULL))
}

This step attempts to extract all URL links in each of the 2,370 pages:

Note that this step is very computationally intensive. The process takes up to several hours, depending on your computer hardware. The resulting object, if saved to a file on disk, will take more than 30 MB of space.

rawlinks <- getHTMLLinksFun(getURLFun(searchresult))

We then extracted the actual links to the job posting sites, using the two string patterns we discovered.

First we identified the index, then we used str_extract to extract the links:

findidx <- str_detect(rawlinks, pattern = "/pagead/clk|/rc/clk")
goodlinks <- rawlinks[findidx] %>%
            str_extract(pattern = "/rc/clk[[:graph:]]*|/pagead/clk[[:graph:]]*")

Then concatenate the whole URL using paste function:

joblinks <- paste(indeed, goodlinks, sep = "")

Now we can use readLines function to download the html codes from these links.

Scraping Indeed.com for Soft Skills (cont.)

First we created a tryCatch function to ignore any errors encounter and return a null if it does:

readLinesFun <- function(x){ 
      return(tryCatch(readLines(x), error = function(e) NULL)) 
}

We used lapply to apply readLines function on each job links.

We found that this step is even more computationally intensive. For this project, we chose to perform scraping on the first 500 job links:

temp <- lapply(joblinks[1:500], readLinesFun)

The result is a list containing html codes for the 500 job websites.

Then we used a for-loop to extract the soft skills words from the html codes:

result <- c()
for (val in temp){
  meshcodes <- paste(val, collapse = "")     # Collapse the html codes into one big string
  temp <- str_extract(meshcodes, softskills) # extract the soft skills words from the html string
  temp <- temp[!is.na(temp)]                          # remove "na" elements
  result <- c(result, paste(temp, collapse = ", "))   # Collapse and append to the result vector
}

Lastly, we created a data.frame object to hold the results for output file.

indeed_softskills <- data.frame(joblinks[1:500], soft_skills = result)
write.csv(indeed_softskills, file ="indeed_softskills", row.names=FALSE)

This is stored in our github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv

Scraping College Homepages

First, we loaded the list of colleges which offer data science degrees:

url <- "https://github.com/ryanswanstrom/awesome-datascience-colleges/blob/master/data_science_colleges.csv"
dscolleges <- read.csv(url, stringsAsFactor = F)
url_list <- dscolleges$URL

We again used the readLinesFun function to download all the html codes for each URL link:

homepages <- sapply(url_list, readLinesFun, USE.NAMES = F)

Then used a for-loop similar to the indeed.com scheme above to extract the words matching the soft skills list. This time, we placed the result in a new column called skills and placed back into the original dscolleges data frame.

dscolleges$skills <- c(rep(NA, dim(dscolleges)[1]))
rowcount = 0
for (htmlcodes in homepages){
  meshcodes <- paste(htmlcodes, collapse = "") 
    rowcount <- rowcount + 1
    temp <- str_extract(meshcodes, softskills)
    temp <- temp[!is.na(temp)]
    raw[rowcount, "skills"] <- paste(temp, collapse = ", ")
}
write.csv(dscolleges, file ="college_softskills", row.names=FALSE)

The file is stored in the following github repository: https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv

Analysis of Results

We are now ready to analyze the 4 datasets.

First we read the scraping results into data.frame objects:

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_hardskills.csv"
indeed_hs <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/cybercoders_hardskills.csv"
cybercoders_hs <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/indeed_softskills.csv"
indeed_ss <- read.csv(url, header = T, stringsAsFactors = F)

url <- "https://raw.githubusercontent.com/Tyllis/Data607/master/college_softskills.csv"
colleges_ss <- read.csv(url, header = T, stringsAsFactors = F)

Before we could analyze the datasets, they required a little tidying.

We then extracted top 20 skills of each dataset and plotted them on frequency graphs.

Indeed.com Data for Hard Skills

indeed.com:

Below code prepared the data for the hard skills analysis.

# Remove the row number from each row, split the strings, unlist to combine into one vector and remove white spaces
indeed_hs <- indeed_hs[,1] %>%
            str_replace(pattern = "^[0-9]* ", replacement ="") %>% 
            strsplit(split = ", ") %>%
            unlist() %>%
            trimws()

# Remove any empty cells and remove "Data Science" element as this is a generic term and not a specific skill
indeed_hs <- indeed_hs[indeed_hs != ""]
indeed_hs <- indeed_hs[indeed_hs != 'Data Science']

# Sort by frequency and convert to a dataframe
indeed_hs <- indeed_hs %>%
            table()%>%
            sort(decreasing = F) %>%
            as.data.frame()

# Extract only top hits             
indeed_hstop20 <- tail(indeed_hs, 20)

Let's see how many total skills show up in the indeed.com job postings:

# Total count of skills in the dataset
(total_indeedhs <- sum(indeed_hs$Freq))
[1] 14670
# Number of types of skills
(num_indeedhs <- dim(indeed_hs)[1])
[1] 109

There are 14670 skills appearing in our dataset, of which there are 109 unique types.

Indeed.com Data for Hard Skills | Frequency Plot

We now exhibit these skills in a frequency plot.

plot of chunk unnamed-chunk-21

According to this graph, the top 5 most valuable hard skills data scientists are:

Machine Learning

Python

Big Data

Hadoop

Cybercoders.com data

The data from cybercoders.com was prepared beforehand and ready for analysis right away. We just needed to extract the top 20 skills:

cc_top20 <- tail(cybercoders_hs, 20)      # Extract top 20 skills
(total_cchs <- sum(cybercoders_hs$Freq))  # Total count of skills in the dataset
[1] 519
(num_cchs <- dim(cybercoders_hs)[1])      # Number of types of skills
[1] 146

There are 519 skills appearing in our dataset, of which, 146 are unique. plot of chunk unnamed-chunk-23 According to this graph, the top 5 most valuable hard skills data scientists are:

Python, Machine Learning, R, Hadoop, and Data Mining

This largely agrees with the dataset from indeed.com, especially the top 3 hard skills. Both datasets suggest that Python, Machine Learning, and R are the most valuable hard skills for data scientists.

Indeed.com Data for Soft Skills

We now move on to analyze the datasets for soft skills. Below code prepares the indeed.com data for the soft skills analysis:

# Split the strings, unlist to combine into one vector and remove white spaces
indeed_ss <- indeed_ss$soft_skills %>%
            strsplit(split = ", ") %>%
            unlist() %>%
            trimws()

# Sort by frequency and convert to a dataframe
indeed_ss <- indeed_ss %>%
            table()%>%
            sort(decreasing = F) %>%
            as.data.frame()

# Extract only top hits             
indeed_sstop20 <- tail(indeed_ss, 20)

# Total count of skills
(total_indeedss <- sum(indeed_ss$Freq))
[1] 3106

# Number of type of skills
(num_indeedss <- dim(indeed_ss)[1])
[1] 84

…so, there are 3106 soft skills appearing in our dataset, of which there are 84 unique types.

Move to the next slide for the frequency graph –>

Indeed.com Data for Soft Skills (cont.)

plot of chunk unnamed-chunk-25

The indeed.com dataset suggests that the top 5 soft skills employers deem most valuable in a data scientist are:

experience

social

research

communication

organization

College Homepage Data

Lastly, let's look at the data scraped from the various colleges' homepages:

# Split the strings, unlist to combine into one vector and remove white spaces
colleges_ss <- colleges_ss$skills %>%
  strsplit(split = ", ") %>%
  unlist() %>%
  trimws()

# Sort by frequency and convert to a dataframe
colleges_ss <- colleges_ss %>%
  table()%>%
  sort(decreasing = F) %>%
  as.data.frame()

# Extract only top hits             
colleges_sstop20 <- tail(colleges_ss, 20)

# Total count of skills
(total_colss <- sum(colleges_ss$Freq))
[1] 2863

# Number of skills
(num_colss <- dim(colleges_ss)[1])
[1] 64

So we observed 2863 soft skills appearing in the various colleges' homepages, of which 64 are unique.

Move to the next slide for the frequency graph –>

College Homepage Data (cont.)

plot of chunk unnamed-chunk-27

The college homepages datatsets suggest that the top 5 most valuable soft skills for data scientist are:

social
research
management
experience
organization

Again, this agrees largely with the indeed.com dataset. In fact, both datasets agree exactly on the top 6 most valuable skills. Attributes such as experience, social, organization, and communication rank consistently at the very top of what employers are looking for.

Conclusion

Using the aforementioned datasets, we were able to determine the current, shared consensus on the most valued skills, both “hard” and “soft”, that a data scientist should have in today's corporate environment. By breaking out the demand (refer to our analysis of each dataset) with our web scraping, data aggregation, parsing, and analysis techniques we were able to see that the job site datasets and the college datasets were relatively in line with each other for the most valued skills and traits. This is indicative of a industry movement towards universally accepted characteristics and abilities for the data science career path. This of course can be subject to major industry shifts as the field as a whole is still in the early stages of development.

With regards to soft skills, attributes such as experience, social, organization, and communication consistently rank at the very top of what employers are looking for.

With regards to hard skills, the top three most valued skills were Python, Machine Learning, and R.

Knowing this information, students can now aim to refine their skill sets to be the best data scientists that they can be per the desired industry standard.