The goal of this project is to take a look at all the current Data Scientist opportunities throughout the United States and see what areas have the best opportunities based on my personal preferences. It will consist of three parts.
Glassdoor isn’t the easiest website to scrape from. I hit multiple roadblocks before arriving at this solution. Whenever tackling web scraping, the first issue is to understand the link structure. If you create a job search, it’s a long jumbled mess for the first page. First step, is to click on the second page because that will show the structure of a multiple page search. Below is what the link looks like when we search for “Data Science” in “United States”
https://www.glassdoor.com/Job/us-data-scientist-jobs-SRCH_IL.0,2_IN1_KO3,17_IP2.htm
Let’s break this down into its key components after the /Job/
tag.
/us-data-scientist-jobs-
This is the search parameters. Location followed by user inputted search.
SRCH_IL.0,2_IN1_KO3.17
My first road block. It can’t be removed without breaking the link and I couldn’t find a pattern to what it represents. I’m guessing it’s a proprietary code Glassdoor uses to make it more difficult to scrape their website. For that reason, it’s easier to pull in a text file of Glassdoor links to loop over.
_IP2.htm
Page number. If we change the link to _IP4.htm
, it will take us to the 4th page of the search. If there are less than 4 pages, it will lead to a 404 error. For some reason, Glassdoor cuts its job searches off at 30 pages despite showing more than 30 pages available. On page 31, you’ll get a 404 error even if you were to manually click on it. This limits it to 900 job listings per search.
The way I got around this is to manually search Data Science jobs for every U.S. State and copy/paste the links excluding the IPXX_htm
into a text file. Therefore, we can get up to 30 pages per state and increase our sample size. The text file Glassdoor-State-Links.txt
can be found in my Glassdoor Github Repository.
I highly suggest using the Chrome extension Selector Gadget. This tool saves you from having to parse through HTML/CSS to obtain element names. If you click on an element, it will highlight all of the elements that contain the same tag. The main element will be green, the similar elements will be yellow, and excluded elements will be red. Once you’ve narrowed down an element you’re interested in, the path given is what we can use to scrape the element. The website has some detailed tutorials.
Now that we have a text file containing the links, we have to determine the total job listings for each search. If not careful, we’ll get a 404 error. Glassdoor has an element on the sidebar that shows the total number of jobs for your query. After using Selector Gadget, the element name can be determined:
This shows us, there are 24,987 job listings and the element name is .jobsCount
. Since element names change, it’s best to create variables out of them. That way we can quickly swap out the new names if need be. While we’re at it, let’s find element names for the following categories.
I reached another roadblock here! Not every job listing contains Salary Range or Company Rating. If not careful, there will be different length vectors. We need to create a way so that if the element doesn’t exist, it creates an NA
as a placeholder. Luckily, each job listing has it’s own main container. As we can see from the photo below, each job listing is in its own element called .jl
.
A solution to the missing element issue is to search the container for the element and if it doesn’t exist, an NA
will be created. the functions html_node()
and html_nodes()
simplify this solution.
After using Selector Gadget on each desired element, the variables are as follows:
Inside Glassdoor-State-Links.txt
, I have 41 separate links that I convert to a list. Not all 50 U.S. states were used because believe it or not, there’s not much opportunities for a Data Scientist in Wyoming.
## Parsed with column specification:
## cols(
## X1 = col_character()
## )
Next, we need to determine the total number of job listings per link. To do this, I loop through each link and scrape the previously mentioned .jobsCount
element. There are 30 job listings per page and we only want to pull a maximum of 30 pages so we don’t get a 404 kickback, so we take the minimum of 30 and the number of pages.
totJobs <- lapply(URLjob, function(i){
read_html(i) %>%
html_nodes(".jobsCount") %>%
html_text() %>%
gsub("[^0-9]","",.) %>%
as.integer()
})
totPages <- lapply(totJobs,function(i){as.integer(min(ceiling(i/30),30))})
Now we can loop over all the links and compile a complete list of links by adding _IPi.htm
. For instance, if a U.S. State has 5 pages, it loops through and adds _IP1.htm,...,_IP5.htm
to the end of each link to create a full listing. It’s a good idea to export the list so it doesn’t need to be recalculated.
With all the possible links in hand, we can create a scraping tool to pull in all the elements. It’s easier to turn it into a function so it pulls them all in once and keeps a consistent dataframe. The packages rvest, httr, and xml2
make this very easy.
GD_Scraper <-function(x=1:100) {map_df(x, function(i) {
cat(" P", i, sep = "")
pg <- read_html(GET(URLall[i]))
jobTitle <- pg %>% html_nodes(CSSjobtitle) %>%
html_text() %>% data.frame(Job_Title = ., stringsAsFactors = F)
company <- pg %>% html_nodes(CSScompany) %>%
html_text() %>% data.frame(Company_Name = ., stringsAsFactors = F)
cityState <- pg %>% html_nodes(CSScitystate) %>%
html_text() %>% data.frame(City_State = ., stringsAsFactors = F)
rating <- pg %>% html_nodes(".jl") %>% html_node(CSSrating) %>%
html_text() %>% data.frame(Company_Rating = ., stringsAsFactors = F)
salary <- pg %>% html_nodes(".jl") %>% html_node(CSSsalary) %>%
html_text() %>% data.frame(salary_Range = ., stringsAsFactors = F)
length <- pg %>% html_nodes(".jl") %>% html_node(CSStime) %>%
html_text() %>% data.frame(Post_Date = ., stringsAsFactors = F)
data.raw<- cbind(jobTitle, company, cityState, rating, salary, length)
})}
This function just pings the website (read_html)
, pulls the nodes (html_node & html_nodes)
, extracts the texts (html_text)
and converts it to a data frame. It only requires the range (defaults to 1:100) of URLs you want to pull. I designed it this way so that if there are a large number of links, they can be pulled in as a batch. That way, if there’s an error, the whole process won’t have to be ran again.
Lastly, just run the scraper up to the number of URL’s you have, rbind
the results, and export the data frame.
data.raw1 <- GD_Scraper(1:100)
data.raw2 <- GD_Scraper(101:200)
data.raw3 <- GD_Scraper(201:300)
data.raw4 <- GD_Scraper(301:400)
data.raw5 <- GD_Scraper(401:500)
data.raw6 <- GD_Scraper(501:length(URLall))
data.total <- data.raw1 %>%
rbind(data.raw2, data.raw3, data.raw4, data.raw5, data.raw6)
#saveRDS(data.total,"./Data/Data-Raw-Total.RDS")
So there we have it, a way to scrape Glassdoor and obtain information on all the job listings. Considering the hoops that have to be jumped around, scraping from other sites should be a lot easier. The process is always the same:
You may want to check the website’s terms of service before scraping because it can lead to getting you IP banned from the website. Some websites have additional anti scraping measures. There are ways to set up scraping tools so that they scrape slower (Sys.sleep)
and won’t be as easy to detect, but that’s outside the scope of this project. This also makes it easy to scrape financial and sports data without having to rely on an API.
The next step is to clean up the data, bring in additional sources of data, and create a tidy dataframe to do an exploratory analysis on.