Contents
Overview: a summary of findings and resulting plots.
Scope & Limitations: a brief description of the project’s scope and limitations.
Technical Details: reproducible details and code.
Technical Details
Setup
The rvest package is used to scrape Indeed.com, dplyr for data wrangling, and ggplot2 for plotting.
# Cleanup environment and set options
rm(list = ls())
options(scipen=999) # avoids scientific notation
# Install and load packages
install_packages <- function(pk){
new_pk <- pk[!(pk %in% installed.packages()[, "Package"])]
# if package is not installed only
if (length(new_pk)) {
install.packages(new_pk, dependencies=TRUE, warn.conflicts=FALSE)
}
# load packages
sapply(pk, require, character.only=TRUE)
}
packages <- c("rvest", "dplyr", "ggplot2")
suppressMessages(suppressWarnings(install_packages(packages)))
Main function
The main workhorse is the scrape_indeed function:
scrape_indeed <- function(job_title, City, State, radius, contract_type, seniority) {
# Scrapes Indeed.com given the following arguments:
#
# 1. job title (one to three terms)
# 2. city
# 3. state abbreviation
# 4. radius (in miles, around the city)
# 5. contract type (full time or part time, for now)
# 6. seniority (entry level, mid level, or senior level)
# format and define URL
# number of terms
n_terms <- length(strsplit(job_title, " ")[[1]])
if (n_terms == 2) {
new_job_title <- paste0(strsplit(job_title, " ")[[1]][1], '+',
strsplit(job_title, " ")[[1]][2], sep="")
} else if (n_terms == 3) {
new_job_title <- paste0(strsplit(job_title, " ")[[1]][1], '+',
strsplit(job_title, " ")[[1]][2], '+',
strsplit(job_title, " ")[[1]][3], sep="")
} else {
new_job_title <- job_title
}
location <- paste0('&l=', City, ',+', State, sep="")
contract_type <- paste0('&jt=', strsplit(contract_type, " ")[[1]][1],
strsplit(contract_type, " ")[[1]][2], sep="")
new_seniority <- paste0('&explvl=', strsplit(seniority, " ")[[1]][1],
"_", strsplit(seniority, " ")[[1]][2], sep="")
URL <- paste0('https://www.indeed.com/jobs?q=',
new_job_title,
location,
contract_type,
new_seniority,
paste0('&radius=', radius))
# scrape webpage: since the rbLabel and rbCount extraction is erratic
# the use of a while loop ensures correct results
cts = NULL
while (length(cts) == 0) {
webpage <- read_html(URL)
# get text from html - assumes 5 bins
vals <- html_text(html_nodes(webpage,'.rbLabel')[1:5])
cts <- html_text(html_nodes(webpage,'.rbCount')[1:5])
}
# checks 5-bin assumption, returns actual bins (if less)
num_cts <- sum(grepl('\\$', vals))
vals <- vals[1:num_cts]
cts <- cts[1:num_cts]
# format labels and counts
format_ <- function(x) {
x <- sub("\\+", "", x)
x <- sub("\\(", "", x)
x <- sub("\\)", "", x)
x <- sub("\\$", "", x)
x <- sub("\\,", "", x)
x
}
for (i in 1:length(vals)) vals[i] <- format_(vals[i])
for (i in 1:length(cts)) cts[i] <- format_(cts[i])
vals <- as.numeric(vals)
cts <- as.numeric(cts)
# return a dataframe
# create a salary_col based on number of bins found
salary_col <- NULL
for (i in 1:length(vals)) {
salary_col <- c(salary_col, rep(vals[i], cts[i]))
}
dfm <- data.frame("Salary" = salary_col,
"City" = rep(City, sum(cts)),
"Title" = rep(job_title, sum(cts)),
"Level" = rep(seniority, sum(cts)))
return(dfm)
}
We scrape the website’s source html and zoom on on Indeed.com’s aggregated bins. Note that this scraper is not getting the actual data, just the already-aggregated summary data displayed in the left bar:

Since rbLabel and rbCount extraction is erratic, we use a while loop to get the correct results. This does not take long.
Scraping
The function is applied for various job types, locations, and experience levels. I am only interested in full-time positions for this example. I made my job easier by creating a CSV file with the arguments to be passed to the scrape_indeed function. Here are the contents of this file:
args <- read.csv('./scraping_arguments.csv'
, stringsAsFactors = FALSE)
args
## Title City State Radius Contract Level
## 1 data engineer Denver CO 0 full time entry level
## 2 data engineer Boulder CO 0 full time entry level
## 3 data scientist Denver CO 0 full time entry level
## 4 data scientist Boulder CO 0 full time entry level
## 5 machine learning engineer Denver CO 0 full time entry level
## 6 machine learning engineer Boulder CO 0 full time entry level
## 7 data engineer Denver CO 0 full time mid level
## 8 data engineer Boulder CO 0 full time mid level
## 9 data scientist Denver CO 0 full time mid level
## 10 data scientist Boulder CO 0 full time mid level
## 11 machine learning engineer Denver CO 0 full time mid level
## 12 machine learning engineer Boulder CO 0 full time mid level
# scraping Indeed.com for each subgroup of interest
indeed_salaries <- NULL
for (i in 1:nrow(args)) {
indeed_salaries <- rbind(indeed_salaries,
scrape_indeed(args[i, 'Title']
,args[i, 'City']
,args[i, 'State']
,args[i, 'Radius']
,args[i, 'Contract']
,args[i, 'Level']))
}
The final dataset is long, each observation is a “job” and consists of the “mean salary” for the job in a given bin, aggregated at the job title, location, contract type and seniority level.
head(indeed_salaries, 10)
## Salary City Title Level
## 1 51100 Denver data engineer entry level
## 2 51100 Denver data engineer entry level
## 3 51100 Denver data engineer entry level
## 4 51100 Denver data engineer entry level
## 5 51100 Denver data engineer entry level
## 6 51100 Denver data engineer entry level
## 7 51100 Denver data engineer entry level
## 8 51100 Denver data engineer entry level
## 9 51100 Denver data engineer entry level
## 10 51100 Denver data engineer entry level
tail(indeed_salaries, 10)
## Salary City Title Level
## 3012 112800 Boulder machine learning engineer mid level
## 3013 112800 Boulder machine learning engineer mid level
## 3014 112800 Boulder machine learning engineer mid level
## 3015 112800 Boulder machine learning engineer mid level
## 3016 120700 Boulder machine learning engineer mid level
## 3017 120700 Boulder machine learning engineer mid level
## 3018 120700 Boulder machine learning engineer mid level
## 3019 120700 Boulder machine learning engineer mid level
## 3020 120700 Boulder machine learning engineer mid level
## 3021 120700 Boulder machine learning engineer mid level
Group averages & Summary Table
We calculate group averages for each sub-group (aggregate level) for ease in plotting and for summarizing the data. Note that this is in effect a mean of means type of average.
# creating summary dataframe for summary table
sdf <- data.frame(summary(indeed_salaries[,c('City','Level','Title')]))
sdf <- sdf[complete.cases(sdf), -1]
sdf$Freq <- as.character(sdf$Freq)
sdf$Name <- NULL
sdf$Value <- NULL
for (i in 1:nrow(sdf)) {
sdf$Name[i] <- trimws(strsplit(sdf$Freq[i], ":")[[1]][1])
sdf$Value[i] <- trimws(strsplit(sdf$Freq[i], ":")[[1]][2])
}
sdf$Freq <- NULL
sdf$Value <- as.numeric(sdf$Value)
colnames(sdf) <- c("Var","Name","Value")
# calculating group averages
group_avgs <- indeed_salaries %>%
group_by(City, Title, Level) %>%
summarize('mean_salary' = round(mean(Salary), 0),
'median_salary' = round(median(Salary), 0))
group_avgs <- data.frame(group_avgs)
# converting into accounting format for better display
accounting_format <- function(x) {
x <- as.character(x)
x1 <- substr(x, 1, nchar(x)-3)
x2 <- substr(x, nchar(x)-2, nchar(x))
out <- paste0('$', x1, ",", x2, sep="")
out
}
group_avgs$mean_salary <- accounting_format(group_avgs$mean_salary)
group_avgs$median_salary <- accounting_format(group_avgs$median_salary)
group_avgs$x_axis <- 110000 # for label plotting purposes
group_avgs$y_axis <- c(rep(500, 6), rep(470, 6)) # for label plotting purposes
group_avgs[, 1:5]
## City Title Level mean_salary median_salary
## 1 Denver data engineer entry level $69,405 66100
## 2 Denver data engineer mid level $89,738 90000
## 3 Denver data scientist entry level $88,143 83300
## 4 Denver data scientist mid level $103,338 100600
## 5 Denver machine learning engineer entry level $70,622 67600
## 6 Denver machine learning engineer mid level $103,103 100100
## 7 Boulder data engineer entry level $71,898 75700
## 8 Boulder data engineer mid level $91,037 90200
## 9 Boulder data scientist entry level $74,661 72200
## 10 Boulder data scientist mid level $94,772 91900
## 11 Boulder machine learning engineer entry level $93,786 97600
## 12 Boulder machine learning engineer mid level $95,266 96900
Plotting
Finally, we construct boxplots to comapre distributions, and barplots to look at them more closely. See the plots back in the Overview.
# BOXPLOTS
# plot title and y axis label
gg_title <- 'Indeed.com\'s Salary Distributions for Full-time Data Positions in Boulder and Denver.'
y_lab <- 'Salary Range'
tally_tb <- indeed_salaries %>%
group_by(City, Title, Level, Salary) %>% tally()
ggplot(data.frame(tally_tb), aes(City, fill = City)) +
geom_boxplot(aes(x = City, y = Salary)) +
facet_wrap(Level ~ Title) +
labs(title=gg_title, y=y_lab) +
theme_bw()
# BARPLOTS
# plot title and y axis label
gg_title <- 'Indeed.com\'s Mean Salaries for Full-time Data Positions in Boulder and Denver.'
y_lab <- 'Number of Jobs'
g <- ggplot(indeed_salaries, aes(Salary, fill = City)) +
geom_histogram(bins = 20) +
facet_wrap(Level ~ Title) +
labs(title=gg_title, y=y_lab) +
theme_bw()
# adding inner labels
g + geom_text(data = group_avgs,
mapping = aes(x=x_axis, y=y_axis),
label = paste0(group_avgs$City, " mean: ",
group_avgs$mean_salary, sep=""),
size=3)
Back to Contents