Indeed Salaries for Data Jobs in Denver & Boulder.

Marcelo Sanches

December 26, 2019

Contents

Overview: a summary of findings and resulting plots.

Scope & Limitations: a brief description of the project’s scope and limitations.

Technical Details: reproducible details and code.


Overview

This project demonstrates how one can scrape indeed.com job salaries using R, and create summary tables and descriptive plots of the data.

In this example I examine salaries for full-time data jobs in Denver and Boulder, CO.

Main Findings

  • There are 2 times as many jobs in Denver than in Boulder
  • There are 3 times as many mid level jobs than entry level ones
  • There are 5 times as many jobs for data engineers than jobs for data scientists and machine learning engineers combined
Number of Salaries
City Level Title
Denver: \(2070\) entry level: \(733\) data engineer: \(2504\)
Boulder: \(951\) mid level: \(2288\) data scientist: \(207\)
machine learning engineer: \(310\)
  • Median salaries range from \(66k\) to \(100k\)
  • Salaries increase as titles evolve from data engineer to data scientist to machine learning engineer
  • Entry-level Denver data engineers make less than Boulder ones
  • Denver data scientists make more than Boulder ones

A comparison of distributions show these findings visually:

While entry-level data engineers make the least amount, a look at the number of jobs shows better opportunity for data engineers:

Back to Contents


Scope & Limitations

This job board scraper is quite basic. It only scrapes aggregate data from www.indeed.com, based on job title search parameters. Because it does not scrape full titles and locations, and because it does not actualy get to the data itself (the job postings), it cannot be aggregated with other scrapers for other job boards, as it would be impossible to deduplicate jobs found.

Future iterations would include:

  1. full titles

  2. company name

  3. company size

  4. top skills

  5. other job boards

Back to Contents


Technical Details

Setup

The rvest package is used to scrape Indeed.com, dplyr for data wrangling, and ggplot2 for plotting.

# Cleanup environment and set options
rm(list = ls())
options(scipen=999)  # avoids scientific notation

# Install and load packages
install_packages <- function(pk){
  
    new_pk <- pk[!(pk %in% installed.packages()[, "Package"])]
  
    # if package is not installed only
    if (length(new_pk)) {
        install.packages(new_pk, dependencies=TRUE, warn.conflicts=FALSE)
    }
    
    # load packages
    sapply(pk, require, character.only=TRUE)
}

packages <- c("rvest", "dplyr", "ggplot2")
suppressMessages(suppressWarnings(install_packages(packages)))

Main function

The main workhorse is the scrape_indeed function:

scrape_indeed <- function(job_title, City, State, radius, contract_type, seniority) {
  
#    Scrapes Indeed.com given the following arguments:
# 
#       1. job title (one to three terms)
#       2. city 
#       3. state abbreviation
#       4. radius (in miles, around the city)  
#       5. contract type (full time or part time, for now)
#       6. seniority (entry level, mid level, or senior level)
  
    # format and define URL
  
    # number of terms
    n_terms <- length(strsplit(job_title, " ")[[1]])
    
    if (n_terms == 2) {
        new_job_title <- paste0(strsplit(job_title, " ")[[1]][1], '+',
                                strsplit(job_title, " ")[[1]][2], sep="")
    } else if (n_terms == 3) { 
        new_job_title <- paste0(strsplit(job_title, " ")[[1]][1], '+',
                                strsplit(job_title, " ")[[1]][2], '+', 
                                strsplit(job_title, " ")[[1]][3], sep="")
    } else {
        new_job_title <- job_title
    }
                        
    location <- paste0('&l=', City, ',+', State, sep="")
                        
    contract_type <- paste0('&jt=', strsplit(contract_type, " ")[[1]][1],
                            strsplit(contract_type, " ")[[1]][2], sep="")
                            
    new_seniority <- paste0('&explvl=', strsplit(seniority, " ")[[1]][1], 
                            "_", strsplit(seniority, " ")[[1]][2], sep="")
    
    URL <- paste0('https://www.indeed.com/jobs?q=', 
                        new_job_title, 
                        location,
                        contract_type,
                        new_seniority,
                        paste0('&radius=', radius))

    # scrape webpage: since the rbLabel and rbCount extraction is erratic
    #                 the use of a while loop ensures correct results
    
    cts = NULL
    while (length(cts) == 0) {
    
        webpage <- read_html(URL)
        
        # get text from html - assumes 5 bins 
        vals <- html_text(html_nodes(webpage,'.rbLabel')[1:5])
        cts <- html_text(html_nodes(webpage,'.rbCount')[1:5])
        
    }
    
    # checks 5-bin assumption, returns actual bins (if less)
    num_cts <- sum(grepl('\\$', vals))
    vals <- vals[1:num_cts]
    cts <- cts[1:num_cts]
    
    # format labels and counts 
    format_ <- function(x) {
        
        x <- sub("\\+", "", x)
        x <- sub("\\(", "", x)
        x <- sub("\\)", "", x)
        x <- sub("\\$", "", x)
        x <- sub("\\,", "", x)
        x
    }
    
    for (i in 1:length(vals)) vals[i] <- format_(vals[i])
    for (i in 1:length(cts)) cts[i] <- format_(cts[i])
    
    vals <- as.numeric(vals)
    cts <- as.numeric(cts)
    
    # return a dataframe 
    
    # create a salary_col based on number of bins found
    salary_col <- NULL 
    for (i in 1:length(vals)) {
        salary_col <- c(salary_col, rep(vals[i], cts[i]))
    }
    
    dfm <- data.frame("Salary" = salary_col,
                      "City" = rep(City, sum(cts)),
                      "Title" = rep(job_title, sum(cts)),
                      "Level" = rep(seniority, sum(cts)))
    
    return(dfm)
}

We scrape the website’s source html and zoom on on Indeed.com’s aggregated bins. Note that this scraper is not getting the actual data, just the already-aggregated summary data displayed in the left bar:

fig1

Since rbLabel and rbCount extraction is erratic, we use a while loop to get the correct results. This does not take long.

Scraping

The function is applied for various job types, locations, and experience levels. I am only interested in full-time positions for this example. I made my job easier by creating a CSV file with the arguments to be passed to the scrape_indeed function. Here are the contents of this file:

args <- read.csv('./scraping_arguments.csv'
                 , stringsAsFactors = FALSE)
args
##                        Title    City State Radius  Contract       Level
## 1              data engineer  Denver    CO      0 full time entry level
## 2              data engineer Boulder    CO      0 full time entry level
## 3             data scientist  Denver    CO      0 full time entry level
## 4             data scientist Boulder    CO      0 full time entry level
## 5  machine learning engineer  Denver    CO      0 full time entry level
## 6  machine learning engineer Boulder    CO      0 full time entry level
## 7              data engineer  Denver    CO      0 full time   mid level
## 8              data engineer Boulder    CO      0 full time   mid level
## 9             data scientist  Denver    CO      0 full time   mid level
## 10            data scientist Boulder    CO      0 full time   mid level
## 11 machine learning engineer  Denver    CO      0 full time   mid level
## 12 machine learning engineer Boulder    CO      0 full time   mid level
# scraping Indeed.com for each subgroup of interest 
indeed_salaries <- NULL
for (i in 1:nrow(args)) {
  
  indeed_salaries <- rbind(indeed_salaries, 
                           scrape_indeed(args[i, 'Title']
                                        ,args[i, 'City']
                                        ,args[i, 'State']
                                        ,args[i, 'Radius']
                                        ,args[i, 'Contract']
                                        ,args[i, 'Level']))

}

The final dataset is long, each observation is a “job” and consists of the “mean salary” for the job in a given bin, aggregated at the job title, location, contract type and seniority level.

head(indeed_salaries, 10)
##    Salary   City         Title       Level
## 1   51100 Denver data engineer entry level
## 2   51100 Denver data engineer entry level
## 3   51100 Denver data engineer entry level
## 4   51100 Denver data engineer entry level
## 5   51100 Denver data engineer entry level
## 6   51100 Denver data engineer entry level
## 7   51100 Denver data engineer entry level
## 8   51100 Denver data engineer entry level
## 9   51100 Denver data engineer entry level
## 10  51100 Denver data engineer entry level
tail(indeed_salaries, 10)
##      Salary    City                     Title     Level
## 3012 112800 Boulder machine learning engineer mid level
## 3013 112800 Boulder machine learning engineer mid level
## 3014 112800 Boulder machine learning engineer mid level
## 3015 112800 Boulder machine learning engineer mid level
## 3016 120700 Boulder machine learning engineer mid level
## 3017 120700 Boulder machine learning engineer mid level
## 3018 120700 Boulder machine learning engineer mid level
## 3019 120700 Boulder machine learning engineer mid level
## 3020 120700 Boulder machine learning engineer mid level
## 3021 120700 Boulder machine learning engineer mid level

Group averages & Summary Table

We calculate group averages for each sub-group (aggregate level) for ease in plotting and for summarizing the data. Note that this is in effect a mean of means type of average.

# creating summary dataframe for summary table
sdf <- data.frame(summary(indeed_salaries[,c('City','Level','Title')]))
sdf <- sdf[complete.cases(sdf), -1]
sdf$Freq <- as.character(sdf$Freq)

sdf$Name <- NULL
sdf$Value <- NULL
for (i in 1:nrow(sdf)) {
  sdf$Name[i] <- trimws(strsplit(sdf$Freq[i], ":")[[1]][1])
  sdf$Value[i] <- trimws(strsplit(sdf$Freq[i], ":")[[1]][2])
}

sdf$Freq <- NULL
sdf$Value <- as.numeric(sdf$Value)
colnames(sdf) <- c("Var","Name","Value")

# calculating group averages
group_avgs <- indeed_salaries %>% 
                group_by(City, Title, Level) %>%
                summarize('mean_salary' = round(mean(Salary), 0),
                          'median_salary' = round(median(Salary), 0))

group_avgs <- data.frame(group_avgs)

# converting into accounting format for better display
accounting_format <- function(x) {

    x <- as.character(x)
    x1 <- substr(x, 1, nchar(x)-3)
    x2 <- substr(x, nchar(x)-2, nchar(x))

    out <- paste0('$', x1, ",", x2, sep="")
    out
}

group_avgs$mean_salary <- accounting_format(group_avgs$mean_salary)
group_avgs$median_salary <- accounting_format(group_avgs$median_salary)
group_avgs$x_axis <- 110000  # for label plotting purposes
group_avgs$y_axis <- c(rep(500, 6), rep(470, 6))  # for label plotting purposes
group_avgs[, 1:5]
##       City                     Title       Level mean_salary median_salary
## 1   Denver             data engineer entry level     $69,405         66100
## 2   Denver             data engineer   mid level     $89,738         90000
## 3   Denver            data scientist entry level     $88,143         83300
## 4   Denver            data scientist   mid level    $103,338        100600
## 5   Denver machine learning engineer entry level     $70,622         67600
## 6   Denver machine learning engineer   mid level    $103,103        100100
## 7  Boulder             data engineer entry level     $71,898         75700
## 8  Boulder             data engineer   mid level     $91,037         90200
## 9  Boulder            data scientist entry level     $74,661         72200
## 10 Boulder            data scientist   mid level     $94,772         91900
## 11 Boulder machine learning engineer entry level     $93,786         97600
## 12 Boulder machine learning engineer   mid level     $95,266         96900

Plotting

Finally, we construct boxplots to comapre distributions, and barplots to look at them more closely. See the plots back in the Overview.

# BOXPLOTS

# plot title and y axis label 
gg_title <- 'Indeed.com\'s Salary Distributions for Full-time Data Positions in Boulder and Denver.'
y_lab <- 'Salary Range'
            
tally_tb <- indeed_salaries %>% 
            group_by(City, Title, Level, Salary) %>% tally()

ggplot(data.frame(tally_tb), aes(City, fill = City)) +
    geom_boxplot(aes(x = City, y = Salary)) + 
    facet_wrap(Level ~ Title) +
    labs(title=gg_title, y=y_lab) + 
    theme_bw()


# BARPLOTS

# plot title and y axis label 
gg_title <- 'Indeed.com\'s Mean Salaries for Full-time Data Positions in Boulder and Denver.'
y_lab <- 'Number of Jobs'

g <- ggplot(indeed_salaries, aes(Salary, fill = City)) +
        geom_histogram(bins = 20) + 
        facet_wrap(Level ~ Title) +
        labs(title=gg_title, y=y_lab) + 
        theme_bw()
        
# adding inner labels
g + geom_text(data = group_avgs, 
              mapping = aes(x=x_axis, y=y_axis), 
              label = paste0(group_avgs$City, " mean: ", 
                             group_avgs$mean_salary, sep=""), 
              size=3) 

Back to Contents