Confirm Use of Web Scraping on BLS.gov

I have been curious about what the job market will look like once I graduate in May, so I’ve been looking at job listings on sites such as Indeed.com. I intended to aggregate and analyze data from Indeed.com or JobsInME.com, but their terms of service explicitly prohibit web scraping, so I began looking for open-source job data. I stumbled upon the Bureau of Labor Statistics website, where there is open-source data about employment projections from 2019-2029. I reviewed their Privacy & Security Statement and the Linking & Copyright Info page and found that they prohibit the use of excessive robot activity. I also checked the robots.txt before I scraped the page that I wanted. Since I am not planning to scrape the site every second, and I did not see a “Disallow” command for the job projections page, I proceeded with my scraping of the BLS.gov website.

Scraping Steps

I begin by downloading the HTML using the read_html() function and then use the SelectorGadget tool in my browser to select the CSS tags of the information I want from the table. I chose the following data:

I copy each CSS tag into a html_nodes() function to get the data I want from this web page. After using the head() function on the retrieved data, I see I have some cleaning to do.

library(dplyr)
library(tidyverse)
library(rvest)
library(stringr)
library(reshape2)

bls_projections <- read_html("https://www.bls.gov/emp/tables/occupational-projections-and-characteristics.htm")
occupation_title <- bls_projections %>% html_nodes("tbody th p") %>% html_text()
head(occupation_title)
## [1] "Total, all occupations"          "Management occupations"         
## [3] "Top executives"                  "Chief executives"               
## [5] "General and operations managers" "Legislators"
occupation_type <- bls_projections %>% html_nodes("td:nth-child(3) p") %>% html_text()
head(occupation_type)
## [1] "Summary"   "Summary"   "Summary"   "Line item" "Line item" "Line item"
percent_employ_change <- bls_projections %>% html_nodes("td:nth-child(7) .datacell") %>% html_text()
percent_employ_change <-as.numeric(percent_employ_change)
head(percent_employ_change)
## [1]   3.7   4.7   4.2 -10.0   5.8   6.2

Issues with Scraping and Cleaning the Data

The first thing that stood out was the dollar sign ($) in front of the median_wage_2019 values and commas in the numbers. I know this will impact how the number is treated in r, so I remove the $ and the comma with the str_replace_all() function. I combined all of the data into a data frame, and the numbers look good until I do some analysis. The error code for an “atomic vector” would come up, and I realized my numbers were character vectors, so I had to re-write them as number vectors using the as.numeric() function.

median_wage_2019 <- bls_projections %>% html_nodes("td:nth-child(10) p") %>% html_text()
median_wage_2019 <- median_wage_2019 %>% str_replace_all("\\$ ", "")
median_wage_2019<-gsub(",","", median_wage_2019)
median_wage_2019 <-as.numeric(median_wage_2019)
head(median_wage_2019)
## [1]  39810 105660 103230 184460 100780  29270
education_needed <- bls_projections %>% html_nodes("td:nth-child(11) p") %>% html_text()
head(education_needed)
## [1] "—"                 "—"                 "—"                
## [4] "Bachelor's degree" "Bachelor's degree" "Bachelor's degree"
experience_needed <- bls_projections %>% html_nodes ("td:nth-child(12) p") %>% html_text()
head(experience_needed)
## [1] "—"                 "—"                 "—"                
## [4] "5 years or more"   "5 years or more"   "Less than 5 years"

Additional Issues

The other issue I ran into is that there is a lot of data on this web page. Many of the job titles were “summary items” (e.g., “Management occupations”) that did not provide specific job projections or median wage data, so I filtered them out. I also found that the table used “–” to denote a missing value, so I coded them as NA and then filtered those rows out as well.

Creating Data Frames

I was mostly interested in the jobs with the most projected growth and median annual incomes over $50,000. I created another data frame of jobs with percent change over 20.0 and median wage (2019) greater than or equal to $50,000. I then created a bar chart of the data frame to visualize the jobs that met my specifications.

occupation_data <-data.frame(Occupation = occupation_title, Type = occupation_type, Percent_Change = percent_employ_change, Median_Wage_2019 = median_wage_2019, Education_Needed = education_needed, Experience_Needed = experience_needed)
occupation_data[occupation_data == "—"] <- NA 
str(occupation_data)
## 'data.frame':    1048 obs. of  6 variables:
##  $ Occupation       : chr  "Total, all occupations" "Management occupations" "Top executives" "Chief executives" ...
##  $ Type             : chr  "Summary" "Summary" "Summary" "Line item" ...
##  $ Percent_Change   : num  3.7 4.7 4.2 -10 5.8 6.2 5.1 -1.1 4.8 6.7 ...
##  $ Median_Wage_2019 : num  39810 105660 103230 184460 100780 ...
##  $ Education_Needed : chr  NA NA NA "Bachelor's degree" ...
##  $ Experience_Needed: chr  NA NA NA "5 years or more" ...
occupation_percent_change <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change >= 20.0 & median_wage_2019 >= 50000), c(1, 3:6)) %>% drop_na() %>% arrange(desc(Percent_Change))

Bar Charts

library(jcolors)
library(ggplot2)
library(scales)

ggplot(data = occupation_percent_change %>% filter(Occupation %in% c("Wind turbine service technicians", "Nurse practitioners", "Occupational therapy assistants", "Statisticians", "Physical therapist assistants", "Medical and health services managers", "Physician assistants", "Information security analysts", "Data scientists and mathematical science occupations, all other", "Rotary drill operators, oil and gas")),
       aes(x = reorder(Occupation, Percent_Change), y = Percent_Change, fill = as.factor(Median_Wage_2019))) + theme_bw() + 
geom_bar(stat="identity") + scale_fill_jcolors(palette = "pal8", name = "2019 Median Wage", labels = c("$52,910", "$54,980", "$58,790", "$61,510", "$91,160", "$94,280", "$99,730", "$100,980", "$109,820", "$112,260")) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) + scale_x_discrete(labels = wrap_format(30)) + theme(axis.text.x = element_text( hjust = 1)) + coord_flip() +
ggtitle("Percent Growth by Occupation (2019-2029) with Median Income >= $50,000") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Occupation") + ylab("Percent Change")

Data source: The U.S. Bureau of Labor Statistics

Discussion

Surprisingly, the job with the most projected growth (60.7) from 2019 to 2029, wind turbine service technicians, has the group’s lowest annual median income at $52,910. The data point made me curious about jobs with a 2019 median income below $50,000 and a high percentage of projected growth, so I created a new data frame filtering jobs with over 20.0 percent expected growth and earning less than $50,000 (median wage in 2019). I then used ggplot() to create a bar chart of the jobs and see that there is one low-income job (2019 median wage is $25,280) with a high projected percentage of growth (33.7 percent), home health aides.

lower_income_growth <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change >= 20.0 & median_wage_2019 < 50000), c(1, 3:6)) %>% drop_na() %>% arrange(desc(Percent_Change))

ggplot(data = lower_income_growth %>% filter(Occupation %in% c("Solar photovoltaic installers", "Home health and personal care aides", "Derrick operators, oil and gas", "Roustabouts, oil and gas", "Substance abuse, behavioral disorder, and mental health counselors", "Forest fire inspectors and prevention specialists", "Cooks, restaurant", "Service unit operators, oil and gas", "Animal caretakers", "Marriage and family therapists")),
       aes(x = reorder(Occupation, Percent_Change), y = Percent_Change, fill = as.factor(Median_Wage_2019))) + theme_bw() + 
geom_bar(stat="identity") + scale_fill_jcolors(palette = "pal8", name = "2019 Median Wage", labels = c("$24,780", "$25,280", "$27,790", "$38,910", "$44,890", "$45,270", "$46,240", "$46,740", "$46,990", "$49,610")) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) + scale_x_discrete(labels = wrap_format(30)) + theme(axis.text.x = element_text( hjust = 1)) + coord_flip() +
ggtitle("Percent Growth by Occupation (2019-2029) with Median Income < $50,000") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Occupation") + ylab("Percent Change")

Data source: The U.S. Bureau of Labor Statistics

Many of the jobs listed in the high percentage of expected growth were not surprising to me (e.g., nurse practitioners, physicians assistants, and home health aides). I was concerned to see the high percentage of projected growth for home health aides with a 2019 median income of $25,280. There will need to be systemic and policy change to attract people to a high-stress job with low wages, but there is no doubt that we will need a larger home health workforce in the next decade.

Table Creation

I became curious about the education and experience required for some of these high-demand occupations that I was less familiar with (e.g., wind turbine service technicians). I made a table using the kable() function containing educational and experience requirements for jobs with percent change over 20.0 and median wage (2019) greater than or equal to $50,000. To my surprise, many of these high demand jobs do not require years of experience in the field or high-level education. Some only require associate’s degrees, bachelor’s degrees, and post-secondary non-degree education.

library(knitr)
kable(occupation_percent_change, align = "lccll", col.names = c('Occupation', 'Percent Change', 'Median Wage (2019)', 'Education Needed', 'Experience Needed'), caption = "Occupations Earning $50,000+ with Highest Percent Growth for 2019-2029")
Occupations Earning $50,000+ with Highest Percent Growth for 2019-2029
Occupation Percent Change Median Wage (2019) Education Needed Experience Needed
Wind turbine service technicians 60.7 52910 Postsecondary nondegree award None
Nurse practitioners 52.4 109820 Master’s degree None
Statisticians 34.6 91160 Master’s degree None
Occupational therapy assistants 34.6 61510 Associate’s degree None
Physical therapist assistants 32.6 58790 Associate’s degree None
Medical and health services managers 31.5 100980 Bachelor’s degree Less than 5 years
Physician assistants 31.3 112260 Master’s degree None
Information security analysts 31.2 99730 Bachelor’s degree Less than 5 years
Data scientists and mathematical science occupations, all other 30.9 94280 Bachelor’s degree None
Rotary drill operators, oil and gas 26.9 54980 No formal educational credential None
Speech-language pathologists 24.9 79120 Master’s degree None
Operations research analysts 24.8 84810 Bachelor’s degree None
Computer numerically controlled tool programmers 21.9 56450 Postsecondary nondegree award None
Film and video editors 21.6 63780 Bachelor’s degree None
Software developers and software quality assurance analysts and testers 21.5 107510 Bachelor’s degree None
Genetic counselors 21.5 81880 Master’s degree None
Health specialties teachers, postsecondary 20.5 97320 Doctoral or professional degree Less than 5 years
Interpreters and translators 20.0 51830 Bachelor’s degree None

Data source: The U.S. Bureau of Labor Statistics

Final Analysis and Thoughts

The last thing I was curious about were the jobs that showed negative growth for the next decade. I wanted to know which jobs requiring a bachelor’s degree had negative projected growth for 2019-2029. I was surprised to see some of the occupations listed in the table I created (e.g., adult education, computer programmers, fashion designers, commercial and industrial designers, and graphic designers). I know that graphic design is a saturated job market, but I wonder what other factors are at play for the decrease in demand for these jobs that require bachelor’s degrees.

The BLS.gov website is a great resource for people who wish to invest in a career change. Someone who wishes to become a Fashion Designer may be attracted to the higher median wage but can see the decline in projected job growth and may not want to invest in schooling for four years if the job market is highly competitive.

negative_percent_change <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change < 0.0 & education_needed == "Bachelor's degree"), c(1, 3:6)) %>% drop_na() %>% arrange(Percent_Change)

kable(negative_percent_change, align = "lccll", col.names = c('Occupation', 'Percent Change', 'Median Wage (2019)', 'Education Needed', 'Experience Needed'), caption = "Occupations Requiring a Bachelor's Degree with Negative Percent Growth for 2019-2029")
Occupations Requiring a Bachelor’s Degree with Negative Percent Growth for 2019-2029
Occupation Percent Change Median Wage (2019) Education Needed Experience Needed
Nuclear engineers -12.6 113460 Bachelor’s degree None
News analysts, reporters, and journalists -11.2 46270 Bachelor’s degree None
Adult basic education, adult secondary education, and English as a Second Language instructors -10.4 54350 Bachelor’s degree None
Chief executives -10.0 184460 Bachelor’s degree 5 years or more
Computer programmers -9.4 86550 Bachelor’s degree None
Buyers and purchasing agents -8.8 64380 Bachelor’s degree None
Editors -7.3 61370 Bachelor’s degree Less than 5 years
Labor relations specialists -6.8 69020 Bachelor’s degree Less than 5 years
Insurance underwriters -6.2 70020 Bachelor’s degree None
Broadcast announcers and radio disc jockeys -5.3 34630 Bachelor’s degree None
Interior designers -4.9 56040 Bachelor’s degree None
Credit analysts -4.8 73650 Bachelor’s degree None
Fashion designers -4.4 73790 Bachelor’s degree None
Tax examiners and collectors, and revenue agents -4.1 54890 Bachelor’s degree None
Graphic designers -3.8 52110 Bachelor’s degree None
Commercial and industrial designers -3.5 68890 Bachelor’s degree None
Physical scientists, all other -3.0 109910 Bachelor’s degree None
Proofreaders and copy markers -2.7 40630 Bachelor’s degree None
Landscape architects -2.4 69360 Bachelor’s degree None
Writers and authors -2.3 63200 Bachelor’s degree None
Designers, all other -1.9 64620 Bachelor’s degree None
Personal service managers, all other; entertainment and recreation managers, except gambling; and managers, all other -1.8 110630 Bachelor’s degree Less than 5 years
Art directors -1.8 94220 Bachelor’s degree 5 years or more
Cost estimators -1.5 65250 Bachelor’s degree None
Geographers -1.2 81540 Bachelor’s degree None
Advertising and promotions managers -1.1 125510 Bachelor’s degree Less than 5 years

Data source: The U.S. Bureau of Labor Statistics

Note: The 2019-2029 occupational projections data scraped from the BLS.gov website has not been updated to reflect the effects of the 2020 Covid-19 pandemic.