I have been curious about what the job market will look like once I graduate in May, so I’ve been looking at job listings on sites such as Indeed.com. I intended to aggregate and analyze data from Indeed.com or JobsInME.com, but their terms of service explicitly prohibit web scraping, so I began looking for open-source job data. I stumbled upon the Bureau of Labor Statistics website, where there is open-source data about employment projections from 2019-2029. I reviewed their Privacy & Security Statement and the Linking & Copyright Info page and found that they prohibit the use of excessive robot activity. I also checked the robots.txt before I scraped the page that I wanted. Since I am not planning to scrape the site every second, and I did not see a “Disallow” command for the job projections page, I proceeded with my scraping of the BLS.gov website.
I begin by downloading the HTML using the read_html() function and then use the SelectorGadget tool in my browser to select the CSS tags of the information I want from the table. I chose the following data:
I copy each CSS tag into a html_nodes() function to get the data I want from this web page. After using the head() function on the retrieved data, I see I have some cleaning to do.
library(dplyr)
library(tidyverse)
library(rvest)
library(stringr)
library(reshape2)
bls_projections <- read_html("https://www.bls.gov/emp/tables/occupational-projections-and-characteristics.htm")
occupation_title <- bls_projections %>% html_nodes("tbody th p") %>% html_text()
head(occupation_title)
## [1] "Total, all occupations" "Management occupations"
## [3] "Top executives" "Chief executives"
## [5] "General and operations managers" "Legislators"
occupation_type <- bls_projections %>% html_nodes("td:nth-child(3) p") %>% html_text()
head(occupation_type)
## [1] "Summary" "Summary" "Summary" "Line item" "Line item" "Line item"
percent_employ_change <- bls_projections %>% html_nodes("td:nth-child(7) .datacell") %>% html_text()
percent_employ_change <-as.numeric(percent_employ_change)
head(percent_employ_change)
## [1] 3.7 4.7 4.2 -10.0 5.8 6.2
The first thing that stood out was the dollar sign ($) in front of the median_wage_2019 values and commas in the numbers. I know this will impact how the number is treated in r, so I remove the $ and the comma with the str_replace_all() function. I combined all of the data into a data frame, and the numbers look good until I do some analysis. The error code for an “atomic vector” would come up, and I realized my numbers were character vectors, so I had to re-write them as number vectors using the as.numeric() function.
median_wage_2019 <- bls_projections %>% html_nodes("td:nth-child(10) p") %>% html_text()
median_wage_2019 <- median_wage_2019 %>% str_replace_all("\\$ ", "")
median_wage_2019<-gsub(",","", median_wage_2019)
median_wage_2019 <-as.numeric(median_wage_2019)
head(median_wage_2019)
## [1] 39810 105660 103230 184460 100780 29270
education_needed <- bls_projections %>% html_nodes("td:nth-child(11) p") %>% html_text()
head(education_needed)
## [1] "—" "—" "—"
## [4] "Bachelor's degree" "Bachelor's degree" "Bachelor's degree"
experience_needed <- bls_projections %>% html_nodes ("td:nth-child(12) p") %>% html_text()
head(experience_needed)
## [1] "—" "—" "—"
## [4] "5 years or more" "5 years or more" "Less than 5 years"
The other issue I ran into is that there is a lot of data on this web page. Many of the job titles were “summary items” (e.g., “Management occupations”) that did not provide specific job projections or median wage data, so I filtered them out. I also found that the table used “–” to denote a missing value, so I coded them as NA and then filtered those rows out as well.
I was mostly interested in the jobs with the most projected growth and median annual incomes over $50,000. I created another data frame of jobs with percent change over 20.0 and median wage (2019) greater than or equal to $50,000. I then created a bar chart of the data frame to visualize the jobs that met my specifications.
occupation_data <-data.frame(Occupation = occupation_title, Type = occupation_type, Percent_Change = percent_employ_change, Median_Wage_2019 = median_wage_2019, Education_Needed = education_needed, Experience_Needed = experience_needed)
occupation_data[occupation_data == "—"] <- NA
str(occupation_data)
## 'data.frame': 1048 obs. of 6 variables:
## $ Occupation : chr "Total, all occupations" "Management occupations" "Top executives" "Chief executives" ...
## $ Type : chr "Summary" "Summary" "Summary" "Line item" ...
## $ Percent_Change : num 3.7 4.7 4.2 -10 5.8 6.2 5.1 -1.1 4.8 6.7 ...
## $ Median_Wage_2019 : num 39810 105660 103230 184460 100780 ...
## $ Education_Needed : chr NA NA NA "Bachelor's degree" ...
## $ Experience_Needed: chr NA NA NA "5 years or more" ...
occupation_percent_change <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change >= 20.0 & median_wage_2019 >= 50000), c(1, 3:6)) %>% drop_na() %>% arrange(desc(Percent_Change))
library(jcolors)
library(ggplot2)
library(scales)
ggplot(data = occupation_percent_change %>% filter(Occupation %in% c("Wind turbine service technicians", "Nurse practitioners", "Occupational therapy assistants", "Statisticians", "Physical therapist assistants", "Medical and health services managers", "Physician assistants", "Information security analysts", "Data scientists and mathematical science occupations, all other", "Rotary drill operators, oil and gas")),
aes(x = reorder(Occupation, Percent_Change), y = Percent_Change, fill = as.factor(Median_Wage_2019))) + theme_bw() +
geom_bar(stat="identity") + scale_fill_jcolors(palette = "pal8", name = "2019 Median Wage", labels = c("$52,910", "$54,980", "$58,790", "$61,510", "$91,160", "$94,280", "$99,730", "$100,980", "$109,820", "$112,260")) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) + scale_x_discrete(labels = wrap_format(30)) + theme(axis.text.x = element_text( hjust = 1)) + coord_flip() +
ggtitle("Percent Growth by Occupation (2019-2029) with Median Income >= $50,000") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Occupation") + ylab("Percent Change")
Data source: The U.S. Bureau of Labor Statistics
Surprisingly, the job with the most projected growth (60.7) from 2019 to 2029, wind turbine service technicians, has the group’s lowest annual median income at $52,910. The data point made me curious about jobs with a 2019 median income below $50,000 and a high percentage of projected growth, so I created a new data frame filtering jobs with over 20.0 percent expected growth and earning less than $50,000 (median wage in 2019). I then used ggplot() to create a bar chart of the jobs and see that there is one low-income job (2019 median wage is $25,280) with a high projected percentage of growth (33.7 percent), home health aides.
lower_income_growth <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change >= 20.0 & median_wage_2019 < 50000), c(1, 3:6)) %>% drop_na() %>% arrange(desc(Percent_Change))
ggplot(data = lower_income_growth %>% filter(Occupation %in% c("Solar photovoltaic installers", "Home health and personal care aides", "Derrick operators, oil and gas", "Roustabouts, oil and gas", "Substance abuse, behavioral disorder, and mental health counselors", "Forest fire inspectors and prevention specialists", "Cooks, restaurant", "Service unit operators, oil and gas", "Animal caretakers", "Marriage and family therapists")),
aes(x = reorder(Occupation, Percent_Change), y = Percent_Change, fill = as.factor(Median_Wage_2019))) + theme_bw() +
geom_bar(stat="identity") + scale_fill_jcolors(palette = "pal8", name = "2019 Median Wage", labels = c("$24,780", "$25,280", "$27,790", "$38,910", "$44,890", "$45,270", "$46,240", "$46,740", "$46,990", "$49,610")) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) + scale_x_discrete(labels = wrap_format(30)) + theme(axis.text.x = element_text( hjust = 1)) + coord_flip() +
ggtitle("Percent Growth by Occupation (2019-2029) with Median Income < $50,000") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Occupation") + ylab("Percent Change")
Data source: The U.S. Bureau of Labor Statistics
Many of the jobs listed in the high percentage of expected growth were not surprising to me (e.g., nurse practitioners, physicians assistants, and home health aides). I was concerned to see the high percentage of projected growth for home health aides with a 2019 median income of $25,280. There will need to be systemic and policy change to attract people to a high-stress job with low wages, but there is no doubt that we will need a larger home health workforce in the next decade.
I became curious about the education and experience required for some of these high-demand occupations that I was less familiar with (e.g., wind turbine service technicians). I made a table using the kable() function containing educational and experience requirements for jobs with percent change over 20.0 and median wage (2019) greater than or equal to $50,000. To my surprise, many of these high demand jobs do not require years of experience in the field or high-level education. Some only require associate’s degrees, bachelor’s degrees, and post-secondary non-degree education.
library(knitr)
kable(occupation_percent_change, align = "lccll", col.names = c('Occupation', 'Percent Change', 'Median Wage (2019)', 'Education Needed', 'Experience Needed'), caption = "Occupations Earning $50,000+ with Highest Percent Growth for 2019-2029")
| Occupation | Percent Change | Median Wage (2019) | Education Needed | Experience Needed |
|---|---|---|---|---|
| Wind turbine service technicians | 60.7 | 52910 | Postsecondary nondegree award | None |
| Nurse practitioners | 52.4 | 109820 | Master’s degree | None |
| Statisticians | 34.6 | 91160 | Master’s degree | None |
| Occupational therapy assistants | 34.6 | 61510 | Associate’s degree | None |
| Physical therapist assistants | 32.6 | 58790 | Associate’s degree | None |
| Medical and health services managers | 31.5 | 100980 | Bachelor’s degree | Less than 5 years |
| Physician assistants | 31.3 | 112260 | Master’s degree | None |
| Information security analysts | 31.2 | 99730 | Bachelor’s degree | Less than 5 years |
| Data scientists and mathematical science occupations, all other | 30.9 | 94280 | Bachelor’s degree | None |
| Rotary drill operators, oil and gas | 26.9 | 54980 | No formal educational credential | None |
| Speech-language pathologists | 24.9 | 79120 | Master’s degree | None |
| Operations research analysts | 24.8 | 84810 | Bachelor’s degree | None |
| Computer numerically controlled tool programmers | 21.9 | 56450 | Postsecondary nondegree award | None |
| Film and video editors | 21.6 | 63780 | Bachelor’s degree | None |
| Software developers and software quality assurance analysts and testers | 21.5 | 107510 | Bachelor’s degree | None |
| Genetic counselors | 21.5 | 81880 | Master’s degree | None |
| Health specialties teachers, postsecondary | 20.5 | 97320 | Doctoral or professional degree | Less than 5 years |
| Interpreters and translators | 20.0 | 51830 | Bachelor’s degree | None |
Data source: The U.S. Bureau of Labor Statistics
The last thing I was curious about were the jobs that showed negative growth for the next decade. I wanted to know which jobs requiring a bachelor’s degree had negative projected growth for 2019-2029. I was surprised to see some of the occupations listed in the table I created (e.g., adult education, computer programmers, fashion designers, commercial and industrial designers, and graphic designers). I know that graphic design is a saturated job market, but I wonder what other factors are at play for the decrease in demand for these jobs that require bachelor’s degrees.
The BLS.gov website is a great resource for people who wish to invest in a career change. Someone who wishes to become a Fashion Designer may be attracted to the higher median wage but can see the decline in projected job growth and may not want to invest in schooling for four years if the job market is highly competitive.
negative_percent_change <- select(filter(occupation_data, occupation_type == "Line item" & percent_employ_change < 0.0 & education_needed == "Bachelor's degree"), c(1, 3:6)) %>% drop_na() %>% arrange(Percent_Change)
kable(negative_percent_change, align = "lccll", col.names = c('Occupation', 'Percent Change', 'Median Wage (2019)', 'Education Needed', 'Experience Needed'), caption = "Occupations Requiring a Bachelor's Degree with Negative Percent Growth for 2019-2029")
| Occupation | Percent Change | Median Wage (2019) | Education Needed | Experience Needed |
|---|---|---|---|---|
| Nuclear engineers | -12.6 | 113460 | Bachelor’s degree | None |
| News analysts, reporters, and journalists | -11.2 | 46270 | Bachelor’s degree | None |
| Adult basic education, adult secondary education, and English as a Second Language instructors | -10.4 | 54350 | Bachelor’s degree | None |
| Chief executives | -10.0 | 184460 | Bachelor’s degree | 5 years or more |
| Computer programmers | -9.4 | 86550 | Bachelor’s degree | None |
| Buyers and purchasing agents | -8.8 | 64380 | Bachelor’s degree | None |
| Editors | -7.3 | 61370 | Bachelor’s degree | Less than 5 years |
| Labor relations specialists | -6.8 | 69020 | Bachelor’s degree | Less than 5 years |
| Insurance underwriters | -6.2 | 70020 | Bachelor’s degree | None |
| Broadcast announcers and radio disc jockeys | -5.3 | 34630 | Bachelor’s degree | None |
| Interior designers | -4.9 | 56040 | Bachelor’s degree | None |
| Credit analysts | -4.8 | 73650 | Bachelor’s degree | None |
| Fashion designers | -4.4 | 73790 | Bachelor’s degree | None |
| Tax examiners and collectors, and revenue agents | -4.1 | 54890 | Bachelor’s degree | None |
| Graphic designers | -3.8 | 52110 | Bachelor’s degree | None |
| Commercial and industrial designers | -3.5 | 68890 | Bachelor’s degree | None |
| Physical scientists, all other | -3.0 | 109910 | Bachelor’s degree | None |
| Proofreaders and copy markers | -2.7 | 40630 | Bachelor’s degree | None |
| Landscape architects | -2.4 | 69360 | Bachelor’s degree | None |
| Writers and authors | -2.3 | 63200 | Bachelor’s degree | None |
| Designers, all other | -1.9 | 64620 | Bachelor’s degree | None |
| Personal service managers, all other; entertainment and recreation managers, except gambling; and managers, all other | -1.8 | 110630 | Bachelor’s degree | Less than 5 years |
| Art directors | -1.8 | 94220 | Bachelor’s degree | 5 years or more |
| Cost estimators | -1.5 | 65250 | Bachelor’s degree | None |
| Geographers | -1.2 | 81540 | Bachelor’s degree | None |
| Advertising and promotions managers | -1.1 | 125510 | Bachelor’s degree | Less than 5 years |
Data source: The U.S. Bureau of Labor Statistics
Note: The 2019-2029 occupational projections data scraped from the BLS.gov website has not been updated to reflect the effects of the 2020 Covid-19 pandemic.