The goal of this project is to understand the relationship between income, education, and automation. It asks the following: Is a given job’s income correlated to its likelihood of automation? Are jobs which are predominantly less educated more or less likely to be automated? How many workers are in the industries that will be automated?
Begin by loading packages.
rm(list=ls())
library(ggplot2)
library(ggthemes)
library(dplyr)
library(ggrepel)
library(tools)
library(readxl)
library(tidyverse)
options(scipen=999)
There are three datasets for this project. 1) Educational attainment broke down by occupation, provided by BLS here 2) Salaries, median hourly/annual wages broke down by occupation, provided by BLS here 3) Risk of automation broken down by occupation, provided by Carl Benedikt Frey and Michael A. Osborne (but compiled here)
setwd("/Users/connorrothschild/Desktop/R/Automation Project")
education <- read_excel("education.xlsx", skip=1)
salary <- read_excel("national_M2017_dl.xlsx")
automation <- read_excel("raw_state_automation_data.xlsx")
I’ll begin by finding which occupations contribute most to the American economy (in USD).
salary1 <- salary %>%
group_by(OCC_TITLE) %>%
mutate(natlwage = TOT_EMP * as.numeric(A_MEAN)) %>%
filter(!is.na(TOT_EMP)) %>%
filter(!is.na(A_MEAN)) %>%
filter(!is.na(A_MEDIAN))
salary1$A_MEDIAN = as.numeric(as.character(salary1$A_MEDIAN))
salary2 <- select(salary1, OCC_TITLE, TOT_EMP, A_MEDIAN, natlwage) %>%
distinct()
options(scipen=999)
salary2 %>%
arrange(desc(natlwage))
## # A tibble: 1,105 x 4
## # Groups: OCC_TITLE [1,104]
## OCC_TITLE TOT_EMP A_MEDIAN natlwage
## <chr> <dbl> <dbl> <dbl>
## 1 All Occupations 1.43e8 37690. 7.22e12
## 2 Management Occupations 7.28e6 102590. 8.73e11
## 3 Office and Administrative Support Occupati… 2.20e7 34740. 8.34e11
## 4 Healthcare Practitioners and Technical Occ… 8.51e6 64770. 6.87e11
## 5 Sales and Related Occupations 1.45e7 27020. 5.91e11
## 6 Business and Financial Operations Occupati… 7.47e6 67710. 5.70e11
## 7 Health Diagnosing and Treating Practitione… 5.27e6 79480. 5.31e11
## 8 Education, Training, and Library Occupatio… 8.73e6 48740. 4.84e11
## 9 Computer and Mathematical Occupations 4.26e6 84560. 3.83e11
## 10 Transportation and Material Moving Occupat… 9.98e6 31600. 3.70e11
## # ... with 1,095 more rows
Management occupations contribute the most, followed by office and administrative support, healthcare practictioners and technical occupations.
This is a quick vizualization which represents the relationship between a job’s median wage and the number of Americans which have that job.
salary2 %>% filter(TOT_EMP < 15000000) %>%
ggplot(mapping=aes(x=TOT_EMP, y=A_MEDIAN)) +
geom_point(alpha=1/3, col="red") +
theme_bw() +
ggtitle("Median Wage vs Total Employment") +
xlab("Number of Americans in a Given Job") +
ylab("Median Wage of a Given Job")
We now have three data points:
The name of an occupation
The annual median wage of that occupation
The amount it contributes to the American economy (or, more specifically, the amount workers are paid as an entire occupation for their work, annually)
Next, I want to cross-reference the salary data with the education data.
library(plyr)
education1 <- education %>% select(-X__1)
education1 <- rename(education1, c("2016 National Employment Matrix title and code" = "occupation",
"Less than high school diploma" = "lessthanhs",
"High school diploma or equivalent" = "hsdiploma",
"Some college, no degree" = "somecollege",
"Associate's degree" = "associates",
"Bachelor's degree" = "bachelors",
"Master's degree" = "masters",
"Doctoral or professional degree" = "professional"))
education2 <- education1 %>%
group_by(occupation) %>%
mutate(hsorless = lessthanhs + hsdiploma,
somecollegeorassociates = somecollege + associates,
postgrad = masters + professional)
education2 <- education2 %>% drop_na()
Next, I want to join education2 and salary2 to start analysis of education’s effect on salary.
salary2 <- rename(salary2, c("OCC_TITLE" = "occupation"))
salary2$occupation <- tolower(salary2$occupation)
education2$occupation <- tolower(education2$occupation)
edsal <- merge(as.data.frame(education2), as.data.frame(salary2), by="occupation") %>% drop_na()
head(edsal)
## occupation
## 1 accountants and auditors
## 3 actuaries
## 4 adhesive bonding machine operators and tenders
## 5 administrative law judges, adjudicators, and hearing officers
## 6 administrative services managers
## 7 adult basic and secondary education and literacy teachers and instructors
## lessthanhs hsdiploma somecollege associates bachelors masters
## 1 0.0 4.0 7.6 9.2 55.6 20.8
## 3 0.0 0.0 1.2 1.5 62.1 24.9
## 4 21.6 49.2 21.5 2.2 4.9 0.5
## 5 0.2 0.4 0.6 0.5 5.0 4.4
## 6 2.1 17.9 26.1 12.1 29.7 10.3
## 7 1.8 10.2 18.4 8.5 35.9 20.5
## professional hsorless somecollegeorassociates postgrad TOT_EMP A_MEDIAN
## 1 2.8 4.0 16.8 23.6 1241000 69350
## 3 10.3 0.0 2.7 35.2 19210 101560
## 4 0.0 70.8 23.7 0.5 15860 32710
## 5 88.9 0.6 1.1 93.3 14480 94790
## 6 1.7 20.0 38.2 12.0 270100 94020
## 7 4.7 12.0 26.9 25.2 60670 52100
## natlwage
## 1 96698720000
## 3 2206268500
## 4 549231800
## 5 1423094400
## 6 27922938000
## 7 3446056000
At this point I’m realizing that having the educational breakdown (# of Bachelor’s degrees, PhD’s, etc.) per job is interesting but may not be able to reveal a lot of key insights.
So, I’m going to introduce a fourth dataset: the typical education of a worker in a given occupation, also provided by BLS and found here.
typicaleducation <- read_excel("typicaleducation.xlsx")
typicaleducation2 <- typicaleducation %>% select(occupation,typicaled,workexp)
typicaleducation2 <- typicaleducation2 %>% drop_na()
typicaleducation2$occupation <- tolower(typicaleducation2$occupation)
edsal2 <- merge(as.data.frame(edsal), as.data.frame(typicaleducation2), by="occupation")
head(edsal2)
## occupation
## 1 accountants and auditors
## 2 actuaries
## 3 adhesive bonding machine operators and tenders
## 4 administrative law judges, adjudicators, and hearing officers
## 5 administrative services managers
## 6 adult basic and secondary education and literacy teachers and instructors
## lessthanhs hsdiploma somecollege associates bachelors masters
## 1 0.0 4.0 7.6 9.2 55.6 20.8
## 2 0.0 0.0 1.2 1.5 62.1 24.9
## 3 21.6 49.2 21.5 2.2 4.9 0.5
## 4 0.2 0.4 0.6 0.5 5.0 4.4
## 5 2.1 17.9 26.1 12.1 29.7 10.3
## 6 1.8 10.2 18.4 8.5 35.9 20.5
## professional hsorless somecollegeorassociates postgrad TOT_EMP A_MEDIAN
## 1 2.8 4.0 16.8 23.6 1241000 69350
## 2 10.3 0.0 2.7 35.2 19210 101560
## 3 0.0 70.8 23.7 0.5 15860 32710
## 4 88.9 0.6 1.1 93.3 14480 94790
## 5 1.7 20.0 38.2 12.0 270100 94020
## 6 4.7 12.0 26.9 25.2 60670 52100
## natlwage typicaled workexp
## 1 96698720000 Bachelor's degree None
## 2 2206268500 Bachelor's degree None
## 3 549231800 High school diploma or equivalent None
## 4 1423094400 Doctoral or professional degree 5 years or more
## 5 27922938000 Bachelor's degree Less than 5 years
## 6 3446056000 Bachelor's degree None
This data allows us to ask: What is the median wage for each typical level of education?
detach(package:plyr)
edsal3 <- edsal2 %>%
group_by(typicaled) %>%
summarise(medianwage = mean(A_MEDIAN))
legend_ord <- levels(with(edsal3, reorder(typicaled, medianwage)))
ggplot(data=edsal3, aes(x = reorder(typicaled, medianwage), y = medianwage)) +
geom_col(aes(fill=typicaled)) +
ggtitle("Median Annual Income by Education Level") +
xlab("Education Level")+
ylab("Median Annual Income") +
labs(fill="Education Level") +
scale_fill_discrete(breaks=legend_ord) +
theme_minimal() +
theme(axis.text.x=element_blank())
The results are unsurpising: more educated people on average earn more.
Lastly, I bring in the automation data.
automationwstates <- automation %>% select(-soc)
automation1 <- automationwstates %>% select(occupation,probability,total)
head(automation)
## # A tibble: 6 x 55
## soc occupation probability alabama alaska arizona arkansas california
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11-10… Chief Exe… 0.0150 1030. 760. 5750. 2710. 31150.
## 2 11-10… General a… 0.160 26930. 6490. 43300. 20680. 261780.
## 3 2011-… Advertisi… 0.0390 50. 40. 470. 110. 3760.
## 4 2021-… Marketing… 0.0140 530. 200. 4790. 1090. 33390.
## 5 2022-… Sales Man… 0.0130 2510. 400. 10650. 2650. 69180.
## 6 2031-… Public Re… 0.0150 400. 150. 1240. 300. 7010.
## # ... with 47 more variables: colorado <dbl>, connecticut <dbl>,
## # delaware <dbl>, district_of_columbia <dbl>, florida <dbl>,
## # georgia <dbl>, hawaii <dbl>, idaho <dbl>, illinois <dbl>,
## # indiana <dbl>, iowa <dbl>, kansas <dbl>, kentucky <dbl>,
## # louisiana <dbl>, maine <dbl>, maryland <dbl>, massachusetts <dbl>,
## # michigan <dbl>, minnesota <dbl>, mississippi <dbl>, missouri <dbl>,
## # montana <dbl>, nebraska <dbl>, nevada <dbl>, new_hampshire <dbl>,
## # new_jersey <dbl>, new_mexico <dbl>, new_york <dbl>,
## # north_carolina <dbl>, north_dakota <dbl>, ohio <dbl>, oklahoma <dbl>,
## # oregon <dbl>, pennsylvania <dbl>, rhode_island <dbl>,
## # south_carolina <dbl>, south_dakota <dbl>, tennessee <dbl>,
## # texas <dbl>, utah <dbl>, vermont <dbl>, virginia <dbl>,
## # washington <dbl>, west_virginia <dbl>, wisconsin <dbl>, wyoming <dbl>,
## # total <dbl>
Next, I visualize the probability of automation vs total employment:
automation1 %>%
ggplot(mapping=aes(x=total, y=probability)) +
geom_point(alpha=1/3, col="blue") +
theme_bw() +
ggtitle("Probability of Automation vs Total Employment") +
xlab("Number of Americans in a Given Job") +
ylab("Probability of Automation") +
geom_label_repel(data=subset(automation1, total > 4000000),
aes(total, probability,label=occupation), label.size=.5, label.r=.05, size=2.5, nudge_y = .05, nudge_x= -10000)
There doesn’t seem to be a huge relationship between automation and number of employees, however there is some concentration at each of the poles.
Some final data cleaning, and the merge of the final dataset:
automation1$occupation <- str_replace_all(automation1$occupation, ";", ",")
automation1$occupation <- tolower(automation$occupation)
data <- merge(as.data.frame(edsal2), as.data.frame(automation1), by="occupation")
summary(data)
## occupation lessthanhs hsdiploma somecollege
## Length:457 Min. : 0.000 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 0.900 1st Qu.: 7.30 1st Qu.:13.40
## Mode :character Median : 2.900 Median :22.60 Median :22.30
## Mean : 7.933 Mean :23.48 Mean :20.89
## 3rd Qu.:11.900 3rd Qu.:38.20 3rd Qu.:28.70
## Max. :55.600 Max. :59.10 Max. :53.70
## associates bachelors masters professional
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5.00 1st Qu.: 7.90 1st Qu.: 1.400 1st Qu.: 0.300
## Median : 8.70 Median :18.90 Median : 3.700 Median : 0.800
## Mean :10.33 Mean :23.25 Mean : 9.592 Mean : 4.531
## 3rd Qu.:12.50 3rd Qu.:35.60 3rd Qu.:13.000 3rd Qu.: 2.600
## Max. :75.70 Max. :76.50 Max. :83.500 Max. :98.300
## hsorless somecollegeorassociates postgrad
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 8.30 1st Qu.:19.80 1st Qu.: 1.80
## Median :26.00 Median :32.10 Median : 4.60
## Mean :31.42 Mean :31.22 Mean : 14.12
## 3rd Qu.:52.20 3rd Qu.:42.10 3rd Qu.: 19.00
## Max. :85.70 Max. :81.80 Max. :100.00
## TOT_EMP A_MEDIAN natlwage
## Min. : 430 Min. : 19820 Min. : 25370800
## 1st Qu.: 15220 1st Qu.: 35370 1st Qu.: 871074600
## Median : 44940 Median : 47400 Median : 2598419600
## Mean : 190453 Mean : 53988 Mean : 9575292012
## 3rd Qu.: 152990 3rd Qu.: 65760 3rd Qu.: 8579078300
## Max. :4442090 Max. :185150 Max. :273118212000
## typicaled workexp probability total
## Length:457 Length:457 Min. :0.0028 Min. : 0
## Class :character Class :character 1st Qu.:0.0770 1st Qu.: 14570
## Mode :character Mode :character Median :0.5900 Median : 44920
## Mean :0.5054 Mean : 179719
## 3rd Qu.:0.8800 3rd Qu.: 147310
## Max. :0.9900 Max. :4528570
We can create an initial visualization of the relationship between automation risk and education level.
autovsedu <- data %>%
group_by(typicaled) %>%
summarise(medianwage = mean(A_MEDIAN),
averageprobability = mean(probability))
legend_ord2 <- levels(with(autovsedu, reorder(typicaled, -averageprobability)))
ggplot(data=autovsedu, aes(x = reorder(typicaled, -averageprobability), y = averageprobability)) +
geom_col(aes(fill=typicaled)) +
ggtitle("Likelihood of Job Automation by Education Level") +
xlab("Education Level")+
ylab("Likelihood of Job Automation") +
labs(fill="Education Level") +
scale_fill_discrete(breaks=legend_ord2) +
theme_minimal() +
theme(axis.text.x=element_blank())
There is a rather clear correlation between level of education and automation risk: those who are more educated are better protected from automation.
We can then visualize this relationship by individual occupation:
ggplot(data=data) +
geom_point(mapping=aes(x=A_MEDIAN, y=probability, size=TOT_EMP, alpha=1/10, col=typicaled))+
scale_size(range = c(1, 20)) +
ylim(-.01,1) +
xlab("Median Income") +
ylab("Probability of Automation") +
ggtitle("Likelihood of Job Automation vs Median Income") +
labs(size="Total Employment", col="Education Level") +
labs(alpha=NULL) +
guides(alpha=FALSE) +
theme(legend.text = element_text(colour="black", size = 10)) +
guides(col = guide_legend(override.aes = list(size=5))) +
theme_minimal()
With labels, a final look:
data$occupation <- toTitleCase(data$occupation)
ggplot(data=data) +
geom_point(mapping=aes(x=A_MEDIAN, y=probability, size=TOT_EMP, alpha=0.05, col=typicaled))+
geom_smooth(aes(x=A_MEDIAN, y=probability), method="lm", se=FALSE) +
scale_size(range = c(1, 20)) +
ylim(-.05,1.05) +
xlim(25000,200000) +
xlab("Median Income") +
ylab("Probability of Automation") +
ggtitle("Likelihood of Job Automation vs Median Income") +
labs(size="Total Employment", col="Typical Education Level") +
labs(alpha=NULL) +
guides(alpha=FALSE) +
theme(legend.text = element_text(colour="black", size = 10)) +
guides(col = guide_legend(override.aes = list(size=5))) +
theme_minimal() +
geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN > 175000 & probability < .05), label.size=.5, label.r=.05, size=2.5, nudge_y = .05, nudge_x= -10000) +
geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN == 21030), label.size=.1, label.r=.01, size=1, nudge_y = 0,nudge_x=0) +
geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN == 24540), label.size=.1, label.r=.01, size=1, nudge_y = 0,nudge_x=0) +
geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN > 100000 & probability > .90), label.size=.5, label.r=.05, size=2.5, nudge_y = -.05,nudge_x=10000) +
annotate("text", x = 165000, y = 1.03, label = "Highest salary,\n highest automation risk", size=3, fontface=2) +
annotate("text", x = 165000, y = -0.035, label = "Highest salary,\n lowest automation risk", size=3, fontface=2) +
annotate("text", x = 45000, y = -0.035, label = "Lowest salary,\n lowest automation risk", size=3, fontface=2) +
annotate("text", x = 45000, y = 1.03, label = "Lowest salary,\n highest automation risk", size=3, fontface=2)
Using the dataset I’ve used in this project, researchers Carl Frey and Michael Osborne predicted that 47% of jobs are at risk of automation over the next couple decades.
The visuals above suggest that the ills of automation may not be evenly distributed across jobs.
Less educated workers are more likely to face job loss as a product of automation. Those with high school diplomas or less (green bubbles) find themself concentrated near the top of the y-axis, while those with Bachelor’s degrees or higher on average face a lower risk of automation.
A job’s salary is also predictive of automation probability. As the median income of a profession increases, the likelihood of automation displacing its workers decreases.
Automation’s impact on work necessitates a policy response. The fact that automation will differentially impact different industries reminds us that this public policy will have to be strategic and thoughtful.