Goal

The goal of this project is to understand the relationship between income, education, and automation. It asks the following: Is a given job’s income correlated to its likelihood of automation? Are jobs which are predominantly less educated more or less likely to be automated? How many workers are in the industries that will be automated?

Begin by loading packages.

rm(list=ls())
library(ggplot2)
library(ggthemes)
library(dplyr)
library(ggrepel)
library(tools)
library(readxl)
library(tidyverse)
options(scipen=999)

Load datasets:

There are three datasets for this project. 1) Educational attainment broke down by occupation, provided by BLS here 2) Salaries, median hourly/annual wages broke down by occupation, provided by BLS here 3) Risk of automation broken down by occupation, provided by Carl Benedikt Frey and Michael A. Osborne (but compiled here)

setwd("/Users/connorrothschild/Desktop/R/Automation Project")
education <- read_excel("education.xlsx", skip=1)
salary <- read_excel("national_M2017_dl.xlsx")
automation <- read_excel("raw_state_automation_data.xlsx")

Analysis

I’ll begin by finding which occupations contribute most to the American economy (in USD).

salary1 <- salary %>% 
  group_by(OCC_TITLE) %>% 
  mutate(natlwage = TOT_EMP * as.numeric(A_MEAN)) %>%
  filter(!is.na(TOT_EMP)) %>%
  filter(!is.na(A_MEAN)) %>%
  filter(!is.na(A_MEDIAN))
  
salary1$A_MEDIAN = as.numeric(as.character(salary1$A_MEDIAN))
salary2 <- select(salary1, OCC_TITLE, TOT_EMP, A_MEDIAN, natlwage) %>% 
  distinct()

options(scipen=999)
salary2 %>%
  arrange(desc(natlwage))
## # A tibble: 1,105 x 4
## # Groups:   OCC_TITLE [1,104]
##    OCC_TITLE                                     TOT_EMP A_MEDIAN natlwage
##    <chr>                                           <dbl>    <dbl>    <dbl>
##  1 All Occupations                                1.43e8   37690.  7.22e12
##  2 Management Occupations                         7.28e6  102590.  8.73e11
##  3 Office and Administrative Support Occupati…    2.20e7   34740.  8.34e11
##  4 Healthcare Practitioners and Technical Occ…    8.51e6   64770.  6.87e11
##  5 Sales and Related Occupations                  1.45e7   27020.  5.91e11
##  6 Business and Financial Operations Occupati…    7.47e6   67710.  5.70e11
##  7 Health Diagnosing and Treating Practitione…    5.27e6   79480.  5.31e11
##  8 Education, Training, and Library Occupatio…    8.73e6   48740.  4.84e11
##  9 Computer and Mathematical Occupations          4.26e6   84560.  3.83e11
## 10 Transportation and Material Moving Occupat…    9.98e6   31600.  3.70e11
## # ... with 1,095 more rows

Management occupations contribute the most, followed by office and administrative support, healthcare practictioners and technical occupations.

This is a quick vizualization which represents the relationship between a job’s median wage and the number of Americans which have that job.

salary2 %>% filter(TOT_EMP < 15000000) %>%
  ggplot(mapping=aes(x=TOT_EMP, y=A_MEDIAN)) +
  geom_point(alpha=1/3, col="red") +
  theme_bw() +
  ggtitle("Median Wage vs Total Employment") +
  xlab("Number of Americans in a Given Job") +
  ylab("Median Wage of a Given Job") 

We now have three data points:

  1. The name of an occupation

  2. The annual median wage of that occupation

  3. The amount it contributes to the American economy (or, more specifically, the amount workers are paid as an entire occupation for their work, annually)

Next, I want to cross-reference the salary data with the education data.

library(plyr)
education1 <- education %>% select(-X__1)
  
education1 <- rename(education1, c("2016 National Employment Matrix title and code" = "occupation",
                     "Less than high school diploma" = "lessthanhs", 
                     "High school diploma or equivalent" = "hsdiploma",
                     "Some college, no degree" = "somecollege",
                     "Associate's degree" = "associates",
                     "Bachelor's degree" = "bachelors",
                     "Master's degree" = "masters",
                     "Doctoral or professional degree" = "professional"))

education2 <- education1 %>% 
  group_by(occupation) %>%
  mutate(hsorless = lessthanhs + hsdiploma,
         somecollegeorassociates = somecollege + associates,
         postgrad = masters + professional)

education2 <- education2 %>% drop_na()

Next, I want to join education2 and salary2 to start analysis of education’s effect on salary.

salary2 <- rename(salary2, c("OCC_TITLE" = "occupation"))
salary2$occupation <- tolower(salary2$occupation)
education2$occupation <- tolower(education2$occupation)
edsal <- merge(as.data.frame(education2), as.data.frame(salary2), by="occupation") %>% drop_na()

head(edsal)
##                                                                  occupation
## 1                                                  accountants and auditors
## 3                                                                 actuaries
## 4                            adhesive bonding machine operators and tenders
## 5             administrative law judges, adjudicators, and hearing officers
## 6                                          administrative services managers
## 7 adult basic and secondary education and literacy teachers and instructors
##   lessthanhs hsdiploma somecollege associates bachelors masters
## 1        0.0       4.0         7.6        9.2      55.6    20.8
## 3        0.0       0.0         1.2        1.5      62.1    24.9
## 4       21.6      49.2        21.5        2.2       4.9     0.5
## 5        0.2       0.4         0.6        0.5       5.0     4.4
## 6        2.1      17.9        26.1       12.1      29.7    10.3
## 7        1.8      10.2        18.4        8.5      35.9    20.5
##   professional hsorless somecollegeorassociates postgrad TOT_EMP A_MEDIAN
## 1          2.8      4.0                    16.8     23.6 1241000    69350
## 3         10.3      0.0                     2.7     35.2   19210   101560
## 4          0.0     70.8                    23.7      0.5   15860    32710
## 5         88.9      0.6                     1.1     93.3   14480    94790
## 6          1.7     20.0                    38.2     12.0  270100    94020
## 7          4.7     12.0                    26.9     25.2   60670    52100
##      natlwage
## 1 96698720000
## 3  2206268500
## 4   549231800
## 5  1423094400
## 6 27922938000
## 7  3446056000

At this point I’m realizing that having the educational breakdown (# of Bachelor’s degrees, PhD’s, etc.) per job is interesting but may not be able to reveal a lot of key insights.

So, I’m going to introduce a fourth dataset: the typical education of a worker in a given occupation, also provided by BLS and found here.

typicaleducation <- read_excel("typicaleducation.xlsx")
typicaleducation2 <- typicaleducation %>% select(occupation,typicaled,workexp)
typicaleducation2 <- typicaleducation2 %>% drop_na()
typicaleducation2$occupation <- tolower(typicaleducation2$occupation)
edsal2 <- merge(as.data.frame(edsal), as.data.frame(typicaleducation2), by="occupation")

head(edsal2)
##                                                                  occupation
## 1                                                  accountants and auditors
## 2                                                                 actuaries
## 3                            adhesive bonding machine operators and tenders
## 4             administrative law judges, adjudicators, and hearing officers
## 5                                          administrative services managers
## 6 adult basic and secondary education and literacy teachers and instructors
##   lessthanhs hsdiploma somecollege associates bachelors masters
## 1        0.0       4.0         7.6        9.2      55.6    20.8
## 2        0.0       0.0         1.2        1.5      62.1    24.9
## 3       21.6      49.2        21.5        2.2       4.9     0.5
## 4        0.2       0.4         0.6        0.5       5.0     4.4
## 5        2.1      17.9        26.1       12.1      29.7    10.3
## 6        1.8      10.2        18.4        8.5      35.9    20.5
##   professional hsorless somecollegeorassociates postgrad TOT_EMP A_MEDIAN
## 1          2.8      4.0                    16.8     23.6 1241000    69350
## 2         10.3      0.0                     2.7     35.2   19210   101560
## 3          0.0     70.8                    23.7      0.5   15860    32710
## 4         88.9      0.6                     1.1     93.3   14480    94790
## 5          1.7     20.0                    38.2     12.0  270100    94020
## 6          4.7     12.0                    26.9     25.2   60670    52100
##      natlwage                         typicaled           workexp
## 1 96698720000                 Bachelor's degree              None
## 2  2206268500                 Bachelor's degree              None
## 3   549231800 High school diploma or equivalent              None
## 4  1423094400   Doctoral or professional degree   5 years or more
## 5 27922938000                 Bachelor's degree Less than 5 years
## 6  3446056000                 Bachelor's degree              None

This data allows us to ask: What is the median wage for each typical level of education?

detach(package:plyr)
edsal3 <- edsal2 %>% 
  group_by(typicaled) %>% 
  summarise(medianwage = mean(A_MEDIAN))

legend_ord <- levels(with(edsal3, reorder(typicaled, medianwage)))

ggplot(data=edsal3, aes(x = reorder(typicaled, medianwage), y = medianwage)) +
  geom_col(aes(fill=typicaled)) +
  ggtitle("Median Annual Income by Education Level") +
  xlab("Education Level")+
  ylab("Median Annual Income") +
  labs(fill="Education Level") +
  scale_fill_discrete(breaks=legend_ord) +
  theme_minimal() +
  theme(axis.text.x=element_blank())

The results are unsurpising: more educated people on average earn more.

Lastly, I bring in the automation data.

automationwstates <- automation %>% select(-soc)
automation1 <- automationwstates %>% select(occupation,probability,total)
head(automation)
## # A tibble: 6 x 55
##   soc    occupation probability alabama alaska arizona arkansas california
##   <chr>  <chr>            <dbl>   <dbl>  <dbl>   <dbl>    <dbl>      <dbl>
## 1 11-10… Chief Exe…      0.0150   1030.   760.   5750.    2710.     31150.
## 2 11-10… General a…      0.160   26930.  6490.  43300.   20680.    261780.
## 3 2011-… Advertisi…      0.0390     50.    40.    470.     110.      3760.
## 4 2021-… Marketing…      0.0140    530.   200.   4790.    1090.     33390.
## 5 2022-… Sales Man…      0.0130   2510.   400.  10650.    2650.     69180.
## 6 2031-… Public Re…      0.0150    400.   150.   1240.     300.      7010.
## # ... with 47 more variables: colorado <dbl>, connecticut <dbl>,
## #   delaware <dbl>, district_of_columbia <dbl>, florida <dbl>,
## #   georgia <dbl>, hawaii <dbl>, idaho <dbl>, illinois <dbl>,
## #   indiana <dbl>, iowa <dbl>, kansas <dbl>, kentucky <dbl>,
## #   louisiana <dbl>, maine <dbl>, maryland <dbl>, massachusetts <dbl>,
## #   michigan <dbl>, minnesota <dbl>, mississippi <dbl>, missouri <dbl>,
## #   montana <dbl>, nebraska <dbl>, nevada <dbl>, new_hampshire <dbl>,
## #   new_jersey <dbl>, new_mexico <dbl>, new_york <dbl>,
## #   north_carolina <dbl>, north_dakota <dbl>, ohio <dbl>, oklahoma <dbl>,
## #   oregon <dbl>, pennsylvania <dbl>, rhode_island <dbl>,
## #   south_carolina <dbl>, south_dakota <dbl>, tennessee <dbl>,
## #   texas <dbl>, utah <dbl>, vermont <dbl>, virginia <dbl>,
## #   washington <dbl>, west_virginia <dbl>, wisconsin <dbl>, wyoming <dbl>,
## #   total <dbl>

Next, I visualize the probability of automation vs total employment:

automation1 %>%
  ggplot(mapping=aes(x=total, y=probability)) +
  geom_point(alpha=1/3, col="blue") +
  theme_bw() +
  ggtitle("Probability of Automation vs Total Employment") +
  xlab("Number of Americans in a Given Job") +
  ylab("Probability of Automation") +
  geom_label_repel(data=subset(automation1, total > 4000000),
                   aes(total, probability,label=occupation), label.size=.5, label.r=.05, size=2.5, nudge_y = .05, nudge_x= -10000)

There doesn’t seem to be a huge relationship between automation and number of employees, however there is some concentration at each of the poles.

Some final data cleaning, and the merge of the final dataset:

automation1$occupation <- str_replace_all(automation1$occupation, ";", ",")
automation1$occupation <- tolower(automation$occupation)
data <- merge(as.data.frame(edsal2), as.data.frame(automation1), by="occupation")
summary(data)
##   occupation          lessthanhs       hsdiploma      somecollege   
##  Length:457         Min.   : 0.000   Min.   : 0.00   Min.   : 0.00  
##  Class :character   1st Qu.: 0.900   1st Qu.: 7.30   1st Qu.:13.40  
##  Mode  :character   Median : 2.900   Median :22.60   Median :22.30  
##                     Mean   : 7.933   Mean   :23.48   Mean   :20.89  
##                     3rd Qu.:11.900   3rd Qu.:38.20   3rd Qu.:28.70  
##                     Max.   :55.600   Max.   :59.10   Max.   :53.70  
##    associates      bachelors        masters        professional   
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 5.00   1st Qu.: 7.90   1st Qu.: 1.400   1st Qu.: 0.300  
##  Median : 8.70   Median :18.90   Median : 3.700   Median : 0.800  
##  Mean   :10.33   Mean   :23.25   Mean   : 9.592   Mean   : 4.531  
##  3rd Qu.:12.50   3rd Qu.:35.60   3rd Qu.:13.000   3rd Qu.: 2.600  
##  Max.   :75.70   Max.   :76.50   Max.   :83.500   Max.   :98.300  
##     hsorless     somecollegeorassociates    postgrad     
##  Min.   : 0.00   Min.   : 0.00           Min.   :  0.00  
##  1st Qu.: 8.30   1st Qu.:19.80           1st Qu.:  1.80  
##  Median :26.00   Median :32.10           Median :  4.60  
##  Mean   :31.42   Mean   :31.22           Mean   : 14.12  
##  3rd Qu.:52.20   3rd Qu.:42.10           3rd Qu.: 19.00  
##  Max.   :85.70   Max.   :81.80           Max.   :100.00  
##     TOT_EMP           A_MEDIAN         natlwage           
##  Min.   :    430   Min.   : 19820   Min.   :    25370800  
##  1st Qu.:  15220   1st Qu.: 35370   1st Qu.:   871074600  
##  Median :  44940   Median : 47400   Median :  2598419600  
##  Mean   : 190453   Mean   : 53988   Mean   :  9575292012  
##  3rd Qu.: 152990   3rd Qu.: 65760   3rd Qu.:  8579078300  
##  Max.   :4442090   Max.   :185150   Max.   :273118212000  
##   typicaled           workexp           probability         total        
##  Length:457         Length:457         Min.   :0.0028   Min.   :      0  
##  Class :character   Class :character   1st Qu.:0.0770   1st Qu.:  14570  
##  Mode  :character   Mode  :character   Median :0.5900   Median :  44920  
##                                        Mean   :0.5054   Mean   : 179719  
##                                        3rd Qu.:0.8800   3rd Qu.: 147310  
##                                        Max.   :0.9900   Max.   :4528570

We can create an initial visualization of the relationship between automation risk and education level.

autovsedu <- data %>% 
  group_by(typicaled) %>% 
  summarise(medianwage = mean(A_MEDIAN),
            averageprobability = mean(probability))

legend_ord2 <- levels(with(autovsedu, reorder(typicaled, -averageprobability)))

ggplot(data=autovsedu, aes(x = reorder(typicaled, -averageprobability), y = averageprobability)) +
  geom_col(aes(fill=typicaled)) +
  ggtitle("Likelihood of Job Automation by Education Level") +
  xlab("Education Level")+
  ylab("Likelihood of Job Automation") +
  labs(fill="Education Level") +
  scale_fill_discrete(breaks=legend_ord2) +
  theme_minimal() +
  theme(axis.text.x=element_blank())

There is a rather clear correlation between level of education and automation risk: those who are more educated are better protected from automation.

We can then visualize this relationship by individual occupation:

ggplot(data=data) +
  geom_point(mapping=aes(x=A_MEDIAN, y=probability, size=TOT_EMP, alpha=1/10, col=typicaled))+
  scale_size(range = c(1, 20)) +
  ylim(-.01,1) +
  xlab("Median Income") +
  ylab("Probability of Automation") +
  ggtitle("Likelihood of Job Automation vs Median Income") +
  labs(size="Total Employment", col="Education Level") +
  labs(alpha=NULL) +
  guides(alpha=FALSE) +
  theme(legend.text = element_text(colour="black", size = 10)) +
  guides(col = guide_legend(override.aes = list(size=5))) +
  theme_minimal()

With labels, a final look:

data$occupation <- toTitleCase(data$occupation)

ggplot(data=data) +
  geom_point(mapping=aes(x=A_MEDIAN, y=probability, size=TOT_EMP, alpha=0.05, col=typicaled))+
  geom_smooth(aes(x=A_MEDIAN, y=probability), method="lm", se=FALSE) +
  scale_size(range = c(1, 20)) +
  ylim(-.05,1.05) +
  xlim(25000,200000) +
  xlab("Median Income") +
  ylab("Probability of Automation") +
  ggtitle("Likelihood of Job Automation vs Median Income") +
  labs(size="Total Employment", col="Typical Education Level") +
  labs(alpha=NULL) +
  guides(alpha=FALSE) +
  theme(legend.text = element_text(colour="black", size = 10)) +
  guides(col = guide_legend(override.aes = list(size=5))) +
  theme_minimal() +
  geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN > 175000 & probability < .05), label.size=.5, label.r=.05, size=2.5, nudge_y = .05, nudge_x= -10000) +
  geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN == 21030), label.size=.1, label.r=.01, size=1, nudge_y = 0,nudge_x=0) +
  geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN == 24540), label.size=.1, label.r=.01, size=1, nudge_y = 0,nudge_x=0) +
  geom_label_repel(aes(A_MEDIAN,probability,label=occupation),subset(data, A_MEDIAN > 100000 & probability > .90), label.size=.5, label.r=.05, size=2.5, nudge_y = -.05,nudge_x=10000) +
  annotate("text", x = 165000, y = 1.03, label = "Highest salary,\n highest automation risk", size=3, fontface=2) +
  annotate("text", x = 165000, y = -0.035, label = "Highest salary,\n lowest automation risk", size=3, fontface=2) +
  annotate("text", x = 45000, y = -0.035, label = "Lowest salary,\n lowest automation risk", size=3, fontface=2) +
  annotate("text", x = 45000, y = 1.03, label = "Lowest salary,\n highest automation risk", size=3, fontface=2)

Conclusions

Using the dataset I’ve used in this project, researchers Carl Frey and Michael Osborne predicted that 47% of jobs are at risk of automation over the next couple decades.

The visuals above suggest that the ills of automation may not be evenly distributed across jobs.

Less educated workers are more likely to face job loss as a product of automation. Those with high school diplomas or less (green bubbles) find themself concentrated near the top of the y-axis, while those with Bachelor’s degrees or higher on average face a lower risk of automation.

A job’s salary is also predictive of automation probability. As the median income of a profession increases, the likelihood of automation displacing its workers decreases.

Automation’s impact on work necessitates a policy response. The fact that automation will differentially impact different industries reminds us that this public policy will have to be strategic and thoughtful.