Overview

This document outlines a proposed process used by University of New Hampshire to review and assess NAICS and SOC assignments by NIOSH Industry and Occupation Computerized Coding System 4 (NIOCCS 4).

Purpose

NIOCCS 3 used a rules-based potentially labor intensive approach - and inherently introduced human error (though unlikely). NIOCCS 4 uses a machine-learning (ML) algorithm to automatically assign industry and occupation codes, based on millions of previously assigned records.

While NIOCCS 4 is a dramatic improvement internal to NIOCCS it poses a difficult problem for NIOSH resources using the system for research: what is the best way to review and assess the improved output? How can can one be certain that a dataset has been adequately coded? Manual review of an auto-coded dataset has the potential to make the auto-coding improvement moot, as the post processing review effort has the potential to be infeasible. Therefore, the method outlined here incrementally subsets and extracts small portions of the NIOCCS 4 auto-coded output, allowing researchers to review greatly row reduced datasets, a more accomplishable task, with discrete and explicit test criteria.

The NIOCCS 4 algorithm is based on millions of manually coded (by professional coders) training I-O responses. NIOCCS 4 uses a machine learning categorization method. While p-values are ubiquitous in research, here they has the potential to get confounded with the reported probabilities provided by the algorithm. Every occupation or industry code assigned by NIOCCS 4 is coupled with a probability value. These reported probabilities are vector components. The algorithm is assigning a probability to each industry, all of which then sum to 1. There are over 2,000 of 2012 NAICS industries; the probability reported by NIOCCS file coding is the most probable within the set of all industries - first-place, or most preferred. Alternatively, a p-value would be the probability given a distributional assumption of a null hypothesis being true. There isn’t a distributional assumption associated with these reported probabilities. They are entirely empirical, based on prior training data. Therefore, if the response provided by the ML algorithm was perfectly ambiguous, then the theoretical algorithm would return a very small uniform probability for each industry. Anything greater than that small uniform value means that the machine learning model has preferentially assigned a higher probability to a particular industry. Even if the model returns 0.30 probability – this is a very strong preference for this industry ML response – particularly when comparing it to the potentially tiny probabilities assigned to all other industries.

However, even in the case described above the second-place assignment could also be 0.29 probability. How do we differentiate what we would call a cut-off? Where is our “certain enough” point?

This document proposes that one isn’t necessarily interested in the absolute probability (because it’s split across all possible industries) but the ratio between the most empirically probable and the next most. If we take our cut-off to be 0.67 then we know, without any ambiguity, that the first place assigned industry is at least twice more empirically preferred that second place. Second-place assignment could at most be 0.33 probable. This is a strong case for that assignment of the first-place category.

Taking this line of thought from the other side - if a lower cut-off is assumed, 0.5, then the second-place assignment could conceivably be 0.499…, which makes this assignment very ambiguous. However, if the second-place assignment is 0.03 then while 0.5 is much less the our original 0.67 cut off its still a very strongly preferred assignment when compared to the field of all possible assignments.

Therefore, the method outlined below uses a comparative probability cutoff. This cutoff is not used as an automatic reason to remove or discount the industry or occupation assignment, but to subset these low probability assignments for manual review:

\[P_2 \gt \frac{P_1}{2}\]

With this as a baseline interpretation of the probabilities, this document outlines the method and R code used to post process SOC and NAICS coding.

Review Process

The cleaning process included the following steps (visualized in chart below):

Flowchart v0

Development Environment

This post-processing uses R version: x86_64-w64-mingw32, x86_64, mingw32, ucrt, x86_64, mingw32, , 4, 2.1, 2022, 06, 23, 82513, R, R version 4.2.1 (2022-06-23 ucrt), Funny-Looking Kid, along with the following libraries:

library(tidyverse)
library(httr)
library(jsonlite)
library(openxlsx)
load('.RData')

Raw dataset

The dataset used to develop this method exists here and can be accessed upon request. Send requests to .

Code example

The following is a code example which fully implements the proposed method above.

Read in dataset: Node 0

Upload Industry and Occupation infomration (Example File Format) information to NIOCCS Code a File.

The output file from NIOCCS is imported and cleaned.

node0 <- read.csv('../Code industry & occupation/NH Death Records I-O from 2000_forward - NIOCCS coded 20221129.csv') %>%
  mutate(Industry.Title = str_trim(Industry.Title),
         Occupation.Title = str_trim(Occupation.Title),
         Industry.Title = str_replace_all(Industry.Title,'&#39;',""),
         Occupation.Title = str_replace_all(Occupation.Title,'&#39;',""),
         Industry.Title = str_replace(Industry.Title,';',""),
         Occupation.Title = str_replace(Occupation.Title,';',""))

Combo Flag Review

The combo flag “Unexpected NAICS SOC Combo” provided by NIOCCS indicates unexpected industry and occupation combinations - as defined by the ML algorithm.

Split on Unexpected.NAICS.SOC.Combo flag: Node 1, 2

The dataset is split based on the combo flag.

node1 <- node0 %>% 
  filter(Unexpected.NAICS.SOC.Combo == "Y")

node2 <- node0 %>% 
  filter(!Unexpected.NAICS.SOC.Combo == "Y")

Node 1: Review and assess

The first output csv file for manual review is created: “1.1.csv”. The file is manually updated to include column named “Valid”. Each row is marked with either “Y” or “N” to indicate if this is a valid combination

node1 %>% 
  group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  write.csv('1.1.csv')

node1 %>% 
  filter(NAICS.Probability < 0.67 | SOC.Probability < 0.67) %>%
  group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  write.csv('1.1 ambiguous.csv')

Node 1.1 Read back manually reviewed combos

The manually reviewed dataset “1.1.csv” is re-imported.

node1.1 <- read.csv('1.1 with Valid (draft1).csv')

node1.2 <- node1.1 %>%
  filter(Valid == 'Y') %>%
  select(-X,-Count,-Valid) 

node1.3 <- node1.1 %>%
  filter(Valid == 'N') %>%
  select(-X,-Count,-Valid)

Define function to retreive second-place occ choices

“Second-place” as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce, located at the URL defined below.

Manual review is created: “1.3.csv”. The file is manually updated to include column named “Valid”. Each row is marked with either “Y” or “N” to indicate if this is a valid combination, and “Soc.Index” which indicates if the “first-place”, “second-place” or completely manually derived (Soc.Index == 3) is applicable, and thus will be carried forward.

second_occ <- function(it, ot){
  url <- "https://wwwn.cdc.gov/nioccs/IOCode"
  industry_text <- it
  occupation_text <- ot
  return_census_codes <- "1"  

  response <- GET(url, query = list(i = industry_text, o = occupation_text, n = "2"))

  http_error(response)
  
  text_results <- content(response, as="text")
  
  json_results <- fromJSON(text_results)
  
  return(data.frame(
             Industry.Title = it,
             Occupation.Title = ot,
             SOC.Code1 = json_results$Occupation$Code[1],
             SOC.Title1 = json_results$Occupation$Title[1],
             SOC.Probability1 = json_results$Occupation$Probability[1],
             SOC.Code2 = json_results$Occupation$Code[2],
             SOC.Title2 = json_results$Occupation$Title[2],
             SOC.Probability2 = json_results$Occupation$Probability[2]))
}

for(i in 1:nrow(node1.3)){
  print(i)
  row <- node1.3[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node1.3_append = second_occ(it, ot)
  } 
  else {
    next_row = second_occ(it, ot)
    node1.3_append <- node1.3_append %>% rbind(next_row)
  }
}

node1.3_joined <- node1.3 %>%
  left_join(node1.3_append, by = c('Industry.Title', 'Occupation.Title'))

node1.3_joined <- node1.3_joined %>%
  mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
         SOC.Code2 = paste("'",SOC.Code2,sep=""))
node1.3_joined %>% write.csv('1.3.csv')

Post-process manual recode of unexpected combinations

The manually reviewed dataset “1.3.csv” is re-imported. SOC.Index is added with numerical values indicating the first place assignment (1), second place (2), and manually added (3). Occ.Valid column is added to show that a valid occupation was able to be determined. SOC.Code3 and SOC.Title3 were added manually as optional manual input occupation columns if needed.

node1.3_valid <- read.csv('1.3 append with Valid(draft2).csv')

node1.3_valid0 <- node1.3_valid %>%
  filter(Occ.Valid == 'Y')

node1.3_valid0.1 <- node1.3_valid0 %>%
  filter(Soc.Index == 1) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code1, SOC.Title1) %>%
  mutate(SOC.Code = SOC.Code1,
         SOC.Title = SOC.Title1,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code1,-SOC.Title1)

node1.3_valid0.2 <- node1.3_valid0 %>%
  filter(Soc.Index == 2) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code2, SOC.Title2) %>%
  mutate(SOC.Code = SOC.Code2,
         SOC.Title = SOC.Title2,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code2,-SOC.Title2)

node1.3_valid0.3 <- node1.3_valid0 %>%
  filter(Soc.Index == 3) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code3, SOC.Title3) %>%
  mutate(SOC.Code = SOC.Code3,
         SOC.Title = SOC.Title3,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code3,-SOC.Title3)
  
node1.3_remove <- node1.3_valid %>%
  filter(!(Occ.Valid == 'Y')) %>%
  select(Industry.Title, Occupation.Title) %>%
  left_join(node1, 
            by = c('Industry.Title','Occupation.Title')) %>%
  arrange(ID)

node1.3_remove

Recombine to node3 after manual review

The dataset is recombined for further review.

node3 <- 
  node1.2 %>%
        left_join(node1 %>% select(-SOC.Title,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title')) %>%
  union(node1.3_valid0.1 %>%
        left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title'))
  ) %>%
  union(node1.3_valid0.2 %>%
        left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title'))
  ) %>%
  union(node1.3_valid0.3 %>%
        left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title'))
  ) %>%
  union(
    node2) %>%
  arrange(ID)

Occupation Review

The SOC occupations provided by NIOCCS are reviewed for low probability criteria and those assigned “Insufficient…” are reviewed and recoded if applicable.

Split on Occ NOT “Insufficient…” & Prob <=0.67 flag: Node 4, 5

Abiguous occupations are separated and reviewed.

node4 <- node3 %>%
  filter(!(grepl('Insufficient', SOC.Title)) & SOC.Probability < 0.67) 

node5 <- node3 %>%
  filter(grepl('Insufficient', SOC.Title) | SOC.Probability >= 0.67)

Automate review of “second-place” Occ

“Second-place” as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce.

node4.1 <- node4 %>% 
  group_by(Industry.Title, Occupation.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

for(i in 1:nrow(node4.1)){
  print(i)
  row <- node4.1[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node4.1_append = second_occ(it, ot)
  } 
  else {
    next_row = second_occ(it, ot)
    node4.1_append <- node4.1_append %>% rbind(next_row)
  }
}

The web service returns the probabilities as text, it is converted to a numeric variable.

node4.1_joined <- node4.1 %>%
  left_join(node4.1_append, by = c('Industry.Title', 'Occupation.Title') %>% unique())%>%
  mutate(SOC.Probability1 = as.double(SOC.Probability1),
         SOC.Probability2 = as.double(SOC.Probability2))

The low and high probability SOC assignments are split. Low and high probability criteria are defined in “Purpose” section above. Single quotes are added to ensure that SOC codes are not auto-formatted as dates by MS Excel if/when the file is opened.

node4.2 <- node4.1_joined %>%
  filter(SOC.Probability2 < SOC.Probability1/2)

node4.3 <- node4.1_joined %>%
  filter(SOC.Probability2 >= SOC.Probability1/2) %>%
  arrange(SOC.Probability1)

node4.3 %>%
  mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
         SOC.Code2 = paste("'",SOC.Code2,sep="")) %>% write.csv('4.3.csv')

The manually reviewed dataset “4.3.csv” is re-imported. Column Occ.Value is added to indicate if a valid assignment has been determined. Column SOC.Index is added with numerical values indicating the first place assignment (1), second place (2) etc..

node4.3_valid <- read.csv('4.3 append with Valid(draft8).csv')

node4.3_valid0 <- node4.3_valid %>%
  filter(Occ.Valid == 'Y')

node4.3_valid0.1 <- node4.3_valid0 %>%
  filter(SOC.Index == 1) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code1, SOC.Title1) %>%
  mutate(SOC.Code = SOC.Code1,
         SOC.Title = SOC.Title1,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code1,-SOC.Title1)

node4.3_valid0.2 <- node4.3_valid0 %>%
  filter(SOC.Index == 2) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code2, SOC.Title2) %>%
  mutate(SOC.Code = SOC.Code2,
         SOC.Title = SOC.Title2,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code2,-SOC.Title2)
 
node4.3_remove <- node4.3_valid %>%
  filter(!Occ.Valid == 'Y') %>%
  select(Industry.Title, Occupation.Title) %>%
  left_join(node4, 
            by = c('Industry.Title','Occupation.Title')) %>%
  arrange(ID)

Recombine to node6 after manual review

The dataset is recombined for further review.

node6 <- 
  node4.2 %>% 
  select(-SOC.Code1,-SOC.Title1,-SOC.Probability1,-SOC.Code2,-SOC.Title2,-SOC.Probability2,-Count) %>%
        left_join(node4 %>% select(-SOC.Title,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title')) %>%
  union(node4.3_valid0.1 %>%
        left_join(node4 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title'))
  ) %>%
  union(node4.3_valid0.2 %>%
        left_join(node4 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title'))
  )  %>%
  union(
    node5) %>%
  arrange(ID) 

The lost-cause SOC occupation “Insufficient..” assignments are removed.

node7.1 <-  node6 %>%
  filter(grepl('Insufficient', SOC.Title))%>% 
  group_by(Industry.Title, Occupation.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

node7.1_valid <- node7.1 %>%
  filter(!(Occupation.Title %in% c(
    'UNKNOWN', 
    'NOT STATED', 
    'VARIOUS', 
    '' , 
    'UNK', 
    'UNKNOWN', 
    'NOT GIVEN', 
    'NOT AVAILABLE', 
    'LABOR', 'LABORER', 
    'OWNER/OPERATOR', 
    'UKNOWN', 
    'CONTRACTOR'
  ))) %>%
  filter(!is.na(Occupation.Title))

node7.1_remove <- anti_join(node7.1,node7.1_valid) %>% ungroup() %>%select(-Count) %>% left_join(node6, by = c('Industry.Title','Occupation.Title')) 

The dataset is recombined for further review.

node7<- anti_join(node6 %>% ungroup(),
                  node7.1_remove,
                  by = c('Industry.Title','Occupation.Title')) %>%
  arrange(ID)

The unemployed list is defined and set explicitly.

unemployed_list = c(
  'NONE', 
  'N/A', 
  'NA', 
  'NOT APPLICABLE', 
  'UNEMPLOYED')
node7.2 <- node7 %>%
  mutate(SOC.Code = ifelse(Occupation.Title %in% unemployed_list, "'00-9100",SOC.Code),
         SOC.Title = ifelse(Occupation.Title %in% unemployed_list, "Did Not Work (unpaid) -NIOSH",SOC.Title),
         SOC.Probability = ifelse(Occupation.Title %in% unemployed_list, 1.0, SOC.Probability))

The “Insufficient…” SOC assignments are split.

node8 <- node7.2%>%
  filter(grepl('Insufficient', SOC.Title))

node9 <- node7.2%>%
  filter(!grepl('Insufficient', SOC.Title))

… and combined to ensure only unique Industry and Occuplation titles are sent to the web service.

node8.1 <-  node8 %>%
  filter(grepl('Insufficient', SOC.Title))%>% 
  group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

Second-place as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce.

for(i in 1:nrow(node8.1)){
  print(i)
  row <- node8.1[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node8.1_append = second_occ(it, ot)
  } 
  else {
    next_row = second_occ(it, ot)
    node8.1_append <- node8.1_append %>% rbind(next_row)
  }
}

Single quotes are added to ensure that SOC codes are not auto-formated as Dates by MS Excel if/when the file is opened.

node8.1_append %>%
  mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
         SOC.Code2 = paste("'",SOC.Code2,sep=""))%>% write.csv('8.1 append with Valid(draft3).csv')

The manually reviewed dataset “8.1.csv” is re-imported. SOC.Index is added with numerical values indicating the first place assignment (1), second place (2) etc..

node8.1_valid <- read.csv('8.1 append with Valid(draft3).csv')

node8.1_valid0 <- node8.1_valid 

node8.1_valid0.1 <- node8.1_valid0 %>%
  filter(SOC.Index == 1) %>%
  select(Industry.Title, Occupation.Title, SOC.Code1, SOC.Title1) %>%
  mutate(SOC.Code = SOC.Code1,
         SOC.Title = SOC.Title1,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code1,-SOC.Title1)

node8.1_valid0.2 <- node8.1_valid0 %>%
  filter(SOC.Index == 2) %>%
  select(Industry.Title, Occupation.Title, SOC.Code2, SOC.Title2) %>%
  mutate(SOC.Code = SOC.Code2,
         SOC.Title = SOC.Title2,
         SOC.Probability = 1.0) %>%
  select(-SOC.Code2,-SOC.Title2)

node8.2 <- 
  node8.1_valid0.1 %>%
        left_join(node8 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability), 
                  by = c('Industry.Title','Occupation.Title')) %>%
  union(node8.1_valid0.2 %>%
        left_join(node8 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability), 
                  by = c('Industry.Title','Occupation.Title'))
  )

node8.2_remove <- node8.2 %>%
  filter(grepl('Insufficient', SOC.Title) & grepl('Insufficient', NAICS.Title))

node8.2_append <- node8.2 %>%
  filter(!grepl('Insufficient', SOC.Title) | !grepl('Insufficient', NAICS.Title))

Recombine to node10 after manual review

node10 <- 
  node9 %>%
  union(node8.2_append) %>%
  arrange(ID) %>%
  mutate(SOC.Code = str_remove(SOC.Code,"'"))

Industry Review

Industry is accomplished similarly to Occupation (Node1 - Node10)

second_ind <- function(it, ot){
  url <- "https://wwwn.cdc.gov/nioccs/IOCode.ashx"
  industry_text <- it
  occupation_text <- ot
  return_census_codes <- "1"

  response <- GET(url, query = list(i = industry_text, o = occupation_text, n = "2"))

  http_error(response) 
  
  text_results <- content(response, as="text")
  
  json_results <- fromJSON(text_results)
  
  return(data.frame(
             Industry.Title = it,
             Occupation.Title = ot,
             NAICS.Code1 = json_results$Industry$Code[1],
             NAICS.Title1 = json_results$Industry$Title[1],
             NAICS.Probability1 = json_results$Industry$Probability[1],
             NAICS.Code2 = json_results$Industry$Code[2],
             NAICS.Title2 = json_results$Industry$Title[2],
             NAICS.Probability2 = json_results$Industry$Probability[2]))
}
for(i in 1:nrow(node1.3_remove)){
  print(i)
  row <- node1.3_remove[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node1.3_ind_append = second_ind(it, ot)
  } 
  else {
    next_row = second_ind(it, ot)
    node1.3_ind_append <- node1.3_ind_append %>% rbind(next_row)
  }
}

node1.3_ind_joined <- node1.3_ind_append %>%
  left_join(node1.3 %>% left_join(node1 %>% select(Industry.Title,Occupation.Title,SOC.Title,SOC.Code) %>% unique()))

node1.3_ind_joined %>%
  mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
         NAICS.Code2 = paste("'",NAICS.Code2,sep=""),
         SOC.Code = paste("'",SOC.Code,sep=""))%>% write.csv('3.1 ind append with Valid(draft3).csv')

Split on Ind NOT “Insufficient…” & Prob <=0.67 flag: Node 12, 13

node11 <- node10 

node12 <- node11 %>%
  filter(!(grepl('Insufficient', NAICS.Title)) & NAICS.Probability < 0.67)

node13 <- node11 %>%
  filter(grepl('Insufficient', NAICS.Title) | NAICS.Probability >= 0.67)

Automate review of second-place Ind

node12.1 <- node12 %>% 
  group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

for(i in 1:nrow(node12.1)){
  print(i)
  row <- node12.1[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node12.1_append = second_ind(it, ot)
  } 
  else {
    next_row = second_ind(it, ot)
    node12.1_append <- node12.1_append %>% rbind(next_row)
  }
}
node12.1_joined <- node12.1 %>%
  left_join(node12.1_append, by = c('Industry.Title', 'Occupation.Title'))%>%
  mutate(NAICS.Probability1 = as.double(NAICS.Probability1),
         NAICS.Probability2 = as.double(NAICS.Probability2))

node12.1_joined 
node12.2 <- node12.1_joined %>%
  filter(NAICS.Probability2 < NAICS.Probability1/2)

node12.3 <- node12.1_joined %>%
  filter(NAICS.Probability2 >= NAICS.Probability1/2) %>%
  arrange(NAICS.Probability1)

node12.3 %>%
  mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
         NAICS.Code2 = paste("'",NAICS.Code2,sep="")) %>% write.csv('12.3 append with Valid(draft1).csv')
node12.3_valid <- read.csv('12.3 append with Valid(draft1).csv')

node12.3_valid0 <- node12.3_valid %>%
  filter(Valid == 'Y')

node12.3_valid0.1 <- node12.3_valid0 %>%
  filter(NAICS.Index == 1) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, NAICS.Code1, NAICS.Title1) %>%
  mutate(NAICS.Code = NAICS.Code1,
         NAICS.Title = NAICS.Title1,
         NAICS.Probability = 1.0) %>%
  select(-NAICS.Code1,-NAICS.Title1)

node12.3_valid0.2 <- node12.3_valid0 %>%
  filter(NAICS.Index == 2) %>%
  select(Industry.Title, Occupation.Title, NAICS.Title, NAICS.Code2, NAICS.Title2) %>%
  mutate(NAICS.Code = NAICS.Code2,
         NAICS.Title = NAICS.Title2,
         NAICS.Probability = 1.0) %>%
  select(-NAICS.Code2,-NAICS.Title2)
 
node12.3_remove <- node12.3_valid %>%
  filter(!Valid == 'Y') %>%
  select(Industry.Title, Occupation.Title) %>%
  left_join(node12, 
            by = c('Industry.Title','Occupation.Title')) %>%
  arrange(ID)

Recombine to node14 after manual review

node14 <- 
  node12.2 %>%
  select(-NAICS.Code1,-NAICS.Title1,-NAICS.Probability1,-NAICS.Code2,-NAICS.Title2,-NAICS.Probability2,-Count) %>%
        left_join(node12 %>% select(-SOC.Title,-NAICS.Title), 
                  by = c('Industry.Title','Occupation.Title')) %>%
  union(node12.3_valid0.1 %>%
        left_join(node12 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability), 
                  by = c('Industry.Title','Occupation.Title'))
  ) %>%
  union(node12.3_valid0.2 %>%
        left_join(node12 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability), 
                  by = c('Industry.Title','Occupation.Title'))
  )  %>%
  union(
    node13) %>%
  arrange(ID) 
node15.1 <-  node14 %>%
  filter(grepl('Insufficient', NAICS.Title))%>% 
  group_by(Industry.Title, Occupation.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

node15.1_valid <- node15.1 %>%
  filter(!(Industry.Title %in% c(
    'UNKNOWN', 
    'NOT STATED', 
    'VARIOUS', 
    '' , 
    'UNK', 
    'UNKNOWN', 
    'NOT GIVEN', 
    'NOT AVAILABLE', 
    'LABOR', 'LABORER', 
    'OWNER/OPERATOR', 
    'UKNOWN', 
    'CONTRACTOR'
  ))) %>%
  filter(!is.na(Industry.Title))

node15.1_remove <- anti_join(node15.1,node15.1_valid) %>% ungroup() %>%select(-Count) %>% left_join(node14, by = c('Industry.Title','Occupation.Title'))

node15.1_remove
node15<- anti_join(node14 %>% ungroup(),
                  node15.1_remove,
                  by = c('Industry.Title','Occupation.Title')) %>%
  arrange(ID)
node16 <- node15%>%
  filter(grepl('Insufficient', NAICS.Title))

node17 <- node15%>%
  filter(!grepl('Insufficient', NAICS.Title))
node16.1 <-  node16 %>%
  filter(grepl('Insufficient', NAICS.Title))%>% 
  group_by(Industry.Title, Occupation.Title) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 
for(i in 1:nrow(node16.1)){
  print(i)
  row <- node16.1[i,c('Industry.Title','Occupation.Title')]
  it <- row$Industry.Title
  ot <- row$Occupation.Title
  if (i==1) {
    node16.1_append = second_ind(it, ot)
  } 
  else {
    next_row = second_ind(it, ot)
    node16.1_append <- node16.1_append %>% rbind(next_row)
  }
}
node16.1_append %>%
  mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
         NAICS.Code2 = paste("'",NAICS.Code2,sep=""))%>% write.csv('16.1 append with Valid(draft1).csv')
node16.1_valid <- read.csv('16.1 append with Valid(draft1).csv')

node16.1_valid0 <- node16.1_valid 

node16.1_valid0.1 <- node16.1_valid0 %>%
  filter(NAICS.Index == 1) %>%
  select(Industry.Title, Occupation.Title, NAICS.Code1, NAICS.Title1) %>%
  mutate(NAICS.Code = NAICS.Code1,
         NAICS.Title = NAICS.Title1,
         NAICS.Probability = 1.0) %>%
  select(-NAICS.Code1,-NAICS.Title1)

node16.1_valid0.2 <- node16.1_valid0 %>%
  filter(NAICS.Index == 2) %>%
  select(Industry.Title, Occupation.Title, NAICS.Code2, NAICS.Title2) %>%
  mutate(NAICS.Code = NAICS.Code2,
         NAICS.Title = NAICS.Title2,
         NAICS.Probability = 1.0) %>%
  select(-NAICS.Code2,-NAICS.Title2)

node16.2 <- 
  node16.1_valid0.1 %>%
        left_join(node16 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability), 
                  by = c('Industry.Title','Occupation.Title')) %>%
  union(node16.1_valid0.2 %>%
        left_join(node16 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability), 
                  by = c('Industry.Title','Occupation.Title'))
  )

node16.2_remove <- node16.2 %>%
  filter(grepl('Insufficient', NAICS.Title))

node16.2_append <- node16.2 %>%
  filter(!grepl('Insufficient', NAICS.Title))

Recombine to node18 after manual review

node18 <- 
  node17 %>%
  union(node16.2_append) %>%
  arrange(ID) %>%
  mutate(NAICS.Code = str_remove(NAICS.Code,"'")) %>% 
  select(ID,    
         Industry.Title,
         Occupation.Title,
         Census.Ind.code,
         Census.Ind.Title,
         Census.Occ.code,
         Census.Occ.Title,
         NAICS.Code,
         NAICS.Title,   
         NAICS.Probability,
         SOC.Code,
         SOC.Title, 
         SOC.Probability,
         Unexpected.NAICS.SOC.Combo
)

Exporting the cleaned I&O subset

Finally, the reviewed set is exported for use.

node18 %>% write.xlsx('Cleaned I&O.xlsx')
node19 <- node0 %>% 
  select(ID, Industry.Title,Occupation.Title) %>%
  left_join(node18, by = c('ID'))

The full set, cleaned and removed records, is recombined is also exported for use. The records which were uncoded are included in this set, but are left blank.

node19 %>% write.xlsx('Full I&O.xlsx')