This document outlines a proposed process used by University of New Hampshire to review and assess NAICS and SOC assignments by NIOSH Industry and Occupation Computerized Coding System 4 (NIOCCS 4).
NIOCCS 3 used a rules-based potentially labor intensive approach - and inherently introduced human error (though unlikely). NIOCCS 4 uses a machine-learning (ML) algorithm to automatically assign industry and occupation codes, based on millions of previously assigned records.
While NIOCCS 4 is a dramatic improvement internal to NIOCCS it poses a difficult problem for NIOSH resources using the system for research: what is the best way to review and assess the improved output? How can can one be certain that a dataset has been adequately coded? Manual review of an auto-coded dataset has the potential to make the auto-coding improvement moot, as the post processing review effort has the potential to be infeasible. Therefore, the method outlined here incrementally subsets and extracts small portions of the NIOCCS 4 auto-coded output, allowing researchers to review greatly row reduced datasets, a more accomplishable task, with discrete and explicit test criteria.
The NIOCCS 4 algorithm is based on millions of manually coded (by professional coders) training I-O responses. NIOCCS 4 uses a machine learning categorization method. While p-values are ubiquitous in research, here they has the potential to get confounded with the reported probabilities provided by the algorithm. Every occupation or industry code assigned by NIOCCS 4 is coupled with a probability value. These reported probabilities are vector components. The algorithm is assigning a probability to each industry, all of which then sum to 1. There are over 2,000 of 2012 NAICS industries; the probability reported by NIOCCS file coding is the most probable within the set of all industries - first-place, or most preferred. Alternatively, a p-value would be the probability given a distributional assumption of a null hypothesis being true. There isn’t a distributional assumption associated with these reported probabilities. They are entirely empirical, based on prior training data. Therefore, if the response provided by the ML algorithm was perfectly ambiguous, then the theoretical algorithm would return a very small uniform probability for each industry. Anything greater than that small uniform value means that the machine learning model has preferentially assigned a higher probability to a particular industry. Even if the model returns 0.30 probability – this is a very strong preference for this industry ML response – particularly when comparing it to the potentially tiny probabilities assigned to all other industries.
However, even in the case described above the second-place assignment could also be 0.29 probability. How do we differentiate what we would call a cut-off? Where is our “certain enough” point?
This document proposes that one isn’t necessarily interested in the absolute probability (because it’s split across all possible industries) but the ratio between the most empirically probable and the next most. If we take our cut-off to be 0.67 then we know, without any ambiguity, that the first place assigned industry is at least twice more empirically preferred that second place. Second-place assignment could at most be 0.33 probable. This is a strong case for that assignment of the first-place category.
Taking this line of thought from the other side - if a lower cut-off is assumed, 0.5, then the second-place assignment could conceivably be 0.499…, which makes this assignment very ambiguous. However, if the second-place assignment is 0.03 then while 0.5 is much less the our original 0.67 cut off its still a very strongly preferred assignment when compared to the field of all possible assignments.
Therefore, the method outlined below uses a comparative probability cutoff. This cutoff is not used as an automatic reason to remove or discount the industry or occupation assignment, but to subset these low probability assignments for manual review:
\[P_2 \gt \frac{P_1}{2}\]
With this as a baseline interpretation of the probabilities, this document outlines the method and R code used to post process SOC and NAICS coding.
The cleaning process included the following steps (visualized in chart below):
Unexpected Combinations: Review of records identified by the NIOCCS 4 ML algorithm as unexpected NAICS and SOC combination. This included identification of two criteria: [1] if the reported occupation and industry combination are minimally plausible (as assess by the researcher) – it is assumed to be a valid combination; [2] otherwise, each record is submitted programmatically to the NIOCCS web service to identify the “second-place” SOC Code. If applicable, the “second-place” SOC assignment is selected. (Node3)
Low Probability Occupations: Review of records with SOC not “Insufficient..” but with a low SOC Probability. In this step the potentially ambiguous SOC assignments provided by NIOCCS are identified and assessed. Ambiguity criteria is defined as empirical probability of second-place SOC Code being greater than half first-place empirical probability (as defined above). Therefore, all records with an assigned empirical first-place probability greater than 0.67 are excluded from review (assumed to be valid); by definition second place is mathematically constrained to be less than half empirically probable. Interestingly, based on this review it appears that NIOCCS prioritizes Occupation assignment accuracy over industry. (Node6)
Lost-Cause Occupations: Identification of records with an occupation within the following explicit list. These are assumed to be lost causes: ‘UNKNOWN’, ‘NOT STATED’, ‘VARIOUS’, ’’ , ‘UNK’, ‘UNKNOWN’, ‘NOT GIVEN’, ‘NOT AVAILABLE’, ‘LABOR’, ‘LABORER’, ‘OWNER/OPERATOR’, ‘UKNOWN’, ‘CONTRACTOR’. (Node7)
Remaining “Insufficient…” Occupations: Review of records with SOC “Insufficient…” compared to the second-place SOC assignment. If applicable, the second-place SOC assignment is selected. (Node10)
Low Probability Industries: Review of records with NAICS not “Insufficient” but with a low NAICS Probability. This is processed for NAICS exactly as described above for SOC. (Node14)
Lost-Cause Industries: Identification of records with an industry within the same explicit lost cause list outlined above. (Node15)
Remaining “Insufficient…” Industries: Review of records with NAICS “Insufficient…” compared to the second-place NAICS assignment. If applicable, the second-place NAICS assignment is selected. (Node18)
This post-processing uses R version: x86_64-w64-mingw32, x86_64, mingw32, ucrt, x86_64, mingw32, , 4, 2.1, 2022, 06, 23, 82513, R, R version 4.2.1 (2022-06-23 ucrt), Funny-Looking Kid, along with the following libraries:
library(tidyverse)
library(httr)
library(jsonlite)
library(openxlsx)
load('.RData')
The dataset used to develop this method exists here and can be accessed upon request. Send requests to nate.thomas@unh.edu.
The following is a code example which fully implements the proposed method above.
Upload Industry and Occupation infomration (Example File Format) information to NIOCCS Code a File.
The output file from NIOCCS is imported and cleaned.
node0 <- read.csv('../Code industry & occupation/NH Death Records I-O from 2000_forward - NIOCCS coded 20221129.csv') %>%
mutate(Industry.Title = str_trim(Industry.Title),
Occupation.Title = str_trim(Occupation.Title),
Industry.Title = str_replace_all(Industry.Title,''',""),
Occupation.Title = str_replace_all(Occupation.Title,''',""),
Industry.Title = str_replace(Industry.Title,';',""),
Occupation.Title = str_replace(Occupation.Title,';',""))
The combo flag “Unexpected NAICS SOC Combo” provided by NIOCCS indicates unexpected industry and occupation combinations - as defined by the ML algorithm.
The dataset is split based on the combo flag.
node1 <- node0 %>%
filter(Unexpected.NAICS.SOC.Combo == "Y")
node2 <- node0 %>%
filter(!Unexpected.NAICS.SOC.Combo == "Y")
The first output csv file for manual review is created: “1.1.csv”. The file is manually updated to include column named “Valid”. Each row is marked with either “Y” or “N” to indicate if this is a valid combination
node1 %>%
group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
write.csv('1.1.csv')
node1 %>%
filter(NAICS.Probability < 0.67 | SOC.Probability < 0.67) %>%
group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
write.csv('1.1 ambiguous.csv')
The manually reviewed dataset “1.1.csv” is re-imported.
node1.1 <- read.csv('1.1 with Valid (draft1).csv')
node1.2 <- node1.1 %>%
filter(Valid == 'Y') %>%
select(-X,-Count,-Valid)
node1.3 <- node1.1 %>%
filter(Valid == 'N') %>%
select(-X,-Count,-Valid)
“Second-place” as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce, located at the URL defined below.
Manual review is created: “1.3.csv”. The file is manually updated to include column named “Valid”. Each row is marked with either “Y” or “N” to indicate if this is a valid combination, and “Soc.Index” which indicates if the “first-place”, “second-place” or completely manually derived (Soc.Index == 3) is applicable, and thus will be carried forward.
second_occ <- function(it, ot){
url <- "https://wwwn.cdc.gov/nioccs/IOCode"
industry_text <- it
occupation_text <- ot
return_census_codes <- "1"
response <- GET(url, query = list(i = industry_text, o = occupation_text, n = "2"))
http_error(response)
text_results <- content(response, as="text")
json_results <- fromJSON(text_results)
return(data.frame(
Industry.Title = it,
Occupation.Title = ot,
SOC.Code1 = json_results$Occupation$Code[1],
SOC.Title1 = json_results$Occupation$Title[1],
SOC.Probability1 = json_results$Occupation$Probability[1],
SOC.Code2 = json_results$Occupation$Code[2],
SOC.Title2 = json_results$Occupation$Title[2],
SOC.Probability2 = json_results$Occupation$Probability[2]))
}
for(i in 1:nrow(node1.3)){
print(i)
row <- node1.3[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node1.3_append = second_occ(it, ot)
}
else {
next_row = second_occ(it, ot)
node1.3_append <- node1.3_append %>% rbind(next_row)
}
}
node1.3_joined <- node1.3 %>%
left_join(node1.3_append, by = c('Industry.Title', 'Occupation.Title'))
node1.3_joined <- node1.3_joined %>%
mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
SOC.Code2 = paste("'",SOC.Code2,sep=""))
node1.3_joined %>% write.csv('1.3.csv')
The manually reviewed dataset “1.3.csv” is re-imported. SOC.Index is added with numerical values indicating the first place assignment (1), second place (2), and manually added (3). Occ.Valid column is added to show that a valid occupation was able to be determined. SOC.Code3 and SOC.Title3 were added manually as optional manual input occupation columns if needed.
node1.3_valid <- read.csv('1.3 append with Valid(draft2).csv')
node1.3_valid0 <- node1.3_valid %>%
filter(Occ.Valid == 'Y')
node1.3_valid0.1 <- node1.3_valid0 %>%
filter(Soc.Index == 1) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code1, SOC.Title1) %>%
mutate(SOC.Code = SOC.Code1,
SOC.Title = SOC.Title1,
SOC.Probability = 1.0) %>%
select(-SOC.Code1,-SOC.Title1)
node1.3_valid0.2 <- node1.3_valid0 %>%
filter(Soc.Index == 2) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code2, SOC.Title2) %>%
mutate(SOC.Code = SOC.Code2,
SOC.Title = SOC.Title2,
SOC.Probability = 1.0) %>%
select(-SOC.Code2,-SOC.Title2)
node1.3_valid0.3 <- node1.3_valid0 %>%
filter(Soc.Index == 3) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code3, SOC.Title3) %>%
mutate(SOC.Code = SOC.Code3,
SOC.Title = SOC.Title3,
SOC.Probability = 1.0) %>%
select(-SOC.Code3,-SOC.Title3)
node1.3_remove <- node1.3_valid %>%
filter(!(Occ.Valid == 'Y')) %>%
select(Industry.Title, Occupation.Title) %>%
left_join(node1,
by = c('Industry.Title','Occupation.Title')) %>%
arrange(ID)
node1.3_remove
The dataset is recombined for further review.
node3 <-
node1.2 %>%
left_join(node1 %>% select(-SOC.Title,-NAICS.Title),
by = c('Industry.Title','Occupation.Title')) %>%
union(node1.3_valid0.1 %>%
left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(node1.3_valid0.2 %>%
left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(node1.3_valid0.3 %>%
left_join(node1 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(
node2) %>%
arrange(ID)
The SOC occupations provided by NIOCCS are reviewed for low probability criteria and those assigned “Insufficient…” are reviewed and recoded if applicable.
Abiguous occupations are separated and reviewed.
node4 <- node3 %>%
filter(!(grepl('Insufficient', SOC.Title)) & SOC.Probability < 0.67)
node5 <- node3 %>%
filter(grepl('Insufficient', SOC.Title) | SOC.Probability >= 0.67)
“Second-place” as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce.
node4.1 <- node4 %>%
group_by(Industry.Title, Occupation.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
for(i in 1:nrow(node4.1)){
print(i)
row <- node4.1[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node4.1_append = second_occ(it, ot)
}
else {
next_row = second_occ(it, ot)
node4.1_append <- node4.1_append %>% rbind(next_row)
}
}
The web service returns the probabilities as text, it is converted to a numeric variable.
node4.1_joined <- node4.1 %>%
left_join(node4.1_append, by = c('Industry.Title', 'Occupation.Title') %>% unique())%>%
mutate(SOC.Probability1 = as.double(SOC.Probability1),
SOC.Probability2 = as.double(SOC.Probability2))
The low and high probability SOC assignments are split. Low and high probability criteria are defined in “Purpose” section above. Single quotes are added to ensure that SOC codes are not auto-formatted as dates by MS Excel if/when the file is opened.
node4.2 <- node4.1_joined %>%
filter(SOC.Probability2 < SOC.Probability1/2)
node4.3 <- node4.1_joined %>%
filter(SOC.Probability2 >= SOC.Probability1/2) %>%
arrange(SOC.Probability1)
node4.3 %>%
mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
SOC.Code2 = paste("'",SOC.Code2,sep="")) %>% write.csv('4.3.csv')
The manually reviewed dataset “4.3.csv” is re-imported. Column Occ.Value is added to indicate if a valid assignment has been determined. Column SOC.Index is added with numerical values indicating the first place assignment (1), second place (2) etc..
node4.3_valid <- read.csv('4.3 append with Valid(draft8).csv')
node4.3_valid0 <- node4.3_valid %>%
filter(Occ.Valid == 'Y')
node4.3_valid0.1 <- node4.3_valid0 %>%
filter(SOC.Index == 1) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code1, SOC.Title1) %>%
mutate(SOC.Code = SOC.Code1,
SOC.Title = SOC.Title1,
SOC.Probability = 1.0) %>%
select(-SOC.Code1,-SOC.Title1)
node4.3_valid0.2 <- node4.3_valid0 %>%
filter(SOC.Index == 2) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, SOC.Code2, SOC.Title2) %>%
mutate(SOC.Code = SOC.Code2,
SOC.Title = SOC.Title2,
SOC.Probability = 1.0) %>%
select(-SOC.Code2,-SOC.Title2)
node4.3_remove <- node4.3_valid %>%
filter(!Occ.Valid == 'Y') %>%
select(Industry.Title, Occupation.Title) %>%
left_join(node4,
by = c('Industry.Title','Occupation.Title')) %>%
arrange(ID)
The dataset is recombined for further review.
node6 <-
node4.2 %>%
select(-SOC.Code1,-SOC.Title1,-SOC.Probability1,-SOC.Code2,-SOC.Title2,-SOC.Probability2,-Count) %>%
left_join(node4 %>% select(-SOC.Title,-NAICS.Title),
by = c('Industry.Title','Occupation.Title')) %>%
union(node4.3_valid0.1 %>%
left_join(node4 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(node4.3_valid0.2 %>%
left_join(node4 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability,-NAICS.Title),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(
node5) %>%
arrange(ID)
The lost-cause SOC occupation “Insufficient..” assignments are removed.
node7.1 <- node6 %>%
filter(grepl('Insufficient', SOC.Title))%>%
group_by(Industry.Title, Occupation.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
node7.1_valid <- node7.1 %>%
filter(!(Occupation.Title %in% c(
'UNKNOWN',
'NOT STATED',
'VARIOUS',
'' ,
'UNK',
'UNKNOWN',
'NOT GIVEN',
'NOT AVAILABLE',
'LABOR', 'LABORER',
'OWNER/OPERATOR',
'UKNOWN',
'CONTRACTOR'
))) %>%
filter(!is.na(Occupation.Title))
node7.1_remove <- anti_join(node7.1,node7.1_valid) %>% ungroup() %>%select(-Count) %>% left_join(node6, by = c('Industry.Title','Occupation.Title'))
The dataset is recombined for further review.
node7<- anti_join(node6 %>% ungroup(),
node7.1_remove,
by = c('Industry.Title','Occupation.Title')) %>%
arrange(ID)
The unemployed list is defined and set explicitly.
unemployed_list = c(
'NONE',
'N/A',
'NA',
'NOT APPLICABLE',
'UNEMPLOYED')
node7.2 <- node7 %>%
mutate(SOC.Code = ifelse(Occupation.Title %in% unemployed_list, "'00-9100",SOC.Code),
SOC.Title = ifelse(Occupation.Title %in% unemployed_list, "Did Not Work (unpaid) -NIOSH",SOC.Title),
SOC.Probability = ifelse(Occupation.Title %in% unemployed_list, 1.0, SOC.Probability))
The “Insufficient…” SOC assignments are split.
node8 <- node7.2%>%
filter(grepl('Insufficient', SOC.Title))
node9 <- node7.2%>%
filter(!grepl('Insufficient', SOC.Title))
… and combined to ensure only unique Industry and Occuplation titles are sent to the web service.
node8.1 <- node8 %>%
filter(grepl('Insufficient', SOC.Title))%>%
group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
Second-place as defined by the ML algorithm is pulled into this dataset using NIOCCS web serivce.
for(i in 1:nrow(node8.1)){
print(i)
row <- node8.1[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node8.1_append = second_occ(it, ot)
}
else {
next_row = second_occ(it, ot)
node8.1_append <- node8.1_append %>% rbind(next_row)
}
}
Single quotes are added to ensure that SOC codes are not auto-formated as Dates by MS Excel if/when the file is opened.
node8.1_append %>%
mutate(SOC.Code1 = paste("'",SOC.Code1,sep=""),
SOC.Code2 = paste("'",SOC.Code2,sep=""))%>% write.csv('8.1 append with Valid(draft3).csv')
The manually reviewed dataset “8.1.csv” is re-imported. SOC.Index is added with numerical values indicating the first place assignment (1), second place (2) etc..
node8.1_valid <- read.csv('8.1 append with Valid(draft3).csv')
node8.1_valid0 <- node8.1_valid
node8.1_valid0.1 <- node8.1_valid0 %>%
filter(SOC.Index == 1) %>%
select(Industry.Title, Occupation.Title, SOC.Code1, SOC.Title1) %>%
mutate(SOC.Code = SOC.Code1,
SOC.Title = SOC.Title1,
SOC.Probability = 1.0) %>%
select(-SOC.Code1,-SOC.Title1)
node8.1_valid0.2 <- node8.1_valid0 %>%
filter(SOC.Index == 2) %>%
select(Industry.Title, Occupation.Title, SOC.Code2, SOC.Title2) %>%
mutate(SOC.Code = SOC.Code2,
SOC.Title = SOC.Title2,
SOC.Probability = 1.0) %>%
select(-SOC.Code2,-SOC.Title2)
node8.2 <-
node8.1_valid0.1 %>%
left_join(node8 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability),
by = c('Industry.Title','Occupation.Title')) %>%
union(node8.1_valid0.2 %>%
left_join(node8 %>% select(-SOC.Title,-SOC.Code,-SOC.Probability),
by = c('Industry.Title','Occupation.Title'))
)
node8.2_remove <- node8.2 %>%
filter(grepl('Insufficient', SOC.Title) & grepl('Insufficient', NAICS.Title))
node8.2_append <- node8.2 %>%
filter(!grepl('Insufficient', SOC.Title) | !grepl('Insufficient', NAICS.Title))
node10 <-
node9 %>%
union(node8.2_append) %>%
arrange(ID) %>%
mutate(SOC.Code = str_remove(SOC.Code,"'"))
Industry is accomplished similarly to Occupation (Node1 - Node10)
second_ind <- function(it, ot){
url <- "https://wwwn.cdc.gov/nioccs/IOCode.ashx"
industry_text <- it
occupation_text <- ot
return_census_codes <- "1"
response <- GET(url, query = list(i = industry_text, o = occupation_text, n = "2"))
http_error(response)
text_results <- content(response, as="text")
json_results <- fromJSON(text_results)
return(data.frame(
Industry.Title = it,
Occupation.Title = ot,
NAICS.Code1 = json_results$Industry$Code[1],
NAICS.Title1 = json_results$Industry$Title[1],
NAICS.Probability1 = json_results$Industry$Probability[1],
NAICS.Code2 = json_results$Industry$Code[2],
NAICS.Title2 = json_results$Industry$Title[2],
NAICS.Probability2 = json_results$Industry$Probability[2]))
}
for(i in 1:nrow(node1.3_remove)){
print(i)
row <- node1.3_remove[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node1.3_ind_append = second_ind(it, ot)
}
else {
next_row = second_ind(it, ot)
node1.3_ind_append <- node1.3_ind_append %>% rbind(next_row)
}
}
node1.3_ind_joined <- node1.3_ind_append %>%
left_join(node1.3 %>% left_join(node1 %>% select(Industry.Title,Occupation.Title,SOC.Title,SOC.Code) %>% unique()))
node1.3_ind_joined %>%
mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
NAICS.Code2 = paste("'",NAICS.Code2,sep=""),
SOC.Code = paste("'",SOC.Code,sep=""))%>% write.csv('3.1 ind append with Valid(draft3).csv')
node11 <- node10
node12 <- node11 %>%
filter(!(grepl('Insufficient', NAICS.Title)) & NAICS.Probability < 0.67)
node13 <- node11 %>%
filter(grepl('Insufficient', NAICS.Title) | NAICS.Probability >= 0.67)
node12.1 <- node12 %>%
group_by(Industry.Title, Occupation.Title, NAICS.Title, SOC.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
for(i in 1:nrow(node12.1)){
print(i)
row <- node12.1[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node12.1_append = second_ind(it, ot)
}
else {
next_row = second_ind(it, ot)
node12.1_append <- node12.1_append %>% rbind(next_row)
}
}
node12.1_joined <- node12.1 %>%
left_join(node12.1_append, by = c('Industry.Title', 'Occupation.Title'))%>%
mutate(NAICS.Probability1 = as.double(NAICS.Probability1),
NAICS.Probability2 = as.double(NAICS.Probability2))
node12.1_joined
node12.2 <- node12.1_joined %>%
filter(NAICS.Probability2 < NAICS.Probability1/2)
node12.3 <- node12.1_joined %>%
filter(NAICS.Probability2 >= NAICS.Probability1/2) %>%
arrange(NAICS.Probability1)
node12.3 %>%
mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
NAICS.Code2 = paste("'",NAICS.Code2,sep="")) %>% write.csv('12.3 append with Valid(draft1).csv')
node12.3_valid <- read.csv('12.3 append with Valid(draft1).csv')
node12.3_valid0 <- node12.3_valid %>%
filter(Valid == 'Y')
node12.3_valid0.1 <- node12.3_valid0 %>%
filter(NAICS.Index == 1) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, NAICS.Code1, NAICS.Title1) %>%
mutate(NAICS.Code = NAICS.Code1,
NAICS.Title = NAICS.Title1,
NAICS.Probability = 1.0) %>%
select(-NAICS.Code1,-NAICS.Title1)
node12.3_valid0.2 <- node12.3_valid0 %>%
filter(NAICS.Index == 2) %>%
select(Industry.Title, Occupation.Title, NAICS.Title, NAICS.Code2, NAICS.Title2) %>%
mutate(NAICS.Code = NAICS.Code2,
NAICS.Title = NAICS.Title2,
NAICS.Probability = 1.0) %>%
select(-NAICS.Code2,-NAICS.Title2)
node12.3_remove <- node12.3_valid %>%
filter(!Valid == 'Y') %>%
select(Industry.Title, Occupation.Title) %>%
left_join(node12,
by = c('Industry.Title','Occupation.Title')) %>%
arrange(ID)
node14 <-
node12.2 %>%
select(-NAICS.Code1,-NAICS.Title1,-NAICS.Probability1,-NAICS.Code2,-NAICS.Title2,-NAICS.Probability2,-Count) %>%
left_join(node12 %>% select(-SOC.Title,-NAICS.Title),
by = c('Industry.Title','Occupation.Title')) %>%
union(node12.3_valid0.1 %>%
left_join(node12 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(node12.3_valid0.2 %>%
left_join(node12 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability),
by = c('Industry.Title','Occupation.Title'))
) %>%
union(
node13) %>%
arrange(ID)
node15.1 <- node14 %>%
filter(grepl('Insufficient', NAICS.Title))%>%
group_by(Industry.Title, Occupation.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
node15.1_valid <- node15.1 %>%
filter(!(Industry.Title %in% c(
'UNKNOWN',
'NOT STATED',
'VARIOUS',
'' ,
'UNK',
'UNKNOWN',
'NOT GIVEN',
'NOT AVAILABLE',
'LABOR', 'LABORER',
'OWNER/OPERATOR',
'UKNOWN',
'CONTRACTOR'
))) %>%
filter(!is.na(Industry.Title))
node15.1_remove <- anti_join(node15.1,node15.1_valid) %>% ungroup() %>%select(-Count) %>% left_join(node14, by = c('Industry.Title','Occupation.Title'))
node15.1_remove
node15<- anti_join(node14 %>% ungroup(),
node15.1_remove,
by = c('Industry.Title','Occupation.Title')) %>%
arrange(ID)
node16 <- node15%>%
filter(grepl('Insufficient', NAICS.Title))
node17 <- node15%>%
filter(!grepl('Insufficient', NAICS.Title))
node16.1 <- node16 %>%
filter(grepl('Insufficient', NAICS.Title))%>%
group_by(Industry.Title, Occupation.Title) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
for(i in 1:nrow(node16.1)){
print(i)
row <- node16.1[i,c('Industry.Title','Occupation.Title')]
it <- row$Industry.Title
ot <- row$Occupation.Title
if (i==1) {
node16.1_append = second_ind(it, ot)
}
else {
next_row = second_ind(it, ot)
node16.1_append <- node16.1_append %>% rbind(next_row)
}
}
node16.1_append %>%
mutate(NAICS.Code1 = paste("'",NAICS.Code1,sep=""),
NAICS.Code2 = paste("'",NAICS.Code2,sep=""))%>% write.csv('16.1 append with Valid(draft1).csv')
node16.1_valid <- read.csv('16.1 append with Valid(draft1).csv')
node16.1_valid0 <- node16.1_valid
node16.1_valid0.1 <- node16.1_valid0 %>%
filter(NAICS.Index == 1) %>%
select(Industry.Title, Occupation.Title, NAICS.Code1, NAICS.Title1) %>%
mutate(NAICS.Code = NAICS.Code1,
NAICS.Title = NAICS.Title1,
NAICS.Probability = 1.0) %>%
select(-NAICS.Code1,-NAICS.Title1)
node16.1_valid0.2 <- node16.1_valid0 %>%
filter(NAICS.Index == 2) %>%
select(Industry.Title, Occupation.Title, NAICS.Code2, NAICS.Title2) %>%
mutate(NAICS.Code = NAICS.Code2,
NAICS.Title = NAICS.Title2,
NAICS.Probability = 1.0) %>%
select(-NAICS.Code2,-NAICS.Title2)
node16.2 <-
node16.1_valid0.1 %>%
left_join(node16 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability),
by = c('Industry.Title','Occupation.Title')) %>%
union(node16.1_valid0.2 %>%
left_join(node16 %>% select(-NAICS.Title,-NAICS.Code,-NAICS.Probability),
by = c('Industry.Title','Occupation.Title'))
)
node16.2_remove <- node16.2 %>%
filter(grepl('Insufficient', NAICS.Title))
node16.2_append <- node16.2 %>%
filter(!grepl('Insufficient', NAICS.Title))
node18 <-
node17 %>%
union(node16.2_append) %>%
arrange(ID) %>%
mutate(NAICS.Code = str_remove(NAICS.Code,"'")) %>%
select(ID,
Industry.Title,
Occupation.Title,
Census.Ind.code,
Census.Ind.Title,
Census.Occ.code,
Census.Occ.Title,
NAICS.Code,
NAICS.Title,
NAICS.Probability,
SOC.Code,
SOC.Title,
SOC.Probability,
Unexpected.NAICS.SOC.Combo
)
Finally, the reviewed set is exported for use.
node18 %>% write.xlsx('Cleaned I&O.xlsx')
node19 <- node0 %>%
select(ID, Industry.Title,Occupation.Title) %>%
left_join(node18, by = c('ID'))
The full set, cleaned and removed records, is recombined is also exported for use. The records which were uncoded are included in this set, but are left blank.
node19 %>% write.xlsx('Full I&O.xlsx')