library(tidyverse)
library(caret)
library(ranger)
library(rmarkdown)
In this document, we will demonstrate the various ways the resources generated from our work can be applied. All resources referenced by this document can be accessed at https://doi.org/10.18130/V3/6RUBPV. For a guide on implementing this modeling process please refer to https://rpubs.com/mmc4cv/appendix.
The crosswalks can be found at https://github.com/IndOcc/CPScrosswalks. There is a crosswalk available for both Industry and Occupation.
xwalk_name<- "https://raw.githubusercontent.com/IndOcc/CPScrosswalks/main/IND_crosswalk_FULL.csv"
crosswalk <- read.csv(xwalk_name)
crosswalk<- janitor::clean_names(crosswalk)
crosswalk<- crosswalk[-1]
paged_table(crosswalk)
Our forest models can be accessed and loaded from our repository here: https://doi.org/10.18130/V3/6RUBPV. As RData files, they can easily be read into R (see below example code chunk).
An important note: These models require data to be in the same format with the same parameters as defined in our process and will not work when applied to other data that deviates from this. To see a tutorial detailing the process of creating these forests, please refer to this document: (link to first vignette).
Evaluation metrics for these forest models (their performance on test data) are available at the same repository.
getwd()
## [1] "/gpfs/gpfs0/project/sdscap-kropko/sdscap-kropko-impute/CodeToPublish"
load('forest_IND1976from_ind_2003_2008.RData')
model<- ranger_forest #all models are saved as 'ranger_forest' objects and should be saved under a new name to prevent overwrite
Predictions for all 38-million responses over this time period (1976-2022) are available here: https://doi.org/10.18130/V3/6RUBPV. These predictions can be applied directly to CPS data using the “YEAR” “MONTH” and “CPSID” columns.
Due to the length of CPSIDs, R might read in the column values in scientific notation, resulting in loss of granularity. To account for this, the scientific notation options can be changed as shown below.
Results can be aggregated and grouped using dplyr commands and other standard data manipulation packages. An example of potential aggregation is below. Pre-aggregated results are also available for use in the data repository linked above.
options(scipen = 999)
temp<- read_csv("IND_OCC_2004_into_other_schemas.csv")
## Rows: 856844 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): CLASSWKR.recoded
## dbl (32): YEAR, MONTH, CPSIDP, Original.IND, Original.OCC, AGE, SEX, EDUC.re...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(temp)
## # A tibble: 6 × 33
## YEAR MONTH CPSIDP Original.IND Original.OCC AGE SEX EDUC.recoded
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2004 1 2.00e13 7980 3640 30 2 80
## 2 2004 1 2.00e13 2670 8220 31 1 70
## 3 2004 1 2.00e13 7890 2340 43 2 110
## 4 2004 1 2.00e13 6890 5100 44 1 110
## 5 2004 1 2.00e13 4090 4850 67 1 70
## 6 2004 1 2.00e13 8190 3300 32 2 110
## # … with 25 more variables: CLASSWKR.recoded <chr>,
## # IND.Predicted.Value.1976.1982 <dbl>,
## # IND.Prediction.Certainty.Score.1976.1982 <dbl>,
## # IND.Predicted.Value.1983.1991 <dbl>,
## # IND.Prediction.Certainty.Score.1983.1991 <dbl>,
## # IND.Predicted.Value.1992.2002 <dbl>,
## # IND.Prediction.Certainty.Score.1992.2002 <dbl>, …
IND_preds_byOCC_2004<- temp %>% filter(YEAR==2004) %>%
group_by(Original.OCC) %>%
summarize(IND.Predicted.Value.1992.2002)
## `summarise()` has grouped output by 'Original.OCC'. You can override using the
## `.groups` argument.
head(IND_preds_byOCC_2004)
## # A tibble: 6 × 2
## # Groups: Original.OCC [1]
## Original.OCC IND.Predicted.Value.1992.2002
## <dbl> <dbl>
## 1 10 732
## 2 10 900
## 3 10 510
## 4 10 580
## 5 10 700
## 6 10 891