library(tidyverse)
library(caret)
library(ranger)
library(rmarkdown)

Overview

In this document, we will demonstrate the various ways the resources generated from our work can be applied. All resources referenced by this document can be accessed at https://doi.org/10.18130/V3/6RUBPV. For a guide on implementing this modeling process please refer to https://rpubs.com/mmc4cv/appendix.

Loading in Crosswalks

The crosswalks can be found at https://github.com/IndOcc/CPScrosswalks. There is a crosswalk available for both Industry and Occupation.

xwalk_name<- "https://raw.githubusercontent.com/IndOcc/CPScrosswalks/main/IND_crosswalk_FULL.csv"
crosswalk <- read.csv(xwalk_name)
crosswalk<- janitor::clean_names(crosswalk)
crosswalk<- crosswalk[-1]
paged_table(crosswalk)

Loading and Using Random Forests

Our forest models can be accessed and loaded from our repository here: https://doi.org/10.18130/V3/6RUBPV. As RData files, they can easily be read into R (see below example code chunk).

An important note: These models require data to be in the same format with the same parameters as defined in our process and will not work when applied to other data that deviates from this. To see a tutorial detailing the process of creating these forests, please refer to this document: (link to first vignette).

Evaluation metrics for these forest models (their performance on test data) are available at the same repository.

getwd()

## [1] "/gpfs/gpfs0/project/sdscap-kropko/sdscap-kropko-impute/CodeToPublish"

load('forest_IND1976from_ind_2003_2008.RData')
model<- ranger_forest #all models are saved as 'ranger_forest' objects and should be saved under a new name to prevent overwrite

Loading and Using Individual-level predictions

Predictions for all 38-million responses over this time period (1976-2022) are available here: https://doi.org/10.18130/V3/6RUBPV. These predictions can be applied directly to CPS data using the “YEAR” “MONTH” and “CPSID” columns.

Due to the length of CPSIDs, R might read in the column values in scientific notation, resulting in loss of granularity. To account for this, the scientific notation options can be changed as shown below.

Results can be aggregated and grouped using dplyr commands and other standard data manipulation packages. An example of potential aggregation is below. Pre-aggregated results are also available for use in the data repository linked above.

options(scipen = 999) 
temp<- read_csv("IND_OCC_2004_into_other_schemas.csv")

## Rows: 856844 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): CLASSWKR.recoded
## dbl (32): YEAR, MONTH, CPSIDP, Original.IND, Original.OCC, AGE, SEX, EDUC.re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(temp)

## # A tibble: 6 × 33
##    YEAR MONTH  CPSIDP Original.IND Original.OCC   AGE   SEX EDUC.recoded
##   <dbl> <dbl>   <dbl>        <dbl>        <dbl> <dbl> <dbl>        <dbl>
## 1  2004     1 2.00e13         7980         3640    30     2           80
## 2  2004     1 2.00e13         2670         8220    31     1           70
## 3  2004     1 2.00e13         7890         2340    43     2          110
## 4  2004     1 2.00e13         6890         5100    44     1          110
## 5  2004     1 2.00e13         4090         4850    67     1           70
## 6  2004     1 2.00e13         8190         3300    32     2          110
## # … with 25 more variables: CLASSWKR.recoded <chr>,
## #   IND.Predicted.Value.1976.1982 <dbl>,
## #   IND.Prediction.Certainty.Score.1976.1982 <dbl>,
## #   IND.Predicted.Value.1983.1991 <dbl>,
## #   IND.Prediction.Certainty.Score.1983.1991 <dbl>,
## #   IND.Predicted.Value.1992.2002 <dbl>,
## #   IND.Prediction.Certainty.Score.1992.2002 <dbl>, …

IND_preds_byOCC_2004<- temp %>% filter(YEAR==2004) %>%
  group_by(Original.OCC) %>%
  summarize(IND.Predicted.Value.1992.2002)

## `summarise()` has grouped output by 'Original.OCC'. You can override using the
## `.groups` argument.

head(IND_preds_byOCC_2004)

## # A tibble: 6 × 2
## # Groups:   Original.OCC [1]
##   Original.OCC IND.Predicted.Value.1992.2002
##          <dbl>                         <dbl>
## 1           10                           732
## 2           10                           900
## 3           10                           510
## 4           10                           580
## 5           10                           700
## 6           10                           891

Vignette: Utilizing Resources from Random Forest CPS Predictions

Hannah Schmuckler & Cecile Johnson

4/17/2022

Overview

Loading in Crosswalks

Loading and Using Random Forests

Loading and Using Individual-level predictions