1 Packages I’ll Use

library(here)
library(magrittr)
library(janitor)
library(haven)
library(nhanesA)
library(naniar)
library(rsample)
library(gtsummary)
library(tidyverse)

2 Objective

Suppose I wanted to include the following items in a data set, from both 2017-18 and 2015-16 NHANES.

From the DEMO files: DEMO_J for 2017-18 and DEMO_I for 2015-16

  • SEQN = Subject identifying code
  • RIDSTATR = Interview/Examination Status (categorical)
    • 1 = Interview only
    • 2 = Interview and MEC examination (MEC = mobile examination center)
  • RIDAGEYR = Age in years (quantitative, topcoded at 80)
    • All subjects ages 80 and over are coded as 80
  • RIAGENDR = Sex (categorical)
    • 1 = Male, 2 = Female
  • RIDRETH3 = Race/Ethnicity (categorical)
    • 1 = Mexican-American,
    • 2 = Other Hispanic (1 and 2 are often combined)
    • 3 = Non-Hispanic White
    • 4 = Non-Hispanic Black
    • 6 = Non-Hispanic Asian
    • 7 = Other Race including Multi-Racial
    • Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.

From the Body Measures (BMX) files: BMX_J for 2017-18 and BMX_I for 2015-16 (BMX is part of the Examination data)

  • BMXWAIST = Waist Circumference in cm (quantitative)

From the Cholesterol - High-Density Lipoprotein (HDL) files: HDL_J for 2017-18 and HDL_I for 2015-16 (HDL is part of the Lab data)

  • LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)

From the Current Health Status (HSQ) file HSQ_J for 2017-18 and HSQ_I for 2015-16 (HSQ is part of the Questionnaire data)

  • HSD010 = Self-reported general health (categorical)
    • 1 = Excellent
    • 2 = Very good
    • 3 = Good
    • 4 = Fair
    • 5 = Poor
    • 7 = Refused
    • 9 = Don’t Know

Here is the process I would use to get going.

3 Get and do initial cleaning on the 2017-18 data

3.1 Pull in the Data from each necessary database for 2017-18, zap the labels and convert to tibbles

demo_17raw <- nhanes('DEMO_J') %>%
    zap_label() %>% tibble()
Processing SAS dataset DEMO_J    ..
# check: DEMO_J should have 9254 observations on 46 variables
dim(demo_17raw)
[1] 9254   46
bmx_17raw <- nhanes('BMX_J') %>%
    zap_label() %>% tibble()
Processing SAS dataset BMX_J     ..
# check: BMX_J should have 8704 observations on 21 variables
dim(bmx_17raw)
[1] 8704   21
hdl_17raw <- nhanes('HDL_J') %>%
    zap_label() %>% tibble()
Processing SAS dataset HDL_J     ..
# check: HDL_J should have 7435 observations on 3 variables
dim(hdl_17raw)
[1] 7435    3
hsq_17raw <- nhanes('HSQ_J') %>%
    zap_label() %>% tibble()
Processing SAS dataset HSQ_J     ..
# check: HSQ_J should have 8366 observations on 9 variables
dim(hsq_17raw)
[1] 8366    9

3.2 Merge the data with inner_join()

# create temp1 which includes all rows in both DEMO and BMX
temp1 <- inner_join(demo_17raw, bmx_17raw, by = "SEQN")

# check: temp1 should have 8704 observations on 66 variables
# since everyone in BMX will also be in DEMO, by design
# but not everyone in DEMO should also be in BMX
# and since SEQN is in both files
dim(temp1)
[1] 8704   66
# create temp2 which includes all rows in both HDL and HSQ
temp2 <- inner_join(hdl_17raw, hsq_17raw, by = "SEQN")

# check: temp2 should have 7435 observations on 11 variables
# since everyone in HDL will also be in HSQ, by design
# but not everyone in HSQ should also be in HDL
# and since SEQN is in both files
dim(temp2)
[1] 7435   11
# create merge_17raw which is all rows in both temp1 and temp2
merge_17raw <- inner_join(temp1, temp2, by = "SEQN")

# check: merge_17raw should have 7435 observations on 76 variables
dim(merge_17raw)
[1] 7435   76
# clean up
rm(temp1, temp2)

3.3 Select the 8 variables we’ll need and convert categorical variables to factors

nh_17 <- merge_17raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))

3.4 Add an indicator of the CYCLE in which the data were drawn

nh_17 <- nh_17 %>%
    mutate(CYCLE = "2017-18")

nh_17
# A tibble: 7,435 x 9
   SEQN  RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE  
   <chr> <fct>       <int> <fct>    <fct>       <dbl>  <int> <fct>  <chr>  
 1 93705 2              66 2        4           102.      60 3      2017-18
 2 93706 2              18 1        6            79.3     47 2      2017-18
 3 93707 2              13 1        7            64.1     68 3      2017-18
 4 93708 2              66 2        6            88.2     88 3      2017-18
 5 93709 2              75 2        4           113       65 <NA>   2017-18
 6 93711 2              56 1        6            86.6     72 2      2017-18
 7 93712 2              18 1        1            72       48 3      2017-18
 8 93713 2              67 1        3            99.7     48 2      2017-18
 9 93714 2              54 2        4           118.      42 3      2017-18
10 93715 2              71 1        7            89.7     57 4      2017-18
# ... with 7,425 more rows

The resulting nh_17 data should have 7435 observations on 9 variables

4 Get and do initial cleaning on the 2015-16 data

4.1 Pull in the Data from each necessary database for 2015-16, zap the labels and convert to tibbles

demo_15raw <- nhanes('DEMO_I') %>%
    zap_label() %>% tibble()
Processing SAS dataset DEMO_I    ..
# check: DEMO_I should have 9971 observations on 47 variables
dim(demo_15raw)
[1] 9971   47
bmx_15raw <- nhanes('BMX_I') %>%
    zap_label() %>% tibble()
Processing SAS dataset BMX_I     ..
# check: BMX_I should have 9544 observations on 26 variables
dim(bmx_15raw)
[1] 9544   26
hdl_15raw <- nhanes('HDL_I') %>%
    zap_label() %>% tibble()
Processing SAS dataset HDL_I     ..
# check: HDL_I should have 8021 observations on 3 variables
dim(hdl_15raw)
[1] 8021    3
hsq_15raw <- nhanes('HSQ_I') %>%
    zap_label() %>% tibble()
Processing SAS dataset HSQ_I     ..
# check: HSQ_I should have 9165 observations on 9 variables
dim(hsq_15raw)
[1] 9165    9

4.2 Merge the data with inner_join()

# create tempA which includes all rows in both DEMO and BMX
tempA <- inner_join(demo_15raw, bmx_15raw, by = "SEQN")

# check: tempA should have 9544 observations on 72 variables
dim(tempA)
[1] 9544   72
# create tempB which includes all rows in both HDL and HSQ
tempB <- inner_join(hdl_15raw, hsq_15raw, by = "SEQN")

# check: tempB should have 8021 observations on 11 variables
dim(tempB)
[1] 8021   11
# create merge_15raw which is all rows in both tempA and tempB
merge_15raw <- inner_join(tempA, tempB, by = "SEQN")

# check: merge_15raw should have 8021 observations on 82 variables
dim(merge_15raw)
[1] 8021   82

4.3 Select the 8 variables we’ll need and convert categorical variables to factors

nh_15 <- merge_15raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))

4.4 Add an indicator of the CYCLE in which the data were drawn

nh_15 <- nh_15 %>%
    mutate(CYCLE = "2015-16")

nh_15
# A tibble: 8,021 x 9
   SEQN  RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE  
   <chr> <fct>       <int> <fct>    <fct>       <dbl>  <int> <fct>  <chr>  
 1 83732 2              62 1        3           101.      46 3      2015-16
 2 83733 2              53 1        3           108.      63 2      2015-16
 3 83734 2              78 1        3           116.      30 4      2015-16
 4 83735 2              56 2        3           110.      61 3      2015-16
 5 83736 2              42 2        4            80.4     53 4      2015-16
 6 83737 2              72 2        1            92.9     78 3      2015-16
 7 83738 2              11 2        1            67.5     43 <NA>   2015-16
 8 83741 2              22 1        4            86.6     48 3      2015-16
 9 83742 2              32 2        1            93.3     28 2      2015-16
10 83743 2              18 1        6            NA       53 <NA>   2015-16
# ... with 8,011 more rows

The resulting nh_15 data should have 8021 observations on 9 variables

5 Combine the 2017-18 and 2015-16 tibbles

nh_proj <- full_join(nh_15, nh_17)
Joining, by = c("SEQN", "RIDSTATR", "RIDAGEYR", "RIAGENDR", "RIDRETH3", "BMXWAIST", "LBDHDD", "HSD010", "CYCLE")

Result should have 15456 observations on the same 9 variables.

dim(nh_proj)
[1] 15456     9

Most of the time in working with NHANES data, I restrict my work to adult subjects. Suppose here we want people between the ages of 20 and 79, to avoid the fact that NHANES classifies everyone over age 80 as RIDAGEYR = 80. The code I’d use is:

nh_proj <- nh_proj %>%
    filter(RIDAGEYR > 19 & RIDAGEYR < 80)

nh_proj
# A tibble: 10,014 x 9
   SEQN  RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE  
   <chr> <fct>       <int> <fct>    <fct>       <dbl>  <int> <fct>  <chr>  
 1 83732 2              62 1        3           101.      46 3      2015-16
 2 83733 2              53 1        3           108.      63 2      2015-16
 3 83734 2              78 1        3           116.      30 4      2015-16
 4 83735 2              56 2        3           110.      61 3      2015-16
 5 83736 2              42 2        4            80.4     53 4      2015-16
 6 83737 2              72 2        1            92.9     78 3      2015-16
 7 83741 2              22 1        4            86.6     48 3      2015-16
 8 83742 2              32 2        1            93.3     28 2      2015-16
 9 83744 2              56 1        4           116       52 3      2015-16
10 83747 2              46 1        3           104.      44 4      2015-16
# ... with 10,004 more rows

and we wind up with 10,014 observations on 9 variables.

6 Checking The Variables

6.1 Quantitative variables

We have three quantitative variables: age (RIDAGEYR), waist circumference (BMXWAIST) and HDL Cholesterol (LBDHDD).

Let’s rename those variables.

nh_proj <- nh_proj %>%
    rename(AGE = RIDAGEYR, WAIST = BMXWAIST, HDL = LBDHDD)

Now, we’ll check their ranges, and how much missingness we’re dealing with using a quick summary().

nh_proj %>% select(AGE, WAIST, HDL) %>%
    summary()
      AGE            WAIST            HDL        
 Min.   :20.00   Min.   : 57.9   Min.   :  6.00  
 1st Qu.:34.00   1st Qu.: 88.4   1st Qu.: 42.00  
 Median :49.00   Median : 99.0   Median : 51.00  
 Mean   :48.28   Mean   :100.5   Mean   : 53.49  
 3rd Qu.:62.00   3rd Qu.:110.5   3rd Qu.: 62.00  
 Max.   :79.00   Max.   :171.6   Max.   :226.00  
                 NA's   :556     NA's   :596     

The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me in light of our earlier decisions. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with.

6.2 Categorical Variables

6.2.1 RIDSTATR and CYCLE

We’re just checking here to see that everyone has RIDSTATR status 2, meaning they completed both the questionnaire and the examination. Since they do, we’ll also make sure our CYCLE variable works to indicate the NHANES reporting CYCLE for each subject.

nh_proj %>% tabyl(CYCLE, RIDSTATR)
   CYCLE    2
 2015-16 5131
 2017-18 4883

6.2.2 RIAGENDR, which we’ll change to SEX or FEMALE

nh_proj %>% tabyl(RIAGENDR)
 RIAGENDR    n   percent
        1 4817 0.4810266
        2 5197 0.5189734

We certainly want to recode this and name it more effectively, and make it a non-numerical factor.

nh_proj <- nh_proj %>%
    mutate(SEX = ifelse(RIAGENDR == 1, "M", "F"))

Checking our results…

nh_proj %>% tabyl(SEX, RIAGENDR)
 SEX    1    2
   F    0 5197
   M 4817    0

Good. We can replace RIAGENDR with SEX moving forward. Another option would have been to use a 1/0 numeric variable, and if I’d wanted to do that, I would have run something like

nh_proj <- nh_proj %>%
    mutate(FEMALE = as.numeric(RIAGENDR) - 1)

nh_proj %>% tabyl(FEMALE, SEX)
 FEMALE    F    M
      0    0 4817
      1 5197    0

6.2.3 RIDRETH3, which we’ll change to RACE_ETH

Let’s do the following:

  • Recode the levels of the RIDRETH3 variable to use short, understandable group names.
  • Collapse the first two categories (Mexican-American and Other Hispanic) into a single category.
  • Change the variable name to RACE_ETH.
  • Sort the resulting factor in order of their counts.
nh_proj <- nh_proj %>%
    mutate(RACE_ETH = fct_recode(RIDRETH3,
                                 "Hispanic" = "1",
                                 "Hispanic" = "2",
                                 "NH_White" = "3",
                                 "NH_Black" = "4",
                                 "NH_Asian" = "6",
                                 "Other" = "7"),
           RACE_ETH = fct_infreq(RACE_ETH))

Let’s look at the results. Do we need to collapse further in this case?

nh_proj %>% tabyl(RACE_ETH, RIDRETH3)
 RACE_ETH    1    2    3    4    6   7
 NH_White    0    0 3117    0    0   0
 Hispanic 1605 1184    0    0    0   0
 NH_Black    0    0    0 2295    0   0
 NH_Asian    0    0    0    0 1366   0
    Other    0    0    0    0    0 447

The most common RACE_ETH is Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won’t collapse any further for now.

6.2.4 HSD010, which we’ll change to SROH

  • Again, we need to recode the levels of this categorical variable.
  • I’ll also rename the variable SROH (which is an abbreviation for self-reported overall health.) Always explain abbreviations.
  • We also need to convert the values 7 (Refused) and 9 (Don’t Know) to missing values.
nh_proj <- nh_proj %>%
    mutate(HSD010 = na_if(HSD010, 7),
           HSD010 = na_if(HSD010, 9)) %>%
    mutate(SROH = fct_recode(HSD010,
                             "Excellent" = "1",
                             "Very_Good" = "2",
                             "Good" = "3",
                             "Fair" = "4",
                             "Poor" = "5")) %>%
    droplevels() # drop unused levels in all factors
nh_proj %>% tabyl(SROH, HSD010)
      SROH   1    2    3    4   5 NA_
 Excellent 772    0    0    0   0   0
 Very_Good   0 2216    0    0   0   0
      Good   0    0 3865    0   0   0
      Fair   0    0    0 2020   0   0
      Poor   0    0    0    0 322   0
      <NA>   0    0    0    0   0 819

6.3 Reset the variables we will actually use

  • We don’t need RIDSTATR any more because we’ve checked and see that all its values are 2, as needed.
  • We’ve renamed some variables and replaced others with better versions.
nh_proj <- nh_proj %>% 
    select(SEQN, CYCLE, AGE, SEX, RACE_ETH, WAIST, HDL, SROH)

We should now have 10,014 observations on these 8 variables.

dim(nh_proj)
[1] 10014     8

6.4 Missingness check

miss_var_summary(nh_proj)
# A tibble: 8 x 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 SROH        819     8.18
2 HDL         596     5.95
3 WAIST       556     5.55
4 SEQN          0     0   
5 CYCLE         0     0   
6 AGE           0     0   
7 SEX           0     0   
8 RACE_ETH      0     0   

Three of our eight variables have missing values. We’ve got some imputation ahead of us.

6.5 What if HDL was our outcome?

If one of those three variables with missing values was our outcome, say, for example, HDL, then we would filter our data for complete cases on that variable.

nh_proj <- nh_proj %>% filter(complete.cases(HDL))

dim(nh_proj)
[1] 9418    8

and we’re now down to 9418 observations on 8 variables. Here’s the revised report on missing data.

miss_var_summary(nh_proj) %>% filter(n_miss > 0)
# A tibble: 2 x 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 SROH        705     7.49
2 WAIST       436     4.63

How many observations have any missing values at all, now? How many are missing both SROH and WAIST?

miss_case_table(nh_proj)
# A tibble: 3 x 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0    8444     89.7 
2              1     807      8.57
3              2     167      1.77

7 Saving a Tidy Tibble

saveRDS(nh_proj, here("data", "nh_proj.Rds"))

8 Codebook

A successful codebook for Project 2 will include:

  • a set of five or so statements about the data, followed by
  • a tbl_summary() report from gtsummary or some equivalent brief description of the data, and that should be followed by
  • a fuller report on the distributions from the describe() function from Hmisc.

I’ve chosen here to build a codebook with some useful information describing the entire sample. In your project 2, I would be happy with a codebook using only your model training sample, or including the whole sample (training and testing together), or whatever else you feel is most useful to present. I would begin the codebook with statements like the ones shown below to fix ideas about the sample size, the missingness, the outcome (and you might have more than one, of course) and a statement about the roles for the other variables shown in the table below. If you look at the R Markdown code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.

  1. Sample Size The data in our complete nh_proj sample consist of 9418 subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 20 and 79 in whom our outcome variable was measured. (If I had other inclusion/exclusion criteria, I would list them here.)
  2. Missingness Of the 9418 subjects, 8444 have complete data on all variables listed below.
  3. Our outcome variable is HDL, which is the subject’s serum HDL cholesterol measured in mg/dl.
  4. All other variables listed below will serve as candidate predictors for our models.
  5. The other variable contained in my tidy tibble is SEQN which is the subject identifying code.

8.1 Using gtsummary to present key information about the variables

Variables included in our analyses are summarized in the following table.

nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    tbl_summary(.,
        label = list(
            HDL = "HDL (HDL Cholesterol in mg/dl)",
            AGE = "AGE (in years)",
            WAIST = "WAIST (circumference in cm)",
            CYCLE = "CYCLE (NHANES reporting)",
            RACE_ETH = "RACE_ETH (Race/Ethnicity)",
            SROH = "SROH (Self-Reported Overall Health)"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))
Characteristic N = 9,4181
HDL (HDL Cholesterol in mg/dl) 51 [6 to 226]
AGE (in years) 49 [20 to 79]
WAIST (circumference in cm) 99 [59 to 172]
Unknown 436
CYCLE (NHANES reporting)
2015-16 4,838 (51%)
2017-18 4,580 (49%)
SEX
F 4,888 (52%)
M 4,530 (48%)
RACE_ETH (Race/Ethnicity)
NH_White 2,984 (32%)
Hispanic 2,678 (28%)
NH_Black 2,075 (22%)
NH_Asian 1,261 (13%)
Other 420 (4.5%)
SROH (Self-Reported Overall Health)
Excellent 726 (8.3%)
Very_Good 2,098 (24%)
Good 3,666 (42%)
Fair 1,922 (22%)
Poor 301 (3.5%)
Unknown 705

1 Median [Minimum to Maximum]; n (%)

8.2 Data Distribution Description

nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    Hmisc::describe() %>% Hmisc::html()
.

7 Variables   9418 Observations

HDL
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
94180132153.4917.9332.0035.0042.0051.0062.0075.0083.15
lowest : 6 10 11 15 16 , highest: 165 166 178 189 226
AGE
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
9418060148.3918.8423263449627074
lowest : 20 21 22 23 24 , highest: 75 76 77 78 79
WAIST
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
89824368691100.618.99 75.20 79.80 88.50 99.15110.70122.80131.70
lowest : 58.7 62.3 63.2 63.4 63.7 , highest: 164.9 165.0 166.0 169.5 171.6
CYCLE
nmissingdistinct
941802
 Value      2015-16 2017-18
 Frequency     4838    4580
 Proportion   0.514   0.486
 

SEX
nmissingdistinct
941802
 Value          F     M
 Frequency   4888  4530
 Proportion 0.519 0.481
 

RACE_ETH
image
nmissingdistinct
941805
lowest : NH_White Hispanic NH_Black NH_Asian Other , highest: NH_White Hispanic NH_Black NH_Asian Other
 Value      NH_White Hispanic NH_Black NH_Asian    Other
 Frequency      2984     2678     2075     1261      420
 Proportion    0.317    0.284    0.220    0.134    0.045
 

SROH
image
nmissingdistinct
87137055
lowest :ExcellentVery_GoodGood Fair Poor
highest:ExcellentVery_GoodGood Fair Poor
 Value      Excellent Very_Good      Good      Fair      Poor
 Frequency        726      2098      3666      1922       301
 Proportion     0.083     0.241     0.421     0.221     0.035
 

9 Splitting into Testing/Training Samples

9.1 Splitting by CYCLE

I’ve suggested to some of you that you use the CYCLE variable to split the data into a training sample (2015-16 data) and a test sample (2017-18 data). That would work like this:

nh_1516 <- nh_proj %>% filter(CYCLE == "2015-16")
nh_1718 <- nh_proj %>% filter(CYCLE == "2017-18")

and that’s fine: you fit the model on nh_1516 and then test it on the quality of predictions made for nh_1718.

9.2 Splitting The Data with initial_split

An even better option that takes advantage of the initial_split function would be to simply create a split of the data across the two cycles, ensuring that a similar fraction of subjects in each cycle are found in both the training and test samples.

Suppose we want to accomplish the following:

  1. We want to use initial_split() to do our splitting into a training set of 1000 subjects and a testing set of 2500 additional subjects from the NHANES data in nh_proj, which contains 9418 observations.

  2. We want the proportion of subjects within each combination of the categorical variable CYCLE within nh_proj to be preserved in both our training set and our testing set.

Note that the table of percentages for CYCLE for the full nh_proj data follows.

nh_proj %>% tabyl(CYCLE)
   CYCLE    n   percent
 2015-16 4838 0.5136972
 2017-18 4580 0.4863028

We first select 3500 subjects from our nh_proj data set in such a way as to preserve the percentages who fall into each CYCLE.

set.seed(432)
nh_3500 <- nh_proj %>% group_by(CYCLE) %>% 
    slice_sample(., prop = 3501/nrow(nh_proj))

nh_3500 %>% tabyl(CYCLE)
   CYCLE    n   percent
 2015-16 1798 0.5137143
 2017-18 1702 0.4862857
  • Note that these percentages are essentially the same as the percentages we saw across the full sample in nhanes_proj but now this is a (stratified) random sample of 3500 observations.
  • Note also that I had to fool slice_sample into thinking I wanted a sample of 3501 out of the total rather than 3500 until I actually got exactly 3500 sampled subjects.

Next, we use initial_split() to create training and testing samples according to our specifications (so 1000 in the training sample and 2500 in the testing sample, again preserving the fraction in each CYCLE.)

nh_split <- initial_split(nh_3500, prop = 1000/3500, 
                          strata = CYCLE)
nh_training <- training(nh_split)
nh_testing <- testing(nh_split)

dim(nh_training)
[1] 1001    8
dim(nh_testing)
[1] 2499    8

Now, let’s check the fraction of subjects by CYCLE in each of our split samples.

nh_training %>% tabyl(CYCLE)
   CYCLE   n   percent
 2015-16 514 0.5134865
 2017-18 487 0.4865135
nh_testing %>% tabyl(CYCLE)
   CYCLE    n   percent
 2015-16 1284 0.5138055
 2017-18 1215 0.4861945

OK. These match well to what we were looking for. We should be all set.

---
title: "NHANES Toy Example for Project 2"
author: "Thomas E. Love"
date: "`r Sys.Date()`"
output: 
    html_document:
        toc: TRUE
        toc_float: TRUE
        number_sections: TRUE
        code_folding: show
        code_download: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```

# Packages I'll Use

```{r, message = FALSE}
library(here)
library(magrittr)
library(janitor)
library(haven)
library(nhanesA)
library(naniar)
library(rsample)
library(gtsummary)
library(tidyverse)
```

# Objective

Suppose I wanted to include the following items in a data set, from both 2017-18 and 2015-16 NHANES.

From the DEMO files: [`DEMO_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm) and [`DEMO_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)

- SEQN = Subject identifying code 
- RIDSTATR = Interview/Examination Status (categorical)
    - 1 = Interview only
    - 2 = Interview and MEC examination (MEC = mobile examination center)
- RIDAGEYR = Age in years (quantitative, topcoded at 80)
    - All subjects ages 80 and over are coded as 80
- RIAGENDR = Sex (categorical)
    - 1 = Male, 2 = Female
- RIDRETH3 = Race/Ethnicity (categorical)
    - 1 = Mexican-American, 
    - 2 = Other Hispanic (1 and 2 are often combined)
    - 3 = Non-Hispanic White
    - 4 = Non-Hispanic Black
    - 6 = Non-Hispanic Asian
    - 7 = Other Race including Multi-Racial
    - Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.

From the Body Measures (BMX) files: [`BMX_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.htm) and [`BMX_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm) (BMX is part of the Examination data)

- BMXWAIST = Waist Circumference in cm (quantitative)

From the Cholesterol - High-Density Lipoprotein (HDL) files: [`HDL_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/HDL_J.htm) and [`HDL_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HDL_I.htm) (HDL is part of the Lab data)

- LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)

From the Current Health Status (HSQ) file [`HSQ_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/HSQ_J.htm) and [`HSQ_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HSQ_I.htm) (HSQ is part of the Questionnaire data)

- HSD010 = Self-reported general health (categorical)
    - 1 = Excellent
    - 2 = Very good
    - 3 = Good
    - 4 = Fair
    - 5 = Poor
    - 7 = Refused
    - 9 = Don't Know

Here is the process I would use to get going.

# Get and do initial cleaning on the 2017-18 data

## Pull in the Data from each necessary database for 2017-18, zap the labels and convert to tibbles

```{r}
demo_17raw <- nhanes('DEMO_J') %>%
    zap_label() %>% tibble()

# check: DEMO_J should have 9254 observations on 46 variables
dim(demo_17raw)

bmx_17raw <- nhanes('BMX_J') %>%
    zap_label() %>% tibble()

# check: BMX_J should have 8704 observations on 21 variables
dim(bmx_17raw)

hdl_17raw <- nhanes('HDL_J') %>%
    zap_label() %>% tibble()

# check: HDL_J should have 7435 observations on 3 variables
dim(hdl_17raw)

hsq_17raw <- nhanes('HSQ_J') %>%
    zap_label() %>% tibble()

# check: HSQ_J should have 8366 observations on 9 variables
dim(hsq_17raw)
```

## Merge the data with `inner_join()`

```{r}
# create temp1 which includes all rows in both DEMO and BMX
temp1 <- inner_join(demo_17raw, bmx_17raw, by = "SEQN")

# check: temp1 should have 8704 observations on 66 variables
# since everyone in BMX will also be in DEMO, by design
# but not everyone in DEMO should also be in BMX
# and since SEQN is in both files
dim(temp1)

# create temp2 which includes all rows in both HDL and HSQ
temp2 <- inner_join(hdl_17raw, hsq_17raw, by = "SEQN")

# check: temp2 should have 7435 observations on 11 variables
# since everyone in HDL will also be in HSQ, by design
# but not everyone in HSQ should also be in HDL
# and since SEQN is in both files
dim(temp2)

# create merge_17raw which is all rows in both temp1 and temp2
merge_17raw <- inner_join(temp1, temp2, by = "SEQN")

# check: merge_17raw should have 7435 observations on 76 variables
dim(merge_17raw)

# clean up
rm(temp1, temp2)
```

## Select the 8 variables we'll need and convert categorical variables to factors

```{r}
nh_17 <- merge_17raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))
```

## Add an indicator of the CYCLE in which the data were drawn

```{r}
nh_17 <- nh_17 %>%
    mutate(CYCLE = "2017-18")

nh_17
```

The resulting `nh_17` data should have 7435 observations on 9 variables

# Get and do initial cleaning on the 2015-16 data

## Pull in the Data from each necessary database for 2015-16, zap the labels and convert to tibbles

```{r}
demo_15raw <- nhanes('DEMO_I') %>%
    zap_label() %>% tibble()

# check: DEMO_I should have 9971 observations on 47 variables
dim(demo_15raw)

bmx_15raw <- nhanes('BMX_I') %>%
    zap_label() %>% tibble()

# check: BMX_I should have 9544 observations on 26 variables
dim(bmx_15raw)

hdl_15raw <- nhanes('HDL_I') %>%
    zap_label() %>% tibble()

# check: HDL_I should have 8021 observations on 3 variables
dim(hdl_15raw)

hsq_15raw <- nhanes('HSQ_I') %>%
    zap_label() %>% tibble()

# check: HSQ_I should have 9165 observations on 9 variables
dim(hsq_15raw)
```

## Merge the data with `inner_join()`

```{r}
# create tempA which includes all rows in both DEMO and BMX
tempA <- inner_join(demo_15raw, bmx_15raw, by = "SEQN")

# check: tempA should have 9544 observations on 72 variables
dim(tempA)

# create tempB which includes all rows in both HDL and HSQ
tempB <- inner_join(hdl_15raw, hsq_15raw, by = "SEQN")

# check: tempB should have 8021 observations on 11 variables
dim(tempB)

# create merge_15raw which is all rows in both tempA and tempB
merge_15raw <- inner_join(tempA, tempB, by = "SEQN")

# check: merge_15raw should have 8021 observations on 82 variables
dim(merge_15raw)
```

## Select the 8 variables we'll need and convert categorical variables to factors

```{r}
nh_15 <- merge_15raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))
```

## Add an indicator of the CYCLE in which the data were drawn

```{r}
nh_15 <- nh_15 %>%
    mutate(CYCLE = "2015-16")

nh_15
```

The resulting `nh_15` data should have 8021 observations on 9 variables

# Combine the 2017-18 and 2015-16 tibbles

```{r}
nh_proj <- full_join(nh_15, nh_17)
```

Result should have 15456 observations on the same 9 variables.

```{r}
dim(nh_proj)
```

Most of the time in working with NHANES data, I restrict my work to adult subjects. Suppose here we want people between the ages of 20 and 79, to avoid the fact that NHANES classifies everyone over age 80 as `RIDAGEYR` = 80. The code I'd use is:

```{r}
nh_proj <- nh_proj %>%
    filter(RIDAGEYR > 19 & RIDAGEYR < 80)

nh_proj
```

and we wind up with 10,014 observations on 9 variables.

# Checking The Variables

## Quantitative variables 

We have three quantitative variables: age (`RIDAGEYR`), waist circumference (`BMXWAIST`) and HDL Cholesterol (`LBDHDD`).

Let's rename those variables.

```{r}
nh_proj <- nh_proj %>%
    rename(AGE = RIDAGEYR, WAIST = BMXWAIST, HDL = LBDHDD)
```

Now, we'll check their ranges, and how much missingness we're dealing with using a quick `summary()`.

```{r}
nh_proj %>% select(AGE, WAIST, HDL) %>%
    summary()
```

The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me in light of our earlier decisions. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with.

## Categorical Variables

### `RIDSTATR` and `CYCLE`

We're just checking here to see that everyone has `RIDSTATR` status 2, meaning they completed both the questionnaire and the examination. Since they do, we'll also make sure our `CYCLE` variable works to indicate the NHANES reporting CYCLE for each subject.

```{r}
nh_proj %>% tabyl(CYCLE, RIDSTATR)
```

### RIAGENDR, which we'll change to `SEX` or `FEMALE`

```{r}
nh_proj %>% tabyl(RIAGENDR)
```

We certainly want to recode this and name it more effectively, and make it a non-numerical factor.

```{r}
nh_proj <- nh_proj %>%
    mutate(SEX = ifelse(RIAGENDR == 1, "M", "F"))
```

Checking our results...

```{r}
nh_proj %>% tabyl(SEX, RIAGENDR)
```

Good. We can replace `RIAGENDR` with `SEX` moving forward. Another option would have been to use a 1/0 numeric variable, and if I'd wanted to do that, I would have run something like 

```{r}
nh_proj <- nh_proj %>%
    mutate(FEMALE = as.numeric(RIAGENDR) - 1)

nh_proj %>% tabyl(FEMALE, SEX)
```

### `RIDRETH3`, which we'll change to `RACE_ETH`

Let's do the following:

- Recode the levels of the `RIDRETH3` variable to use short, understandable group names.
- Collapse the first two categories (Mexican-American and Other Hispanic) into a single category.
- Change the variable name to `RACE_ETH`.
- Sort the resulting factor in order of their counts.


```{r}
nh_proj <- nh_proj %>%
    mutate(RACE_ETH = fct_recode(RIDRETH3,
                                 "Hispanic" = "1",
                                 "Hispanic" = "2",
                                 "NH_White" = "3",
                                 "NH_Black" = "4",
                                 "NH_Asian" = "6",
                                 "Other" = "7"),
           RACE_ETH = fct_infreq(RACE_ETH))
```

Let's look at the results. Do we need to collapse further in this case?

```{r}
nh_proj %>% tabyl(RACE_ETH, RIDRETH3)
```

The most common `RACE_ETH` is Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won't collapse any further for now.

### `HSD010`, which we'll change to `SROH`

- Again, we need to recode the levels of this categorical variable. 
- I'll also rename the variable SROH (which is an abbreviation for self-reported overall health.) Always explain abbreviations.
- We also need to convert the values 7 (Refused) and 9 (Don't Know) to missing values.

```{r}
nh_proj <- nh_proj %>%
    mutate(HSD010 = na_if(HSD010, 7),
           HSD010 = na_if(HSD010, 9)) %>%
    mutate(SROH = fct_recode(HSD010,
                             "Excellent" = "1",
                             "Very_Good" = "2",
                             "Good" = "3",
                             "Fair" = "4",
                             "Poor" = "5")) %>%
    droplevels() # drop unused levels in all factors
```

```{r}
nh_proj %>% tabyl(SROH, HSD010)
```

## Reset the variables we will actually use

- We don't need `RIDSTATR` any more because we've checked and see that all its values are 2, as needed.
- We've renamed some variables and replaced others with better versions.

```{r}
nh_proj <- nh_proj %>% 
    select(SEQN, CYCLE, AGE, SEX, RACE_ETH, WAIST, HDL, SROH)
```

We should now have 10,014 observations on these 8 variables.

```{r}
dim(nh_proj)
```

## Missingness check

```{r}
miss_var_summary(nh_proj)
```

Three of our eight variables have missing values. We've got some imputation ahead of us.

## What if HDL was our outcome?

If one of those three variables with missing values was our outcome, say, for example, HDL, then we would filter our data for complete cases on that variable.

```{r}
nh_proj <- nh_proj %>% filter(complete.cases(HDL))

dim(nh_proj)
```

and we're now down to 9418 observations on 8 variables. Here's the revised report on missing data.

```{r}
miss_var_summary(nh_proj) %>% filter(n_miss > 0)
```

How many observations have any missing values at all, now? How many are missing both `SROH` and `WAIST`?

```{r}
miss_case_table(nh_proj)
```

# Saving a Tidy Tibble

```{r}
saveRDS(nh_proj, here("data", "nh_proj.Rds"))
```

# Codebook

A successful codebook for Project 2 will include:

- a set of five or so statements about the data, followed by 
- a `tbl_summary()` report from `gtsummary` or some equivalent brief description of the data, and that should be followed by 
- a fuller report on the distributions from the `describe()` function from `Hmisc`.

I've chosen here to build a codebook with some useful information describing the entire sample. In your project 2, I would be happy with a codebook using only your model training sample, or including the whole sample (training and testing together), or whatever else you feel is most useful to present. I would begin the codebook with statements like the ones shown below to fix ideas about the sample size, the missingness, the outcome (and you might have more than one, of course) and a statement about the roles for the other variables shown in the table below. If you look at the R Markdown code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.

1. **Sample Size** The data in our complete `nh_proj` sample consist of `r nrow(nh_proj)` subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 20 and 79 in whom our outcome variable was measured. (If I had other inclusion/exclusion criteria, I would list them here.)
2. **Missingness** Of the `r nrow(nh_proj)` subjects, `r n_case_complete(nh_proj %>% select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH))` have complete data on all variables listed below.
3. Our **outcome** variable is `HDL`, which is the subject's serum HDL cholesterol measured in mg/dl.
4. All other variables listed below will serve as candidate **predictors** for our models.
5. The other variable contained in my tidy tibble is `SEQN` which is the subject identifying code.

## Using `gtsummary` to present key information about the variables

Variables included in our analyses are summarized in the following table.

```{r, warning = FALSE}
nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    tbl_summary(.,
        label = list(
            HDL = "HDL (HDL Cholesterol in mg/dl)",
            AGE = "AGE (in years)",
            WAIST = "WAIST (circumference in cm)",
            CYCLE = "CYCLE (NHANES reporting)",
            RACE_ETH = "RACE_ETH (Race/Ethnicity)",
            SROH = "SROH (Self-Reported Overall Health)"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))
```

## Data Distribution Description

```{r, warning = FALSE}
nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    Hmisc::describe() %>% Hmisc::html()
```

# Splitting into Testing/Training Samples

## Splitting by `CYCLE`

I've suggested to some of you that you use the `CYCLE` variable to split the data into a training sample (2015-16 data) and a test sample (2017-18 data). That would work like this:

```{r}
nh_1516 <- nh_proj %>% filter(CYCLE == "2015-16")
nh_1718 <- nh_proj %>% filter(CYCLE == "2017-18")
```

and that's fine: you fit the model on `nh_1516` and then test it on the quality of predictions made for `nh_1718`. 

## Splitting The Data with `initial_split`

An **even better** option that takes advantage of the `initial_split` function would be to simply create a split of the data across the two cycles, ensuring that a similar fraction of subjects in each cycle are found in both the training and test samples.

Suppose we want to accomplish the following:

1. We want to use `initial_split()` to do our splitting into a training set of 1000 subjects and a testing set of 2500 additional subjects from the NHANES data in `nh_proj`, which contains `r nrow(nh_proj)` observations.

2. We want the proportion of subjects within each combination of the categorical variable `CYCLE` within `nh_proj` to be preserved in both our training set and our testing set.

Note that the table of percentages for `CYCLE` for the full `nh_proj` data follows.

```{r}
nh_proj %>% tabyl(CYCLE)
```

We first select 3500 subjects from our `nh_proj` data set in such a way as to preserve the percentages who fall into each `CYCLE`. 

```{r}
set.seed(432)
nh_3500 <- nh_proj %>% group_by(CYCLE) %>% 
    slice_sample(., prop = 3501/nrow(nh_proj))

nh_3500 %>% tabyl(CYCLE)
```

- Note that these percentages are essentially the same as the percentages we saw across the full sample in `nhanes_proj` but now this is a (stratified) random sample of 3500 observations. 
- Note also that I had to fool `slice_sample` into thinking I wanted a sample of 3501 out of the total rather than 3500 until I actually got exactly 3500 sampled subjects.

Next, we use `initial_split()` to create training and testing samples according to our specifications (so 1000 in the training sample and 2500 in the testing sample, again preserving the fraction in each CYCLE.)

```{r}
nh_split <- initial_split(nh_3500, prop = 1000/3500, 
                          strata = CYCLE)
nh_training <- training(nh_split)
nh_testing <- testing(nh_split)

dim(nh_training)
dim(nh_testing)
```

Now, let's check the fraction of subjects by `CYCLE` in each of our split samples.

```{r}
nh_training %>% tabyl(CYCLE)
```

```{r}
nh_testing %>% tabyl(CYCLE)
```

OK. These match well to what we were looking for. We should be all set.
