Packages I’ll Use
library(here)
library(magrittr)
library(janitor)
library(haven)
library(nhanesA)
library(naniar)
library(rsample)
library(gtsummary)
library(tidyverse)
Objective
Suppose I wanted to include the following items in a data set, from both 2017-18 and 2015-16 NHANES.
From the DEMO files: DEMO_J
for 2017-18 and DEMO_I
for 2015-16
- SEQN = Subject identifying code
- RIDSTATR = Interview/Examination Status (categorical)
- 1 = Interview only
- 2 = Interview and MEC examination (MEC = mobile examination center)
- RIDAGEYR = Age in years (quantitative, topcoded at 80)
- All subjects ages 80 and over are coded as 80
- RIAGENDR = Sex (categorical)
- RIDRETH3 = Race/Ethnicity (categorical)
- 1 = Mexican-American,
- 2 = Other Hispanic (1 and 2 are often combined)
- 3 = Non-Hispanic White
- 4 = Non-Hispanic Black
- 6 = Non-Hispanic Asian
- 7 = Other Race including Multi-Racial
- Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.
From the Body Measures (BMX) files: BMX_J
for 2017-18 and BMX_I
for 2015-16 (BMX is part of the Examination data)
- BMXWAIST = Waist Circumference in cm (quantitative)
From the Cholesterol - High-Density Lipoprotein (HDL) files: HDL_J
for 2017-18 and HDL_I
for 2015-16 (HDL is part of the Lab data)
- LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)
From the Current Health Status (HSQ) file HSQ_J
for 2017-18 and HSQ_I
for 2015-16 (HSQ is part of the Questionnaire data)
- HSD010 = Self-reported general health (categorical)
- 1 = Excellent
- 2 = Very good
- 3 = Good
- 4 = Fair
- 5 = Poor
- 7 = Refused
- 9 = Don’t Know
Here is the process I would use to get going.
Get and do initial cleaning on the 2017-18 data
Pull in the Data from each necessary database for 2017-18, zap the labels and convert to tibbles
demo_17raw <- nhanes('DEMO_J') %>%
zap_label() %>% tibble()
Processing SAS dataset DEMO_J ..
# check: DEMO_J should have 9254 observations on 46 variables
dim(demo_17raw)
[1] 9254 46
bmx_17raw <- nhanes('BMX_J') %>%
zap_label() %>% tibble()
Processing SAS dataset BMX_J ..
# check: BMX_J should have 8704 observations on 21 variables
dim(bmx_17raw)
[1] 8704 21
hdl_17raw <- nhanes('HDL_J') %>%
zap_label() %>% tibble()
Processing SAS dataset HDL_J ..
# check: HDL_J should have 7435 observations on 3 variables
dim(hdl_17raw)
[1] 7435 3
hsq_17raw <- nhanes('HSQ_J') %>%
zap_label() %>% tibble()
Processing SAS dataset HSQ_J ..
# check: HSQ_J should have 8366 observations on 9 variables
dim(hsq_17raw)
[1] 8366 9
Merge the data with inner_join()
# create temp1 which includes all rows in both DEMO and BMX
temp1 <- inner_join(demo_17raw, bmx_17raw, by = "SEQN")
# check: temp1 should have 8704 observations on 66 variables
# since everyone in BMX will also be in DEMO, by design
# but not everyone in DEMO should also be in BMX
# and since SEQN is in both files
dim(temp1)
[1] 8704 66
# create temp2 which includes all rows in both HDL and HSQ
temp2 <- inner_join(hdl_17raw, hsq_17raw, by = "SEQN")
# check: temp2 should have 7435 observations on 11 variables
# since everyone in HDL will also be in HSQ, by design
# but not everyone in HSQ should also be in HDL
# and since SEQN is in both files
dim(temp2)
[1] 7435 11
# create merge_17raw which is all rows in both temp1 and temp2
merge_17raw <- inner_join(temp1, temp2, by = "SEQN")
# check: merge_17raw should have 7435 observations on 76 variables
dim(merge_17raw)
[1] 7435 76
# clean up
rm(temp1, temp2)
Select the 8 variables we’ll need and convert categorical variables to factors
nh_17 <- merge_17raw %>%
select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
BMXWAIST, LBDHDD, HSD010) %>%
mutate(SEQN = as.character(SEQN),
RIDSTATR = factor(RIDSTATR),
RIAGENDR = factor(RIAGENDR),
RIDRETH3 = factor(RIDRETH3),
HSD010 = factor(HSD010))
Add an indicator of the CYCLE in which the data were drawn
nh_17 <- nh_17 %>%
mutate(CYCLE = "2017-18")
nh_17
# A tibble: 7,435 x 9
SEQN RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE
<chr> <fct> <int> <fct> <fct> <dbl> <int> <fct> <chr>
1 93705 2 66 2 4 102. 60 3 2017-18
2 93706 2 18 1 6 79.3 47 2 2017-18
3 93707 2 13 1 7 64.1 68 3 2017-18
4 93708 2 66 2 6 88.2 88 3 2017-18
5 93709 2 75 2 4 113 65 <NA> 2017-18
6 93711 2 56 1 6 86.6 72 2 2017-18
7 93712 2 18 1 1 72 48 3 2017-18
8 93713 2 67 1 3 99.7 48 2 2017-18
9 93714 2 54 2 4 118. 42 3 2017-18
10 93715 2 71 1 7 89.7 57 4 2017-18
# ... with 7,425 more rows
The resulting nh_17
data should have 7435 observations on 9 variables
Get and do initial cleaning on the 2015-16 data
Pull in the Data from each necessary database for 2015-16, zap the labels and convert to tibbles
demo_15raw <- nhanes('DEMO_I') %>%
zap_label() %>% tibble()
Processing SAS dataset DEMO_I ..
# check: DEMO_I should have 9971 observations on 47 variables
dim(demo_15raw)
[1] 9971 47
bmx_15raw <- nhanes('BMX_I') %>%
zap_label() %>% tibble()
Processing SAS dataset BMX_I ..
# check: BMX_I should have 9544 observations on 26 variables
dim(bmx_15raw)
[1] 9544 26
hdl_15raw <- nhanes('HDL_I') %>%
zap_label() %>% tibble()
Processing SAS dataset HDL_I ..
# check: HDL_I should have 8021 observations on 3 variables
dim(hdl_15raw)
[1] 8021 3
hsq_15raw <- nhanes('HSQ_I') %>%
zap_label() %>% tibble()
Processing SAS dataset HSQ_I ..
# check: HSQ_I should have 9165 observations on 9 variables
dim(hsq_15raw)
[1] 9165 9
Merge the data with inner_join()
# create tempA which includes all rows in both DEMO and BMX
tempA <- inner_join(demo_15raw, bmx_15raw, by = "SEQN")
# check: tempA should have 9544 observations on 72 variables
dim(tempA)
[1] 9544 72
# create tempB which includes all rows in both HDL and HSQ
tempB <- inner_join(hdl_15raw, hsq_15raw, by = "SEQN")
# check: tempB should have 8021 observations on 11 variables
dim(tempB)
[1] 8021 11
# create merge_15raw which is all rows in both tempA and tempB
merge_15raw <- inner_join(tempA, tempB, by = "SEQN")
# check: merge_15raw should have 8021 observations on 82 variables
dim(merge_15raw)
[1] 8021 82
Select the 8 variables we’ll need and convert categorical variables to factors
nh_15 <- merge_15raw %>%
select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
BMXWAIST, LBDHDD, HSD010) %>%
mutate(SEQN = as.character(SEQN),
RIDSTATR = factor(RIDSTATR),
RIAGENDR = factor(RIAGENDR),
RIDRETH3 = factor(RIDRETH3),
HSD010 = factor(HSD010))
Add an indicator of the CYCLE in which the data were drawn
nh_15 <- nh_15 %>%
mutate(CYCLE = "2015-16")
nh_15
# A tibble: 8,021 x 9
SEQN RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE
<chr> <fct> <int> <fct> <fct> <dbl> <int> <fct> <chr>
1 83732 2 62 1 3 101. 46 3 2015-16
2 83733 2 53 1 3 108. 63 2 2015-16
3 83734 2 78 1 3 116. 30 4 2015-16
4 83735 2 56 2 3 110. 61 3 2015-16
5 83736 2 42 2 4 80.4 53 4 2015-16
6 83737 2 72 2 1 92.9 78 3 2015-16
7 83738 2 11 2 1 67.5 43 <NA> 2015-16
8 83741 2 22 1 4 86.6 48 3 2015-16
9 83742 2 32 2 1 93.3 28 2 2015-16
10 83743 2 18 1 6 NA 53 <NA> 2015-16
# ... with 8,011 more rows
The resulting nh_15
data should have 8021 observations on 9 variables
Combine the 2017-18 and 2015-16 tibbles
nh_proj <- full_join(nh_15, nh_17)
Joining, by = c("SEQN", "RIDSTATR", "RIDAGEYR", "RIAGENDR", "RIDRETH3", "BMXWAIST", "LBDHDD", "HSD010", "CYCLE")
Result should have 15456 observations on the same 9 variables.
dim(nh_proj)
[1] 15456 9
Most of the time in working with NHANES data, I restrict my work to adult subjects. Suppose here we want people between the ages of 20 and 79, to avoid the fact that NHANES classifies everyone over age 80 as RIDAGEYR
= 80. The code I’d use is:
nh_proj <- nh_proj %>%
filter(RIDAGEYR > 19 & RIDAGEYR < 80)
nh_proj
# A tibble: 10,014 x 9
SEQN RIDSTATR RIDAGEYR RIAGENDR RIDRETH3 BMXWAIST LBDHDD HSD010 CYCLE
<chr> <fct> <int> <fct> <fct> <dbl> <int> <fct> <chr>
1 83732 2 62 1 3 101. 46 3 2015-16
2 83733 2 53 1 3 108. 63 2 2015-16
3 83734 2 78 1 3 116. 30 4 2015-16
4 83735 2 56 2 3 110. 61 3 2015-16
5 83736 2 42 2 4 80.4 53 4 2015-16
6 83737 2 72 2 1 92.9 78 3 2015-16
7 83741 2 22 1 4 86.6 48 3 2015-16
8 83742 2 32 2 1 93.3 28 2 2015-16
9 83744 2 56 1 4 116 52 3 2015-16
10 83747 2 46 1 3 104. 44 4 2015-16
# ... with 10,004 more rows
and we wind up with 10,014 observations on 9 variables.
Checking The Variables
Quantitative variables
We have three quantitative variables: age (RIDAGEYR
), waist circumference (BMXWAIST
) and HDL Cholesterol (LBDHDD
).
Let’s rename those variables.
nh_proj <- nh_proj %>%
rename(AGE = RIDAGEYR, WAIST = BMXWAIST, HDL = LBDHDD)
Now, we’ll check their ranges, and how much missingness we’re dealing with using a quick summary()
.
nh_proj %>% select(AGE, WAIST, HDL) %>%
summary()
AGE WAIST HDL
Min. :20.00 Min. : 57.9 Min. : 6.00
1st Qu.:34.00 1st Qu.: 88.4 1st Qu.: 42.00
Median :49.00 Median : 99.0 Median : 51.00
Mean :48.28 Mean :100.5 Mean : 53.49
3rd Qu.:62.00 3rd Qu.:110.5 3rd Qu.: 62.00
Max. :79.00 Max. :171.6 Max. :226.00
NA's :556 NA's :596
The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me in light of our earlier decisions. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with.
Categorical Variables
RIDSTATR
and CYCLE
We’re just checking here to see that everyone has RIDSTATR
status 2, meaning they completed both the questionnaire and the examination. Since they do, we’ll also make sure our CYCLE
variable works to indicate the NHANES reporting CYCLE for each subject.
nh_proj %>% tabyl(CYCLE, RIDSTATR)
CYCLE 2
2015-16 5131
2017-18 4883
RIAGENDR, which we’ll change to SEX
or FEMALE
nh_proj %>% tabyl(RIAGENDR)
RIAGENDR n percent
1 4817 0.4810266
2 5197 0.5189734
We certainly want to recode this and name it more effectively, and make it a non-numerical factor.
nh_proj <- nh_proj %>%
mutate(SEX = ifelse(RIAGENDR == 1, "M", "F"))
Checking our results…
nh_proj %>% tabyl(SEX, RIAGENDR)
SEX 1 2
F 0 5197
M 4817 0
Good. We can replace RIAGENDR
with SEX
moving forward. Another option would have been to use a 1/0 numeric variable, and if I’d wanted to do that, I would have run something like
nh_proj <- nh_proj %>%
mutate(FEMALE = as.numeric(RIAGENDR) - 1)
nh_proj %>% tabyl(FEMALE, SEX)
FEMALE F M
0 0 4817
1 5197 0
RIDRETH3
, which we’ll change to RACE_ETH
Let’s do the following:
- Recode the levels of the
RIDRETH3
variable to use short, understandable group names.
- Collapse the first two categories (Mexican-American and Other Hispanic) into a single category.
- Change the variable name to
RACE_ETH
.
- Sort the resulting factor in order of their counts.
nh_proj <- nh_proj %>%
mutate(RACE_ETH = fct_recode(RIDRETH3,
"Hispanic" = "1",
"Hispanic" = "2",
"NH_White" = "3",
"NH_Black" = "4",
"NH_Asian" = "6",
"Other" = "7"),
RACE_ETH = fct_infreq(RACE_ETH))
Let’s look at the results. Do we need to collapse further in this case?
nh_proj %>% tabyl(RACE_ETH, RIDRETH3)
RACE_ETH 1 2 3 4 6 7
NH_White 0 0 3117 0 0 0
Hispanic 1605 1184 0 0 0 0
NH_Black 0 0 0 2295 0 0
NH_Asian 0 0 0 0 1366 0
Other 0 0 0 0 0 447
The most common RACE_ETH
is Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won’t collapse any further for now.
HSD010
, which we’ll change to SROH
- Again, we need to recode the levels of this categorical variable.
- I’ll also rename the variable SROH (which is an abbreviation for self-reported overall health.) Always explain abbreviations.
- We also need to convert the values 7 (Refused) and 9 (Don’t Know) to missing values.
nh_proj <- nh_proj %>%
mutate(HSD010 = na_if(HSD010, 7),
HSD010 = na_if(HSD010, 9)) %>%
mutate(SROH = fct_recode(HSD010,
"Excellent" = "1",
"Very_Good" = "2",
"Good" = "3",
"Fair" = "4",
"Poor" = "5")) %>%
droplevels() # drop unused levels in all factors
nh_proj %>% tabyl(SROH, HSD010)
SROH 1 2 3 4 5 NA_
Excellent 772 0 0 0 0 0
Very_Good 0 2216 0 0 0 0
Good 0 0 3865 0 0 0
Fair 0 0 0 2020 0 0
Poor 0 0 0 0 322 0
<NA> 0 0 0 0 0 819
Reset the variables we will actually use
- We don’t need
RIDSTATR
any more because we’ve checked and see that all its values are 2, as needed.
- We’ve renamed some variables and replaced others with better versions.
nh_proj <- nh_proj %>%
select(SEQN, CYCLE, AGE, SEX, RACE_ETH, WAIST, HDL, SROH)
We should now have 10,014 observations on these 8 variables.
dim(nh_proj)
[1] 10014 8
Missingness check
miss_var_summary(nh_proj)
# A tibble: 8 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 SROH 819 8.18
2 HDL 596 5.95
3 WAIST 556 5.55
4 SEQN 0 0
5 CYCLE 0 0
6 AGE 0 0
7 SEX 0 0
8 RACE_ETH 0 0
Three of our eight variables have missing values. We’ve got some imputation ahead of us.
What if HDL was our outcome?
If one of those three variables with missing values was our outcome, say, for example, HDL, then we would filter our data for complete cases on that variable.
nh_proj <- nh_proj %>% filter(complete.cases(HDL))
dim(nh_proj)
[1] 9418 8
and we’re now down to 9418 observations on 8 variables. Here’s the revised report on missing data.
miss_var_summary(nh_proj) %>% filter(n_miss > 0)
# A tibble: 2 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 SROH 705 7.49
2 WAIST 436 4.63
How many observations have any missing values at all, now? How many are missing both SROH
and WAIST
?
miss_case_table(nh_proj)
# A tibble: 3 x 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 8444 89.7
2 1 807 8.57
3 2 167 1.77
Saving a Tidy Tibble
saveRDS(nh_proj, here("data", "nh_proj.Rds"))
Codebook
A successful codebook for Project 2 will include:
- a set of five or so statements about the data, followed by
- a
tbl_summary()
report from gtsummary
or some equivalent brief description of the data, and that should be followed by
- a fuller report on the distributions from the
describe()
function from Hmisc
.
I’ve chosen here to build a codebook with some useful information describing the entire sample. In your project 2, I would be happy with a codebook using only your model training sample, or including the whole sample (training and testing together), or whatever else you feel is most useful to present. I would begin the codebook with statements like the ones shown below to fix ideas about the sample size, the missingness, the outcome (and you might have more than one, of course) and a statement about the roles for the other variables shown in the table below. If you look at the R Markdown code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.
- Sample Size The data in our complete
nh_proj
sample consist of 9418 subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 20 and 79 in whom our outcome variable was measured. (If I had other inclusion/exclusion criteria, I would list them here.)
- Missingness Of the 9418 subjects, 8444 have complete data on all variables listed below.
- Our outcome variable is
HDL
, which is the subject’s serum HDL cholesterol measured in mg/dl.
- All other variables listed below will serve as candidate predictors for our models.
- The other variable contained in my tidy tibble is
SEQN
which is the subject identifying code.
Data Distribution Description
nh_proj %>%
select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
Hmisc::describe() %>% Hmisc::html()
.
7 Variables 9418 Observations
HDL
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
9418 | 0 | 132 | 1 | 53.49 | 17.93 | 32.00 | 35.00 | 42.00 | 51.00 | 62.00 | 75.00 | 83.15 |
lowest : 6 10 11 15 16 , highest: 165 166 178 189 226
AGE
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
9418 | 0 | 60 | 1 | 48.39 | 18.84 | 23 | 26 | 34 | 49 | 62 | 70 | 74 |
lowest : 20 21 22 23 24 , highest: 75 76 77 78 79
WAIST
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
8982 | 436 | 869 | 1 | 100.6 | 18.99 | 75.20 | 79.80 | 88.50 | 99.15 | 110.70 | 122.80 | 131.70 |
lowest : 58.7 62.3 63.2 63.4 63.7 , highest: 164.9 165.0 166.0 169.5 171.6
CYCLE
Value 2015-16 2017-18
Frequency 4838 4580
Proportion 0.514 0.486
SEX
Value F M
Frequency 4888 4530
Proportion 0.519 0.481
RACE_ETH
lowest : NH_White Hispanic NH_Black NH_Asian Other , highest: NH_White Hispanic NH_Black NH_Asian Other
Value NH_White Hispanic NH_Black NH_Asian Other
Frequency 2984 2678 2075 1261 420
Proportion 0.317 0.284 0.220 0.134 0.045
SROH
n | missing | distinct |
8713 | 705 | 5 |
lowest : | Excellent | Very_Good | Good | Fair | Poor |
highest: | Excellent | Very_Good | Good | Fair | Poor |
Value Excellent Very_Good Good Fair Poor
Frequency 726 2098 3666 1922 301
Proportion 0.083 0.241 0.421 0.221 0.035
Splitting into Testing/Training Samples
Splitting by CYCLE
I’ve suggested to some of you that you use the CYCLE
variable to split the data into a training sample (2015-16 data) and a test sample (2017-18 data). That would work like this:
nh_1516 <- nh_proj %>% filter(CYCLE == "2015-16")
nh_1718 <- nh_proj %>% filter(CYCLE == "2017-18")
and that’s fine: you fit the model on nh_1516
and then test it on the quality of predictions made for nh_1718
.
Splitting The Data with initial_split
An even better option that takes advantage of the initial_split
function would be to simply create a split of the data across the two cycles, ensuring that a similar fraction of subjects in each cycle are found in both the training and test samples.
Suppose we want to accomplish the following:
We want to use initial_split()
to do our splitting into a training set of 1000 subjects and a testing set of 2500 additional subjects from the NHANES data in nh_proj
, which contains 9418 observations.
We want the proportion of subjects within each combination of the categorical variable CYCLE
within nh_proj
to be preserved in both our training set and our testing set.
Note that the table of percentages for CYCLE
for the full nh_proj
data follows.
nh_proj %>% tabyl(CYCLE)
CYCLE n percent
2015-16 4838 0.5136972
2017-18 4580 0.4863028
We first select 3500 subjects from our nh_proj
data set in such a way as to preserve the percentages who fall into each CYCLE
.
set.seed(432)
nh_3500 <- nh_proj %>% group_by(CYCLE) %>%
slice_sample(., prop = 3501/nrow(nh_proj))
nh_3500 %>% tabyl(CYCLE)
CYCLE n percent
2015-16 1798 0.5137143
2017-18 1702 0.4862857
- Note that these percentages are essentially the same as the percentages we saw across the full sample in
nhanes_proj
but now this is a (stratified) random sample of 3500 observations.
- Note also that I had to fool
slice_sample
into thinking I wanted a sample of 3501 out of the total rather than 3500 until I actually got exactly 3500 sampled subjects.
Next, we use initial_split()
to create training and testing samples according to our specifications (so 1000 in the training sample and 2500 in the testing sample, again preserving the fraction in each CYCLE.)
nh_split <- initial_split(nh_3500, prop = 1000/3500,
strata = CYCLE)
nh_training <- training(nh_split)
nh_testing <- testing(nh_split)
dim(nh_training)
[1] 1001 8
dim(nh_testing)
[1] 2499 8
Now, let’s check the fraction of subjects by CYCLE
in each of our split samples.
nh_training %>% tabyl(CYCLE)
CYCLE n percent
2015-16 514 0.5134865
2017-18 487 0.4865135
nh_testing %>% tabyl(CYCLE)
CYCLE n percent
2015-16 1284 0.5138055
2017-18 1215 0.4861945
OK. These match well to what we were looking for. We should be all set.
---
title: "NHANES Toy Example for Project 2"
author: "Thomas E. Love"
date: "`r Sys.Date()`"
output: 
    html_document:
        toc: TRUE
        toc_float: TRUE
        number_sections: TRUE
        code_folding: show
        code_download: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```

# Packages I'll Use

```{r, message = FALSE}
library(here)
library(magrittr)
library(janitor)
library(haven)
library(nhanesA)
library(naniar)
library(rsample)
library(gtsummary)
library(tidyverse)
```

# Objective

Suppose I wanted to include the following items in a data set, from both 2017-18 and 2015-16 NHANES.

From the DEMO files: [`DEMO_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm) and [`DEMO_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)

- SEQN = Subject identifying code 
- RIDSTATR = Interview/Examination Status (categorical)
    - 1 = Interview only
    - 2 = Interview and MEC examination (MEC = mobile examination center)
- RIDAGEYR = Age in years (quantitative, topcoded at 80)
    - All subjects ages 80 and over are coded as 80
- RIAGENDR = Sex (categorical)
    - 1 = Male, 2 = Female
- RIDRETH3 = Race/Ethnicity (categorical)
    - 1 = Mexican-American, 
    - 2 = Other Hispanic (1 and 2 are often combined)
    - 3 = Non-Hispanic White
    - 4 = Non-Hispanic Black
    - 6 = Non-Hispanic Asian
    - 7 = Other Race including Multi-Racial
    - Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.

From the Body Measures (BMX) files: [`BMX_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.htm) and [`BMX_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm) (BMX is part of the Examination data)

- BMXWAIST = Waist Circumference in cm (quantitative)

From the Cholesterol - High-Density Lipoprotein (HDL) files: [`HDL_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/HDL_J.htm) and [`HDL_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HDL_I.htm) (HDL is part of the Lab data)

- LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)

From the Current Health Status (HSQ) file [`HSQ_J` for 2017-18](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/HSQ_J.htm) and [`HSQ_I` for 2015-16](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HSQ_I.htm) (HSQ is part of the Questionnaire data)

- HSD010 = Self-reported general health (categorical)
    - 1 = Excellent
    - 2 = Very good
    - 3 = Good
    - 4 = Fair
    - 5 = Poor
    - 7 = Refused
    - 9 = Don't Know

Here is the process I would use to get going.

# Get and do initial cleaning on the 2017-18 data

## Pull in the Data from each necessary database for 2017-18, zap the labels and convert to tibbles

```{r}
demo_17raw <- nhanes('DEMO_J') %>%
    zap_label() %>% tibble()

# check: DEMO_J should have 9254 observations on 46 variables
dim(demo_17raw)

bmx_17raw <- nhanes('BMX_J') %>%
    zap_label() %>% tibble()

# check: BMX_J should have 8704 observations on 21 variables
dim(bmx_17raw)

hdl_17raw <- nhanes('HDL_J') %>%
    zap_label() %>% tibble()

# check: HDL_J should have 7435 observations on 3 variables
dim(hdl_17raw)

hsq_17raw <- nhanes('HSQ_J') %>%
    zap_label() %>% tibble()

# check: HSQ_J should have 8366 observations on 9 variables
dim(hsq_17raw)
```

## Merge the data with `inner_join()`

```{r}
# create temp1 which includes all rows in both DEMO and BMX
temp1 <- inner_join(demo_17raw, bmx_17raw, by = "SEQN")

# check: temp1 should have 8704 observations on 66 variables
# since everyone in BMX will also be in DEMO, by design
# but not everyone in DEMO should also be in BMX
# and since SEQN is in both files
dim(temp1)

# create temp2 which includes all rows in both HDL and HSQ
temp2 <- inner_join(hdl_17raw, hsq_17raw, by = "SEQN")

# check: temp2 should have 7435 observations on 11 variables
# since everyone in HDL will also be in HSQ, by design
# but not everyone in HSQ should also be in HDL
# and since SEQN is in both files
dim(temp2)

# create merge_17raw which is all rows in both temp1 and temp2
merge_17raw <- inner_join(temp1, temp2, by = "SEQN")

# check: merge_17raw should have 7435 observations on 76 variables
dim(merge_17raw)

# clean up
rm(temp1, temp2)
```

## Select the 8 variables we'll need and convert categorical variables to factors

```{r}
nh_17 <- merge_17raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))
```

## Add an indicator of the CYCLE in which the data were drawn

```{r}
nh_17 <- nh_17 %>%
    mutate(CYCLE = "2017-18")

nh_17
```

The resulting `nh_17` data should have 7435 observations on 9 variables

# Get and do initial cleaning on the 2015-16 data

## Pull in the Data from each necessary database for 2015-16, zap the labels and convert to tibbles

```{r}
demo_15raw <- nhanes('DEMO_I') %>%
    zap_label() %>% tibble()

# check: DEMO_I should have 9971 observations on 47 variables
dim(demo_15raw)

bmx_15raw <- nhanes('BMX_I') %>%
    zap_label() %>% tibble()

# check: BMX_I should have 9544 observations on 26 variables
dim(bmx_15raw)

hdl_15raw <- nhanes('HDL_I') %>%
    zap_label() %>% tibble()

# check: HDL_I should have 8021 observations on 3 variables
dim(hdl_15raw)

hsq_15raw <- nhanes('HSQ_I') %>%
    zap_label() %>% tibble()

# check: HSQ_I should have 9165 observations on 9 variables
dim(hsq_15raw)
```

## Merge the data with `inner_join()`

```{r}
# create tempA which includes all rows in both DEMO and BMX
tempA <- inner_join(demo_15raw, bmx_15raw, by = "SEQN")

# check: tempA should have 9544 observations on 72 variables
dim(tempA)

# create tempB which includes all rows in both HDL and HSQ
tempB <- inner_join(hdl_15raw, hsq_15raw, by = "SEQN")

# check: tempB should have 8021 observations on 11 variables
dim(tempB)

# create merge_15raw which is all rows in both tempA and tempB
merge_15raw <- inner_join(tempA, tempB, by = "SEQN")

# check: merge_15raw should have 8021 observations on 82 variables
dim(merge_15raw)
```

## Select the 8 variables we'll need and convert categorical variables to factors

```{r}
nh_15 <- merge_15raw %>%
    select(SEQN, RIDSTATR, RIDAGEYR, RIAGENDR, RIDRETH3,
           BMXWAIST, LBDHDD, HSD010) %>%
    mutate(SEQN = as.character(SEQN),
           RIDSTATR = factor(RIDSTATR),
           RIAGENDR = factor(RIAGENDR),
           RIDRETH3 = factor(RIDRETH3),
           HSD010 = factor(HSD010))
```

## Add an indicator of the CYCLE in which the data were drawn

```{r}
nh_15 <- nh_15 %>%
    mutate(CYCLE = "2015-16")

nh_15
```

The resulting `nh_15` data should have 8021 observations on 9 variables

# Combine the 2017-18 and 2015-16 tibbles

```{r}
nh_proj <- full_join(nh_15, nh_17)
```

Result should have 15456 observations on the same 9 variables.

```{r}
dim(nh_proj)
```

Most of the time in working with NHANES data, I restrict my work to adult subjects. Suppose here we want people between the ages of 20 and 79, to avoid the fact that NHANES classifies everyone over age 80 as `RIDAGEYR` = 80. The code I'd use is:

```{r}
nh_proj <- nh_proj %>%
    filter(RIDAGEYR > 19 & RIDAGEYR < 80)

nh_proj
```

and we wind up with 10,014 observations on 9 variables.

# Checking The Variables

## Quantitative variables 

We have three quantitative variables: age (`RIDAGEYR`), waist circumference (`BMXWAIST`) and HDL Cholesterol (`LBDHDD`).

Let's rename those variables.

```{r}
nh_proj <- nh_proj %>%
    rename(AGE = RIDAGEYR, WAIST = BMXWAIST, HDL = LBDHDD)
```

Now, we'll check their ranges, and how much missingness we're dealing with using a quick `summary()`.

```{r}
nh_proj %>% select(AGE, WAIST, HDL) %>%
    summary()
```

The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me in light of our earlier decisions. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with.

## Categorical Variables

### `RIDSTATR` and `CYCLE`

We're just checking here to see that everyone has `RIDSTATR` status 2, meaning they completed both the questionnaire and the examination. Since they do, we'll also make sure our `CYCLE` variable works to indicate the NHANES reporting CYCLE for each subject.

```{r}
nh_proj %>% tabyl(CYCLE, RIDSTATR)
```

### RIAGENDR, which we'll change to `SEX` or `FEMALE`

```{r}
nh_proj %>% tabyl(RIAGENDR)
```

We certainly want to recode this and name it more effectively, and make it a non-numerical factor.

```{r}
nh_proj <- nh_proj %>%
    mutate(SEX = ifelse(RIAGENDR == 1, "M", "F"))
```

Checking our results...

```{r}
nh_proj %>% tabyl(SEX, RIAGENDR)
```

Good. We can replace `RIAGENDR` with `SEX` moving forward. Another option would have been to use a 1/0 numeric variable, and if I'd wanted to do that, I would have run something like 

```{r}
nh_proj <- nh_proj %>%
    mutate(FEMALE = as.numeric(RIAGENDR) - 1)

nh_proj %>% tabyl(FEMALE, SEX)
```

### `RIDRETH3`, which we'll change to `RACE_ETH`

Let's do the following:

- Recode the levels of the `RIDRETH3` variable to use short, understandable group names.
- Collapse the first two categories (Mexican-American and Other Hispanic) into a single category.
- Change the variable name to `RACE_ETH`.
- Sort the resulting factor in order of their counts.


```{r}
nh_proj <- nh_proj %>%
    mutate(RACE_ETH = fct_recode(RIDRETH3,
                                 "Hispanic" = "1",
                                 "Hispanic" = "2",
                                 "NH_White" = "3",
                                 "NH_Black" = "4",
                                 "NH_Asian" = "6",
                                 "Other" = "7"),
           RACE_ETH = fct_infreq(RACE_ETH))
```

Let's look at the results. Do we need to collapse further in this case?

```{r}
nh_proj %>% tabyl(RACE_ETH, RIDRETH3)
```

The most common `RACE_ETH` is Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won't collapse any further for now.

### `HSD010`, which we'll change to `SROH`

- Again, we need to recode the levels of this categorical variable. 
- I'll also rename the variable SROH (which is an abbreviation for self-reported overall health.) Always explain abbreviations.
- We also need to convert the values 7 (Refused) and 9 (Don't Know) to missing values.

```{r}
nh_proj <- nh_proj %>%
    mutate(HSD010 = na_if(HSD010, 7),
           HSD010 = na_if(HSD010, 9)) %>%
    mutate(SROH = fct_recode(HSD010,
                             "Excellent" = "1",
                             "Very_Good" = "2",
                             "Good" = "3",
                             "Fair" = "4",
                             "Poor" = "5")) %>%
    droplevels() # drop unused levels in all factors
```

```{r}
nh_proj %>% tabyl(SROH, HSD010)
```

## Reset the variables we will actually use

- We don't need `RIDSTATR` any more because we've checked and see that all its values are 2, as needed.
- We've renamed some variables and replaced others with better versions.

```{r}
nh_proj <- nh_proj %>% 
    select(SEQN, CYCLE, AGE, SEX, RACE_ETH, WAIST, HDL, SROH)
```

We should now have 10,014 observations on these 8 variables.

```{r}
dim(nh_proj)
```

## Missingness check

```{r}
miss_var_summary(nh_proj)
```

Three of our eight variables have missing values. We've got some imputation ahead of us.

## What if HDL was our outcome?

If one of those three variables with missing values was our outcome, say, for example, HDL, then we would filter our data for complete cases on that variable.

```{r}
nh_proj <- nh_proj %>% filter(complete.cases(HDL))

dim(nh_proj)
```

and we're now down to 9418 observations on 8 variables. Here's the revised report on missing data.

```{r}
miss_var_summary(nh_proj) %>% filter(n_miss > 0)
```

How many observations have any missing values at all, now? How many are missing both `SROH` and `WAIST`?

```{r}
miss_case_table(nh_proj)
```

# Saving a Tidy Tibble

```{r}
saveRDS(nh_proj, here("data", "nh_proj.Rds"))
```

# Codebook

A successful codebook for Project 2 will include:

- a set of five or so statements about the data, followed by 
- a `tbl_summary()` report from `gtsummary` or some equivalent brief description of the data, and that should be followed by 
- a fuller report on the distributions from the `describe()` function from `Hmisc`.

I've chosen here to build a codebook with some useful information describing the entire sample. In your project 2, I would be happy with a codebook using only your model training sample, or including the whole sample (training and testing together), or whatever else you feel is most useful to present. I would begin the codebook with statements like the ones shown below to fix ideas about the sample size, the missingness, the outcome (and you might have more than one, of course) and a statement about the roles for the other variables shown in the table below. If you look at the R Markdown code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.

1. **Sample Size** The data in our complete `nh_proj` sample consist of `r nrow(nh_proj)` subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 20 and 79 in whom our outcome variable was measured. (If I had other inclusion/exclusion criteria, I would list them here.)
2. **Missingness** Of the `r nrow(nh_proj)` subjects, `r n_case_complete(nh_proj %>% select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH))` have complete data on all variables listed below.
3. Our **outcome** variable is `HDL`, which is the subject's serum HDL cholesterol measured in mg/dl.
4. All other variables listed below will serve as candidate **predictors** for our models.
5. The other variable contained in my tidy tibble is `SEQN` which is the subject identifying code.

## Using `gtsummary` to present key information about the variables

Variables included in our analyses are summarized in the following table.

```{r, warning = FALSE}
nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    tbl_summary(.,
        label = list(
            HDL = "HDL (HDL Cholesterol in mg/dl)",
            AGE = "AGE (in years)",
            WAIST = "WAIST (circumference in cm)",
            CYCLE = "CYCLE (NHANES reporting)",
            RACE_ETH = "RACE_ETH (Race/Ethnicity)",
            SROH = "SROH (Self-Reported Overall Health)"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))
```

## Data Distribution Description

```{r, warning = FALSE}
nh_proj %>% 
    select(HDL, AGE, WAIST, CYCLE, SEX, RACE_ETH, SROH) %>%
    Hmisc::describe() %>% Hmisc::html()
```

# Splitting into Testing/Training Samples

## Splitting by `CYCLE`

I've suggested to some of you that you use the `CYCLE` variable to split the data into a training sample (2015-16 data) and a test sample (2017-18 data). That would work like this:

```{r}
nh_1516 <- nh_proj %>% filter(CYCLE == "2015-16")
nh_1718 <- nh_proj %>% filter(CYCLE == "2017-18")
```

and that's fine: you fit the model on `nh_1516` and then test it on the quality of predictions made for `nh_1718`. 

## Splitting The Data with `initial_split`

An **even better** option that takes advantage of the `initial_split` function would be to simply create a split of the data across the two cycles, ensuring that a similar fraction of subjects in each cycle are found in both the training and test samples.

Suppose we want to accomplish the following:

1. We want to use `initial_split()` to do our splitting into a training set of 1000 subjects and a testing set of 2500 additional subjects from the NHANES data in `nh_proj`, which contains `r nrow(nh_proj)` observations.

2. We want the proportion of subjects within each combination of the categorical variable `CYCLE` within `nh_proj` to be preserved in both our training set and our testing set.

Note that the table of percentages for `CYCLE` for the full `nh_proj` data follows.

```{r}
nh_proj %>% tabyl(CYCLE)
```

We first select 3500 subjects from our `nh_proj` data set in such a way as to preserve the percentages who fall into each `CYCLE`. 

```{r}
set.seed(432)
nh_3500 <- nh_proj %>% group_by(CYCLE) %>% 
    slice_sample(., prop = 3501/nrow(nh_proj))

nh_3500 %>% tabyl(CYCLE)
```

- Note that these percentages are essentially the same as the percentages we saw across the full sample in `nhanes_proj` but now this is a (stratified) random sample of 3500 observations. 
- Note also that I had to fool `slice_sample` into thinking I wanted a sample of 3501 out of the total rather than 3500 until I actually got exactly 3500 sampled subjects.

Next, we use `initial_split()` to create training and testing samples according to our specifications (so 1000 in the training sample and 2500 in the testing sample, again preserving the fraction in each CYCLE.)

```{r}
nh_split <- initial_split(nh_3500, prop = 1000/3500, 
                          strata = CYCLE)
nh_training <- training(nh_split)
nh_testing <- testing(nh_split)

dim(nh_training)
dim(nh_testing)
```

Now, let's check the fraction of subjects by `CYCLE` in each of our split samples.

```{r}
nh_training %>% tabyl(CYCLE)
```

```{r}
nh_testing %>% tabyl(CYCLE)
```

OK. These match well to what we were looking for. We should be all set.
