Contents

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

1 Introduction to NHANES

In this assignment, you will use the dataset from the nhanesA package.

The National Health and Nutrition Examination Survey (NHANES) is a program with a series of studies aimed at determining the health and nutritional status of Americans, including adults and children. Started in the early 1960s, the NHANES program has consisted of a series of surveys concentrating on various population groups or health themes. The survey was transformed into a continuous program in 1999, with a shifting focus on a variety of health and nutrition measurements. For continuous NHANES, the survey is conducted in two-year cycles, i.e, 1999-2000,2001-2002,etc.

There are 5 types of continuous NHANES survey data available to the public:

You can learn more about NHANES data at the CDC website.

2 Load packages and data

2.0.1 [Point 0.5] Load packages

Load the two packages: nhanesA and dplyr ```{r} library(nhanesA) library(dplyr)


### [Point 1] Check data
Check the Laboratory Data in survey cycle 2017-2018 using the 
`nhanesTables()` function.```{r}
lab_data <- nhanesTables("Laboratory", "2017-2018")
head(lab_data)

2.0.2 [Point 2.5] Load data #1

Import the Demographic Data in survey cycle 2011-2012 using the nhanes() function. Assign the data to the object named demo.{r} demo <- nhanes("DEMO", "2011-2012")

2.0.3 [Point 0.5] Simple EDA

Show dimension and the first and last 3 rows of the demo table. {r} dim(demo) head(demo, 3) tail(demo, 3)

2.0.4 [Point 2] Data encoding

Translate the way of variable encoding using the nhanesTranslate() function and assign the result to the object named demo_translate.

  • SEQN (Respondent sequence number)
  • RIAGENDR (Gender)
  • RIDAGEYR (Age in years at screening)

{r} demo_translate <- nhanesTranslate("DEMO_G", c("SEQN", "RIAGENDR", "RIDAGEYR"))

2.0.5 [Point 2.5] Load data #2

Load the ‘Blood Pressure’ table, which is a part of Examination survey data, from NHANES 2011-2012. Assign it to the new object named bpx. Show the first 3 rows of bpx. {r} bpx <- nhanes("BPX_G") head(bpx, 3)

3 Data wrangling

3.1 Missing values

3.1.1 [Point 1] Simple EDA #1

How many missing values are there for ‘gender’ attributes?{r} missing_gender_count <- sum(is.na(demo$RIAGENDR)

3.1.2 [Point 1] Simple EDA #2

How many missing values are there for each column? ```{r} if (is.data.frame(demo)) { colSums(is.na(demo)) } else { print(“The object ‘demo’ is not a data frame.”) }


## Translate encoding
For the `bpx` dataset, we want to keep the following variables as 
our primary interests only:

- SEQN (Respondent sequence number)
- PEASCST1 (Blood Pressure Status)
- PEASCTM1 (Blood Pressure Time in Seconds)
- BPXSY1 (Systolic: Blood pres (1st rdg) mm Hg)
- BPXDI1 (Diastolic: Blood pres (1st rdg) mm Hg)

### [Point 0.5]
Translate the way of variable encoding using the `nhanesTranslate()` 
function and assign to a new object named `bpx_translate`. Show the 
top 3 rows of `bpx_translate`.```{r}
bpx_translate <- nhanesTranslate(bpx_data)

head(bpx_translate, 3)

3.2 Subset

3.2.1 [Point 1] Subset #1

Subset the table only to the variables we are interested in and assign it to the new object named new_bpx. Show the first 6 rows of the new_bpx table.`{r} new_bpx <- bpx[, c("SEQN", "PEASCST1", "PEASCTM1", "BPXSY1", "BPXDI1")] head(new_bpx, 6)

3.2.2 [Point 1] Rename

Rename the variables in the following way:

  • SEQN -> id
  • PEASCST1 -> bp_status
  • PEASCTM1 -> bpt_sec
  • BPXSY1 -> systolic
  • BPXDI1 -> diastolic

Assign this updated version of the table as new_bpx and show the first 6 rows of new_bpx. {r} names(new_bpx) <- c("id", "bp_status", "bpt_sec", "systolic", "diastolic") head(new_bpx, 6)

3.2.3 [Point 0.5] Subset #2

Keep only the top 10 rows and assign it to the object named final_bpx. {r} final_bpx <- head(new_bpx, 10)

3.3 Filter

3.3.1 [Point 1.5] Handle missing values

From the final_bpx table, remove all the rows with any missing values. Assign this subset of the table as bpx_complete. Display the dimension of bpx_complete. {r} bpx_complete <- na.omit(final_bpx) dim(bpx_complete)

3.4 Reorder

3.4.1 [Point 1]

Re-order the rows in the final_bpx dataset by Blood Pressure Time in Seconds (bpt_sec) in descending order. {r} final_bpx <- final_bpx[order(-final_bpx$bpt_sec), ] head(final_bpx)

3.5 Add new variables

3.5.1 [Point 1] Keep all

Create a new variable called rescale_bpt_min that records the Blood Pressure Time in minutes. Keep both original and new variables/columns. ```{r} final_bpx\(rescale_bpt_min <- final_bpx\)bpt_sec / 60

head(final_bpx)



### [Point 1] Keep only new 
Create the `rescale_bpt_min` variable in the same way above and 
**only keep new columns**. (Try to avoid using `select()`)

```{r}
bpx_new <- data.frame(rescale_bpt_min = final_bpx$rescale_bpt_min)
head(bpx_new)

3.6 Summary statistics

3.6.1 [Point 1.5]

Summarize the final_bpx table where systolic is above 120 and the new column (bpt_min) that records the Blood Pressure Time in minutes is added. Using a pipe operator (%>% or |>).{r} final_bpx %>% filter(systolic > 120) %>% mutate(bpt_min = bpt / 60) %>% summary()

4 [Extra] Summary statistics by gender

4.0.1 [Extra point 5 (no partial point available)]

  • Step 1: Select the following columns from demo_translate table

    • SEQN (Respondent sequence number)
    • RIAGENDR (Gender)
    • RIDAGEYR (Age in years at screening)
  • Step 2: Rename the selected columns to:

    • SEQN -> id
    • RIAGENDR -> gender
    • RIDAGEYR -> age
  • Step 3: Join the subset of Demography survey table from Step 2 with new_bpx using id as a key.
    You can reference the data wrangling cheat sheet HERE.

  • Step 4: Subset the table to the participants with age over 65

  • Step 5: Summarize the average age, systolic, and diastolic by gender
    The dimension of the final table should be 2 x 4.```{r} demographics <- demo_translate %>% select(SEQN, RIAGENDR, RIDAGEYR)

demographics <- demographics %>% rename(id = SEQN, gender = RIAGENDR, age = RIDAGEYR)

joined_data <- demographics %>% inner_join(new_bpx, by = “id”)

filtered_data <- joined_data %>% filter(age > 65)

summary_by_gender <- filtered_data %>% group_by(gender) %>% summarise( avg_age = mean(age, na.rm = TRUE), avg_systolic = mean(systolic, na.rm = TRUE),

```