Lab01 - NHANES Data Exploration

##Part 1: Setting Up Your Works - running the libraries

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(NHANES)       # NHANES dataset

## Warning: package 'NHANES' was built under R version 4.5.2

library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

##Part 2: Loading and Exploring NHANES Data

# Load the NHANES data
data(NHANES)

##Part 3: Data Preparation and Exploration

# Select key variables for analysis
nhanes_analysis <- NHANES %>%  ## |> - new pipe operator (don't like it) / shortcut CTRL, SHIFT, M 
  
  ##can also write it dplyr:: select - in the case that you have another package with a select function
  select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No")
  )


nhanes_analysis2 <- nhanes_analysis %>% 
  filter(complete.cases(.))

# View the processed dataset
head(nhanes_analysis, 10)

## # A tibble: 10 × 14
##       ID Gender   Age Race1 Education      BMI Pulse BPSys1 BPDia1 PhysActive
##    <int> <fct>  <int> <fct> <fct>        <dbl> <int>  <int>  <int> <fct>     
##  1 51624 male      34 White High School   32.2    70    114     88 No        
##  2 51624 male      34 White High School   32.2    70    114     88 No        
##  3 51624 male      34 White High School   32.2    70    114     88 No        
##  4 51625 male       4 Other <NA>          15.3    NA     NA     NA <NA>      
##  5 51630 female    49 White Some College  30.6    86    118     82 No        
##  6 51638 male       9 White <NA>          16.8    82     84     50 <NA>      
##  7 51646 male       8 White <NA>          20.6    72    114     46 <NA>      
##  8 51647 female    45 White College Grad  27.2    62    106     62 Yes       
##  9 51647 female    45 White College Grad  27.2    62    106     62 Yes       
## 10 51647 female    45 White College Grad  27.2    62    106     62 Yes       
## # ℹ 4 more variables: SmokeNow <fct>, Diabetes <fct>, HealthGen <fct>,
## #   Hypertension <chr>

##Lab 1 - NHANES Exploration

Task 1: Explore Health Disparities by Education (15 minutes) Using the nhanes_analysis data, explore:

“How does hypertension prevalence vary by education level?”

Write code to:

Group by education level Calculate sample size, mean systolic BP, and percent with hypertension Print the results

# Your code here:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)

## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

Task 2: Create a Visualization (10 minutes) Create a bar chart showing hypertension by education level:

# Your visualization here:
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Task 3: Write a Data Interpretation (5 minutes) Write 2-3 sentences:

“What does this pattern tell us about health disparities and social determinants?”

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

Arielle’s Response: This pattern shows that there is a relationship between health disparities and social determinants of health. Individuals with an 8th grade educational level have hypertension 2 times more than individuals with a college graduate educational level.This matters in public health because it is important to understand the reasons why someone with less education is more prone to disease, in this case - hypertension, and what can be done to help close the gap between groups.

Lab01 - NHANES Data Exploration

Arielle Coq

2026-01-27