Part 1: Setting Up Your Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Part 2: Loading and Exploring the NHANES Data

Load and Examine the NHANES Dataset

# Load the NHANES data
data(NHANES)

# Examine the first few rows
head(NHANES, n = 10)

## # A tibble: 10 × 76
##       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education    MaritalStatus HHIncome  
##    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>        <fct>         <fct>     
##  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>         <NA>          20000-249…
##  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some College LivePartner   35000-449…
##  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>         <NA>          75000-999…
##  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>         <NA>          55000-649…
##  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
##  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## # ℹ 65 more variables: HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## #   Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## #   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
## #   BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
## #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>, HealthGen <fct>,
## #   DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, …

# View data structure
str(NHANES)

## tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
##  $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num [1:10000] 165 165 165 105 168 ...
##  $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
##  $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...

# Dimensions: rows (observations) and columns (variables)
dim(NHANES)

## [1] 10000    76

Part 3: Data Preparation and Exploration

Create Analysis Dataset

# Select key variables for analysis
nhanes_analysis <- NHANES %>%
  dplyr::select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = factor(ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"))
  )

# Remove rows with missing values for key variables
nhanes_analysis2 <- nhanes_analysis %>%
        filter(complete.cases(.))  # Complete cases only


# View the processed dataset
head(nhanes_analysis, 10)

## # A tibble: 10 × 14
##       ID Gender   Age Race1 Education      BMI Pulse BPSys1 BPDia1 PhysActive SmokeNow Diabetes
##    <int> <fct>  <int> <fct> <fct>        <dbl> <int>  <int>  <int> <fct>      <fct>    <fct>   
##  1 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  2 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  3 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  4 51625 male       4 Other <NA>          15.3    NA     NA     NA <NA>       <NA>     No      
##  5 51630 female    49 White Some College  30.6    86    118     82 No         Yes      No      
##  6 51638 male       9 White <NA>          16.8    82     84     50 <NA>       <NA>     No      
##  7 51646 male       8 White <NA>          20.6    72    114     46 <NA>       <NA>     No      
##  8 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
##  9 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## 10 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## # ℹ 2 more variables: HealthGen <fct>, Hypertension <fct>

# Check dimensions
dim(nhanes_analysis)

## [1] 10000    14

Your Turn: Guided Practice

🎯 Task 1: Explore Health Disparities by Education (15 minutes)

Using the nhanes_analysis data, explore:

“How does hypertension prevalence vary by education level?”

Write code to:

Group by education level
Calculate sample size, mean systolic BP, and percent with hypertension
Print the results

# Your code here:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)

## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

🎯 Task 2: Create a Visualization (10 minutes)

Create a bar chart showing hypertension by education level:

# Your visualization here:
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

R output visualization

🎯 Task 3: Write a Data Interpretation (5 minutes)

Write 2-3 sentences:

“What does this pattern tell us about health disparities and social determinants?”

Higher Blood Pressure is more prevalent among people who completed only 8th grade of school (28.26) and less prevalent among other groups of completed education. So people with completed 9-11th school grade/college education might be more informed about risk factors and health outcomes like Hypertension. Or people with those education levels has higher income in comparison with people who only have 8th grade of school education and it’s might affect health outcomes as well. But to be sure we need to analyse with other variables like smoking status, BMI, chronic diseases and income.

This might be important for Public Health because people who don’t have completed college or any other high education might be a potential risk group, that needs to be studied and

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

LETS SEE HOW OTHER VARIABLES CONNECTED TO HYPERTENSION AMONG DIFFERENT EDUCATION GROUPS

Stratified Analysis: Physical activity AND Education

Research Question: Are there Physical activity differences in hypertension that differ by education?

# Stratified by Physical activity and Education
health_stratified <- nhanes_analysis %>%
  group_by(PhysActive, Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

print(health_stratified)

## # A tibble: 13 × 6
##    PhysActive Education          N Mean_SysBP Mean_BMI Pct_Hypertension
##    <fct>      <fct>          <int>      <dbl>    <dbl>            <dbl>
##  1 No         8th Grade        339       129.     29.7            29.8 
##  2 No         9 - 11th Grade   610       124.     29.7            18.9 
##  3 No         High School      882       125.     30.1            20.2 
##  4 No         Some College    1056       123.     30.0            18.4 
##  5 No         College Grad     546       121.     29.1            18.0 
##  6 No         <NA>             244       109.     24.6             1.35
##  7 Yes        8th Grade        112       127.     29.0            23.5 
##  8 Yes        9 - 11th Grade   278       122.     27.9            13.7 
##  9 Yes        High School      635       122.     28.5            17.2 
## 10 Yes        Some College    1211       121.     28.5            15.1 
## 11 Yes        College Grad    1552       118.     27.1            11.4 
## 12 Yes        <NA>             861       109.     23.6             0.74
## 13 <NA>       <NA>            1674       101.     17.9             0.4

Data Visualization

Hypertension Prevalence by Education and Physical Activity

# Create visualization
health_by_PhysActive_education <- nhanes_analysis %>%
   filter(
    !is.na(PhysActive),
    !is.na(Education),
    !is.na(Hypertension)
  ) %>%
  group_by(PhysActive, Education) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

ggplot(health_by_PhysActive_education, aes(x = PhysActive, y = Pct_Hypertension, fill = Education)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(
    title = "Hypertension Prevalence by Physical activity and Education",
    subtitle = "NHANES Data",
    x = "PhysActive",
    y = "Prevalence (%)",
    fill = "Education",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing hypertension prevalence by PhysActive and Education

Hypertension is still more prevalent among people with completed 8th grade level of education regardless of Physical activity status.

OR lets Analyse Diabetes and Hypertension among different education groups

Stratified Analysis: Diabetes AND Education

Research Question: Are there Diabetes differences in hypertension that differ by education?

# Stratified by Diabetes and Education
health_stratified <- nhanes_analysis %>%
  group_by(Diabetes, Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

print(health_stratified)

## # A tibble: 14 × 6
##    Diabetes Education          N Mean_SysBP Mean_BMI Pct_Hypertension
##    <fct>    <fct>          <int>      <dbl>    <dbl>            <dbl>
##  1 No       8th Grade        351       127.     28.8            25.6 
##  2 No       9 - 11th Grade   781       122.     28.6            14.6 
##  3 No       High School     1352       123.     28.9            18.2 
##  4 No       Some College    2039       121.     28.7            15.0 
##  5 No       College Grad    1954       118.     27.3            11.8 
##  6 No       <NA>            2621       106.     20.5             0.73
##  7 Yes      8th Grade        100       134.     32.0            36.8 
##  8 Yes      9 - 11th Grade   105       133.     33.1            36   
##  9 Yes      High School      165       127.     33.8            25   
## 10 Yes      Some College     228       129.     33.0            31.6 
## 11 Yes      College Grad     144       129.     31.3            29.6 
## 12 Yes      <NA>              18       100.     26.1             0   
## 13 <NA>     9 - 11th Grade     2       142      40.8           100   
## 14 <NA>     <NA>             140       100      22.2             0

Data Visualization

Hypertension Prevalence by Diabetes and Education

# Create visualization
health_by_Diabetes_education <- nhanes_analysis %>%
   filter(
    !is.na(Diabetes),
    !is.na(Education),
    !is.na(Hypertension)
  ) %>%
  group_by(Diabetes, Education) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

ggplot(health_by_Diabetes_education, aes(x = Diabetes, y = Pct_Hypertension, fill = Education)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(
    title = "Hypertension Prevalence by Diabetes and Education",
    subtitle = "NHANES Data",
    x = "Diabetes",
    y = "Prevalence (%)",
    fill = "Education",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing hypertension prevalence by Diabetes and Education

We can see that Hypertension is prevalent in all education level groups among people who have Diabetes. And hypertension is still more prevalent among people with completed 8th grade level of education even if a person does not have Diabetes

EPI 553 - R Lab Activity 1: Azamatova Aigerim

Azamatova Aigerim

January 29, 2026

Part 1: Setting Up Your Workspace

Load Required Packages

Part 2: Loading and Exploring the NHANES Data

Load and Examine the NHANES Dataset

Part 3: Data Preparation and Exploration

Create Analysis Dataset

Your Turn: Guided Practice

🎯 Task 1: Explore Health Disparities by Education (15 minutes)

🎯 Task 2: Create a Visualization (10 minutes)

🎯 Task 3: Write a Data Interpretation (5 minutes)

LETS SEE HOW OTHER VARIABLES CONNECTED TO HYPERTENSION AMONG DIFFERENT EDUCATION GROUPS

Stratified Analysis: Physical activity AND Education

Data Visualization

Hypertension Prevalence by Education and Physical Activity

OR lets Analyse Diabetes and Hypertension among different education groups

Stratified Analysis: Diabetes AND Education

Data Visualization

Hypertension Prevalence by Diabetes and Education

the_end