Part 1: Setting Up Your Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Part 2: Loading and Exploring the NHANES Data

Load and Examine the NHANES Dataset

# Load the NHANES data
data(NHANES)

# Examine the first few rows
head(NHANES, n = 10)

## # A tibble: 10 × 76
##       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education    MaritalStatus HHIncome  
##    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>        <fct>         <fct>     
##  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>         <NA>          20000-249…
##  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some College LivePartner   35000-449…
##  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>         <NA>          75000-999…
##  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>         <NA>          55000-649…
##  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
##  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## # ℹ 65 more variables: HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## #   Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## #   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
## #   BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
## #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>, HealthGen <fct>,
## #   DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, …

# View data structure
str(NHANES)

## tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
##  $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num [1:10000] 165 165 165 105 168 ...
##  $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
##  $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...

# Dimensions: rows (observations) and columns (variables)
dim(NHANES)

## [1] 10000    76

View All Available Variables

# What variables do we have?
names(NHANES)

##  [1] "ID"               "SurveyYr"         "Gender"           "Age"              "AgeDecade"       
##  [6] "AgeMonths"        "Race1"            "Race3"            "Education"        "MaritalStatus"   
## [11] "HHIncome"         "HHIncomeMid"      "Poverty"          "HomeRooms"        "HomeOwn"         
## [16] "Work"             "Weight"           "Length"           "HeadCirc"         "Height"          
## [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"            "BPSysAve"        
## [26] "BPDiaAve"         "BPSys1"           "BPDia1"           "BPSys2"           "BPDia2"          
## [31] "BPSys3"           "BPDia3"           "Testosterone"     "DirectChol"       "TotChol"         
## [36] "UrineVol1"        "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
## [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
## [46] "Depressed"        "nPregnancies"     "nBabies"          "Age1stBaby"       "SleepHrsNight"   
## [51] "SleepTrouble"     "PhysActive"       "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"      
## [56] "TVHrsDayChild"    "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
## [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"         "Marijuana"       
## [66] "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"      "HardDrugs"        "SexEver"         
## [71] "SexAge"           "SexNumPartnLife"  "SexNumPartYear"   "SameSex"          "SexOrientation"  
## [76] "PregnantNow"

# Count missing values in each column
missing_summary <- data.frame(
  Variable = names(NHANES),
  Missing_Count = colSums(is.na(NHANES)),
  Missing_Percent = round(colSums(is.na(NHANES)) / nrow(NHANES) * 100, 2)
)

# Show variables with the most missing data
print(missing_summary[order(-missing_summary$Missing_Count), ][1:15, ])

##                          Variable Missing_Count Missing_Percent
## HeadCirc                 HeadCirc          9912           99.12
## Length                     Length          9457           94.57
## DiabetesAge           DiabetesAge          9371           93.71
## TVHrsDayChild       TVHrsDayChild          9347           93.47
## CompHrsDayChild   CompHrsDayChild          9347           93.47
## BMICatUnder20yrs BMICatUnder20yrs          8726           87.26
## AgeRegMarij           AgeRegMarij          8634           86.34
## UrineFlow2             UrineFlow2          8524           85.24
## UrineVol2               UrineVol2          8522           85.22
## PregnantNow           PregnantNow          8304           83.04
## Age1stBaby             Age1stBaby          8116           81.16
## nBabies                   nBabies          7584           75.84
## nPregnancies         nPregnancies          7396           73.96
## AgeFirstMarij       AgeFirstMarij          7109           71.09
## SmokeAge                 SmokeAge          6920           69.20

Part 3: Data Preparation and Exploration

Create Analysis Dataset

# Select key variables for analysis
nhanes_analysis <- NHANES %>%
  dplyr::select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = factor(ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"))
  )

# Remove rows with missing values for key variables
nhanes_analysis2 <- nhanes_analysis %>%
        filter(complete.cases(.))  # Complete cases only


# View the processed dataset
head(nhanes_analysis, 10)

## # A tibble: 10 × 14
##       ID Gender   Age Race1 Education      BMI Pulse BPSys1 BPDia1 PhysActive SmokeNow Diabetes
##    <int> <fct>  <int> <fct> <fct>        <dbl> <int>  <int>  <int> <fct>      <fct>    <fct>   
##  1 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  2 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  3 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  4 51625 male       4 Other <NA>          15.3    NA     NA     NA <NA>       <NA>     No      
##  5 51630 female    49 White Some College  30.6    86    118     82 No         Yes      No      
##  6 51638 male       9 White <NA>          16.8    82     84     50 <NA>       <NA>     No      
##  7 51646 male       8 White <NA>          20.6    72    114     46 <NA>       <NA>     No      
##  8 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
##  9 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## 10 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## # ℹ 2 more variables: HealthGen <fct>, Hypertension <fct>

# Check dimensions
dim(nhanes_analysis)

## [1] 10000    14

Part 4: Summary Statistics by Groups

Summary by Gender

Research Question: Do cardiovascular health indicators and physical activity differ by gender?

# Create summary statistics grouped by gender
health_by_gender <- nhanes_analysis %>%
  group_by(Gender) %>%
  summarise(
    N = n(),
    # Descriptive statistics
    Mean_Age = round(mean(Age, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_DiaBP = round(mean(BPDia1, na.rm = TRUE), 2),
    # Prevalence estimates (percentages)
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2),
    Pct_Current_Smoker = round(
      sum(SmokeNow == "Yes", na.rm = TRUE) / sum(!is.na(SmokeNow)) * 100, 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2)
  )

print(health_by_gender)

## # A tibble: 2 × 10
##   Gender     N Mean_Age Mean_BMI Mean_SysBP Mean_DiaBP Pct_Physically_Active Pct_Current_Smoker
##   <fct>  <int>    <dbl>    <dbl>      <dbl>      <dbl>                 <dbl>              <dbl>
## 1 female  5020     37.6     26.8       117.       67.0                  53.2               45.5
## 2 male    4980     35.8     26.6       121.       69.5                  58.6               45.8
## # ℹ 2 more variables: Pct_Hypertension <dbl>, Pct_Diabetes <dbl>

Summary by Age Group

Research Question: How do health outcomes and behaviors vary across the lifespan?

#What not to do:
# Summary by age
health_by_age <- nhanes_analysis %>%
  group_by(Age) %>%
  summarise(
    N = n(),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2),
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2)
  )

#Do this:
# Create age groups for analysis
nhanes_analysis <- nhanes_analysis %>%
  mutate(
    Age_Group = cut(Age, 
                    breaks = c(0, 20, 35, 50, 65, 100),
                    labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
  )

# Summary by age group
health_by_age <- nhanes_analysis %>%
  group_by(Age_Group) %>%
  summarise(
    N = n(),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2),
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2)
  )

print(health_by_age)

## # A tibble: 6 × 7
##   Age_Group     N Mean_BMI Mean_SysBP Pct_Hypertension Pct_Diabetes Pct_Physically_Active
##   <fct>     <int>    <dbl>      <dbl>            <dbl>        <dbl>                 <dbl>
## 1 18-20      2769     20.8       107.             0.73         0.65                  76.8
## 2 21-35      1984     28.2       114.             4.72         1.46                  61.8
## 3 36-50      2138     29.1       118.            11.4          6.69                  55.5
## 4 51-65      1784     29.4       127.            22.6         17.7                   47.5
## 5 65+        1188     28.6       135.            40.4         21.4                   37.2
## 6 <NA>        137    NaN         NaN            NaN          NaN                    NaN

Stratified Analysis: Gender AND Race

Research Question: Are there racial/ethnic differences in hypertension that differ by gender?

# Stratified by gender and race/ethnicity
health_stratified <- nhanes_analysis %>%
  group_by(Gender, Race1) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

print(health_stratified)

## # A tibble: 10 × 6
##    Gender Race1        N Mean_SysBP Mean_BMI Pct_Hypertension
##    <fct>  <fct>    <int>      <dbl>    <dbl>            <dbl>
##  1 female Black      614       120.     29.8            17.4 
##  2 female Hispanic   320       114.     26.3             8.88
##  3 female Mexican    452       112.     26.4             7.08
##  4 female White     3221       118.     26.6            13.8 
##  5 female Other      413       113.     24.0            10   
##  6 male   Black      583       123.     26.3            17.3 
##  7 male   Hispanic   290       119.     26.4            10.1 
##  8 male   Mexican    563       119.     26.6            12.4 
##  9 male   White     3151       121.     26.8            15.2 
## 10 male   Other      393       118.     24.8            12.8

Part 5: Data Visualization

Hypertension Prevalence by Gender and Age

# Create visualization
health_by_age_gender <- nhanes_analysis %>%
  group_by(Age_Group, Gender) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

ggplot(health_by_age_gender, aes(x = Age_Group, y = Pct_Hypertension, fill = Gender)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(
    title = "Hypertension Prevalence by Age and Gender",
    subtitle = "NHANES Data",
    x = "Age Group",
    y = "Prevalence (%)",
    fill = "Gender",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing hypertension prevalence by age group and gender ## Physical Activity by Age Group

activity_by_age <- health_by_age %>%
  filter(!is.na(Age_Group))

ggplot(activity_by_age, aes(x = Age_Group, y = Pct_Physically_Active)) +
  geom_col(fill = "#2ecc71", alpha = 0.8) +
  geom_text(aes(label = paste0(Pct_Physically_Active, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Physical Activity Participation by Age Group",
    subtitle = "NHANES Data",
    x = "Age Group",
    y = "Prevalence (%)",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  ylim(0, 70) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing physical activity prevalence by age group

Hypertension by Race/Ethnicity

hypertension_by_race <- nhanes_analysis %>%
  group_by(Race1) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    N = n(),
    .groups = 'drop'
  ) %>%
  filter(!is.na(Race1))

ggplot(hypertension_by_race, aes(x = Race1, y = Pct_Hypertension)) +
  geom_col(fill = "#e74c3c", alpha = 0.8) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Race/Ethnicity",
    subtitle = "NHANES Data",
    x = "Race/Ethnicity",
    y = "Prevalence (%)",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing hypertension prevalence by race/ethnicity

Part 6: Professional Summary Table

# Create summary table by gender
summary_table <- health_by_gender %>%
  select(
    Gender,
    N,
    `Mean Age` = Mean_Age,
    `Mean BMI` = Mean_BMI,
    `Mean SysBP` = Mean_SysBP,
    `% Hypertension` = Pct_Hypertension,
    `% Physically Active` = Pct_Physically_Active
  )

# Display with kableExtra
kable(summary_table, 
      caption = "Cardiovascular Health and Physical Activity by Gender",
      format = "html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Cardiovascular Health and Physical Activity by Gender
Gender	N	Mean Age	Mean BMI	Mean SysBP	% Hypertension	% Physically Active
female	5020	37.64	26.77	117.3	13.07	53.18
male	4980	35.83	26.55	120.9	14.68	58.56

🎯 Task 1: Explore Health Disparities by Education

“How does hypertension prevalence vary by education level?”

Code:

# Your code here:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)

## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

Answer:

Hypertension prevalence shows a clear inverse relationship with educational attainment, with individuals having 8th grade education experiencing the highest rates (28.26%) compared to college graduates who have the lowest rates (13.09%), representing a 15.2% point difference and 2.15 times higher the risk. This stepwise gradient demonstrates that each additional level of education corresponds to progressively lower hypertension prevalence, revealing a dose-response pattern where lower education systematically predicts worse cardiovascular health outcomes.

🎯 Task 2: Create a Visualization

Create a bar chart showing hypertension by education level:

# Your visualization here:
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

R output visualization

🎯 Task 3: Write a Data Interpretation

Write 2-3 sentences:

“What does this pattern tell us about health disparities and social determinants?”

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

Answer:

The highest prevalence of hypertension occurs among adults with the lowest educational attainment (28.26%) and the lowest among college graduates (13.09%), showing a clear social gradient in cardiovascular risk. This inequality are likely to result from multiple overlapping pathways including limited healthcare access, stressful occupational conditions, economic pressures affecting diet and housing, and unequal neighborhood environments that accumulate over the life course to create systematic health disparities (Marmot et al., 2010).These findings are critical for public health because they reveal that a substantial proportion of hypertension cases are preventable through upstream interventions targeting educational equity, working conditions, and community resources rather than relying solely on downstream medical treatment after disease develops, emphasizing the need for targeted screening, prevention programs, and structural policies in communities with lower educational and socioeconomic resources (CDC, 2023; Braveman, 2011).

References

Braveman, P. A., Cubbin, C., Egerter, S., Williams, D. R., & Pamuk, E. (2011). Socioeconomic disparities in health in the United States: What the patterns tell us. American Journal of Public Health, 101(S1), S186–S196.

Centers for Disease Control and Prevention (CDC). (2023). Hypertension prevalence and control among adults.

Marmot, M., Allen, J., Goldblatt, P., et al. (2010). Fair society, healthy lives: The Marmot Review. London: UCL Institute of Health Equity.

553_Lab01_NHANES

Marina Frimpong

January 27, 2026

Part 1: Setting Up Your Workspace

Load Required Packages

Part 2: Loading and Exploring the NHANES Data

Load and Examine the NHANES Dataset

View All Available Variables

Part 3: Data Preparation and Exploration

Create Analysis Dataset

Part 4: Summary Statistics by Groups

Summary by Gender

Summary by Age Group

Stratified Analysis: Gender AND Race

Part 5: Data Visualization

Hypertension Prevalence by Gender and Age

Hypertension by Race/Ethnicity

Part 6: Professional Summary Table

🎯 Task 1: Explore Health Disparities by Education

🎯 Task 2: Create a Visualization

🎯 Task 3: Write a Data Interpretation