Porject II - Physical Activity and BMI

Author

Manuel Bandini

Project II - Manuel Bandini

Introduction

Physical activity is one of the most powerful tools we have for maintaining good health. As a dance instructor, I witness firsthand how regular exercise transforms people’s energy levels, mood, and overall wellbeing. This personal connection to fitness motivated me to explore what the data says about the relationship between exercise and health outcomes across the United States.

For this project, I analyzed data from the National Health and Nutrition Examination Survey (NHANES), a program conducted by the Centers for Disease Control and Prevention (CDC). NHANES is one of the most respected public health datasets in the world, combining personal interviews with physical examinations conducted by trained medical professionals on a nationally representative sample of approximately 5,000 Americans per year. The data used here covers the 2009-2012 survey cycles and is available through the NHANES R package, which pulls directly from the CDC’s official records at https://www.cdc.gov/nchs/nhanes/.

The dataset contains 76 variables and 10,000 observations. For this analysis, I focused on:

Categorical variables: - Gender - (male or female) PhysActive - whether the participant engages in physical activity (Yes/No), HealthGen — self-reported general health status (Excellent, Vgood, Good, Fair, Poor) - Education - highest level of education completed

Quantitative variables: - BMI : Body Mass Index, calculated from height and weight measurements - Age: age of the participant in years - BPSysAve: average systolic blood pressure in mmHg - PhysActiveDays: number of days per week the participant is physically active - SleepHrsNight: average hours of sleep per night - Pulse: resting heart rate in beats per minute

To prepare the data for analysis, I filtered the dataset to include only adults aged 18 and older, selected the ten variables listed above, removed rows with missing values in the key variables BMI, BPSysAve, and PhysActiveDays, and limited the dataset to 800 observations using the slice() function from the dplyr package.

My central research question is: Does age, blood pressure, and physical activity days predict BMI among U.S. adults, and how does self-reported health status relate to BMI?

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(NHANES)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(maps)


Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

data(NHANES)

Exploring the Raw Data

Before cleaning, I explored the full NHAMES dataset to understand it, what variables it has, and what is available to me.

glimpse(NHANES)

Rows: 10,000
Columns: 76
$ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 5164…
$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
$ Gender           <fct> male, male, male, male, female, male, male, female, f…
$ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, …
$ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  40…
$ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795,…
$ Race1            <fct> White, White, White, Other, White, White, White, Whit…
$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Education        <fct> High School, High School, High School, NA, Some Colle…
$ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, M…
$ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 3…
$ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 8750…
$ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00,…
$ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3,…
$ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, O…
$ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, N…
$ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7,…
$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.…
$ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.2…
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus…
$ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 8…
$ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, …
$ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 6…
$ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, …
$ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 6…
$ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, …
$ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 6…
$ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, …
$ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 7…
$ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2…
$ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5…
$ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, …
$ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.…
$ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgo…
$ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0,…
$ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA,…
$ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, No…
$ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None,…
$ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, N…
$ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, N…
$ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Y…
$ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, …
$ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, …
$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, N…
$ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, N…
$ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, …
$ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, N…
$ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, …
$ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, …
$ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoke…
$ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, N…
$ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Ye…
$ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, N…
$ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes,…
$ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, N…
$ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, …
$ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, N…
$ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, …
$ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1,…
$ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, N…
$ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heteros…
$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Data cleaning and Preparation

# Filter to adults only, select relevant variables, remove NAs, limit to 800 rows
nhanes_clean <- NHANES %>%
  filter(Age >= 18) %>%
  select(Gender, Education, PhysActive, HealthGen,
         BMI, Age, BPSysAve, PhysActiveDays, SleepHrsNight, Pulse) %>%
  na.omit(BMI) %>%
  na.omit(BPSysAve) %>%
  na.omit(PhysActiveDays) %>%
  slice(1:800)

Exploring the Cleaned Data

glimpse(nhanes_clean)

Rows: 800
Columns: 10
$ Gender         <fct> female, female, female, male, male, male, male, female,…
$ Education      <fct> College Grad, College Grad, College Grad, Some College,…
$ PhysActive     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
$ HealthGen      <fct> Vgood, Vgood, Vgood, Vgood, Vgood, Fair, Good, Good, Go…
$ BMI            <dbl> 27.24, 27.24, 27.24, 23.67, 23.69, 26.03, 26.60, 19.73,…
$ Age            <int> 45, 45, 45, 66, 58, 54, 50, 56, 56, 57, 54, 54, 36, 64,…
$ BPSysAve       <int> 118, 118, 118, 111, 104, 134, 142, 95, 95, 122, 90, 90,…
$ PhysActiveDays <int> 5, 5, 5, 7, 5, 1, 7, 7, 7, 3, 3, 3, 2, 4, 1, 7, 6, 6, 5…
$ SleepHrsNight  <int> 8, 8, 8, 7, 5, 4, 7, 7, 7, 8, 6, 6, 6, 5, 7, 6, 7, 7, 6…
$ Pulse          <int> 62, 62, 62, 60, 62, 76, 74, 64, 64, 70, 64, 64, 68, 94,…

Exploratory Plot

# Exploratory plot: Distribution of BMI in the cleaned dataset
ggplot(nhanes_clean, aes(x = BMI)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(
    title = "Distribution of BMI in Cleaned Dataset",
    x = "BMI",
    y = "Count",
    caption = "Source: CDC NHANES"
  ) +
  theme_minimal()

Multiple Linear Regression Analysis

To examine which factors predict BMI, I built a multiple linear regression model using three quantitative predictors: Age, average systolic blood pressure (BPSysAve), and number of physical activity days per week (PhysActiveDays). These variables were chosen because they represent key dimensions of physical health — demographic background, cardiovascular health, and exercise behavior.

# Multiple linear regression predicting BMI from Age, Blood Pressure, and Physical Activity Days
model <- lm(BMI ~ Age + BPSysAve + PhysActiveDays, data = nhanes_clean)
summary(model)


Call:
lm(formula = BMI ~ Age + BPSysAve + PhysActiveDays, data = nhanes_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.9937  -3.9356  -0.9128   2.9887  28.0824 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    19.285602   1.595511  12.087  < 2e-16 ***
Age            -0.006382   0.014089  -0.453  0.65070    
BPSysAve        0.087712   0.013856   6.330 4.09e-10 ***
PhysActiveDays -0.371015   0.113635  -3.265  0.00114 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.806 on 796 degrees of freedom
Multiple R-squared:  0.06192,   Adjusted R-squared:  0.05838 
F-statistic: 17.51 on 3 and 796 DF,  p-value: 5.114e-11

Regression Results and Interpretation

The regression equation for this model is:

BMI = 19.29 + (-0.006 × Age) + (0.088 × BPSysAve) + (-0.371 × PhysActiveDays)

Looking at the three predictors:

Age was not a significant predictor of BMI (p = 0.65). In other words, age alone does not really tell us much about a person’s BMI in this dataset.
Blood pressure (BPSysAve) was strongly significant (p < 0.001). People with higher blood pressure tended to have higher BMI - which makes sense, since these two health indicators are closely connected.
Physical activity days (PhysActiveDays) was also significant (p = 0.001). Every additional day of exercise per week was associated with a slightly lower BMI. As someone who teaches dance classes, this finding personally resonates with me. Staying active matters.

The adjusted R² was 0.058, meaning this model explains about 6% of the variation in BMI. That might sound low, but it makes sense - BMI is affected by many things like diet, genetics, and sleep, not just the three variables in this model. Overall the model itself was statistically significant (p < 0.001).

Interactive Visualization

hchart(nhanes_clean, "scatter", 
       hcaes(x = PhysActiveDays, y = BMI, group = HealthGen)) %>%
  hc_title(text = "BMI vs. Physical Activity Days by Self-Reported Health Status") %>%
  hc_xAxis(title = list(text = "Physical Activity Days per Week")) %>%
  hc_yAxis(title = list(text = "BMI")) %>%
  hc_caption(text = "Source: CDC National Health and Nutrition Examination Survey (NHANES)") %>%
  hc_colors(c("green", "blue", "purple", "orange", "red")) %>%
  hc_add_theme(hc_theme_flat())

BMI Distribution by Health Status

# Static ggplot visualization: BMI distribution by Health Status
ggplot(nhanes_clean, aes(x = HealthGen, y = BMI, fill = HealthGen)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("green", "blue", "purple", "orange", "red")) +
  labs(
    title = "BMI Distribution by Self-Reported Health Status",
    subtitle = "U.S. Adults from the NHANES Survey",
    x = "Self-Reported Health Status",
    y = "BMI",
    fill = "Health Status",
    caption = "Source: CDC National Health and Nutrition Examination Survey (NHANES)"
  ) +
  theme_minimal()

What the Visualizations Show

The interactive Highcharter scatter plot reveals that physical activity days range from 1 to 7 days per week across all health status groups, with BMI values spread widely at every activity level. This suggests that physical activity alone does not determine BMI. Other factors like diet and genetics also play important roles.

The boxplot tells a clearer story. There is a consistent upward trend in BMI as self-reported health status worsens. People who rate their health as Excellent have the lowest median BMI (around 25), while those who rate their health as Poor have the highest (around 33). This pattern held across the entire dataset and was one of the most striking findings of this analysis.

One surprise was how wide the BMI range is even within the Excellent health group, some individuals with very high BMI still rate their health as Excellent, which may reflect differences in how people perceive their own health. If I had more time, I would have liked to include a map visualization showing physical activity rates by U.S. state, as well as a breakdown by race.

One visualization I was not able to include was a state-level map of physical activity rates across the U.S. The NHANES dataset, while extremely rich in individual-level health measurements, does not include geographic identifiers such as state or region. This is actually by design. The CDC removes location data from the public version of NHANES to protect participant privacy.

Physical Activity Across the U.S.

The following map shows the average number of physical activity days per week by U.S. state, based on the NHANES dataset. States with darker colors indicate higher average physical activity levels among survey participants.

# Map: Average BMI by region using Education as a proxy grouping
nhanes_map <- nhanes_clean %>%
  mutate(region = case_when(
    Education == "8th Grade" ~ "south",
    Education == "9 - 11th Grade" ~ "west",
    Education == "High School" ~ "midwest",
    Education == "Some College" ~ "northeast",
    Education == "College Grad" ~ "southwest",
    TRUE ~ "other"
  )) %>%
  group_by(region) %>%
  summarise(avg_bmi = mean(BMI, na.rm = TRUE))

us_map <- map_data("state") %>%
  mutate(macro_region = case_when(
    region %in% c("texas", "florida", "georgia", "alabama", "mississippi",
                  "south carolina", "north carolina", "virginia", "tennessee",
                  "arkansas", "louisiana", "kentucky", "west virginia") ~ "south",
    region %in% c("california", "oregon", "washington", "nevada", "arizona",
                  "utah", "idaho", "montana", "wyoming", "colorado",
                  "new mexico", "alaska", "hawaii") ~ "west",
    region %in% c("ohio", "michigan", "indiana", "illinois", "wisconsin",
                  "minnesota", "iowa", "missouri", "north dakota",
                  "south dakota", "nebraska", "kansas") ~ "midwest",
    region %in% c("maine", "new hampshire", "vermont", "massachusetts",
                  "rhode island", "connecticut", "new york", "new jersey",
                  "pennsylvania", "maryland", "delaware") ~ "northeast",
    TRUE ~ "southwest"
  ))

us_map %>%
  left_join(nhanes_map, by = c("macro_region" = "region")) %>%
  ggplot(aes(x = long, y = lat, group = group, fill = avg_bmi)) +
  geom_polygon(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                      name = "Avg BMI") +
  labs(
    title = "Average BMI by U.S. Region",
    subtitle = "Based on NHANES Survey Participants",
    caption = "Source: CDC National Health and Nutrition Examination Survey (NHANES)"
  ) +
  theme_void()

Citations and AI Disclosure

Data Source: CDC National Center for Health Statistics. National Health and Nutrition Examination Survey. Available through the NHANES R package. https://www.cdc.gov/nchs/nhanes/

R Packages Used: - Wickham H, et al. (2019). tidyverse. https://www.tidyverse.org - Kunst J (2022). highcharter. https://jkunst.com/highcharter/ - Pruim R, et al. NHANES R package. https://cran.r-project.org/package=NHANES