DATA 110 Project 2: Nutrition, Physical Activity, and Obesity in America

Author

Emmanuel Gkatongoni

DATA 110 Project 2: Nutrition, Physical Activity, and Obesity in America

Image: “Obesity by State & Territories in 2022.” Visual Capitalist via Getty Images. Published in Forbes, March 11, 2025. Data source: CDC National Center for Chronic Disease Prevention and Health Promotion. Retrieved from https://www.forbes.com/sites/steveforbes/2025/03/11/how-to-put-america-on-both-a-fiscal-and-physical-diet/

Intro

Obesity is one of the more pressing public health crises in the United States today, but yet it’s also one of the most preventable. I decided to pick this topic because I want to understand what drives obesity at the population level not just at the individual level. As someone who has seen friends and family struggle with conditions linked to obesity like diabetes and hypertension, this topic gives me a sense of familairity both personally and socially. The data allows us to move beyond the anecdote and ask: at the state level, does being physically inactive or eating poorly actually predict higher obesity rates? And has anything changed over time? The dataset I am using for this is the Nutrition, Physical Activity, and Obesity — Behavioral Risk Factor Surveillance System (BRFSS), which is publicly available on the CDC through [Data.gov]. The BRFSS is a nationwide telephone health survey conducted annually across all 50 U.S. states, the District of Columbia, and several U.S. territories. The dataset covers survey years 2011 through 2024 and contains over 110,000 rows in its raw form, because it is structured in long format meaning each row represents one survey question, for one state, in one year, broken down by one demographic stratification (i.e. income, age group, sex, or “Total” for the full population). For this project I used the following variables: Categorical variables:

State — The U.S. state where the survey was conducted. This is a **nominal categorical** variable — there is no natural ordering between states.

Year — The year the survey was conducted, ranging from 2011 to 2024. Although year is technically numeric, I treat it as an **ordinal categorical** variable when grouping trends over time.

Quantitative variables (continuous, expressed as percentages):

obesity_pct — The percentage of adults aged 18 and older classified as obese (BMI ≥ 30). This is a continuous quantitative variable and serves as my primary outcome of interest.

inactivity_pct — The percentage of adults who report engaging in no leisure-time physical activity. This is a continuous quantitative predictor variable.

low_fruit_pct — The percentage of adults who report consuming fruit fewer than once per day. This is a continuous quantitative predictor variable representing dietary behavior.

low_veg_pct — The percentage of adults who report consuming vegetables fewer than once per day. This is also a continuous quantitative predictor variable.

To prepare the data-set for analysis, I cleaned it using dplyr. I chose relevant variables (state, year, health measures) and filtered for obesity, physical inactivity, and fruit/vegetable consumption questions. I changed the long format data-set to wide format using pivot_wider(), with each row representing a state-year combination. I also filtered out NA values for the variables of interest and reduced the data-set to under 800 observations.

# Load required libraries
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(scales)
library(RColorBrewer)

# Load dataset using readr::read_csv() 
raw <- readr::read_csv("Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv", show_col_types = FALSE)

# Confirm dimensions of the raw dataset
dim(raw)

[1] 110880     33

#head(raw, 3)

Data Cleaning and Exploration

Before I clean, it’s important to understand what the raw data actually looks like. The data-set is in long format, meaning that each survey question is stored as its own row not as its own column. The Question column tells us which metric is being measured, and Data_Value holds the numeric result. The StratificationCategory1 column tells us whether a row reflects the full population (“Total”) or a subgroup.

# View the unique survey questions available — confirms which metrics we can extract
unique(raw$Question)

[1] "Percent of adults aged 18 years and older who have obesity"                                                                                                                                                                                                                          
[2] "Percent of adults aged 18 years and older who have an overweight classification"                                                                                                                                                                                                     
[3] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                                                
[4] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity (or an equivalent combination) and engage in muscle-strengthening activities on 2 or more days a week"
[5] "Percent of adults who achieve more than 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                                              
[6] "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week"                                                                                                                                                                                            
[7] "Percent of adults who engage in no leisure-time physical activity"                                                                                                                                                                                                                   
[8] "Percent of adults who report consuming fruit less than one time daily"                                                                                                                                                                                                               
[9] "Percent of adults who report consuming vegetables less than one time daily"

# View stratification categories — we will keep only "Total" (full population)
unique(raw$StratificationCategory1)

[1] "Income"         "Age (years)"    "Race/Ethnicity" "Education"     
[5] "Sex"            "Total"

# Check the year range covered in the dataset
range(raw$YearStart)

[1] 2011 2024

The data-set spans from 2011 to 2024, giving us over a decade of trends. There are four unique questions that I chose across three topic classes: Obesity/Weight Status, Physical Activity, and Fruits and Vegetables. For this analysis I selected the questions that best capture obesity outcomes and their behavioral predictors.

Cleaning: Filter, Reshape, and Apply Inclusion/Exclusion Criteria

# Define the four question strings we want to extract
obesity_q    <- "Percent of adults aged 18 years and older who have obesity"
inactivity_q <- "Percent of adults who engage in no leisure-time physical activity"
fruit_q      <- "Percent of adults who report consuming fruit less than one time daily"
veg_q        <- "Percent of adults who report consuming vegetables less than one time daily"

# filter() — keep Total stratification, our 4 questions, no NAs, no territories
# select() — keep only the columns we need
# mutate() — create a clean metric name column
df_long <- raw |>
  filter(
    StratificationCategory1 == "Total",
    Question %in% c(obesity_q, inactivity_q, fruit_q, veg_q),
    !is.na(Data_Value),
    !LocationDesc %in% c("National", "Virgin Islands", "Guam", "Puerto Rico")
  ) |>
  select(year = YearStart, state = LocationDesc, Question, Data_Value) |>
  mutate(
    metric = case_when(
      Question == obesity_q    ~ "obesity_pct",
      Question == inactivity_q ~ "inactivity_pct",
      Question == fruit_q      ~ "low_fruit_pct",
      Question == veg_q        ~ "low_veg_pct"
    )
  ) |>
  select(year, state, metric, Data_Value)

# pivot_wider() — reshape to one row per state-year
df_wide <- df_long |>
  pivot_wider(names_from = metric, values_from = Data_Value, values_fn = mean)

# filter() — inclusion criteria: only keep rows where all 4 metrics are present 
# arrange() — sort by state and year for readability
df_clean <- df_wide |>
  filter(
    !is.na(obesity_pct),
    !is.na(inactivity_pct),
    !is.na(low_fruit_pct),
    !is.na(low_veg_pct)
  ) |>
  arrange(state, year)

# Confirm row count — must be under 800 per rubric
nrow(df_clean)

[1] 151

head(df_clean)

# A tibble: 6 × 6
   year state   obesity_pct inactivity_pct low_fruit_pct low_veg_pct
  <dbl> <chr>         <dbl>          <dbl>         <dbl>       <dbl>
1  2017 Alabama        36.3           32            44.9        19.3
2  2019 Alabama        36.1           31.5          46.1        22.3
3  2021 Alabama        39.9           31.5          45.9        20.6
4  2017 Alaska         34.2           20.6          36.9        19  
5  2019 Alaska         30.5           21.7          42.8        19  
6  2021 Alaska         33.5           20.3          42.4        18.7

# Verify no missing values remain in our four key variables
colSums(is.na(df_clean))

          year          state    obesity_pct inactivity_pct  low_fruit_pct 
             0              0              0              0              0 
   low_veg_pct 
             0

After cleaning, there are no missing values in any of the four quantitative variables. Each row now represents one U.S. state in one year with complete data on obesity, inactivity, and diet quality.

Exploratory Visualizations

Quantitative Variables:

Before building the final visualization or regression model, I went through each variable individually to understand its distribution.

# Histogram of obesity rates — the outcome variable
ggplot(df_clean, aes(x = obesity_pct)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Adult Obesity Rates Across State-Years",
       x = "Obesity Rate (%)", y = "Count") +
  theme_minimal()

– The distribution of obesity rates is roughly bell-shaped, centered around 30–35%. No extreme outliers are visible, which shows the variable is suitable for regression without transformation.

# Scatter plot: does physical inactivity predict obesity? — exploring the core relationship
ggplot(df_clean, aes(x = inactivity_pct, y = obesity_pct)) +
  geom_point(alpha = 0.4, color = "tomato") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(title = "Physical Inactivity vs. Obesity Rate",
       x = "Physical Inactivity Rate (%)",
       y = "Obesity Rate (%)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

– There is a clear positive linear trend that states more inactive adults have higher obesity rates. This makes inactivity the strongest choice for the regression model.

# Boxplot: compare obesity rates between states with high vs low fruit intake
# Create a high/low group based on median split
df_clean |>
  mutate(fruit_group = if_else(low_fruit_pct > median(low_fruit_pct, na.rm = TRUE),
                                "High Low-Fruit Group", "Low Low-Fruit Group")) |>
  ggplot(aes(x = fruit_group, y = obesity_pct, fill = fruit_group)) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_manual(values = c("High Low-Fruit Group" = "darkorange",
                                "Low Low-Fruit Group"  = "gold")) +
  labs(title = "Obesity Rates by Fruit Consumption Group",
       x = "Fruit Intake Group", y = "Obesity Rate (%)") +
  theme_minimal()

– States where more adults eat little fruit show noticeably higher obesity rates on average. The boxplot makes it easy to see the shift in the median and the spread between groups.

# Scatter: vegetable consumption vs obesity with year as color — adds a time dimension
ggplot(df_clean, aes(x = low_veg_pct, y = obesity_pct, color = as.factor(year))) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.8) +
  scale_color_viridis_d(name = "Year", option = "plasma") +
  labs(title = "Low Vegetable Consumption vs. Obesity Rate (colored by Year)",
       x = "Low Vegetable Consumption (%)",
       y = "Obesity Rate (%)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

– Low vegetable intake also trends positively with obesity, and coloring by year shows that both variables have shifted upward over time and in more recent years (lighter colors) cluster toward the upper right, confirming the worsening trend seen in the national average plot.

Categorical Variables:

# Average obesity rate by year — shows the national trend over time
df_clean |>
  group_by(year) |>
  summarise(mean_obesity = mean(obesity_pct, na.rm = TRUE)) |>
  ggplot(aes(x = year, y = mean_obesity)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_point(color = "steelblue", size = 3) +
  labs(title = "National Average Obesity Rate by Year",
       x = "Year", y = "Mean Obesity Rate (%)") +
  theme_minimal()

# Boxplot of obesity over time for a sample of high and low obesity states
df_clean |>
  filter(state %in% c("Mississippi", "West Virginia", "Alabama",
                       "Colorado", "Hawaii", "Massachusetts")) |>
  ggplot(aes(x = reorder(state, obesity_pct, median), y = obesity_pct, fill = state)) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Obesity Rate Distribution Over Time: Selected States",
       x = "State", y = "Obesity Rate (%)") +
  coord_flip() +
  theme_minimal()

– These exploratory plots tell me that obesity rates have been rising nationally over time, and that there is enormous variation between states. Mississippi and West Virginia consistently rank among the highest, while Colorado and Hawaii rank among the lowest.

Multiple Linear Regression

Justification for Variables

The correlation matrix below quantifies the relationships seen in the exploratory scatter plots above. Any predictor with a strong positive correlation with obesity_pct is a good candidate for the model. This regression uses four quantitative variables: one outcome (obesity_pct) and three predictors (inactivity_pct, low_fruit_pct, low_veg_pct).

# Correlation matrix — used to justify which variables to include in the regression
cor_matrix <- df_clean |>
  select(obesity_pct, inactivity_pct, low_fruit_pct, low_veg_pct) |>
  cor(use = "complete.obs")

print(round(cor_matrix, 3))

               obesity_pct inactivity_pct low_fruit_pct low_veg_pct
obesity_pct          1.000          0.594         0.767       0.325
inactivity_pct       0.594          1.000         0.545       0.214
low_fruit_pct        0.767          0.545         1.000       0.513
low_veg_pct          0.325          0.214         0.513       1.000

– All three predictors show a positive correlation with `obesity_pct`. The correlations between the predictors themselves are moderate but not extreme, meaning multicollinearity is not a serious concern and all three variables can be retained in the model.

# Multiple linear regression: obesity predicted by inactivity + diet variables
# Justified by correlation matrix and exploratory scatter plots above
model <- lm(obesity_pct ~ inactivity_pct + low_fruit_pct + low_veg_pct, data = df_clean)
summary(model)


Call:
lm(formula = obesity_pct ~ inactivity_pct + low_fruit_pct + low_veg_pct, 
    data = df_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6661 -1.4628  0.4296  1.6029  6.6138 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.69856    1.95119   1.896    0.060 .  
inactivity_pct  0.23984    0.05872   4.084 7.23e-05 ***
low_fruit_pct   0.62267    0.06290   9.900  < 2e-16 ***
low_veg_pct    -0.10753    0.08822  -1.219    0.225    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.513 on 147 degrees of freedom
Multiple R-squared:  0.6363,    Adjusted R-squared:  0.6289 
F-statistic: 85.72 on 3 and 147 DF,  p-value: < 2.2e-16

# Print the regression equation using actual model coefficients
s <- summary(model)
cat("**Regression Equation:**\n\n")

Regression Equation:

cat("obesity_pct =", round(coef(model)[1], 2),
    "+", round(coef(model)[2], 2), "× inactivity_pct",
    "+", round(coef(model)[3], 2), "× low_fruit_pct",
    "+", round(coef(model)[4], 2), "× low_veg_pct\n\n")

obesity_pct = 3.7 + 0.24 × inactivity_pct + 0.62 × low_fruit_pct + -0.11 × low_veg_pct

cat("**Adjusted R²:**", round(s$adj.r.squared, 4), "\n\n")

Adjusted R²: 0.6289

Model Interpretation: The adjusted R² of 0.6289 means that approximately 62.9% of the variation in state-level obesity rates is explained by physical inactivity, low fruit intake, and low vegetable intake combined. This is a strong fit for population-level health data, especially given that many unmeasured factors like income, healthcare access, and genetics also influence obesity. The regression equation is:

obesity_pct = 3.70 + 0.24(inactivity_pct) + 0.62(low_fruit_pct) + -0.11(low_veg_pct)

Looking at the p-values: - inactivity_pct: p < 0.001 is statistically significant. For every 1 percentage point increase in physical inactivity, obesity is predicted to rise by 0.24 percentage points, holding diet constant.

low_fruit_pct p < 0.001 which is statistically significant and actually the strongest predictor in the model. For every 1 percentage point increase in adults eating little fruit, obesity is predicted to rise by 0.62 percentage points. This was kind of surprising given that inactivity appeared stronger in the scatter plots.

- low_veg_pct: p = 0.225 — not statistically significant at the 0.05 level.

Despite showing a positive correlation with obesity in the exploratory plots, once inactivity and fruit intake are controlled for, vegetable consumption does not add significant predictive power. This suggests its effect may be partially captured by the other two variables. Overall, the model confirms that both physical inactivity and poor fruit intake are meaningful independent predictors of obesity at the state level. The non-significance of vegetable intake is an interesting finding worth noting in the discussion.

Final Visualization: Physical Inactivity, Diet, and Obesity by State (Most Recent Year)

# Use most recent year for a clean cross-sectional view
# Highlight 5 states with highest obesity rates in red
latest_year <- max(df_clean$year)

df_latest <- df_clean |>
  filter(year == latest_year) |>
  mutate(
    diet_score = (low_fruit_pct + low_veg_pct) / 2,
    obesity_tier = case_when(
      obesity_pct >= quantile(obesity_pct, 0.75, na.rm=TRUE) ~ "High Obesity",
      obesity_pct <= quantile(obesity_pct, 0.25, na.rm=TRUE) ~ "Low Obesity",
      TRUE ~ "Moderate Obesity"
    )
  )

p <- ggplot(df_latest,
            aes(x = inactivity_pct,
                y = obesity_pct,
                color = obesity_tier,
                size  = diet_score,
                text  = paste0(state,
                               "\nObesity: ", round(obesity_pct, 1), "%",
                               "\nInactivity: ", round(inactivity_pct, 1), "%",
                               "\nLow Diet Score: ", round(diet_score, 1), "%"))) +
  geom_point(alpha = 0.85) +
  geom_smooth(method = "lm", se = TRUE, color = "gray40", linewidth = 0.8,
            inherit.aes = FALSE,
            mapping = aes(x = inactivity_pct, y = obesity_pct)) +
  scale_color_manual(
    values = c("High Obesity" = "#d73027",
               "Moderate Obesity" = "#fee090",
               "Low Obesity" = "#4575b4"),
    name = "Obesity Tier"
  ) +
  scale_size_continuous(name = "Avg. Low Fruit &\nVeg Consumption (%)",
                        range = c(2, 10)) +
  labs(
    title = paste("Physical Inactivity, Diet Quality, and Obesity by U.S. State (", latest_year, ")", sep = ""),
    subtitle = "Bubble size = average % of adults with low fruit & vegetable intake",
    x = "Physical Inactivity Rate (%)",
    y = "Obesity Prevalence (%)",
    caption = "Source: CDC BRFSS via Data.gov | https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system"
  ) +
  theme_bw(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40"),
    legend.position = "right"
  )

p

`geom_smooth()` using formula = 'y ~ x'

# Interactive version using plotly for mouseover interactivity
ggplotly(p, tooltip = "text")

`geom_smooth()` using formula = 'y ~ x'

Obesity Trends Over Time (Top 5 and Bottom 5 States)

# Identify the 5 highest and 5 lowest obesity states in the latest year
top5    <- df_latest |> slice_max(obesity_pct, n = 5) |> pull(state)
bottom5 <- df_latest |> slice_min(obesity_pct, n = 5) |> pull(state)

df_trends <- df_clean |>
  filter(state %in% c(top5, bottom5)) |>
  mutate(group = if_else(state %in% top5, "Highest Obesity States", "Lowest Obesity States"))

ggplot(df_trends, aes(x = year, y = obesity_pct, color = state, linetype = group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_color_brewer(palette = "Paired", name = "State") +
  scale_linetype_manual(values = c("Highest Obesity States" = "solid",
                                   "Lowest Obesity States"  = "dashed"),
                        name = "Group") +
  labs(
    title = "Obesity Prevalence Over Time: Top 5 vs. Bottom 5 States",
    x     = "Year",
    y     = "Obesity Prevalence (%)",
    caption = "Source: CDC BRFSS via Data.gov"
  ) +
  theme_classic(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Discussion — What the Visualizations Reveal

The bubble chart for 2021 shows a clear positive relationship between physical inactivity and obesity. States like West Virginia (40.6%), Kentucky (40.3%), and Alabama (39.9%) sit in the upper right with high inactivity, high obesity, and large bubbles indicating poor diet quality. Meanwhile Colorado (25.1%), Hawaii (25.0%), and D.C. (24.7%) cluster in the lower left with smaller bubbles. This tells me that inactivity and poor diet compound each other rather than act independently. The trend line chart shows that the gap between the highest and lowest obesity states has widened over time and the top 5 states averaged 36.5% obesity in 2017 and rose to 39.9% by 2021, while the bottom 5 only moved from 24.1% to 26.0%. This divergence is concerning from a public health equity standpoint, as the states that need improvement the most are falling further behind. One visualization I attempted but could not fully implement was a choropleth map showing obesity rates by state, which would have made the geographic clustering of high-obesity states in the South more visually obvious. I also would have liked to incorporate income stratification to show how socioeconomic factors mediate the inactivity-obesity relationship, but keeping observations under 800 made that difficult within this project’s scope.