DATA 110 Project 2 - Geographic Patterns of Obesity in the U.S.

Author

Catherine Z. Matenje

Geographic Patterns of Obesity and Risk Factors in the United States in 2021

Adult Obesity Prevalence Maps for 49 states, the District of Columbia (DC), and three U.S. territories show the proportion of adults with obesity.

Introduction:

My project focuses on geographic patterns in obesity and related risk factors across the United States using data from the Behavioral Risk Factor Surveillance System (BRFSS), a large cross-sectional survey conducted by the Centers for Disease Control and Prevention (CDC). The dataset comes from the CDC’s Division of Nutrition, Physical Activity, and Obesity (DNPAO) Data, Trends, and Maps database and includes state-level estimates of obesity, nutrition, and physical activity.

The dataset contains quantitative variables such as obesity prevalence (percentage of adults with obesity), physical inactivity, and low fruit and vegetable consumption, as well as a categorical variable for state (LocationDesc).

To prepare the data for exploration, I used dplyr by filtering to observations from 2021, removing missing values in “Data_Value”, and excluding national-level data. Because the dataset is in long format, meaning multiple rows for each state and variables, I filtered to include only relevant variables and restricted to the “Total” stratification category so that each state is represented by a single observation (row). This allowed me to reduce the number of observations in the dataset from 93,249 to 53, to meet the project requirements.

I selected this topic due to my interest in addressing the growing burden of noncommunicable diseases (NCDs)globally. Obesity is a major shared risk factor for chronic non-communicable diseases such as cardiovascular disease, diabetes, cancer, and chronic respiratory diseases. Understanding how behavioral risk factors contribute to obesity patterns can help inform public health initiatives and improve population health outcomes.

Variables:

Variable Type Variable Name Description
Outcome Obesity % of adults with obesity
Precitor (Behavioral) Physical Inactivity % of adults with no leisure-time activity
Predictor (Behavioral) Low Fruit Consumption % consuming fruit < 1 time daily
Predictor (Behavioral) Low Vegetable Consumption % consuming vegetables < 1 time daily

Load Libraries

# Load required libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(RColorBrewer)

Load Data set

# Load dataset
 setwd("C:/Users/cathe/OneDrive/Desktop/Montgomery College Transition/2025-2026 MONTGOMERY COLLEGE TRANSITION/MC COURSES 25-26/Spring 2026/DATA 110/02. Projects/Project 2")

data <- readr::read_csv("Nutrition__Physical_Activity__and_Obesity_Behavioral_Risk_Factor_Surveillance_System.csv")
Rows: 93249 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl  (8): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...
lgl  (1): Data_Value_Unit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning & Filtering

data_clean <- data |>
  filter(
    YearStart == 2021, # Filter 2021 year
    !is.na(Data_Value), # Remove missing values
    LocationDesc != "National" # Remove national-level data
  )

Exploring the Data

# View structure of dataset
head(data_clean)
# A tibble: 6 × 33
  YearStart YearEnd LocationAbbr LocationDesc  Datasource   Class Topic Question
      <dbl>   <dbl> <chr>        <chr>         <chr>        <chr> <chr> <chr>   
1      2021    2021 MO           Missouri      Behavioral … Obes… Obes… Percent…
2      2021    2021 VT           Vermont       Behavioral … Obes… Obes… Percent…
3      2021    2021 MI           Michigan      Behavioral … Frui… Frui… Percent…
4      2021    2021 NM           New Mexico    Behavioral … Frui… Frui… Percent…
5      2021    2021 SD           South Dakota  Behavioral … Frui… Frui… Percent…
6      2021    2021 NH           New Hampshire Behavioral … Obes… Obes… Percent…
# ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Gender <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
#   QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>, …
# Summary statistics
summary(data_clean)
   YearStart       YearEnd     LocationAbbr       LocationDesc      
 Min.   :2021   Min.   :2021   Length:6746        Length:6746       
 1st Qu.:2021   1st Qu.:2021   Class :character   Class :character  
 Median :2021   Median :2021   Mode  :character   Mode  :character  
 Mean   :2021   Mean   :2021                                        
 3rd Qu.:2021   3rd Qu.:2021                                        
 Max.   :2021   Max.   :2021                                        
  Datasource           Class              Topic             Question        
 Length:6746        Length:6746        Length:6746        Length:6746       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 Data_Value_Unit Data_Value_Type      Data_Value    Data_Value_Alt 
 Mode:logical    Length:6746        Min.   : 5.50   Min.   : 5.50  
 NA's:6746       Class :character   1st Qu.:23.40   1st Qu.:23.40  
                 Mode  :character   Median :32.40   Median :32.40  
                                    Mean   :31.22   Mean   :31.22  
                                    3rd Qu.:38.17   3rd Qu.:38.17  
                                    Max.   :61.40   Max.   :61.40  
 Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
 Length:6746                Length:6746         Min.   : 2.60       
 Class :character           Class :character    1st Qu.:18.90       
 Mode  :character           Mode  :character    Median :27.50       
                                                Mean   :26.57       
                                                3rd Qu.:33.60       
                                                Max.   :56.10       
 High_Confidence_Limit  Sample_Size         Total            Age(years)       
 Min.   : 7.0          Min.   :   50.0   Length:6746        Length:6746       
 1st Qu.:28.7          1st Qu.:  408.0   Class :character   Class :character  
 Median :37.3          Median :  888.5   Mode  :character   Mode  :character  
 Mean   :36.5          Mean   : 1753.6                                        
 3rd Qu.:43.9          3rd Qu.: 2165.8                                        
 Max.   :77.9          Max.   :38964.0                                        
  Education            Gender             Income          Race/Ethnicity    
 Length:6746        Length:6746        Length:6746        Length:6746       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 GeoLocation          ClassID            TopicID           QuestionID       
 Length:6746        Length:6746        Length:6746        Length:6746       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 DataValueTypeID      LocationID    StratificationCategory1 Stratification1   
 Length:6746        Min.   : 1.00   Length:6746             Length:6746       
 Class :character   1st Qu.:18.00   Class :character        Class :character  
 Mode  :character   Median :31.00   Mode  :character        Mode  :character  
                    Mean   :31.43                                             
                    3rd Qu.:45.00                                             
                    Max.   :78.00                                             
 StratificationCategoryId1 StratificationID1 
 Length:6746               Length:6746       
 Class :character          Class :character  
 Mode  :character          Mode  :character  
                                             
                                             
                                             
# Checking key variables of interest (obesity outcome, physical inactivity, fruit/vegetable consumption)
unique(data_clean$Question)
[1] "Percent of adults aged 18 years and older who have obesity"                     
[2] "Percent of adults who report consuming fruit less than one time daily"          
[3] "Percent of adults who report consuming vegetables less than one time daily"     
[4] "Percent of adults who engage in no leisure-time physical activity"              
[5] "Percent of adults aged 18 years and older who have an overweight classification"

checking variables between original data and clean data

unique(data$Question)
[1] "Percent of adults who engage in no leisure-time physical activity"                                                                                                                                                                                    
[2] "Percent of adults aged 18 years and older who have obesity"                                                                                                                                                                                           
[3] "Percent of adults aged 18 years and older who have an overweight classification"                                                                                                                                                                      
[4] "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                
[5] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week"
[6] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                 
[7] "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week"                                                                                                                                                             
[8] "Percent of adults who report consuming fruit less than one time daily"                                                                                                                                                                                
[9] "Percent of adults who report consuming vegetables less than one time daily"                                                                                                                                                                           
unique(data_clean$Question)
[1] "Percent of adults aged 18 years and older who have obesity"                     
[2] "Percent of adults who report consuming fruit less than one time daily"          
[3] "Percent of adults who report consuming vegetables less than one time daily"     
[4] "Percent of adults who engage in no leisure-time physical activity"              
[5] "Percent of adults aged 18 years and older who have an overweight classification"

Creating final data set with all variables

data_final <- data_clean |>
  filter(
    Question %in% c(
      "Percent of adults aged 18 years and older who have obesity",
      "Percent of adults who engage in no leisure-time physical activity",
      "Percent of adults who report consuming fruit less than one time daily",
      "Percent of adults who report consuming vegetables less than one time daily"
    ),
    StratificationCategory1 == "Total"
  )

### There are 212 observations 

unique(data_final$Question)
[1] "Percent of adults who report consuming vegetables less than one time daily"
[2] "Percent of adults who report consuming fruit less than one time daily"     
[3] "Percent of adults aged 18 years and older who have obesity"                
[4] "Percent of adults who engage in no leisure-time physical activity"         

Creating a Merged Data Set

The original dataset is in long format, where each state appears multiple times, one for each health indicator. In order to map my data later, the data must be in wide format, where each row represents a state with all corresponding variables in separate columns.

So, with the help of AI (OpenAI ChatGPT), I created separate data sets for each of my variables of interest (obesity, physical inactivity, low fruit consumption, and low vegetable consumption) by filtering the data based on the “Question” variable. Then I cleaned each data set and renamed it to the corresponding variable. Lastly, I merged the individual data sets using “left joins” based on the state.

# Creating separate data sets for each variable of interest

obesity <- data_final |>
  filter(Question == "Percent of adults aged 18 years and older who have obesity") |>
  select(LocationDesc, Data_Value, GeoLocation) |>
  rename(obesity = Data_Value)

inactivity <- data_final |>
  filter(Question == "Percent of adults who engage in no leisure-time physical activity") |>
  select(LocationDesc, Data_Value) |>
  rename(inactivity = Data_Value)

fruit <- data_final |>
  filter(Question == "Percent of adults who report consuming fruit less than one time daily") |>
  select(LocationDesc, Data_Value) |>
  rename(fruit = Data_Value)

veg <- data_final |>
  filter(Question == "Percent of adults who report consuming vegetables less than one time daily") |>
  select(LocationDesc, Data_Value) |>
  rename(veg = Data_Value)

# Merging all variables into one data set 
merged_data <- obesity |>
  left_join(inactivity, by = "LocationDesc") |>
  left_join(fruit, by = "LocationDesc") |>
  left_join(veg, by = "LocationDesc")

# Extracting the lat/lon 
merged_data <- merged_data |>
  separate(GeoLocation, into = c("lat", "lon"), sep = ",") |>
  mutate(
    lat = as.numeric(gsub("[()]", "", lat)),
    lon = as.numeric(gsub("[()]", "", lon))
  )

Density Plot (Fruit and Vegetable Consumption)

ggplot(
data_final |>
filter(
Question %in% c(
"Percent of adults who report consuming fruit less than one time daily",
"Percent of adults who report consuming vegetables less than one time daily"
),
StratificationCategory1 == "Total"
),
aes(x = Data_Value, fill = Question, color = Question)
) +
geom_density(alpha = 0.4) +
scale_fill_brewer(
palette = "Set2",
labels = c(
"Low Fruit Consumption",
"Low Vegetable Consumption"
)
) +
scale_color_brewer(
palette = "Set2",
labels = c(
"Low Fruit Consumption",
"Low Vegetable Consumption"
)
) +
labs(
title = "Distribution of Low Fruit vs Vegetable Consumption (2021)",
x = "Percentage (%)",
y = "Density",
fill = "Behavior",
color = "Behavior"
) +
theme_minimal()

This density plot compares the distribution of low fruit and vegetable consumption across states. The distributions are very different: low fruit consumption appears to be at higher percentages than low vegetable consumption. This may suggest that inadequate fruit consumption may be more prevalent than inadequate vegetable consumption, which highlights potential differences in dietary choices and possibly accessibility to fruits across communities

ScatterPlot

ggplot(merged_data, aes(x = inactivity, y = obesity, color = fruit)) +
 geom_jitter(width = 0.5, height = 0.5, size = 3, alpha = 0.8) +
  
  geom_smooth(method = "lm", color = "gray40", linewidth = 1) +
  
  scale_color_gradientn(
    colors = c("#f7c6c7", "#f4a6a8", "#c7e9c0", "#74c476"),
    name = "Low Fruit Consumption (%)"
  ) +
  
  labs(
    title = "Obesity vs Physical Inactivity (2021)",
    subtitle = "Colored by Low Fruit Consumption",
    x = "Physical Inactivity (%)",
    y = "Obesity (%)",
    caption = "Higher physical inactivity and low fruit consumption are associated with higher obesity rates.\nSource: CDC BRFSS 2021"
  ) +
  
  theme_minimal() +
  
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(size = 9, hjust = 0)
  )
`geom_smooth()` using formula = 'y ~ x'

Map

# Creating popup with all variables
popup_all <- paste0(
  "<b>State: </b>", merged_data$LocationDesc, "<br>",
  "<b>Obesity (%): </b>", round(merged_data$obesity, 1), "<br>",
  "<b>Physical Inactivity (%): </b>", round(merged_data$inactivity, 1), "<br>",
  "<b>Low Fruit Consumption (%): </b>", round(merged_data$fruit, 1), "<br>",
  "<b>Low Vegetable Consumption (%): </b>", round(merged_data$veg, 1)
)

pal <- colorNumeric(
  palette = c("#fde2e4", "#f4acb7", "#c9184a"),
  domain = merged_data$obesity
)

# Creating the map 
leaflet(data = merged_data) |>
  setView(lng = -98, lat = 39, zoom = 4) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    lng = ~lon,
    lat = ~lat,
    radius = ~obesity * 500,
    color = ~pal(obesity),
    fillColor = ~pal(obesity),
    fillOpacity = 0.7,
    popup = popup_all
  ) |>
 addLegend(
  position = "bottomright",
  pal = pal,
  values = merged_data$obesity,
  title = "Obesity (%)",
  opacity = 0.7
)

Essay

Scatter Plot

The scatter plot shows the relationship between physical inactivity (%) and obesity prevalence (%) across U.S. states. Each point represents a state and color indicates low fruit consumption (green = higher low fruit consumption, pink/coral = lower low fruit consumption). Overall, there is a clear positive association between the two variables. States with higher levels of physical inactivity tend to have higher obesity rates. The fitted regression line demonstrates the upward trend which suggests that physical inactivity is an important behavioral riks factor associated with obesity.

Further, States with higher levels of low fruit consumption also experience higher physical inactivity and higher obesity, indicating that a diet low in fruit (source of fiber and vitamin/minerals) and lack of physical activity together contribute to increased obesity rates. It is important to note that additional factors such as socioeconomic conditions or access to resources, which are not reflected here, could also influence obesity outcomes.

Map

The interactive map displays obesity prevalence across U.S. states, where circle size and color represent the percentage of adults with obesity. Larger and darker circles indicate higher obesity rates, while smaller and lighter circles indicate lower rates. The pop-up labels provide additional information on physical inactivity and fruit and vegetable consumption.

There are clear regional differences in the map. Obesity rates are highest in many southern states, including Mississippi, Alabama, Louisiana, Kentucky, Arkansas, and West Virginia, while northeastern and western states have lower obesity rates. These patterns also appear to align with behavior and dietary trends, as states with higher obesity also tend to have higher levels of physical inactivity and low fruit consumption. Overall, low fruit consumption appears to be high across most states, which suggests that this is a widespread national issue rather than one that is limited to specific regions.

Citations:

Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System (BRFSS).

AI: I used AI ChatGPT (OpenAI) for assistance with successfully converting the data set from long format to wide format.