Obesity and Physical Activity Across the U.S.

Author

Rahwa Hagos

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

brfss <- read_csv("Nutrition,_Physical_Activity,_and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System_20260617.csv")

Rows: 110880 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl  (6): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...
num  (1): Sample_Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

brfss

# A tibble: 110,880 × 33
   YearStart YearEnd LocationAbbr LocationDesc Datasource   Class Topic Question
       <dbl>   <dbl> <chr>        <chr>        <chr>        <chr> <chr> <chr>   
 1      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 2      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 3      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 4      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 5      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 6      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 7      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 8      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
 9      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
10      2011    2011 AL           Alabama      Behavioral … Obes… Obes… Percent…
# ℹ 110,870 more rows
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>, …

head(brfss)

# A tibble: 6 × 33
  YearStart YearEnd LocationAbbr LocationDesc Datasource    Class Topic Question
      <dbl>   <dbl> <chr>        <chr>        <chr>         <chr> <chr> <chr>   
1      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
2      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
3      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
4      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
5      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
6      2011    2011 AL           Alabama      Behavioral R… Obes… Obes… Percent…
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
#   QuestionID <chr>, DataValueTypeID <chr>, LocationID <chr>, …

summary(brfss)

   YearStart       YearEnd     LocationAbbr       LocationDesc      
 Min.   :2011   Min.   :2011   Length:110880      Length:110880     
 1st Qu.:2014   1st Qu.:2014   Class :character   Class :character  
 Median :2017   Median :2017   Mode  :character   Mode  :character  
 Mean   :2017   Mean   :2017                                        
 3rd Qu.:2020   3rd Qu.:2020                                        
 Max.   :2024   Max.   :2024                                        
                                                                    
  Datasource           Class              Topic             Question        
 Length:110880      Length:110880      Length:110880      Length:110880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Data_Value_Unit    Data_Value_Type      Data_Value    Data_Value_Alt 
 Length:110880      Length:110880      Min.   : 0.9    Min.   : 0.9   
 Class :character   Class :character   1st Qu.:24.9    1st Qu.:24.9   
 Mode  :character   Mode  :character   Median :31.8    Median :31.8   
                                       Mean   :31.8    Mean   :31.8   
                                       3rd Qu.:37.4    3rd Qu.:37.4   
                                       Max.   :85.3    Max.   :85.3   
                                       NA's   :13214   NA's   :13214  
 Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
 Length:110880              Length:110880       Min.   : 0.3        
 Class :character           Class :character    1st Qu.:20.4        
 Mode  :character           Mode  :character    Median :27.3        
                                                Mean   :27.4        
                                                3rd Qu.:33.3        
                                                Max.   :74.7        
                                                NA's   :13214       
 High_Confidence_Limit  Sample_Size        Total            Age(years)       
 Min.   : 3.00         Min.   :    50   Length:110880      Length:110880     
 1st Qu.:29.20         1st Qu.:   494   Class :character   Class :character  
 Median :36.50         Median :  1079   Mode  :character   Mode  :character  
 Mean   :36.77         Mean   :  3625                                        
 3rd Qu.:42.80         3rd Qu.:  2399                                        
 Max.   :92.40         Max.   :476876                                        
 NA's   :13214         NA's   :13214                                         
  Education             Sex               Income          Race/Ethnicity    
 Length:110880      Length:110880      Length:110880      Length:110880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 GeoLocation          ClassID            TopicID           QuestionID       
 Length:110880      Length:110880      Length:110880      Length:110880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 DataValueTypeID     LocationID        StratificationCategory1
 Length:110880      Length:110880      Length:110880          
 Class :character   Class :character   Class :character       
 Mode  :character   Mode  :character   Mode  :character       
                                                              
                                                              
                                                              
                                                              
 Stratification1    StratificationCategoryId1 StratificationID1 
 Length:110880      Length:110880             Length:110880     
 Class :character   Class :character          Class :character  
 Mode  :character   Mode  :character          Mode  :character

brfss <- brfss %>%
  rename(
    state = LocationDesc,
    class = Class,
    question = Question,
    value = Data_Value,
    category = StratificationCategory1,
    subgroup = Stratification1
  )
brfss

# A tibble: 110,880 × 33
   YearStart YearEnd LocationAbbr state   Datasource        class Topic question
       <dbl>   <dbl> <chr>        <chr>   <chr>             <chr> <chr> <chr>   
 1      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 2      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 3      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 4      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 5      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 6      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 7      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 8      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
 9      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
10      2011    2011 AL           Alabama Behavioral Risk … Obes… Obes… Percent…
# ℹ 110,870 more rows
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
#   value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>, …

brfss_clean <- brfss %>%
  filter(!is.na(value))

brfss_clean <- brfss_clean %>%
  filter(!state %in% c("National", "Guam", "Puerto Rico", "Virgin Islands"))

obesity_data <- brfss_clean %>%
  filter(grepl("obesity", question, ignore.case = TRUE))

activity_data <- brfss_clean %>%
  filter(grepl("150 minutes", question))

obesity_state <- obesity_data %>%
  filter(subgroup == "Total") %>%
  group_by(state) %>%
  summarize(avg_obesity = mean(value))

activity_state <- activity_data %>%
  filter(subgroup == "Total") %>%
  group_by(state) %>%
  summarize(avg_activity = mean(value))

head(obesity_state)

# A tibble: 6 × 2
  state      avg_obesity
  <chr>            <dbl>
1 Alabama           36.2
2 Alaska            31.0
3 Arizona           29.7
4 Arkansas          36.2
5 California        26.2
6 Colorado          22.8

head(activity_state)

# A tibble: 6 × 2
  state      avg_activity
  <chr>             <dbl>
1 Alabama            30.8
2 Alaska             41.4
3 Arizona            38.4
4 Arkansas           31.4
5 California         39.7
6 Colorado           43.2

state_joined <- inner_join(obesity_state, activity_state, by = "state")

head(state_joined)

# A tibble: 6 × 3
  state      avg_obesity avg_activity
  <chr>            <dbl>        <dbl>
1 Alabama           36.2         30.8
2 Alaska            31.0         41.4
3 Arizona           29.7         38.4
4 Arkansas          36.2         31.4
5 California        26.2         39.7
6 Colorado          22.8         43.2

ggplot(state_joined, aes(avg_activity, avg_obesity)) +
  geom_point(size = 3, alpha = .6) +
  geom_smooth(se = FALSE) +
  theme_minimal() +
  labs(x = "Percent of Adults Meeting Physical Activity Guidelines",
       y = "Percent of Adults with Obesity",
       title = "Physical Activity vs Obesity Rate by State",
       caption = "Source: CDC BRFSS")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

income_obesity <- brfss_clean %>%
  filter(grepl("obesity", question, ignore.case = TRUE),
         category == "Income",
         state %in% c("California", "Texas", "Florida", "New York", "Illinois")) %>%
  group_by(state, subgroup) %>%
  summarize(avg_value = mean(value), .groups = "drop")

income_obesity

# A tibble: 35 × 3
   state      subgroup           avg_value
   <chr>      <chr>                  <dbl>
 1 California $15,000 - $24,999       30.2
 2 California $25,000 - $34,999       30.8
 3 California $35,000 - $49,999       27.2
 4 California $50,000 - $74,999       27.5
 5 California $75,000 or greater      23.1
 6 California Data not reported       21.5
 7 California Less than $15,000       32.0
 8 Florida    $15,000 - $24,999       32.2
 9 Florida    $25,000 - $34,999       30.6
10 Florida    $35,000 - $49,999       29.2
# ℹ 25 more rows

ggplot(income_obesity, aes(x = subgroup, y = avg_value, fill = state)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("California" = "red",
                                "Texas" = "blue",
                                "Florida" = "green",
                                "New York" = "purple",
                                "Illinois" = "orange")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Income Group",
       y = "Percent with Obesity",
       fill = "State",
       title = "Obesity Rate by Income Group in 5 States",
       caption = "Source: CDC BRFSS")

library(treemap)

category_counts <- obesity_data %>%
  group_by(category) %>%
  summarize(count = n())

category_counts

# A tibble: 6 × 2
  category       count
  <chr>          <int>
1 Age (years)     4254
2 Education       2836
3 Income          4962
4 Race/Ethnicity  3865
5 Sex             1418
6 Total            709

treemap(category_counts,
        index = "category",
        vSize = "count",
        type = "index",
        title = "Number of Obesity Records by Stratification Category")

Conclusion

How I cleaned the data

Firstly, I shortened the column names from “Age(years)” and such to just age so I wouldn’t have to constantly write those weird character names. Then I dropped all the NAs because I didn’t need data that doesn’t contribute to averages and visuals. In addition, I filtered out national and territories like Puerto Rico and Guam to only have states. With such an extensive dataset with many different survey questions, it made sense to use grepl(“obesity”) and grepl(“150 minutes”) rather than write out the full text of the questions, as that was prone to error. To be able to visualize the relationship between physical activity and obesity, I needed to group_by state, then summarize to give me one average of obesity and one average of physical activity per state, and then inner join just like in the pfizer and fda merge but using the state name.

What the visualizations show

From the graph above, we can observe that there appears to be a tendency for states with high physical activity rates to experience relatively low obesity levels because of the descending trend in the curve although there is a wide variability between the data points such that physical activity is not the only factor that contributes to obesity rates. In the case of the bar chart comparing the obesity percentage among income groups within five different states, it is clear that in all states considered, the lowest income group exhibits an obesity percentage greater than those of the highest income group, confirming my expectations about obesity before beginning the visualization process. As for the treemap, it provided insights regarding the structure of the dataset, indicating that majority of the obesity records are grouped based on income and ethnicity while there are relatively few records that are divided based on sex and education.