── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
brfss <- read_csv ("Nutrition,_Physical_Activity,_and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System_20260617.csv" )
Rows: 110880 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl (6): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...
num (1): Sample_Size
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 110,880 × 33
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
2 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
3 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
4 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
5 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
6 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
7 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
8 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
9 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
10 2011 2011 AL Alabama Behavioral … Obes… Obes… Percent…
# ℹ 110,870 more rows
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
# Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
# `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
# `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>, …
# A tibble: 6 × 33
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
2 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
3 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
4 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
5 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
6 2011 2011 AL Alabama Behavioral R… Obes… Obes… Percent…
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
# Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
# `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
# `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
# QuestionID <chr>, DataValueTypeID <chr>, LocationID <chr>, …
YearStart YearEnd LocationAbbr LocationDesc
Min. :2011 Min. :2011 Length:110880 Length:110880
1st Qu.:2014 1st Qu.:2014 Class :character Class :character
Median :2017 Median :2017 Mode :character Mode :character
Mean :2017 Mean :2017
3rd Qu.:2020 3rd Qu.:2020
Max. :2024 Max. :2024
Datasource Class Topic Question
Length:110880 Length:110880 Length:110880 Length:110880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Data_Value_Unit Data_Value_Type Data_Value Data_Value_Alt
Length:110880 Length:110880 Min. : 0.9 Min. : 0.9
Class :character Class :character 1st Qu.:24.9 1st Qu.:24.9
Mode :character Mode :character Median :31.8 Median :31.8
Mean :31.8 Mean :31.8
3rd Qu.:37.4 3rd Qu.:37.4
Max. :85.3 Max. :85.3
NA's :13214 NA's :13214
Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
Length:110880 Length:110880 Min. : 0.3
Class :character Class :character 1st Qu.:20.4
Mode :character Mode :character Median :27.3
Mean :27.4
3rd Qu.:33.3
Max. :74.7
NA's :13214
High_Confidence_Limit Sample_Size Total Age(years)
Min. : 3.00 Min. : 50 Length:110880 Length:110880
1st Qu.:29.20 1st Qu.: 494 Class :character Class :character
Median :36.50 Median : 1079 Mode :character Mode :character
Mean :36.77 Mean : 3625
3rd Qu.:42.80 3rd Qu.: 2399
Max. :92.40 Max. :476876
NA's :13214 NA's :13214
Education Sex Income Race/Ethnicity
Length:110880 Length:110880 Length:110880 Length:110880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
GeoLocation ClassID TopicID QuestionID
Length:110880 Length:110880 Length:110880 Length:110880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
DataValueTypeID LocationID StratificationCategory1
Length:110880 Length:110880 Length:110880
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Stratification1 StratificationCategoryId1 StratificationID1
Length:110880 Length:110880 Length:110880
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
brfss <- brfss %>%
rename (
state = LocationDesc,
class = Class,
question = Question,
value = Data_Value,
category = StratificationCategory1,
subgroup = Stratification1
)
brfss
# A tibble: 110,880 × 33
YearStart YearEnd LocationAbbr state Datasource class Topic question
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
2 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
3 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
4 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
5 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
6 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
7 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
8 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
9 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
10 2011 2011 AL Alabama Behavioral Risk … Obes… Obes… Percent…
# ℹ 110,870 more rows
# ℹ 25 more variables: Data_Value_Unit <chr>, Data_Value_Type <chr>,
# value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
# `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
# `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>, …
brfss_clean <- brfss %>%
filter (! is.na (value))
brfss_clean <- brfss_clean %>%
filter (! state %in% c ("National" , "Guam" , "Puerto Rico" , "Virgin Islands" ))
obesity_data <- brfss_clean %>%
filter (grepl ("obesity" , question, ignore.case = TRUE ))
activity_data <- brfss_clean %>%
filter (grepl ("150 minutes" , question))
obesity_state <- obesity_data %>%
filter (subgroup == "Total" ) %>%
group_by (state) %>%
summarize (avg_obesity = mean (value))
activity_state <- activity_data %>%
filter (subgroup == "Total" ) %>%
group_by (state) %>%
summarize (avg_activity = mean (value))
head (obesity_state)
# A tibble: 6 × 2
state avg_obesity
<chr> <dbl>
1 Alabama 36.2
2 Alaska 31.0
3 Arizona 29.7
4 Arkansas 36.2
5 California 26.2
6 Colorado 22.8
# A tibble: 6 × 2
state avg_activity
<chr> <dbl>
1 Alabama 30.8
2 Alaska 41.4
3 Arizona 38.4
4 Arkansas 31.4
5 California 39.7
6 Colorado 43.2
state_joined <- inner_join (obesity_state, activity_state, by = "state" )
head (state_joined)
# A tibble: 6 × 3
state avg_obesity avg_activity
<chr> <dbl> <dbl>
1 Alabama 36.2 30.8
2 Alaska 31.0 41.4
3 Arizona 29.7 38.4
4 Arkansas 36.2 31.4
5 California 26.2 39.7
6 Colorado 22.8 43.2
ggplot (state_joined, aes (avg_activity, avg_obesity)) +
geom_point (size = 3 , alpha = .6 ) +
geom_smooth (se = FALSE ) +
theme_minimal () +
labs (x = "Percent of Adults Meeting Physical Activity Guidelines" ,
y = "Percent of Adults with Obesity" ,
title = "Physical Activity vs Obesity Rate by State" ,
caption = "Source: CDC BRFSS" )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
income_obesity <- brfss_clean %>%
filter (grepl ("obesity" , question, ignore.case = TRUE ),
category == "Income" ,
state %in% c ("California" , "Texas" , "Florida" , "New York" , "Illinois" )) %>%
group_by (state, subgroup) %>%
summarize (avg_value = mean (value), .groups = "drop" )
income_obesity
# A tibble: 35 × 3
state subgroup avg_value
<chr> <chr> <dbl>
1 California $15,000 - $24,999 30.2
2 California $25,000 - $34,999 30.8
3 California $35,000 - $49,999 27.2
4 California $50,000 - $74,999 27.5
5 California $75,000 or greater 23.1
6 California Data not reported 21.5
7 California Less than $15,000 32.0
8 Florida $15,000 - $24,999 32.2
9 Florida $25,000 - $34,999 30.6
10 Florida $35,000 - $49,999 29.2
# ℹ 25 more rows
ggplot (income_obesity, aes (x = subgroup, y = avg_value, fill = state)) +
geom_bar (stat = "identity" , position = "dodge" ) +
scale_fill_manual (values = c ("California" = "red" ,
"Texas" = "blue" ,
"Florida" = "green" ,
"New York" = "purple" ,
"Illinois" = "orange" )) +
theme_minimal () +
theme (axis.text.x = element_text (angle = 45 , hjust = 1 )) +
labs (x = "Income Group" ,
y = "Percent with Obesity" ,
fill = "State" ,
title = "Obesity Rate by Income Group in 5 States" ,
caption = "Source: CDC BRFSS" )
library (treemap)
category_counts <- obesity_data %>%
group_by (category) %>%
summarize (count = n ())
category_counts
# A tibble: 6 × 2
category count
<chr> <int>
1 Age (years) 4254
2 Education 2836
3 Income 4962
4 Race/Ethnicity 3865
5 Sex 1418
6 Total 709
treemap (category_counts,
index = "category" ,
vSize = "count" ,
type = "index" ,
title = "Number of Obesity Records by Stratification Category" )
Conclusion
How I cleaned the data
Firstly, I shortened the column names from “Age(years)” and such to just age so I wouldn’t have to constantly write those weird character names. Then I dropped all the NAs because I didn’t need data that doesn’t contribute to averages and visuals. In addition, I filtered out national and territories like Puerto Rico and Guam to only have states. With such an extensive dataset with many different survey questions, it made sense to use grepl(“obesity”) and grepl(“150 minutes”) rather than write out the full text of the questions, as that was prone to error. To be able to visualize the relationship between physical activity and obesity, I needed to group_by state, then summarize to give me one average of obesity and one average of physical activity per state, and then inner join just like in the pfizer and fda merge but using the state name.
What the visualizations show
From the graph above, we can observe that there appears to be a tendency for states with high physical activity rates to experience relatively low obesity levels because of the descending trend in the curve although there is a wide variability between the data points such that physical activity is not the only factor that contributes to obesity rates. In the case of the bar chart comparing the obesity percentage among income groups within five different states, it is clear that in all states considered, the lowest income group exhibits an obesity percentage greater than those of the highest income group, confirming my expectations about obesity before beginning the visualization process. As for the treemap, it provided insights regarding the structure of the dataset, indicating that majority of the obesity records are grouped based on income and ethnicity while there are relatively few records that are divided based on sex and education.