BRFSS Health Risk Factors Analysis

Author

Rahwa Hagos

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(broom)      # for tidy regression output
library(ggplot2)    # for visualization

setwd("/Users/rahwahagos/Downloads/Hate_Crimes")

brfss_raw <- read_csv("Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")
Rows: 110880 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl  (7): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(brfss_raw)
Rows: 110,880
Columns: 33
$ YearStart                  <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2…
$ YearEnd                    <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2…
$ LocationAbbr               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ LocationDesc               <chr> "Alabama", "Alabama", "Alabama", "Alabama",…
$ Datasource                 <chr> "Behavioral Risk Factor Surveillance System…
$ Class                      <chr> "Obesity / Weight Status", "Obesity / Weigh…
$ Topic                      <chr> "Obesity / Weight Status", "Obesity / Weigh…
$ Question                   <chr> "Percent of adults aged 18 years and older …
$ Data_Value_Unit            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Data_Value_Type            <chr> "Value", "Value", "Value", "Value", "Value"…
$ Data_Value                 <dbl> 34.8, 35.8, 32.3, 34.1, 28.8, 16.3, 27.8, 3…
$ Data_Value_Alt             <dbl> 34.8, 35.8, 32.3, 34.1, 28.8, 16.3, 27.8, 3…
$ Data_Value_Footnote_Symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Data_Value_Footnote        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Low_Confidence_Limit       <dbl> 31.3, 31.1, 28.0, 29.7, 25.4, 12.6, 14.4, 3…
$ High_Confidence_Limit      <dbl> 38.5, 40.8, 36.8, 38.8, 32.5, 20.9, 46.9, 4…
$ Sample_Size                <dbl> 1367, 757, 861, 785, 1125, 356, 58, 598, 86…
$ Total                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Age(years)`               <chr> NA, NA, NA, NA, NA, "18 - 24", NA, "25 - 34…
$ Education                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Sex                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Income                     <chr> "$15,000 - $24,999", "$25,000 - $34,999", "…
$ `Race/Ethnicity`           <chr> NA, NA, NA, NA, NA, NA, "2 or more races", …
$ GeoLocation                <chr> "(32.840571122, -86.631860762)", "(32.84057…
$ ClassID                    <chr> "OWS", "OWS", "OWS", "OWS", "OWS", "OWS", "…
$ TopicID                    <chr> "OWS1", "OWS1", "OWS1", "OWS1", "OWS1", "OW…
$ QuestionID                 <chr> "Q036", "Q036", "Q036", "Q036", "Q036", "Q0…
$ DataValueTypeID            <chr> "VALUE", "VALUE", "VALUE", "VALUE", "VALUE"…
$ LocationID                 <chr> "01", "01", "01", "01", "01", "01", "01", "…
$ StratificationCategory1    <chr> "Income", "Income", "Income", "Income", "In…
$ Stratification1            <chr> "$15,000 - $24,999", "$25,000 - $34,999", "…
$ StratificationCategoryId1  <chr> "INC", "INC", "INC", "INC", "INC", "AGEYR",…
$ StratificationID1          <chr> "INC1525", "INC2535", "INC3550", "INC5075",…
names(brfss_raw) <- tolower(names(brfss_raw))
names(brfss_raw) <- gsub(" ", "", names(brfss_raw))
names(brfss_raw) <- gsub("_", "", names(brfss_raw))

names(brfss_raw)
 [1] "yearstart"                 "yearend"                  
 [3] "locationabbr"              "locationdesc"             
 [5] "datasource"                "class"                    
 [7] "topic"                     "question"                 
 [9] "datavalueunit"             "datavaluetype"            
[11] "datavalue"                 "datavaluealt"             
[13] "datavaluefootnotesymbol"   "datavaluefootnote"        
[15] "lowconfidencelimit"        "highconfidencelimit"      
[17] "samplesize"                "total"                    
[19] "age(years)"                "education"                
[21] "sex"                       "income"                   
[23] "race/ethnicity"            "geolocation"              
[25] "classid"                   "topicid"                  
[27] "questionid"                "datavaluetypeid"          
[29] "locationid"                "stratificationcategory1"  
[31] "stratification1"           "stratificationcategoryid1"
[33] "stratificationid1"        
brfss_states <- brfss_raw %>%
  filter(!locationdesc %in% c("National", "Guam", "Puerto Rico", "Virgin Islands"))

brfss_clean <- brfss_states %>%
  filter(!is.na(datavalue))

brfss_clean <- brfss_clean %>%
  filter(question %in% c(
    "Percent of adults aged 18 years and older who have obesity",
    "Percent of adults aged 18 years and older who have an overweight classification",
    "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"
  ))

brfss_analysis <- brfss_clean %>%
  select(locationdesc, question, datavalue, stratification1, stratificationcategory1)

brfss_analysis <- brfss_analysis %>%
  rename(
    state = locationdesc,
    category = stratificationcategory1,
    subgroup = stratification1,
    value = datavalue,
    health_question = question
  )

summary(brfss_analysis)
    state           health_question        value         subgroup        
 Length:43764       Length:43764       Min.   : 0.90   Length:43764      
 Class :character   Class :character   1st Qu.:30.40   Class :character  
 Mode  :character   Mode  :character   Median :34.90   Mode  :character  
                                       Mean   :36.04                     
                                       3rd Qu.:39.50                     
                                       Max.   :85.30                     
   category        
 Length:43764      
 Class :character  
 Mode  :character  
                   
                   
                   
state_avg <- brfss_clean %>%
  filter(stratification1 == "Total") %>%
  group_by(locationdesc, question) %>%
  summarise(
    avg_value = mean(datavalue, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = question,
    values_from = avg_value
  )

state_avg <- state_avg %>%
  rename(
    state = locationdesc,
    obesity = `Percent of adults aged 18 years and older who have obesity`,
    overweight = `Percent of adults aged 18 years and older who have an overweight classification`,
    physical_activity = `Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)`
  )

state_regression <- state_avg %>%
  drop_na()

head(state_regression)
# A tibble: 6 × 4
  state      overweight obesity physical_activity
  <chr>           <dbl>   <dbl>             <dbl>
1 Alabama          33.8    36.2              46.2
2 Alaska           35.7    31.0              58.7
3 Arizona          35.3    29.7              54.8
4 Arkansas         33.5    36.2              46.2
5 California       35.8    26.2              57.8
6 Colorado         35.7    22.8              61.3
# Research question: Does physical activity predict obesity rates across states?
regression_model <- lm(obesity ~ physical_activity, data = state_regression)

summary(regression_model)

Call:
lm(formula = obesity ~ physical_activity, data = state_regression)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3289 -1.0727  0.2795  1.2925  4.1259 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       63.50204    3.49257  18.182  < 2e-16 ***
physical_activity -0.61932    0.06612  -9.366 1.69e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.203 on 49 degrees of freedom
Multiple R-squared:  0.6416,    Adjusted R-squared:  0.6343 
F-statistic: 87.72 on 1 and 49 DF,  p-value: 1.691e-12
tidy(regression_model)
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         63.5      3.49       18.2  2.00e-23
2 physical_activity   -0.619    0.0661     -9.37 1.69e-12
regression_coefficients <- tidy(regression_model)
regression_glance <- glance(regression_model)

cat("\n========================================\n")

========================================
cat("REGRESSION EQUATION\n")
REGRESSION EQUATION
cat("========================================\n")
========================================
cat("Obesity_Percent = ", round(coef(regression_model)[1], 2), 
    " + ", round(coef(regression_model)[2], 2), 
    " * Physical_Activity_Percent\n", sep = "")
Obesity_Percent = 63.5 + -0.62 * Physical_Activity_Percent
cat("\n========================================\n")

========================================
cat("KEY STATISTICS\n")
KEY STATISTICS
cat("========================================\n")
========================================
cat("Adjusted R-squared:", round(regression_glance$adj.r.squared, 4), "\n")
Adjusted R-squared: 0.6343 
cat("P-value for Physical Activity coefficient:", round(regression_coefficients$p.value[2], 6), "\n")
P-value for Physical Activity coefficient: 0 
cat("F-statistic p-value:", round(regression_glance$p.value, 6), "\n")
F-statistic p-value: 0 
cat("\n========================================\n")

========================================
cat("INTERPRETATION\n")
INTERPRETATION
cat("========================================\n")
========================================
if(regression_coefficients$p.value[2] < 0.05) {
  cat("The relationship between physical activity and obesity is statistically significant (p < 0.05).\n")
  cat("For every 1% increase in physical activity adherence, obesity decreases by approximately", 
      abs(round(regression_coefficients$estimate[2], 2)), "%.\n")
} else {
  cat("The relationship between physical activity and obesity is NOT statistically significant (p > 0.05).\n")
}
The relationship between physical activity and obesity is statistically significant (p < 0.05).
For every 1% increase in physical activity adherence, obesity decreases by approximately 0.62 %.
if(regression_glance$adj.r.squared > 0.5) {
  cat("The model explains", round(regression_glance$adj.r.squared * 100, 1), 
      "% of the variation in obesity rates, which is a strong fit.\n")
} else if(regression_glance$adj.r.squared > 0.2) {
  cat("The model explains", round(regression_glance$adj.r.squared * 100, 1), 
      "% of the variation in obesity rates, which is a moderate fit.\n")
} else {
  cat("The model explains only", round(regression_glance$adj.r.squared * 100, 1), 
      "% of the variation in obesity rates, suggesting other factors are important.\n")
}
The model explains 63.4 % of the variation in obesity rates, which is a strong fit.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))

plot(regression_model)

mtext("Regression Diagnostic Plots", side = 3, line = 0, outer = TRUE, cex = 1.2)

par(mfrow = c(1, 1), mar = c(5, 4, 4, 2), oma = c(0, 0, 0, 0))
income_obesity <- brfss_clean %>%
  filter(
    question == "Percent of adults aged 18 years and older who have obesity",
    stratificationcategory1 == "Income",
    stratification1 %in% c("Less than $15,000", "$15,000 - $24,999", 
                           "$25,000 - $34,999", "$35,000 - $49,999", 
                           "$50,000 - $74,999", "$75,000 or greater"),
    !is.na(datavalue)
  )

income_obesity <- income_obesity %>%
  mutate(
    income_level = factor(stratification1, 
                          levels = c("Less than $15,000", "$15,000 - $24,999",
                                     "$25,000 - $34,999", "$35,000 - $49,999",
                                     "$50,000 - $74,999", "$75,000 or greater"))
  )

top_states <- c("California", "Texas", "Florida", "New York", "Illinois")

income_filtered <- income_obesity %>%
  filter(locationdesc %in% top_states)

obesity_plot <- ggplot(income_filtered, aes(x = income_level, y = datavalue, fill = locationdesc)) +
  
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), alpha = 0.85) +
  
  scale_fill_manual(values = c("California" = "#E41A1C",
                               "Texas" = "#377EB8",
                               "Florida" = "#4DAF4A",
                               "New York" = "#984EA3",
                               "Illinois" = "#FF7F00")) +
  
  labs(
    title = "Obesity Rates by Income Level in Top 5 U.S. States (2011)",
    subtitle = "Higher income is consistently associated with lower obesity rates across all states",
    x = "Annual Household Income",
    y = "Obesity Rate (%)",
    fill = "State",
    caption = "Source: CDC Behavioral Risk Factor Surveillance System (BRFSS), 2011"
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
    axis.title = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 11),
    legend.text = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.caption = element_text(hjust = 0, size = 9, color = "gray50")
  ) +
  
  scale_y_continuous(limits = c(0, 45), breaks = seq(0, 45, 5))

print(obesity_plot)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

age_obesity <- brfss_clean %>%
  filter(
    question == "Percent of adults aged 18 years and older who have obesity",
    stratificationcategory1 == "Age (years)",
    stratification1 != "Total",
    !is.na(datavalue)
  ) %>%
  group_by(stratification1) %>%
  summarise(
    avg_obesity = mean(datavalue, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_obesity))

cat("\n=== Obesity Rates by Age Group (National Average) ===\n")

=== Obesity Rates by Age Group (National Average) ===
print(age_obesity)
# A tibble: 6 × 2
  stratification1 avg_obesity
  <chr>                 <dbl>
1 45 - 54                36.8
2 55 - 64                35.6
3 35 - 44                34.6
4 25 - 34                29.8
5 65 or older            28.6
6 18 - 24                18.5
race_obesity <- brfss_clean %>%
  filter(
    question == "Percent of adults aged 18 years and older who have obesity",
    stratificationcategory1 == "Race/Ethnicity",
    stratification1 != "Total",
    !is.na(datavalue)
  ) %>%
  group_by(stratification1) %>%
  summarise(
    avg_obesity = mean(datavalue, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_obesity))

cat("\n=== Obesity Rates by Race/Ethnicity (National Average) ===\n")

=== Obesity Rates by Race/Ethnicity (National Average) ===
print(race_obesity)
# A tibble: 8 × 2
  stratification1               avg_obesity
  <chr>                               <dbl>
1 Hawaiian/Pacific Islander            42.5
2 Non-Hispanic Black                   38.7
3 American Indian/Alaska Native        37.5
4 Hispanic                             32.8
5 2 or more races                      32.5
6 Non-Hispanic White                   29.4
7 Other                                28.2
8 Asian                                11.5
edu_obesity <- brfss_clean %>%
  filter(
    question == "Percent of adults aged 18 years and older who have obesity",
    stratificationcategory1 == "Education",
    stratification1 != "Total",
    !is.na(datavalue)
  ) %>%
  group_by(stratification1) %>%
  summarise(
    avg_obesity = mean(datavalue, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(avg_obesity)

cat("\n=== Obesity Rates by Education Level (National Average) ===\n")

=== Obesity Rates by Education Level (National Average) ===
print(edu_obesity)
# A tibble: 4 × 2
  stratification1                  avg_obesity
  <chr>                                  <dbl>
1 College graduate                        25.3
2 Some college or technical school        32.8
3 High school graduate                    33.1
4 Less than high school                   34.7

Data Cleaning Process

I loaded the BRFSS dataset (110,880 rows, 33 columns) and standardized column names using tolower() and gsub() to remove spaces and underscores. I filtered out non-state locations to focus only on U.S. states and removed rows with missing values. I then filtered for three specific health questions (obesity, overweight, and physical activity) and renamed columns for clarity. For the regression analysis, I calculated state-level averages by filtering for the “Total” subgroup.

Visualization Analysis

The bar chart shows obesity rates by income level across five major states. A clear pattern emerges: higher income is associated with lower obesity ratesacross all states. For example, in California, obesity drops from approximately 34% in the lowest income group to 20% in the highest. The regression analysis supports this, showing physical activity strongly predicts obesity (p < 0.001, adjusted R² = 0.634). Exploratory analysis also revealed that obesity peaks in middle age (45-54), varies significantly by race/ethnicity, and decreases with higher education.

Challenges

The main challenge was handling the complex stratification structure of the BRFSS data, which required careful filtering to extract the right subgroups for each analysis. I also needed to order income levels as a factor using factor() with specified levels to display them meaningfully from lowest to highest income.