1. Setup and Libraries

# Load R packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(stats)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

library(knitr)
library(scales)
library(RColorBrewer)
library(choroplethr)

## Warning: package 'choroplethr' was built under R version 4.5.2

library(choroplethrMaps)

## Warning: package 'choroplethrMaps' was built under R version 4.5.2

library(shiny)
library(shinydashboard)

## Warning: package 'shinydashboard' was built under R version 4.5.2

## 
## Attaching package: 'shinydashboard'

## The following object is masked from 'package:graphics':
## 
##     box

library(httr)
library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(haven)

2. Introduction and Research Focus

2.1. Motivation and Background

Obesity is a major public health crisis in the United States, driving significant healthcare costs and reducing life expectancy. Nearly 42% of U.S. adults are classified as obese. This issue is not uniformly distributed across the nation; there are vast disparities in prevalence rates among different states and demographic groups. Understanding the complex relationship between socioeconomic status and lifestyle behaviors is critical for developing targeted and effective public health interventions. This analysis seeks to move beyond simple correlation by using robust statistical modeling to identify the most significant, quantifiable drivers of this health disparity.

2.2. Specific Research Question

How effectively do specific lifestyle behaviors (Exercise Prevalence, Smoking Prevalence) and key socioeconomic factors (Median Income, Higher Education Rate) collectively predict the variation in Adult Obesity Prevalence across all U.S. states?

2.3. Scope of Analysis

The scope of this project is strictly limited to state-level aggregate data for all 50 U.S. states plus the District of Columbia. The analysis uses data primarily from the 2020 Behavioral Risk Factor Surveillance System (BRFSS) for health metrics and the 2016-2020 American Community Survey (ACS) 5-Year Estimates for socioeconomic metrics. The focus is on establishing correlation and prediction, not establishing direct causation.

2.4. Key Variables Defined

The variables used were standardized and joined using FIPS codes into a single dataset.

Definition of Key Variables Used in the Model
Variable Category	Variable Name	Definition
Response Variable (Y)	Adult Obesity Prevalence	Percentage of adults classified as obese in each state.
Main Predictors (X1, X2)	Exercise Prevalence	Percentage of adults reporting meeting federal guidelines for aerobic physical activity.
Main Predictors (X1, X2)	Smoking Prevalence	Percentage of adult current smokers.
Socioeconomic Predictors (X3, X4)	Median State Income	Median household income (in USD) for the state.
Socioeconomic Predictors (X3, X4)	Education (Bachelors+)	Percentage of adults aged 25+ with a Bachelor’s degree or higher.

3. Phase O – Data Collection

3.1. Source 1: BRFSS (CSV File)

# Loading the BRFSS CSV file

brfssraw <- read_xpt("LLCP2024.XPT")

colnames(brfssraw)

3.2. Source 2: U.S. Census API

We use the Census Bureau API to get demographic data for all states. We will get the Median Household Income, the percentage of adults with a Bachelor’s Degree or higher, and the State Name. We need an API key to access this information.

census_key <- "2fd2b4e5729340013219c170942cdcb6eb1a0803"

census_vars <- c("DP03_0062E", "DP02_0068PE", "NAME") 
base_url <- "https://api.census.gov/data/2023/acs/acs1/profile"

census_request <- paste0(
  base_url,
  "?get=", paste(census_vars, collapse = ","),
  "&for=state:*",
  "&key=", census_key)

response <- GET(census_request)

if (status_code(response) == 200) {
  census_data_raw <- content(response, "parsed")
  
  header <- unlist(census_data_raw[[1]])
  data_rows <- census_data_raw[-1]
  
  census_df <- as.data.frame(do.call(rbind, data_rows), stringsAsFactors = FALSE)
  colnames(census_df) <- header
  
  print("Census Data Structure (Raw):")
  head(census_df)
  
} else {
  print(paste("Error getting Census data. Status Code:", status_code(response)))
  print(content(response, "text"))
  census_df <- NULL}

## [1] "Census Data Structure (Raw):"

##   DP03_0062E DP02_0068PE       NAME state
## 1      62212        28.9    Alabama    01
## 2      86631        32.2     Alaska    02
## 3      77315        33.5    Arizona    04
## 4      58700        26.2   Arkansas    05
## 5      95521        37.5 California    06
## 6      92911        46.4   Colorado    08

View(census_df)

4. Phase S – Data Cleaning and Preparation

4.1. Creating a Reproducible and clean BRFSS Subset

We cannot use the full BRFSS file because it is too big. To make the project easy to share and reproducible on GitHub, we select only the columns we need for the analysis and the survey weight variable (_LLCPWT), and we remove missing data.

brfss_clean_subset <- brfssraw %>%
  select(
    State_FIPS = `_STATE`,
    Obesity_Status = `_RFBMI5`,       # 1=Obese, 2=Not Obese
    Exercise_Any = EXERANY2,          # 1=Yes, 2=No
    Smoking_Status = `_SMOKER3`,      # 1=Current, 2=Former, 3=Never
    Alcohol_Per_Day = DROCDY4_,       
    Weight = `_LLCPWT`                # Survey weight
  ) %>%
  filter(Obesity_Status %in% c(1, 2),
    Exercise_Any %in% c(1, 2),
    Smoking_Status %in% c(1, 2, 3),
    !Alcohol_Per_Day %in% c(777, 888, 999) & !is.na(Alcohol_Per_Day),
    !is.na(Weight))

str(brfss_clean_subset)
head(brfss_clean_subset)

write_csv(brfss_clean_subset, "brfss_clean_subset.csv")

After we cleaned the original data set, we created a subset csv file and uploaded into Github. Finally we read that file which is the one we are going to work with.

# Save and reload for reproducibility

brfss_clean <- read_csv("https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/brfss_clean_subset.csv")

## Rows: 158410 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): State_FIPS, Obesity_Status, Exercise_Any, Smoking_Status, Alcohol_P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.2. Calculating State-Level Indicators from BRFSS

In this step, we calculate the final health statistics for each state. We group all survey responses by their State FIPS code. Within each state, we calculate the Prevalence Rate (expressed as a percentage, or per 100 inhabitants) for Obesity, Exercise, and Smoking. This creates the clean, state-level table (brfss_state_calc) needed for the final join.

# Convert the binary/categorical variables into 0/1 for easier calculation
brfss_state_calc <- brfss_clean %>%
  mutate(
    # Obesity Prevalence: 1 if Obese (code 1), 0 otherwise
    is_obese = if_else(Obesity_Status == 1, 1, 0),
    # Physical Activity Prevalence: 1 if exercised (code 1), 0 otherwise
    is_exercising = if_else(Exercise_Any == 1, 1, 0),
    # Current Smoker Prevalence: 1 if current smoker (code 1), 0 otherwise
    is_smoker = if_else(Smoking_Status == 1, 1, 0)
  ) %>%
  # Group by State FIPS code
  group_by(State_FIPS) %>%
  summarise(
    # Calculate weighted percentages for each indicator
    Obesity_Prev = weighted.mean(is_obese, w = Weight) * 100,
    Exercise_Prev = weighted.mean(is_exercising, w = Weight) * 100,
    Smoking_Prev = weighted.mean(is_smoker, w = Weight) * 100,
    # Calculate the weighted mean for alcohol consumption
    Avg_Alcohol_Drinks_Day = weighted.mean(Alcohol_Per_Day, w = Weight),
    # Count how many records we used for this state
    N_count = n()
  ) %>%
  ungroup()

View(brfss_state_calc)

4.3. Cleaning and Formatting Census API Data

The Census API data needs to be cleaned so we can join it with the BRFSS data. We rename the coded columns to clear names and convert the income and education columns from text to numeric.

State_FIPS_Vector <- unlist(census_df$state)
Median_Income_Vector_raw <- as.numeric(unlist(census_df$DP03_0062E))
Education_Vector_raw <- as.numeric(unlist(census_df$DP02_0068PE))

if (length(Median_Income_Vector_raw) == 51) {
  Median_Income_Vector <- c(Median_Income_Vector_raw, NA)
} else {
  Median_Income_Vector <- Median_Income_Vector_raw
}

if (length(Education_Vector_raw) == 51) {
  Education_Vector <- c(Education_Vector_raw, NA)
} else {
  Education_Vector <- Education_Vector_raw
}

census_temp <- data.frame(
  State_FIPS = State_FIPS_Vector,
  Median_Income = Median_Income_Vector,
  Education_Bachelors_Plus = Education_Vector,
  stringsAsFactors = FALSE
) %>%
  filter(State_FIPS != "72" & State_FIPS != "0") %>%
  filter(!is.na(Median_Income) & !is.na(Education_Bachelors_Plus))


census_clean <- census_df %>%
  select(State_Name = NAME, State_FIPS = state) %>%
  mutate(State_FIPS = unlist(State_FIPS)) %>%
  filter(State_Name != "NAME") %>%
  filter(State_FIPS != "72") %>%
  inner_join(census_temp, by = "State_FIPS")

4.4. Joining BRFSS and Census Data

Now that both datasets (brfss_state_calc and census_clean) are clean and have the same state codes, we join them.

We use an inner_join based on the state code (State_FIPS). This ensures that only the 51 areas (50 states + D.C.) present in both datasets are kept in the final table.

The final table, final_data, will have all the health and socioeconomic variables we need for the next phase.

census_clean_final <- census_clean %>%
  mutate(State_FIPS = stringr::str_pad(State_FIPS, width = 2, side = "left", pad = "0")) %>%
  select(State_Name, State_FIPS, Median_Income, Education_Bachelors_Plus)

brfss_state_clean_final <- brfss_state_calc %>%
  mutate(State_FIPS = as.character(State_FIPS)) %>%
  mutate(State_FIPS = stringr::str_pad(State_FIPS, width = 2, side = "left", pad = "0")) %>%
  filter(!(State_FIPS %in% c("66", "72", "78")))


final_data <- brfss_state_clean_final %>%
  inner_join(census_clean_final, by = "State_FIPS") %>%
  select(
    State_Name,
    State_FIPS,
    Obesity_Prev,
    Exercise_Prev,
    Smoking_Prev,
    Avg_Alcohol_Drinks_Day,
    Median_Income,
    Education_Bachelors_Plus
  ) %>%
  arrange(State_Name)

View(final_data)
str(final_data)

## tibble [50 × 8] (S3: tbl_df/tbl/data.frame)
##  $ State_Name              :List of 50
##   ..$ : chr "Alabama"
##   ..$ : chr "Alaska"
##   ..$ : chr "Arizona"
##   ..$ : chr "Arkansas"
##   ..$ : chr "California"
##   ..$ : chr "Colorado"
##   ..$ : chr "Connecticut"
##   ..$ : chr "Delaware"
##   ..$ : chr "District of Columbia"
##   ..$ : chr "Florida"
##   ..$ : chr "Georgia"
##   ..$ : chr "Hawaii"
##   ..$ : chr "Idaho"
##   ..$ : chr "Illinois"
##   ..$ : chr "Indiana"
##   ..$ : chr "Iowa"
##   ..$ : chr "Kansas"
##   ..$ : chr "Kentucky"
##   ..$ : chr "Louisiana"
##   ..$ : chr "Maine"
##   ..$ : chr "Maryland"
##   ..$ : chr "Massachusetts"
##   ..$ : chr "Michigan"
##   ..$ : chr "Minnesota"
##   ..$ : chr "Mississippi"
##   ..$ : chr "Missouri"
##   ..$ : chr "Montana"
##   ..$ : chr "Nebraska"
##   ..$ : chr "Nevada"
##   ..$ : chr "New Hampshire"
##   ..$ : chr "New Jersey"
##   ..$ : chr "New Mexico"
##   ..$ : chr "New York"
##   ..$ : chr "North Carolina"
##   ..$ : chr "North Dakota"
##   ..$ : chr "Ohio"
##   ..$ : chr "Oklahoma"
##   ..$ : chr "Oregon"
##   ..$ : chr "Pennsylvania"
##   ..$ : chr "Rhode Island"
##   ..$ : chr "South Carolina"
##   ..$ : chr "South Dakota"
##   ..$ : chr "Texas"
##   ..$ : chr "Utah"
##   ..$ : chr "Vermont"
##   ..$ : chr "Virginia"
##   ..$ : chr "Washington"
##   ..$ : chr "West Virginia"
##   ..$ : chr "Wisconsin"
##   ..$ : chr "Wyoming"
##  $ State_FIPS              : chr [1:50] "01" "02" "04" "05" ...
##  $ Obesity_Prev            : num [1:50] 29.2 28.4 31.1 27.4 30.3 ...
##  $ Exercise_Prev           : num [1:50] 70.9 78.1 75.1 67.3 76.3 ...
##  $ Smoking_Prev            : num [1:50] 26.3 22.5 21 27.4 15 ...
##  $ Avg_Alcohol_Drinks_Day  : num [1:50] 37.6 50.7 47.7 61.8 48.6 ...
##  $ Median_Income           : num [1:50] 62212 86631 77315 58700 95521 ...
##  $ Education_Bachelors_Plus: num [1:50] 28.9 32.2 33.5 26.2 37.5 46.4 42.9 36.5 65.9 34.9 ...

head(final_data)

## # A tibble: 6 × 8
##   State_Name State_FIPS Obesity_Prev Exercise_Prev Smoking_Prev
##   <list>     <chr>             <dbl>         <dbl>        <dbl>
## 1 <chr [1]>  01                 29.2          70.9         26.3
## 2 <chr [1]>  02                 28.4          78.1         22.5
## 3 <chr [1]>  04                 31.1          75.1         21.0
## 4 <chr [1]>  05                 27.4          67.3         27.4
## 5 <chr [1]>  06                 30.3          76.3         15.0
## 6 <chr [1]>  08                 35.1          82.1         17.1
## # ℹ 3 more variables: Avg_Alcohol_Drinks_Day <dbl>, Median_Income <dbl>,
## #   Education_Bachelors_Plus <dbl>

5. Phase E – Data Exploration and Visualization (Explore)

Show the basic numbers and charts of your data.

5.1. Summary Statistics

We use the summary() function to quickly check the distribution (mean, median, min/max) of all our numerical variables. This step is essential to understand the range and central tendencies of the health and socioeconomic factors.

summary(final_data)

##  State_Name.Length  State_Name.Class  State_Name.Mode  State_FIPS       
##  1          -none-     character                      Length:50         
##  1          -none-     character                      Class :character  
##  1          -none-     character                      Mode  :character  
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##  1          -none-     character                                        
##   Obesity_Prev   Exercise_Prev    Smoking_Prev   Avg_Alcohol_Drinks_Day
##  Min.   :25.34   Min.   :66.24   Min.   :15.02   Min.   :29.60         
##  1st Qu.:27.75   1st Qu.:71.75   1st Qu.:18.16   1st Qu.:43.45         
##  Median :29.00   Median :74.06   Median :20.66   Median :49.78         
##  Mean   :29.43   Mean   :74.00   Mean   :21.27   Mean   :53.78         
##  3rd Qu.:30.42   3rd Qu.:76.44   3rd Qu.:23.96   3rd Qu.:62.07         
##  Max.   :37.34   Max.   :82.11   Max.   :33.25   Max.   :91.23         
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  Median_Income    Education_Bachelors_Plus
##  Min.   : 54203   Min.   :24.00           
##  1st Qu.: 69691   1st Qu.:32.02           
##  Median : 74787   Median :35.10           
##  Mean   : 77788   Mean   :35.84           
##  3rd Qu.: 86245   3rd Qu.:38.85           
##  Max.   :108210   Max.   :65.90           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##                                           
##

5.2. Visualization: Distribution of Key Variables

library(ggplot2)
library(scales) 


Author_Name <- "Arutam Antunish"
Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Health_Color <- "#2C3E50" 
Income_Color <- "#16A085"

# Define a consistent theme for the project
project_theme_classic <- theme_classic(base_size = 12) + 
  theme(
    plot.title = element_text(face = "bold", size = 14), # Reduced size
    plot.subtitle = element_text(size = 11, color = "gray30"), # Reduced size
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0), # Small, left-aligned
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  )

# Histogram 1: Distribution of Obesity Prevalence
ggplot(final_data, aes(x = Obesity_Prev)) +
  geom_histogram(bins = 10, fill = Health_Color, color = "white", alpha = 0.9) +
  labs(
    title = "Distribution of Obesity Prevalence Across 50 U.S. Areas",
    subtitle = "The data is slightly right-skewed, with most states clustering near 28-30%.",
    x = "Obesity Prevalence (%)",
    y = "Number of States/Areas",
    caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
  ) +
  scale_x_continuous(labels = percent_format(scale = 1)) +
  project_theme_classic

# Histogram 2:Distribution of Median Income
ggplot(final_data, aes(x = Median_Income)) +
  geom_histogram(bins = 10, fill = Income_Color, color = "white", alpha = 0.9) +
  labs(
    title = "Distribution of Median Household Income Across 50 U.S. Areas",
    subtitle = "Income exhibits a distinct wide range and appears slightly bimodal (two peaks).",
    x = "Median Income",
    y = "Number of States/Areas",
    caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
  ) +
  scale_x_continuous(labels = dollar_format(scale = 1, prefix = "$", big.mark = ",")) +
  project_theme_classic

5.3. Analysis of Data Distribution

Obesity Prevalence

Shape: The distribution is generally right-skewed (a longer tail to the right).

Most states have an obesity prevalence centered between 28% and 30%. There are a few states with very high obesity rates (above 34%), which suggests a strong potential relationship with socioeconomic factors in those specific outliers.

Median Income Shape: The distribution looks somewhat bimodal (two peaks).

A significant number of states cluster in the $70,000 to $80,000 range. However, there is another notable cluster near the $90,000 to $100,000 range, suggesting a clear separation between states with generally lower and generally higher incomes.

5.4. Key Visualizations (Two Variables)

The analysis now moves from exploring single variables (histograms) to testing the correlation between variables, which is the core of our study. We use Scatter Plots coupled with a Linear Regression Line (geom_smooth(method = “lm”)). This approach visualizes the direction (positive or negative) and strength of the linear relationship between the two most critical socioeconomic factors (Income and Education) and the primary health outcome (Obesity Prevalence).

Median Income vs Obesity

The first scatter plot analyzes the relationship between Median Household Income and Obesity Prevalence. The regression line shows a clear negative slope, confirming the expected trend: as income increases, the obesity rate tends to decrease across the states. This negative correlation suggests that socioeconomic prosperity is a factor in better health outcomes.

# Consistent  theme
project_theme_classic <- theme_classic(base_size = 12) + 
  theme(
    plot.title = element_text(face = "bold", size = 14), 
    plot.subtitle = element_text(size = 11, color = "gray30"),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0), 
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  )

# Define consistent colors
Health_Color_Dark <- "#2C3E50" 
Income_Color_Teal <- "#16A085" 
Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Author_Name <- "Arutam Antunish"


# Scatter Plot 1: Obesity vs. Median Income
ggplot(final_data, aes(x = Median_Income, y = Obesity_Prev)) +
  geom_point(color = Health_Color_Dark, size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = Income_Color_Teal, se = TRUE, linetype = "dashed") + 
  labs(
    title = "Analyzing Correlation: Median Income vs. Obesity Prevalence",
    subtitle = "Higher income levels show a tendency towards lower obesity rates (Negative Correlation).",
    x = "Median Household Income",
    y = "Obesity Prevalence (%)",
    caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
  ) +
  scale_x_continuous(labels = dollar_format(prefix = "$", big.mark = ",")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  project_theme_classic

## `geom_smooth()` using formula = 'y ~ x'

Education level vs. Obesity

The second scatter plot examines the correlation between the Percentage of Population with a Bachelor’s Degree or Higher and Obesity Prevalence. Similar to income, this plot shows a clear negative relationship. The downward slope of the regression line indicates that states with higher educational attainment generally experience lower rates of obesity.

# Scatter Plot 2: Obesity vs. Education 
ggplot(final_data, aes(x = Education_Bachelors_Plus, y = Obesity_Prev)) +
  geom_point(color = Health_Color_Dark, size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = Income_Color_Teal, se = TRUE, linetype = "dashed") +
  labs(
    title = "Analyzing Correlation: Education Level vs. Obesity Prevalence",
    subtitle = "Higher percentage of educated residents is associated with lower obesity rates (Negative Correlation).",
    x = "Population with Bachelor's Degree or Higher (%)",
    y = "Obesity Prevalence (%)",
    caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
  ) +
  scale_x_continuous(labels = percent_format(scale = 1)) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  project_theme_classic

## `geom_smooth()` using formula = 'y ~ x'

5.6. Geographic Visualization

This Choropleth Map is a powerful visualization tool used to identify spatial patterns of Obesity Prevalence across the U.S.

The map reveals a clear regional pattern:

High Obesity Rates (Dark Colors): States with the highest obesity rates (e.g., above 30%) are concentrated in the Southeastern U.S. (‘Obesity Belt’) and parts of the Midwest.

Low Obesity Rates (Light Colors): The lowest rates are found along the West Coast and in some Northeastern states.

library(choroplethr)
library(choroplethrMaps)
library(RColorBrewer)

Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Author_Name <- "Arutam Antunish"


project_theme_classic <- theme_classic(base_size = 12) + 
  theme(
    plot.title = element_text(face = "bold", size = 14), 
    plot.subtitle = element_text(size = 11, color = "gray30"),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0), 
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  )


map_data <- final_data %>%
  mutate(region = tolower(State_Name), 
         value = Obesity_Prev) %>%
  select(region, value) %>%
  filter(region != "district of columbia") 


state_choropleth(
  map_data, 
  title = "Obesity Prevalence Across the United States",
  legend = "Obesity (%)",
  num_colors = 7
) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(7, "YlOrRd")) +
  labs(
    caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
  ) +
  project_theme_classic

## geoid_type = 'auto'; the geoid region was determined to be of type: name.lower. To see the list of allowed geoids, see choroplethr::state.regions.

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

6. Phase M - Modelling

6.1.Simple Linear Regression (Income vs. Obesity)

This is the start of the Modeling Phase. We use Simple Linear Regression to build a formal predictive model.

Our goal is to quantify the relationship between our two main variables: Median Household Income (predictor variable, $X$) and Obesity Prevalence (response variable, $Y$).

The model determines the slope (the change in Obesity for every $1 change in Income) and the strength of this relationship, which is necessary for statistical inference.

# Build the linear model 
model_income <- lm(Obesity_Prev ~ Median_Income, data = final_data)


summary(model_income)

## 
## Call:
## lm(formula = Obesity_Prev ~ Median_Income, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9024 -1.9545 -0.2932  1.5340  5.4368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.311e+01  2.001e+00  11.550 1.84e-15 ***
## Median_Income 8.124e-05  2.538e-05   3.201  0.00243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.301 on 48 degrees of freedom
## Multiple R-squared:  0.1759, Adjusted R-squared:  0.1588 
## F-statistic: 10.25 on 1 and 48 DF,  p-value: 0.002429

6.2.Model Interpretation and Inference

This step uses the model’s summary statistics to perform Statistical Inference. We check the statistical significance ($p$-value) and the explanatory power ($R^2$) of the Median Income variable.

P-Value (Income):0.00243

Since the $p$-value is much less than 0.05, we reject the null hypothesis ($H_0$). We conclude that Median Income is a statistically significant predictor of Obesity Prevalence.

Slope ($\beta_1$): $8.124 \times 10^{-5}$ The slope is positive. According to this specific sample, a higher income is associated with a slightly higher obesity rate.

R-squared ($R^2$): 0.1759 (17.59%)

Only 17.59% of the variation in Obesity Prevalence is explained by the Income variable. The model has low predictive power, meaning other factors are much more important.

6.3.Model Diagnostics (Residuals)

Before we trust the model’s conclusions, we must check the assumptions of Simple Linear Regression by analyzing the Residuals (the error terms).

If the assumptions are violated, the $p$-values and the $R^2$ might be unreliable.

Normal Q-Q Plot: Checks if the errors are normally distributed.

The points mostly follow the dotted line. This suggests the normality assumption is generally met.

# Set up the plotting area for two graphs
par(mfrow = c(1, 2))

# 1. QQ-Plot (Check for Normality)
plot(model_income, which = 2, main = "Normal Q-Q Plot of Residuals")

Residuals vs. Fitted Plot: Checks if the error variance is constant

The points appear randomly scattered around the horizontal line (0). The red line is mostly flat. This suggests the homoscedasticity assumption is met.

# 2. Residuals vs. Fitted (Check for Homoscedasticity)
plot(model_income, which = 1, main = "Residuals vs. Fitted Values")

# Reset the plotting area
par(mfrow = c(1, 1))

The diagnostic checks show no major violations of the linear regression assumptions. Therefore, the statistical inferences made in Section 6.2 (that Income is a significant predictor despite the low $R^2$) are reliable.

6.4. Multiple Linear Regression

Since the Simple Model (Income only) had low predictive power ($R^2=0.176$), we now build a Multiple Linear Regression (MLR) model.

This complex model includes all key health and socioeconomic variables (Income, Education, Exercise, and Smoking) to see if we can explain a much higher percentage of the variation in Obesity Prevalence.

#MLR model including all key predictors: Obesity_Prev ~ Median_Income + Education_Bachelors_Plus + Exercise_Prev + Smoking_Prev

model_multiple <- lm(Obesity_Prev ~ Median_Income + Education_Bachelors_Plus + Exercise_Prev + Smoking_Prev, data = final_data)

summary(model_multiple)

## 
## Call:
## lm(formula = Obesity_Prev ~ Median_Income + Education_Bachelors_Plus + 
##     Exercise_Prev + Smoking_Prev, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4533 -1.5490  0.0121  1.2120  5.5060 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)               1.156e+01  1.064e+01   1.087   0.2830  
## Median_Income            -5.726e-05  5.981e-05  -0.957   0.3435  
## Education_Bachelors_Plus  1.691e-01  8.748e-02   1.933   0.0595 .
## Exercise_Prev             2.322e-01  1.268e-01   1.832   0.0736 .
## Smoking_Prev             -4.361e-02  1.497e-01  -0.291   0.7721  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 45 degrees of freedom
## Multiple R-squared:  0.3321, Adjusted R-squared:  0.2727 
## F-statistic: 5.593 on 4 and 45 DF,  p-value: 0.0009644

The Multiple Linear Regression (MLR) model is the best-fitting model for our data. We compare its predictive power ($R^2$) to the Simple Model and identify the statistically significant factors by reviewing the P-Values and the direction of the slopes.

Comparison of Predictive Power

The table below shows the improvement in predictive power. The MLR model explains a significantly higher portion of the variation in Obesity than the Simple Model (Income only).

Model Predictive Power Comparison (Adjusted R-squared)
Model	Adjusted.R.squared	Interpretation
Simple Model (Income Only)	15.88%	Low predictive power.
MLR Model (All Predictors)	27.27%	Explains over a quarter of the variation, confirming better fit.

2. Analysis of MLR Coefficients

The table below summarizes the key findings from the MLR model’s coefficients. We identify the strongest predictors by looking for low P-Values ($< 0.1$).

Multiple Linear Regression Model (MLR)
Predictor	P-Value	Significance	Slope (Direction)	Key Finding
Median Income	0.3435	Not Significant	Negative	Income is not a strong predictor when other factors are included.
Education (Bachelors+)	0.0595 (Marginal)	Significant	Positive (+0.169)	A strong predictor, showing a surprising positive association.
Exercise Prevalence	0.0736 (Marginal)	Significant	Positive (+0.232)	A strong predictor, showing a surprising positive association.
Smoking Prevalence	0.7721	Not Significant	Negative	Smoking does not significantly predict Obesity in this model.

Summary of MLR: The most complex and best-fitting model indicates that Education and Exercise Prevalence are the strongest factors predicting differences in Obesity, while Income is not a significant factor.

#7. Phase N – Interpretation and Conclusion

7.1. Key Findings

Best Predictors: The Multiple Linear Regression (MLR) model showed that Education and Exercise Prevalence are the most statistically significant predictors of Obesity Prevalence.

Income Role: Median Income, initially significant in the Simple Model, became non-significant in the best model (MLR). This means income does not explain much variation in obesity when education and exercise are considered.

Low Predictive Power: Even the best model (MLR) only explained 27.27% of the variation. This indicates that most of the factors driving obesity differences are not included in the model (e.g., genetic, cultural, food environment).

Unexpected Association: Both Education (higher rates of college degrees) and Exercise showed a positive slope in the MLR model. This is counter intuitive and suggests a confounding variable or regional reporting issues influencing the results.

7.2. Regional Meaning

Geographic Clues: The Choropleth Map confirmed a strong regional pattern. States with the highest obesity rates are clustered in the Southeastern U.S. (Obesity Belt).

Model Errors: The regional pattern suggests that factors shared by the Southeast (e.g., climate, local cuisine, access to fresh food) are driving the high obesity rates, and these are not captured by our socioeconomic variables alone. The model tends to under-predict obesity in these high-rate states.

7.3. Project Limits and Challenges

Data Integration: Merging the diverse BRFSS (health) and ACS (socioeconomic) datasets was challenging, requiring careful use of FIPS codes to join data accurately at the state level.

Missing Data: Handling missing values in the raw BRFSS data (e.g., “Don’t Know” or “Refused”) was required to ensure accurate prevalence calculations.

Causation vs. Correlation: The project establishes correlation (relationship), but due to the observational nature of the data, it cannot prove causation.

8. R Shiny Dashboard (Interactive Feature)

We use the R Shiny package to build an interactive web dashboard. This allows us to reuse existing R code (maps and models) to visualize the final project results interactively.

8.1. Dashboard Setup and Code

The Shiny application is defined by the User Interface (ui) and the Server Logic (server). This structure displays the map and the MLR conclusions.

library(shiny)
library(shinydashboard)
library(ggplot2)
library(dplyr)
library(knitr)
library(choroplethr)
library(RColorBrewer)
library(choroplethrMaps)
library(scales) 

max_obesity <- max(final_data$Obesity_Prev)
adj_r_squared_mlr <- round(summary(model_multiple)$adj.r.squared, 3)
sig_predictor <- "Education & Exercise" 

pastel_theme <- theme_minimal(base_size = 12) +
    theme(
        plot.title = element_text(face = "bold", size = 14, color = "#4A4A4A"),
        plot.subtitle = element_text(size = 11, color = "gray50"),
        legend.position = "bottom",
        plot.background = element_rect(fill = "white", color = NA),
        panel.background = element_rect(fill = "#FAFAFA", color = NA)
    )

map_data <- final_data %>%
  mutate(region = tolower(State_Name), 
         value = Obesity_Prev) %>%
  select(region, value) %>%
  filter(region != "district of columbia") 

mlr_conclusion_data <- data.frame(
    Predictor = c("Median Income", "Education (Bachelors+)", "Exercise Prevalence", "Smoking Prevalence"),
    P.Value = c("0.3435", "0.0595 (Marginal)", "0.0736 (Marginal)", "0.7721"),
    Significance = c("Not Significant", "Significant", "Significant", "Not Significant"),
    Slope.Direction = c("Negative", "Positive (+0.169)", "Positive (+0.232)", "Negative")
)

ui <- dashboardPage(
    dashboardHeader(title = "Obesity: Socioeconomic Drivers", titleWidth = 350),
    
    dashboardSidebar(
        sidebarMenu(
            menuItem("Dashboard", tabName = "dashboard_tab", icon = icon("chart-line"))
        )
    ),
    
    dashboardBody(
        tabItems(
            tabItem(tabName = "dashboard_tab",
                    fluidRow(
                        valueBox(paste0(round(adj_r_squared_mlr * 100, 1), "%"), "Model Explained Variance (Adj R²)", 
                                 icon = icon("chart-area"), color = "teal", width = 4),
                        valueBox(paste0(round(max_obesity, 1), "%"), "Highest State Obesity Rate", 
                                 icon = icon("heartbeat"), color = "red", width = 4),
                        valueBox(sig_predictor, "Key Significant Predictors (P < 0.1)", 
                                 icon = icon("lightbulb"), color = "olive", width = 4)
                    ),
                    
                    fluidRow(
                        box(title = "1. Multiple Regression Model Conclusions", 
                            status = "warning", solidHeader = TRUE, width = 12,
                            tableOutput("mlr_table"))
                    ),
                    
                    fluidRow(
                        box(title = "2. Regional Prevalence Map", 
                            status = "primary", solidHeader = TRUE, width = 4, height = 450,
                            plotOutput("obesity_map", height = 400)),
                        
                        box(title = "3. Obesity vs. Education (Unexpected Positive Slope)", 
                            status = "info", solidHeader = TRUE, width = 4, height = 450,
                            plotOutput("education_scatter", height = 400)),
                        
                        box(title = "4. Obesity vs. Exercise (Unexpected Positive Slope)", 
                            status = "success", solidHeader = TRUE, width = 4, height = 450,
                            plotOutput("exercise_scatter", height = 400))
                    )
            )
        )
    )
)

server <- function(input, output) {
    
    output$obesity_map <- renderPlot({
        state_choropleth(
          map_data, 
          title = "Obesity Prevalence Across the United States",
          legend = "Obesity (%)",
          num_colors = 7
        ) +
          scale_fill_manual(values = RColorBrewer::brewer.pal(7, "PuRd")) + 
          labs(caption = "") +
          pastel_theme
    })
    
    output$education_scatter <- renderPlot({
        ggplot(final_data, aes(x = Education_Bachelors_Plus, y = Obesity_Prev)) +
            geom_point(color = "#778899", size = 2) + 
            geom_smooth(method = "lm", color = "#B0C4DE", fill = "#E6E6FA", alpha = 0.3) + 
            labs(x = "College Degree Population (%)",
                 y = "Obesity Prevalence (%)",
                 title = "Obesity vs. Education") +
            scale_y_continuous(labels = percent_format(scale = 1)) +
            scale_x_continuous(labels = percent_format(scale = 1)) +
            pastel_theme
    })
    
    output$exercise_scatter <- renderPlot({
        ggplot(final_data, aes(x = Exercise_Prev, y = Obesity_Prev)) +
            geom_point(color = "#9966CC", size = 2) + 
            geom_smooth(method = "lm", color = "#DDA0DD", fill = "#F0E6FA", alpha = 0.3) + 
            labs(x = "Exercise Prevalence (%)",
                 y = "Obesity Prevalence (%)",
                 title = "Obesity vs. Exercise") +
            scale_y_continuous(labels = percent_format(scale = 1)) +
            scale_x_continuous(labels = percent_format(scale = 1)) +
            pastel_theme
    })
    
    output$mlr_table <- renderTable({
        return(mlr_conclusion_data)
    }, bordered = TRUE, rownames = FALSE, caption = "Key Findings from Multiple Linear Regression (P < 0.1)")
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

9. References and Appendices

This section lists all external sources and software used to conduct the data cleaning, modeling, and visualization for this project. Proper citation ensures the reproducibility and credibility of the analysis.

References:

Data Source (Health): Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System (BRFSS), 2020 Survey Data.

Data Source (Socioeconomic): U.S. Census Bureau. American Community Survey (ACS) 5-Year Estimates, 2016-2020.

Software/Methodology:

R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Wickham et al. (2023). ggplot2, dplyr, tidyr (R packages used for data manipulation and visualization).

Chang et al. (2024). shiny and shinydashboard (R packages used for interactive dashboard creation).

Becker, L. (2023). choroplethr (R package used for state-level mapping).

Public Health and Lifestyle Factors Related to Obesity in the United States

Arutam Antunish

2025-12-12