# Load R packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stats)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(knitr)
library(scales)
library(RColorBrewer)
library(choroplethr)
## Warning: package 'choroplethr' was built under R version 4.5.2
library(choroplethrMaps)
## Warning: package 'choroplethrMaps' was built under R version 4.5.2
library(shiny)
library(shinydashboard)
## Warning: package 'shinydashboard' was built under R version 4.5.2
##
## Attaching package: 'shinydashboard'
## The following object is masked from 'package:graphics':
##
## box
library(httr)
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
library(haven)
Obesity is a major public health crisis in the United States, driving significant healthcare costs and reducing life expectancy. Nearly 42% of U.S. adults are classified as obese. This issue is not uniformly distributed across the nation; there are vast disparities in prevalence rates among different states and demographic groups. Understanding the complex relationship between socioeconomic status and lifestyle behaviors is critical for developing targeted and effective public health interventions. This analysis seeks to move beyond simple correlation by using robust statistical modeling to identify the most significant, quantifiable drivers of this health disparity.
How effectively do specific lifestyle behaviors (Exercise Prevalence, Smoking Prevalence) and key socioeconomic factors (Median Income, Higher Education Rate) collectively predict the variation in Adult Obesity Prevalence across all U.S. states?
The scope of this project is strictly limited to state-level aggregate data for all 50 U.S. states plus the District of Columbia. The analysis uses data primarily from the 2020 Behavioral Risk Factor Surveillance System (BRFSS) for health metrics and the 2016-2020 American Community Survey (ACS) 5-Year Estimates for socioeconomic metrics. The focus is on establishing correlation and prediction, not establishing direct causation.
The variables used were standardized and joined using FIPS codes into a single dataset.
| Variable Category | Variable Name | Definition |
|---|---|---|
| Response Variable (Y) | Adult Obesity Prevalence | Percentage of adults classified as obese in each state. |
| Main Predictors (X1, X2) | Exercise Prevalence | Percentage of adults reporting meeting federal guidelines for aerobic physical activity. |
| Main Predictors (X1, X2) | Smoking Prevalence | Percentage of adult current smokers. |
| Socioeconomic Predictors (X3, X4) | Median State Income | Median household income (in USD) for the state. |
| Socioeconomic Predictors (X3, X4) | Education (Bachelors+) | Percentage of adults aged 25+ with a Bachelor’s degree or higher. |
# Loading the BRFSS CSV file
brfssraw <- read_xpt("LLCP2024.XPT")
colnames(brfssraw)
We use the Census Bureau API to get demographic data for all states. We will get the Median Household Income, the percentage of adults with a Bachelor’s Degree or higher, and the State Name. We need an API key to access this information.
census_key <- "2fd2b4e5729340013219c170942cdcb6eb1a0803"
census_vars <- c("DP03_0062E", "DP02_0068PE", "NAME")
base_url <- "https://api.census.gov/data/2023/acs/acs1/profile"
census_request <- paste0(
base_url,
"?get=", paste(census_vars, collapse = ","),
"&for=state:*",
"&key=", census_key)
response <- GET(census_request)
if (status_code(response) == 200) {
census_data_raw <- content(response, "parsed")
header <- unlist(census_data_raw[[1]])
data_rows <- census_data_raw[-1]
census_df <- as.data.frame(do.call(rbind, data_rows), stringsAsFactors = FALSE)
colnames(census_df) <- header
print("Census Data Structure (Raw):")
head(census_df)
} else {
print(paste("Error getting Census data. Status Code:", status_code(response)))
print(content(response, "text"))
census_df <- NULL}
## [1] "Census Data Structure (Raw):"
## DP03_0062E DP02_0068PE NAME state
## 1 62212 28.9 Alabama 01
## 2 86631 32.2 Alaska 02
## 3 77315 33.5 Arizona 04
## 4 58700 26.2 Arkansas 05
## 5 95521 37.5 California 06
## 6 92911 46.4 Colorado 08
View(census_df)
We cannot use the full BRFSS file because it is too big. To make the project easy to share and reproducible on GitHub, we select only the columns we need for the analysis and the survey weight variable (_LLCPWT), and we remove missing data.
brfss_clean_subset <- brfssraw %>%
select(
State_FIPS = `_STATE`,
Obesity_Status = `_RFBMI5`, # 1=Obese, 2=Not Obese
Exercise_Any = EXERANY2, # 1=Yes, 2=No
Smoking_Status = `_SMOKER3`, # 1=Current, 2=Former, 3=Never
Alcohol_Per_Day = DROCDY4_,
Weight = `_LLCPWT` # Survey weight
) %>%
filter(Obesity_Status %in% c(1, 2),
Exercise_Any %in% c(1, 2),
Smoking_Status %in% c(1, 2, 3),
!Alcohol_Per_Day %in% c(777, 888, 999) & !is.na(Alcohol_Per_Day),
!is.na(Weight))
str(brfss_clean_subset)
head(brfss_clean_subset)
write_csv(brfss_clean_subset, "brfss_clean_subset.csv")
After we cleaned the original data set, we created a subset csv file and uploaded into Github. Finally we read that file which is the one we are going to work with.
# Save and reload for reproducibility
brfss_clean <- read_csv("https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/brfss_clean_subset.csv")
## Rows: 158410 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): State_FIPS, Obesity_Status, Exercise_Any, Smoking_Status, Alcohol_P...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In this step, we calculate the final health statistics for each state. We group all survey responses by their State FIPS code. Within each state, we calculate the Prevalence Rate (expressed as a percentage, or per 100 inhabitants) for Obesity, Exercise, and Smoking. This creates the clean, state-level table (brfss_state_calc) needed for the final join.
# Convert the binary/categorical variables into 0/1 for easier calculation
brfss_state_calc <- brfss_clean %>%
mutate(
# Obesity Prevalence: 1 if Obese (code 1), 0 otherwise
is_obese = if_else(Obesity_Status == 1, 1, 0),
# Physical Activity Prevalence: 1 if exercised (code 1), 0 otherwise
is_exercising = if_else(Exercise_Any == 1, 1, 0),
# Current Smoker Prevalence: 1 if current smoker (code 1), 0 otherwise
is_smoker = if_else(Smoking_Status == 1, 1, 0)
) %>%
# Group by State FIPS code
group_by(State_FIPS) %>%
summarise(
# Calculate weighted percentages for each indicator
Obesity_Prev = weighted.mean(is_obese, w = Weight) * 100,
Exercise_Prev = weighted.mean(is_exercising, w = Weight) * 100,
Smoking_Prev = weighted.mean(is_smoker, w = Weight) * 100,
# Calculate the weighted mean for alcohol consumption
Avg_Alcohol_Drinks_Day = weighted.mean(Alcohol_Per_Day, w = Weight),
# Count how many records we used for this state
N_count = n()
) %>%
ungroup()
View(brfss_state_calc)
The Census API data needs to be cleaned so we can join it with the BRFSS data. We rename the coded columns to clear names and convert the income and education columns from text to numeric.
State_FIPS_Vector <- unlist(census_df$state)
Median_Income_Vector_raw <- as.numeric(unlist(census_df$DP03_0062E))
Education_Vector_raw <- as.numeric(unlist(census_df$DP02_0068PE))
if (length(Median_Income_Vector_raw) == 51) {
Median_Income_Vector <- c(Median_Income_Vector_raw, NA)
} else {
Median_Income_Vector <- Median_Income_Vector_raw
}
if (length(Education_Vector_raw) == 51) {
Education_Vector <- c(Education_Vector_raw, NA)
} else {
Education_Vector <- Education_Vector_raw
}
census_temp <- data.frame(
State_FIPS = State_FIPS_Vector,
Median_Income = Median_Income_Vector,
Education_Bachelors_Plus = Education_Vector,
stringsAsFactors = FALSE
) %>%
filter(State_FIPS != "72" & State_FIPS != "0") %>%
filter(!is.na(Median_Income) & !is.na(Education_Bachelors_Plus))
census_clean <- census_df %>%
select(State_Name = NAME, State_FIPS = state) %>%
mutate(State_FIPS = unlist(State_FIPS)) %>%
filter(State_Name != "NAME") %>%
filter(State_FIPS != "72") %>%
inner_join(census_temp, by = "State_FIPS")
Now that both datasets (brfss_state_calc and census_clean) are clean and have the same state codes, we join them.
We use an inner_join based on the state code (State_FIPS). This ensures that only the 51 areas (50 states + D.C.) present in both datasets are kept in the final table.
The final table, final_data, will have all the health and socioeconomic variables we need for the next phase.
census_clean_final <- census_clean %>%
mutate(State_FIPS = stringr::str_pad(State_FIPS, width = 2, side = "left", pad = "0")) %>%
select(State_Name, State_FIPS, Median_Income, Education_Bachelors_Plus)
brfss_state_clean_final <- brfss_state_calc %>%
mutate(State_FIPS = as.character(State_FIPS)) %>%
mutate(State_FIPS = stringr::str_pad(State_FIPS, width = 2, side = "left", pad = "0")) %>%
filter(!(State_FIPS %in% c("66", "72", "78")))
final_data <- brfss_state_clean_final %>%
inner_join(census_clean_final, by = "State_FIPS") %>%
select(
State_Name,
State_FIPS,
Obesity_Prev,
Exercise_Prev,
Smoking_Prev,
Avg_Alcohol_Drinks_Day,
Median_Income,
Education_Bachelors_Plus
) %>%
arrange(State_Name)
View(final_data)
str(final_data)
## tibble [50 × 8] (S3: tbl_df/tbl/data.frame)
## $ State_Name :List of 50
## ..$ : chr "Alabama"
## ..$ : chr "Alaska"
## ..$ : chr "Arizona"
## ..$ : chr "Arkansas"
## ..$ : chr "California"
## ..$ : chr "Colorado"
## ..$ : chr "Connecticut"
## ..$ : chr "Delaware"
## ..$ : chr "District of Columbia"
## ..$ : chr "Florida"
## ..$ : chr "Georgia"
## ..$ : chr "Hawaii"
## ..$ : chr "Idaho"
## ..$ : chr "Illinois"
## ..$ : chr "Indiana"
## ..$ : chr "Iowa"
## ..$ : chr "Kansas"
## ..$ : chr "Kentucky"
## ..$ : chr "Louisiana"
## ..$ : chr "Maine"
## ..$ : chr "Maryland"
## ..$ : chr "Massachusetts"
## ..$ : chr "Michigan"
## ..$ : chr "Minnesota"
## ..$ : chr "Mississippi"
## ..$ : chr "Missouri"
## ..$ : chr "Montana"
## ..$ : chr "Nebraska"
## ..$ : chr "Nevada"
## ..$ : chr "New Hampshire"
## ..$ : chr "New Jersey"
## ..$ : chr "New Mexico"
## ..$ : chr "New York"
## ..$ : chr "North Carolina"
## ..$ : chr "North Dakota"
## ..$ : chr "Ohio"
## ..$ : chr "Oklahoma"
## ..$ : chr "Oregon"
## ..$ : chr "Pennsylvania"
## ..$ : chr "Rhode Island"
## ..$ : chr "South Carolina"
## ..$ : chr "South Dakota"
## ..$ : chr "Texas"
## ..$ : chr "Utah"
## ..$ : chr "Vermont"
## ..$ : chr "Virginia"
## ..$ : chr "Washington"
## ..$ : chr "West Virginia"
## ..$ : chr "Wisconsin"
## ..$ : chr "Wyoming"
## $ State_FIPS : chr [1:50] "01" "02" "04" "05" ...
## $ Obesity_Prev : num [1:50] 29.2 28.4 31.1 27.4 30.3 ...
## $ Exercise_Prev : num [1:50] 70.9 78.1 75.1 67.3 76.3 ...
## $ Smoking_Prev : num [1:50] 26.3 22.5 21 27.4 15 ...
## $ Avg_Alcohol_Drinks_Day : num [1:50] 37.6 50.7 47.7 61.8 48.6 ...
## $ Median_Income : num [1:50] 62212 86631 77315 58700 95521 ...
## $ Education_Bachelors_Plus: num [1:50] 28.9 32.2 33.5 26.2 37.5 46.4 42.9 36.5 65.9 34.9 ...
head(final_data)
## # A tibble: 6 × 8
## State_Name State_FIPS Obesity_Prev Exercise_Prev Smoking_Prev
## <list> <chr> <dbl> <dbl> <dbl>
## 1 <chr [1]> 01 29.2 70.9 26.3
## 2 <chr [1]> 02 28.4 78.1 22.5
## 3 <chr [1]> 04 31.1 75.1 21.0
## 4 <chr [1]> 05 27.4 67.3 27.4
## 5 <chr [1]> 06 30.3 76.3 15.0
## 6 <chr [1]> 08 35.1 82.1 17.1
## # ℹ 3 more variables: Avg_Alcohol_Drinks_Day <dbl>, Median_Income <dbl>,
## # Education_Bachelors_Plus <dbl>
Show the basic numbers and charts of your data.
We use the summary() function to quickly check the distribution (mean, median, min/max) of all our numerical variables. This step is essential to understand the range and central tendencies of the health and socioeconomic factors.
summary(final_data)
## State_Name.Length State_Name.Class State_Name.Mode State_FIPS
## 1 -none- character Length:50
## 1 -none- character Class :character
## 1 -none- character Mode :character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## 1 -none- character
## Obesity_Prev Exercise_Prev Smoking_Prev Avg_Alcohol_Drinks_Day
## Min. :25.34 Min. :66.24 Min. :15.02 Min. :29.60
## 1st Qu.:27.75 1st Qu.:71.75 1st Qu.:18.16 1st Qu.:43.45
## Median :29.00 Median :74.06 Median :20.66 Median :49.78
## Mean :29.43 Mean :74.00 Mean :21.27 Mean :53.78
## 3rd Qu.:30.42 3rd Qu.:76.44 3rd Qu.:23.96 3rd Qu.:62.07
## Max. :37.34 Max. :82.11 Max. :33.25 Max. :91.23
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## Median_Income Education_Bachelors_Plus
## Min. : 54203 Min. :24.00
## 1st Qu.: 69691 1st Qu.:32.02
## Median : 74787 Median :35.10
## Mean : 77788 Mean :35.84
## 3rd Qu.: 86245 3rd Qu.:38.85
## Max. :108210 Max. :65.90
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
library(ggplot2)
library(scales)
Author_Name <- "Arutam Antunish"
Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Health_Color <- "#2C3E50"
Income_Color <- "#16A085"
# Define a consistent theme for the project
project_theme_classic <- theme_classic(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14), # Reduced size
plot.subtitle = element_text(size = 11, color = "gray30"), # Reduced size
axis.title = element_text(size = 10),
axis.text = element_text(size = 9),
plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0), # Small, left-aligned
plot.margin = unit(c(1, 1, 1, 1), "cm")
)
# Histogram 1: Distribution of Obesity Prevalence
ggplot(final_data, aes(x = Obesity_Prev)) +
geom_histogram(bins = 10, fill = Health_Color, color = "white", alpha = 0.9) +
labs(
title = "Distribution of Obesity Prevalence Across 50 U.S. Areas",
subtitle = "The data is slightly right-skewed, with most states clustering near 28-30%.",
x = "Obesity Prevalence (%)",
y = "Number of States/Areas",
caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
) +
scale_x_continuous(labels = percent_format(scale = 1)) +
project_theme_classic
# Histogram 2:Distribution of Median Income
ggplot(final_data, aes(x = Median_Income)) +
geom_histogram(bins = 10, fill = Income_Color, color = "white", alpha = 0.9) +
labs(
title = "Distribution of Median Household Income Across 50 U.S. Areas",
subtitle = "Income exhibits a distinct wide range and appears slightly bimodal (two peaks).",
x = "Median Income",
y = "Number of States/Areas",
caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
) +
scale_x_continuous(labels = dollar_format(scale = 1, prefix = "$", big.mark = ",")) +
project_theme_classic
Obesity Prevalence
Shape: The distribution is generally right-skewed (a longer tail to the right).
Most states have an obesity prevalence centered between 28% and 30%. There are a few states with very high obesity rates (above 34%), which suggests a strong potential relationship with socioeconomic factors in those specific outliers.
Median Income Shape: The distribution looks somewhat bimodal (two peaks).
A significant number of states cluster in the $70,000 to $80,000 range. However, there is another notable cluster near the $90,000 to $100,000 range, suggesting a clear separation between states with generally lower and generally higher incomes.
The analysis now moves from exploring single variables (histograms) to testing the correlation between variables, which is the core of our study. We use Scatter Plots coupled with a Linear Regression Line (geom_smooth(method = “lm”)). This approach visualizes the direction (positive or negative) and strength of the linear relationship between the two most critical socioeconomic factors (Income and Education) and the primary health outcome (Obesity Prevalence).
The first scatter plot analyzes the relationship between Median Household Income and Obesity Prevalence. The regression line shows a clear negative slope, confirming the expected trend: as income increases, the obesity rate tends to decrease across the states. This negative correlation suggests that socioeconomic prosperity is a factor in better health outcomes.
# Consistent theme
project_theme_classic <- theme_classic(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray30"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 9),
plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0),
plot.margin = unit(c(1, 1, 1, 1), "cm")
)
# Define consistent colors
Health_Color_Dark <- "#2C3E50"
Income_Color_Teal <- "#16A085"
Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Author_Name <- "Arutam Antunish"
# Scatter Plot 1: Obesity vs. Median Income
ggplot(final_data, aes(x = Median_Income, y = Obesity_Prev)) +
geom_point(color = Health_Color_Dark, size = 3, alpha = 0.8) +
geom_smooth(method = "lm", color = Income_Color_Teal, se = TRUE, linetype = "dashed") +
labs(
title = "Analyzing Correlation: Median Income vs. Obesity Prevalence",
subtitle = "Higher income levels show a tendency towards lower obesity rates (Negative Correlation).",
x = "Median Household Income",
y = "Obesity Prevalence (%)",
caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
) +
scale_x_continuous(labels = dollar_format(prefix = "$", big.mark = ",")) +
scale_y_continuous(labels = percent_format(scale = 1)) +
project_theme_classic
## `geom_smooth()` using formula = 'y ~ x'
The second scatter plot examines the correlation between the Percentage of Population with a Bachelor’s Degree or Higher and Obesity Prevalence. Similar to income, this plot shows a clear negative relationship. The downward slope of the regression line indicates that states with higher educational attainment generally experience lower rates of obesity.
# Scatter Plot 2: Obesity vs. Education
ggplot(final_data, aes(x = Education_Bachelors_Plus, y = Obesity_Prev)) +
geom_point(color = Health_Color_Dark, size = 3, alpha = 0.8) +
geom_smooth(method = "lm", color = Income_Color_Teal, se = TRUE, linetype = "dashed") +
labs(
title = "Analyzing Correlation: Education Level vs. Obesity Prevalence",
subtitle = "Higher percentage of educated residents is associated with lower obesity rates (Negative Correlation).",
x = "Population with Bachelor's Degree or Higher (%)",
y = "Obesity Prevalence (%)",
caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
) +
scale_x_continuous(labels = percent_format(scale = 1)) +
scale_y_continuous(labels = percent_format(scale = 1)) +
project_theme_classic
## `geom_smooth()` using formula = 'y ~ x'
This Choropleth Map is a powerful visualization tool used to identify spatial patterns of Obesity Prevalence across the U.S.
The map reveals a clear regional pattern:
High Obesity Rates (Dark Colors): States with the highest obesity rates (e.g., above 30%) are concentrated in the Southeastern U.S. (‘Obesity Belt’) and parts of the Midwest.
Low Obesity Rates (Light Colors): The lowest rates are found along the West Coast and in some Northeastern states.
library(choroplethr)
library(choroplethrMaps)
library(RColorBrewer)
Data_Source <- "BRFSS (2020), U.S. Census Bureau (ACS 5-Year Estimates)"
Author_Name <- "Arutam Antunish"
project_theme_classic <- theme_classic(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray30"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 9),
plot.caption = element_text(size = 8, color = "gray50", hjust = 0.0),
plot.margin = unit(c(1, 1, 1, 1), "cm")
)
map_data <- final_data %>%
mutate(region = tolower(State_Name),
value = Obesity_Prev) %>%
select(region, value) %>%
filter(region != "district of columbia")
state_choropleth(
map_data,
title = "Obesity Prevalence Across the United States",
legend = "Obesity (%)",
num_colors = 7
) +
scale_fill_manual(values = RColorBrewer::brewer.pal(7, "YlOrRd")) +
labs(
caption = paste("Source: ", Data_Source, "\nAuthor: ", Author_Name)
) +
project_theme_classic
## geoid_type = 'auto'; the geoid region was determined to be of type: name.lower. To see the list of allowed geoids, see choroplethr::state.regions.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
This is the start of the Modeling Phase. We use Simple Linear Regression to build a formal predictive model.
Our goal is to quantify the relationship between our two main variables: Median Household Income (predictor variable, \(X\)) and Obesity Prevalence (response variable, \(Y\)).
The model determines the slope (the change in Obesity for every $1 change in Income) and the strength of this relationship, which is necessary for statistical inference.
# Build the linear model
model_income <- lm(Obesity_Prev ~ Median_Income, data = final_data)
summary(model_income)
##
## Call:
## lm(formula = Obesity_Prev ~ Median_Income, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9024 -1.9545 -0.2932 1.5340 5.4368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.311e+01 2.001e+00 11.550 1.84e-15 ***
## Median_Income 8.124e-05 2.538e-05 3.201 0.00243 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.301 on 48 degrees of freedom
## Multiple R-squared: 0.1759, Adjusted R-squared: 0.1588
## F-statistic: 10.25 on 1 and 48 DF, p-value: 0.002429
This step uses the model’s summary statistics to perform Statistical Inference. We check the statistical significance (\(p\)-value) and the explanatory power (\(R^2\)) of the Median Income variable.
P-Value (Income):0.00243
Since the \(p\)-value is much less than 0.05, we reject the null hypothesis (\(H_0\)). We conclude that Median Income is a statistically significant predictor of Obesity Prevalence.
Slope (\(\beta_1\)): \(8.124 \times 10^{-5}\) The slope is positive. According to this specific sample, a higher income is associated with a slightly higher obesity rate.
R-squared (\(R^2\)): 0.1759 (17.59%)
Only 17.59% of the variation in Obesity Prevalence is explained by the Income variable. The model has low predictive power, meaning other factors are much more important.
Before we trust the model’s conclusions, we must check the assumptions of Simple Linear Regression by analyzing the Residuals (the error terms).
If the assumptions are violated, the \(p\)-values and the \(R^2\) might be unreliable.
The points mostly follow the dotted line. This suggests the normality assumption is generally met.
# Set up the plotting area for two graphs
par(mfrow = c(1, 2))
# 1. QQ-Plot (Check for Normality)
plot(model_income, which = 2, main = "Normal Q-Q Plot of Residuals")
The points appear randomly scattered around the horizontal line (0). The red line is mostly flat. This suggests the homoscedasticity assumption is met.
# 2. Residuals vs. Fitted (Check for Homoscedasticity)
plot(model_income, which = 1, main = "Residuals vs. Fitted Values")
# Reset the plotting area
par(mfrow = c(1, 1))
The diagnostic checks show no major violations of the linear regression assumptions. Therefore, the statistical inferences made in Section 6.2 (that Income is a significant predictor despite the low \(R^2\)) are reliable.
Since the Simple Model (Income only) had low predictive power (\(R^2=0.176\)), we now build a Multiple Linear Regression (MLR) model.
This complex model includes all key health and socioeconomic variables (Income, Education, Exercise, and Smoking) to see if we can explain a much higher percentage of the variation in Obesity Prevalence.
#MLR model including all key predictors: Obesity_Prev ~ Median_Income + Education_Bachelors_Plus + Exercise_Prev + Smoking_Prev
model_multiple <- lm(Obesity_Prev ~ Median_Income + Education_Bachelors_Plus + Exercise_Prev + Smoking_Prev, data = final_data)
summary(model_multiple)
##
## Call:
## lm(formula = Obesity_Prev ~ Median_Income + Education_Bachelors_Plus +
## Exercise_Prev + Smoking_Prev, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4533 -1.5490 0.0121 1.2120 5.5060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.156e+01 1.064e+01 1.087 0.2830
## Median_Income -5.726e-05 5.981e-05 -0.957 0.3435
## Education_Bachelors_Plus 1.691e-01 8.748e-02 1.933 0.0595 .
## Exercise_Prev 2.322e-01 1.268e-01 1.832 0.0736 .
## Smoking_Prev -4.361e-02 1.497e-01 -0.291 0.7721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 45 degrees of freedom
## Multiple R-squared: 0.3321, Adjusted R-squared: 0.2727
## F-statistic: 5.593 on 4 and 45 DF, p-value: 0.0009644
The Multiple Linear Regression (MLR) model is the best-fitting model for our data. We compare its predictive power (\(R^2\)) to the Simple Model and identify the statistically significant factors by reviewing the P-Values and the direction of the slopes.
The table below shows the improvement in predictive power. The MLR model explains a significantly higher portion of the variation in Obesity than the Simple Model (Income only).
| Model | Adjusted.R.squared | Interpretation |
|---|---|---|
| Simple Model (Income Only) | 15.88% | Low predictive power. |
| MLR Model (All Predictors) | 27.27% | Explains over a quarter of the variation, confirming better fit. |
The table below summarizes the key findings from the MLR model’s coefficients. We identify the strongest predictors by looking for low P-Values (\(< 0.1\)).
| Predictor | P-Value | Significance | Slope (Direction) | Key Finding |
|---|---|---|---|---|
| Median Income | 0.3435 | Not Significant | Negative | Income is not a strong predictor when other factors are included. |
| Education (Bachelors+) | 0.0595 (Marginal) | Significant | Positive (+0.169) | A strong predictor, showing a surprising positive association. |
| Exercise Prevalence | 0.0736 (Marginal) | Significant | Positive (+0.232) | A strong predictor, showing a surprising positive association. |
| Smoking Prevalence | 0.7721 | Not Significant | Negative | Smoking does not significantly predict Obesity in this model. |
Summary of MLR: The most complex and best-fitting model indicates that Education and Exercise Prevalence are the strongest factors predicting differences in Obesity, while Income is not a significant factor.
#7. Phase N – Interpretation and Conclusion
Best Predictors: The Multiple Linear Regression (MLR) model showed that Education and Exercise Prevalence are the most statistically significant predictors of Obesity Prevalence.
Income Role: Median Income, initially significant in the Simple Model, became non-significant in the best model (MLR). This means income does not explain much variation in obesity when education and exercise are considered.
Low Predictive Power: Even the best model (MLR) only explained 27.27% of the variation. This indicates that most of the factors driving obesity differences are not included in the model (e.g., genetic, cultural, food environment).
Unexpected Association: Both Education (higher rates of college degrees) and Exercise showed a positive slope in the MLR model. This is counter intuitive and suggests a confounding variable or regional reporting issues influencing the results.
Geographic Clues: The Choropleth Map confirmed a strong regional pattern. States with the highest obesity rates are clustered in the Southeastern U.S. (Obesity Belt).
Model Errors: The regional pattern suggests that factors shared by the Southeast (e.g., climate, local cuisine, access to fresh food) are driving the high obesity rates, and these are not captured by our socioeconomic variables alone. The model tends to under-predict obesity in these high-rate states.
Data Integration: Merging the diverse BRFSS (health) and ACS (socioeconomic) datasets was challenging, requiring careful use of FIPS codes to join data accurately at the state level.
Missing Data: Handling missing values in the raw BRFSS data (e.g., “Don’t Know” or “Refused”) was required to ensure accurate prevalence calculations.
Causation vs. Correlation: The project establishes correlation (relationship), but due to the observational nature of the data, it cannot prove causation.
We use the R Shiny package to build an interactive web dashboard. This allows us to reuse existing R code (maps and models) to visualize the final project results interactively.
The Shiny application is defined by the User Interface (ui) and the Server Logic (server). This structure displays the map and the MLR conclusions.
library(shiny)
library(shinydashboard)
library(ggplot2)
library(dplyr)
library(knitr)
library(choroplethr)
library(RColorBrewer)
library(choroplethrMaps)
library(scales)
max_obesity <- max(final_data$Obesity_Prev)
adj_r_squared_mlr <- round(summary(model_multiple)$adj.r.squared, 3)
sig_predictor <- "Education & Exercise"
pastel_theme <- theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#4A4A4A"),
plot.subtitle = element_text(size = 11, color = "gray50"),
legend.position = "bottom",
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "#FAFAFA", color = NA)
)
map_data <- final_data %>%
mutate(region = tolower(State_Name),
value = Obesity_Prev) %>%
select(region, value) %>%
filter(region != "district of columbia")
mlr_conclusion_data <- data.frame(
Predictor = c("Median Income", "Education (Bachelors+)", "Exercise Prevalence", "Smoking Prevalence"),
P.Value = c("0.3435", "0.0595 (Marginal)", "0.0736 (Marginal)", "0.7721"),
Significance = c("Not Significant", "Significant", "Significant", "Not Significant"),
Slope.Direction = c("Negative", "Positive (+0.169)", "Positive (+0.232)", "Negative")
)
ui <- dashboardPage(
dashboardHeader(title = "Obesity: Socioeconomic Drivers", titleWidth = 350),
dashboardSidebar(
sidebarMenu(
menuItem("Dashboard", tabName = "dashboard_tab", icon = icon("chart-line"))
)
),
dashboardBody(
tabItems(
tabItem(tabName = "dashboard_tab",
fluidRow(
valueBox(paste0(round(adj_r_squared_mlr * 100, 1), "%"), "Model Explained Variance (Adj R²)",
icon = icon("chart-area"), color = "teal", width = 4),
valueBox(paste0(round(max_obesity, 1), "%"), "Highest State Obesity Rate",
icon = icon("heartbeat"), color = "red", width = 4),
valueBox(sig_predictor, "Key Significant Predictors (P < 0.1)",
icon = icon("lightbulb"), color = "olive", width = 4)
),
fluidRow(
box(title = "1. Multiple Regression Model Conclusions",
status = "warning", solidHeader = TRUE, width = 12,
tableOutput("mlr_table"))
),
fluidRow(
box(title = "2. Regional Prevalence Map",
status = "primary", solidHeader = TRUE, width = 4, height = 450,
plotOutput("obesity_map", height = 400)),
box(title = "3. Obesity vs. Education (Unexpected Positive Slope)",
status = "info", solidHeader = TRUE, width = 4, height = 450,
plotOutput("education_scatter", height = 400)),
box(title = "4. Obesity vs. Exercise (Unexpected Positive Slope)",
status = "success", solidHeader = TRUE, width = 4, height = 450,
plotOutput("exercise_scatter", height = 400))
)
)
)
)
)
server <- function(input, output) {
output$obesity_map <- renderPlot({
state_choropleth(
map_data,
title = "Obesity Prevalence Across the United States",
legend = "Obesity (%)",
num_colors = 7
) +
scale_fill_manual(values = RColorBrewer::brewer.pal(7, "PuRd")) +
labs(caption = "") +
pastel_theme
})
output$education_scatter <- renderPlot({
ggplot(final_data, aes(x = Education_Bachelors_Plus, y = Obesity_Prev)) +
geom_point(color = "#778899", size = 2) +
geom_smooth(method = "lm", color = "#B0C4DE", fill = "#E6E6FA", alpha = 0.3) +
labs(x = "College Degree Population (%)",
y = "Obesity Prevalence (%)",
title = "Obesity vs. Education") +
scale_y_continuous(labels = percent_format(scale = 1)) +
scale_x_continuous(labels = percent_format(scale = 1)) +
pastel_theme
})
output$exercise_scatter <- renderPlot({
ggplot(final_data, aes(x = Exercise_Prev, y = Obesity_Prev)) +
geom_point(color = "#9966CC", size = 2) +
geom_smooth(method = "lm", color = "#DDA0DD", fill = "#F0E6FA", alpha = 0.3) +
labs(x = "Exercise Prevalence (%)",
y = "Obesity Prevalence (%)",
title = "Obesity vs. Exercise") +
scale_y_continuous(labels = percent_format(scale = 1)) +
scale_x_continuous(labels = percent_format(scale = 1)) +
pastel_theme
})
output$mlr_table <- renderTable({
return(mlr_conclusion_data)
}, bordered = TRUE, rownames = FALSE, caption = "Key Findings from Multiple Linear Regression (P < 0.1)")
}
shinyApp(ui = ui, server = server)
This section lists all external sources and software used to conduct the data cleaning, modeling, and visualization for this project. Proper citation ensures the reproducibility and credibility of the analysis.
Data Source (Health): Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System (BRFSS), 2020 Survey Data.
Data Source (Socioeconomic): U.S. Census Bureau. American Community Survey (ACS) 5-Year Estimates, 2016-2020.
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Wickham et al. (2023). ggplot2, dplyr, tidyr (R packages used for data manipulation and visualization).
Chang et al. (2024). shiny and shinydashboard (R packages used for interactive dashboard creation).
Becker, L. (2023). choroplethr (R package used for state-level mapping).