1 Title Page

Predicting Diabetes Risk: The Role of Socioeconomic and Behavioral Factors in U.S. Adults

Caitlin Kennedy, PharmD, MHA
University of Rhode Island
DSP 552: Computer-Based Data Exploration
Instructor: Nathan Graff


2 Abstract

This study explored the relationship between socioeconomic and behavioral risk factors and the prevalence of diabetes among adults in the United States (U.S.), using data from the 2023 Behavioral Risk Factor Surveillance System (BRFSS). A survey-weighted logistic regression model was used to analyze the relationship between demographic, socioeconomic, and behavioral variables and a diabetes diagnosis. Additionally, a secondary survey-weighted linear regression model examined the predictors of body mass index (BMI).

The results of the study revealed that higher BMI, older age, lower income, and physical inactivity were all significantly linked to an increased likelihood of developing diabetes. Furthermore, the female-coded group had lower odds of diabetes compared to the male-coded reference group. The study also found a strong association between physical inactivity and higher BMI, suggesting both direct and indirect pathways through which behavioral and socioeconomic factors contribute to diabetes risk.

These findings highlight the importance of addressing modifiable risk factors and the disparities in social determinants of health (SDOH), as these issues significantly increase the risk of developing diabetes.


3 Introduction, Problem Formulation, and Literature Review

Diabetes is one of the most prevalent and costly chronic conditions in the United States and continues to affect vulnerable populations disproportionately. Persistent disparities in diabetes prevalence and outcomes suggest that the condition is shaped not only by biological factors but also by broader social and behavioral determinants. This project examined the extent to which socioeconomic and behavioral factors are associated with the prevalence of diabetes among U.S. adults.

Social determinants of health are an important consideration in population health because identifying high-risk populations and modifiable risk factors can guide targeted prevention strategies and reduce long-term healthcare burden. The project was selected for its direct relevance to population health management and the value of data-driven approaches in better understanding and addressing chronic disease risk.

Prior literature shows that lower socioeconomic status, higher BMI, and lower physical activity are associated with higher diabetes prevalence. SDOH, including income and educational attainment, influence chronic disease risk through pathways such as access to care, access to healthy foods, preventive services, and opportunities for physical activity. These findings support the use of multivariable models that evaluate multiple predictors simultaneously.

This study specifically examines how socioeconomic and behavioral factors influence the likelihood of a diabetes diagnosis among U.S. adults using a multivariable analytic approach.

3.1 Problem Formulation

The primary research question was: How do socioeconomic and behavioral factors influence the likelihood of diabetes among U.S. adults?

3.1.1 Response Variable

  • Diabetes status (binary: yes/no)

3.1.2 Explanatory Variables

  • BMI (continuous, kg/m²)
  • Age (continuous, years)
  • Sex (categorical)
  • Income (categorical)
  • Education (categorical)
  • Physical activity (categorical: active/inactive)

3.2 Hypotheses

  • Null hypothesis (H0): Socioeconomic and behavioral factors are not associated with diabetes prevalence.
  • Alternative hypothesis (H1): Higher BMI, older age, lower income, lower education, and physical inactivity are associated with greater odds of diabetes.

4 Data Characterization, Descriptive Analytics, and Visualization

4.1 Data Source

The data for this analysis came from the 2023 Behavioral Risk Factor Surveillance System (BRFSS), a CDC-supported telephone survey of noninstitutionalized U.S. adults. BRFSS collects demographic, socioeconomic, behavioral, and chronic disease information from a large national sample.

All code necessary to reproduce the analysis is included in this report. Due to memory constraints in the cloud environment, a subset of 15,000 observations from the 2023 BRFSS dataset was used for model fitting.

4.2 Load Saved Analytic Objects

analytic_df <- readRDS("analytic_df.rds")
analytic_design <- readRDS("analytic_design.rds")
odds_ratio_table <- readRDS("odds_ratio_table.rds")
logit_model <- readRDS("logit_model.rds")
bmi_model <- readRDS("bmi_model.rds")

4.3 Variables Used

The primary outcome was diabetes status and the main predictors were age, BMI, income, education, physical activity, and sex.

4.4 Descriptive Visualizations

4.4.1 Figure 1. BMI Distribution by Diabetes Status

ggplot(analytic_df, aes(x = bmi, fill = factor(diabetes))) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  labs(
    title = "BMI Distribution by Diabetes Status",
    x = "BMI (kg/m²)",
    y = "Count",
    fill = "Diabetes"
  )

4.4.2 Figure 2. Diabetes Prevalence by Physical Activity

ggplot(analytic_df, aes(x = physical_activity, fill = factor(diabetes))) +
  geom_bar(position = "fill") +
  labs(
    title = "Diabetes Prevalence by Physical Activity",
    x = "Physical Activity",
    y = "Proportion",
    fill = "Diabetes"
  )

4.5 Descriptive Narrative

The final analytic dataset included adults with non-missing values for diabetes status, age, BMI, sex, income, education, and physical activity. The BMI distribution showed that respondents with diabetes were more concentrated at higher BMI values than those without diabetes. Diabetes prevalence also appeared higher among respondents classified as inactive than among those classified as active, suggesting that physical activity may be an important behavioral predictor of diabetes risk.


5 Methods

This study used survey-weighted regression methods to account for the complex sampling design of the BRFSS. The BRFSS incorporates weighting, stratification, and clustering, so the variables LLCPWT, STSTR, and PSU were included in all regression analyses.

The primary outcome variable was diabetes status, coded as a binary variable (1 = diabetes, 0 = no diabetes). The primary predictors were BMI, age, sex, income, education, and physical activity. A survey-weighted logistic regression model was used as the primary analytic approach because the outcome was binary and the goal was to estimate adjusted odds ratios.

This approach allows for unbiased population-level estimates while accounting for the complex survey design of BRFSS.

A secondary survey-weighted linear regression model was used to examine predictors of BMI as an upstream risk factor for diabetes. This second model helped identify socioeconomic and behavioral factors that may influence diabetes indirectly through obesity.

Several limitations should be considered when interpreting these results. First, because the BRFSS is a cross-sectional survey, causal relationships cannot be established. Second, all variables are self-reported and may be subject to recall or reporting bias. Third, due to computational constraints, this analysis used a subset of the full BRFSS dataset, which may limit generalizability. Despite these limitations, the findings remain consistent with the existing literature and provide useful insights into diabetes risk factors.


6 Data Analysis and Main Results

6.1 Analysis 1: Logistic Regression

6.1.1 Response Variable

  • Diabetes status (binary: yes/no)

6.1.2 Explanatory Variables

  • BMI (continuous, kg/m²)
  • Age (continuous, years)
  • Sex (categorical)
  • Income (categorical)
  • Education (categorical)
  • Physical activity (categorical)

6.1.3 Logistic Regression Model

6.1.4 Odds Ratio Table

kable(odds_ratio_table, caption = "Odds Ratios from Survey-Weighted Logistic Regression")
Odds Ratios from Survey-Weighted Logistic Regression
Term Odds_Ratio CI_Lower CI_Upper
(Intercept) (Intercept) 0.001 0.000 0.002
bmi bmi 1.074 1.058 1.090
age age 1.047 1.041 1.054
sexFemale sexFemale 0.777 0.641 0.942
income<$10,000 income<$10,000 4.042 1.876 8.706
income$10,000-$14,999 income$10,000-$14,999 3.331 1.544 7.186
income$15,000-$19,999 income$15,000-$19,999 2.645 1.354 5.164
income$20,000-$24,999 income$20,000-$24,999 2.719 1.406 5.256
income$25,000-$34,999 income$25,000-$34,999 2.522 1.383 4.601
income$35,000-$49,999 income$35,000-$49,999 2.550 1.453 4.475
income$50,000-$74,999 income$50,000-$74,999 1.745 1.002 3.038
income$75,000-$99,999 income$75,000-$99,999 1.849 1.039 3.290
income$100,000-$149,999 income$100,000-$149,999 1.591 0.900 2.812
income$150,000-$199,999 income$150,000-$199,999 1.263 0.596 2.674
educationNever attended/Kindergarten only educationNever attended/Kindergarten only 0.000 0.000 0.000
educationGrades 1-8 educationGrades 1-8 1.568 0.811 3.035
educationGrades 9-11 educationGrades 9-11 0.823 0.517 1.311
educationHigh school graduate/GED educationHigh school graduate/GED 0.990 0.766 1.280
educationSome college/technical school educationSome college/technical school 1.159 0.927 1.449
physical_activityInactive physical_activityInactive 1.731 1.406 2.131

6.1.5 Logistic Regression Results Narrative

A survey-weighted logistic regression model was used to evaluate the association between BMI, age, sex, income, education, physical activity, and diabetes status. Higher BMI and older age were both significantly associated with increased odds of diabetes. Each 1-unit increase in BMI was associated with a 7.4% increase in the odds of diabetes (OR = 1.074, 95% CI: 1.058–1.090), and each additional year of age was associated with a 4.7% increase in the odds of diabetes (OR = 1.047, 95% CI: 1.041–1.054). Compared with the male-coded reference group, the female-coded group had significantly lower odds of diabetes (OR = 0.777, 95% CI: 0.641–0.942). Physical inactivity was also significantly associated with diabetes, with inactive respondents showing 73.1% higher odds of diabetes than active respondents (OR = 1.731, 95% CI: 1.406–2.131).

Compared with the highest income category ($200,000+), respondents in several lower-income categories had significantly higher odds of diabetes, with the strongest association observed in the lowest income group (OR = 4.042, 95% CI: 1.876–8.706). Educational attainment showed mixed associations, and some sparse education categories produced unstable estimates, suggesting caution in interpretation.

6.2 Figure 1 supports the regression findings by showing that respondents with diabetes tended to have higher BMIs than those without diabetes. Figure 2 also shows a higher proportion of respondents with diabetes among those classified as inactive than among those classified as active.

6.3 Analysis 2: Linear Regression

6.3.1 Response Variable

  • BMI (continuous, kg/m²)

6.3.2 Explanatory Variables

  • Age (continuous, years)
  • Sex (categorical)
  • Income (categorical)
  • Education (categorical)
  • Physical activity (categorical)

6.3.3 Linear Regression Model

6.3.4 Linear Regression Results Narrative

A secondary survey-weighted linear regression model was used to examine predictors of BMI. Physical inactivity was significantly associated with higher BMI, with inactive respondents averaging approximately 1.42 BMI units higher than active respondents after adjusting for age, sex, income, and education. The female-coded group also had slightly higher BMI on average compared with the male-coded reference group. Several lower- and middle-income categories were associated with higher BMI relative to the highest-income category. These findings suggest that socioeconomic and behavioral factors may influence diabetes risk both directly and indirectly through BMI as an upstream risk factor.


7 Conclusions and Research Directions

This project examined the relationship between socioeconomic and behavioral factors and diabetes prevalence among U.S. adults using BRFSS 2023 data. The results indicated that higher BMI, older age, physical inactivity, and lower household income were associated with increased odds of diabetes. Supplementary analysis further showed that physical inactivity and several lower socioeconomic categories were associated with higher BMI, supporting the role of obesity as an upstream diabetes risk factor.

These findings are consistent with prior literature demonstrating that social determinants of health and behavioral risk factors play an important role in shaping diabetes outcomes and disparities. The results suggest that interventions focused on physical activity, obesity prevention, and support for lower-income populations may help reduce diabetes disparities. Future research could extend this work by using the full BRFSS dataset, examining additional social determinants such as geographic access to care and food environments, testing interaction effects, and comparing traditional regression approaches with machine-learning classification models.


8 Appendix

8.1 Data Sample

head(analytic_df, 10)

8.2 Additional Notes

Only the outputs referenced in the report are included. All code shown in this file reproduces the analyses presented in the main text.

8.3 Session Information


9 AI Use Statement

AI tools were used to support code refinement, editing, and organization of the written report. All data preparation, statistical analysis, interpretation of results, and final review were completed by the author.


10 References

Bewick, V., Cheek, L., & Ball, J. (2005). Statistics review 14: Logistic regression. Critical Care, 9(1), 112–118.

Centers for Disease Control and Prevention. (2023). National Diabetes Statistics Report 2023. U.S. Department of Health and Human Services.

Hill-Briggs, F., Adler, N. E., Berkowitz, S. A., Chin, M. H., Gary-Webb, T. L., Navas-Acien, A., Thornton, P. L., & Haire-Joshu, D. (2021). Social determinants of health and diabetes: A scientific review. Diabetes Care, 44(1), 258–279.

Walker, R. J., Smalls, B. L., & Egede, L. E. (2014). Social determinants of health in adults with type 2 diabetes—Contribution of mutable and immutable factors. Diabetes Research and Clinical Practice, 106(2), 193–201.