MATH1115 Project Template

Author

SIDs xxx

Recommendation/Insight

This is effectively your “Executive Summary” with a concrete insight or action. Type your recommendation here

Evidence

IDA

The dataset originates from the Behavioral Risk Factor Surveillance System (BRFSS) 2015, a U.S. telephone survey conducted annually by the CDC, and contains 253,680 observations and 22 variables. Variables were originally imported as numeric codes, recoded according to BRFSS conventions (Table 1). For analysis purposes, BMI was treated as a continuous numerical variable and other variables were converted to categorical or ordinal factors. Whilst originating from a reputable source, that dataset is limited due to reliance on self-reported information, potential recall and social desirability bias, cross-sectional design preventing causal inference, and underrepresentation of some population groups due to telephone survey methodology. The dataset is also class-imbalanced, with most respondents in the non-diabetic group, and exhibits demographic skew toward older, more educated, and higher-income individuals. Furthermore, categorical coding of variables such as age and income reduces granularity, and results are restricted to the U.S. population, limiting generalisability. We assume that each observation represents a unique individual, with independent, accurately coded responses broadly representative of the target population. Although cleaned microdata was obtained from Kaggle, checks for missing data were performed; if present, methods such as imputation or listwise deletion could have been considered.

Code
library(readxl)   # for reading Excel files
library(dplyr)    # for data manipulation
library(ggplot2)  # for visualization
library(tidyverse)
library(plotly)
library(scales)

diabetes <- read_excel("diabetes_health_indicators.xlsx") # reading in the xlsx file and renaming it to "diabetes"
diabetes %>%
  glimpse() # dataset overview
Rows: 253,680
Columns: 22
$ Diabetes_012         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0…
$ HighBP               <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1…
$ HighChol             <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1…
$ CholCheck            <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ BMI                  <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2…
$ Smoker               <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0…
$ Stroke               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ PhysActivity         <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ Fruits               <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1…
$ Veggies              <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
$ HvyAlcoholConsump    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ AnyHealthcare        <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ NoDocbcCost          <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ GenHlth              <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3…
$ MentHlth             <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, …
$ PhysHlth             <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2…
$ DiffWalk             <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0…
$ Sex                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
$ Age                  <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, …
$ Education            <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4…
$ Income               <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3…

Table 1. Data dictionary

Variable Description
Diabetes_012 0 = No diabetes, 1 = Prediabetes, 2 = Type 2 Diabetes
BMI Continuous (kg/m²)
Sex 0 = Female, 1 = Male
Age 13 categories; 1 = 18–24, …, 13 = 80+
Education 1 = No school, 2 = Grades 1–8, 3 = Grades 9–11, 4 = High school, 5 = Some college, 6 = College graduate
Income 1 = <$10,000, …, 8 = ≥$75,000
HighBP 0 = No, 1 = Yes
HighChol 0 = No, 1 = Yes
CholCheck 0 = No cholesterol check in past 5 years, 1 = Yes
Smoker 0 = Never smoked 100 cigarettes, 1 = Ever smoked 100+
Stroke 0 = No, 1 = Yes
HeartDiseaseorAttack 0 = No, 1 = Yes
PhysActivity 0 = No physical activity in past 30 days, 1 = Yes
Fruits 0 = <1 fruit/day, 1 = ≥1 fruit/day
Veggies 0 = <1 vegetable/day, 1 = ≥1 vegetable/day
HvyAlcoholConsump 0 = No (≤14 drinks/week men; ≤7 women), 1 = Yes
AnyHealthcare 0 = No health coverage, 1 = Yes
NoDocbcCost 0 = No, 1 = Yes (could not see doctor due to cost)
GenHlth 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
MentHlth Number of days mental health not good (0–30)
PhysHlth Number of days physical health not good (0–30)
DiffWalk 0 = No, 1 = Yes (difficulty walking/climbing stairs)

Research Question 1

What is the prevalence of T2DM in the US, accounting for demographics?

Code
# Looking at proportions of individuals in each group
diabetes %>%
  # 1) Give Diabetes_012 readable labels (and fix order)
  mutate(Diabetes_012 = factor(Diabetes_012,
                               levels = c(0, 1, 2),
                               labels = c("No diabetes/gestational",
                                          "Pre-diabetes", "Diabetes"))) %>%
  # 2) Count each group and compute percentages
  count(Diabetes_012, name = "n") %>%
  mutate(percent = 100 * n / sum(n)) %>%

  # 3) Plot bars; fill by group
  ggplot(aes(x = Diabetes_012, y = percent, fill = Diabetes_012)) +
  geom_col(width = 0.7) +

  # 4) Label each bar with its percent
  geom_text(aes(label = sprintf("%.1f%%", percent)),
            vjust = -0.35, size = 3.5) +

  # 5) Axes, colors, theme
  scale_y_continuous(labels = scales::percent_format(scale = 1),
                     expand = expansion(mult = c(0, 0.06))) +
  scale_fill_manual(values = c(
    "No diabetes/gestational" = "#8dd3c7",
    "Pre-diabetes"            = "#fdb462",
    "Diabetes"                = "#fb8072"
  )) +
  labs(
       x = "Diabetes Status",
       y = "Percentage of Individuals") +
  guides(fill = "none") +
  theme_classic(base_size = 12)

Figure 1.1 Diabetes status in the survey sample

In this sample (n = 253,680), 13.9% reported diabetes, whilst 1.8% reported pre-diabetes. The diabetes share is consistent with U.S. adult total diabetes (diagnosed + undiagnosed) of ~14.7% based on NHANES 2017–2020 (CDC, 2022). By contrast, the likely national prevalence of pre-diabetes is ~38% of U.S. adults (≈98 million), and >80% are unaware of their condition (CDC, 2023a; CDC, 2023b). The gap between the sample’s 1.8% and the national ~38% reflects method/awareness differences: national figures come from biomarker-based NHANES estimates, whereas this dataset is drawn from BRFSS, a self-reported telephone survey (CDC, 2023c).

Code
# 0) "Mutating" the labels
dat <- diabetes %>%
  mutate(
    Sex = factor(Sex, levels = c(0, 1), labels = c("Female", "Male")),
    # Broad age bands from BRFSS age codes
    AgeGroup = case_when(
      Age %in%  1:3  ~ "18–34",
      Age %in%  4:6  ~ "35–49",
      Age %in%  7:9  ~ "50–64",
      Age %in% 10:13 ~ "65+",
      TRUE           ~ NA_character_
    )
  ) %>%
  filter(!is.na(AgeGroup))

# 1) Prevalence by AgeGroup × Sex  (# prevalence = % with Diabetes_012 == 2)
age_sex <- dat %>%
  group_by(AgeGroup, Sex) %>%
  summarise(
    n = n(),
    prevalence = 100 * mean(Diabetes_012 == 2, na.rm = TRUE),
    .groups = "drop"
  )

# 2) “All ages” totals by Sex (computed from raw rows, not averaging percentages)
all_ages <- dat %>%
  group_by(Sex) %>%
  summarise(
    n = n(),
    prevalence = 100 * mean(Diabetes_012 == 2, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(AgeGroup = "All ages")

# 3) Combine and set display order of age bands
plot_df <- bind_rows(all_ages, age_sex) %>%
  mutate(AgeGroup = factor(AgeGroup,
                           levels = c("All ages", "18–34", "35–49", "50–64", "65+")))

# 4) Plot: grouped bars with % labels
ggplot(plot_df, aes(x = AgeGroup, y = prevalence, fill = Sex)) +
  geom_col(position = position_dodge(width = 0.9), width = 0.9) +
  # labels each bar with its percentage
  geom_text(aes(label = sprintf("%.1f%%", prevalence)),
            position = position_dodge(width = 0.8), vjust = -0.35, size = 3.4) +
  scale_y_continuous(labels = scales::percent_format(scale = 1),
                     expand = expansion(mult = c(0, 0.10))) +
  scale_fill_manual(values = c(Female = "#fb8072", Male = "#80b1d3")) +
  labs(
    x = "Age Group", y = "Prevalence (%)", fill = "Sex"
  ) +
  theme_classic(base_size = 12)+
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Figure 1.2. Diabetes prevalence by age group and sex, including all ages.

Overall, diabetes prevalence is higher in males (15.2%) than females (13.0%), and men have higher rates in most age groups—35–49: 7.2% vs 6.6%; 50–64: 15.6% vs 13.6%; 65+: 23.4% vs 18.4%—with a clear rise with age in both sexes (Figure 1.2). The small female excess at 18–34 (2.5% vs 1.8%) is minor and likely due to sampling/under-diagnosis. Taken together, older age and male sex are associated with higher diabetes prevalence in this cross-sectional dataset.

Research Question 2

How does the mean BMI differ between individuals with no diabetes, pre-diabetes, and type 2 diabetes?

Code
# Box Plot
ggplot(diabetes, aes(x = factor(Diabetes_012), y = BMI, fill = factor(Diabetes_012))) +
  # 1) Boxplot of BMI by diabetes status
  geom_boxplot(width = 0.6) +
  # 2) Dashed mean line inside each box
  stat_summary(fun = mean, geom = "crossbar", width = 0.6,
               colour = "black", linetype = "dashed", size = 0.2) +
  # 3) Labels
  labs(
       x = "Diabetes Status", y = "BMI") +
  # 4) Replace x-axis codes with labels
  scale_x_discrete(labels = c(
    "0" = "No Diabetes",
    "1" = "Pre-diabetes",
    "2" = "Type 2 Diabetes"
  )) +
  # 5) Manual fill colours
  scale_fill_manual(values = c("0" = "lightgreen", "1" = "orange", "2" = "lightcoral")) +
  # 6) Focused y-range (edit if you like)
  coord_cartesian(ylim = c(10, 100)) +
  # 7) Theme
  theme_classic(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Code
# --- NUMERICAL SUMMARY (per diabetes group) ---
summary_tbl <- diabetes %>%
  filter(!is.na(Diabetes_012), !is.na(BMI)) %>%
  group_by(Diabetes_012) %>%
  summarise(
    n       = n(),
    mean    = mean(BMI),
    sd      = sd(BMI),
    se      = sd / sqrt(n),
    lo95    = mean - 1.96 * se,   # 95% CI for the mean
    hi95    = mean + 1.96 * se,
    median  = median(BMI),
    q1      = quantile(BMI, 0.25),
    q3      = quantile(BMI, 0.75),
    IQR     = IQR(BMI),
    min     = min(BMI),
    max     = max(BMI),
    .groups = "drop"
  ) %>%
  mutate(Group = recode(as.character(Diabetes_012),
                        "0" = "No Diabetes",
                        "1" = "Pre-diabetes",
                        "2" = "Type 2 Diabetes")) %>%
  select(Group, n, mean, sd, se, lo95, hi95, median, q1, q3, IQR, min, max) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

summary_tbl
# A tibble: 3 × 13
  Group           n  mean    sd    se  lo95  hi95 median    q1    q3   IQR   min
  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 No Diabet… 213703  27.7  6.26  0.01  27.7  27.8     27    24    30     6    12
2 Pre-diabe…   4631  30.7  6.96  0.1   30.5  30.9     30    26    34     8    13
3 Type 2 Di…  35346  31.9  7.36  0.04  31.9  32.0     31    27    35     8    13
# ℹ 1 more variable: max <dbl>

Figure 2.0. BMI Distribution across respondents with no diabetes, pre-diabetes and type 2 diabetes.

The BMI distributions are right-skewed in all groups, however the means are close to the medians (No diabetes: 27.7 vs 27; Pre-diabetes: 30.7 vs 30; Type 2: 31.9 vs 31). BMI shifts upward across diabetes status: No diabetes < Pre-diabetes < Type 2. Despite large variability in individuals (SD ≈ 6–7), the estimates of the means are very precise because of the big samples (n ≈ 213k, 4.6k, 35k), yielding tight 95% CIs (about 27.72–27.77, 30.52–30.93, 31.87–32.02, respectively). The spreads are comparable across groups, with IQRs roughly 6 (24–30) for No diabetes and 8 (26–34; 27–35) for Pre-diabetes and Type 2.

Figure 4. BMI distribution across ‘no diabetes’, ‘pre-diabetes’ and ‘type 2 diabetes’ groups.

Code
# 1) Summarise mean and SE
summary_tbl <- diabetes %>%
  mutate(Group = factor(Diabetes_012,
                        levels = c(0, 1, 2),
                        labels = c("No Diabetes", "Pre-diabetes", "Type 2 Diabetes"))) %>%
  group_by(Group) %>%
  summarise(
    n        = sum(!is.na(BMI)),                 # number of cases
    mean_BMI = mean(BMI, na.rm = TRUE),          # mean BMI
    se_BMI   = sd(BMI, na.rm = TRUE) / sqrt(n),  # Standard Error
    .groups  = "drop"
  )

# 2) Barplot with SE error bars
ggplot(summary_tbl, aes(x = Group, y = mean_BMI, fill = Group)) +
  geom_col(width = 0.65, colour = "black") +
  geom_errorbar(aes(ymin = mean_BMI - se_BMI, ymax = mean_BMI + se_BMI),
                width = 0.12, linewidth = 0.7) +
  geom_text(aes(label = sprintf("%.1f", mean_BMI)), vjust = -1, size = 3.5) +
  scale_fill_manual(values = c("No Diabetes" = "lightgreen",
                               "Pre-diabetes" = "orange",
                               "Type 2 Diabetes" = "lightcoral")) +
  labs(title = "Mean BMI by Diabetes Group (±SE)",
       x = "Diabetes Status", y = "Mean BMI") +
  coord_cartesian(ylim = c(25, 33)) +
  theme_classic(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

Code
diabetes <- diabetes %>%
  mutate(Diabetes = factor(Diabetes_012,
                           levels = c(0,1,2),
                           labels = c("No Diabetes","Pre-diabetes","Type 2 Diabetes")))

# Recode diabetes status
diabetes <- diabetes %>%
  mutate(Diabetes = factor(Diabetes_012,
                           levels = c(0,1,2),
                           labels = c("No Diabetes","Pre-diabetes","Type 2 Diabetes")))

Figure 5 Mean BMI by diabetes status.

Research Question 3

Which factors are most strongly associated with Type 2 Diabetes, and what what is the relationship of socioeconomic factors?

Code
## --- Correlation bar plot + table of coefficients ---

dat <- diabetes

# If needed, make binary outcome
if ("Diabetes_012" %in% names(dat) && !"Diabetes_binary" %in% names(dat)) {
  dat$Diabetes_binary <- as.integer(dat$Diabetes_012 == 2)
}

# Pick the right Education/Income column names
edu_col <- if ("Education" %in% names(dat)) "Education" else
           if ("Education2" %in% names(dat)) "Education2" else stop("No Education column")
inc_col <- if ("Income" %in% names(dat)) "Income" else
           if ("Income2" %in% names(dat)) "Income2" else stop("No Income column")

vars <- c("HighBP","HighChol","CholCheck","BMI","Smoker","Stroke",
          "HeartDiseaseorAttack","PhysActivity","Fruits","Veggies",
          "HvyAlcoholConsump","AnyHealthcare","NoDocbcCost","GenHlth",
          "MentHlth","PhysHlth","DiffWalk","Sex","Age", edu_col, inc_col)

# Robust coercion
to_num <- function(x) {
  if (is.numeric(x)) return(x)
  if (is.factor(x) || is.character(x)) {
    y <- suppressWarnings(as.numeric(as.character(x)))
    if (all(is.na(y))) y <- as.integer(as.factor(x))
    return(y)
  }
  as.numeric(x)
}

X <- dat[, c("Diabetes_binary", vars), drop = FALSE]
for (nm in names(X)) X[[nm]] <- to_num(X[[nm]])

# Spearman correlations
rho <- sapply(vars, function(v) {
  ok <- complete.cases(X[, c("Diabetes_binary", v)])
  cor(X$Diabetes_binary[ok], X[[v]][ok], method = "spearman")
})

# Sort, color scheme
o <- order(rho)
rho_sorted  <- rho[o]; vars_sorted <- vars[o]
top3 <- order(abs(rho_sorted), decreasing = TRUE)[1:min(3, length(rho_sorted))]
cols <- ifelse(rho_sorted >= 0, "orange", "blue"); cols[top3] <- "red"

# Plot
op <- par(mar = c(6, 8, 2, 2), xaxs = "i")
barplot(rho_sorted, horiz = TRUE,
        names.arg = vars_sorted, las = 1,
        col = cols, border = NA, space = 0.3, cex.names = 0.7,
        xlab = "Spearman correlation (ρ)", xlim = c(-0.2, 0.3))
abline(v = 0, col = "grey40", lwd = 1)

Code
par(op)

# --- Print correlation table ---
cor_table <- data.frame(
  Variable = vars_sorted,
  Spearman_rho = round(rho_sorted, 3)
)
print(cor_table, row.names = FALSE)
             Variable Spearman_rho
               Income       -0.163
            Education       -0.120
         PhysActivity       -0.118
    HvyAlcoholConsump       -0.057
              Veggies       -0.057
               Fruits       -0.041
        AnyHealthcare        0.016
                  Sex        0.031
          NoDocbcCost        0.031
             MentHlth        0.040
               Smoker        0.061
            CholCheck        0.065
               Stroke        0.106
             PhysHlth        0.157
 HeartDiseaseorAttack        0.177
                  Age        0.178
             HighChol        0.200
             DiffWalk        0.218
                  BMI        0.226
               HighBP        0.263
              GenHlth        0.288

Figure 3.1. Ranked Spearman Correlations of Risk Factors With Diabetes.

We computed Spearman’s rank correlations (ρ) between each predictor and a binary outcome (diabetes or pre-diabetes = 1 vs no diabetes = 0), using pairwise complete cases. The strongest positive associations were poor general health (ρ≈0.26), high blood pressure (ρ≈0.24), BMI (ρ≈0.23), difficulty walking (ρ≈0.22), high cholesterol (ρ≈0.20), age (ρ≈0.18) and heart disease (ρ≈0.16). The strongest inverse associations were income (ρ≈−0.16), education (ρ≈−0.12) and physical activity (ρ≈−0.12), with smaller negatives for fruit/vegetable intake. These directions align with the literature: higher adiposity increases insulin resistance and β-cell stress, while limited mobility/physical activity reduces glucose uptake—both raising diabetes risk (Aune et al., 2015). Likewise, lower socioeconomic position (income/education) is linked to less access to healthy foods, preventive care and screening, and higher prevalence (Agardh et al., 2011). The observed age gradient and male excess are consistent with national surveillance (Gwira, Fryar and Gu, 2024). Importantly, effect sizes are modest (|ρ|≲0.3) and pairwise/unadjusted; with cross-sectional data we can describe associations, not infer causation.

Code
# canonical labels
edu_lvls <- c("No school","Grades 1–8","Grades 9–11",
              "High school","Some college","College 4+")
inc_lvls <- c("<$10k","$10–15k","$15–20k","$20–25k",
              "$25–35k","$35–50k","$50–75k","$75k+")

# ---- Build safe codes then factors (handles numeric or text with -/–) ----
heat_edu_inc <- diabetes %>%
  mutate(
    # normalize text: replace hyphen with en-dash to match labels
    Education_chr = if (is.numeric(Education)) NA_character_ else gsub("-", "–", as.character(Education)),
    Income_chr    = if (is.numeric(Income))    NA_character_ else gsub("-", "–", as.character(Income)),

    EduCode = dplyr::case_when(
      is.numeric(Education)                 ~ as.integer(Education),
      grepl("No school", Education_chr,  TRUE) ~ 1L,
      grepl("Grades\\s*1\\s*–\\s*8", Education_chr) ~ 2L,
      grepl("Grades\\s*9\\s*–\\s*11", Education_chr) ~ 3L,
      grepl("High school", Education_chr, TRUE) ~ 4L,
      grepl("Some college", Education_chr, TRUE) ~ 5L,
      grepl("College\\s*4\\+", Education_chr) ~ 6L,
      TRUE ~ NA_integer_
    ),
    IncCode = dplyr::case_when(
      is.numeric(Income)                  ~ as.integer(Income),
      grepl("^\\s*<\\$?10", Income_chr)  ~ 1L,
      grepl("10\\s*–\\s*15", Income_chr) ~ 2L,
      grepl("15\\s*–\\s*20", Income_chr) ~ 3L,
      grepl("20\\s*–\\s*25", Income_chr) ~ 4L,
      grepl("25\\s*–\\s*35", Income_chr) ~ 5L,
      grepl("35\\s*–\\s*50", Income_chr) ~ 6L,
      grepl("50\\s*–\\s*75", Income_chr) ~ 7L,
      grepl("75\\s*\\+|\\$75k\\+", Income_chr) ~ 8L,
      TRUE ~ NA_integer_
    ),

    Education_f = factor(EduCode, levels = 1:6, labels = edu_lvls),
    Income_f    = factor(IncCode, levels = 1:8, labels = inc_lvls)
  ) %>%
  filter(!is.na(Education_f), !is.na(Income_f)) %>%
  group_by(Education_f, Income_f) %>%
  summarise(
    n = n(),
    diabetes_cases = sum(Diabetes_012 == 2, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    prevalence = diabetes_cases / n,
    size_log2  = log2(pmax(n, 1)),
    text = paste0(
      "<b>Education:</b> ", Education_f,
      "<br><b>Income:</b> ", Income_f,
      "<br><b>n:</b> ", comma(n),
      "<br><b>T2D prevalence:</b> ", percent(prevalence, 0.1)
    )
  )

# Quick sanity check (optional): stop if no rows
if (nrow(heat_edu_inc) == 0) stop("No rows after recoding — check Education/Income values.")

# ---- Plot ----
p <- ggplot(heat_edu_inc, aes(x = Income_f, y = Education_f)) +
  geom_point(aes(size = size_log2, fill = prevalence, text = text),
             shape = 21, colour = "black", stroke = 0.3) +
  scale_fill_gradient(low = "lightyellow", high = "red",
                      name = "T2D prevalence",
                      labels = percent_format(accuracy = 1)) +
  scale_size_continuous(
    name = expression(Sample~size~~log[2](n)),
    range = c(4, 15),
    breaks = log2(c(20, 100, 500, 2000)),
    labels = c("20","100","500","2000")
  ) +
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  labs(
    title = "Diabetes prevalence by education and income",
    x = "Household income",
    y = "Education level"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right"
  )
Warning in geom_point(aes(size = size_log2, fill = prevalence, text = text), :
Ignoring unknown aesthetics: text
Code
ggplotly(p, tooltip = "text")

Figure 3.2. Interaction of Education and Income in Diabetes Prevalence.

Extending the correlation results, the bubble heatmap localises the socioeconomic gradient by showing Type 2 diabetes prevalence across the joint grid of education × income. Prevalence is highest in low-education/low-income cells and declines near-monotonically with increases in either education or income, with the lowest risk at college 4+/≥$75k bracket. This pattern is consistent with national surveillance showing higher diagnosed diabetes at lower education and income (Chen et al., 2023) and with meta-analytic evidence that lower socioeconomic position increases T2D incidence (Agardh et al., 2011).

Articles

Agardh, E., Allebeck, P., Hallqvist, J., Moradi, T. and Sidorchuk, A. (2011) ‘Type 2 diabetes incidence and socio-economic position: a systematic review and meta-analysis’, International Journal of Epidemiology, 40(3), pp. 804–818. doi:10.1093/ije/dyq155.

American Diabetes Association (ADA) (2024) ‘Standards of Medical Care in Diabetes—2024’, Diabetes Care, 47(Suppl 1), S1–SXX.

Aune, D., Norat, T., Leitzmann, M., Tonstad, S. and Vatten, L.J. (2015) ‘Physical activity and the risk of type 2 diabetes: a systematic review and dose–response meta-analysis’, BMJ, 350, h354. doi:10.1136/bmj.h354.

CDC (Centers for Disease Control and Prevention) (2023a) Prediabetes — Your Chance to Prevent Type 2 Diabetes. Available at: https://www.cdc.gov/diabetes/basics/prediabetes.html (Accessed 5 September 2025).

CDC (Centers for Disease Control and Prevention) (2023b) Prediabetes Facts. Available at: https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html (Accessed 5 September 2025).

CDC (Centers for Disease Control and Prevention) (2023c) About the Behavioral Risk Factor Surveillance System (BRFSS). Available at: https://www.cdc.gov/brfss/index.html (Accessed 5 September 2025).

CDC (Centers for Disease Control and Prevention) (2024) BRFSS Annual Survey Data—Methodology and Documentation. Available at: https://www.cdc.gov/brfss/annual_data/annual_data.htm (Accessed 5 September 2025).

CDC (Centers for Disease Control and Prevention) (2024) National Diabetes Statistics Report. Atlanta, GA: CDC. Available at: https://www.cdc.gov/diabetes/data/statistics-report/index.html (Accessed 5 September 2025).

Chen, Y., Zhou, X., Bullard, K.M., Zhang, P., Imperatore, G. and Rolka, D.B. (2023) ‘Income-related inequalities in diagnosed diabetes prevalence among US adults, 2001–2018’, PLOS ONE, 18(4), e0283450. doi:10.1371/journal.pone.0283450.

Gwira, J.A., Fryar, C.D. and Gu, Q. (2024) Prevalence of total, diagnosed, and undiagnosed diabetes in adults: United States, August 2021–August 2023. NCHS Data Brief No. 516. Hyattsville, MD: National Center for Health Statistics, CDC. Available at: https://www.cdc.gov/nchs/data/databriefs/db516.pdf (Accessed 5 September 2025).

Acknowledgements