Dset <- read.table(
  here("Data", "dataset_project23444.txt"),
  header = TRUE,
  sep = "\t"
)
# Table of NA counts for each variable
sapply(Dset, function(x) sum(is.na(x)))
##          ID         sex  family_ses     matsmok      matbmi     gestage 
##           0           0           0           0           0           0 
##     bweight   alcohol31   smoking31       ses31    height31    weight31 
##           0           0           0           0           0           0 
##    waistc31   ldlchol31 totalchol31       age46 chestpain46       t2d46 
##           0           0           0           0           0           0
# Sum of missing values in the data set
sum(is.na(Dset)) 
## [1] 0

Note that the data set has 0 missing values.

BMI at age 31 was calculated for each participant using measured weight and height (kg/m²), and values were rounded to two decimal places using half-up rounding. The resulting BMI values were then categorized into underweight (<18.5), normal weight (18.5–24.9), overweight (25.0–29.9), and obesity (≥30.0) according to standard BMI cut-offs.

# Step 1: calculate bmi for study participants at the age of 31

Dset$bmi31 <- (Dset$weight31)/(Dset$height31/100)^2

Dset <- Dset %>% 
  mutate(bmi31 = round_half_up(bmi31,2))

# step 2: categorize the bmi into 

Dset$bmi31_cate <- cut(Dset$bmi31,
                       breaks = c(0, 18.5, 25, 30, Inf),
                       labels = c("Underweight", "Normalweight", "Overweight", "Obesity"),
                       right = TRUE)

Basic Information

This project uses data from 3,040 participants in a modified random sample of the Northern Finland Birth Cohort 1966 to investigate whether body mass index (BMI) in early adulthood is associated with the risk of doctor-diagnosed type 2 diabetes by age 46. The dataset includes detailed perinatal, socio-demographic, lifestyle, anthropometric, and clinical measures, and is displayed below using an interactive table to facilitate data inspection and exploration.

reactable::reactable(Dset, defaultPageSize = 10, striped = F ,highlight = TRUE)

Descriptive Statistics

Descriptive statistics for selected continuous variables

Descriptive statistics were calculated for selected continuous variables, including maternal, birth, and adult anthropometric and lipid measures, by computing the minimum, quartiles, median, mean, maximum, and standard deviation. These summaries were then organized into a formatted table using the gt package to provide a clear overview of the distribution of continuous variables in the study population.

# sapply to compute summary for selected numeric variables
summary_list <- sapply(
  Dset[c("matbmi","gestage","bweight","height31",
         "weight31","ldlchol31","totalchol31","bmi31")],
  function(x) c(summary(x), sd = sd(x, na.rm = TRUE))
)

# data frame for gt
summary_df <- as.data.frame(t(summary_list))
summary_df <- tibble::rownames_to_column(summary_df, "Variable")

# gt table
summary_df %>%
  gt() %>%
  tab_header(
    title = "Summary Statistics of Continuous Variables",
    subtitle = "Study population"
  ) %>%
  cols_label(
    Variable = "Variable",
    Min. = "Minimum",
    "1st Qu." = "Q1 (25th percentile)",
    Median = "Median",
    Mean = "Mean",
    "3rd Qu." = "Q3 (75th percentile)",
    Max. = "Maximum",
    sd = "Standard Deviation"
  ) %>%
  fmt_number(
    columns = c(Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max., sd),
    decimals = 2
  )
Summary Statistics of Continuous Variables
Study population
Variable Minimum Q1 (25th percentile) Median Mean Q3 (75th percentile) Maximum Standard Deviation
matbmi 14.50 21.00 22.70 23.21 24.80 40.30 3.22
gestage 26.00 39.00 40.00 40.06 41.00 46.00 1.86
bweight 1,229.00 3,167.00 3,501.00 3,500.46 3,830.00 6,080.00 517.79
height31 136.00 164.00 171.00 171.19 178.00 203.00 9.27
weight31 40.00 61.00 70.00 72.37 81.00 151.00 14.87
ldlchol31 0.58 2.38 2.90 2.97 3.50 7.18 0.87
totalchol31 2.00 4.38 4.94 5.04 5.61 9.75 0.95
bmi31 15.35 21.80 23.94 24.60 26.53 52.89 4.24

The table above presents summary statistics for key continuous variables in the study population. Maternal BMI had a mean of 23.2 kg/m² (SD = 3.2), while gestational age was centered around full term with a mean of 40.1 weeks (SD = 1.9). Birth weight averaged 3,500 g (SD = 518), indicating a generally healthy birth weight distribution. At age 31, participants had a mean height of 171.2 cm (SD = 9.3) and a mean weight of 72.4 kg (SD = 14.9).

Cardiometabolic risk markers in early adulthood showed moderate variability. Mean BMI at age 31 was 24.6 kg/m² (SD = 4.2), with the interquartile range spanning from 21.8 to 26.5 kg/m², suggesting that a substantial proportion of participants were overweight. Mean LDL cholesterol was 3.0 mmol/L (SD = 0.9) and mean total cholesterol was 5.0 mmol/L (SD = 1.0), values consistent with a population-based adult cohort. Overall, the distributions indicate adequate variability across exposures relevant to the investigation of later type 2 diabetes risk.

Descriptive statistics for selected Categorical variables

Descriptive statistics for categorical variables were generated by calculating frequency counts and corresponding percentages for each category, including sex, socio-economic indicators, lifestyle factors, clinical outcomes, and BMI categories at age 31. These results were compiled into a formatted table using the gt package to summarize the distribution of categorical characteristics in the study population and to facilitate interpretation of prevalence patterns relevant to type 2 diabetes risk.

# Categorical variables
cat_vars <- c(
  "sex", "family_ses", "matsmok", "alcohol31",
  "smoking31", "ses31", "chestpain46", "t2d46", "bmi31_cate"
)

# summary table
summary_cat <- do.call(rbind, lapply(cat_vars, function(v) {
  tab <- table(Dset[[v]], useNA = "ifany")
  
  data.frame(
    Variable = v,
    Category = names(tab),
    Count = as.numeric(tab),
    Percentage = as.numeric(round_half_up(100 * tab / sum(tab), 1))
  )
}))

# gt table
summary_cat %>%
  gt() %>%
  tab_header(
    title = "Summary Statistics of Categorical Variables",
    subtitle = "Study population distributions"
  ) %>%
  cols_label(
    Variable = "Variable",
    Category = "Category",
    Count = "Count",
    Percentage = "Percentage (%)"
  ) %>%
  fmt_number(
    columns = Percentage,
    decimals = 1
  )
Summary Statistics of Categorical Variables
Study population distributions
Variable Category Count Percentage (%)
sex 1 1406 46.3
sex 2 1634 53.8
family_ses 1 224 7.4
family_ses 2 559 18.4
family_ses 3 1044 34.3
family_ses 4 594 19.5
family_ses 5 605 19.9
family_ses 6 14 0.5
matsmok 0 2640 86.8
matsmok 1 400 13.2
alcohol31 0 276 9.1
alcohol31 1 2732 89.9
alcohol31 2 32 1.1
smoking31 0 1888 62.1
smoking31 1 1152 37.9
ses31 1 772 25.4
ses31 2 1014 33.4
ses31 3 734 24.1
ses31 4 413 13.6
ses31 5 107 3.5
chestpain46 0 2776 91.3
chestpain46 1 256 8.4
chestpain46 2 8 0.3
t2d46 1 2946 96.9
t2d46 2 94 3.1
bmi31_cate Underweight 78 2.6
bmi31_cate Normalweight 1778 58.5
bmi31_cate Overweight 914 30.1
bmi31_cate Obesity 270 8.9

The table above summarizes the distribution of categorical characteristics in the study population. Females comprised a slightly larger proportion of the cohort (53.8%) compared with males (46.3%). Family socio-economic status at birth was predominantly represented by skilled and unskilled worker categories, together accounting for over half of participants, while very few individuals had parents who were both retired (0.5%). Maternal smoking during pregnancy was reported for 13.2% of participants, whereas the majority (86.8%) were not exposed.

At age 31, most participants reported moderate alcohol use (89.9%), with relatively small proportions of non-users (9.1%) and heavy users (1.1%).Approximately 38% of the cohort were current smokers at age 31.Adult socio-economic status was broadly distributed, with skilled workers (33.4%) and professionals (25.4%) forming the largest groups.

By age 46, chest pain under strain was uncommon (8.7% combined mild or severe), and doctor-diagnosed type 2 diabetes was observed in 3.1% of participants.In terms of BMI at age 31, the majority were of normal weight (58.5%), while 30.1% were overweight and 8.9% were classified as obese, indicating a substantial proportion of the cohort was exposed to elevated BMI in early adulthood.

Distribution of Continuous Variables by Type 2 Diabetes Status

Kernel density plots were generated for selected continuous variables and stratified by type 2 diabetes status at age 46 to visually compare their distributions between individuals with and without diabetes. The plots are faceted by variable with free scales, allowing differences in central tendency and spread across T2D groups to be examined without imposing common axis constraints.

# Define continuous variables
cont_vars <- c("matbmi","gestage","bweight","height31",
               "weight31","ldlchol31","totalchol31","bmi31")

Dset %>%
  select(t2d46, all_of(cont_vars)) %>%
  tidyr::pivot_longer(-t2d46, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = factor(t2d46))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "Continuous Variables Across T2D Status",
       x = "Value", y = "Density",
       fill = "T2D at 46 (1 = No, 2 = Yes)") +
  theme_minimal()

The figure presents kernel density plots of selected continuous variables stratified by type 2 diabetes (T2D) status at age 46, allowing visual comparison of their distributions between individuals with and without T2D. Variables related to adiposity, including BMI at age 31, weight at age 31, and maternal BMI, show a clear rightward shift among participants with T2D, indicating higher average values and greater variability in this group. This pattern suggests that higher body mass in early adulthood and maternal adiposity are associated with an increased risk of developing T2D later in life.

In contrast, variables such as height at age 31, gestational age, and birth weight display largely overlapping distributions between T2D groups, implying little evidence of association with later T2D status. Lipid measures (LDL cholesterol and total cholesterol at age 31) show modest rightward shifts for individuals with T2D, although substantial overlap remains. Overall, the figure highlights adiposity-related measures as the most distinct differentiators between T2D groups, supporting their relevance as key exposures in subsequent analyses.

Distribution of Type 2 Diabetes Across Categorical Predictors

Stacked bar charts were created to display the proportion of participants with and without type 2 diabetes at age 46 across categories of BMI at age 31, demographic, lifestyle, socio-economic, and clinical variables. The plots are faceted by predictor and show percentage distributions within each category, facilitating comparison of T2D prevalence patterns across key risk factors.

Dset %>%
  select(t2d46, bmi31_cate, sex, matsmok,smoking31, family_ses,ses31, alcohol31, chestpain46) %>%
  mutate(across(-t2d46, as.character)) %>%   # convert predictors to character
  pivot_longer(-t2d46, names_to = "Variable", values_to = "Category") %>%
  group_by(Variable, Category, t2d46) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Variable, Category) %>%
  mutate(percent = 100 * n / sum(n)) %>%
  ggplot(aes(x = Category, y = percent, fill = factor(t2d46))) +
  geom_col(position = "fill") +   # <-- stacked bars
  facet_wrap(~Variable, scales = "free_x") +
  labs(title = "Type 2 DM at 46 Across Predictors",
       y = "Percentage (%)",
       fill = "T2D (1=No, 2=Yes)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The displayed stacked bar charts showing the percentage distribution of type 2 diabetes (T2D) status at age 46 across categories of key demographic, lifestyle, socio-economic, and clinical predictors. Within each category, bars are normalized to 100%, allowing comparison of the relative proportion of individuals with and without T2D. A clear gradient is observed across BMI categories at age 31, with the proportion of T2D being lowest among normal-weight individuals and highest among those classified as obese, highlighting BMI as a strong risk factor for later T2D.

For most other predictors, including sex, smoking status, alcohol use, socio-economic status at birth and at age 31, and maternal smoking during pregnancy, the proportion of T2D remains low across categories, reflecting the overall low prevalence of T2D in the cohort. Slightly higher proportions of T2D are observed among participants reporting chest pain at age 46 and among current smokers, although differences are modest. Overall, the figure emphasizes the prominent role of early-adulthood adiposity compared with other categorical predictors in distinguishing T2D risk by midlife.

Outlier Inspection of Continuous Variables

Boxplots were created for all selected continuous variables to visually identify potential outliers, with points outside the whiskers highlighted in red. Each variable is displayed in a separate facet with free scales, allowing clear detection of extreme values and comparison of distributional spread across variables.

Dset %>%
  select(all_of(cont_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(
    outlier.colour = "red", outlier.shape = 21, outlier.size = 3,
    fill = "#4DB6AC", color = "#00695C"
  ) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(
    title = "Outlier Inspection via Boxplots",
    subtitle = "Red points indicate potential outliers",
    x = NULL,
    y = "Value"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    strip.text = element_text(face = "bold", size = 15),
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 13),
    axis.title.y = element_text(size = 14),
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    panel.spacing = unit(1.5, "lines")
  )

The figure above presents a series of boxplots used to assess the distributional integrity and identify potential outliers across eight key clinical variables, including BMI, birth weight, and cholesterol levels. In these plots, the central box represents the interquartile range (IQR), while the “whiskers” extend to \(1.5 \times \text{IQR}\) from the hinges. Data points plotted in red signify observations that fall outside this range, marking them as potential outliers that may require further statistical validation or data cleaning prior to downstream modeling.

A preliminary review of the plots indicates varying degrees of variance across the features. For instance, variables such as bmi31, weight31, and gestage exhibit a noticeable cluster of high-end outliers, suggesting a right-skewed distribution or the presence of extreme clinical cases within the cohort. Conversely, height31 and bweight show a more symmetric distribution of outliers at both the upper and lower bounds. These visualizations serve as a critical diagnostic step, ensuring that any influential data points are identified and handled appropriately to maintain the robustness of the subsequent analysis.

Correlation Analysis: Binary and Continuous Variable Associations

This analysis begins by encoding the categorical type 2 diabetes variable into a numeric binary format to facilitate the calculation of point-bi-serial correlation coefficients against all continuous predictors. The resulting coefficients are then visualized in a sorted bar chart, allowing for a clear identification of which clinical factors—such as BMI or cholesterol—share the strongest positive or negative associations with a diabetes diagnosis at age 46.

Point-Biserial Correlation Coefficient (\(r_{pb}\))

The point-biserial correlation is used to measure the strength and direction of the association between a continuous variable and a dichotomous (binary) variable. It is a specialized case of the Pearson correlation coefficient.

The formula for the point bi-serial correlation coefficient is:

  • M1 = mean (for the entire test) of the group that received the positive binary variable (i.e. the “1”).

  • M0 = mean (for the entire test) of the group that received the negative binary variable (i.e. the “0”).

  • Sn = standard deviation for the entire test.

  • p = Proportion of cases in the “0” group.

  • q = Proportion of cases in the “1” group.

Dset <- Dset %>%
  mutate(t2d46_num = ifelse(t2d46 == 2, 1, 0))  # 1 = Yes, 0 = No
cor_data <- Dset %>%
  select(t2d46_num, all_of(cont_vars)) %>%
  na.omit()

# Correlation matrix
cor_matrix <- cor(cor_data)

# Extract correlations with t2d46
cor_with_t2d <- cor_matrix["t2d46_num", -1]
cor_with_t2d
##      matbmi     gestage     bweight    height31    weight31   ldlchol31 
##  0.04192280  0.02165326 -0.02331914 -0.01575756  0.19476188  0.05644972 
## totalchol31       bmi31 
##  0.05812481  0.24561773
# Correlation matrix plot
cor_df <- data.frame(
  Variable = names(cor_with_t2d),
  Correlation = cor_with_t2d
)

ggplot(cor_df, aes(x = reorder(Variable, Correlation), y = Correlation, fill = Correlation)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  labs(title = "Correlation of Continuous Variables with T2D at 46",
       x = "Variable", y = "Correlation (Point-Biserial)") +
  theme_minimal()

Correlation Results: Continuous Predictors of T2D at Age 46

The bar chart illustrates the strength and direction of the association between various continuous clinical variables and the presence of Type 2 Diabetes (T2D) at age 46, measured using point-biserial correlation coefficients. The results reveal that bmi31 (\(r_{pb} \approx 0.25\)) and weight31 (\(r_{pb} \approx 0.19\)) are the strongest positive predictors, indicating that higher body mass index and weight in the recorded period are associated with an increased likelihood of a T2D diagnosis. Other factors such as total cholesterol, LDL cholesterol, and maternal BMI show weaker but positive correlations, suggesting a modest link between these metabolic markers and future diabetic outcomes.

In contrast, variables such as bweight (birth weight) and height31 display negligible negative correlations, with coefficients of approximately \(-0.023\) and \(-0.016\), respectively. These values suggest that as birth weight or height increases, there is a very slight, perhaps non-significant, decrease in T2D risk within this specific cohort. Collectively, this visualization prioritizes metabolic and weight-related metrics as the primary continuous variables of interest for further predictive modeling.

Model Development and Inferential Analysis

To investigate the aetiological risk factors for Type 2 Diabetes (T2D), an unadjusted logistic regression model was developed using the logit link function to assess the impact of early adulthood BMI (at age 31) on the risk of a doctor-verified diagnosis by age 46. This initial model serves as a baseline to quantify the crude association between body mass index and metabolic outcomes before incorporating pre-specified confounders such as sex, smoking, and socio-economic position.

Statistical Methods and Study Context

Type 2 diabetes represents a significant global disease burden, making the identification of early-life risk factors a public health priority. This study utilizes data from the Northern Finland Birth Cohort 1966 to examine the potential causal effect of BMI at age 31 on the risk of T2D at age 46. The analysis employs logistic regression, as it is the standard method for modeling binary outcomes (\(y = 1\) for “No”, \(y = 2\) for “Yes”) in epidemiological research.

Summary of Variables used in the Model:

  • Outcome (\(Y\)): t2d46 (Type 2 diabetes diagnosis by age 46).
  • Predictor (\(X\)): bmi31 (Body Mass Index calculated from weight and height at age 31).
  • Dataset: A prospective pregnancy-birth cohort sample (NFBC1966).
m1 <- glm(factor(t2d46) ~ bmi31, family = binomial(link = "logit"), data = Dset)
summary(m1)
## 
## Call:
## glm(formula = factor(t2d46) ~ bmi31, family = binomial(link = "logit"), 
##     data = Dset)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.61268    0.51443  -16.74   <2e-16 ***
## bmi31        0.19329    0.01737   11.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 838.61  on 3039  degrees of freedom
## Residual deviance: 719.91  on 3038  degrees of freedom
## AIC: 723.91
## 
## Number of Fisher Scoring iterations: 6

Model Estimation and Odds Ratio Interpretation

The unadjusted logistic regression model identifies a statistically significant association between Body Mass Index at age 31 and the risk of Type 2 Diabetes at age 46. Based on the maximum likelihood estimates from the model output, the relationship is expressed through the following estimated logit equation:

\[\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -8.61 + 0.19 \times \text{bmi31}\]

The slope coefficient (\(\hat{\beta}_1 = 0.1933\)) indicates that for every one-unit increase in BMI (\(kg/m^2\)) at age 31, the log-odds of a future diabetes diagnosis increase significantly. This model provides strong empirical evidence (\(p < 2 \times 10^{-16}\)) that early adulthood adiposity is a primary aetiological driver for metabolic disorders later in life within the Northern Finland Birth Cohort 1966.

exp(coef(summary(m1))[2,1])
## [1] 1.213234
exp(coef(summary(m1))[2])
## [1] 1.213234
confint_or_m1 <- exp(confint(m1))
## Waiting for profiling to be done...
confint_or_m1
##                    2.5 %       97.5 %
## (Intercept) 0.0000646349 0.0004873959
## bmi31       1.1730546789 1.2558828436

Statistical Precision of the Effect Size

To provide a clinically meaningful interpretation, the coefficient was exponentiated to calculate the Odds Ratio (OR) along with its 95% confidence interval. The analysis yielded an OR of 1.213 (95% CI: 1.173, 1.256).

  • Interpretation: For each additional unit of BMI at age 31, the odds of developing Type 2 Diabetes by age 46 increase by 21.3%.

  • Significance: Since the 95% confidence interval [1.173, 1.256] is entirely above the null value of 1.0, the effect is considered statistically significant at the 5% level.

  • Precision: The narrow width of the interval suggests a high level of precision in our estimate, reinforcing BMI as a robust predictor in this population.

Adjusted Logistic Regression Model

To distinguish a potential causal effect of early adulthood BMI, a multivariable logistic regression model was developed to adjust for a pre-specified set of confounders including sex, smoking status, socio-economic position, and alcohol use. The resulting model estimates the independent association between BMI at age 31 and the risk of a doctor-verified Type 2 Diabetes diagnosis by age 46 while holding these demographic and lifestyle factors constant.

Mathematical Model Representation

Based on the variables included in your m2 model, the estimated logit equation is expressed in the following form:

\[\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = \hat{\beta}_0 + \hat{\beta}_1(\text{bmi31}) + \hat{\beta}_2(\text{sex}) + \hat{\beta}_3(\text{smoking31}) + \hat{\beta}_4(\text{ses31}) + \hat{\beta}_5(\text{alcohol31})\]

Where:

m2 <- glm(factor(t2d46)~bmi31+factor(sex)+factor(smoking31)+factor(ses31)+factor(alcohol31), family = binomial(link = "logit"), data = Dset)

summary(m2)
## 
## Call:
## glm(formula = factor(t2d46) ~ bmi31 + factor(sex) + factor(smoking31) + 
##     factor(ses31) + factor(alcohol31), family = binomial(link = "logit"), 
##     data = Dset)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.73286    0.64576 -13.523   <2e-16 ***
## bmi31               0.18809    0.01767  10.646   <2e-16 ***
## factor(sex)2       -0.05775    0.23713  -0.244   0.8076    
## factor(smoking31)1  0.35789    0.22751   1.573   0.1157    
## factor(ses31)2      0.44826    0.34975   1.282   0.2000    
## factor(ses31)3      0.65518    0.34791   1.883   0.0597 .  
## factor(ses31)4      0.53525    0.40508   1.321   0.1864    
## factor(ses31)5      0.73881    0.57074   1.294   0.1955    
## factor(alcohol31)1 -0.36646    0.35023  -1.046   0.2954    
## factor(alcohol31)2  0.67587    0.73327   0.922   0.3567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 838.61  on 3039  degrees of freedom
## Residual deviance: 709.44  on 3030  degrees of freedom
## AIC: 729.44
## 
## Number of Fisher Scoring iterations: 7

Results of the Adjusted Logistic Regression Model

The adjusted multivariable logistic regression confirms that Body Mass Index at age 31 remains a highly significant independent predictor of Type 2 Diabetes by age 46, even after controlling for sex, smoking, socio-economic status, and alcohol use. The association for this model is defined by the following estimated logit equation:

\[\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -8.73 + 0.19 \times \text{bmi31} - 0.06 \times \text{sex}_2 + 0.36 \times \text{smoking}_{31} + \dots + \hat{\beta}_p x_p\]

The adjusted coefficient for bmi31 (\(\hat{\beta} = 0.188\), \(p < 2 \times 10^{-16}\)) indicates that for every unit increase in BMI, the log-odds of a diabetes diagnosis increase by \(0.188\) when all other lifestyle and demographic factors are held constant.

Interestingly, while BMI shows a robust clinical link, other factors such as sex and alcohol use did not reach statistical significance at the \(5\%\) level in this cohort, though individuals in the “skilled worker” socio-economic category (ses31 level 3) showed a marginal trend toward higher risk (\(p = 0.0597\)).

exp(coef(summary(m2))[2,1])
## [1] 1.206946
exp(coef(summary(m2))[2])
## [1] 1.206946
confint_or_m2 <- exp(confint(m2))
## Waiting for profiling to be done...
confint_or_m2
##                           2.5 %       97.5 %
## (Intercept)        4.335957e-05 0.0005484571
## bmi31              1.166415e+00 1.2502550643
## factor(sex)2       5.919939e-01 1.5035157431
## factor(smoking31)1 9.139092e-01 2.2354502230
## factor(ses31)2     8.043873e-01 3.2040836511
## factor(ses31)3     9.924902e-01 3.9264064987
## factor(ses31)4     7.679660e-01 3.8165388078
## factor(ses31)5     6.206422e-01 6.0555479306
## factor(alcohol31)1 3.625562e-01 1.4476844214
## factor(alcohol31)2 3.932591e-01 7.5553870456

The Adjusted Odds Ratio (aOR) for bmi31 is 1.207, indicating that for every \(1\text{ kg/m}^2\) increase in BMI at age 31, the odds of a Type 2 Diabetes diagnosis by age 46 increase by 20.7%, independent of the other included risk factors. This value is nearly identical to the unadjusted OR (1.213), suggesting that the relationship between BMI and later-life diabetes is not strongly confounded by the demographic and lifestyle variables assessed here.

Precision and Significance of Adjusted

EffectsThe reliability of the adjusted BMI estimate is confirmed by the 95% confidence interval of [1.166, 1.250]. Since this interval does not include the null value of 1.0, the effect of BMI is highly significant. In contrast, most other predictors, such as sex and alcohol use, show confidence intervals that cross 1.0 (e.g., factor(sex)2 CI: [0.59, 1.50]), indicating that they are not statistically significant independent predictors in this specific model. The narrow interval for BMI reflects high precision in the estimate, reinforcing its role as a key aetiological factor for metabolic health in this cohort.

Comparative Analysis of Model Performance

This analysis consolidates the results from the unadjusted and adjusted logistic regression models into a summary data frame to facilitate a direct comparison of the odds ratios and their associated confidence intervals. The findings are then visualized using a forest plot, which provides a clear graphical representation of how the association between early adulthood BMI and Type 2 Diabetes remains robust and statistically significant (\(p < 0.05\)) even after controlling for potential confounders.

results = data.frame(
  "Model" = c("Unadjusted", "Adjusted"),
  "OR" = c(exp(coef(m1)[2]), exp(coef(m2)[2])),
  "CI_lower" = c(confint_or_m1[2,1], confint_or_m2[2,1]),
  "CI_upper" = c(confint_or_m1[2,2], confint_or_m2[2,2]),
  "Pvalue" = c(coef(summary(m1))[2,4], coef(summary(m2))[2,4])
)

results
##        Model       OR CI_lower CI_upper       Pvalue
## 1 Unadjusted 1.213234 1.173055 1.255883 9.133512e-29
## 2   Adjusted 1.206946 1.166415 1.250255 1.821156e-26
ggplot(results, aes(x = OR, y = Model)) +
  geom_point(shape = 21, size = 5, fill = "steelblue", color = "black") +
  geom_errorbar(aes(xmin = CI_lower, xmax = CI_upper), orientation = "y", height = 0.2, linewidth = 1) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey40") +
  geom_text(aes(label = sprintf("OR=%.2f\np=%.1e", OR, Pvalue)), vjust = -1.5, fontface = "bold") +
  scale_x_log10(
    name = "Odds Ratio (log scale)",
    limits = c(0.5, 2),
    breaks = c(0.5, 1, 1.5, 2)
  ) +
  labs(y = "Model") +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.y = element_blank(),
    axis.title = element_text(face = "bold")
  )
## `height` was translated to `width`.

Analysis of Risk Factors and Model Predictive Value

The visual analysis begins with a point-biserial correlation mapping, which identifies bmi31 (\(r \approx 0.25\)) and weight31 (\(r \approx 0.19\)) as the primary continuous predictors positively associated with a Type 2 Diabetes diagnosis at age 46. While metabolic markers like cholesterol and maternal BMI show modest positive trends, birth weight (bweight) and height show negligible inverse relationships. This preliminary screening establishes early adulthood adiposity as the most substantial candidate for formal inferential modeling within this cohort.

The subsequent Forest Plot compares the unadjusted and multivariable-adjusted logistic regression models to isolate the specific impact of BMI. Both models yield an identical Odds Ratio (OR) of 1.21, demonstrating that for every \(1\text{ kg/m}^2\) increase in BMI at age 31, the odds of developing Type 2 Diabetes by age 46 increase by 21%. The high statistical significance (\(p < 0.001\)) and the narrow, stable confidence intervals—which remain entirely above the null value of 1.0 after adjusting for sex, smoking, alcohol use, and socio-economic status—confirm that BMI is a robust, independent aetiological risk factor rather than a product of confounding lifestyle variables.

Model Diagnostics and Significance Testing

This stage of the analysis focuses on evaluating the statistical significance of each predictor within the adjusted logistic regression model using Wald statistics. By calculating the z-values (ratios of coefficients to their standard errors) and subsequent Wald chi-square statistics, we can rigorously test the null hypothesis for each individual parameter, specifically verifying the robust contribution of BMI at age 31 while accounting for the variance explained by the demographic and lifestyle confounders.

Mathematical Summary of the Wald TestThe Wald test determines whether a parameter \(\hat{\beta}\) is significantly different from zero using the following relationship:

\[Z = \frac{\hat{\beta}}{SE(\hat{\beta})}\]

The resulting z-values and their squared counterparts (chi-square statistics) provide a standardized measure of evidence for each variables.

coef(m2) 
##        (Intercept)              bmi31       factor(sex)2 factor(smoking31)1 
##        -8.73285918         0.18809297        -0.05774806         0.35789298 
##     factor(ses31)2     factor(ses31)3     factor(ses31)4     factor(ses31)5 
##         0.44826441         0.65517657         0.53525042         0.73881443 
## factor(alcohol31)1 factor(alcohol31)2 
##        -0.36646481         0.67587125
sqrt(diag(vcov(m2)))
##        (Intercept)              bmi31       factor(sex)2 factor(smoking31)1 
##         0.64576322         0.01766803         0.23712968         0.22751303 
##     factor(ses31)2     factor(ses31)3     factor(ses31)4     factor(ses31)5 
##         0.34975494         0.34791237         0.40507553         0.57073853 
## factor(alcohol31)1 factor(alcohol31)2 
##         0.35022789         0.73326971
# Wald z-statistics
z_values <- coef(m2) / sqrt(diag(vcov(m2)))
z_values
##        (Intercept)              bmi31       factor(sex)2 factor(smoking31)1 
##        -13.5233147         10.6459507         -0.2435295          1.5730659 
##     factor(ses31)2     factor(ses31)3     factor(ses31)4     factor(ses31)5 
##          1.2816528          1.8831655          1.3213596          1.2944884 
## factor(alcohol31)1 factor(alcohol31)2 
##         -1.0463610          0.9217226
# Wald chi-square
wald_chisq <- z_values^2
wald_chisq
##        (Intercept)              bmi31       factor(sex)2 factor(smoking31)1 
##        182.8800400        113.3362670          0.0593066          2.4745362 
##     factor(ses31)2     factor(ses31)3     factor(ses31)4     factor(ses31)5 
##          1.6426340          3.5463122          1.7459911          1.6757003 
## factor(alcohol31)1 factor(alcohol31)2 
##          1.0948714          0.8495725

The diagnostic output highlights the primary influence of BMI compared to the other lifestyle and demographic variables:

Model Comparison via Information Criteria:

This diagnostic step utilizes the Akaike Information Criterion (AIC) to compare the relative quality of the unadjusted model (\(m_1\)) and the multivariable-adjusted model (\(m_2\)). By penalizing for the number of estimated parameters, this metric identifies which model achieves the best balance between goodness-of-fit and parsimony, helping to determine if the inclusion of confounders significantly improves the model’s predictive value.

Mathematical Framework:

Akaike Information Criterion (AIC)The AIC is a standard statistical tool used for model selection, defined by the following formula:

\[AIC = 2k - 2\ln(\hat{L})\]

Where:

  • \(k\): The number of estimated parameters in the model (e.g., intercept and slopes).

  • \(\hat{L}\): The maximum value of the likelihood function for the model.

  • Interpretation: When comparing models, the one with the lower AIC value is generally preferred as it indicates a better fit with fewer unnecessary variables.

AIC(m1, m2)
##    df      AIC
## m1  2 723.9104
## m2 10 729.4353

Based on the output, the unadjusted model (\(m_1\)) has an AIC of 723.91, while the multivariable-adjusted model (\(m_2\)) has an AIC of 729.44.

  • Model Selection: Because the unadjusted model (\(m_1\)) yields a lower AIC value than the adjusted model (\(m_2\)), it is considered the more parsimonious fit for the data.

  • Predictive Efficiency: The increase in AIC for the adjusted model indicates that the addition of extra parameters (sex, smoking, socio-economic status, and alcohol use) does not sufficiently improve the model’s likelihood to justify the added complexity.

  • Conclusion: These results suggest that BMI at age 31 is such a dominant predictor that the additional demographic and lifestyle variables provide little further explanatory power regarding the risk of Type 2 Diabetes in this cohort.

Model Performance Evaluation using Pseudo-\(R^2\)

This analysis involves the calculation of various pseudo-\(R^2\) statistics to quantify and compare the proportion of variance explained by the unadjusted and multivariable-adjusted logistic regression models. To evaluate the goodness-of-fit for both models, providing a numerical assessment of how effectively BMI at age 31 and the additional lifestyle confounders predict the likelihood of a Type 2 Diabetes diagnosis.

Mathematical Framework:

McFadden’s Pseudo-\(R^2\)In logistic regression, the standard \(R^2\) used in linear modeling is not applicable; therefore, McFadden’s \(R^2\) is frequently used to assess model fit by comparing the log-likelihood of the fitted model to a null (intercept-only) model.

\[R^2_{\text{McFadden}} = 1 - \frac{\ln(\hat{L}_{\text{full}})}{\ln(\hat{L}_{\text{null}})}\]

Where:

  • \(\ln(\hat{L}_{\text{full}})\): The log-likelihood of the model with predictors (BMI, sex, smoking, etc.).

  • \(\ln(\hat{L}_{\text{null}})\): The log-likelihood of the model with only the intercept.

  • Interpretation: Values typically range from 0 to 1, where values between 0.2 and 0.4 are generally considered to indicate an excellent model fit in the context of logistic regression.

# Create binary outcome
Dset <- Dset %>%
  mutate(t2d46_num = ifelse(t2d46 == 2, 1, 0))

# Unadjusted model
Unadjusted <- glm(t2d46_num ~ bmi31, family = binomial, data = Dset)

# Adjusted model
Adjusted <- glm(t2d46_num ~ bmi31 + factor(sex) + factor(smoking31) +
            factor(ses31) + factor(alcohol31),
          family = binomial, data = Dset)

# Pseudo-R²
library(pscl)
pR2(Unadjusted)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -359.95519224 -419.30538861  118.70039275    0.14154408    0.03829371 
##          r2CU 
##    0.15884237
pR2(Adjusted)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -354.71766399 -419.30538861  129.17544924    0.15403505    0.04160179 
##          r2CU 
##    0.17256433

Goodness-of-Fit- Pseudo-\(R^2\)

The goodness-of-fit for the logistic regression models was evaluated using several Pseudo-\(R^2\) metrics, which estimate the proportion of variance in Type 2 Diabetes (T2D) risk explained by the predictors. For the Unadjusted Model, McFadden’s \(R^2\) was calculated at 0.1415, indicating that BMI at age 31 alone provides a meaningful improvement over the intercept-only null model. The Adjusted Model, which incorporates additional lifestyle and demographic confounders, showed a slight increase in McFadden’s \(R^2\) to 0.1540. While this suggests the complex model explains more of the underlying data structure, the marginal gain reinforces the earlier AIC findings that BMI remains the dominant explanatory variable in the dataset.

The reported Cragg and Uhler’s (Nagelkerke) \(R^2\) values—0.1588 for the unadjusted and 0.1726 for the adjusted model—further demonstrate the robust predictive capacity of the chosen factors. In the context of clinical epidemiology and logistic regression, McFadden values between 0.1 and 0.2 typically represent a good model fit, suggesting that early adulthood BMI is a substantial marker for later-life metabolic health. These metrics collectively validate that the models are well-specified and that the identified association between BMI and T2D is statistically reliable