Logistic Regression Model Selection in R

Author

1 Introduction

1.1 Background

Type 2 diabetes mellitus is a major public health concern worldwide, with risk influenced by a combination of metabolic, anthropometric, and demographic factors. Identifying predictors of diabetes is essential for understanding disease etiology and informing prevention strategies. The Pima Indians Diabetes dataset has been widely used to study these relationships and to demonstrate statistical modeling approaches in epidemiology and biostatistics.

1.2 Objective

This analysis applies logistic regression modeling to identify factors associated with diabetes status. Specifically, the objectives are to:

Examine univariable associations between selected clinical and demographic predictors and diabetes status
Fit a multivariable logistic regression model to estimate adjusted associations while controlling for confounding
Demonstrate a systematic and transparent approach to model selection using likelihood-based methods

1.3 Dataset

The analysis uses the PimaIndiansDiabetes2 dataset, which includes 768 observations on adult women of Pima Indian heritage. The outcome variable indicates diabetes status, while predictors include plasma glucose concentration, body mass index, blood pressure, insulin levels, diabetes pedigree function, age, and reproductive history. The corrected version of the dataset replaces physiologically implausible zero values with missing values.

1.4 Analytic Approach

A stepwise modeling strategy commonly used in epidemiological research is employed. Univariable logistic regression models are first fitted to assess individual associations between predictors and diabetes status. A full multivariable model is then estimated, and Likelihood Ratio Tests are used to compare nested models and guide variable removal. The final model is selected based on statistical evidence and parsimony.

2 Univariable Model

Each predictor is first evaluated in a univariable logistic regression model with diabetes status as the outcome. This step provides an initial assessment of the strength and direction of association between individual predictors and the outcome. Univariable results are used for descriptive purposes and do not, on their own, determine inclusion in the final multivariable model.

Show the code

tbl_uni <- data %>%
  select(diabetes, pregnant, glucose, pressure, triceps,
         insulin, mass, pedigree, age) %>%
  tbl_uvregression(
    method = glm,
    y = diabetes,
    method.args = list(family = binomial),
    exponentiate = TRUE,
    label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
  ) |>
  bold_labels() |>
  bold_p() |>
  as_gt() |>  # Use as_gt() instead of gt()
 tab_header(
        title = md("**Univariable Logistic Regression Model**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
          table.width = px(450)
    )
    
tbl_uni

Univariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic	N	OR	95% CI	p-value
Number of pregnancies	392	1.18	1.11, 1.27	<0.001
Plasma glucose	392	1.04	1.03, 1.05	<0.001
Diastolic BP	392	1.04	1.02, 1.05	<0.001
Triceps skinfold	392	1.06	1.03, 1.08	<0.001
Serum insulin	392	1.01	1.00, 1.01	<0.001
Body Mass Index	392	1.09	1.06, 1.13	<0.001
Diabetes pedigree	392	3.60	1.93, 7.02	<0.001
Age (years)	392	1.08	1.05, 1.10	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

3 Multivariable Logistic Regression Model

A full multivariable logistic regression model is then fitted, including all candidate predictors. This model serves as the reference point for subsequent model comparisons. Adjusted odds ratios (AORs) from this model reflect the association between each predictor and diabetes after controlling for the other variables in the model.

Show the code

full_model <- glm(
  diabetes ~ pregnant + glucose + pressure + triceps +
             insulin + mass + pedigree + age,
  data = data,
  family = binomial
)

tbl_regression(full_model,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Multivariable Logistic Regression Model**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Multivariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Number of pregnancies	1.09	0.97, 1.21	0.14
Plasma glucose	1.04	1.03, 1.05	<0.001
Diastolic BP	1.00	0.98, 1.02	>0.9
Triceps skinfold	1.01	0.98, 1.05	0.5
Serum insulin	1.00	1.00, 1.00	0.5
Body Mass Index	1.07	1.02, 1.13	0.010
Diabetes pedigree	3.13	1.38, 7.37	0.008
Age (years)	1.03	1.00, 1.07	0.065
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

4 Variable Removal Strategy

Model simplification is performed by removing one variable at a time from the full model. Each reduced model is compared with the full model using a Likelihood Ratio Test (LRT). This approach allows assessment of whether the excluded variable contributes meaningfully to model fit.

5 Likelihood Ratio Test (LRT)

The Likelihood Ratio Test evaluates the null hypothesis that the reduced model fits the data as well as the full model.

A p-value < 0.05 suggests that the removed variable significantly improves model fit and should be retained.
A p-value ≥ 0.05 suggests that the simpler model is sufficient, and the variable may be excluded without substantially worsening model performance.

5.1 Removal of Diastolic Blood Pressure

Show the code

model_no_pressure <- glm(
  diabetes ~ pregnant + glucose +  triceps +
             insulin + mass + pedigree + age,
  data = data,
  family = binomial
) 

tbl_regression(model_no_pressure,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Model without the Pressure BP**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Model without the Pressure BP
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Number of pregnancies	1.09	0.97, 1.21	0.14
Plasma glucose	1.04	1.03, 1.05	<0.001
Triceps skinfold	1.01	0.98, 1.05	0.5
Serum insulin	1.00	1.00, 1.00	0.5
Body Mass Index	1.07	1.02, 1.13	0.009
Diabetes pedigree	3.14	1.38, 7.38	0.007
Age (years)	1.03	1.00, 1.07	0.063
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Show the code

# Perform LRT
lrt_result <- lrtest(model_no_pressure, full_model)

# Convert to data frame and clean up
lrt_df <- data.frame(
  Model = c("Without Pressure", "With Pressure"),
  `Parameters` = c(8, 9),
  `Log Likelihood` = c(-172.02, -172.01),
  `Df` = c(NA, 1),
  `Chi-square` = c(NA, 0.0144),
  `P-value` = c(NA, 0.9045),
  check.names = FALSE
)

# Create formatted table
lrt_df |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing models with and without `blood pressure` Variable"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table_body.hlines.color = "transparent",
    table_body.vlines.color = "transparent"
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing models with and without `blood pressure` Variable
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Without Pressure	8	−172.0200	NA	NA	NA
With Pressure	9	−172.0100	1	0.0144	0.9045
¹ A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.

After removing diastolic blood pressure from the model, the Likelihood Ratio Test yields a non-significant p-value. This indicates that inclusion of diastolic blood pressure does not significantly improve model fit relative to the full model. Therefore, this variable may be considered for exclusion.

5.2 Removal of Triceps Skinfold Thickness

Show the code

model_no_triceps <- glm(
  diabetes ~ pregnant + glucose + pressure +
             insulin + mass + pedigree + age,
  data = data,
  family = binomial
)

tbl_regression(model_no_triceps,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Model Without the Triceps skinfold Variable**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Model Without the Triceps skinfold Variable
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Number of pregnancies	1.09	0.97, 1.21	0.14
Plasma glucose	1.04	1.03, 1.05	<0.001
Diastolic BP	1.00	0.98, 1.02	>0.9
Serum insulin	1.00	1.00, 1.00	0.5
Body Mass Index	1.08	1.04, 1.13	<0.001
Diabetes pedigree	3.19	1.41, 7.50	0.006
Age (years)	1.04	1.00, 1.07	0.052
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Show the code

# Perform LRT
lrt_result_triceps <- lrtest(model_no_triceps, full_model)

# Convert to data frame and clean up
lrt_df_triceps <- data.frame(
  Model = c("Without Triceps skinfold", "With Triceps skinfold"),
  Parameters = c(8, 9),
  `Log Likelihood` = c(-172.23, -172.01),
  Df = c(NA, 1),
  `Chi-square` = c(NA, 0.4308),
  `P-value` = c(NA, 0.5116),
  check.names = FALSE
)

# Create formatted table
lrt_df_triceps |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing models with and without Triceps skinfold Variable"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black"
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing models with and without Triceps skinfold Variable
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Without Triceps skinfold	8	−172.2300	NA	NA	NA
With Triceps skinfold	9	−172.0100	1	0.4308	0.5116
¹ A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.

The Likelihood Ratio Test comparing models with and without triceps skinfold thickness is non-significant. This suggests that triceps skinfold thickness does not contribute meaningfully to the explanatory power of the model and can be excluded in favor of a simpler model.

5.3 Removal of Serum Insulin

Show the code

model_no_insulin <- glm(
  diabetes ~ pregnant + glucose + pressure + triceps +
      mass + pedigree + age,
  data = data,
  family = binomial
)

tbl_regression(model_no_insulin,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Model Without the Insulin Vaariable**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Model Without the Insulin Vaariable
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Number of pregnancies	1.09	0.98, 1.21	0.13
Plasma glucose	1.04	1.03, 1.05	<0.001
Diastolic BP	1.00	0.98, 1.02	>0.9
Triceps skinfold	1.01	0.98, 1.05	0.5
Body Mass Index	1.07	1.02, 1.13	0.012
Diabetes pedigree	3.09	1.36, 7.27	0.008
Age (years)	1.03	1.00, 1.07	0.071
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Show the code

# Perform LRT
lrt_result_insulin <- lrtest(model_no_insulin, full_model)

# Convert to data frame and clean up
lrt_df_insulin <- data.frame(
  Model = c("Without Insulin", "With Insulin"),
  Parameters = c(8, 9),
  `Log Likelihood` = c(-172.21, -172.01),
  Df = c(NA, 1),
  `Chi-square` = c(NA, 0.3971),
  `P-value` = c(NA, 0.5286),
  check.names = FALSE
)

# Create formatted table
lrt_df_insulin |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing models with and without Insulin Variable"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black"
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing models with and without Insulin Variable
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Without Insulin	8	−172.2100	NA	NA	NA
With Insulin	9	−172.0100	1	0.3971	0.5286
¹ A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.

Similarly, removing serum insulin results in a non-significant Likelihood Ratio Test. This finding indicates that serum insulin does not significantly improve model fit when other predictors are included.

5.4 Removal of Number of Pregnancies

Show the code

model_no_pregnancies <- glm(
  diabetes ~ glucose + pressure + triceps +
             insulin + mass + pedigree + age,
  data = data,
  family = binomial
)

tbl_regression(model_no_pregnancies,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Mulativariable Logistic Regression Model**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Plasma glucose	1.04	1.03, 1.05	<0.001
Diastolic BP	1.00	0.98, 1.02	>0.9
Triceps skinfold	1.01	0.98, 1.05	0.5
Serum insulin	1.00	1.00, 1.00	0.5
Body Mass Index	1.07	1.01, 1.13	0.014
Diabetes pedigree	2.94	1.31, 6.86	0.011
Age (years)	1.05	1.02, 1.08	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Show the code

# Perform LRT
lrt_result_pregnant <- lrtest(model_no_pregnancies , full_model)

# Convert to data frame and clean up
lrt_df_pregnant <- data.frame(
  Model = c("Without Pregnant", "With Pregnant"),
  Parameters = c(8, 9),
  `Log Likelihood` = c(-173.12, -172.01),
  Df = c(NA, 1),
  `Chi-square` = c(NA, 2.2143),
  `P-value` = c(NA, 0.1367),
  check.names = FALSE
)

# Create formatted table
lrt_df_pregnant |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing models with and without Pregnant Variable"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black"
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing models with and without Pregnant Variable
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Without Pregnant	8	−173.1200	NA	NA	NA
With Pregnant	9	−172.0100	1	2.2143	0.1367
¹ A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.

Excluding the number of pregnancies yields a non-significant p-value, although it is closer to the conventional significance threshold compared to previously removed variables. While the simpler model is preferred at the 0.05 level, this result suggests that the variable may still have some explanatory relevance.

5.5 Removal of Age

Show the code

model_no_age <- glm(
  diabetes ~ pregnant + glucose + pressure + triceps +
             insulin + mass + pedigree,
  data = data,
  family = binomial
)

tbl_regression(model_no_age,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Mulativariable Logistic Regression Model**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Number of pregnancies	1.16	1.07, 1.26	<0.001
Plasma glucose	1.04	1.03, 1.05	<0.001
Diastolic BP	1.00	0.98, 1.03	0.8
Triceps skinfold	1.02	0.98, 1.05	0.4
Serum insulin	1.00	1.00, 1.00	0.6
Body Mass Index	1.07	1.01, 1.13	0.017
Diabetes pedigree	3.39	1.50, 7.94	0.004
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Show the code

# Perform LRT
lrt_result_age <- lrtest(model_no_age, full_model)

# Convert to data frame and clean up
lrt_df_age <- data.frame(
  Model = c("Without Age", "With Age"),
  Parameters = c(8, 9),
  `Log Likelihood` = c(-173.78, -172.01),
  Df = c(NA, 1),
  `Chi-square` = c(NA, 3.5285),
  `P-value` = c(NA, 0.06032),
  check.names = FALSE
)

# Create formatted table
lrt_df_age |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing models with and without Age Variable"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black"
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing models with and without Age Variable
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Without Age	8	−173.7800	NA	NA	NA
With Age	9	−172.0100	1	3.5285	0.0603
¹ A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.

After removing age from the model, the Likelihood Ratio Test produces a p-value that is marginally above 0.05. This result suggests that age is close to being statistically significant and may warrant retention based on subject-matter knowledge, despite narrowly missing the conventional threshold.

6 Final Model

Show the code

final_model <- glm(diabetes ~ glucose + mass + pedigree + age, 
                   data = data, 
                   family = binomial)

tbl_regression(final_model,
               exponentiate = T,
               label = list(
      pregnant = "Number of pregnancies",
      glucose  = "Plasma glucose",
      pressure = "Diastolic BP",
      triceps  = "Triceps skinfold",
      insulin  = "Serum insulin",
      mass     = "Body Mass Index",
      pedigree = "Diabetes pedigree",
      age      = "Age (years)"
    )
) |> 
    bold_labels() |> 
    modify_header(estimate ~ "**AOR**") |> 
    bold_p() |> 
    as_gt() |>  # Use as_gt() instead of gt()
    tab_header(
        title = md("**Mulativariable Logistic Regression Model**"),
        subtitle = md("Predictors of Diabetes, India")
    ) |> 
    tab_source_note(
        source_note = md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") 
    ) |> 
    opt_align_table_header(
        align = "left") |> 
     tab_options(
        table_body.hlines.color = "transparent",
        table_body.vlines.color = "transparent",
        table.width = px(450)
    )

Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic	AOR	95% CI	p-value
Plasma glucose	1.04	1.03, 1.05	<0.001
Body Mass Index	1.08	1.04, 1.12	<0.001
Diabetes pedigree	2.97	1.33, 6.87	0.010
Age (years)	1.05	1.03, 1.08	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Based on the sequence of Likelihood Ratio Tests, a final model including plasma glucose, body mass index, diabetes pedigree function, and age is selected. This model balances goodness-of-fit with parsimony, retaining predictors that contribute meaningfully to explaining diabetes risk.

7 Full Model Versus Null Model

Show the code

# Create null model (intercept only)
null_model <- glm(diabetes ~ 1, data = data, family = binomial)

# Create full model
full_model <- glm(diabetes ~ pregnant + glucose + pressure + triceps + 
                    insulin + mass + pedigree + age, 
                  data = data, family = binomial)

# Perform LRT comparing null vs full model
lrt_full <- lrtest(null_model, full_model)


# Convert to data frame and create formatted table
lrt_df_full <- data.frame(
  Model = c("Null Model (Intercept only)", "Full Model (All predictors)"),
  Parameters = c(1, 9),
  `Log Likelihood` = c(lrt_full$LogLik[1], lrt_full$LogLik[2]),
  Df = c(NA, lrt_full$Df[2]),
  `Chi-square` = c(NA, lrt_full$Chisq[2]),
  `P-value` = c(NA, lrt_full$`Pr(>Chisq)`[2]),
  check.names = FALSE
)

# Create formatted table
lrt_df_full |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Testing Overall Model Significance: Null vs Full Model"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A significant p-value (p < 0.05) indicates that the full model is significantly better than the null model.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black",
    table.width = px(650)
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Testing Overall Model Significance: Null vs Full Model
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Null Model (Intercept only)	1	−249.0489	NA	NA	NA
Full Model (All predictors)	9	−172.0106	8	154.0766	0.0000
¹ A significant p-value (p < 0.05) indicates that the full model is significantly better than the null model.

Comparison of the full model with the null (intercept-only) model shows a highly significant improvement in model fit. This confirms that the set of predictors collectively provides substantial explanatory power for diabetes status.

8 Final Model Versus Full Model

Show the code

# Perform LRT comparing final vs full model
lrt_final_vs_full <- lrtest(final_model, full_model)

# Convert to data frame and create formatted table
lrt_df_final_vs_full <- data.frame(
  Model = c("Final Model (glucose, mass, pedigree, age)", 
            "Full Model (All 8 predictors)"),
  Parameters = c(5, 9),
  `Log Likelihood` = c(-173.62, -172.01),
  Df = c(NA, 4),
  `Chi-square` = c(NA, 3.2138),
  `P-value` = c(NA, 0.5227),
  check.names = FALSE
)

# Create formatted table
lrt_df_final_vs_full |>
  gt() |>
  tab_header(
    title = md("**Likelihood Ratio Test**"),
    subtitle = "Comparing Final Model vs Full Model"
  ) |>
  fmt_number(
    columns = c(`Log Likelihood`, `Chi-square`),
    decimals = 4
  ) |>
  fmt_number(
    columns = `P-value`,
    decimals = 4
  ) |>
  fmt_number(
    columns = Df,
    decimals = 0
  ) |>
  tab_footnote(
    footnote = "A non-significant p-value (p > 0.05) indicates that the simpler final model is preferred - the additional predictors in the full model do not significantly improve fit.",
    locations = cells_column_labels(columns = `P-value`)
  ) |>
  tab_options(
    table.border.top.color = "transparent",
    table_body.hlines.color = "transparent",
    table_body.border.bottom.color = "black",
    table.width = px(800)
  ) |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_align(
    align = "left",
    columns = Model
  )

Likelihood Ratio Test
Comparing Final Model vs Full Model
Model	Parameters	Log Likelihood	Df	Chi-square	P-value¹
Final Model (glucose, mass, pedigree, age)	5	−173.6200	NA	NA	NA
Full Model (All 8 predictors)	9	−172.0100	4	3.2138	0.5227
¹ A non-significant p-value (p > 0.05) indicates that the simpler final model is preferred - the additional predictors in the full model do not significantly improve fit.

The Likelihood Ratio Test comparing the final model with the full model is non-significant. This indicates that the reduced model performs comparably to the full model, despite including fewer predictors. The additional variables in the full model do not significantly improve fit.

9 Conclusion

In conclusion, the final model consisting of plasma glucose, body mass index, diabetes pedigree function, and age provides a parsimonious and interpretable representation of diabetes risk in this dataset. The model achieves similar explanatory performance to the full model while avoiding unnecessary complexity, making it a suitable choice for inference and reporting.