Type 2 diabetes mellitus is a major public health concern worldwide, with risk influenced by a combination of metabolic, anthropometric, and demographic factors. Identifying predictors of diabetes is essential for understanding disease etiology and informing prevention strategies. The Pima Indians Diabetes dataset has been widely used to study these relationships and to demonstrate statistical modeling approaches in epidemiology and biostatistics.
1.2 Objective
This analysis applies logistic regression modeling to identify factors associated with diabetes status. Specifically, the objectives are to:
Examine univariable associations between selected clinical and demographic predictors and diabetes status
Fit a multivariable logistic regression model to estimate adjusted associations while controlling for confounding
Demonstrate a systematic and transparent approach to model selection using likelihood-based methods
1.3 Dataset
The analysis uses the PimaIndiansDiabetes2 dataset, which includes 768 observations on adult women of Pima Indian heritage. The outcome variable indicates diabetes status, while predictors include plasma glucose concentration, body mass index, blood pressure, insulin levels, diabetes pedigree function, age, and reproductive history. The corrected version of the dataset replaces physiologically implausible zero values with missing values.
1.4 Analytic Approach
A stepwise modeling strategy commonly used in epidemiological research is employed. Univariable logistic regression models are first fitted to assess individual associations between predictors and diabetes status. A full multivariable model is then estimated, and Likelihood Ratio Tests are used to compare nested models and guide variable removal. The final model is selected based on statistical evidence and parsimony.
2 Univariable Model
Each predictor is first evaluated in a univariable logistic regression model with diabetes status as the outcome. This step provides an initial assessment of the strength and direction of association between individual predictors and the outcome. Univariable results are used for descriptive purposes and do not, on their own, determine inclusion in the final multivariable model.
Show the code
tbl_uni <- data %>%select(diabetes, pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age) %>%tbl_uvregression(method = glm,y = diabetes,method.args =list(family = binomial),exponentiate =TRUE,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" ) ) |>bold_labels() |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Univariable Logistic Regression Model**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )tbl_uni
Univariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic
N
OR
95% CI
p-value
Number of pregnancies
392
1.18
1.11, 1.27
<0.001
Plasma glucose
392
1.04
1.03, 1.05
<0.001
Diastolic BP
392
1.04
1.02, 1.05
<0.001
Triceps skinfold
392
1.06
1.03, 1.08
<0.001
Serum insulin
392
1.01
1.00, 1.01
<0.001
Body Mass Index
392
1.09
1.06, 1.13
<0.001
Diabetes pedigree
392
3.60
1.93, 7.02
<0.001
Age (years)
392
1.08
1.05, 1.10
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
3 Multivariable Logistic Regression Model
A full multivariable logistic regression model is then fitted, including all candidate predictors. This model serves as the reference point for subsequent model comparisons. Adjusted odds ratios (AORs) from this model reflect the association between each predictor and diabetes after controlling for the other variables in the model.
Show the code
full_model <-glm( diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age,data = data,family = binomial)tbl_regression(full_model,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Multivariable Logistic Regression Model**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Multivariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Number of pregnancies
1.09
0.97, 1.21
0.14
Plasma glucose
1.04
1.03, 1.05
<0.001
Diastolic BP
1.00
0.98, 1.02
>0.9
Triceps skinfold
1.01
0.98, 1.05
0.5
Serum insulin
1.00
1.00, 1.00
0.5
Body Mass Index
1.07
1.02, 1.13
0.010
Diabetes pedigree
3.13
1.38, 7.37
0.008
Age (years)
1.03
1.00, 1.07
0.065
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
4 Variable Removal Strategy
Model simplification is performed by removing one variable at a time from the full model. Each reduced model is compared with the full model using a Likelihood Ratio Test (LRT). This approach allows assessment of whether the excluded variable contributes meaningfully to model fit.
5 Likelihood Ratio Test (LRT)
The Likelihood Ratio Test evaluates the null hypothesis that the reduced model fits the data as well as the full model.
A p-value < 0.05 suggests that the removed variable significantly improves model fit and should be retained.
A p-value ≥ 0.05 suggests that the simpler model is sufficient, and the variable may be excluded without substantially worsening model performance.
5.1 Removal of Diastolic Blood Pressure
Show the code
model_no_pressure <-glm( diabetes ~ pregnant + glucose + triceps + insulin + mass + pedigree + age,data = data,family = binomial) tbl_regression(model_no_pressure,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Model without the Pressure BP**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Model without the Pressure BP
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Number of pregnancies
1.09
0.97, 1.21
0.14
Plasma glucose
1.04
1.03, 1.05
<0.001
Triceps skinfold
1.01
0.98, 1.05
0.5
Serum insulin
1.00
1.00, 1.00
0.5
Body Mass Index
1.07
1.02, 1.13
0.009
Diabetes pedigree
3.14
1.38, 7.38
0.007
Age (years)
1.03
1.00, 1.07
0.063
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Show the code
# Perform LRTlrt_result <-lrtest(model_no_pressure, full_model)# Convert to data frame and clean uplrt_df <-data.frame(Model =c("Without Pressure", "With Pressure"),`Parameters`=c(8, 9),`Log Likelihood`=c(-172.02, -172.01),`Df`=c(NA, 1),`Chi-square`=c(NA, 0.0144),`P-value`=c(NA, 0.9045),check.names =FALSE)# Create formatted tablelrt_df |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing models with and without `blood pressure` Variable" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent" ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing models with and without `blood pressure` Variable
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Without Pressure
8
−172.0200
NA
NA
NA
With Pressure
9
−172.0100
1
0.0144
0.9045
1 A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.
After removing diastolic blood pressure from the model, the Likelihood Ratio Test yields a non-significant p-value. This indicates that inclusion of diastolic blood pressure does not significantly improve model fit relative to the full model. Therefore, this variable may be considered for exclusion.
5.2 Removal of Triceps Skinfold Thickness
Show the code
model_no_triceps <-glm( diabetes ~ pregnant + glucose + pressure + insulin + mass + pedigree + age,data = data,family = binomial)tbl_regression(model_no_triceps,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Model Without the Triceps skinfold Variable**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Model Without the Triceps skinfold Variable
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Number of pregnancies
1.09
0.97, 1.21
0.14
Plasma glucose
1.04
1.03, 1.05
<0.001
Diastolic BP
1.00
0.98, 1.02
>0.9
Serum insulin
1.00
1.00, 1.00
0.5
Body Mass Index
1.08
1.04, 1.13
<0.001
Diabetes pedigree
3.19
1.41, 7.50
0.006
Age (years)
1.04
1.00, 1.07
0.052
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Show the code
# Perform LRTlrt_result_triceps <-lrtest(model_no_triceps, full_model)# Convert to data frame and clean uplrt_df_triceps <-data.frame(Model =c("Without Triceps skinfold", "With Triceps skinfold"),Parameters =c(8, 9),`Log Likelihood`=c(-172.23, -172.01),Df =c(NA, 1),`Chi-square`=c(NA, 0.4308),`P-value`=c(NA, 0.5116),check.names =FALSE)# Create formatted tablelrt_df_triceps |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing models with and without Triceps skinfold Variable" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black" ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing models with and without Triceps skinfold Variable
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Without Triceps skinfold
8
−172.2300
NA
NA
NA
With Triceps skinfold
9
−172.0100
1
0.4308
0.5116
1 A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.
The Likelihood Ratio Test comparing models with and without triceps skinfold thickness is non-significant. This suggests that triceps skinfold thickness does not contribute meaningfully to the explanatory power of the model and can be excluded in favor of a simpler model.
5.3 Removal of Serum Insulin
Show the code
model_no_insulin <-glm( diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + age,data = data,family = binomial)tbl_regression(model_no_insulin,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Model Without the Insulin Vaariable**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Model Without the Insulin Vaariable
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Number of pregnancies
1.09
0.98, 1.21
0.13
Plasma glucose
1.04
1.03, 1.05
<0.001
Diastolic BP
1.00
0.98, 1.02
>0.9
Triceps skinfold
1.01
0.98, 1.05
0.5
Body Mass Index
1.07
1.02, 1.13
0.012
Diabetes pedigree
3.09
1.36, 7.27
0.008
Age (years)
1.03
1.00, 1.07
0.071
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Show the code
# Perform LRTlrt_result_insulin <-lrtest(model_no_insulin, full_model)# Convert to data frame and clean uplrt_df_insulin <-data.frame(Model =c("Without Insulin", "With Insulin"),Parameters =c(8, 9),`Log Likelihood`=c(-172.21, -172.01),Df =c(NA, 1),`Chi-square`=c(NA, 0.3971),`P-value`=c(NA, 0.5286),check.names =FALSE)# Create formatted tablelrt_df_insulin |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing models with and without Insulin Variable" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black" ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing models with and without Insulin Variable
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Without Insulin
8
−172.2100
NA
NA
NA
With Insulin
9
−172.0100
1
0.3971
0.5286
1 A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.
Similarly, removing serum insulin results in a non-significant Likelihood Ratio Test. This finding indicates that serum insulin does not significantly improve model fit when other predictors are included.
5.4 Removal of Number of Pregnancies
Show the code
model_no_pregnancies <-glm( diabetes ~ glucose + pressure + triceps + insulin + mass + pedigree + age,data = data,family = binomial)tbl_regression(model_no_pregnancies,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Mulativariable Logistic Regression Model**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Plasma glucose
1.04
1.03, 1.05
<0.001
Diastolic BP
1.00
0.98, 1.02
>0.9
Triceps skinfold
1.01
0.98, 1.05
0.5
Serum insulin
1.00
1.00, 1.00
0.5
Body Mass Index
1.07
1.01, 1.13
0.014
Diabetes pedigree
2.94
1.31, 6.86
0.011
Age (years)
1.05
1.02, 1.08
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Show the code
# Perform LRTlrt_result_pregnant <-lrtest(model_no_pregnancies , full_model)# Convert to data frame and clean uplrt_df_pregnant <-data.frame(Model =c("Without Pregnant", "With Pregnant"),Parameters =c(8, 9),`Log Likelihood`=c(-173.12, -172.01),Df =c(NA, 1),`Chi-square`=c(NA, 2.2143),`P-value`=c(NA, 0.1367),check.names =FALSE)# Create formatted tablelrt_df_pregnant |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing models with and without Pregnant Variable" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black" ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing models with and without Pregnant Variable
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Without Pregnant
8
−173.1200
NA
NA
NA
With Pregnant
9
−172.0100
1
2.2143
0.1367
1 A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.
Excluding the number of pregnancies yields a non-significant p-value, although it is closer to the conventional significance threshold compared to previously removed variables. While the simpler model is preferred at the 0.05 level, this result suggests that the variable may still have some explanatory relevance.
5.5 Removal of Age
Show the code
model_no_age <-glm( diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree,data = data,family = binomial)tbl_regression(model_no_age,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Mulativariable Logistic Regression Model**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Number of pregnancies
1.16
1.07, 1.26
<0.001
Plasma glucose
1.04
1.03, 1.05
<0.001
Diastolic BP
1.00
0.98, 1.03
0.8
Triceps skinfold
1.02
0.98, 1.05
0.4
Serum insulin
1.00
1.00, 1.00
0.6
Body Mass Index
1.07
1.01, 1.13
0.017
Diabetes pedigree
3.39
1.50, 7.94
0.004
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Show the code
# Perform LRTlrt_result_age <-lrtest(model_no_age, full_model)# Convert to data frame and clean uplrt_df_age <-data.frame(Model =c("Without Age", "With Age"),Parameters =c(8, 9),`Log Likelihood`=c(-173.78, -172.01),Df =c(NA, 1),`Chi-square`=c(NA, 3.5285),`P-value`=c(NA, 0.06032),check.names =FALSE)# Create formatted tablelrt_df_age |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing models with and without Age Variable" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black" ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing models with and without Age Variable
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Without Age
8
−173.7800
NA
NA
NA
With Age
9
−172.0100
1
3.5285
0.0603
1 A non-significant p-value (p > 0.05) indicates that the simpler model is preferred.
After removing age from the model, the Likelihood Ratio Test produces a p-value that is marginally above 0.05. This result suggests that age is close to being statistically significant and may warrant retention based on subject-matter knowledge, despite narrowly missing the conventional threshold.
6 Final Model
Show the code
final_model <-glm(diabetes ~ glucose + mass + pedigree + age, data = data, family = binomial)tbl_regression(final_model,exponentiate = T,label =list(pregnant ="Number of pregnancies",glucose ="Plasma glucose",pressure ="Diastolic BP",triceps ="Triceps skinfold",insulin ="Serum insulin",mass ="Body Mass Index",pedigree ="Diabetes pedigree",age ="Age (years)" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Mulativariable Logistic Regression Model**"),subtitle =md("Predictors of Diabetes, India") ) |>tab_source_note(source_note =md("Original owners: National Institute of Diabetes and Digestive and Kidney Diseases") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent",table.width =px(450) )
Mulativariable Logistic Regression Model
Predictors of Diabetes, India
Characteristic
AOR
95% CI
p-value
Plasma glucose
1.04
1.03, 1.05
<0.001
Body Mass Index
1.08
1.04, 1.12
<0.001
Diabetes pedigree
2.97
1.33, 6.87
0.010
Age (years)
1.05
1.03, 1.08
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Based on the sequence of Likelihood Ratio Tests, a final model including plasma glucose, body mass index, diabetes pedigree function, and age is selected. This model balances goodness-of-fit with parsimony, retaining predictors that contribute meaningfully to explaining diabetes risk.
7 Full Model Versus Null Model
Show the code
# Create null model (intercept only)null_model <-glm(diabetes ~1, data = data, family = binomial)# Create full modelfull_model <-glm(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age, data = data, family = binomial)# Perform LRT comparing null vs full modellrt_full <-lrtest(null_model, full_model)# Convert to data frame and create formatted tablelrt_df_full <-data.frame(Model =c("Null Model (Intercept only)", "Full Model (All predictors)"),Parameters =c(1, 9),`Log Likelihood`=c(lrt_full$LogLik[1], lrt_full$LogLik[2]),Df =c(NA, lrt_full$Df[2]),`Chi-square`=c(NA, lrt_full$Chisq[2]),`P-value`=c(NA, lrt_full$`Pr(>Chisq)`[2]),check.names =FALSE)# Create formatted tablelrt_df_full |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Testing Overall Model Significance: Null vs Full Model" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A significant p-value (p < 0.05) indicates that the full model is significantly better than the null model.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black",table.width =px(650) ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Testing Overall Model Significance: Null vs Full Model
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Null Model (Intercept only)
1
−249.0489
NA
NA
NA
Full Model (All predictors)
9
−172.0106
8
154.0766
0.0000
1 A significant p-value (p < 0.05) indicates that the full model is significantly better than the null model.
Comparison of the full model with the null (intercept-only) model shows a highly significant improvement in model fit. This confirms that the set of predictors collectively provides substantial explanatory power for diabetes status.
8 Final Model Versus Full Model
Show the code
# Perform LRT comparing final vs full modellrt_final_vs_full <-lrtest(final_model, full_model)# Convert to data frame and create formatted tablelrt_df_final_vs_full <-data.frame(Model =c("Final Model (glucose, mass, pedigree, age)", "Full Model (All 8 predictors)"),Parameters =c(5, 9),`Log Likelihood`=c(-173.62, -172.01),Df =c(NA, 4),`Chi-square`=c(NA, 3.2138),`P-value`=c(NA, 0.5227),check.names =FALSE)# Create formatted tablelrt_df_final_vs_full |>gt() |>tab_header(title =md("**Likelihood Ratio Test**"),subtitle ="Comparing Final Model vs Full Model" ) |>fmt_number(columns =c(`Log Likelihood`, `Chi-square`),decimals =4 ) |>fmt_number(columns =`P-value`,decimals =4 ) |>fmt_number(columns = Df,decimals =0 ) |>tab_footnote(footnote ="A non-significant p-value (p > 0.05) indicates that the simpler final model is preferred - the additional predictors in the full model do not significantly improve fit.",locations =cells_column_labels(columns =`P-value`) ) |>tab_options(table.border.top.color ="transparent",table_body.hlines.color ="transparent",table_body.border.bottom.color ="black",table.width =px(800) ) |>cols_align(align ="center",columns =everything() ) |>cols_align(align ="left",columns = Model )
Likelihood Ratio Test
Comparing Final Model vs Full Model
Model
Parameters
Log Likelihood
Df
Chi-square
P-value1
Final Model (glucose, mass, pedigree, age)
5
−173.6200
NA
NA
NA
Full Model (All 8 predictors)
9
−172.0100
4
3.2138
0.5227
1 A non-significant p-value (p > 0.05) indicates that the simpler final model is preferred - the additional predictors in the full model do not significantly improve fit.
The Likelihood Ratio Test comparing the final model with the full model is non-significant. This indicates that the reduced model performs comparably to the full model, despite including fewer predictors. The additional variables in the full model do not significantly improve fit.
9 Conclusion
In conclusion, the final model consisting of plasma glucose, body mass index, diabetes pedigree function, and age provides a parsimonious and interpretable representation of diabetes risk in this dataset. The model achieves similar explanatory performance to the full model while avoiding unnecessary complexity, making it a suitable choice for inference and reporting.