Determinants of Type 2 Diabetes Mellitus Among Pima Indian Women: A Logistic Regression Approach to Identifying Risk Factors

Author: Amira Mandour
Biostatistician | Clinical Trials & Statistical Modeling Expert

2024-10-30

Introduction

Diabetes mellitus is a chronic metabolic disorder characterized by high blood sugar levels. Understanding the factors associated with diabetes risk can aid in prevention and early diagnosis.

The present analysis uses data from the PimaIndiansDiabetes R dataset (mlbench package) to investigate which physiological and demographic variables significantly influence diabetes status among Pima Indian women.

Baseline Characteristics of Pima Indian Women:

Table 1. Baseline Characteristics of Pima Indian Women
Characteristic N = 7681
Women's Age (years) 29 (24, 41)
Triceps skin fold thickness (mm) 23 (0, 32)
Diabetes Pedigree Function (DPF) 0.37 (0.24, 0.63)
Number of Times Pregnant
    0 111 (14%)
    1-2 238 (31%)
    3-4 143 (19%)
    5+ 276 (36%)
Plasma Glucose Concentration
    Low 115 (15%)
    Normal 461 (60%)
    High 192 (25%)
Diastolic blood pressure (mm Hg)
    Low 158 (21%)
    Normal 445 (58%)
    High 165 (21%)
Body mass index
    Underweight 15 (2.0%)
    Normal 102 (13%)
    Overweight 179 (23%)
    Obese 472 (61%)
Test for Diabetes
    Normal 500 (65%)
    Diabetic 268 (35%)
1 Median (IQR); n (%)

Table 1 summarizes the characteristics of 768 Pima Indian women. The median age is 29 years (IQR: 24–41), median triceps skin fold thickness is 23 mm (range: 0–32), and the median Diabetes Pedigree Function (DPF) score is 0.37 (IQR: 0.24–0.63). Pregnancy history is categorized as 0 (14%), 1–2 (31%), 3–4 (19%), and ≥5 (36%) pregnancies. Plasma glucose levels are classified as low (15%), normal (60%), and high (25%), while diastolic blood pressure is low (21%), normal (58%), and high (21%). BMI categories include underweight (2%), normal (13%), overweight (23%), and obese (61%). Overall, 65% of women are non-diabetic, and 35% are diabetic.

Factors Associated with Diabetes Among Pima Indian Women:
Table 2. Comparison of Clinical and Demographic Characteristics by Diabetes Status
Characteristic Normal, N = 5001 Diabetic, N = 2681 p-value2
Women's Age (years) 27 (23, 37) 36 (28, 44) <0.001
Triceps skin fold thickness (mm) 21 (0, 31) 27 (0, 36) 0.013
2-Hour serum insulin (mu U/ml) 39 (0, 105) 0 (0, 167) 0.066
Diabetes Pedigree Function (DPF) 0.34 (0.23, 0.56) 0.45 (0.26, 0.73) <0.001
Number of Times Pregnant

<0.001
    0 73 (15%) 38 (14%)
    1-2 190 (38%) 48 (18%)
    3-4 93 (19%) 50 (19%)
    5+ 144 (29%) 132 (49%)
Plasma Glucose Concentration

<0.001
    Low 106 (21%) 9 (3.4%)
    Normal 334 (67%) 127 (47%)
    High 60 (12%) 132 (49%)
Diastolic blood pressure (mm Hg)

<0.001
    Low 119 (24%) 39 (15%)
    Normal 293 (59%) 152 (57%)
    High 88 (18%) 77 (29%)
Body mass index

<0.001
    Underweight 13 (2.6%) 2 (0.7%)
    Normal 95 (19%) 7 (2.6%)
    Overweight 139 (28%) 40 (15%)
    Obese 253 (51%) 219 (82%)
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Table 2 shows a comparison between characteristics of non-diabetic (N = 500) and diabetic (N = 268) Pima Indian women. Significant differences (p < 0.05) are observed in age, triceps skin fold thickness, Diabetes Pedigree Function (DPF), number of pregnancies, plasma glucose, diastolic blood pressure, and body mass index. Diabetic women are generally older, have higher triceps skin fold thickness, and elevated DPF scores. The distributions of pregnancies, plasma glucose, blood pressure, and BMI also differ significantly between the two groups.

The 2-hour serum insulin level does not show a significant difference (p = 0.066), suggesting it may not be a reliable indicator of diabetes in this cohort.

Distribution and Patterns of Risk Factors by Diabetes Status:

Distribution of Plasma Glucose Levels Among Pima Indian Women:

Figure 2 shows the distribution of plasma glucose concentration categorized as low, normal, and high. The majority of women have normal glucose levels (60%), while 25% have high glucose and 15% have low glucose.

Body Mass Index (BMI) Categories:

Figure 3 illustrates the BMI distribution of the cohort. Most women are classified as obese (61%), followed by overweight (23%), normal weight (13%), and underweight (2%).

Diabetes Status Among Pima Indian Women:

Figure 3presents the proportion of women with and without diabetes. 35% of the cohort are diabetic, whereas 65% are non-diabetic.

Age Distribution by Diabetes Status:

Figure 4 compares the age distribution between non-diabetic and diabetic Pima Indian women. Diabetic women tend to be older than non-diabetic women.

Triceps Skin Fold Thickness and DPF by Diabetes Status:

Figure 5 Diabetic women show higher triceps skin fold thickness and elevated Diabetes Pedigree Function (DPF) scores compared with non-diabetic women, highlighting differences in body composition and genetic risk.

Number of Pregnancies, Plasma Glucose, and Diastolic Blood Pressure by Diabetes Status:

Figure 6 The distributions of number of pregnancies, plasma glucose concentration, and diastolic blood pressure differ significantly between diabetic and non-diabetic women.

Body Mass Index (BMI) by Diabetes Status:

2-Hour Serum Insulin Levels by Diabetes Status:

Figure 8 No significant difference in 2-hour serum insulin levels is observed between diabetic and non-diabetic women (p = 0.066), suggesting it may not be a reliable predictor in this cohort.

Multivariable Logistic Regression of Factors Associated with Diabetes in Pima Indian Women:

The multivariable logistic regression analysis of factors associated with diabetes among Pima Indian women. Number of pregnancies, plasma glucose concentration, BMI, and Diabetes Pedigree Function (DPF) were statistically significant predictors of diabetes (p < 0.05). Number of pregnancies (OR = 1.13) indicates that each additional pregnancy is associated with a 13% higher odds of diabetes.

Plasma glucose concentration (OR = 1.03) suggests a 3% increase in diabetes odds per unit increase in glucose.

BMI (OR = 1.09) demonstrates that higher body mass is linked to 9% higher odds of diabetes per unit increase.

Diabetes Pedigree Function (DPF) (OR = 2.50) shows a strong association, with women having higher DPF scores exhibiting more than double the odds of diabetes.

Diastolic blood pressure (OR = 0.99) is very close to 1, indicating essentially no independent effect on diabetes risk in this cohort.

Other factors, such as triceps skin fold thickness and age, were not statistically significant.

Overall, these results highlight that reproductive history, glucose levels, body composition, and genetic predisposition are the most relevant determinants of diabetes risk among Pima Indian women.(Table 3)

Table 3. Multivariable Logistic Regression Analysis of Factors Associated with Diabetes Among Pima Indian Women
Characteristic OR1 95% CI1 p-value
Number of Pregnancies 1.13 1.06, 1.21 <0.001
Plasma Glucose Concentration 1.03 1.03, 1.04 <0.001
Diastolic Blood Pressure 0.99 0.98, 1.00 0.012
Triceps Skin Fold Thickness (mm) 1.00 0.98, 1.01 0.600
Body Mass Index 1.09 1.06, 1.13 <0.001
Diabetes Pedigree Function (DPF) 2.50 1.40, 4.52 0.002
Women's Age (years) 1.02 1.00, 1.03 0.092
1 OR = Odds Ratio, CI = Confidence Interval

Forest Plot of Adjusted Odds Ratios for Predictors of Diabetes in Pregnant Women:

Figure 9. Forest plot showing adjusted odds ratios (ORs) and 95% confidence intervals for predictors of diabetes in pregnant women included in the logistic regression model.

Logistic Regression Curve - Predicted Probability of Diabetes by Number of Pregnancies:

The logistic regression curve demonstrates a positive association, with the probability of diabetes increasing as the number of pregnancies rises.

Logistic Regression Curve – Plasma Glucose Concentration:

Predicted probability of diabetes increases with higher plasma glucose concentrations. The logistic regression curve highlights a strong positive association between glucose levels and diabetes risk.

Logistic Regression Curve – Diabetes Pedigree Function (DPF):

Predicted probability of diabetes rises sharply with higher DPF values, demonstrating a strong familial/genetic risk contribution.

Logistic Regression Curve – Diastolic Blood Pressure:

Predicted probability of diabetes across diastolic blood pressure values. The curve shows a slight upward trend, indicating a modest increase in diabetes risk with higher diastolic blood pressure in this cohort.

Table 4: Comparison of Reduced and Full Logistic Regression Models:

The reduced model includes Number of Pregnancies, Plasma Glucose Concentration, Diastolic Blood Pressure, Body Mass Index, and Diabetes Pedigree Function. The full model adds Triceps Skin Fold Thickness and Age. The likelihood ratio test comparing the two models shows a p-value = 0.185, indicating that including triceps and age does not significantly improve model fit. Therefore, the reduced model is preferred because it is simpler and more efficient.

Table 4. Likelihood Ratio Test Comparing Reduced and Full Logistic Regression Models
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
2 760 725.1871 2 3.372482 0.1852144
The likelihood ratio test (χ² test, p = 0.1852) indicates that adding Triceps Skin Fold Thickness and Age does not significantly improve model fit.

Model Diagnostics for Logistic Regression

1. Linearity of the Logit:

To verify the linearity assumption, component-plus-residual (partial residual) plots were examined for each continuous predictor. The plots did not show any systematic deviations from a straight line, suggesting that the linearity of the logit assumption is reasonable for these predictors.

2. Multicollinearity Assessment:

Variance inflation factors (VIFs) were computed to check for multicollinearity among predictors. All VIF values were close to 1, indicating minimal collinearity and that each predictor contributes independent information to the model.

3.Residual Analysis and Model Diagnostics for Logistic Regression:

We assessed the fit and assumptions of our logistic regression model using standard residual diagnostic plots:

Residuals vs. Fitted Values: The plot indicates some potential non-linearity, suggesting that the relationship between predictors and the logit may not be perfectly linear for all variables.

Normal Q-Q Plot: No points are extremely far from the reference line, indicating that there are no highly influential observations. This suggests that model estimates and predictions are robust to small changes in the data.

Scale-Location Plot: The inverted U-shape of the smoothed line suggests heteroscedasticity, meaning that the variance of residuals is not constant across fitted values. This is something to consider when interpreting model reliability.

ROC Curves and Model Performance:

Area Under the Curve (AUC) for our model is 0.8387, which is considered good. This indicates that there is an 83.87% chance that the model will correctly distinguish between a diabetic and non-diabetic individual.

Comparing the two models:

Model 1 AUC = 0.839

Model 2 AUC = 0.837

These AUC values are very similar, suggesting that both models have comparable discriminative performance.

Conclusion:

In this study, logistic regression models were developed and evaluated to predict diabetes risk among pregnant women based on clinical and demographic predictors. Through careful model building, diagnostics, and validation, the selected predictors were confirmed to provide meaningful insights without issues of multicollinearity or influential outliers. Both reduced and full models performed similarly in distinguishing between diabetic and non-diabetic cases, with strong discriminative ability as shown by the ROC curves.

Overall, this analysis demonstrates a systematic approach to predictive modeling, emphasizes the importance of model diagnostics, and provides a framework for interpreting clinical risk factors in a robust and reproducible manner. The findings contribute to a better understanding of diabetes risk and illustrate best practices in applied statistical modeling.