Today, we will be working with a diabetes dataset due to the growing prevalence and significant health impact of diabetes in the USA. Diabetes is a chronic disease that affects millions of Americans, leading to serious health complications, including heart disease, kidney failure, and vision loss. Understanding and analyzing diabetes data can help in identifying trends, risk factors, and potential preventive measures. In this analysis, we hope to apply data science techniques to better gain valuable insights into the epidemiology of diabetes, improve patient outcomes, and inform public health strategies to combat this pervasive condition.
| Heading | Variable | Description |
|---|---|---|
| Patient ID | ||
| PatientID | A unique identifier assigned to each patient (6000 to 7878). | |
| Demographic Details | ||
| Age | The age of the patients ranges from 20 to 90 years. | |
| Gender | Gender of the patients, where 0 represents Male and 1 represents Female. | |
| Ethnicity | The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other. | |
| SocioeconomicStatus | The socioeconomic status of the patients, coded as follows: 0: Low, 1: Middle, 2: High. | |
| EducationLevel | The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor’s, 3: Higher. | |
| Lifestyle Factors | ||
| BMI | Body Mass Index of the patients, ranging from 15 to 40. | |
| Smoking | Smoking status, where 0 indicates No and 1 indicates Yes. | |
| AlcoholConsumption | Weekly alcohol consumption in units, ranging from 0 to 20. | |
| PhysicalActivity | Weekly physical activity in hours, ranging from 0 to 10. | |
| DietQuality | Diet quality score, ranging from 0 to 10. | |
| SleepQuality | Sleep quality score, ranging from 4 to 10. | |
| Medical History | ||
| FamilyHistoryDiabetes | Family history of diabetes, where 0 indicates No and 1 indicates Yes. | |
| GestationalDiabetes | History of gestational diabetes, where 0 indicates No and 1 indicates Yes. | |
| PolycysticOvarySyndrome | Presence of polycystic ovary syndrome, where 0 indicates No and 1 indicates Yes. | |
| PreviousPreDiabetes | History of previous pre-diabetes, where 0 indicates No and 1 indicates Yes. | |
| Hypertension | Presence of hypertension, where 0 indicates No and 1 indicates Yes. | |
| Clinical Measurements | ||
| SystolicBP | Systolic blood pressure, ranging from 90 to 180 mmHg. | |
| DiastolicBP | Diastolic blood pressure, ranging from 60 to 120 mmHg. | |
| FastingBloodSugar | Fasting blood sugar levels, ranging from 70 to 200 mg/dL. | |
| HbA1c | Hemoglobin A1c levels, ranging from 4.0% to 10.0%. | |
| SerumCreatinine | Serum creatinine levels, ranging from 0.5 to 5.0 mg/dL. | |
| BUNLevels | Blood Urea Nitrogen levels, ranging from 5 to 50 mg/dL. | |
| CholesterolTotal | Total cholesterol levels, ranging from 150 to 300 mg/dL. | |
| CholesterolLDL | Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL. | |
| CholesterolHDL | High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL. | |
| CholesterolTriglycerides | Triglycerides levels, ranging from 50 to 400 mg/dL. | |
| Medications | ||
| AntihypertensiveMedications | Use of antihypertensive medications, where 0 indicates No and 1 indicates Yes. | |
| Statins | Use of statins, where 0 indicates No and 1 indicates Yes. | |
| AntidiabeticMedications | Use of antidiabetic medications, where 0 indicates No and 1 indicates Yes. | |
| Symptoms and Quality of Life | ||
| FrequentUrination | Presence of frequent urination, where 0 indicates No and 1 indicates Yes. | |
| ExcessiveThirst | Presence of excessive thirst, where 0 indicates No and 1 indicates Yes. | |
| UnexplainedWeightLoss | Presence of unexplained weight loss, where 0 indicates No and 1 indicates Yes. | |
| FatigueLevels | Fatigue levels, ranging from 0 to 10. | |
| BlurredVision | Presence of blurred vision, where 0 indicates No and 1 indicates Yes. | |
| SlowHealingSores | Presence of slow-healing sores, where 0 indicates No and 1 indicates Yes. | |
| TinglingHandsFeet | Presence of tingling in hands or feet, where 0 indicates No and 1 indicates Yes. | |
| QualityOfLifeScore | Quality of life score, ranging from 0 to 100. | |
| Environmental and Occupational Exposures | ||
| HeavyMetalsExposure | Exposure to heavy metals, where 0 indicates No and 1 indicates Yes. | |
| OccupationalExposureChemicals | Occupational exposure to harmful chemicals, where 0 indicates No and 1 indicates Yes. | |
| WaterQuality | Quality of water, where 0 indicates Good and 1 indicates Poor. | |
| Health Behaviors | ||
| MedicalCheckupsFrequency | Frequency of medical check-ups per year, ranging from 0 to 4. | |
| MedicationAdherence | Medication adherence score, ranging from 0 to 10. | |
| HealthLiteracy | Health literacy score, ranging from 0 to 10. | |
| Diagnosis Information (Target Variable) | ||
| Diagnosis | Diagnosis status for Diabetes, where 0 indicates No and 1 indicates Yes. |
We now take a short sample overview of our data:
We observe that there are 1879 observations, as well as 46 variables in our data set.
It is also evident that our response variable, Diagnosis is categorical, only taking on values of 1 or 0. Because of this, we will perform logistic regression and NOT linear regression in order to predict whether or not a patient will have diabetes.
Moving forward, we shall apply a decision tree to our data in order to better interpret and evaluate which variables, if any, are essential towards predicting diabetes.
Specifically, the question we hope to answer today is how accurately may we apply inference to Diagnosis, the response variable, as a function of some relevant predictor variables which will be determined shortly after in this report. The overarching goal will be to determine which variables are most essential towards such via each of the aforementioned statistical methods, as well as performing model comparison via resampling methods to validate our results.
| Criteria | Details |
|---|---|
| Steps to Use Logistic Regression |
|
| Advantages of Logistic Regression |
|
Now that we have defined what logistic regression is and its capacities, we must first inspect our dataset and prune any unnecessary variables which may not be applicable towards our regression.
## [1] "DoctorInCharge"
We may now safely remove such from our dataset and begin the next part of data processing.
We notice from our overview of variables that those such as Gender, Ethnicity, and response variable, Diagnosis all fall under the factor category of variables, as they only possess values such as 0, 1, or 2. We account for this by converting such into factor variables before regression.
We may now begin logistic regression, remembering to exclude PatientID from our predictors, as it is only a placeholder variable to indicate the number of patients, and not representative of having diabetes.
The model we derive will determine the likelihood of a patient having diabetes based on all of our predictors, with a value exceeding 0.5 meaning we predict a patient will have diabetes and a value less than 0.5 meaning we predict a patient will NOT have diabetes.
The assumptions will now be checked to ensure our data set is suitable for testing:
| Assumptions | Definitions |
|---|---|
| Binary Outcome | The dependent variable should be binary. |
| Independence of Observations | Observations should be independent of each other. |
| Linearity of Independent Variables and Log Odds | There should be a linear relationship between the independent variables and the log odds of the dependent variable. |
| No Multicollinearity | Independent variables should not be highly correlated with each other. |
| No Outliers | There should be no significant outliers in the data. |
## diagnosis_check
## 0 1
## 1127 752
It is evident from our table that the binarity condition has been met as we have only 0s and 1s as our outputs. Specifically, we find 1127 cases of no diabetes and 752 cases of diabetes. This ratio indicates that 59.9787121% of our data set does NOT have diabetes.
## Age Ethnicity EducationLevel
## 0.34912261 0.59833617 0.30235741
## BMI AlcoholConsumption PhysicalActivity
## 0.67958323 0.63484280 0.48742042
## DietQuality SleepQuality SystolicBP
## 0.67710542 0.55295574 0.87822521
## DiastolicBP FastingBloodSugar HbA1c
## 0.09260123 0.15584485 0.54594599
## SerumCreatinine BUNLevels CholesterolTotal
## 0.57905077 0.20243680 0.06664950
## CholesterolLDL CholesterolHDL CholesterolTriglycerides
## 0.23120138 0.23512896 0.71803384
## FatigueLevels QualityOfLifeScore MedicalCheckupsFrequency
## 0.94076392 0.82019148 0.31525673
## MedicationAdherence HealthLiteracy
## 0.55756168 0.97563695
## [1] TRUE
Since all p-values > 0.05, we have met the linearity assumption and are safe to proceed with further statistical testing.
## GVIF Df GVIF^(1/(2*Df))
## Age 1.044993 1 1.022249
## Gender 1.045455 1 1.022475
## Ethnicity 1.041471 1 1.020525
## SocioeconomicStatus 1.105555 2 1.025404
## EducationLevel 1.033036 1 1.016384
## BMI 1.039275 1 1.019448
## Smoking 1.045637 1 1.022564
## AlcoholConsumption 1.040251 1 1.019927
## PhysicalActivity 1.035862 1 1.017773
## DietQuality 1.030300 1 1.015037
## SleepQuality 1.038179 1 1.018911
## FamilyHistoryDiabetes 1.041220 1 1.020402
## GestationalDiabetes 1.041847 1 1.020709
## PolycysticOvarySyndrome 1.020634 1 1.010264
## PreviousPreDiabetes 1.037703 1 1.018677
## Hypertension 1.142892 1 1.069061
## SystolicBP 1.038834 1 1.019232
## DiastolicBP 1.048538 1 1.023981
## FastingBloodSugar 1.705955 1 1.306122
## HbA1c 1.722189 1 1.312322
## SerumCreatinine 1.056390 1 1.027808
## BUNLevels 1.061918 1 1.030494
## CholesterolTotal 1.032273 1 1.016008
## CholesterolLDL 1.038802 1 1.019216
## CholesterolHDL 1.042717 1 1.021135
## CholesterolTriglycerides 1.063894 1 1.031452
## AntihypertensiveMedications 1.035805 1 1.017745
## Statins 1.049630 1 1.024514
## AntidiabeticMedications 1.034804 1 1.017253
## FrequentUrination 1.176298 1 1.084573
## ExcessiveThirst 1.142393 1 1.068828
## UnexplainedWeightLoss 1.099520 1 1.048580
## FatigueLevels 1.035622 1 1.017655
## BlurredVision 1.069286 1 1.034063
## SlowHealingSores 1.042426 1 1.020992
## TinglingHandsFeet 1.054609 1 1.026941
## QualityOfLifeScore 1.044428 1 1.021973
## HeavyMetalsExposure 1.045855 1 1.022671
## OccupationalExposureChemicals 1.030387 1 1.015080
## WaterQuality 1.034034 1 1.016875
## MedicalCheckupsFrequency 1.050992 1 1.025179
## MedicationAdherence 1.060415 1 1.029764
## HealthLiteracy 1.063377 1 1.031202
## [1] TRUE
There is no multicollinearity detected (VIF < 5) and we are safe to proceed.
| VIF Range | Interpretation |
|---|---|
| VIF = 1 | No correlation between the predictor variable and any others. |
| 1 < VIF < 5 | Moderate correlation, but not severe enough to require correction. |
| VIF ≥ 5 | High correlation, indicating potential multicollinearity problems. Consider corrective actions. |
| VIF ≥ 10 | Very high correlation, requiring immediate corrective actions such as removing variables or using regularization techniques. |
As our conditions have been met, with only a few instances of outliers present, we may begin further analysis, being mindful of potentially drastic results.
We split our data set into a training and test set for further analysis and repeat this process 10 times:
# Set the seed for reproducibility
set.seed(123)
response <- diabetes_data_final$Diagnosis
predictors <- diabetes_data_final[, !colnames(diabetes_data_final) %in% "Diagnosis"]
errors <- numeric(10)
sensitivities <- numeric(10)
specificities <- numeric(10)
for (i in 1:10) {
set.seed(123 + i)
trainIndex <- createDataPartition(response, p = 0.80, list = FALSE)
trainData <- diabetes_data_final[trainIndex, ]
testData <- diabetes_data_final[-trainIndex, ]
model <- glm(Diagnosis ~ ., data = trainData, family = binomial)
predictions <- predict(model, testData, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
test_error <- mean(predicted_classes != testData$Diagnosis)
errors[i] <- test_error
cm <- confusionMatrix(as.factor(predicted_classes), as.factor(testData$Diagnosis))
sensitivities[i] <- cm$byClass["Sensitivity"]
specificities[i] <- cm$byClass["Specificity"]
}
From this, we attain the following outputs:
## Mean Test Error: 0.172
## Mean Sensitivity: 0.8711111
## Mean Specificity: 0.7633333
We define these terms as follows:
| Terms | Definitions |
|---|---|
| Test Error | The proportion of incorrect predictions made by the model on the test set. |
| Sensitivity | The proportion of actual positive cases correctly identified by the model. |
| Specificity | The proportion of actual negative cases correctly identified by the model. |
In the context of our problem statement, given our 10 repeated trials of cross validation, we have found 17.2 percent of our observations to be misclassified. Additionally, there are 87.1111111 percent of positives correctly marked, as well as 76.3333333 percent of negatives correctly marked. This gives us confidence in the validity of our model towards predicting diabetes as none of our error rates exceed 25%.
The full logistic regression equation is as follows:
## Logistic Regression Equation:
## logit(P) = -16.775 * (Intercept) + 0 * Age + 0.075 * Gender1 + -0.082 * Ethnicity + 0.158 * SocioeconomicStatus1 + 0.379 * SocioeconomicStatus2 + 0.029 * EducationLevel + 0.012 * BMI + 0.258 * Smoking1 + 0.004 * AlcoholConsumption + -0.014 * PhysicalActivity + -0.035 * DietQuality + -0.015 * SleepQuality + 0.256 * FamilyHistoryDiabetes1 + 0.547 * GestationalDiabetes1 + 0.081 * PolycysticOvarySyndrome1 + -0.186 * PreviousPreDiabetes1 + 1.632 * Hypertension1 + -0.004 * SystolicBP + 0.005 * DiastolicBP + 0.051 * FastingBloodSugar + 1.039 * HbA1c + 0.028 * SerumCreatinine + 0.006 * BUNLevels + 0 * CholesterolTotal + -0.001 * CholesterolLDL + -0.001 * CholesterolHDL + 0.001 * CholesterolTriglycerides + -0.107 * AntihypertensiveMedications1 + -0.139 * Statins1 + -0.136 * AntidiabeticMedications1 + 1.656 * FrequentUrination1 + 1.169 * ExcessiveThirst1 + 1.24 * UnexplainedWeightLoss1 + 0.017 * FatigueLevels + 0.732 * BlurredVision1 + 0.309 * SlowHealingSores1 + 0.136 * TinglingHandsFeet1 + 0.002 * QualityOfLifeScore + 0.098 * HeavyMetalsExposure1 + 0.019 * OccupationalExposureChemicals1 + 0.241 * WaterQuality1 + -0.002 * MedicalCheckupsFrequency + 0.004 * MedicationAdherence + -0.005 * HealthLiteracy
This is notably extremely robust and largely unimportant to the average reader. As such, we must trim our model in order to find the most significant predictors by first taking an overview of the relative importance of each of our variables:
##
## Call:
## glm(formula = Diagnosis ~ ., family = binomial, data = diabetes_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1178 -0.5430 -0.1495 0.4722 3.1559
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.677e+01 1.228e+00 -13.664 < 2e-16 ***
## Age 3.711e-04 3.443e-03 0.108 0.91415
## Gender1 7.493e-02 1.404e-01 0.534 0.59368
## Ethnicity -8.228e-02 6.791e-02 -1.212 0.22563
## SocioeconomicStatus1 1.583e-01 1.711e-01 0.925 0.35479
## SocioeconomicStatus2 3.786e-01 1.835e-01 2.063 0.03907 *
## EducationLevel 2.850e-02 7.941e-02 0.359 0.71964
## BMI 1.227e-02 9.636e-03 1.273 0.20291
## Smoking1 2.579e-01 1.543e-01 1.671 0.09463 .
## AlcoholConsumption 3.822e-03 1.193e-02 0.320 0.74862
## PhysicalActivity -1.379e-02 2.470e-02 -0.558 0.57660
## DietQuality -3.503e-02 2.405e-02 -1.457 0.14525
## SleepQuality -1.505e-02 4.065e-02 -0.370 0.71115
## FamilyHistoryDiabetes1 2.555e-01 1.626e-01 1.571 0.11613
## GestationalDiabetes1 5.467e-01 2.355e-01 2.321 0.02028 *
## PolycysticOvarySyndrome1 8.144e-02 3.341e-01 0.244 0.80740
## PreviousPreDiabetes1 -1.855e-01 1.929e-01 -0.962 0.33619
## Hypertension1 1.632e+00 2.005e-01 8.141 3.93e-16 ***
## SystolicBP -4.475e-03 2.727e-03 -1.641 0.10079
## DiastolicBP 5.184e-03 4.087e-03 1.269 0.20461
## FastingBloodSugar 5.061e-02 2.642e-03 19.152 < 2e-16 ***
## HbA1c 1.039e+00 5.611e-02 18.521 < 2e-16 ***
## SerumCreatinine 2.775e-02 5.378e-02 0.516 0.60592
## BUNLevels 6.415e-03 5.522e-03 1.162 0.24538
## CholesterolTotal -2.137e-04 1.605e-03 -0.133 0.89409
## CholesterolLDL -5.492e-04 1.625e-03 -0.338 0.73532
## CholesterolHDL -1.185e-03 3.022e-03 -0.392 0.69504
## CholesterolTriglycerides 1.377e-03 6.984e-04 1.972 0.04859 *
## AntihypertensiveMedications1 -1.067e-01 1.556e-01 -0.686 0.49266
## Statins1 -1.389e-01 1.430e-01 -0.971 0.33154
## AntidiabeticMedications1 -1.360e-01 1.543e-01 -0.881 0.37823
## FrequentUrination1 1.656e+00 1.820e-01 9.099 < 2e-16 ***
## ExcessiveThirst1 1.169e+00 1.854e-01 6.301 2.96e-10 ***
## UnexplainedWeightLoss1 1.240e+00 2.277e-01 5.446 5.15e-08 ***
## FatigueLevels 1.723e-02 2.411e-02 0.714 0.47497
## BlurredVision1 7.319e-01 2.327e-01 3.146 0.00166 **
## SlowHealingSores1 3.091e-01 2.236e-01 1.382 0.16694
## TinglingHandsFeet1 1.362e-01 2.185e-01 0.623 0.53320
## QualityOfLifeScore 1.991e-03 2.446e-03 0.814 0.41564
## HeavyMetalsExposure1 9.824e-02 3.108e-01 0.316 0.75194
## OccupationalExposureChemicals1 1.927e-02 2.225e-01 0.087 0.93098
## WaterQuality1 2.406e-01 1.721e-01 1.398 0.16209
## MedicalCheckupsFrequency -1.680e-03 6.279e-02 -0.027 0.97865
## MedicationAdherence 3.518e-03 2.432e-02 0.145 0.88499
## HealthLiteracy -5.175e-03 2.441e-02 -0.212 0.83213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2529.5 on 1878 degrees of freedom
## Residual deviance: 1336.2 on 1834 degrees of freedom
## AIC: 1426.2
##
## Number of Fisher Scoring iterations: 6
As is common practice in statistics, we note that only variables with p- values being less than 0.05 are deemed to be statistically significant when working towards classifying such as useful. We will therefore prune our selection by only accounting for those fitting our criteria.
The variables with p- values less than 0.05 are as follows:
## [1] "(Intercept)" "SocioeconomicStatus2"
## [3] "GestationalDiabetes1" "Hypertension1"
## [5] "FastingBloodSugar" "HbA1c"
## [7] "CholesterolTriglycerides" "FrequentUrination1"
## [9] "ExcessiveThirst1" "UnexplainedWeightLoss1"
## [11] "BlurredVision1"
Now that we have established our useful variables, we may condense our data set by first selecting only those relevant:
This table of data represents our most important variables towards predicting Diagnosis. We may clearly observe a majority of our predictors are factor variables, with only three, FastingBloodSugar, HbA1C, and CholesterolTriglycerides being numerical.
| Criteria | Details |
|---|---|
| Steps to Use Decision Trees |
|
| Advantages of Decision Trees |
|
Now that we have an overview of the significance and purpose of decision trees, it is time to implement them onto our data set diabetes_data_final. We begin by constructing our unpruned decision tree from all variables and reporting its size:
And now we prune our tree, using a parameter to limit cross- validation error:
| Reason | Explanation |
|---|---|
| Reduce Overfitting | Pruning helps remove branches that capture noise, leading to better generalization on new data. |
| Improve Interpretability | A pruned tree is simpler and easier to understand, making it more interpretable for stakeholders. |
| Enhance Performance | By eliminating less significant splits, pruning can improve the model’s predictive performance and reduce complexity. |
| Prevent Over-complexity | Pruning avoids creating an overly complex model that may fit the training data too closely and fail on test data. |
| Ensure Robustness | Pruned trees are generally more robust and less sensitive to small variations in the data. |
We once again split the data into an 80 / 20 training and test split in order to evluate the average mean prediction error of our findings:
train_and_evaluate <- function(data, response, train_frac = 0.8, n_iterations = 10) {
errors <- numeric(n_iterations)
for (i in 1:n_iterations) {
set.seed(i)
# Split the data into training and testing sets
train_index <- createDataPartition(data[[response]], p = train_frac, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Train the decision tree model
formula <- as.formula(paste(response, "~ ."))
model <- rpart(formula, data = train_data, method = "class", cp = 0.01)
# Predict on the test set
predictions <- predict(model, newdata = test_data, type = "class")
# Calculate prediction error
confusion_matrix <- table(predictions, test_data[[response]])
error_rate <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Record the error
errors[i] <- error_rate
# Prune the tree to avoid overfitting
best_cp <- model$cptable[which.min(model$cptable[,"xerror"]), "CP"]
pruned_model <- prune(model, cp = best_cp)
# Predict on the test set with the pruned model
pruned_predictions <- predict(pruned_model, newdata = test_data, type = "class")
# Calculate prediction error for the pruned model
pruned_confusion_matrix <- table(pruned_predictions, test_data[[response]])
pruned_error_rate <- 1 - sum(diag(pruned_confusion_matrix)) / sum(pruned_confusion_matrix)
# Record the pruned model error
errors[i] <- pruned_error_rate
}
# Return the mean test prediction error
mean(errors)
}
response <- "Diagnosis"
# Calculate mean test prediction error
mean_error <- train_and_evaluate(diabetes_data_mod, response, train_frac = 0.8, n_iterations = 10)
print(paste("Mean Test Prediction Error:", round(mean_error, 4)))
## [1] "Mean Test Prediction Error: 0.0872"
As we find our mean error to be 0.0872, we are satisfied with the rate being well under 10%. This indicates that our decision tree is satisfactory, with an error much lower than that of logistic regression.
It would also be wise to better quantify which of our variables are most significant via variable importance plot:
We now select the top 5 most important variables and present them below:
| Variable | Importance | |
|---|---|---|
| 11 | HbA1c | 471.58547 |
| 8 | FastingBloodSugar | 191.53513 |
| 7 | ExcessiveThirst | 65.57590 |
| 13 | Hypertension | 63.73770 |
| 10 | FrequentUrination | 61.85861 |
These importance values indicate how each variable has high importance in our model, meaning there is significant contribution to reducing impurity and improving model performance.
When comparing to our logistic regression, we find that:
## [1] "HbA1c" "FastingBloodSugar" "ExcessiveThirst"
## [4] "Hypertension" "FrequentUrination"
are the same in both models. This gives us confidence that our variables specified from logistic regression:
## [1] "(Intercept)" "SocioeconomicStatus2"
## [3] "GestationalDiabetes1" "Hypertension1"
## [5] "FastingBloodSugar" "HbA1c"
## [7] "CholesterolTriglycerides" "FrequentUrination1"
## [9] "ExcessiveThirst1" "UnexplainedWeightLoss1"
## [11] "BlurredVision1"
are credible in nature and useful towards predicting Diagnosis without the need for other predictors. To test such claims, we now fit our models to the full data set:
## term estimate std.error statistic
## Length:45 Min. :-16.77482 Min. :0.0006984 Min. :-13.6636
## Class :character 1st Qu.: -0.00168 1st Qu.:0.0119279 1st Qu.: -0.3381
## Mode :character Median : 0.01227 Median :0.0794062 Median : 0.5159
## Mean : -0.16261 Mean :0.1325002 Mean : 1.5131
## 3rd Qu.: 0.24058 3rd Qu.:0.1854485 3rd Qu.: 1.5712
## Max. : 1.65572 Max. :1.2277011 Max. : 19.1518
## p.value
## Min. :0.00000
## 1st Qu.:0.09463
## Median :0.33619
## Mean :0.38826
## 3rd Qu.:0.71115
## Max. :0.97865
The variables with p- values less than 0.05 are as follows:
| Term | Estimate | Std. Error | Statistic | P-value |
|---|---|---|---|---|
| (Intercept) | -16.7748247 | 1.2277011 | -13.663606 | 0.0000000 |
| SocioeconomicStatus2 | 0.3786291 | 0.1834956 | 2.063423 | 0.0390725 |
| GestationalDiabetes1 | 0.5466823 | 0.2355153 | 2.321218 | 0.0202751 |
| Hypertension1 | 1.6323338 | 0.2005136 | 8.140762 | 0.0000000 |
| FastingBloodSugar | 0.0506091 | 0.0026425 | 19.151820 | 0.0000000 |
| HbA1c | 1.0392162 | 0.0561106 | 18.520859 | 0.0000000 |
| CholesterolTriglycerides | 0.0013773 | 0.0006984 | 1.972148 | 0.0485927 |
| FrequentUrination1 | 1.6557201 | 0.1819685 | 9.098937 | 0.0000000 |
| ExcessiveThirst1 | 1.1685055 | 0.1854485 | 6.300969 | 0.0000000 |
| UnexplainedWeightLoss1 | 1.2403127 | 0.2277426 | 5.446117 | 0.0000001 |
| BlurredVision1 | 0.7319367 | 0.2326673 | 3.145851 | 0.0016560 |
From our tree, we also observe:
| Variable | Importance | |
|---|---|---|
| 15 | HbA1c | 598.49772 |
| 12 | FastingBloodSugar | 247.84455 |
| 17 | Hypertension | 92.42778 |
| 14 | FrequentUrination | 89.13811 |
| 11 | ExcessiveThirst | 66.62333 |
The factors relevant to predicting Diagnosis are identical to those identified in our prior cross validation testing. As such, we are confident to report that the significant predictors listed above from our test model are most closely associated to predicting whether or not a patient has diabetes.
In our analysis of the Diabetes dataset, we employed logistic regression and a decision tree in order to determine which of our variables were most statistically significant towards predicting Diagnosis, of whether or not a patient has diabetes.
The logistic regression model (test_model) identified several statistically significant predictors of diabetes diagnosis. The most notable predictors included:
| Term | Estimate | Std. Error | Statistic | P-value |
|---|---|---|---|---|
| (Intercept) | -16.7748247 | 1.2277011 | -13.663606 | 0.0000000 |
| SocioeconomicStatus2 | 0.3786291 | 0.1834956 | 2.063423 | 0.0390725 |
| GestationalDiabetes1 | 0.5466823 | 0.2355153 | 2.321218 | 0.0202751 |
| Hypertension1 | 1.6323338 | 0.2005136 | 8.140762 | 0.0000000 |
| FastingBloodSugar | 0.0506091 | 0.0026425 | 19.151820 | 0.0000000 |
| HbA1c | 1.0392162 | 0.0561106 | 18.520859 | 0.0000000 |
| CholesterolTriglycerides | 0.0013773 | 0.0006984 | 1.972148 | 0.0485927 |
| FrequentUrination1 | 1.6557201 | 0.1819685 | 9.098937 | 0.0000000 |
| ExcessiveThirst1 | 1.1685055 | 0.1854485 | 6.300969 | 0.0000000 |
| UnexplainedWeightLoss1 | 1.2403127 | 0.2277426 | 5.446117 | 0.0000001 |
| BlurredVision1 | 0.7319367 | 0.2326673 | 3.145851 | 0.0016560 |
When interpreting such, we must consider the following regarding AIC:
| AIC_Range | Interpretation |
|---|---|
| < 100 | Very strong evidence for the model |
| 100 - 200 | Strong evidence for the model |
| 200 - 300 | Weak evidence for the model |
| > 300 | Very weak evidence for the model |
As for R²:
| R2_Range | Interpretation |
|---|---|
| 0 - 0.1 | Very weak explanatory power |
| 0.1 - 0.3 | Weak explanatory power |
| 0.3 - 0.5 | Moderate explanatory power |
| > 0.5 | Strong explanatory power |
The model fit was evaluated using the AIC, being 1426.2193299, which indicated that our model provided weak evidence towards predicting whether or not a patient had diabetes from our chosen predictors. We have also calculated an R² value of 0.4717461, and from our table, determine that have moderate explanatory power towards our classification task. The coefficients specified beforehand revealed that, due to their low p- values, these are the only ones of statistical significance during regression, with all others having p- values too large to be considered useful.
For the decision tree (pruned_model), the pruned model consisted of 37 nodes, of which 19 were terminal nodes. The tree highlighted our predictors: HbA1c, FastingBloodSugar, Hypertension, FrequentUrination, and ExcessiveThirst as the most influential factors in predicting Diagnosis. Pruning the tree helped to reduce complexity and improve the model’s generalizability without significantly compromising its predictive accuracy.
Overall, both models provided valuable insights into the factors associated with diabetes diagnosis, with the logistic regression model offering a more straightforward interpretation of predictor significance, while the decision tree provided a clear hierarchical structure of decision rules.
The decision tree is ultimately the preferred model of choice, as it maintained a test error of 0.0872, far lower than that of the logistic, model, with an error of 0.172.
A low R² value in our logistic regression model indicates that the predictors used explain only a small portion of the variability in diabetes diagnosis. This highlights the complexity of diabetes, suggesting that many factors influencing the disease may not be captured in our current model. To better understand the root causes of diabetes and improve health outcomes for future generations, further research is essential. This research should aim to identify and incorporate additional predictors, including genetic, environmental, and lifestyle factors, to develop more comprehensive and effective predictive models.
| Source |
|---|
| Kharoua, Rabie El. “Diabetes Health Dataset Analysis🩸.” Kaggle, 11 June 2024, www.kaggle.com/datasets/rabieelkharoua/diabetes-health-dataset-analysis. |