Dataset Overview

Today, we will be working with a diabetes dataset due to the growing prevalence and significant health impact of diabetes in the USA. Diabetes is a chronic disease that affects millions of Americans, leading to serious health complications, including heart disease, kidney failure, and vision loss. Understanding and analyzing diabetes data can help in identifying trends, risk factors, and potential preventive measures. In this analysis, we hope to apply data science techniques to better gain valuable insights into the epidemiology of diabetes, improve patient outcomes, and inform public health strategies to combat this pervasive condition.

Diabetes Dataset Variables and Descriptions
Heading Variable Description
Patient ID
PatientID A unique identifier assigned to each patient (6000 to 7878).
Demographic Details
Age The age of the patients ranges from 20 to 90 years.
Gender Gender of the patients, where 0 represents Male and 1 represents Female.
Ethnicity The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
SocioeconomicStatus The socioeconomic status of the patients, coded as follows: 0: Low, 1: Middle, 2: High.
EducationLevel The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor’s, 3: Higher.
Lifestyle Factors
BMI Body Mass Index of the patients, ranging from 15 to 40.
Smoking Smoking status, where 0 indicates No and 1 indicates Yes.
AlcoholConsumption Weekly alcohol consumption in units, ranging from 0 to 20.
PhysicalActivity Weekly physical activity in hours, ranging from 0 to 10.
DietQuality Diet quality score, ranging from 0 to 10.
SleepQuality Sleep quality score, ranging from 4 to 10.
Medical History
FamilyHistoryDiabetes Family history of diabetes, where 0 indicates No and 1 indicates Yes.
GestationalDiabetes History of gestational diabetes, where 0 indicates No and 1 indicates Yes.
PolycysticOvarySyndrome Presence of polycystic ovary syndrome, where 0 indicates No and 1 indicates Yes.
PreviousPreDiabetes History of previous pre-diabetes, where 0 indicates No and 1 indicates Yes.
Hypertension Presence of hypertension, where 0 indicates No and 1 indicates Yes.
Clinical Measurements
SystolicBP Systolic blood pressure, ranging from 90 to 180 mmHg.
DiastolicBP Diastolic blood pressure, ranging from 60 to 120 mmHg.
FastingBloodSugar Fasting blood sugar levels, ranging from 70 to 200 mg/dL.
HbA1c Hemoglobin A1c levels, ranging from 4.0% to 10.0%.
SerumCreatinine Serum creatinine levels, ranging from 0.5 to 5.0 mg/dL.
BUNLevels Blood Urea Nitrogen levels, ranging from 5 to 50 mg/dL.
CholesterolTotal Total cholesterol levels, ranging from 150 to 300 mg/dL.
CholesterolLDL Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
CholesterolHDL High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
CholesterolTriglycerides Triglycerides levels, ranging from 50 to 400 mg/dL.
Medications
AntihypertensiveMedications Use of antihypertensive medications, where 0 indicates No and 1 indicates Yes.
Statins Use of statins, where 0 indicates No and 1 indicates Yes.
AntidiabeticMedications Use of antidiabetic medications, where 0 indicates No and 1 indicates Yes.
Symptoms and Quality of Life
FrequentUrination Presence of frequent urination, where 0 indicates No and 1 indicates Yes.
ExcessiveThirst Presence of excessive thirst, where 0 indicates No and 1 indicates Yes.
UnexplainedWeightLoss Presence of unexplained weight loss, where 0 indicates No and 1 indicates Yes.
FatigueLevels Fatigue levels, ranging from 0 to 10.
BlurredVision Presence of blurred vision, where 0 indicates No and 1 indicates Yes.
SlowHealingSores Presence of slow-healing sores, where 0 indicates No and 1 indicates Yes.
TinglingHandsFeet Presence of tingling in hands or feet, where 0 indicates No and 1 indicates Yes.
QualityOfLifeScore Quality of life score, ranging from 0 to 100.
Environmental and Occupational Exposures
HeavyMetalsExposure Exposure to heavy metals, where 0 indicates No and 1 indicates Yes.
OccupationalExposureChemicals Occupational exposure to harmful chemicals, where 0 indicates No and 1 indicates Yes.
WaterQuality Quality of water, where 0 indicates Good and 1 indicates Poor.
Health Behaviors
MedicalCheckupsFrequency Frequency of medical check-ups per year, ranging from 0 to 4.
MedicationAdherence Medication adherence score, ranging from 0 to 10.
HealthLiteracy Health literacy score, ranging from 0 to 10.
Diagnosis Information (Target Variable)
Diagnosis Diagnosis status for Diabetes, where 0 indicates No and 1 indicates Yes.

We now take a short sample overview of our data:

We observe that there are 1879 observations, as well as 46 variables in our data set.

It is also evident that our response variable, Diagnosis is categorical, only taking on values of 1 or 0. Because of this, we will perform logistic regression and NOT linear regression in order to predict whether or not a patient will have diabetes.

Moving forward, we shall apply a decision tree to our data in order to better interpret and evaluate which variables, if any, are essential towards predicting diabetes.

Specifically, the question we hope to answer today is how accurately may we apply inference to Diagnosis, the response variable, as a function of some relevant predictor variables which will be determined shortly after in this report. The overarching goal will be to determine which variables are most essential towards such via each of the aforementioned statistical methods, as well as performing model comparison via resampling methods to validate our results.

Logistic Regression

Criteria Details
Steps to Use Logistic Regression
  1. Prepare your data:
    - Collect and clean your data to ensure it is free from errors and missing values.
    - Encode categorical variables using methods like as.factor() to ensure they are fitted well.
    - Normalize or standardize continuous variables if necessary.
    - Split your data into training and testing sets to evaluate model performance.
  2. Fit the logistic regression model:
    - Select the appropriate logistic regression function from your statistical software or library (e.g., glm in R, LogisticRegression in scikit-learn).
    - Specify the response variable and predictor variables.
    - Train the model on the training dataset.
  3. Evaluate the model:
    - Use metrics like accuracy and precision to assess model performance on the test dataset.
    - Examine the confusion matrix to understand the model’s classification performance.
    - Consider plotting the ROC curve and calculating the AUC for a comprehensive evaluation.
    - Perform cross-validation to ensure the model’s robustness and generalizability.
Advantages of Logistic Regression
  1. Handles binary outcomes: Logistic regression is particularly well-suited for binary outcome variables, where the response variable can only take on two possible values (e.g., success/failure, yes/no).
  2. Provides probabilities: The model estimates probabilities for the occurrence of an event, allowing for a more nuanced understanding of the likelihood of outcomes, which can be useful for decision-making processes.
  3. Simple to implement: Logistic regression is relatively straightforward to implement and interpret, making it a popular choice for binary classification problems. It does not require complex computations and can be easily implemented using various statistical software packages.
  4. Feature importance: The coefficients obtained from logistic regression provide insights into the importance and influence of each predictor variable on the outcome, helping in feature selection and understanding the underlying relationships in the data.

Now that we have defined what logistic regression is and its capacities, we must first inspect our dataset and prune any unnecessary variables which may not be applicable towards our regression.

## [1] "DoctorInCharge"

We may now safely remove such from our dataset and begin the next part of data processing.

We notice from our overview of variables that those such as Gender, Ethnicity, and response variable, Diagnosis all fall under the factor category of variables, as they only possess values such as 0, 1, or 2. We account for this by converting such into factor variables before regression.

We may now begin logistic regression, remembering to exclude PatientID from our predictors, as it is only a placeholder variable to indicate the number of patients, and not representative of having diabetes.

The model we derive will determine the likelihood of a patient having diabetes based on all of our predictors, with a value exceeding 0.5 meaning we predict a patient will have diabetes and a value less than 0.5 meaning we predict a patient will NOT have diabetes.

The assumptions will now be checked to ensure our data set is suitable for testing:

Assumptions Definitions
Binary Outcome The dependent variable should be binary.
Independence of Observations Observations should be independent of each other.
Linearity of Independent Variables and Log Odds There should be a linear relationship between the independent variables and the log odds of the dependent variable.
No Multicollinearity Independent variables should not be highly correlated with each other.
No Outliers There should be no significant outliers in the data.
## diagnosis_check
##    0    1 
## 1127  752

It is evident from our table that the binarity condition has been met as we have only 0s and 1s as our outputs. Specifically, we find 1127 cases of no diabetes and 752 cases of diabetes. This ratio indicates that 59.9787121% of our data set does NOT have diabetes.

##                      Age                Ethnicity           EducationLevel 
##               0.34912261               0.59833617               0.30235741 
##                      BMI       AlcoholConsumption         PhysicalActivity 
##               0.67958323               0.63484280               0.48742042 
##              DietQuality             SleepQuality               SystolicBP 
##               0.67710542               0.55295574               0.87822521 
##              DiastolicBP        FastingBloodSugar                    HbA1c 
##               0.09260123               0.15584485               0.54594599 
##          SerumCreatinine                BUNLevels         CholesterolTotal 
##               0.57905077               0.20243680               0.06664950 
##           CholesterolLDL           CholesterolHDL CholesterolTriglycerides 
##               0.23120138               0.23512896               0.71803384 
##            FatigueLevels       QualityOfLifeScore MedicalCheckupsFrequency 
##               0.94076392               0.82019148               0.31525673 
##      MedicationAdherence           HealthLiteracy 
##               0.55756168               0.97563695
## [1] TRUE

Since all p-values > 0.05, we have met the linearity assumption and are safe to proceed with further statistical testing.

##                                   GVIF Df GVIF^(1/(2*Df))
## Age                           1.044993  1        1.022249
## Gender                        1.045455  1        1.022475
## Ethnicity                     1.041471  1        1.020525
## SocioeconomicStatus           1.105555  2        1.025404
## EducationLevel                1.033036  1        1.016384
## BMI                           1.039275  1        1.019448
## Smoking                       1.045637  1        1.022564
## AlcoholConsumption            1.040251  1        1.019927
## PhysicalActivity              1.035862  1        1.017773
## DietQuality                   1.030300  1        1.015037
## SleepQuality                  1.038179  1        1.018911
## FamilyHistoryDiabetes         1.041220  1        1.020402
## GestationalDiabetes           1.041847  1        1.020709
## PolycysticOvarySyndrome       1.020634  1        1.010264
## PreviousPreDiabetes           1.037703  1        1.018677
## Hypertension                  1.142892  1        1.069061
## SystolicBP                    1.038834  1        1.019232
## DiastolicBP                   1.048538  1        1.023981
## FastingBloodSugar             1.705955  1        1.306122
## HbA1c                         1.722189  1        1.312322
## SerumCreatinine               1.056390  1        1.027808
## BUNLevels                     1.061918  1        1.030494
## CholesterolTotal              1.032273  1        1.016008
## CholesterolLDL                1.038802  1        1.019216
## CholesterolHDL                1.042717  1        1.021135
## CholesterolTriglycerides      1.063894  1        1.031452
## AntihypertensiveMedications   1.035805  1        1.017745
## Statins                       1.049630  1        1.024514
## AntidiabeticMedications       1.034804  1        1.017253
## FrequentUrination             1.176298  1        1.084573
## ExcessiveThirst               1.142393  1        1.068828
## UnexplainedWeightLoss         1.099520  1        1.048580
## FatigueLevels                 1.035622  1        1.017655
## BlurredVision                 1.069286  1        1.034063
## SlowHealingSores              1.042426  1        1.020992
## TinglingHandsFeet             1.054609  1        1.026941
## QualityOfLifeScore            1.044428  1        1.021973
## HeavyMetalsExposure           1.045855  1        1.022671
## OccupationalExposureChemicals 1.030387  1        1.015080
## WaterQuality                  1.034034  1        1.016875
## MedicalCheckupsFrequency      1.050992  1        1.025179
## MedicationAdherence           1.060415  1        1.029764
## HealthLiteracy                1.063377  1        1.031202
## [1] TRUE

There is no multicollinearity detected (VIF < 5) and we are safe to proceed.

VIF Range Interpretation
VIF = 1 No correlation between the predictor variable and any others.
1 < VIF < 5 Moderate correlation, but not severe enough to require correction.
VIF ≥ 5 High correlation, indicating potential multicollinearity problems. Consider corrective actions.
VIF ≥ 10 Very high correlation, requiring immediate corrective actions such as removing variables or using regularization techniques.

As our conditions have been met, with only a few instances of outliers present, we may begin further analysis, being mindful of potentially drastic results.

We split our data set into a training and test set for further analysis and repeat this process 10 times:

# Set the seed for reproducibility
set.seed(123)

response <- diabetes_data_final$Diagnosis
predictors <- diabetes_data_final[, !colnames(diabetes_data_final) %in% "Diagnosis"]

errors <- numeric(10)
sensitivities <- numeric(10)
specificities <- numeric(10)

for (i in 1:10) {

  set.seed(123 + i)
  
  trainIndex <- createDataPartition(response, p = 0.80, list = FALSE)
  trainData <- diabetes_data_final[trainIndex, ]
  testData <- diabetes_data_final[-trainIndex, ]

  model <- glm(Diagnosis ~ ., data = trainData, family = binomial)

  predictions <- predict(model, testData, type = "response")
  predicted_classes <- ifelse(predictions > 0.5, 1, 0)

  test_error <- mean(predicted_classes != testData$Diagnosis)
  errors[i] <- test_error
  
  cm <- confusionMatrix(as.factor(predicted_classes), as.factor(testData$Diagnosis))

  sensitivities[i] <- cm$byClass["Sensitivity"]
  specificities[i] <- cm$byClass["Specificity"]
}

From this, we attain the following outputs:

## Mean Test Error:  0.172
## Mean Sensitivity:  0.8711111
## Mean Specificity:  0.7633333

We define these terms as follows:

Terms Definitions
Test Error The proportion of incorrect predictions made by the model on the test set.
Sensitivity The proportion of actual positive cases correctly identified by the model.
Specificity The proportion of actual negative cases correctly identified by the model.

In the context of our problem statement, given our 10 repeated trials of cross validation, we have found 17.2 percent of our observations to be misclassified. Additionally, there are 87.1111111 percent of positives correctly marked, as well as 76.3333333 percent of negatives correctly marked. This gives us confidence in the validity of our model towards predicting diabetes as none of our error rates exceed 25%.

The full logistic regression equation is as follows:

## Logistic Regression Equation:
##  logit(P) =  -16.775 * (Intercept) + 0 * Age + 0.075 * Gender1 + -0.082 * Ethnicity + 0.158 * SocioeconomicStatus1 + 0.379 * SocioeconomicStatus2 + 0.029 * EducationLevel + 0.012 * BMI + 0.258 * Smoking1 + 0.004 * AlcoholConsumption + -0.014 * PhysicalActivity + -0.035 * DietQuality + -0.015 * SleepQuality + 0.256 * FamilyHistoryDiabetes1 + 0.547 * GestationalDiabetes1 + 0.081 * PolycysticOvarySyndrome1 + -0.186 * PreviousPreDiabetes1 + 1.632 * Hypertension1 + -0.004 * SystolicBP + 0.005 * DiastolicBP + 0.051 * FastingBloodSugar + 1.039 * HbA1c + 0.028 * SerumCreatinine + 0.006 * BUNLevels + 0 * CholesterolTotal + -0.001 * CholesterolLDL + -0.001 * CholesterolHDL + 0.001 * CholesterolTriglycerides + -0.107 * AntihypertensiveMedications1 + -0.139 * Statins1 + -0.136 * AntidiabeticMedications1 + 1.656 * FrequentUrination1 + 1.169 * ExcessiveThirst1 + 1.24 * UnexplainedWeightLoss1 + 0.017 * FatigueLevels + 0.732 * BlurredVision1 + 0.309 * SlowHealingSores1 + 0.136 * TinglingHandsFeet1 + 0.002 * QualityOfLifeScore + 0.098 * HeavyMetalsExposure1 + 0.019 * OccupationalExposureChemicals1 + 0.241 * WaterQuality1 + -0.002 * MedicalCheckupsFrequency + 0.004 * MedicationAdherence + -0.005 * HealthLiteracy

This is notably extremely robust and largely unimportant to the average reader. As such, we must trim our model in order to find the most significant predictors by first taking an overview of the relative importance of each of our variables:

## 
## Call:
## glm(formula = Diagnosis ~ ., family = binomial, data = diabetes_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1178  -0.5430  -0.1495   0.4722   3.1559  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.677e+01  1.228e+00 -13.664  < 2e-16 ***
## Age                             3.711e-04  3.443e-03   0.108  0.91415    
## Gender1                         7.493e-02  1.404e-01   0.534  0.59368    
## Ethnicity                      -8.228e-02  6.791e-02  -1.212  0.22563    
## SocioeconomicStatus1            1.583e-01  1.711e-01   0.925  0.35479    
## SocioeconomicStatus2            3.786e-01  1.835e-01   2.063  0.03907 *  
## EducationLevel                  2.850e-02  7.941e-02   0.359  0.71964    
## BMI                             1.227e-02  9.636e-03   1.273  0.20291    
## Smoking1                        2.579e-01  1.543e-01   1.671  0.09463 .  
## AlcoholConsumption              3.822e-03  1.193e-02   0.320  0.74862    
## PhysicalActivity               -1.379e-02  2.470e-02  -0.558  0.57660    
## DietQuality                    -3.503e-02  2.405e-02  -1.457  0.14525    
## SleepQuality                   -1.505e-02  4.065e-02  -0.370  0.71115    
## FamilyHistoryDiabetes1          2.555e-01  1.626e-01   1.571  0.11613    
## GestationalDiabetes1            5.467e-01  2.355e-01   2.321  0.02028 *  
## PolycysticOvarySyndrome1        8.144e-02  3.341e-01   0.244  0.80740    
## PreviousPreDiabetes1           -1.855e-01  1.929e-01  -0.962  0.33619    
## Hypertension1                   1.632e+00  2.005e-01   8.141 3.93e-16 ***
## SystolicBP                     -4.475e-03  2.727e-03  -1.641  0.10079    
## DiastolicBP                     5.184e-03  4.087e-03   1.269  0.20461    
## FastingBloodSugar               5.061e-02  2.642e-03  19.152  < 2e-16 ***
## HbA1c                           1.039e+00  5.611e-02  18.521  < 2e-16 ***
## SerumCreatinine                 2.775e-02  5.378e-02   0.516  0.60592    
## BUNLevels                       6.415e-03  5.522e-03   1.162  0.24538    
## CholesterolTotal               -2.137e-04  1.605e-03  -0.133  0.89409    
## CholesterolLDL                 -5.492e-04  1.625e-03  -0.338  0.73532    
## CholesterolHDL                 -1.185e-03  3.022e-03  -0.392  0.69504    
## CholesterolTriglycerides        1.377e-03  6.984e-04   1.972  0.04859 *  
## AntihypertensiveMedications1   -1.067e-01  1.556e-01  -0.686  0.49266    
## Statins1                       -1.389e-01  1.430e-01  -0.971  0.33154    
## AntidiabeticMedications1       -1.360e-01  1.543e-01  -0.881  0.37823    
## FrequentUrination1              1.656e+00  1.820e-01   9.099  < 2e-16 ***
## ExcessiveThirst1                1.169e+00  1.854e-01   6.301 2.96e-10 ***
## UnexplainedWeightLoss1          1.240e+00  2.277e-01   5.446 5.15e-08 ***
## FatigueLevels                   1.723e-02  2.411e-02   0.714  0.47497    
## BlurredVision1                  7.319e-01  2.327e-01   3.146  0.00166 ** 
## SlowHealingSores1               3.091e-01  2.236e-01   1.382  0.16694    
## TinglingHandsFeet1              1.362e-01  2.185e-01   0.623  0.53320    
## QualityOfLifeScore              1.991e-03  2.446e-03   0.814  0.41564    
## HeavyMetalsExposure1            9.824e-02  3.108e-01   0.316  0.75194    
## OccupationalExposureChemicals1  1.927e-02  2.225e-01   0.087  0.93098    
## WaterQuality1                   2.406e-01  1.721e-01   1.398  0.16209    
## MedicalCheckupsFrequency       -1.680e-03  6.279e-02  -0.027  0.97865    
## MedicationAdherence             3.518e-03  2.432e-02   0.145  0.88499    
## HealthLiteracy                 -5.175e-03  2.441e-02  -0.212  0.83213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2529.5  on 1878  degrees of freedom
## Residual deviance: 1336.2  on 1834  degrees of freedom
## AIC: 1426.2
## 
## Number of Fisher Scoring iterations: 6

As is common practice in statistics, we note that only variables with p- values being less than 0.05 are deemed to be statistically significant when working towards classifying such as useful. We will therefore prune our selection by only accounting for those fitting our criteria.

The variables with p- values less than 0.05 are as follows:

##  [1] "(Intercept)"              "SocioeconomicStatus2"    
##  [3] "GestationalDiabetes1"     "Hypertension1"           
##  [5] "FastingBloodSugar"        "HbA1c"                   
##  [7] "CholesterolTriglycerides" "FrequentUrination1"      
##  [9] "ExcessiveThirst1"         "UnexplainedWeightLoss1"  
## [11] "BlurredVision1"

Now that we have established our useful variables, we may condense our data set by first selecting only those relevant:

This table of data represents our most important variables towards predicting Diagnosis. We may clearly observe a majority of our predictors are factor variables, with only three, FastingBloodSugar, HbA1C, and CholesterolTriglycerides being numerical.

Decision Trees

Criteria Details
Steps to Use Decision Trees
  1. Prepare your data:
  • Ensure that your dataset is free of missing values and outliers. Remove missing data as appropriate.
  • Identify the relevant features (variables) that will be used for building the decision tree. Remove irrelevant or redundant features.
  • Convert categorical variables into factors and normalize or scale numeric variables if necessary.
  • Divide the dataset into training and testing sets. A common split is 80% for training and 20% for testing.
  1. Fit the decision tree model:
  • Use library(rpart) to load the rpart package in R, which is commonly used for creating decision tree models.
  • Use the rpart() function to build the decision tree model on the training dataset. Specify the formula and the dataset.
  • Adjust the parameters such as minsplit, cp, and maxdepth to improve the model performance.
  1. Evaluate the model:
  • Use the predict() function to make predictions on the testing dataset.
  • Compare the predicted values with the actual values to calculate accuracy, confusion matrix, and other performance metrics.
  • Perform k-fold cross-validation to assess the model’s robustness and generalizability.
  • Use rpart.plot or similar functions to visualize the decision tree.
Advantages of Decision Trees
  1. Easy to interpret: Decision trees are easy to understand and interpret. The rules generated by decision trees can be easily explained to non-technical stakeholders.
  1. Requires little data preprocessing: Decision trees do not require scaling of data and can handle both numerical and categorical data.
  2. Handles non-linearity: Decision trees can capture non-linear relationships between the features and the target variable.
  3. Feature importance: Decision trees provide insights into the importance of different features, which can be useful for feature selection and understanding the data.

Now that we have an overview of the significance and purpose of decision trees, it is time to implement them onto our data set diabetes_data_final. We begin by constructing our unpruned decision tree from all variables and reporting its size:

And now we prune our tree, using a parameter to limit cross- validation error:

Benefits of Pruning Decision Trees
Reason Explanation
Reduce Overfitting Pruning helps remove branches that capture noise, leading to better generalization on new data.
Improve Interpretability A pruned tree is simpler and easier to understand, making it more interpretable for stakeholders.
Enhance Performance By eliminating less significant splits, pruning can improve the model’s predictive performance and reduce complexity.
Prevent Over-complexity Pruning avoids creating an overly complex model that may fit the training data too closely and fail on test data.
Ensure Robustness Pruned trees are generally more robust and less sensitive to small variations in the data.

We once again split the data into an 80 / 20 training and test split in order to evluate the average mean prediction error of our findings:

train_and_evaluate <- function(data, response, train_frac = 0.8, n_iterations = 10) {
  errors <- numeric(n_iterations)
  
  for (i in 1:n_iterations) {
    set.seed(i)
    
    # Split the data into training and testing sets
    train_index <- createDataPartition(data[[response]], p = train_frac, list = FALSE)
    train_data <- data[train_index, ]
    test_data <- data[-train_index, ]
    
    # Train the decision tree model
    formula <- as.formula(paste(response, "~ ."))
    model <- rpart(formula, data = train_data, method = "class", cp = 0.01)
    
    # Predict on the test set
    predictions <- predict(model, newdata = test_data, type = "class")
    
    # Calculate prediction error
    confusion_matrix <- table(predictions, test_data[[response]])
    error_rate <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
    
    # Record the error
    errors[i] <- error_rate
    
    # Prune the tree to avoid overfitting
    best_cp <- model$cptable[which.min(model$cptable[,"xerror"]), "CP"]
    pruned_model <- prune(model, cp = best_cp)
    
    # Predict on the test set with the pruned model
    pruned_predictions <- predict(pruned_model, newdata = test_data, type = "class")
    
    # Calculate prediction error for the pruned model
    pruned_confusion_matrix <- table(pruned_predictions, test_data[[response]])
    pruned_error_rate <- 1 - sum(diag(pruned_confusion_matrix)) / sum(pruned_confusion_matrix)
    
    # Record the pruned model error
    errors[i] <- pruned_error_rate
  }
  
  # Return the mean test prediction error
  mean(errors)
}

response <- "Diagnosis"

# Calculate mean test prediction error
mean_error <- train_and_evaluate(diabetes_data_mod, response, train_frac = 0.8, n_iterations = 10)

print(paste("Mean Test Prediction Error:", round(mean_error, 4)))
## [1] "Mean Test Prediction Error: 0.0872"

As we find our mean error to be 0.0872, we are satisfied with the rate being well under 10%. This indicates that our decision tree is satisfactory, with an error much lower than that of logistic regression.

It would also be wise to better quantify which of our variables are most significant via variable importance plot:

We now select the top 5 most important variables and present them below:

Top 5 Most Important Variables in the Pruned Decision Tree Model
Variable Importance
11 HbA1c 471.58547
8 FastingBloodSugar 191.53513
7 ExcessiveThirst 65.57590
13 Hypertension 63.73770
10 FrequentUrination 61.85861

These importance values indicate how each variable has high importance in our model, meaning there is significant contribution to reducing impurity and improving model performance.

Results

When comparing to our logistic regression, we find that:

## [1] "HbA1c"             "FastingBloodSugar" "ExcessiveThirst"  
## [4] "Hypertension"      "FrequentUrination"

are the same in both models. This gives us confidence that our variables specified from logistic regression:

##  [1] "(Intercept)"              "SocioeconomicStatus2"    
##  [3] "GestationalDiabetes1"     "Hypertension1"           
##  [5] "FastingBloodSugar"        "HbA1c"                   
##  [7] "CholesterolTriglycerides" "FrequentUrination1"      
##  [9] "ExcessiveThirst1"         "UnexplainedWeightLoss1"  
## [11] "BlurredVision1"

are credible in nature and useful towards predicting Diagnosis without the need for other predictors. To test such claims, we now fit our models to the full data set:

##      term              estimate           std.error           statistic       
##  Length:45          Min.   :-16.77482   Min.   :0.0006984   Min.   :-13.6636  
##  Class :character   1st Qu.: -0.00168   1st Qu.:0.0119279   1st Qu.: -0.3381  
##  Mode  :character   Median :  0.01227   Median :0.0794062   Median :  0.5159  
##                     Mean   : -0.16261   Mean   :0.1325002   Mean   :  1.5131  
##                     3rd Qu.:  0.24058   3rd Qu.:0.1854485   3rd Qu.:  1.5712  
##                     Max.   :  1.65572   Max.   :1.2277011   Max.   : 19.1518  
##     p.value       
##  Min.   :0.00000  
##  1st Qu.:0.09463  
##  Median :0.33619  
##  Mean   :0.38826  
##  3rd Qu.:0.71115  
##  Max.   :0.97865

The variables with p- values less than 0.05 are as follows:

Significant Predictors in Logistic Regression (p-value < 0.05)
Term Estimate Std. Error Statistic P-value
(Intercept) -16.7748247 1.2277011 -13.663606 0.0000000
SocioeconomicStatus2 0.3786291 0.1834956 2.063423 0.0390725
GestationalDiabetes1 0.5466823 0.2355153 2.321218 0.0202751
Hypertension1 1.6323338 0.2005136 8.140762 0.0000000
FastingBloodSugar 0.0506091 0.0026425 19.151820 0.0000000
HbA1c 1.0392162 0.0561106 18.520859 0.0000000
CholesterolTriglycerides 0.0013773 0.0006984 1.972148 0.0485927
FrequentUrination1 1.6557201 0.1819685 9.098937 0.0000000
ExcessiveThirst1 1.1685055 0.1854485 6.300969 0.0000000
UnexplainedWeightLoss1 1.2403127 0.2277426 5.446117 0.0000001
BlurredVision1 0.7319367 0.2326673 3.145851 0.0016560

From our tree, we also observe:

Top 5 Most Important Variables in Pruned Decision Tree
Variable Importance
15 HbA1c 598.49772
12 FastingBloodSugar 247.84455
17 Hypertension 92.42778
14 FrequentUrination 89.13811
11 ExcessiveThirst 66.62333

The factors relevant to predicting Diagnosis are identical to those identified in our prior cross validation testing. As such, we are confident to report that the significant predictors listed above from our test model are most closely associated to predicting whether or not a patient has diabetes.

Conclusion

In our analysis of the Diabetes dataset, we employed logistic regression and a decision tree in order to determine which of our variables were most statistically significant towards predicting Diagnosis, of whether or not a patient has diabetes.

The logistic regression model (test_model) identified several statistically significant predictors of diabetes diagnosis. The most notable predictors included:

Significant Predictors in Logistic Regression (p-value < 0.05)
Term Estimate Std. Error Statistic P-value
(Intercept) -16.7748247 1.2277011 -13.663606 0.0000000
SocioeconomicStatus2 0.3786291 0.1834956 2.063423 0.0390725
GestationalDiabetes1 0.5466823 0.2355153 2.321218 0.0202751
Hypertension1 1.6323338 0.2005136 8.140762 0.0000000
FastingBloodSugar 0.0506091 0.0026425 19.151820 0.0000000
HbA1c 1.0392162 0.0561106 18.520859 0.0000000
CholesterolTriglycerides 0.0013773 0.0006984 1.972148 0.0485927
FrequentUrination1 1.6557201 0.1819685 9.098937 0.0000000
ExcessiveThirst1 1.1685055 0.1854485 6.300969 0.0000000
UnexplainedWeightLoss1 1.2403127 0.2277426 5.446117 0.0000001
BlurredVision1 0.7319367 0.2326673 3.145851 0.0016560

When interpreting such, we must consider the following regarding AIC:

Interpretation of AIC Values
AIC_Range Interpretation
< 100 Very strong evidence for the model
100 - 200 Strong evidence for the model
200 - 300 Weak evidence for the model
> 300 Very weak evidence for the model

As for R²:

Interpretation of R² Values
R2_Range Interpretation
0 - 0.1 Very weak explanatory power
0.1 - 0.3 Weak explanatory power
0.3 - 0.5 Moderate explanatory power
> 0.5 Strong explanatory power

The model fit was evaluated using the AIC, being 1426.2193299, which indicated that our model provided weak evidence towards predicting whether or not a patient had diabetes from our chosen predictors. We have also calculated an R² value of 0.4717461, and from our table, determine that have moderate explanatory power towards our classification task. The coefficients specified beforehand revealed that, due to their low p- values, these are the only ones of statistical significance during regression, with all others having p- values too large to be considered useful.

For the decision tree (pruned_model), the pruned model consisted of 37 nodes, of which 19 were terminal nodes. The tree highlighted our predictors: HbA1c, FastingBloodSugar, Hypertension, FrequentUrination, and ExcessiveThirst as the most influential factors in predicting Diagnosis. Pruning the tree helped to reduce complexity and improve the model’s generalizability without significantly compromising its predictive accuracy.

Overall, both models provided valuable insights into the factors associated with diabetes diagnosis, with the logistic regression model offering a more straightforward interpretation of predictor significance, while the decision tree provided a clear hierarchical structure of decision rules.

The decision tree is ultimately the preferred model of choice, as it maintained a test error of 0.0872, far lower than that of the logistic, model, with an error of 0.172.

A low R² value in our logistic regression model indicates that the predictors used explain only a small portion of the variability in diabetes diagnosis. This highlights the complexity of diabetes, suggesting that many factors influencing the disease may not be captured in our current model. To better understand the root causes of diabetes and improve health outcomes for future generations, further research is essential. This research should aim to identify and incorporate additional predictors, including genetic, environmental, and lifestyle factors, to develop more comprehensive and effective predictive models.

Bibliography

Source
Kharoua, Rabie El. “Diabetes Health Dataset Analysis🩸.” Kaggle, 11 June 2024, www.kaggle.com/datasets/rabieelkharoua/diabetes-health-dataset-analysis.