2024-08-27

Introduction

  • Shiny app for predicting diabetes risk
  • Based on health indicators from BRFSS 2015 (Kaggle)
  • Development process:
    • Start with 21 health indicators
    • Used 20% stratified sample due to memory constraints
    • Applied XGBoost model for feature selection
    • Selected top 5 most impactful featuresUses XGBoost model
  • Provides risk category and confidence score

Data

# Load the data
data <- read.csv("diabetes_012_health_indicators_BRFSS2015.csv")
data <- data %>%
  select(c("HighBP", "BMI", "DiffWalk", "HighChol", "GenHlth", "Diabetes_012")) %>%
  mutate(across(!c(BMI), factor))
levels(data$Diabetes_012) <- c("Normal", "Prediabetes", "Diabetes")

(omit some codes)

# Display the summary table
kable(summary_data, digits = 2,
      col.names = c("Diabetes Status", "Count", "High BP", "High Cholesterol", "Avg BMI", "Avg Gen. Health"),
      caption = "Summary of Key Health Indicators by Diabetes Status (20% Sample)")
Summary of Key Health Indicators by Diabetes Status (20% Sample)
Diabetes Status Count High BP High Cholesterol Avg BMI Avg Gen. Health
Normal 42741 15740 16015 27.75 2.37
Prediabetes 927 589 567 30.51 2.99
Diabetes 7070 5352 4761 31.97 3.29

Model and App Features

  • User inputs:
    • BMI
    • High Blood Pressure
    • High Cholesterol
    • Difficulty Walking
    • General Health
  • Output:
    • Risk category(Normal, Prediabetes, Diabetes)
    • Confidence score
# XGBoost Modelling
set.seed(8-26-2024)
final_xgb_model <- train(Diabetes_012 ~ .,data = sampled_data,method = "xgbTree")
saveRDS(final_xgb_model, "final_diabetes_xgb_model.rds")
# Load the model
model <- readRDS("final_diabetes_xgb_model.rds")

# Display feature importance
print(model$results[which.max(model$results$Accuracy), ])
##    eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 18 0.3         1     0              0.8                1         1     150
##     Accuracy     Kappa  AccuracySD    KappaSD
## 18 0.8470395 0.1743532 0.002090137 0.01446783

Model Performance

## Create test set by randomly sampling 100 instances from the remaining data
remaining_data <- data[-train_index, ]
test_index <- sample(1:nrow(remaining_data), 100)
test_data <- remaining_data[test_index, ]

# Make predictions
predictions <- predict(model, newdata = test_data)

# Calculate performance metrics
confusion_matrix <- confusionMatrix(predictions, test_data$Diabetes_012)
cat("Test Accuracy:", confusion_matrix$overall['Accuracy'], "\n")
## Test Accuracy: 0.87

Remember: This app is for educational purposes only. Always consult healthcare professionals for medical advice.