Diabetes Prediction App

2024-08-27

Introduction

Shiny app for predicting diabetes risk
Based on health indicators from BRFSS 2015 (Kaggle)
Development process:
- Start with 21 health indicators
- Used 20% stratified sample due to memory constraints
- Applied XGBoost model for feature selection
- Selected top 5 most impactful featuresUses XGBoost model
Provides risk category and confidence score

Data

# Load the data
data <- read.csv("diabetes_012_health_indicators_BRFSS2015.csv")
data <- data %>%
  select(c("HighBP", "BMI", "DiffWalk", "HighChol", "GenHlth", "Diabetes_012")) %>%
  mutate(across(!c(BMI), factor))
levels(data$Diabetes_012) <- c("Normal", "Prediabetes", "Diabetes")

(omit some codes)

# Display the summary table
kable(summary_data, digits = 2,
      col.names = c("Diabetes Status", "Count", "High BP", "High Cholesterol", "Avg BMI", "Avg Gen. Health"),
      caption = "Summary of Key Health Indicators by Diabetes Status (20% Sample)")

Summary of Key Health Indicators by Diabetes Status (20% Sample)
Diabetes Status	Count	High BP	High Cholesterol	Avg BMI	Avg Gen. Health
Normal	42741	15740	16015	27.75	2.37
Prediabetes	927	589	567	30.51	2.99
Diabetes	7070	5352	4761	31.97	3.29

Model and App Features

User inputs:
- BMI
- High Blood Pressure
- High Cholesterol
- Difficulty Walking
- General Health
Output:
- Risk category(Normal, Prediabetes, Diabetes)
- Confidence score

# XGBoost Modelling
set.seed(8-26-2024)
final_xgb_model <- train(Diabetes_012 ~ .,data = sampled_data,method = "xgbTree")
saveRDS(final_xgb_model, "final_diabetes_xgb_model.rds")

# Load the model
model <- readRDS("final_diabetes_xgb_model.rds")

# Display feature importance
print(model$results[which.max(model$results$Accuracy), ])

##    eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 18 0.3         1     0              0.8                1         1     150
##     Accuracy     Kappa  AccuracySD    KappaSD
## 18 0.8470395 0.1743532 0.002090137 0.01446783

Model Performance

## Create test set by randomly sampling 100 instances from the remaining data
remaining_data <- data[-train_index, ]
test_index <- sample(1:nrow(remaining_data), 100)
test_data <- remaining_data[test_index, ]

# Make predictions
predictions <- predict(model, newdata = test_data)

# Calculate performance metrics
confusion_matrix <- confusionMatrix(predictions, test_data$Diabetes_012)
cat("Test Accuracy:", confusion_matrix$overall['Accuracy'], "\n")

## Test Accuracy: 0.87

Remember: This app is for educational purposes only. Always consult healthcare professionals for medical advice.