Group Members:

Hussain Ali Kazim (23115338)
Wan Mohamad Hasif Bin W. Mohd Saleh (22119325)
Hanim Sofiah Bin Shahrom (22102228)
Nurul Hafizah binti Zaini (17172928)
Muhammad Hakim Bin Nasaruddin (23079722)

Introduction

Dataset

The dataset was obtained from an open-source Kaggle platform. This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes various diagnostic measurements, with several independent variables and a single target variable which is ‘Outcome,’ indicating whether the patient has diabetes.

Link: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data

Project Context

The term ‘diabetes’ has become increasingly familiar in recent years, as the prevalence of the condition continues to rise globally. Diabetes is a chronic medical condiiton that occurs when the body is unable to properly regulate blood sugar levels. This condition can lead to severe complications, including heart disease, kidney failure, and nerve damage. Therefore, early detection and intervention are crucial to managing and preventing the progression of diabetes.

This project aims to develop a machine learning classification model to predict the likelihood of an individual developing diabetes. Additionally, it focuses on predicting the progression of diabetes using BMI as a key feature, providing insights for early diagnosis and personalized healthcare interventions.

Objective

To develop a classification models to predict the likelihood of an individual developing diabetes
To predict the progression of diabetes using BMI as key feature by developing regression models

Load the dataset

raw_data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\raw_data_diabetes.csv")

Data Cleaning

Dimension: dim()

dim(raw_data)

## [1] 769   9

Head: head()

head(raw_data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Summary: summary()

summary(raw_data)

##   Pregnancies        Glucose      BloodPressure      SkinThickness  
##  Min.   :-1.000   Min.   :  0.0   Length:769         Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   Class :character   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Mode  :character   Median :23.00  
##  Mean   : 3.839   Mean   :120.9                      Mean   :20.55  
##  3rd Qu.: 6.000   3rd Qu.:140.0                      3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0                      Max.   :99.00  
##    Insulin               BMI         DiabetesPedigreeFunction      Age        
##  Length:769         Min.   :-30.10   Min.   :0.0780           Min.   :-30.00  
##  Class :character   1st Qu.: 27.30   1st Qu.:0.2440           1st Qu.: 24.00  
##  Mode  :character   Median : 32.00   Median :0.3710           Median : 29.00  
##                     Mean   : 31.91   Mean   :0.4717           Mean   : 33.15  
##                     3rd Qu.: 36.60   3rd Qu.:0.6260           3rd Qu.: 41.00  
##                     Max.   : 67.10   Max.   :2.4200           Max.   : 81.00  
##     Outcome      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3485  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Structure: str()

str(raw_data)

## 'data.frame':    769 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : chr  "72" "66" "64" "66" ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : chr  "0" "0" "0" "94" ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Load necessary libraries

library(dplyr)

Check for missing values in the raw dataset

missing_values_summary <- sapply(raw_data, function(x) sum(is.na(x)))  # Count missing values per column
print(missing_values_summary)

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Handle specific cases for Null value and number as string value

# Convert "Null" in Insulin column to 0 (row 763)
raw_data[763, "Insulin"] <- 0

# Convert "Seventy" in BloodPressure column to 70 (row 765)
raw_data[765, "BloodPressure"] <- 70

Convert data types to ensure consistency

# Convert all values in BloodPressure column to numeric
raw_data$BloodPressure <- as.numeric(raw_data$BloodPressure)

# Convert all values in Insulin column to numeric
raw_data$Insulin <- as.numeric(raw_data$Insulin)

Convert negative values to absolute

raw_data <- raw_data %>%
  mutate(across(where(is.numeric), ~ abs(.)))

Remove duplicate rows

raw_data <- distinct(raw_data)

Save the cleaned dataset

cleaned_data <- write.csv(raw_data, "C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv", row.names = FALSE)

Updated dataset structure

data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
str(data)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

EDA

Load necessary libraries

library(dplyr)
library(ggplot2)
library(corrplot)

Glimpse of the data

glimpse(data)

## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

Summary statistics for all variables

data %>%
  summarise(across(everything(), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))

##   Pregnancies_mean Pregnancies_sd Pregnancies_min Pregnancies_max Glucose_mean
## 1         3.845052       3.369578               0              17     120.8945
##   Glucose_sd Glucose_min Glucose_max BloodPressure_mean BloodPressure_sd
## 1   31.97262           0         199           69.10547         19.35581
##   BloodPressure_min BloodPressure_max SkinThickness_mean SkinThickness_sd
## 1                 0               122           20.53646         15.95222
##   SkinThickness_min SkinThickness_max Insulin_mean Insulin_sd Insulin_min
## 1                 0                99     79.79948    115.244           0
##   Insulin_max BMI_mean  BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_mean
## 1         846 31.99258 7.88416       0    67.1                     0.4718763
##   DiabetesPedigreeFunction_sd DiabetesPedigreeFunction_min
## 1                   0.3313286                        0.078
##   DiabetesPedigreeFunction_max Age_mean   Age_sd Age_min Age_max Outcome_mean
## 1                         2.42 33.24089 11.76023      21      81    0.3489583
##   Outcome_sd Outcome_min Outcome_max
## 1  0.4769514           0           1

Check for missing values

data %>%
  summarise(across(everything(), ~sum(is.na(.))))

##   Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1           0       0             0             0       0   0
##   DiabetesPedigreeFunction Age Outcome
## 1                        0   0       0

Histogram for numerical variables

numeric_cols <- colnames(data)[sapply(data, is.numeric)]
for (col in numeric_cols) {
  print(
    ggplot(data, aes_string(x = col)) +
      geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Distribution of", col), x = col, y = "Frequency")
  )
}

Correlation matrix and visualization

cor_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")
print(cor_matrix)

##                          Pregnancies    Glucose BloodPressure SkinThickness
## Pregnancies               1.00000000 0.12945867    0.14128198   -0.08167177
## Glucose                   0.12945867 1.00000000    0.15258959    0.05732789
## BloodPressure             0.14128198 0.15258959    1.00000000    0.20737054
## SkinThickness            -0.08167177 0.05732789    0.20737054    1.00000000
## Insulin                  -0.07353461 0.33135711    0.08893338    0.43678257
## BMI                       0.01768309 0.22107107    0.28180529    0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730    0.04126495    0.18392757
## Age                       0.54434123 0.26351432    0.23952795   -0.11397026
## Outcome                   0.22189815 0.46658140    0.06506836    0.07475223
##                              Insulin        BMI DiabetesPedigreeFunction
## Pregnancies              -0.07353461 0.01768309              -0.03352267
## Glucose                   0.33135711 0.22107107               0.13733730
## BloodPressure             0.08893338 0.28180529               0.04126495
## SkinThickness             0.43678257 0.39257320               0.18392757
## Insulin                   1.00000000 0.19785906               0.18507093
## BMI                       0.19785906 1.00000000               0.14064695
## DiabetesPedigreeFunction  0.18507093 0.14064695               1.00000000
## Age                      -0.04216295 0.03624187               0.03356131
## Outcome                   0.13054795 0.29269466               0.17384407
##                                  Age    Outcome
## Pregnancies               0.54434123 0.22189815
## Glucose                   0.26351432 0.46658140
## BloodPressure             0.23952795 0.06506836
## SkinThickness            -0.11397026 0.07475223
## Insulin                  -0.04216295 0.13054795
## BMI                       0.03624187 0.29269466
## DiabetesPedigreeFunction  0.03356131 0.17384407
## Age                       1.00000000 0.23835598
## Outcome                   0.23835598 1.00000000

corrplot::corrplot(cor_matrix, method = "circle")

Scatter plot between Glucose and BMI grouped by Outcome

ggplot(data, aes(x = Glucose, y = BMI, color = as.factor(Outcome))) +
  geom_point(alpha = 0.7) +
  labs(title = "Glucose vs BMI by Outcome", x = "Glucose", y = "BMI")

Categorize Age into groups

data <- data %>%
  mutate(AgeGroup = case_when(
    Age < 30 ~ "Under 30",
    Age >= 30 & Age < 50 ~ "30-49",
    Age >= 50 ~ "50 and above"
  ))

Age group distribution by Outcome

data %>%
  group_by(AgeGroup, Outcome) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = AgeGroup, y = Count, fill = as.factor(Outcome))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Age Group Distribution by Outcome", x = "Age Group", y = "Count")

## Remove AgeGroup column if it was added accidentally
data <- data %>% select(-AgeGroup)

Classification Model

Classification modelling was executed to predict the likelihood of diabetes occurrence based on the input features. The classification models used in this project are as follows:

Logistic Regression
Decision Tree
Support Vector Machine (SVM)

library(caret)
library(rpart)
library(rpart.plot)
library(e1071)

Train Test Split

The dataset is split to 80% training and 20% testing set.The train-test split is stratified to ensures that the distribution of the target variable’s classes remains consistent between the training and testing datasets.

# Select all columns except the target variable 'Outcome'
X <- data[, setdiff(names(data), 'Outcome')]

# Select the target variable 'Outcome'
y <- data$Outcome

# Set seed for reproducibility
set.seed(123)

# Stratified train-test split
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)

# Create training and testing sets
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

Factor Transformation

The target variable for classification modelling is ‘Outcome’ which is a categorical variable. By using factor transformation on our target variable, the models will treat it as a categorical variable with distinct classes.

train$Outcome <- factor(train$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))

Model 1: Logistic Regression

Logistic regression is a supervised machine learning algorithm used for binary classification, where the goal is to predict the probability of an outcome belonging to one of two classes. It is widely used for its simplicity and interpretability.

logistic_model <- glm(Outcome ~ . , data = train, family = binomial)
#Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + BloodPressure

# Model summary
summary(logistic_model)

## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.2216062  0.7781491 -10.566  < 2e-16 ***
## Pregnancies               0.1185211  0.0357929   3.311 0.000929 ***
## Glucose                   0.0352886  0.0041848   8.433  < 2e-16 ***
## BloodPressure            -0.0130815  0.0057276  -2.284 0.022374 *  
## SkinThickness            -0.0009780  0.0075288  -0.130 0.896648    
## Insulin                  -0.0009111  0.0009841  -0.926 0.354533    
## BMI                       0.0861702  0.0166338   5.180 2.21e-07 ***
## DiabetesPedigreeFunction  0.7824888  0.3212008   2.436 0.014845 *  
## Age                       0.0152434  0.0102944   1.481 0.138676    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 797.28  on 614  degrees of freedom
## Residual deviance: 583.64  on 606  degrees of freedom
## AIC: 601.64
## 
## Number of Fisher Scoring iterations: 5

# Predict probabilities and classes
predicted_probs <- predict(logistic_model, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

Observation:

Significant predictors, such as Glucose, BMI, and Pregnancies, were identified based on their p-values and estimated coefficients. The model suggests a positive association of Pregnancies, Glucose, BMI, and DiabetesPedigreeFunction with diabetes likelihood and a slight negative association with BloodPressure.

Model 2: Decision Tree

Decision Tree is a flowchart-like structure used to make predictions or decisions. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

Train a decision tree model

tree_model <- rpart(Outcome ~ ., data = train, method = "class")

To plot the decision tree

rpart.plot(tree_model)

Observation:

The model identified Glucose as the most significant predictor at the root node, with a threshold of 144 used to split the data

Model 3: Support Vector Machine (SVM)

SVM is also a supervised machine learning algorithm and can be used for the classification task. In classification, SVM seeks to find the optimal hyperplane that divides the data into different categories.

Train an SVM model without tuning (default settings)

svm_model <- svm(Outcome ~ ., data = train, type = "C-classification", kernel = "radial")

Predict on the test set

svm_predictions <- predict(svm_model, newdata = test)

Regression Model

Regression modeling is a fundamental statistical and machine learning technique used to understand and quantify relationships between variables. Regression helps to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also called predictors or features).

This project aims to explore regression modelling using R.To carry out regression modelling, using the same dataset, ‘BMI’ is chosen to be the target variable.

Why BMI as target variable?

Key Health Indicator - BMI (Body Mass Index) is a widely recognized measure of body fat, associated with health risks such as diabetes and cardiovascular conditions.

Feature Relevance - Features like Glucose, Insulin, and SkinThickness are biologically linked to BMI and influence metabolic health. Continuous Target Variable: BMI is continuous, making it suitable for regression modeling and enabling meaningful analysis of predictor relationships.

Data Completeness - BMI is fully available in the dataset, ensuring reliable and interpretable model results.

The regression models chosen to explore with, include: 1. Linear Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)

# Load the package
library(randomForest)

Train Test Split

Data was split into 80% training and 20% test sets.

# Select the features and target variable
# We will predict BMI using other features
dataReg <- subset(data, select = -c(Outcome)) # Exclude the binary outcome column

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(dataReg$BMI, p = 0.8, list = FALSE)
trainData <- dataReg[trainIndex, ]
testData <- dataReg[-trainIndex, ]

Model 1: Linear Regression

Linear regression is a simple and interpretable technique that models the relationship between a dependent variable and one or more predictors by fitting a linear equation.

Train a linear regression model

# Train a linear regression model
lm_model <- train(BMI ~ ., data = trainData, method = "lm")

View the model summaries

cat("Linear Regression Summary:\n")

## Linear Regression Summary:

summary(lm_model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.852  -3.960  -0.195   3.697  28.249 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              16.994090   1.521067  11.172  < 2e-16 ***
## Pregnancies               0.043360   0.101077   0.429   0.6681    
## Glucose                   0.051422   0.009903   5.193 2.83e-07 ***
## BloodPressure             0.085203   0.015587   5.466 6.72e-08 ***
## SkinThickness             0.165516   0.021356   7.751 3.85e-14 ***
## Insulin                  -0.003679   0.002906  -1.266   0.2060    
## DiabetesPedigreeFunction  1.434437   0.861407   1.665   0.0964 .  
## Age                      -0.029789   0.030587  -0.974   0.3305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.956 on 609 degrees of freedom
## Multiple R-squared:  0.227,  Adjusted R-squared:  0.2181 
## F-statistic: 25.55 on 7 and 609 DF,  p-value: < 2.2e-16

Make predictions on the test set

lm_pred <- predict(lm_model, newdata = testData)

Model 2: Random Forest

Random forest is an ensemble learning method that uses multiple decision trees to model complex relationships between a dependent variable and its predicting features.

Train a random forest regression model

rf_model <- train(BMI ~ ., data = trainData, method = "rf",
                  tuneGrid = expand.grid(.mtry = seq(2, ncol(trainData) - 1, by = 1)),
                  trControl = trainControl(method = "cv", number = 5))

cat("\nRandom Forest Model Parameters:\n")

## 
## Random Forest Model Parameters:

print(rf_model$bestTune)

##   mtry
## 1    2

rf_pred <- predict(rf_model, newdata = testData)

Model 3: Extreme Gradient Boosting (XGBoost)

XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. This Technique uses an ensemble of decision trees to optimize predictions.

Train an XGBoost regression model

xgb_model <- train(BMI ~ ., data = trainData, method = "xgbLinear",
                   tuneGrid = expand.grid(.nrounds = seq(50, 200, by = 50),
                                          .lambda = c(0, 0.1, 1),
                                          .alpha = c(0, 0.1, 1),
                                          .eta = c(0.01, 0.1, 0.3)),
                   trControl = trainControl(method = "cv", number = 5))

cat("\nXGBoost Model Best Parameters:\n")

## 
## XGBoost Model Best Parameters:

print(xgb_model$bestTune)

##    nrounds lambda alpha  eta
## 25      50      1     1 0.01

xgb_pred <- predict(xgb_model, newdata = testData)

Evaluation

Export

# Export the logistic regression model
saveRDS(logistic_model, file = "logistic_model.rds")

# Export the decision tree model
saveRDS(tree_model, file = "decision_tree_model.rds")

# Export the SVM model
saveRDS(svm_model, file = "svm_model.rds")

# Export the linear regression model
saveRDS(lm_model, file = "linear_regression_model.rds")

# Export the random forest model
saveRDS(rf_model, file = "random_forest_model.rds")

# Export the XGBoost model
saveRDS(xgb_model, file = "xgboost_model.rds")

rm(list = ls())
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2753421 147.1    4420428 236.1  4420428 236.1
## Vcells 4812887  36.8   17644568 134.7 17642590 134.7

Import packages

library(dplyr)
library(ggplot2)
library(caret)
library(yardstick)  # For evaluation metrics
library(ROCR)      # For ROC and AUC
library(DALEX)     # For model explainability
library(shapviz)   # For SHAP values
library(vip)       # For variable importance plots
library(mlflow)    # For experiment tracking
library(ranger)  # For Random Forest with SHAP support
library(randomForest)

Load the Models

Classification Models

loaded_logistic_model <- readRDS("logistic_model.rds")
loaded_tree_model <- readRDS("decision_tree_model.rds")
loaded_svm_model <- readRDS("svm_model.rds")

Regression Models

loaded_lm_model <- readRDS("linear_regression_model.rds")
loaded_rf_model <- readRDS("random_forest_model.rds")
loaded_xgb_model <- readRDS("xgboost_model.rds")

Load the Data

data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

Prepare Classification Test Data

test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome_numeric <- ifelse(test$Outcome == "diabetic", 1, 0)  # Numeric outcome for DALEX

Prepare Regression Test Data

dataReg <- data %>% select(-Outcome)
testData <- dataReg[-trainIndex, ]

Classification Models Evaluation Logistic Regression

logistic_probs <- predict(loaded_logistic_model, newdata = test, type = "response")
logistic_preds <- ifelse(logistic_probs > 0.5, "diabetic", "non.diabetic")
logistic_conf <- confusionMatrix(factor(logistic_preds, levels = levels(test$Outcome)), test$Outcome)

Logistic Regression Explainer Ensure numeric targets for DALEX

test$Outcome_numeric <- ifelse(test$Outcome == "diabetic", 1, 0)

Create an explainer Logistic Regression Explainer

explainer_logistic <- explain(
  model = loaded_logistic_model,
  data = test,
  y = test$Outcome_numeric,
  predict_function = function(m, d) predict(m, newdata = d, type = "response"),
  label = "Logistic Regression"
)

## Preparation of a new explainer is initiated
##   -> model label       :  Logistic Regression 
##   -> data              :  153  rows  10  cols 
##   -> target variable   :  153  values 
##   -> predict function  :  function(m, d) predict(m, newdata = d, type = "response") 
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.4.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.01023708 , mean =  0.3313538 , max =  0.9619726  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8273228 , mean =  0.008515518 , max =  0.9319744  
##   A new explainer has been created!

Evaluate performance

performance_logistic <- model_performance(explainer_logistic)
plot(performance_logistic)

ggsave("logistic_performance.png")

Feature Importance

importance_logistic <- model_parts(explainer_logistic)
plot(importance_logistic)

ggsave("logistic_importance.png")

Decision Tree

tree_preds <- predict(loaded_tree_model, newdata = test, type = "class")
tree_conf <- confusionMatrix(tree_preds, test$Outcome)

SVM

svm_preds <- predict(loaded_svm_model, newdata = test)
svm_conf <- confusionMatrix(svm_preds, test$Outcome)

Plot ROC Curves for Classification Models

logistic_pred <- prediction(logistic_probs, test$Outcome_numeric)
logistic_perf <- performance(logistic_pred, "tpr", "fpr")

tree_pred <- prediction(as.numeric(tree_preds), test$Outcome_numeric)
tree_perf <- performance(tree_pred, "tpr", "fpr")

svm_pred <- prediction(as.numeric(svm_preds), test$Outcome_numeric)
svm_perf <- performance(svm_pred, "tpr", "fpr")


plot(logistic_perf, col = "blue", lty = 1, main = "ROC Curves")
plot(tree_perf, col = "red", lty = 2, add = TRUE)
plot(svm_perf, col = "green", lty = 3, add = TRUE)
legend("bottomright", legend = c("Logistic Regression", "Decision Tree", "SVM"),
       col = c("blue", "red", "green"), lty = 1:3)

ggsave("classification_roc_curves.png")

Regression Models Evaluation

library(dplyr)
library(caret)
library(Metrics)

rf_preds <- predict(loaded_rf_model, newdata = testData)
rf_metrics <- data.frame(
  RMSE = RMSE(rf_preds, testData$BMI),
  Rsquared = R2(rf_preds, testData$BMI)
)

Linear Regression

lm_preds <- predict(loaded_lm_model, newdata = testData)
lm_metrics <- data.frame(
  RMSE = RMSE(lm_preds, testData$BMI),
  Rsquared = R2(lm_preds, testData$BMI)
)

XGBoost

xgb_preds <- predict(loaded_xgb_model, newdata = testData)
xgb_metrics <- data.frame(
  RMSE = RMSE(xgb_preds, testData$BMI),
  Rsquared = R2(xgb_preds, testData$BMI)
)

Combine Regression Metrics

regression_metrics <- rbind(
  Random_Forest = rf_metrics,
  Linear_Regression = lm_metrics,
  XGBoost = xgb_metrics
)

Plot Regression Performance

barplot(as.matrix(regression_metrics), beside = TRUE, col = c("blue", "red", "green"),
        main = "Regression Model Performance", legend = rownames(regression_metrics))

ggsave("regression_performance.png")

Final Tables and Visualizations Classification Summary

classification_summary <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "SVM"),
  Accuracy = c(logistic_conf$overall["Accuracy"],
               tree_conf$overall["Accuracy"],
               svm_conf$overall["Accuracy"]),
  Kappa = c(logistic_conf$overall["Kappa"],
            tree_conf$overall["Kappa"],
            svm_conf$overall["Kappa"])
)

ggplot(classification_summary, aes(x = Model, y = Accuracy)) +
  geom_col(fill = "steelblue") +
  ggtitle("Classification Model Accuracy") +
  theme_minimal()

ggsave("classification_accuracy.png")

FINAL RESULT

print("Classification Summary")

## [1] "Classification Summary"

print(classification_summary)

##                 Model  Accuracy     Kappa
## 1 Logistic Regression 0.7712418 0.4518374
## 2       Decision Tree 0.8104575 0.5675017
## 3                 SVM 0.7843137 0.4671240

## Regression Summary
print("Regression Metrics")

## [1] "Regression Metrics"

print(regression_metrics)

##                       RMSE  Rsquared
## Random_Forest     4.339702 0.7800229
## Linear_Regression 6.857145 0.2753592
## XGBoost           3.591581 0.8039952

Conclusion

The evaluation of models was conducted to analyze and compare the performance of the classification and regression models. Metrics such as Accuracy, Kappa, ROC Curves, RMSE and R-Squared are used to measure performance. The evaluation findings are summarized below.

Classification Models

Accuracy and Kappa Scores

The Accuracy and Kappa scores for the classification models are summarized in the following table.

{r ClassificationAccuracyPlot, echo=FALSE}
knitr::include_graphics("C:\\Users\\Nasaruddin\\Desktop\\classification_accuracy.png")

The SVM model demonstrated the highest accuracy (78.41%) and Kappa score (0.4799), indicating its strong capability in classifying diabetic and non-diabetic cases effectively. This performance reflects its ability to manage complex decision boundaries in high-dimensional data. The Decision Tree model achieved slightly lower accuracy (77.78%) but reported a higher Kappa score (0.5136), which suggests that it may handle class imbalances more effectively. In contrast, Logistic Regression exhibited the lowest accuracy (75.82%) and Kappa (0.4224), which could be attributed to its simplicity and its limitations in capturing non-linear relationships within the dataset.

ROC Curves

The ROC curves for all three classification models are visualized below to compare their true positive rates (TPR) against false positive rates (FPR).

The Logistic Regression ROC curve displayed the smoothest shape, indicative of stable performance across varying thresholds. However, its overall performance was slightly below that of the Decision Tree and SVM models. The Decision Tree showed strong initial performance but experienced a sharper drop-off in its curve, indicating potential overfitting to the training data. On the other hand, SVM demonstrated a less smooth curve but consistently outperformed Logistic Regression in regions requiring higher specificity, highlighting its strength in distinguishing between the classes.

Logistic Regression Feature Importance

The feature importance of the logistic regression model, visualized using permutation-based feature selection, highlights which predictors contribute the most to the model’s predictions.

Logistic Regression Residuals Distribution

The residual distribution plot below demonstrates the reverse cumulative distribution of residuals for the logistic regression model.

Logistic Regression remains valuable for its interpretability, as seen in the feature importance chart. The model identified Glucose as the most significant predictor for diabetes likelihood, followed by BMI and Diabetes Pedigree Function, reinforcing the importance of these factors in understanding diabetes risk.

Regression Models

Regression Performance

The RMSE (Root Mean Squared Error) and R-Squared values for regression models are summarized in the following table.

XGBoost emerged as the most effective regression model, achieving the lowest RMSE (3.14) and the highest R-squared value (0.836). These metrics indicate its ability to accurately model the relationship between features and BMI. Random Forest followed with an RMSE of 4.16 and R-squared of 0.767, showcasing its strength in capturing non-linear relationships but falling short of XGBoost’s optimization techniques. Linear Regression, in contrast, exhibited the highest RMSE (6.88) and the lowest R-squared (0.209), reflecting its limitations in addressing complex, non-linear patterns.

Combined Summary Table

The combined results for classification and regression models are presented in the summary table below.

Feature Importance Insights

Among the predictors, Glucose levels were identified as the most critical feature for predicting diabetes likelihood, aligning with established medical knowledge. BMI and Diabetes Pedigree Function also played significant roles, emphasizing the importance of weight management and genetic predisposition in diabetes risk. Additional predictors like BloodPressure and Insulin offered insights into the biological factors influencing diabetes, suggesting opportunities for targeted interventions aimed at early prevention.

Recommendations

To enhance classification performance, deploying SVM is recommended due to its superior accuracy and Kappa score, particularly for applications requiring high precision. Logistic Regression can still be utilized for initial feature exploration, as it provides interpretable insights into key predictors. For regression, XGBoost is the preferred model given its robust performance in handling complex relationships. Random Forest serves as an alternative where interpretability is slightly more important but performance remains critical. Finally, hyperparameter tuning for Decision Tree and SVM models in classification, along with feature engineering for Linear Regression in regression tasks, may further improve overall performance.

Diabetic Data Insights: Statistical Analysis and Predictive Modeling

Group 12 - Programming For Data Science (WQD7004)

Group Members:

Introduction

Dataset

Project Context

Objective

Load the dataset

Data Cleaning

EDA

Classification Model

Train Test Split

Factor Transformation

Model 1: Logistic Regression

Model 2: Decision Tree

Model 3: Support Vector Machine (SVM)

Regression Model

Train Test Split

Model 1: Linear Regression

Model 2: Random Forest

Model 3: Extreme Gradient Boosting (XGBoost)

Evaluation

Export

Import packages

Conclusion

Classification Models

ROC Curves

Logistic Regression Feature Importance

Logistic Regression Residuals Distribution

Regression Models

Combined Summary Table

Feature Importance Insights

Recommendations