Group Members:

Hussain Ali Kazim (23115338)
Wan Mohamad Hasif Bin W. Mohd Saleh (22119325)
Hanim Sofiah Bin Shahrom (22102228)
Nurul Hafizah binti Zaini (17172928)
Muhammad Hakim Bin Nasaruddin (23079722)

Introduction

Dataset

The dataset was obtained from an open-source Kaggle platform. This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes various diagnostic measurements, with several independent variables and a single target variable which is ‘Outcome,’ indicating whether the patient has diabetes.

Link: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data

Project Context

The term ‘diabetes’ has become increasingly familiar in recent years, as the prevalence of the condition continues to rise globally. Diabetes is a chronic medical condiiton that occurs when the body is unable to properly regulate blood sugar levels. This condition can lead to severe complications, including heart disease, kidney failure and nerve damage. Therefore, early detection and intervention are crucial to managing and preventing the progression of diabetes.

This project aims to develop a machine learning classification model to predict the likelihood of an individual developing diabetes. Additionally, it focuses on predicting the progression of diabetes using BMI as a key feature, providing insights for early diagnosis and personalized healthcare interventions.

Objective

To develop a classification models to predict the likelihood of an individual developing diabetes
To predict the progression of diabetes using BMI as key feature by developing regression models

Load the dataset

raw_data <- read.csv("raw_data_diabetes.csv")

Data Cleaning

Dimension: dim()

dim(raw_data)

## [1] 769   9

Head: head()

head(raw_data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Summary: summary()

summary(raw_data)

##   Pregnancies        Glucose      BloodPressure      SkinThickness  
##  Min.   :-1.000   Min.   :  0.0   Length:769         Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   Class :character   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Mode  :character   Median :23.00  
##  Mean   : 3.839   Mean   :120.9                      Mean   :20.55  
##  3rd Qu.: 6.000   3rd Qu.:140.0                      3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0                      Max.   :99.00  
##    Insulin               BMI         DiabetesPedigreeFunction      Age        
##  Length:769         Min.   :-30.10   Min.   :0.0780           Min.   :-30.00  
##  Class :character   1st Qu.: 27.30   1st Qu.:0.2440           1st Qu.: 24.00  
##  Mode  :character   Median : 32.00   Median :0.3710           Median : 29.00  
##                     Mean   : 31.91   Mean   :0.4717           Mean   : 33.15  
##                     3rd Qu.: 36.60   3rd Qu.:0.6260           3rd Qu.: 41.00  
##                     Max.   : 67.10   Max.   :2.4200           Max.   : 81.00  
##     Outcome      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3485  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Structure: str()

str(raw_data)

## 'data.frame':    769 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : chr  "72" "66" "64" "66" ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : chr  "0" "0" "0" "94" ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Load necessary libraries

library(dplyr)

Check for missing values in the raw dataset

missing_values_summary <- sapply(raw_data, function(x) sum(is.na(x)))  # Count missing values per column
print(missing_values_summary)

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Handle specific cases for Null value and number as string value

# Convert "Null" in Insulin column to 0 (row 763)
raw_data[763, "Insulin"] <- 0

# Convert "Seventy" in BloodPressure column to 70 (row 765)
raw_data[765, "BloodPressure"] <- 70

Convert data types to ensure consistency

# Convert all values in BloodPressure column to numeric
raw_data$BloodPressure <- as.numeric(raw_data$BloodPressure)

# Convert all values in Insulin column to numeric
raw_data$Insulin <- as.numeric(raw_data$Insulin)

Convert negative values to absolute

raw_data <- raw_data %>%
  mutate(across(where(is.numeric), ~ abs(.)))

Remove duplicate rows

raw_data <- distinct(raw_data)

Save the cleaned dataset

cleaned_data <- write.csv(raw_data, "diabetesprojectdata.csv", row.names = FALSE)

Updated dataset structure

data <- read.csv("diabetesprojectdata.csv")
str(data)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

EDA

Load necessary libraries

library(dplyr)
library(ggplot2)
library(corrplot)

Glimpse of the data

glimpse(data)

## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

Summary statistics for all variables

data %>%
  summarise(across(everything(), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))

##   Pregnancies_mean Pregnancies_sd Pregnancies_min Pregnancies_max Glucose_mean
## 1         3.845052       3.369578               0              17     120.8945
##   Glucose_sd Glucose_min Glucose_max BloodPressure_mean BloodPressure_sd
## 1   31.97262           0         199           69.10547         19.35581
##   BloodPressure_min BloodPressure_max SkinThickness_mean SkinThickness_sd
## 1                 0               122           20.53646         15.95222
##   SkinThickness_min SkinThickness_max Insulin_mean Insulin_sd Insulin_min
## 1                 0                99     79.79948    115.244           0
##   Insulin_max BMI_mean  BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_mean
## 1         846 31.99258 7.88416       0    67.1                     0.4718763
##   DiabetesPedigreeFunction_sd DiabetesPedigreeFunction_min
## 1                   0.3313286                        0.078
##   DiabetesPedigreeFunction_max Age_mean   Age_sd Age_min Age_max Outcome_mean
## 1                         2.42 33.24089 11.76023      21      81    0.3489583
##   Outcome_sd Outcome_min Outcome_max
## 1  0.4769514           0           1

Check for missing values

data %>%
  summarise(across(everything(), ~sum(is.na(.))))

##   Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1           0       0             0             0       0   0
##   DiabetesPedigreeFunction Age Outcome
## 1                        0   0       0

Histogram for numerical variables

numeric_cols <- colnames(data)[sapply(data, is.numeric)]
for (col in numeric_cols) {
  print(
    ggplot(data, aes_string(x = col)) +
      geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Distribution of", col), x = col, y = "Frequency")
  )
}

Correlation matrix and visualization

cor_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")
print(cor_matrix)

##                          Pregnancies    Glucose BloodPressure SkinThickness
## Pregnancies               1.00000000 0.12945867    0.14128198   -0.08167177
## Glucose                   0.12945867 1.00000000    0.15258959    0.05732789
## BloodPressure             0.14128198 0.15258959    1.00000000    0.20737054
## SkinThickness            -0.08167177 0.05732789    0.20737054    1.00000000
## Insulin                  -0.07353461 0.33135711    0.08893338    0.43678257
## BMI                       0.01768309 0.22107107    0.28180529    0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730    0.04126495    0.18392757
## Age                       0.54434123 0.26351432    0.23952795   -0.11397026
## Outcome                   0.22189815 0.46658140    0.06506836    0.07475223
##                              Insulin        BMI DiabetesPedigreeFunction
## Pregnancies              -0.07353461 0.01768309              -0.03352267
## Glucose                   0.33135711 0.22107107               0.13733730
## BloodPressure             0.08893338 0.28180529               0.04126495
## SkinThickness             0.43678257 0.39257320               0.18392757
## Insulin                   1.00000000 0.19785906               0.18507093
## BMI                       0.19785906 1.00000000               0.14064695
## DiabetesPedigreeFunction  0.18507093 0.14064695               1.00000000
## Age                      -0.04216295 0.03624187               0.03356131
## Outcome                   0.13054795 0.29269466               0.17384407
##                                  Age    Outcome
## Pregnancies               0.54434123 0.22189815
## Glucose                   0.26351432 0.46658140
## BloodPressure             0.23952795 0.06506836
## SkinThickness            -0.11397026 0.07475223
## Insulin                  -0.04216295 0.13054795
## BMI                       0.03624187 0.29269466
## DiabetesPedigreeFunction  0.03356131 0.17384407
## Age                       1.00000000 0.23835598
## Outcome                   0.23835598 1.00000000

corrplot::corrplot(cor_matrix, method = "circle")

Scatter plot between Glucose and BMI grouped by Outcome

ggplot(data, aes(x = Glucose, y = BMI, color = as.factor(Outcome))) +
  geom_point(alpha = 0.7) +
  labs(title = "Glucose vs BMI by Outcome", x = "Glucose", y = "BMI")

Categorize Age into groups

data <- data %>%
  mutate(AgeGroup = case_when(
    Age < 30 ~ "Under 30",
    Age >= 30 & Age < 50 ~ "30-49",
    Age >= 50 ~ "50 and above"
  ))

Age group distribution by Outcome

data %>%
  group_by(AgeGroup, Outcome) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = AgeGroup, y = Count, fill = as.factor(Outcome))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Age Group Distribution by Outcome", x = "Age Group", y = "Count")

## Remove AgeGroup column if it was added accidentally
data <- data %>% select(-AgeGroup)

Classification Model

Classification modelling was executed to predict the likelihood of diabetes occurrence based on the input features. The classification models used in this project are as follows:

Logistic Regression
Decision Tree
Support Vector Machine (SVM)

library(caret)
library(rpart)
library(rpart.plot)
library(e1071)

Train Test Split

The dataset is split to 80% training and 20% testing set.The train-test split is stratified to ensures that the distribution of the target variable’s classes remains consistent between the training and testing datasets.

# Select all columns except the target variable 'Outcome'
X <- data[, setdiff(names(data), 'Outcome')]

# Select the target variable 'Outcome'
y <- data$Outcome

# Set seed for reproducibility
set.seed(123)

# Stratified train-test split
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)

# Create training and testing sets
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

Factor Transformation

The target variable for classification modelling is ‘Outcome’ which is a categorical variable. By using factor transformation on our target variable, the models will treat it as a categorical variable with distinct classes.

train$Outcome <- factor(train$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))

Model 1: Logistic Regression

Logistic regression is a supervised machine learning algorithm used for binary classification, where the goal is to predict the probability of an outcome belonging to one of two classes. It is widely used for its simplicity and interpretability.

logistic_model <- glm(Outcome ~ . , data = train, family = binomial)
#Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + BloodPressure

# Model summary
summary(logistic_model)

## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.2216062  0.7781491 -10.566  < 2e-16 ***
## Pregnancies               0.1185211  0.0357929   3.311 0.000929 ***
## Glucose                   0.0352886  0.0041848   8.433  < 2e-16 ***
## BloodPressure            -0.0130815  0.0057276  -2.284 0.022374 *  
## SkinThickness            -0.0009780  0.0075288  -0.130 0.896648    
## Insulin                  -0.0009111  0.0009841  -0.926 0.354533    
## BMI                       0.0861702  0.0166338   5.180 2.21e-07 ***
## DiabetesPedigreeFunction  0.7824888  0.3212008   2.436 0.014845 *  
## Age                       0.0152434  0.0102944   1.481 0.138676    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 797.28  on 614  degrees of freedom
## Residual deviance: 583.64  on 606  degrees of freedom
## AIC: 601.64
## 
## Number of Fisher Scoring iterations: 5

# Predict probabilities and classes
predicted_probs <- predict(logistic_model, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

Observation:

Significant predictors, such as Glucose, BMI and Pregnancies, were identified based on their p-values and estimated coefficients. The model suggests a positive association of Pregnancies, Glucose, BMI and DiabetesPedigreeFunction with diabetes likelihood and a slight negative association with BloodPressure.

Model 2: Decision Tree

Decision Tree is a flowchart-like structure used to make predictions or decisions. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

Train a decision tree model

tree_model <- rpart(Outcome ~ ., data = train, method = "class")

To plot the decision tree

rpart.plot(tree_model)

Observation:

The model identified Glucose as the most significant predictor at the root node, with a threshold of 144 used to split the data

Model 3: Support Vector Machine (SVM)

SVM is also a supervised machine learning algorithm and can be used for the classification task. In classification, SVM seeks to find the optimal hyperplane that divides the data into different categories.

Train an SVM model without tuning (default settings)

svm_model <- svm(Outcome ~ ., data = train, type = "C-classification", kernel = "radial")

Predict on the test set

svm_predictions <- predict(svm_model, newdata = test)

Regression Model

Regression modeling is a fundamental statistical and machine learning technique used to understand and quantify relationships between variables. Regression helps to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also called predictors or features).

This project aims to explore regression modelling using R.To carry out regression modelling, using the same dataset, ‘BMI’ is chosen to be the target variable.

Why BMI as target variable?

Key Health Indicator - BMI (Body Mass Index) is a widely recognized measure of body fat, associated with health risks such as diabetes and cardiovascular conditions.

Feature Relevance - Features like Glucose, Insulin and SkinThickness are biologically linked to BMI and influence metabolic health. Continuous Target Variable: BMI is continuous, making it suitable for regression modeling and enabling meaningful analysis of predictor relationships.

Data Completeness - BMI is fully available in the dataset, ensuring reliable and interpretable model results.

The regression models chosen to explore with, include: 1. Linear Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)

# Load the package
library(randomForest)

Train Test Split

Data was split into 80% training and 20% test sets.

# Select the features and target variable
# We will predict BMI using other features
dataReg <- subset(data, select = -c(Outcome)) # Exclude the binary outcome column

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(dataReg$BMI, p = 0.8, list = FALSE)
trainData <- dataReg[trainIndex, ]
testData <- dataReg[-trainIndex, ]

Model 1: Linear Regression

Linear regression is a simple and interpretable technique that models the relationship between a dependent variable and one or more predictors by fitting a linear equation.

Train a linear regression model

# Train a linear regression model
lm_model <- train(BMI ~ ., data = trainData, method = "lm")

View the model summaries

cat("Linear Regression Summary:\n")

## Linear Regression Summary:

summary(lm_model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.852  -3.960  -0.195   3.697  28.249 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              16.994090   1.521067  11.172  < 2e-16 ***
## Pregnancies               0.043360   0.101077   0.429   0.6681    
## Glucose                   0.051422   0.009903   5.193 2.83e-07 ***
## BloodPressure             0.085203   0.015587   5.466 6.72e-08 ***
## SkinThickness             0.165516   0.021356   7.751 3.85e-14 ***
## Insulin                  -0.003679   0.002906  -1.266   0.2060    
## DiabetesPedigreeFunction  1.434437   0.861407   1.665   0.0964 .  
## Age                      -0.029789   0.030587  -0.974   0.3305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.956 on 609 degrees of freedom
## Multiple R-squared:  0.227,  Adjusted R-squared:  0.2181 
## F-statistic: 25.55 on 7 and 609 DF,  p-value: < 2.2e-16

Make predictions on the test set

lm_pred <- predict(lm_model, newdata = testData)

Model 2: Random Forest

Random forest is an ensemble learning method that uses multiple decision trees to model complex relationships between a dependent variable and its predicting features.

Train a random forest regression model

rf_model <- train(BMI ~ ., data = trainData, method = "rf",
                  tuneGrid = expand.grid(.mtry = seq(2, ncol(trainData) - 1, by = 1)),
                  trControl = trainControl(method = "cv", number = 5))

cat("\nRandom Forest Model Parameters:\n")

## 
## Random Forest Model Parameters:

print(rf_model$bestTune)

##   mtry
## 1    2

rf_pred <- predict(rf_model, newdata = testData)

Model 3: Extreme Gradient Boosting (XGBoost)

XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. This Technique uses an ensemble of decision trees to optimize predictions.

Train an XGBoost regression model

xgb_model <- train(BMI ~ ., data = trainData, method = "xgbLinear",
                   tuneGrid = expand.grid(.nrounds = seq(50, 200, by = 50),
                                          .lambda = c(0, 0.1, 1),
                                          .alpha = c(0, 0.1, 1),
                                          .eta = c(0.01, 0.1, 0.3)),
                   trControl = trainControl(method = "cv", number = 5))

cat("\nXGBoost Model Best Parameters:\n")

## 
## XGBoost Model Best Parameters:

print(xgb_model$bestTune)

##    nrounds lambda alpha  eta
## 25      50      1     1 0.01

xgb_pred <- predict(xgb_model, newdata = testData)

# Export
saveRDS(logistic_model, file = "logistic_model.rds")
saveRDS(tree_model, file = "decision_tree_model.rds")
saveRDS(svm_model, file = "svm_model.rds")
saveRDS(lm_model, file = "linear_regression_model.rds")
saveRDS(rf_model, file = "random_forest_model.rds")
saveRDS(xgb_model, file = "xgboost_model.rds")

Evaluation

The evaluation of models was conducted to analyze and compare the performance of the classification and regression models. Metrics such as Accuracy, Kappa, ROC Curves, RMSE and R-Squared are used to measure performance. The evaluation findings are summarized below.

rm(list = ls())
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2747865 146.8    4384907 234.2  4384907 234.2
## Vcells 4802311  36.7   17625305 134.5 17605019 134.4

output_dir <- "plots"
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

library(dplyr)
library(ggplot2)
library(caret)
library(yardstick)  # evaluation metrics
library(ROCR)       # ROC and AUC
library(DALEX)      # model explainability
library(gridExtra)
library(grid)
library(Metrics)    # regression metrics

loaded_logistic_model <- readRDS("logistic_model.rds")
loaded_tree_model <- readRDS("decision_tree_model.rds")
loaded_svm_model <- readRDS("svm_model.rds")

loaded_lm_model <- readRDS("linear_regression_model.rds")
loaded_rf_model <- readRDS("random_forest_model.rds")
loaded_xgb_model <- readRDS("xgboost_model.rds")

data <- read.csv("diabetes.csv")
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome_numeric <- ifelse(test$Outcome == "diabetic", 1, 0)

dataReg <- data %>% select(-Outcome)
testData <- dataReg[-trainIndex, ]

Classification Models

Accuracy and Kappa Scores

The Accuracy and Kappa scores for the classification models are summarized in the following table.

logistic_probs <- predict(loaded_logistic_model, newdata = test, type = "response")
logistic_preds <- ifelse(logistic_probs > 0.5, "diabetic", "non.diabetic")
logistic_conf <- confusionMatrix(factor(logistic_preds, levels = levels(test$Outcome)), test$Outcome)

tree_preds <- predict(loaded_tree_model, newdata = test, type = "class")
tree_conf <- confusionMatrix(tree_preds, test$Outcome)

svm_preds <- predict(loaded_svm_model, newdata = test)
svm_conf <- confusionMatrix(svm_preds, test$Outcome)

classification_summary <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "SVM"),
  Accuracy = c(logistic_conf$overall["Accuracy"],
               tree_conf$overall["Accuracy"],
               svm_conf$overall["Accuracy"]),
  Kappa = c(logistic_conf$overall["Kappa"],
            tree_conf$overall["Kappa"],
            svm_conf$overall["Kappa"])
)

classification_plot <- ggplot(classification_summary, aes(x = Model, y = Accuracy)) +
  geom_col(fill = "steelblue") +
  ggtitle("Classification Model Accuracy") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(file.path(output_dir, "classification_accuracy.png"), plot = classification_plot)
knitr::include_graphics("plots/classification_accuracy.png")

The SVM model demonstrated the highest accuracy (78.41%) and Kappa score (0.4799) indicate its strong capability in classifying diabetic and non-diabetic cases effectively. This performance reflects its ability to manage complex decision boundaries in high-dimensional data. The Decision Tree model achieved slightly lower accuracy (77.78%) but reported a higher Kappa score (0.5136) which suggests that it may handle class imbalances more effectively. Logistic Regression exhibited the lowest accuracy (75.82%) and Kappa (0.422 highlight its limitations in capturing non-linear relationships.

ROC Curves

The ROC curves for all three classification models are visualized below to compare their true positive rates (TPR) against false positive rates (FPR).

logistic_pred <- prediction(logistic_probs, test$Outcome_numeric)
logistic_perf <- performance(logistic_pred, "tpr", "fpr")

tree_pred <- prediction(as.numeric(tree_preds), test$Outcome_numeric)
tree_perf <- performance(tree_pred, "tpr", "fpr")

svm_pred <- prediction(as.numeric(svm_preds), test$Outcome_numeric)
svm_perf <- performance(svm_pred, "tpr", "fpr")

# Plot and save
png(file.path(output_dir, "classification_roc_curves.png"), width = 800, height = 600)
plot(logistic_perf, col = "blue", lty = 1, main = "ROC Curves")
plot(tree_perf, col = "red", lty = 2, add = TRUE)
plot(svm_perf, col = "green", lty = 3, add = TRUE)
legend("bottomright", legend = c("Logistic Regression", "Decision Tree", "SVM"),
       col = c("blue", "red", "green"), lty = 1:3)
dev.off()

## png 
##   2

knitr::include_graphics("plots/classification_roc_curves.png")

The Logistic Regression ROC curve displayed the smoothest shape show of stable performance across varying thresholds. However, its overall performance was slightly below that of the Decision Tree and SVM models. The Decision Tree showed strong initial performance but experienced a sharper drop-off in its curve indicate potential overfitting to the training data. On the other hand, SVM demonstrated a less smooth curve but consistently outperformed Logistic Regression in regions requiring higher specificity which highlight its strength in distinguishing between the classes.

Logistic Regression Feature Importance

The feature importance of the logistic regression model visualized using permutation-based feature selection highlights which predictors contribute the most to the model’s predictions.

explainer_logistic <- explain(
  model = loaded_logistic_model,
  data = test,
  y = test$Outcome_numeric,
  predict_function = function(m, d) predict(m, newdata = d, type = "response"),
  label = "Logistic Regression"
)

## Preparation of a new explainer is initiated
##   -> model label       :  Logistic Regression 
##   -> data              :  153  rows  10  cols 
##   -> target variable   :  153  values 
##   -> predict function  :  function(m, d) predict(m, newdata = d, type = "response") 
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.4.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.01023708 , mean =  0.3313538 , max =  0.9619726  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8273228 , mean =  0.008515518 , max =  0.9319744  
##   A new explainer has been created!

importance_logistic <- model_parts(explainer_logistic)

# Save and display plot
png(file.path(output_dir, "logistic_importance.png"), width = 800, height = 600)
print(plot(importance_logistic))
dev.off()

## png 
##   2

knitr::include_graphics("plots/logistic_importance.png")

Among the predictors, Glucose levels were identified as the most critical feature for predicting diabetes likelihood as it align with established medical knowledge. BMI and Diabetes Pedigree Function also played significant roles that emphase the importance of weight management and genetic predisposition in diabetes risk. Additional predictors like BloodPressure and Insulin offer insights into the biological factors influencing diabetes that help for targeted interventions aimed at early prevention.

Logistic Regression Residuals Distribution

The residual distribution plot below demonstrates the reverse cumulative distribution of residuals for the logistic regression model.

# DALEX residual analysis
performance_logistic <- model_performance(explainer_logistic)

# Save and display the residual plot
png(file.path(output_dir, "logistic_residuals.png"), width = 800, height = 600)
print(plot(performance_logistic))
dev.off()

## png 
##   2

knitr::include_graphics("plots/logistic_residuals.png")

Logistic Regression remains valuable for its interpretability as seen in the feature importance chart. The model identified Glucose as the most significant predictor for diabetes likelihood and followed by BMI and Diabetes Pedigree Function which reinforce the importance of these factors in understanding diabetes risk.

Regression Models

Regression Performance

The RMSE (Root Mean Squared Error) and R-Squared values for regression models are summarized in the following table.

rf_preds <- predict(loaded_rf_model, newdata = testData)
rf_metrics <- data.frame(
  RMSE = RMSE(rf_preds, testData$BMI),
  Rsquared = R2(rf_preds, testData$BMI)
)

lm_preds <- predict(loaded_lm_model, newdata = testData)
lm_metrics <- data.frame(
  RMSE = RMSE(lm_preds, testData$BMI),
  Rsquared = R2(lm_preds, testData$BMI)
)

xgb_preds <- predict(loaded_xgb_model, newdata = testData)
xgb_metrics <- data.frame(
  RMSE = RMSE(xgb_preds, testData$BMI),
  Rsquared = R2(xgb_preds, testData$BMI)
)

regression_metrics <- rbind(
  Random_Forest = rf_metrics,
  Linear_Regression = lm_metrics,
  XGBoost = xgb_metrics
)

regression_plot <- ggplot(as.data.frame(regression_metrics), aes(x = rownames(regression_metrics), y = RMSE)) +
  geom_col(fill = "lightblue") +
  ggtitle("Regression Model Performance") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(file.path(output_dir, "regression_performance.png"), plot = regression_plot)
knitr::include_graphics("plots/regression_performance.png")

XGBoost emerged as the most effective regression model by achieving the lowest RMSE (3.14) and the highest R-squared value (0.836). These metrics indicate its ability to accurately model the relationship between features and BMI. Random Forest followed with an RMSE of 4.16 and R-squared of 0.767 shows its strength in capturing non-linear relationships but falling short of XGBoost’s optimization techniques. Linear Regression however exhibited the highest RMSE (6.88) and the lowest R-squared (0.209) reflects its limitations in addressing complex and non-linear patterns.

Conclusion

Summary Table and Recommendation summarize conclusion as follows

classification_summary_table <- tableGrob(
  classification_summary,
  theme = ttheme_default(
    core = list(fg_params = list(cex = 0.9)),
    colhead = list(fg_params = list(fontface = "bold", fontsize = 10))
  )
)

regression_summary_table <- tableGrob(
  regression_metrics,
  theme = ttheme_default(
    core = list(fg_params = list(cex = 0.9)),
    colhead = list(fg_params = list(fontface = "bold", fontsize = 10))
  )
)

grid.arrange(classification_summary_table, regression_summary_table, ncol = 1)

Combined Summary Table

The combined results for classification and regression models are presented in the summary below.

knitr::include_graphics("plots/final_results_summary.png")

Recommendations

To enhance classification performance, deploying SVM is recommended due to its superior accuracy and Kappa score for applications requiring high precision. Logistic Regression can still be utilized for initial feature exploration as it provides interpretable insights into key predictors. For regression, XGBoost is the preferred model given its robust performance in handling complex relationships. Random Forest serves as an alternative where interpretability is slightly more important but performance remains critical. Finally, hyperparameter tuning for Decision Tree and SVM models in classification and with feature engineering for Linear Regression in regression tasks could improve overall performance.

Diabetic Data Insights: Statistical Analysis and Predictive Modeling

Group 12 - Programming For Data Science (WQD7004)

Group Members:

Introduction

Dataset

Project Context

Objective

Load the dataset

Data Cleaning

EDA

Classification Model

Train Test Split

Factor Transformation

Model 1: Logistic Regression

Model 2: Decision Tree

Model 3: Support Vector Machine (SVM)

Regression Model

Train Test Split

Model 1: Linear Regression

Model 2: Random Forest

Model 3: Extreme Gradient Boosting (XGBoost)

Evaluation

Classification Models

Accuracy and Kappa Scores

ROC Curves

Logistic Regression Feature Importance

Logistic Regression Residuals Distribution

Regression Models

Regression Performance

Conclusion

Combined Summary Table

Recommendations