The dataset was obtained from an open-source Kaggle platform. This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes various diagnostic measurements, with several independent variables and a single target variable which is ‘Outcome,’ indicating whether the patient has diabetes.
Link: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data
The term ‘diabetes’ has become increasingly familiar in recent years, as the prevalence of the condition continues to rise globally. Diabetes is a chronic medical condiiton that occurs when the body is unable to properly regulate blood sugar levels. This condition can lead to severe complications, including heart disease, kidney failure, and nerve damage. Therefore, early detection and intervention are crucial to managing and preventing the progression of diabetes.
This project aims to develop a machine learning classification model to predict the likelihood of an individual developing diabetes. Additionally, it focuses on predicting the progression of diabetes using BMI as a key feature, providing insights for early diagnosis and personalized healthcare interventions.
raw_data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\raw_data_diabetes.csv")
Dimension: dim()
dim(raw_data)
## [1] 769 9
Head: head()
head(raw_data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Summary: summary()
summary(raw_data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. :-1.000 Min. : 0.0 Length:769 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 Class :character 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Mode :character Median :23.00
## Mean : 3.839 Mean :120.9 Mean :20.55
## 3rd Qu.: 6.000 3rd Qu.:140.0 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Length:769 Min. :-30.10 Min. :0.0780 Min. :-30.00
## Class :character 1st Qu.: 27.30 1st Qu.:0.2440 1st Qu.: 24.00
## Mode :character Median : 32.00 Median :0.3710 Median : 29.00
## Mean : 31.91 Mean :0.4717 Mean : 33.15
## 3rd Qu.: 36.60 3rd Qu.:0.6260 3rd Qu.: 41.00
## Max. : 67.10 Max. :2.4200 Max. : 81.00
## Outcome
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3485
## 3rd Qu.:1.0000
## Max. :1.0000
Structure: str()
str(raw_data)
## 'data.frame': 769 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : chr "72" "66" "64" "66" ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : chr "0" "0" "0" "94" ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Load necessary libraries
library(dplyr)
Check for missing values in the raw dataset
missing_values_summary <- sapply(raw_data, function(x) sum(is.na(x))) # Count missing values per column
print(missing_values_summary)
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Handle specific cases for Null value and number as string value
# Convert "Null" in Insulin column to 0 (row 763)
raw_data[763, "Insulin"] <- 0
# Convert "Seventy" in BloodPressure column to 70 (row 765)
raw_data[765, "BloodPressure"] <- 70
Convert data types to ensure consistency
# Convert all values in BloodPressure column to numeric
raw_data$BloodPressure <- as.numeric(raw_data$BloodPressure)
# Convert all values in Insulin column to numeric
raw_data$Insulin <- as.numeric(raw_data$Insulin)
Convert negative values to absolute
raw_data <- raw_data %>%
mutate(across(where(is.numeric), ~ abs(.)))
Remove duplicate rows
raw_data <- distinct(raw_data)
Save the cleaned dataset
cleaned_data <- write.csv(raw_data, "C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv", row.names = FALSE)
Updated dataset structure
data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Load necessary libraries
library(dplyr)
library(ggplot2)
library(corrplot)
Glimpse of the data
glimpse(data)
## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
Summary statistics for all variables
data %>%
summarise(across(everything(), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))
## Pregnancies_mean Pregnancies_sd Pregnancies_min Pregnancies_max Glucose_mean
## 1 3.845052 3.369578 0 17 120.8945
## Glucose_sd Glucose_min Glucose_max BloodPressure_mean BloodPressure_sd
## 1 31.97262 0 199 69.10547 19.35581
## BloodPressure_min BloodPressure_max SkinThickness_mean SkinThickness_sd
## 1 0 122 20.53646 15.95222
## SkinThickness_min SkinThickness_max Insulin_mean Insulin_sd Insulin_min
## 1 0 99 79.79948 115.244 0
## Insulin_max BMI_mean BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_mean
## 1 846 31.99258 7.88416 0 67.1 0.4718763
## DiabetesPedigreeFunction_sd DiabetesPedigreeFunction_min
## 1 0.3313286 0.078
## DiabetesPedigreeFunction_max Age_mean Age_sd Age_min Age_max Outcome_mean
## 1 2.42 33.24089 11.76023 21 81 0.3489583
## Outcome_sd Outcome_min Outcome_max
## 1 0.4769514 0 1
Check for missing values
data %>%
summarise(across(everything(), ~sum(is.na(.))))
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0 0 0 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 1 0 0 0
Histogram for numerical variables
numeric_cols <- colnames(data)[sapply(data, is.numeric)]
for (col in numeric_cols) {
print(
ggplot(data, aes_string(x = col)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of", col), x = col, y = "Frequency")
)
}
Correlation matrix and visualization
cor_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")
print(cor_matrix)
## Pregnancies Glucose BloodPressure SkinThickness
## Pregnancies 1.00000000 0.12945867 0.14128198 -0.08167177
## Glucose 0.12945867 1.00000000 0.15258959 0.05732789
## BloodPressure 0.14128198 0.15258959 1.00000000 0.20737054
## SkinThickness -0.08167177 0.05732789 0.20737054 1.00000000
## Insulin -0.07353461 0.33135711 0.08893338 0.43678257
## BMI 0.01768309 0.22107107 0.28180529 0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730 0.04126495 0.18392757
## Age 0.54434123 0.26351432 0.23952795 -0.11397026
## Outcome 0.22189815 0.46658140 0.06506836 0.07475223
## Insulin BMI DiabetesPedigreeFunction
## Pregnancies -0.07353461 0.01768309 -0.03352267
## Glucose 0.33135711 0.22107107 0.13733730
## BloodPressure 0.08893338 0.28180529 0.04126495
## SkinThickness 0.43678257 0.39257320 0.18392757
## Insulin 1.00000000 0.19785906 0.18507093
## BMI 0.19785906 1.00000000 0.14064695
## DiabetesPedigreeFunction 0.18507093 0.14064695 1.00000000
## Age -0.04216295 0.03624187 0.03356131
## Outcome 0.13054795 0.29269466 0.17384407
## Age Outcome
## Pregnancies 0.54434123 0.22189815
## Glucose 0.26351432 0.46658140
## BloodPressure 0.23952795 0.06506836
## SkinThickness -0.11397026 0.07475223
## Insulin -0.04216295 0.13054795
## BMI 0.03624187 0.29269466
## DiabetesPedigreeFunction 0.03356131 0.17384407
## Age 1.00000000 0.23835598
## Outcome 0.23835598 1.00000000
corrplot::corrplot(cor_matrix, method = "circle")
Scatter plot between Glucose and BMI grouped by Outcome
ggplot(data, aes(x = Glucose, y = BMI, color = as.factor(Outcome))) +
geom_point(alpha = 0.7) +
labs(title = "Glucose vs BMI by Outcome", x = "Glucose", y = "BMI")
Categorize Age into groups
data <- data %>%
mutate(AgeGroup = case_when(
Age < 30 ~ "Under 30",
Age >= 30 & Age < 50 ~ "30-49",
Age >= 50 ~ "50 and above"
))
Age group distribution by Outcome
data %>%
group_by(AgeGroup, Outcome) %>%
summarise(Count = n()) %>%
ggplot(aes(x = AgeGroup, y = Count, fill = as.factor(Outcome))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Age Group Distribution by Outcome", x = "Age Group", y = "Count")
## Remove AgeGroup column if it was added accidentally
data <- data %>% select(-AgeGroup)
Classification modelling was executed to predict the likelihood of diabetes occurrence based on the input features. The classification models used in this project are as follows:
library(caret)
library(rpart)
library(rpart.plot)
library(e1071)
The dataset is split to 80% training and 20% testing set.The train-test split is stratified to ensures that the distribution of the target variable’s classes remains consistent between the training and testing datasets.
# Select all columns except the target variable 'Outcome'
X <- data[, setdiff(names(data), 'Outcome')]
# Select the target variable 'Outcome'
y <- data$Outcome
# Set seed for reproducibility
set.seed(123)
# Stratified train-test split
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)
# Create training and testing sets
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
The target variable for classification modelling is ‘Outcome’ which is a categorical variable. By using factor transformation on our target variable, the models will treat it as a categorical variable with distinct classes.
train$Outcome <- factor(train$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
Logistic regression is a supervised machine learning algorithm used for binary classification, where the goal is to predict the probability of an outcome belonging to one of two classes. It is widely used for its simplicity and interpretability.
logistic_model <- glm(Outcome ~ . , data = train, family = binomial)
#Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + BloodPressure
# Model summary
summary(logistic_model)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.2216062 0.7781491 -10.566 < 2e-16 ***
## Pregnancies 0.1185211 0.0357929 3.311 0.000929 ***
## Glucose 0.0352886 0.0041848 8.433 < 2e-16 ***
## BloodPressure -0.0130815 0.0057276 -2.284 0.022374 *
## SkinThickness -0.0009780 0.0075288 -0.130 0.896648
## Insulin -0.0009111 0.0009841 -0.926 0.354533
## BMI 0.0861702 0.0166338 5.180 2.21e-07 ***
## DiabetesPedigreeFunction 0.7824888 0.3212008 2.436 0.014845 *
## Age 0.0152434 0.0102944 1.481 0.138676
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 797.28 on 614 degrees of freedom
## Residual deviance: 583.64 on 606 degrees of freedom
## AIC: 601.64
##
## Number of Fisher Scoring iterations: 5
# Predict probabilities and classes
predicted_probs <- predict(logistic_model, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
Observation:
Significant predictors, such as Glucose, BMI, and Pregnancies, were identified based on their p-values and estimated coefficients. The model suggests a positive association of Pregnancies, Glucose, BMI, and DiabetesPedigreeFunction with diabetes likelihood and a slight negative association with BloodPressure.
Decision Tree is a flowchart-like structure used to make predictions or decisions. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Train a decision tree model
tree_model <- rpart(Outcome ~ ., data = train, method = "class")
To plot the decision tree
rpart.plot(tree_model)
Observation:
The model identified Glucose as the most significant predictor at the root node, with a threshold of 144 used to split the data
SVM is also a supervised machine learning algorithm and can be used for the classification task. In classification, SVM seeks to find the optimal hyperplane that divides the data into different categories.
Train an SVM model without tuning (default settings)
svm_model <- svm(Outcome ~ ., data = train, type = "C-classification", kernel = "radial")
Predict on the test set
svm_predictions <- predict(svm_model, newdata = test)
Regression modeling is a fundamental statistical and machine learning technique used to understand and quantify relationships between variables. Regression helps to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also called predictors or features).
This project aims to explore regression modelling using R.To carry out regression modelling, using the same dataset, ‘BMI’ is chosen to be the target variable.
Why BMI as target variable?
Key Health Indicator - BMI (Body Mass Index) is a widely recognized measure of body fat, associated with health risks such as diabetes and cardiovascular conditions.
Feature Relevance - Features like Glucose, Insulin, and SkinThickness are biologically linked to BMI and influence metabolic health. Continuous Target Variable: BMI is continuous, making it suitable for regression modeling and enabling meaningful analysis of predictor relationships.
Data Completeness - BMI is fully available in the dataset, ensuring reliable and interpretable model results.
The regression models chosen to explore with, include: 1. Linear Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)
# Load the package
library(randomForest)
Data was split into 80% training and 20% test sets.
# Select the features and target variable
# We will predict BMI using other features
dataReg <- subset(data, select = -c(Outcome)) # Exclude the binary outcome column
# Split the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(dataReg$BMI, p = 0.8, list = FALSE)
trainData <- dataReg[trainIndex, ]
testData <- dataReg[-trainIndex, ]
Linear regression is a simple and interpretable technique that models the relationship between a dependent variable and one or more predictors by fitting a linear equation.
Train a linear regression model
# Train a linear regression model
lm_model <- train(BMI ~ ., data = trainData, method = "lm")
View the model summaries
cat("Linear Regression Summary:\n")
## Linear Regression Summary:
summary(lm_model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.852 -3.960 -0.195 3.697 28.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.994090 1.521067 11.172 < 2e-16 ***
## Pregnancies 0.043360 0.101077 0.429 0.6681
## Glucose 0.051422 0.009903 5.193 2.83e-07 ***
## BloodPressure 0.085203 0.015587 5.466 6.72e-08 ***
## SkinThickness 0.165516 0.021356 7.751 3.85e-14 ***
## Insulin -0.003679 0.002906 -1.266 0.2060
## DiabetesPedigreeFunction 1.434437 0.861407 1.665 0.0964 .
## Age -0.029789 0.030587 -0.974 0.3305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.956 on 609 degrees of freedom
## Multiple R-squared: 0.227, Adjusted R-squared: 0.2181
## F-statistic: 25.55 on 7 and 609 DF, p-value: < 2.2e-16
Make predictions on the test set
lm_pred <- predict(lm_model, newdata = testData)
Random forest is an ensemble learning method that uses multiple decision trees to model complex relationships between a dependent variable and its predicting features.
Train a random forest regression model
rf_model <- train(BMI ~ ., data = trainData, method = "rf",
tuneGrid = expand.grid(.mtry = seq(2, ncol(trainData) - 1, by = 1)),
trControl = trainControl(method = "cv", number = 5))
cat("\nRandom Forest Model Parameters:\n")
##
## Random Forest Model Parameters:
print(rf_model$bestTune)
## mtry
## 1 2
rf_pred <- predict(rf_model, newdata = testData)
XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. This Technique uses an ensemble of decision trees to optimize predictions.
Train an XGBoost regression model
xgb_model <- train(BMI ~ ., data = trainData, method = "xgbLinear",
tuneGrid = expand.grid(.nrounds = seq(50, 200, by = 50),
.lambda = c(0, 0.1, 1),
.alpha = c(0, 0.1, 1),
.eta = c(0.01, 0.1, 0.3)),
trControl = trainControl(method = "cv", number = 5))
cat("\nXGBoost Model Best Parameters:\n")
##
## XGBoost Model Best Parameters:
print(xgb_model$bestTune)
## nrounds lambda alpha eta
## 25 50 1 1 0.01
xgb_pred <- predict(xgb_model, newdata = testData)
# Export the logistic regression model
saveRDS(logistic_model, file = "logistic_model.rds")
# Export the decision tree model
saveRDS(tree_model, file = "decision_tree_model.rds")
# Export the SVM model
saveRDS(svm_model, file = "svm_model.rds")
# Export the linear regression model
saveRDS(lm_model, file = "linear_regression_model.rds")
# Export the random forest model
saveRDS(rf_model, file = "random_forest_model.rds")
# Export the XGBoost model
saveRDS(xgb_model, file = "xgboost_model.rds")
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2753421 147.1 4420428 236.1 4420428 236.1
## Vcells 4812887 36.8 17644568 134.7 17642590 134.7
library(dplyr)
library(ggplot2)
library(caret)
library(yardstick) # For evaluation metrics
library(ROCR) # For ROC and AUC
library(DALEX) # For model explainability
library(shapviz) # For SHAP values
library(vip) # For variable importance plots
library(mlflow) # For experiment tracking
library(ranger) # For Random Forest with SHAP support
library(randomForest)
Load the Models
Classification Models
loaded_logistic_model <- readRDS("logistic_model.rds")
loaded_tree_model <- readRDS("decision_tree_model.rds")
loaded_svm_model <- readRDS("svm_model.rds")
Regression Models
loaded_lm_model <- readRDS("linear_regression_model.rds")
loaded_rf_model <- readRDS("random_forest_model.rds")
loaded_xgb_model <- readRDS("xgboost_model.rds")
Load the Data
data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
Prepare Classification Test Data
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome_numeric <- ifelse(test$Outcome == "diabetic", 1, 0) # Numeric outcome for DALEX
Prepare Regression Test Data
dataReg <- data %>% select(-Outcome)
testData <- dataReg[-trainIndex, ]
Classification Models Evaluation Logistic Regression
logistic_probs <- predict(loaded_logistic_model, newdata = test, type = "response")
logistic_preds <- ifelse(logistic_probs > 0.5, "diabetic", "non.diabetic")
logistic_conf <- confusionMatrix(factor(logistic_preds, levels = levels(test$Outcome)), test$Outcome)
Logistic Regression Explainer Ensure numeric targets for DALEX
test$Outcome_numeric <- ifelse(test$Outcome == "diabetic", 1, 0)
Create an explainer Logistic Regression Explainer
explainer_logistic <- explain(
model = loaded_logistic_model,
data = test,
y = test$Outcome_numeric,
predict_function = function(m, d) predict(m, newdata = d, type = "response"),
label = "Logistic Regression"
)
## Preparation of a new explainer is initiated
## -> model label : Logistic Regression
## -> data : 153 rows 10 cols
## -> target variable : 153 values
## -> predict function : function(m, d) predict(m, newdata = d, type = "response")
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package stats , ver. 4.4.1 , task classification ( default )
## -> predicted values : numerical, min = 0.01023708 , mean = 0.3313538 , max = 0.9619726
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.8273228 , mean = 0.008515518 , max = 0.9319744
## A new explainer has been created!
Evaluate performance
performance_logistic <- model_performance(explainer_logistic)
plot(performance_logistic)
ggsave("logistic_performance.png")
Feature Importance
importance_logistic <- model_parts(explainer_logistic)
plot(importance_logistic)
ggsave("logistic_importance.png")
Decision Tree
tree_preds <- predict(loaded_tree_model, newdata = test, type = "class")
tree_conf <- confusionMatrix(tree_preds, test$Outcome)
SVM
svm_preds <- predict(loaded_svm_model, newdata = test)
svm_conf <- confusionMatrix(svm_preds, test$Outcome)
Plot ROC Curves for Classification Models
logistic_pred <- prediction(logistic_probs, test$Outcome_numeric)
logistic_perf <- performance(logistic_pred, "tpr", "fpr")
tree_pred <- prediction(as.numeric(tree_preds), test$Outcome_numeric)
tree_perf <- performance(tree_pred, "tpr", "fpr")
svm_pred <- prediction(as.numeric(svm_preds), test$Outcome_numeric)
svm_perf <- performance(svm_pred, "tpr", "fpr")
plot(logistic_perf, col = "blue", lty = 1, main = "ROC Curves")
plot(tree_perf, col = "red", lty = 2, add = TRUE)
plot(svm_perf, col = "green", lty = 3, add = TRUE)
legend("bottomright", legend = c("Logistic Regression", "Decision Tree", "SVM"),
col = c("blue", "red", "green"), lty = 1:3)
ggsave("classification_roc_curves.png")
Regression Models Evaluation
library(dplyr)
library(caret)
library(Metrics)
rf_preds <- predict(loaded_rf_model, newdata = testData)
rf_metrics <- data.frame(
RMSE = RMSE(rf_preds, testData$BMI),
Rsquared = R2(rf_preds, testData$BMI)
)
Linear Regression
lm_preds <- predict(loaded_lm_model, newdata = testData)
lm_metrics <- data.frame(
RMSE = RMSE(lm_preds, testData$BMI),
Rsquared = R2(lm_preds, testData$BMI)
)
XGBoost
xgb_preds <- predict(loaded_xgb_model, newdata = testData)
xgb_metrics <- data.frame(
RMSE = RMSE(xgb_preds, testData$BMI),
Rsquared = R2(xgb_preds, testData$BMI)
)
Combine Regression Metrics
regression_metrics <- rbind(
Random_Forest = rf_metrics,
Linear_Regression = lm_metrics,
XGBoost = xgb_metrics
)
Plot Regression Performance
barplot(as.matrix(regression_metrics), beside = TRUE, col = c("blue", "red", "green"),
main = "Regression Model Performance", legend = rownames(regression_metrics))
ggsave("regression_performance.png")
Final Tables and Visualizations Classification Summary
classification_summary <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "SVM"),
Accuracy = c(logistic_conf$overall["Accuracy"],
tree_conf$overall["Accuracy"],
svm_conf$overall["Accuracy"]),
Kappa = c(logistic_conf$overall["Kappa"],
tree_conf$overall["Kappa"],
svm_conf$overall["Kappa"])
)
ggplot(classification_summary, aes(x = Model, y = Accuracy)) +
geom_col(fill = "steelblue") +
ggtitle("Classification Model Accuracy") +
theme_minimal()
ggsave("classification_accuracy.png")
FINAL RESULT
print("Classification Summary")
## [1] "Classification Summary"
print(classification_summary)
## Model Accuracy Kappa
## 1 Logistic Regression 0.7712418 0.4518374
## 2 Decision Tree 0.8104575 0.5675017
## 3 SVM 0.7843137 0.4671240
## Regression Summary
print("Regression Metrics")
## [1] "Regression Metrics"
print(regression_metrics)
## RMSE Rsquared
## Random_Forest 4.339702 0.7800229
## Linear_Regression 6.857145 0.2753592
## XGBoost 3.591581 0.8039952
The evaluation of models was conducted to analyze and compare the performance of the classification and regression models. Metrics such as Accuracy, Kappa, ROC Curves, RMSE and R-Squared are used to measure performance. The evaluation findings are summarized below.
Accuracy and Kappa Scores
The Accuracy and Kappa scores for the classification models are summarized in the following table.
{r ClassificationAccuracyPlot, echo=FALSE}
knitr::include_graphics("C:\\Users\\Nasaruddin\\Desktop\\classification_accuracy.png")
The SVM model demonstrated the highest accuracy (78.41%) and Kappa score (0.4799), indicating its strong capability in classifying diabetic and non-diabetic cases effectively. This performance reflects its ability to manage complex decision boundaries in high-dimensional data. The Decision Tree model achieved slightly lower accuracy (77.78%) but reported a higher Kappa score (0.5136), which suggests that it may handle class imbalances more effectively. In contrast, Logistic Regression exhibited the lowest accuracy (75.82%) and Kappa (0.4224), which could be attributed to its simplicity and its limitations in capturing non-linear relationships within the dataset.
The ROC curves for all three classification models are visualized below to compare their true positive rates (TPR) against false positive rates (FPR).
The Logistic Regression ROC curve displayed the smoothest shape, indicative of stable performance across varying thresholds. However, its overall performance was slightly below that of the Decision Tree and SVM models. The Decision Tree showed strong initial performance but experienced a sharper drop-off in its curve, indicating potential overfitting to the training data. On the other hand, SVM demonstrated a less smooth curve but consistently outperformed Logistic Regression in regions requiring higher specificity, highlighting its strength in distinguishing between the classes.
The feature importance of the logistic regression model, visualized using permutation-based feature selection, highlights which predictors contribute the most to the model’s predictions.
The residual distribution plot below demonstrates the reverse cumulative distribution of residuals for the logistic regression model.
Logistic Regression remains valuable for its interpretability, as seen in the feature importance chart. The model identified Glucose as the most significant predictor for diabetes likelihood, followed by BMI and Diabetes Pedigree Function, reinforcing the importance of these factors in understanding diabetes risk.
Regression Performance
The RMSE (Root Mean Squared Error) and R-Squared values for regression models are summarized in the following table.
XGBoost emerged as the most effective regression model, achieving the lowest RMSE (3.14) and the highest R-squared value (0.836). These metrics indicate its ability to accurately model the relationship between features and BMI. Random Forest followed with an RMSE of 4.16 and R-squared of 0.767, showcasing its strength in capturing non-linear relationships but falling short of XGBoost’s optimization techniques. Linear Regression, in contrast, exhibited the highest RMSE (6.88) and the lowest R-squared (0.209), reflecting its limitations in addressing complex, non-linear patterns.
The combined results for classification and regression models are presented in the summary table below.
Among the predictors, Glucose levels were identified as the most critical feature for predicting diabetes likelihood, aligning with established medical knowledge. BMI and Diabetes Pedigree Function also played significant roles, emphasizing the importance of weight management and genetic predisposition in diabetes risk. Additional predictors like BloodPressure and Insulin offered insights into the biological factors influencing diabetes, suggesting opportunities for targeted interventions aimed at early prevention.
To enhance classification performance, deploying SVM is recommended due to its superior accuracy and Kappa score, particularly for applications requiring high precision. Logistic Regression can still be utilized for initial feature exploration, as it provides interpretable insights into key predictors. For regression, XGBoost is the preferred model given its robust performance in handling complex relationships. Random Forest serves as an alternative where interpretability is slightly more important but performance remains critical. Finally, hyperparameter tuning for Decision Tree and SVM models in classification, along with feature engineering for Linear Regression in regression tasks, may further improve overall performance.