Group Members:

Hussain Ali Kazim (23115338)
Wan Mohamad Hasif Bin W. Mohd Saleh (22119325)
Hanim Sofiah Bin Shahrom (22102228)
Nurul Hafizah binti Zaini (17172928)
Muhammad Hakim Bin Nasaruddin (23079722)

Introduction

Dataset

The dataset was obtained from an open-source Kaggle platform. This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes various diagnostic measurements, with several independent variables and a single target variable which is ‘Outcome,’ indicating whether the patient has diabetes.

Link: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data

Project Context

The term ‘diabetes’ has become increasingly familiar in recent years, as the prevalence of the condition continues to rise globally. Diabetes is a chronic medical condiiton that occurs when the body is unable to properly regulate blood sugar levels. This condition can lead to severe complications, including heart disease, kidney failure, and nerve damage. Therefore, early detection and intervention are crucial to managing and preventing the progression of diabetes.

This project aims to develop a machine learning classification model to predict the likelihood of an individual developing diabetes. Additionally, it focuses on predicting the progression of diabetes using BMI as a key feature, providing insights for early diagnosis and personalized healthcare interventions.

Objective

To develop a classification models to predict the likelihood of an individual developing diabetes
To predict the progression of diabetes using BMI as key feature by developing regression models

Load the dataset

raw_data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\raw_data_diabetes.csv")

Data Cleaning

Dimension: dim()

dim(raw_data)

## [1] 769   9

Head: head()

head(raw_data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Summary: summary()

summary(raw_data)

##   Pregnancies        Glucose      BloodPressure      SkinThickness  
##  Min.   :-1.000   Min.   :  0.0   Length:769         Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   Class :character   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Mode  :character   Median :23.00  
##  Mean   : 3.839   Mean   :120.9                      Mean   :20.55  
##  3rd Qu.: 6.000   3rd Qu.:140.0                      3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0                      Max.   :99.00  
##    Insulin               BMI         DiabetesPedigreeFunction      Age        
##  Length:769         Min.   :-30.10   Min.   :0.0780           Min.   :-30.00  
##  Class :character   1st Qu.: 27.30   1st Qu.:0.2440           1st Qu.: 24.00  
##  Mode  :character   Median : 32.00   Median :0.3710           Median : 29.00  
##                     Mean   : 31.91   Mean   :0.4717           Mean   : 33.15  
##                     3rd Qu.: 36.60   3rd Qu.:0.6260           3rd Qu.: 41.00  
##                     Max.   : 67.10   Max.   :2.4200           Max.   : 81.00  
##     Outcome      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3485  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Structure: str()

str(raw_data)

## 'data.frame':    769 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : chr  "72" "66" "64" "66" ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : chr  "0" "0" "0" "94" ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Load necessary libraries

library(dplyr)

Check for missing values in the raw dataset

missing_values_summary <- sapply(raw_data, function(x) sum(is.na(x)))  # Count missing values per column
print(missing_values_summary)

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Handle specific cases for Null value and number as string value

# Convert "Null" in Insulin column to 0 (row 763)
raw_data[763, "Insulin"] <- 0

# Convert "Seventy" in BloodPressure column to 70 (row 765)
raw_data[765, "BloodPressure"] <- 70

Convert data types to ensure consistency

# Convert all values in BloodPressure column to numeric
raw_data$BloodPressure <- as.numeric(raw_data$BloodPressure)

# Convert all values in Insulin column to numeric
raw_data$Insulin <- as.numeric(raw_data$Insulin)

Convert negative values to absolute

raw_data <- raw_data %>%
  mutate(across(where(is.numeric), ~ abs(.)))

Remove duplicate rows

raw_data <- distinct(raw_data)

Save the cleaned dataset

cleaned_data <- write.csv(raw_data, "C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv", row.names = FALSE)

Updated dataset structure

data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
str(data)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

EDA

Load necessary libraries

library(dplyr)
library(ggplot2)
library(corrplot)

Glimpse of the data

glimpse(data)

## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

Summary statistics for all variables

data %>%
  summarise(across(everything(), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))

##   Pregnancies_mean Pregnancies_sd Pregnancies_min Pregnancies_max Glucose_mean
## 1         3.845052       3.369578               0              17     120.8945
##   Glucose_sd Glucose_min Glucose_max BloodPressure_mean BloodPressure_sd
## 1   31.97262           0         199           69.10547         19.35581
##   BloodPressure_min BloodPressure_max SkinThickness_mean SkinThickness_sd
## 1                 0               122           20.53646         15.95222
##   SkinThickness_min SkinThickness_max Insulin_mean Insulin_sd Insulin_min
## 1                 0                99     79.79948    115.244           0
##   Insulin_max BMI_mean  BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_mean
## 1         846 31.99258 7.88416       0    67.1                     0.4718763
##   DiabetesPedigreeFunction_sd DiabetesPedigreeFunction_min
## 1                   0.3313286                        0.078
##   DiabetesPedigreeFunction_max Age_mean   Age_sd Age_min Age_max Outcome_mean
## 1                         2.42 33.24089 11.76023      21      81    0.3489583
##   Outcome_sd Outcome_min Outcome_max
## 1  0.4769514           0           1

Check for missing values

data %>%
  summarise(across(everything(), ~sum(is.na(.))))

##   Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1           0       0             0             0       0   0
##   DiabetesPedigreeFunction Age Outcome
## 1                        0   0       0

Histogram for numerical variables

numeric_cols <- colnames(data)[sapply(data, is.numeric)]
for (col in numeric_cols) {
  print(
    ggplot(data, aes_string(x = col)) +
      geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Distribution of", col), x = col, y = "Frequency")
  )
}

Correlation matrix and visualization

cor_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")
print(cor_matrix)

##                          Pregnancies    Glucose BloodPressure SkinThickness
## Pregnancies               1.00000000 0.12945867    0.14128198   -0.08167177
## Glucose                   0.12945867 1.00000000    0.15258959    0.05732789
## BloodPressure             0.14128198 0.15258959    1.00000000    0.20737054
## SkinThickness            -0.08167177 0.05732789    0.20737054    1.00000000
## Insulin                  -0.07353461 0.33135711    0.08893338    0.43678257
## BMI                       0.01768309 0.22107107    0.28180529    0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730    0.04126495    0.18392757
## Age                       0.54434123 0.26351432    0.23952795   -0.11397026
## Outcome                   0.22189815 0.46658140    0.06506836    0.07475223
##                              Insulin        BMI DiabetesPedigreeFunction
## Pregnancies              -0.07353461 0.01768309              -0.03352267
## Glucose                   0.33135711 0.22107107               0.13733730
## BloodPressure             0.08893338 0.28180529               0.04126495
## SkinThickness             0.43678257 0.39257320               0.18392757
## Insulin                   1.00000000 0.19785906               0.18507093
## BMI                       0.19785906 1.00000000               0.14064695
## DiabetesPedigreeFunction  0.18507093 0.14064695               1.00000000
## Age                      -0.04216295 0.03624187               0.03356131
## Outcome                   0.13054795 0.29269466               0.17384407
##                                  Age    Outcome
## Pregnancies               0.54434123 0.22189815
## Glucose                   0.26351432 0.46658140
## BloodPressure             0.23952795 0.06506836
## SkinThickness            -0.11397026 0.07475223
## Insulin                  -0.04216295 0.13054795
## BMI                       0.03624187 0.29269466
## DiabetesPedigreeFunction  0.03356131 0.17384407
## Age                       1.00000000 0.23835598
## Outcome                   0.23835598 1.00000000

corrplot::corrplot(cor_matrix, method = "circle")

Scatter plot between Glucose and BMI grouped by Outcome

ggplot(data, aes(x = Glucose, y = BMI, color = as.factor(Outcome))) +
  geom_point(alpha = 0.7) +
  labs(title = "Glucose vs BMI by Outcome", x = "Glucose", y = "BMI")

Categorize Age into groups

data <- data %>%
  mutate(AgeGroup = case_when(
    Age < 30 ~ "Under 30",
    Age >= 30 & Age < 50 ~ "30-49",
    Age >= 50 ~ "50 and above"
  ))

Age group distribution by Outcome

data %>%
  group_by(AgeGroup, Outcome) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = AgeGroup, y = Count, fill = as.factor(Outcome))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Age Group Distribution by Outcome", x = "Age Group", y = "Count")

Classification Model

Classification modelling was executed to predict the likelihood of diabetes occurrence based on the input features. The classification models used in this project are as follows:

Logistic Regression
Decision Tree
Support Vector Machine (SVM)

library(caret)
library(rpart)
library(rpart.plot)
library(e1071)

Train Test Split

The dataset is split to 80% training and 20% testing set.The train-test split is stratified to ensures that the distribution of the target variable’s classes remains consistent between the training and testing datasets.

# Select all columns except the target variable 'Outcome'
X <- data[, setdiff(names(data), 'Outcome')]

# Select the target variable 'Outcome'
y <- data$Outcome

# Set seed for reproducibility
set.seed(123)

# Stratified train-test split
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)

# Create training and testing sets
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

Factor Transformation

The target variable for classification modelling is ‘Outcome’ which is a categorical variable. By using factor transformation on our target variable, the models will treat it as a categorical variable with distinct classes.

train$Outcome <- factor(train$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))

Model 1: Logistic Regression

Logistic regression is a supervised machine learning algorithm used for binary classification, where the goal is to predict the probability of an outcome belonging to one of two classes. It is widely used for its simplicity and interpretability.

logistic_model <- glm(Outcome ~ . , data = train, family = binomial)
#Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + BloodPressure

# Model summary
summary(logistic_model)

## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.0010983  1.1096993  -6.309 2.81e-10 ***
## Pregnancies               0.0844375  0.0379026   2.228   0.0259 *  
## Glucose                   0.0353493  0.0042209   8.375  < 2e-16 ***
## BloodPressure            -0.0138576  0.0058109  -2.385   0.0171 *  
## SkinThickness            -0.0005479  0.0075540  -0.073   0.9422    
## Insulin                  -0.0008211  0.0009931  -0.827   0.4084    
## BMI                       0.0858561  0.0169059   5.078 3.80e-07 ***
## DiabetesPedigreeFunction  0.7303151  0.3251952   2.246   0.0247 *  
## Age                      -0.0020538  0.0228494  -0.090   0.9284    
## AgeGroup50 and above     -0.1829781  0.5634162  -0.325   0.7454    
## AgeGroupUnder 30         -0.8586742  0.3798471  -2.261   0.0238 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 797.28  on 614  degrees of freedom
## Residual deviance: 575.91  on 604  degrees of freedom
## AIC: 597.91
## 
## Number of Fisher Scoring iterations: 5

# Predict probabilities and classes
predicted_probs <- predict(logistic_model, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

Observation:

Significant predictors, such as Glucose, BMI, and Pregnancies, were identified based on their p-values and estimated coefficients. The model suggests a positive association of Pregnancies, Glucose, BMI, and DiabetesPedigreeFunction with diabetes likelihood and a slight negative association with BloodPressure.

Model 2: Decision Tree

Decision Tree is a flowchart-like structure used to make predictions or decisions. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

Train a decision tree model

tree_model <- rpart(Outcome ~ ., data = train, method = "class")

To plot the decision tree

rpart.plot(tree_model)

Observation:

The model identified Glucose as the most significant predictor at the root node, with a threshold of 144 used to split the data

Model 3: Support Vector Machine (SVM)

SVM is also a supervised machine learning algorithm and can be used for the classification task. In classification, SVM seeks to find the optimal hyperplane that divides the data into different categories.

Train an SVM model without tuning (default settings)

svm_model <- svm(Outcome ~ ., data = train, type = "C-classification", kernel = "radial")

Predict on the test set

svm_predictions <- predict(svm_model, newdata = test)

Regression Model

Regression modeling is a fundamental statistical and machine learning technique used to understand and quantify relationships between variables. Regression helps to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also called predictors or features).

This project aims to explore regression modelling using R.To carry out regression modelling, using the same dataset, ‘BMI’ is chosen to be the target variable.

Why BMI as target variable?

Key Health Indicator - BMI (Body Mass Index) is a widely recognized measure of body fat, associated with health risks such as diabetes and cardiovascular conditions.

Feature Relevance - Features like Glucose, Insulin, and SkinThickness are biologically linked to BMI and influence metabolic health. Continuous Target Variable: BMI is continuous, making it suitable for regression modeling and enabling meaningful analysis of predictor relationships.

Data Completeness - BMI is fully available in the dataset, ensuring reliable and interpretable model results.

The regression models chosen to explore with, include: 1. Linear Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)

# Load the package
library(randomForest)

Train Test Split

Data was split into 80% training and 20% test sets.

# Select the features and target variable
# We will predict BMI using other features
dataReg <- subset(data, select = -c(Outcome)) # Exclude the binary outcome column

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(dataReg$BMI, p = 0.8, list = FALSE)
trainData <- dataReg[trainIndex, ]
testData <- dataReg[-trainIndex, ]

Model 1: Linear Regression

Linear regression is a simple and interpretable technique that models the relationship between a dependent variable and one or more predictors by fitting a linear equation.

Train a linear regression model

# Train a linear regression model
lm_model <- train(BMI ~ ., data = trainData, method = "lm")

View the model summaries

cat("Linear Regression Summary:\n")

## Linear Regression Summary:

summary(lm_model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.286  -3.764  -0.351   3.649  27.636 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              13.618783   2.831287   4.810 1.91e-06 ***
## Pregnancies              -0.097112   0.107060  -0.907 0.364723    
## Glucose                   0.049947   0.009804   5.095 4.67e-07 ***
## BloodPressure             0.085913   0.015425   5.570 3.84e-08 ***
## SkinThickness             0.156152   0.021337   7.318 7.99e-13 ***
## Insulin                  -0.002440   0.002892  -0.844 0.399149    
## DiabetesPedigreeFunction  1.403859   0.853622   1.645 0.100572    
## Age                       0.113981   0.066938   1.703 0.089120 .  
## `AgeGroup50 and above`   -5.811361   1.629075  -3.567 0.000389 ***
## `AgeGroupUnder 30`        0.076655   1.085830   0.071 0.943743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.874 on 607 degrees of freedom
## Multiple R-squared:  0.2476, Adjusted R-squared:  0.2364 
## F-statistic: 22.19 on 9 and 607 DF,  p-value: < 2.2e-16

Make predictions on the test set

lm_pred <- predict(lm_model, newdata = testData)

Model 2: Random Forest

Random forest is an ensemble learning method that uses multiple decision trees to model complex relationships between a dependent variable and its predicting features.

Train a random forest regression model

rf_model <- train(BMI ~ ., data = trainData, method = "rf",
                  tuneGrid = expand.grid(.mtry = seq(2, ncol(trainData) - 1, by = 1)),
                  trControl = trainControl(method = "cv", number = 5))

cat("\nRandom Forest Model Parameters:\n")

## 
## Random Forest Model Parameters:

print(rf_model$bestTune)

##   mtry
## 1    2

rf_pred <- predict(rf_model, newdata = testData)

Model 3: Extreme Gradient Boosting (XGBoost)

XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. This Technique uses an ensemble of decision trees to optimize predictions.

Train an XGBoost regression model

xgb_model <- train(BMI ~ ., data = trainData, method = "xgbLinear",
                   tuneGrid = expand.grid(.nrounds = seq(50, 200, by = 50),
                                          .lambda = c(0, 0.1, 1),
                                          .alpha = c(0, 0.1, 1),
                                          .eta = c(0.01, 0.1, 0.3)),
                   trControl = trainControl(method = "cv", number = 5))

cat("\nXGBoost Model Best Parameters:\n")

## 
## XGBoost Model Best Parameters:

print(xgb_model$bestTune)

##    nrounds lambda alpha  eta
## 19      50      1     0 0.01

xgb_pred <- predict(xgb_model, newdata = testData)

Diabetic Data Insights: Statistical Analysis and Predictive Modeling

Group 12 - Programming For Data Science (WQD7004)

Group Members:

Introduction

Dataset

Project Context

Objective

Load the dataset

Data Cleaning

EDA

Classification Model

Train Test Split

Factor Transformation

Model 1: Logistic Regression

Model 2: Decision Tree

Model 3: Support Vector Machine (SVM)

Regression Model

Train Test Split

Model 1: Linear Regression

Model 2: Random Forest

Model 3: Extreme Gradient Boosting (XGBoost)

Evaluation

Conclusion