The dataset was obtained from an open-source Kaggle platform. This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes various diagnostic measurements, with several independent variables and a single target variable which is ‘Outcome,’ indicating whether the patient has diabetes.
Link: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data
The term ‘diabetes’ has become increasingly familiar in recent years, as the prevalence of the condition continues to rise globally. Diabetes is a chronic medical condiiton that occurs when the body is unable to properly regulate blood sugar levels. This condition can lead to severe complications, including heart disease, kidney failure, and nerve damage. Therefore, early detection and intervention are crucial to managing and preventing the progression of diabetes.
This project aims to develop a machine learning classification model to predict the likelihood of an individual developing diabetes. Additionally, it focuses on predicting the progression of diabetes using BMI as a key feature, providing insights for early diagnosis and personalized healthcare interventions.
raw_data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\raw_data_diabetes.csv")
Dimension: dim()
dim(raw_data)
## [1] 769 9
Head: head()
head(raw_data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Summary: summary()
summary(raw_data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. :-1.000 Min. : 0.0 Length:769 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 Class :character 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Mode :character Median :23.00
## Mean : 3.839 Mean :120.9 Mean :20.55
## 3rd Qu.: 6.000 3rd Qu.:140.0 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Length:769 Min. :-30.10 Min. :0.0780 Min. :-30.00
## Class :character 1st Qu.: 27.30 1st Qu.:0.2440 1st Qu.: 24.00
## Mode :character Median : 32.00 Median :0.3710 Median : 29.00
## Mean : 31.91 Mean :0.4717 Mean : 33.15
## 3rd Qu.: 36.60 3rd Qu.:0.6260 3rd Qu.: 41.00
## Max. : 67.10 Max. :2.4200 Max. : 81.00
## Outcome
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3485
## 3rd Qu.:1.0000
## Max. :1.0000
Structure: str()
str(raw_data)
## 'data.frame': 769 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : chr "72" "66" "64" "66" ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : chr "0" "0" "0" "94" ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Load necessary libraries
library(dplyr)
Check for missing values in the raw dataset
missing_values_summary <- sapply(raw_data, function(x) sum(is.na(x))) # Count missing values per column
print(missing_values_summary)
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Handle specific cases for Null value and number as string value
# Convert "Null" in Insulin column to 0 (row 763)
raw_data[763, "Insulin"] <- 0
# Convert "Seventy" in BloodPressure column to 70 (row 765)
raw_data[765, "BloodPressure"] <- 70
Convert data types to ensure consistency
# Convert all values in BloodPressure column to numeric
raw_data$BloodPressure <- as.numeric(raw_data$BloodPressure)
# Convert all values in Insulin column to numeric
raw_data$Insulin <- as.numeric(raw_data$Insulin)
Convert negative values to absolute
raw_data <- raw_data %>%
mutate(across(where(is.numeric), ~ abs(.)))
Remove duplicate rows
raw_data <- distinct(raw_data)
Save the cleaned dataset
cleaned_data <- write.csv(raw_data, "C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv", row.names = FALSE)
Updated dataset structure
data <- read.csv("C:\\Users\\Nasaruddin\\Group12-Project\\diabetesprojectdata.csv")
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Load necessary libraries
library(dplyr)
library(ggplot2)
library(corrplot)
Glimpse of the data
glimpse(data)
## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
Summary statistics for all variables
data %>%
summarise(across(everything(), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))
## Pregnancies_mean Pregnancies_sd Pregnancies_min Pregnancies_max Glucose_mean
## 1 3.845052 3.369578 0 17 120.8945
## Glucose_sd Glucose_min Glucose_max BloodPressure_mean BloodPressure_sd
## 1 31.97262 0 199 69.10547 19.35581
## BloodPressure_min BloodPressure_max SkinThickness_mean SkinThickness_sd
## 1 0 122 20.53646 15.95222
## SkinThickness_min SkinThickness_max Insulin_mean Insulin_sd Insulin_min
## 1 0 99 79.79948 115.244 0
## Insulin_max BMI_mean BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_mean
## 1 846 31.99258 7.88416 0 67.1 0.4718763
## DiabetesPedigreeFunction_sd DiabetesPedigreeFunction_min
## 1 0.3313286 0.078
## DiabetesPedigreeFunction_max Age_mean Age_sd Age_min Age_max Outcome_mean
## 1 2.42 33.24089 11.76023 21 81 0.3489583
## Outcome_sd Outcome_min Outcome_max
## 1 0.4769514 0 1
Check for missing values
data %>%
summarise(across(everything(), ~sum(is.na(.))))
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0 0 0 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 1 0 0 0
Histogram for numerical variables
numeric_cols <- colnames(data)[sapply(data, is.numeric)]
for (col in numeric_cols) {
print(
ggplot(data, aes_string(x = col)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of", col), x = col, y = "Frequency")
)
}
Correlation matrix and visualization
cor_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")
print(cor_matrix)
## Pregnancies Glucose BloodPressure SkinThickness
## Pregnancies 1.00000000 0.12945867 0.14128198 -0.08167177
## Glucose 0.12945867 1.00000000 0.15258959 0.05732789
## BloodPressure 0.14128198 0.15258959 1.00000000 0.20737054
## SkinThickness -0.08167177 0.05732789 0.20737054 1.00000000
## Insulin -0.07353461 0.33135711 0.08893338 0.43678257
## BMI 0.01768309 0.22107107 0.28180529 0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730 0.04126495 0.18392757
## Age 0.54434123 0.26351432 0.23952795 -0.11397026
## Outcome 0.22189815 0.46658140 0.06506836 0.07475223
## Insulin BMI DiabetesPedigreeFunction
## Pregnancies -0.07353461 0.01768309 -0.03352267
## Glucose 0.33135711 0.22107107 0.13733730
## BloodPressure 0.08893338 0.28180529 0.04126495
## SkinThickness 0.43678257 0.39257320 0.18392757
## Insulin 1.00000000 0.19785906 0.18507093
## BMI 0.19785906 1.00000000 0.14064695
## DiabetesPedigreeFunction 0.18507093 0.14064695 1.00000000
## Age -0.04216295 0.03624187 0.03356131
## Outcome 0.13054795 0.29269466 0.17384407
## Age Outcome
## Pregnancies 0.54434123 0.22189815
## Glucose 0.26351432 0.46658140
## BloodPressure 0.23952795 0.06506836
## SkinThickness -0.11397026 0.07475223
## Insulin -0.04216295 0.13054795
## BMI 0.03624187 0.29269466
## DiabetesPedigreeFunction 0.03356131 0.17384407
## Age 1.00000000 0.23835598
## Outcome 0.23835598 1.00000000
corrplot::corrplot(cor_matrix, method = "circle")
Scatter plot between Glucose and BMI grouped by Outcome
ggplot(data, aes(x = Glucose, y = BMI, color = as.factor(Outcome))) +
geom_point(alpha = 0.7) +
labs(title = "Glucose vs BMI by Outcome", x = "Glucose", y = "BMI")
Categorize Age into groups
data <- data %>%
mutate(AgeGroup = case_when(
Age < 30 ~ "Under 30",
Age >= 30 & Age < 50 ~ "30-49",
Age >= 50 ~ "50 and above"
))
Age group distribution by Outcome
data %>%
group_by(AgeGroup, Outcome) %>%
summarise(Count = n()) %>%
ggplot(aes(x = AgeGroup, y = Count, fill = as.factor(Outcome))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Age Group Distribution by Outcome", x = "Age Group", y = "Count")
Classification modelling was executed to predict the likelihood of diabetes occurrence based on the input features. The classification models used in this project are as follows:
library(caret)
library(rpart)
library(rpart.plot)
library(e1071)
The dataset is split to 80% training and 20% testing set.The train-test split is stratified to ensures that the distribution of the target variable’s classes remains consistent between the training and testing datasets.
# Select all columns except the target variable 'Outcome'
X <- data[, setdiff(names(data), 'Outcome')]
# Select the target variable 'Outcome'
y <- data$Outcome
# Set seed for reproducibility
set.seed(123)
# Stratified train-test split
trainIndex <- createDataPartition(data$Outcome, p = 0.8, list = FALSE)
# Create training and testing sets
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
The target variable for classification modelling is ‘Outcome’ which is a categorical variable. By using factor transformation on our target variable, the models will treat it as a categorical variable with distinct classes.
train$Outcome <- factor(train$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
test$Outcome <- factor(test$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))
Logistic regression is a supervised machine learning algorithm used for binary classification, where the goal is to predict the probability of an outcome belonging to one of two classes. It is widely used for its simplicity and interpretability.
logistic_model <- glm(Outcome ~ . , data = train, family = binomial)
#Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + BloodPressure
# Model summary
summary(logistic_model)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.0010983 1.1096993 -6.309 2.81e-10 ***
## Pregnancies 0.0844375 0.0379026 2.228 0.0259 *
## Glucose 0.0353493 0.0042209 8.375 < 2e-16 ***
## BloodPressure -0.0138576 0.0058109 -2.385 0.0171 *
## SkinThickness -0.0005479 0.0075540 -0.073 0.9422
## Insulin -0.0008211 0.0009931 -0.827 0.4084
## BMI 0.0858561 0.0169059 5.078 3.80e-07 ***
## DiabetesPedigreeFunction 0.7303151 0.3251952 2.246 0.0247 *
## Age -0.0020538 0.0228494 -0.090 0.9284
## AgeGroup50 and above -0.1829781 0.5634162 -0.325 0.7454
## AgeGroupUnder 30 -0.8586742 0.3798471 -2.261 0.0238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 797.28 on 614 degrees of freedom
## Residual deviance: 575.91 on 604 degrees of freedom
## AIC: 597.91
##
## Number of Fisher Scoring iterations: 5
# Predict probabilities and classes
predicted_probs <- predict(logistic_model, newdata = test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
Observation:
Significant predictors, such as Glucose, BMI, and Pregnancies, were identified based on their p-values and estimated coefficients. The model suggests a positive association of Pregnancies, Glucose, BMI, and DiabetesPedigreeFunction with diabetes likelihood and a slight negative association with BloodPressure.
Decision Tree is a flowchart-like structure used to make predictions or decisions. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Train a decision tree model
tree_model <- rpart(Outcome ~ ., data = train, method = "class")
To plot the decision tree
rpart.plot(tree_model)
Observation:
The model identified Glucose as the most significant predictor at the root node, with a threshold of 144 used to split the data
SVM is also a supervised machine learning algorithm and can be used for the classification task. In classification, SVM seeks to find the optimal hyperplane that divides the data into different categories.
Train an SVM model without tuning (default settings)
svm_model <- svm(Outcome ~ ., data = train, type = "C-classification", kernel = "radial")
Predict on the test set
svm_predictions <- predict(svm_model, newdata = test)
Regression modeling is a fundamental statistical and machine learning technique used to understand and quantify relationships between variables. Regression helps to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also called predictors or features).
This project aims to explore regression modelling using R.To carry out regression modelling, using the same dataset, ‘BMI’ is chosen to be the target variable.
Why BMI as target variable?
Key Health Indicator - BMI (Body Mass Index) is a widely recognized measure of body fat, associated with health risks such as diabetes and cardiovascular conditions.
Feature Relevance - Features like Glucose, Insulin, and SkinThickness are biologically linked to BMI and influence metabolic health. Continuous Target Variable: BMI is continuous, making it suitable for regression modeling and enabling meaningful analysis of predictor relationships.
Data Completeness - BMI is fully available in the dataset, ensuring reliable and interpretable model results.
The regression models chosen to explore with, include: 1. Linear Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)
# Load the package
library(randomForest)
Data was split into 80% training and 20% test sets.
# Select the features and target variable
# We will predict BMI using other features
dataReg <- subset(data, select = -c(Outcome)) # Exclude the binary outcome column
# Split the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(dataReg$BMI, p = 0.8, list = FALSE)
trainData <- dataReg[trainIndex, ]
testData <- dataReg[-trainIndex, ]
Linear regression is a simple and interpretable technique that models the relationship between a dependent variable and one or more predictors by fitting a linear equation.
Train a linear regression model
# Train a linear regression model
lm_model <- train(BMI ~ ., data = trainData, method = "lm")
View the model summaries
cat("Linear Regression Summary:\n")
## Linear Regression Summary:
summary(lm_model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.286 -3.764 -0.351 3.649 27.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.618783 2.831287 4.810 1.91e-06 ***
## Pregnancies -0.097112 0.107060 -0.907 0.364723
## Glucose 0.049947 0.009804 5.095 4.67e-07 ***
## BloodPressure 0.085913 0.015425 5.570 3.84e-08 ***
## SkinThickness 0.156152 0.021337 7.318 7.99e-13 ***
## Insulin -0.002440 0.002892 -0.844 0.399149
## DiabetesPedigreeFunction 1.403859 0.853622 1.645 0.100572
## Age 0.113981 0.066938 1.703 0.089120 .
## `AgeGroup50 and above` -5.811361 1.629075 -3.567 0.000389 ***
## `AgeGroupUnder 30` 0.076655 1.085830 0.071 0.943743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.874 on 607 degrees of freedom
## Multiple R-squared: 0.2476, Adjusted R-squared: 0.2364
## F-statistic: 22.19 on 9 and 607 DF, p-value: < 2.2e-16
Make predictions on the test set
lm_pred <- predict(lm_model, newdata = testData)
Random forest is an ensemble learning method that uses multiple decision trees to model complex relationships between a dependent variable and its predicting features.
Train a random forest regression model
rf_model <- train(BMI ~ ., data = trainData, method = "rf",
tuneGrid = expand.grid(.mtry = seq(2, ncol(trainData) - 1, by = 1)),
trControl = trainControl(method = "cv", number = 5))
cat("\nRandom Forest Model Parameters:\n")
##
## Random Forest Model Parameters:
print(rf_model$bestTune)
## mtry
## 1 2
rf_pred <- predict(rf_model, newdata = testData)
XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. This Technique uses an ensemble of decision trees to optimize predictions.
Train an XGBoost regression model
xgb_model <- train(BMI ~ ., data = trainData, method = "xgbLinear",
tuneGrid = expand.grid(.nrounds = seq(50, 200, by = 50),
.lambda = c(0, 0.1, 1),
.alpha = c(0, 0.1, 1),
.eta = c(0.01, 0.1, 0.3)),
trControl = trainControl(method = "cv", number = 5))
cat("\nXGBoost Model Best Parameters:\n")
##
## XGBoost Model Best Parameters:
print(xgb_model$bestTune)
## nrounds lambda alpha eta
## 19 50 1 0 0.01
xgb_pred <- predict(xgb_model, newdata = testData)