Overview: Using a Kaggle dataset to visualize trends between pregnancy, BMI levels, and Diabetes and training a linear regression model to predict whether a patient has diabetes or not.
First we’ll load all necessary libraries and look at the structure of the data set, identifying the information we have to work with.
library(readr)
library(caTools)
library(caret)
library(e1071)
library(tidyverse)
library(ggplot2)
library(reshape2)
data <- read.csv("diabetes.csv")
head(data)
Next we’ll clean the data by summarizing and checking for missing values. Although there are no NAs in the data set, some of the features like Insulin and SkinThickness contain zeros, potentially representing missing/unrecorded data.
summary(data)
Pregnancies Glucose BloodPressure SkinThickness
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction
Min. : 0.0 Min. : 0.00 Min. :0.0780
1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437
Median : 30.5 Median :32.00 Median :0.3725
Mean : 79.8 Mean :31.99 Mean :0.4719
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262
Max. :846.0 Max. :67.10 Max. :2.4200
Age Outcome
Min. :21.00 Min. :0.000
1st Qu.:24.00 1st Qu.:0.000
Median :29.00 Median :0.000
Mean :33.24 Mean :0.349
3rd Qu.:41.00 3rd Qu.:1.000
Max. :81.00 Max. :1.000
data %>% is.na() %>% colSums()
Pregnancies Glucose
0 0
BloodPressure SkinThickness
0 0
Insulin BMI
0 0
DiabetesPedigreeFunction Age
0 0
Outcome
0
X <- data[, 1:8]
Y <- data[, 9]
Let’s do some visualization to help identify which variables may be most predictive of diabetes. Strong positive correlations with the outcome variable include Glucose and BMI.
cor_matrix <- cor(data)
cor_melted <- melt(cor_matrix)
ggplot(cor_melted, aes(Var1, Var2, fill=value)) +
geom_tile(color="white") +
scale_fill_gradient2(low="blue",high="red", mid="white", limit=c(-1,1), name="Correlation") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title="Correlation Heatmap", x="Features", y="Features")
We can also visualize the distribution of the outcome variable which
is clearly imbalanced. The imbalance can affect model performance and
will be addressed later using SMOTE.
Outcome = 0 -> the patient
does not have diabetes.
Outcome = 1 -> the patient does have
diabetes.
outcome_count <- table(data$Outcome)
outcome_df <- data.frame(Outcome = names(outcome_count),
Count = as.numeric(outcome_count))
ggplot(outcome_df, aes(x=Outcome, y=Count)) +
geom_bar(stat="identity", fill="light pink") +
ggtitle("Distribution of Diabetes Outcomes")
This histogram shows the distribution of the number of pregnancies among patients, separated by diabetes outcome. We can see that patients with diabetes tend to have a higher number of pregnancies. Also notice the distribution is positively skewed. This indicates that most patients have fewer pregnancies, while a small number have significantly more.
ggplot(data, aes(x=Pregnancies, fill=factor(Outcome))) +
geom_histogram(bins=30, col="black") + facet_wrap(~Outcome, scales="free_y") +
ggtitle("Distribution of Pregnancies by Outcomes")
The box plot shows the distribution of BMI values for diabetic and non-diabetic patients. Patient with diabetes tend to have higher BMI scores which is expected considering known risk facotrs. .
ggplot(data, aes(x=factor(Outcome), y=BMI, fill=factor(Outcome))) +
geom_boxplot() + ylab("BMI") + ggtitle("BMI Distribution by Outcome")
Let’s get ready to build and train our prediction model. First we’ll split up the features into X and the outcome into Y, forming the final data set by scaling the features using z-score standardization. We’ll use logistic regression to predict whether a patient has diabetes.
X <- data[, 1:8]
Y <- data[, 9]
scaled_X <- as.data.frame(scale(X))
scaled_data <- cbind(scaled_X, Y)
X <- scaled_data[, 1:8]
Y <- scaled_data[, 9]
set.seed(123)
#sample split takes 70% of points from Y and set them to TRUE
sample <- sample.split(Y, SplitRatio = 0.7)
X_train <- X[sample == TRUE, ]
Y_train <- Y[sample == TRUE]
X_test <- X[sample == FALSE, ]
Y_test <- Y[sample == FALSE]
After training the model, we evaluated its performance using a confusion matrix.
log_model <- glm(Y_train ~ ., data=X_train, family=binomial)
summary(log_model)
Call:
glm(formula = Y_train ~ ., family = binomial, data = X_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8521 0.1185 -7.192 6.38e-13 ***
Pregnancies 0.3455 0.1354 2.551 0.0107 *
Glucose 1.4154 0.1592 8.892 < 2e-16 ***
BloodPressure -0.2875 0.1364 -2.108 0.0350 *
SkinThickness 0.1272 0.1404 0.906 0.3649
Insulin -0.3864 0.1447 -2.669 0.0076 **
BMI 0.7028 0.1496 4.699 2.61e-06 ***
DiabetesPedigreeFunction 0.2759 0.1279 2.158 0.0309 *
Age 0.1921 0.1392 1.380 0.1675
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom
Residual deviance: 477.81 on 529 degrees of freedom
AIC: 495.81
Number of Fisher Scoring iterations: 5
predictions <- predict(log_model, newdata = X_test, type="response")
predictions <- factor(ifelse(predictions > 0.5, 1, 0),
levels = levels(as.factor(Y_test)))
confusionMatrix(predictions, as.factor(Y_test))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 127 37
1 23 43
Accuracy : 0.7391
95% CI : (0.6773, 0.7946)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 0.002949
Kappa : 0.4005
Mcnemar's Test P-Value : 0.093290
Sensitivity : 0.8467
Specificity : 0.5375
Pos Pred Value : 0.7744
Neg Pred Value : 0.6515
Prevalence : 0.6522
Detection Rate : 0.5522
Detection Prevalence : 0.7130
Balanced Accuracy : 0.6921
'Positive' Class : 0
The following predictions were made.
- 23 patients incorrectly
predicted to have diabetes
- 37 patients incorrectly predicted not
to have diabetes
- 43 patients correctly predicted to have
diabetes
- 127 patients correctly predicted not to have diabetes
m<-confusionMatrix(predictions, as.factor(Y_test))
prediction_results <- as.table(m)
matrix_df <- as.data.frame(prediction_results)
ggplot(matrix_df, aes(x=Reference, y=Prediction, fill=Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black", size = 6) +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Confusion Matrix Heatmap", x = "Actual", y = "Predicted")
To address class imbalance, we’ll apply SMOTE (synthetic minority over-sampling technique) to generate synthetic examples of the minority class. This will help the model learn more balanced decision boundaries.
library(ROSE)
set.seed(199)
smote_data <- ROSE(Outcome ~ ., data = data, N = 1500, p = 0.5)$data
outcome_count_smote <- table(smote_data$Outcome)
outcome_df_smote <- data.frame(Outcome = names(outcome_count_smote),
Count = as.numeric(outcome_count_smote))
Now the distribution of outcomes is significantly more balanced.
library(gridExtra)
p1 <- ggplot(outcome_df, aes(x=Outcome, y=Count)) +
geom_bar(stat="identity", fill="light pink") +
ggtitle("Distribution of Diabetes Outcomes")
p2 <- ggplot(outcome_df_smote, aes(x=Outcome, y=Count)) +
geom_bar(stat="identity", fill="light pink") +
ggtitle("Distribution of Diabetes Outcomes (SMOTE data)")
grid.arrange(p1,p2)
Also observe that the distribution of pregnancies across diabetes outcomes becomes more symmetrical and resembles a normal distribution. This confirms that the SMOTE helped create a more representative dataset for model training.
p1 <- ggplot(data, aes(x=Pregnancies, fill=factor(Outcome))) +
geom_histogram(bins=30, col="black") + facet_wrap(~Outcome, scales="free_y") +
ggtitle("Distribution of Pregnancies by Outcomes")
p2 <- ggplot(smote_data, aes(x=Pregnancies, fill=factor(Outcome))) +
geom_histogram(bins=30, col="black") + facet_wrap(~Outcome, scales="free_y") +
ggtitle("Distribution of Pregnancies by Outcomes (SMOTE data)")
grid.arrange(p1,p2)
After applying SMOTE and retraining the model, we can see an improvement in accuracy and balanced performance metrics.
X_primed <- smote_data[, 1:8]
Y_primed <- smote_data[, 9]
scaled_X_primed <- as.data.frame(scale(X_primed))
scaled_data_smote <- cbind(scaled_X_primed, Y_primed)
X_primed <- scaled_data_smote[, 1:8]
Y_primed <- scaled_data_smote[, 9]
set.seed(123)
sample <- sample.split(Y_primed, SplitRatio = 0.7)
X_train <- X_primed[sample == TRUE, ]
Y_train <- Y_primed[sample == TRUE]
X_test <- X_primed[sample == FALSE, ]
Y_test <- Y_primed[sample == FALSE]
log_model <- glm(Y_train ~ ., data=X_train, family=binomial)
summary(log_model)
Call:
glm(formula = Y_train ~ ., family = binomial, data = X_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.01366 0.07221 -0.189 0.850001
Pregnancies 0.36455 0.08104 4.498 6.85e-06 ***
Glucose 0.94790 0.08806 10.764 < 2e-16 ***
BloodPressure -0.15323 0.07888 -1.943 0.052069 .
SkinThickness -0.14594 0.08203 -1.779 0.075243 .
Insulin -0.06331 0.07993 -0.792 0.428351
BMI 0.57902 0.08438 6.862 6.77e-12 ***
DiabetesPedigreeFunction 0.32377 0.07714 4.197 2.70e-05 ***
Age 0.29627 0.08122 3.648 0.000264 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1455.4 on 1049 degrees of freedom
Residual deviance: 1146.4 on 1041 degrees of freedom
AIC: 1164.4
Number of Fisher Scoring iterations: 4
predictions <- predict(log_model, newdata = X_test, type="response")
predictions <- factor(ifelse(predictions > 0.5, 1, 0),
levels = levels(as.factor(Y_test)))
m <- confusionMatrix(predictions, as.factor(Y_test))
m
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 178 53
1 51 168
Accuracy : 0.7689
95% CI : (0.7271, 0.8071)
No Information Rate : 0.5089
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5376
Mcnemar's Test P-Value : 0.9219
Sensitivity : 0.7773
Specificity : 0.7602
Pos Pred Value : 0.7706
Neg Pred Value : 0.7671
Prevalence : 0.5089
Detection Rate : 0.3956
Detection Prevalence : 0.5133
Balanced Accuracy : 0.7687
'Positive' Class : 0
The model’s accuracy has improved from 73% to 76%! This improvement highlights the impact of data preprocessing techniques like scaling and oversampling. While logistic regression is a simple model, it provides a strong baseline for future experimentation with more complex algorithms.
prediction_results <- as.table(m)
matrix_df <- as.data.frame(prediction_results)
ggplot(matrix_df, aes(x=Reference, y=Prediction, fill=Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black", size = 6) +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Confusion Matrix Heatmap", x = "Actual", y = "Predicted")
The final prediction results are as follows:
- 51 patients
incorrectly predicted to have diabetes
- 53 patients incorrectly
predicted not to have diabetes
- 168 patients correctly predicted to
have diabetes
- 178 patients correctly predicted not to have
diabetes
Dataset:
https://www.kaggle.com/datasets/mathchi/diabetes-data-set/data