Introduction

Diabetes is one of the most frequent diseases worldwide, and the number of diabetic patients is growing steadily over the years. Although the exact cause of diabetes remains unknown, scientists believe that both genetic factors and environmental lifestyle play a major role in its development.

This analysis specifically focuses on women, as they are particularly susceptible to certain diabetes risk factors, including pregnancy-related factors. Pregnancies, especially those resulting in gestational diabetes, can increase the likelihood of developing type 2 diabetes later in life.

Individuals with diabetes face an increased risk of developing secondary health issues such as heart disease and nerve damage. Early detection and treatment of diabetes are crucial in preventing complications and reducing the risk of severe health problems. While diabetes is incurable, it can be effectively managed with appropriate treatment and medication.

In this analysis, we will explore the distribution and correlation of various features related to diabetes, including factors such as pregnancies, glucose levels, BMI, age, and family history. We will build predictive models to assess diabetes risk and discuss the importance of these key factors identified through our analysis. Additionally, readers will be guided on how to use a prediction app designed to assist in evaluating diabetes risk based on input data.

For Further application and Analysis, you can use our Diabetes Prediction App to assess your risk: Click here to use the Diabetes Prediction App

Data Import and Overview

# Load the diabetes dataset
diabetes <- read_csv("diabetes.csv")
## Rows: 768 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(diabetes)
## # A tibble: 6 × 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI Pedigree   Age
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>    <dbl> <dbl>
## 1           6     148            72            35       0  33.6    0.627    50
## 2           1      85            66            29       0  26.6    0.351    31
## 3           8     183            64             0       0  23.3    0.672    32
## 4           1      89            66            23      94  28.1    0.167    21
## 5           0     137            40            35     168  43.1    2.29     33
## 6           5     116            74             0       0  25.6    0.201    30
## # ℹ 1 more variable: Class <dbl>
# Display the last few rows of the dataset
tail(diabetes)
## # A tibble: 6 × 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI Pedigree   Age
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>    <dbl> <dbl>
## 1           9      89            62             0       0  22.5    0.142    33
## 2          10     101            76            48     180  32.9    0.171    63
## 3           2     122            70            27       0  36.8    0.34     27
## 4           5     121            72            23     112  26.2    0.245    30
## 5           1     126            60             0       0  30.1    0.349    47
## 6           1      93            70            31       0  30.4    0.315    23
## # ℹ 1 more variable: Class <dbl>
# Provide a summary of the dataset
summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI           Pedigree           Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##      Class      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
# Check for missing values
sum(is.null(diabetes))
## [1] 0
# Get a quick overview of the dataset structure
glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies   <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7,…
## $ Glucose       <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168,…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 7…
## $ SkinThickness <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, …
## $ Insulin       <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0…
## $ BMI           <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.…
## $ Pedigree      <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, …
## $ Age           <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, …
## $ Class         <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,…

Data Visualization

# Define a function to plot histogram and boxplot
histogram_boxplot <- function(diabetes, feature, figsize = c(12, 7), kde = FALSE, bins = NULL) {
  ggplot(diabetes, aes(x = !!sym(feature))) +  # Use !!sym() for tidy evaluation
    geom_boxplot(fill = "pink", outlier.shape = 21, outlier.color = "red") +
    geom_histogram(binwidth = 1, fill = "skyblue", color="black") +
    geom_vline(xintercept = mean(diabetes[[feature]]), color = "green", linetype = "dashed") +
    geom_vline(xintercept = median(diabetes[[feature]]), color = "black", linetype = "solid") +
    labs(title = paste("Distribution of", feature), x = feature, y = "Count") +
    theme_classic()
}

# Plot for Pregnancies
histogram_boxplot(diabetes, "Pregnancies")

The distribution is heavily skewed to the right, indicating that there are a large number of individuals with fewer pregnancies and a smaller number with higher pregnancies. The peak of the distribution is around the value 0-1, suggesting that a significant portion of the individuals have no or one pregnancy. There are a few outliers present on the higher pregnancy side, indicating that there might be some individuals with unusually high numbers of pregnancies compared to the majority of the sample.

# Plot for Glucose
histogram_boxplot(diabetes, "Glucose")

The distribution of plasma glucose concentration looks like a bell-shaped curve, i.e., fairly normal. The boxplot shows that 0 value is an outlier for this variable, but a 0 value of Glucose concentration is not possible. We should treat the 0 values as missing data. From the boxplot, we can see that the third quartile (Q3) is equal to 140, which means 75% of women have less than 140 units of plasma glucose concentration.

# Plot for Blood Pressure
histogram_boxplot(diabetes, "BloodPressure")

There are a few observations with 0 blood pressure, which is biologically impossible. These values should be treated as missing data. The distribution is slightly skewed to the right, indicating a longer tail on the higher blood pressure side. The peak of the distribution is around the 80-90 mmHg range, suggesting that this is the most common blood pressure range in the sample. The blood pressure values range from approximately 0 to 120 mmHg, with a majority of the data points falling within the 40-100 mmHg range.

# Plot for Skin Thickness
histogram_boxplot(diabetes, "SkinThickness")

The distribution is heavily skewed to the right, indicating that there are a large number of individuals with low skin thickness and a smaller number with high skin thickness. There is one extreme value of 99 in this variable, indicating a potential outlier or data entry error. While there are many values with 0 skin thickness, this is likely an error as a value of 0 is not biologically plausible. These values should be treated as missing data.

# Plot for Insulin
histogram_boxplot(diabetes, "Insulin")

From the boxplot, we can see that the third quartile (Q3) is equal to 127 mu U/ml, which means 75% of women have less than 127 mu U/ml of insulin concentration and an average of 80 mu U/ml. The distribution is heavily skewed to the right, indicating that there are a large number of individuals with low insulin levels and a smaller number with high insulin levels. The peak of the distribution is around the value 0. A Zero value in insulin is not possible. We should treat the 0 values as missing data. There are a few outliers present on the higher insulin side, indicating that there might be some individuals with unusually high insulin levels compared to the majority.

# Plot for BMI
histogram_boxplot(diabetes, "BMI")

The distribution is slightly skewed to the right, indicating a longer tail on the higher BMI side. The peak of the distribution is around the 30-35 BMI range, suggesting that this is the most common BMI category in the sample. There are a few outliers present, particularly on the lower BMI side. A 0 value in mass is not possible, so we should treat these values as missing data. This could be due to various factors such as data entry errors or individuals with extremely low weight.

# Plot for Pedigree
histogram_boxplot(diabetes, "Pedigree")

From the boxplot, we can see that the third quartile (Q3) is equal to 0.62, which means 75% of women have less than 0.62 diabetes pedigree function value and an average of 0.47. The distribution is skewed to the right, particularly for the higher pedigree values. There are fewer individuals with higher pedigree values compared to those with lower values.

There are a few outliers present on the higher pedigree side, suggesting that there might be some individuals with unusually high pedigree values compared to the majority of the sample.

# Plot for Age
histogram_boxplot(diabetes, "Age")

The distribution is skewed to the right, indicating that there are more individuals in the younger age groups compared to the older age groups. The peak of the distribution appears to be around the age of 25-30, suggesting that this age range has the highest concentration of individuals in the dataset.

There are a few outliers present on the higher age side, indicating that there are a small number of individuals who are significantly older than the majority of the sample.

### Bar Plot of Diabetes Distribution
ggplot(diabetes, aes(x = Class)) +
  geom_bar(aes(fill = factor(Class)), color = "black") +
  labs(title = "Distribution of Diabetes and Non-Diabetes Patients",
       y = "Count") +
  scale_fill_manual(values = c("skyblue", "orange"),
                    labels = c("Non-Diabetic", "Diabetic")) +
  theme_minimal() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  theme(legend.title = element_blank())
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The “Non-Diabetic” class has a significantly higher count (500) compared to the “Diabetic” class (268). This indicates that a larger proportion of the individuals in the dataset are non-diabetic. The distribution exhibits a class imbalance, with a disproportionately larger number of non-diabetic cases. This is evident from the fact that there are only approximately 35% of the women in the data who are diabetic, while 65% are non-diabetic.

# Distribution of pregnancy
ggplot(diabetes, aes(x = Pregnancies)) +
  geom_bar(aes(fill = as.factor(Pregnancies))) +
  labs(
    title = "Distribution of Pregnancies",
    x = "Pregnancies",
    y = "Count"
  ) +
  scale_fill_hue(h = c(0, 360)) +  # Adjust hue range for better color differentiation
  scale_y_continuous(breaks = seq(0, max(table(diabetes$Pregnancies)), by = 20)) +  # Adjust y-axis breaks
  theme_minimal() +
  geom_text(
    stat = "count",
    aes(label = paste0(round(..count.. / sum(..count..) * 100, 1), "%")),
    vjust = -0.5
  ) +
  theme(legend.title = element_blank())

The peak of the distribution is around the value 0-1, suggesting that a significant portion of the individuals have no or one pregnancy. There are a few outliers present on the higher pregnancy side, indicating that there might be some individuals with unusually high numbers of pregnancies compared to the majority of the sample

Bivariate Analysis

diabetes$Class <- as.factor(diabetes$Class)
ggpairs(diabetes, 
        aes(color = Class, alpha = 0.5),  # Color by Class
        upper = list(continuous = "points"),  # Scatterplot in upper triangle
        lower = list(continuous = "smooth"),  # Smoothed line in lower triangle
        diag = list(continuous = "densityDiag"))  # Density plot on the diagonal
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

While most individuals without diabetes exhibit glucose levels below 100 and a BMI under 30, there is a degree of overlap between these groups. This suggests that these factors alone may not be sufficient to definitively differentiate between diabetic and non-diabetic individuals.

# Compute the correlation matrix
numeric_data <- diabetes %>% select(where(is.numeric))
corr <- round(cor(numeric_data, use = "complete.obs"), 2)

# Plot the correlation matrix
ggcorrplot(corr, 
           method = "square", 
           type = "full", 
           lab = TRUE, 
           lab_size = 3, 
           colors = c("red", "white", "blue"),
           outline.color = "white",
           title = "Correlation Matrix",
           ggtheme = ggplot2::theme_minimal())

Age and Pregnancies (0.54): There’s a moderate positive correlation between “Age” and “Pregnancies,” indicating that older women tend to have more pregnancies. Glucose and Class (0.47): Higher glucose levels are linked to a higher likelihood of diabetes.

BMI and SkinThickness (0.39): Higher BMI is associated with greater skin thickness.

BMI and Class (0.29): Higher BMI may increase diabetes risk. Insulin and Glucose (0.33): Insulin levels moderately correlate with glucose. Age and Class (0.24): Age has a slight positive influence on diabetes risk.

# Define the function to plot a boxplot
boxplot_function <- function(diabetes, x) {
  # Ensure that 'Class' is a factor
  diabetes$Class <- as.factor(diabetes$Class)
  
  ggplot(diabetes, aes(x = Class, y = .data[[x]], fill = Class)) +
    geom_boxplot() +
    scale_fill_brewer(palette = "Set3") +  # Using the Set3 palette for different colors
    theme_minimal() +
    labs(x = "Non-Diabetic And Diabetic", y = x) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

# Boxplot for Pregnancies
boxplot_function(diabetes, "Pregnancies")

There appears to be a positive correlation between the number of pregnancies and the prevalence of diabetes. Women with higher pregnancy counts tend to have a higher likelihood of developing diabetes.

# Boxplot for Glucose
boxplot_function(diabetes, "Glucose")

Plasma glucose levels are significantly elevated in women with diabetes compared to those without. This is a key diagnostic criterion for diabetes.

# Boxplot for BloodPressure
boxplot_function(diabetes, "BloodPressure")

While there’s some variability, the overall distribution of blood pressure levels between diabetic and non-diabetic women is relatively similar. This suggests that blood pressure may not be a strong predictor of diabetes risk in this dataset.

# Boxplot for SkinThickness
boxplot_function(diabetes, "SkinThickness")

Skin thickness appears to be generally comparable between diabetic and non-diabetic women. However, there’s a notable outlier with exceptionally high skin thickness in the diabetic group, which might warrant further investigation.

# Boxplot for Insulin
boxplot_function(diabetes, "Insulin")

Insulin levels are significantly higher in women with diabetes, indicating a potential underlying insulin resistance or deficiency.

# Boxplot for BMI
boxplot_function(diabetes, "BMI")

Women with diabetes tend to have higher BMI values, suggesting a link between obesity and diabetes risk.

# Boxplot for Age
boxplot_function(diabetes, "Age")

While diabetes is more prevalent in middle-aged to older women, there are some exceptions. The presence of outliers in the non-diabetic group indicates that age alone may not be a definitive predictor of diabetes risk.

# Boxplot for Pedigree
boxplot_function(diabetes, "Pedigree")

A higher diabetes pedigree function (DPF) value is associated with a greater risk of developing diabetes. This suggests a genetic predisposition to the disease in women with higher DPF values.

DATA PREPROCESSING

# Missing Values Treatment
diabetes <- diabetes %>%
  mutate(
    Glucose = ifelse(Glucose == 0, median(Glucose[Glucose != 0], na.rm = TRUE), Glucose),
    BloodPressure = ifelse(BloodPressure == 0, median(BloodPressure[BloodPressure != 0], na.rm = TRUE), BloodPressure),
    SkinThickness = ifelse(SkinThickness == 0, median(SkinThickness[SkinThickness != 0], na.rm = TRUE), SkinThickness),
    Insulin = ifelse(Insulin == 0, median(Insulin[Insulin != 0], na.rm = TRUE), Insulin),
    BMI = ifelse(BMI == 0, median(BMI[BMI != 0], na.rm = TRUE), BMI)
  )

Zero values were replaced by the median of the respective variable.

Data Preparation for Modeling

# Feature and Target Separation
features <- diabetes %>% select(-Class)  # Dropping the 'Class' column to get features
target <- diabetes$Class

# Preprocess features
preprocessParams <- preProcess(features, method = c("center", "scale"))
features_normalized <- predict(preprocessParams, features)

# Split data into training and test sets
set.seed(1)  # Set a seed for reproducibility
trainIndex <- createDataPartition(target, p = 0.7, list = FALSE)

# Training set
X_train <- features_normalized[trainIndex, ]
y_train <- target[trainIndex]

# Test set
X_test <- features_normalized[-trainIndex, ]
y_test <- target[-trainIndex]

# Print the dimensions of training and test sets
cat("Training set dimensions:", dim(X_train), "\n")
## Training set dimensions: 538 8
cat("Test set dimensions:", dim(X_test), "\n")
## Test set dimensions: 230 8
# Checking proportion in Target variable
prop.table(table(y_test))
## y_test
##         0         1 
## 0.6521739 0.3478261
prop.table(table(target))
## target
##         0         1 
## 0.6510417 0.3489583
# Combine training and test data with the target variable
trainData <- data.frame(X_train, Class = y_train)
testData <- data.frame(X_test, Class = y_test)

DECISION TREE MODDEL

d_tree <- rpart(Class ~ ., data = data.frame(X_train, Class = y_train), method = "class")

# Save the trained model to an RDS file
saveRDS(d_tree, file = "decision_tree_model.rds")
# Load the model from the RDS file
d_tree_loaded <- readRDS("decision_tree_model.rds")

evaluate_model <- function(model, predictors, target) {
  # Predict using the model
  pred <- predict(model, predictors, type = "class")
  
  # Convert to factor and match levels between pred and target
  pred <- factor(pred, levels = union(levels(pred), levels(target)))
  target <- factor(target, levels = union(levels(pred), levels(target)))
  
  # Filter out levels that are not in both pred and target
  valid_levels <- intersect(levels(pred), levels(target))
  pred <- factor(pred[pred %in% valid_levels], levels = valid_levels)
  target <- factor(target[target %in% valid_levels], levels = valid_levels)
  
  # Create confusion matrix
  confusion_matrix <- confusionMatrix(pred, target)
  
  # Extract metrics
  metrics <- data.frame(
    Accuracy = confusion_matrix$overall['Accuracy'],
    Recall = confusion_matrix$byClass['Sensitivity'],
    Precision = confusion_matrix$byClass['Pos Pred Value'],
    F1 = confusion_matrix$byClass['F1']
  )
  
  return(metrics)
}

# Evaluate model on training set
train_metrics <- evaluate_model(d_tree_loaded, X_train, y_train)
cat("Training Performance:\n")
## Training Performance:
print(train_metrics)
##          Accuracy    Recall Precision   F1
## Accuracy  0.83829 0.9114286 0.8506667 0.88
# Evaluate model on test set
test_metrics <- evaluate_model(d_tree_loaded, X_test, y_test)
cat("Testing Performance:\n")
## Testing Performance:
print(test_metrics)
##           Accuracy Recall Precision        F1
## Accuracy 0.7086957   0.82 0.7546012 0.7859425

Both the training and test accuracies are above 70%, indicating that the model is making correct predictions a significant portion of the time. The recall for the test set is 82%, which is generally considered good. This means that the model is able to correctly identify a large proportion of actual positive cases (people with diabetes).

# Function to plot confusion matrix 
plot_confusion_matrix <- function(model, predictors, target, set_name) {
  # Predict using the model
  pred <- predict(model, predictors, type = "class")
  
  # Convert to factors and ensure same levels between pred and target
  pred <- factor(pred, levels = union(levels(pred), levels(target)))
  target <- factor(target, levels = union(levels(pred), levels(target)))
  
  # Filter out levels that are not in both pred and target
  valid_levels <- intersect(levels(pred), levels(target))
  pred <- factor(pred[pred %in% valid_levels], levels = valid_levels)
  target <- factor(target[target %in% valid_levels], levels = valid_levels)
  
  # Create confusion matrix
  confusion_matrix <- confusionMatrix(pred, target)$table
  percentages <- prop.table(confusion_matrix) * 100
  
  labels <- sprintf("%d\n(%.1f%%)", confusion_matrix, percentages)
  cm_df <- as.data.frame(as.table(confusion_matrix))
  cm_df$labels <- as.vector(labels)
  
  # Plot confusion matrix
  ggplot(cm_df, aes(x = Prediction, y = Reference, fill = Freq)) +
    geom_tile(color = "white") +
    geom_text(aes(label = labels), color = "black", size = 5) +
    scale_fill_gradient(low = "white", high = "steelblue") +
    labs(title = paste("Confusion Matrix -", set_name), x = "Predicted Class", y = "True Class", fill = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

# Evaluate and plot model performance on test set
plot_confusion_matrix(d_tree_loaded, X_test, y_test, "Test Set")

The confusion matrix shows that the model can identify the majority of patients who are at risk of diabetes.

# Plot the decision tree
rpart.plot(d_tree_loaded, type = 3, extra = 101, under = TRUE, fallen.leaves = TRUE)

FEATURE IMPORTANCE

# Extract feature importance
importance <- d_tree_loaded$variable.importance

# Convert to a data frame for ggplot
importance_df <- data.frame(
  Feature = names(importance),
  Importance = importance
)

# Sort the data frame by importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Create a horizontal bar plot of feature importances
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "violet") +
  coord_flip() +
  labs(title = "Feature Importance For Diabetes Prediction",
       x = "Feature",
       y = "Relative Importance") +
  theme_minimal()

The feature importance plot highlights the prominence of “Glucose” as the most critical factor in determining diabetes risk, with “Age,” “BMI,” and “Pedigree” also playing significant roles.

CONCLUSION .

RECOMMENDATIONS

Potential Applications and Future Directions