Libraries used

# Importing the 'modeest' package for estimating statistical modes or working with statistical distributions
library(modeest)

# Importing the 'ggplot2' package for creating visually appealing and customizable plots
library(ggplot2)

# Importing the 'lattice' package for creating conditioned plots
library(lattice)

# Importing the 'caret' package for machine learning tasks such as data preprocessing, model training, and performance evaluation
library(caret)

# Importing the 'mlbench' package for benchmark datasets commonly used in machine learning research
library(mlbench)

# Importing the 'dlookr' package for data exploration and outlier detection
library(dlookr)
## Registered S3 methods overwritten by 'dlookr':
##   method          from  
##   plot.transform  scales
##   print.transform scales
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:modeest':
## 
##     skewness
## The following object is masked from 'package:base':
## 
##     transform
# Load the dplyr library, which provides a grammar of data manipulation.
# dplyr offers a collection of functions like filter(), mutate(), select(), and group_by()
# for efficiently transforming, summarizing, and working with data frames.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the randomForest library, which implements the random forest algorithm.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin

Loading the dataset

# Reading the csv data file "diabetes.csv" into the 'diabetes' dataframe
diabetes <- read.csv("diabetes.csv", stringsAsFactors = TRUE, na.strings = c("","NA"))

Data pre-processing

Generating a summary

# Generating a summary of the 'diabetes' dataframe
summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Pregnancies: The mean number of pregnancies is 3.845, with a median of 3. This suggests that the distribution is slightly positively skewed, as the mean is slightly higher than the median. The minimum value of 0 indicates that some individuals in the dataset have never been pregnant, which is a possible scenario. However, it’s worth noting that the maximum value of 17 pregnancies is relatively high and may be considered an outlier. While it is technically possible for someone to have a high number of pregnancies, it is relatively uncommon and not representative of the majority of the population. The presence of this outlier suggests that there may be one or a few individuals with exceptionally high numbers of pregnancies in the dataset.

Glucose: The mean glucose level is 120.9, with a median of 117. The minimum value of 0 seems unrealistic for glucose levels and may indicate missing or invalid data. It is important to investigate and address the presence of such values in the dataset. The maximum value of 199 indicates a relatively wide range of glucose levels among the individuals included.

BloodPressure: The mean blood pressure is 69.11, with a median of 72. The minimum value of 0 is also concerning, as it is unrealistic for blood pressure readings and may indicate missing or invalid data. Similarly to the glucose variable, further investigation and data validation are necessary to address this issue. The maximum value of 122 suggests a range of blood pressure measurements in the dataset.

SkinThickness: The mean skin thickness is 20.54, with a median of 23. The minimum value of 0 is once again concerning, as it seems unrealistic for skin thickness measurements and may indicate missing or invalid data. It is crucial to investigate and address the presence of such values in the dataset. The maximum value of 99 indicates a wide range of skin thickness measurements among the individuals included.

Insulin: The mean insulin level is 79.8, with a median of 30.5. Similar to the previous variables, the presence of a minimum value of 0 is concerning, as it could indicate missing or invalid data. It is important to carefully investigate and handle these zero values appropriately. The maximum value of 846 suggests a wide range of insulin levels in the dataset.

BMI: The mean BMI (Body Mass Index) is 31.99, with a median of 32. The minimum value of 0 is once again concerning, as it is unrealistic for BMI measurements and may indicate missing or invalid data. It is important to investigate and address the presence of such values in the dataset. The maximum value of 67.1 indicates a wide range of BMI values among the individuals included.

DiabetesPedigreeFunction: The mean diabetes pedigree function is 0.4719, with a median of 0.3725. This function provides information about the genetic influence of diabetes based on family history. The values range from 0.0780 to 2.4200.

Age: The mean age is 33.24, with a median of 29. The minimum age is 21, and the maximum age is 81. These age statistics provide an overview of the age distribution within the dataset. The range of ages suggests that the dataset includes individuals spanning a wide age range.

Outcome: The outcome variable represents whether an individual has diabetes or not, with 0 indicating no diabetes and 1 indicating diabetes. The mean value is 0.349, indicating that approximately 34.9% of the individuals in the dataset have diabetes.Furthermore this variable supposed to be factor and not numerical.

Change the target variable to factor

# Converting the "Outcome" variable to factor
diabetes$Outcome <- as.factor(diabetes$Outcome)

Displaying the structure

# Displaying the structure of the 'diabetes' dataframe
str(diabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...

The structure of the dataset is confirmed

Missing values analysis

# Calculating the column-wise sum of missing values in the 'diabetes' dataframe
colSums(is.na(diabetes))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Confirmed no missing values

Duplicates removal and variable name improvement

# Removing duplicate rows
diabetes <- unique(diabetes)

# Renaming the 'DiabetesPedigreeFunction' column to 'DiabetesPF'.
diabetes <- rename(diabetes, DiabetesPF = DiabetesPedigreeFunction)

Confirmed no duplicates records

Outliers handling

# Selecting numerical variables from diabetes data frame
numerical_vars <- diabetes[, sapply(diabetes, is.numeric)]

# Plot outliers in numerical variables
plot_outlier(numerical_vars)

# Identify nature of outliers using boxplot statistics
boxplot.stats(diabetes$Pregnancies)$out
## [1] 15 17 14 14
boxplot.stats(diabetes$Glucose)$out
## [1] 0 0 0 0 0
boxplot.stats(diabetes$BloodPressure)$out
##  [1]   0   0  30 110   0   0   0   0 108 122  30   0 110   0   0   0   0   0   0
## [20]   0   0   0   0 108   0   0   0   0   0   0   0   0   0   0 110   0  24   0
## [39]   0   0   0 114   0   0   0
boxplot.stats(diabetes$SkinThickness)$out
## [1] 99
boxplot.stats(diabetes$Insulin)$out
##  [1] 543 846 342 495 325 485 495 478 744 370 680 402 375 545 360 325 465 325 415
## [20] 579 474 328 480 326 330 600 321 440 540 480 335 387 392 510
boxplot.stats(diabetes$BMI)$out
##  [1]  0.0  0.0  0.0  0.0 53.2 55.0  0.0 67.1 52.3 52.3 52.9  0.0  0.0 59.4  0.0
## [16]  0.0 57.3  0.0  0.0
boxplot.stats(diabetes$DiabetesPF)$out 
##  [1] 2.288 1.441 1.390 1.893 1.781 1.222 1.400 1.321 1.224 2.329 1.318 1.213
## [13] 1.353 1.224 1.391 1.476 2.137 1.731 1.268 1.600 2.420 1.251 1.699 1.258
## [25] 1.282 1.698 1.461 1.292 1.394
boxplot.stats(diabetes$Age)$out
## [1] 69 67 72 81 67 67 70 68 69

For the variable “Pregnancies,” the outliers identified are 15, 17, 14, and 14. These values indicate that there are individuals in the dataset with exceptionally high numbers of pregnancies compared to the rest of the data.

For the variable “Glucose,” the outliers identified are all 0. These values indicate missing or invalid data points rather than true outliers.

Based on the observed outliers in the Blood Pressure variable, it is important to note that these values may not necessarily indicate wrong or erroneous data. Instead, they could be indicative of abnormal health conditions related to blood pressure. Outliers above the expected range may signify hypertension, while outliers below the range (excluding the zeros) may suggest low blood pressure or hypotension.

Outliers in the Skin Thickness variable, such as a value of 99, require further investigation. These extreme values may be indicative of unusual or uncommon measurements within the dataset and may warrant consideration in the analysis.

In the Insulin variable, outliers that are significantly higher than the rest of the data suggest individuals with exceptionally high insulin levels. These outliers may be related to specific medical conditions or other factors affecting insulin production or metabolism.

Outliers in the BMI variable, including zero values, may indicate missing or invalid data. These outliers should be treated as missing values and imputed using appropriate techniques to ensure the integrity of the analysis.

The Diabetes Pedigree Function variable’s outliers may represent extreme values in the genetic diabetes score. These outliers may have implications for understanding the hereditary component of diabetes risk.

In the Age variable, outliers such as individuals with ages above or below the expected range may reflect unique cases within the dataset. These outliers provide insights into the age distribution and may have distinct characteristics or experiences relevant to the analysis.

Under normal health conditions, human glucose levels, blood pressure, BMI, and age cannot be zero, as these values are integral to physiological functioning. If any of these variables are recorded as zero, it suggests that the individual does not exist. Therefore, all observed zero values in the dataset during summary statistics and outlier handling likely signify missing values rather than wrong data or valid data points. Following that, the technique known as median imputation is employed to resolve these possible missing data. The proposed methodology involves substituting the discovered zero values in certain numerical variables with the corresponding variable’s median value. Utilising median imputation is considered a robust approach for addressing missing data, specifically in numerical variables. This method effectively reduces the impact of outliers while still maintaining the central tendency of the dataset.

# Removing the Zero values in some variables
diabetes$Pregnancies[which(diabetes$Pregnancies==0)]<- median(diabetes$Pregnancies)
diabetes$Glucose[which(diabetes$Glucose==0)] <- median(diabetes$Glucose)
diabetes$BloodPressure[which(diabetes$BloodPressure==0)] <- median(diabetes$BloodPressure)
diabetes$SkinThickness[which(diabetes$SkinThickness==0)] <- median(diabetes$SkinThickness)
diabetes$Insulin[which(diabetes$Insulin==0)] <- median(diabetes$Insulin)
diabetes$Insulin[which(diabetes$BMI==0)] <- median(diabetes$BMI)
diabetes$DiabetesPF[which(diabetes$DiabetesPF==0)]<- median(diabetes$DiabetesPF)
diabetes$Age[which(diabetes$Age==0)] <- median(diabetes$Age)

Data analysis

Correlation analysis

# Calculating correlation matrix and rounding to 2 decimal places
round(cor(numerical_vars), 2)
##               Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## Pregnancies          1.00    0.13          0.14         -0.08   -0.07 0.02
## Glucose              0.13    1.00          0.15          0.06    0.33 0.22
## BloodPressure        0.14    0.15          1.00          0.21    0.09 0.28
## SkinThickness       -0.08    0.06          0.21          1.00    0.44 0.39
## Insulin             -0.07    0.33          0.09          0.44    1.00 0.20
## BMI                  0.02    0.22          0.28          0.39    0.20 1.00
## DiabetesPF          -0.03    0.14          0.04          0.18    0.19 0.14
## Age                  0.54    0.26          0.24         -0.11   -0.04 0.04
##               DiabetesPF   Age
## Pregnancies        -0.03  0.54
## Glucose             0.14  0.26
## BloodPressure       0.04  0.24
## SkinThickness       0.18 -0.11
## Insulin             0.19 -0.04
## BMI                 0.14  0.04
## DiabetesPF          1.00  0.03
## Age                 0.03  1.00

The sole correlation observed with a coefficient above 0.5 is between the variables ‘Pregnancies’ and ‘Age’, with a value of 0.54. This correlation can be considered moderate, indicating a significant relationship between the two variables. As individuals grow older, the probability of encountering a greater number of pregnancies tends to increase.

In addition to the correlation between pregnancies and age, the remaining correlations among the independent variables were all below 0.5, indicating low levels of correlation. These low correlations further suggest the absence of multicollinearity among the independent variables in the dataset.

Logistic regression analysis

# Attaching the "Diabetes" data frame
attach(diabetes)

# Building the logistic regression model
logitA <- glm(Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI +DiabetesPF + Age,family = binomial)

# Summarizing the logistic regression model
summary(logitA)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     SkinThickness + Insulin + BMI + DiabetesPF + Age, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6141  -0.7124  -0.4082   0.7050   2.4744  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -8.8279654  0.7994998 -11.042  < 2e-16 ***
## Pregnancies    0.1426196  0.0359056   3.972 7.12e-05 ***
## Glucose        0.0382817  0.0038628   9.910  < 2e-16 ***
## BloodPressure -0.0090341  0.0085999  -1.050  0.29350    
## SkinThickness  0.0042342  0.0121734   0.348  0.72797    
## Insulin       -0.0015045  0.0009196  -1.636  0.10182    
## BMI            0.0798234  0.0165767   4.815 1.47e-06 ***
## DiabetesPF     0.9445078  0.3021420   3.126  0.00177 ** 
## Age            0.0113604  0.0093738   1.212  0.22554    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 714.80  on 759  degrees of freedom
## AIC: 732.8
## 
## Number of Fisher Scoring iterations: 5

In this analysis, variables such as “Pregnancies,” “Glucose,”, “BMI,” and “DiabetesPF” have coefficients with statistically significant p-values (p < 0.05), indicating that they have a significant influence on the log-odds of the outcome (dependent variable).

Among these variables, “Glucose” and “BMI” have the largest coefficient magnitudes, suggesting that they have a relatively higher influence on the outcome. Specifically, for each unit increase in “Glucose,” the log-odds of the outcome increase by 0.0382817. Similarly, for each unit increase in “BMI,” the log-odds of the outcome increase by 0.0798234.

Variables such as “Pregnancies,” and “DiabetesPF” also have statistically significant coefficients but with smaller magnitudes compared to “Glucose” and “BMI.” The remaining variables, “SkinThickness,” “BloodPressure,” “Insulin,” and “Age,” do not have statistically significant coefficients (p > 0.05), indicating that they may have less influence on the outcome.

odds ratio analysis

# Calculating and interpreting the odds ratio
exp(coef(logitA))
##   (Intercept)   Pregnancies       Glucose BloodPressure SkinThickness 
##  0.0001465762  1.1532909827  1.0390238523  0.9910066097  1.0042431701 
##       Insulin           BMI    DiabetesPF           Age 
##  0.9984966772  1.0830957581  2.5715473012  1.0114251402

Here are the interpretations of the coefficients:

  • (Intercept): The odds of the dependent variable when all independent variables are zero. The value of the intercept is 0.0001465762

  • Pregnancies: For each unit increase in the number of pregnancies, the odds of the dependent variable increase by a factor of approximately 1.15

  • Glucose: For each unit increase in the glucose level, the odds of the dependent variable increase by a factor of approximately 1.04

  • BloodPressure: For each unit increase in the blood pressure, the odds of the dependent variable decrease by a factor of approximately 0.99.

  • SkinThickness: For each unit increase in the skin thickness, the odds of the dependent variable increase by a factor of approximately 1.00.

  • Insulin: For each unit increase in the insulin level, the odds of the dependent variable decrease by a factor of approximately 0.99.

  • BMI: For each unit increase in the BMI (body mass index), the odds of the dependent variable increase by a factor of approximately 1.08.

  • DiabetesPF: For each unit increase in the diabetes pedigree function, the odds of the dependent variable increase by a factor of approximately 2.57.

  • Age: For each unit increase in age, the odds of the dependent variable increase by a factor of approximately 1.01.

Based on the coefficients, the independent variables with higher magnitudes have a stronger influence on the odds of the dependent variable. In this case, “DiabetesPF” has the highest coefficient, indicating it has the greatest influence on the odds of the dependent variable. “BMI” and “Pregnancies” also have relatively high coefficients, suggesting they have significant influences on the odds. Meanwhile, “BloodPressure” and “Insulin” have coefficients close to 1, indicating weaker influences on the odds. The remaining variables, including “SkinThickness” and “Age,” have coefficients very close to 1, suggesting they have little to no influence on the odds of the dependent variable.

In conclusion, the logistic regression analysis reveals that the likelihood of developing diabetes is significantly increased by four main factors: a greater number of pregnancies, higher glucose levels, elevated BMI, and a positive family history of diabetes.

Model training, testing, and evaluation

# Set the random seed for reproducibility
set.seed(1994)

# Determine the number of rows in the 'diabetes' data frame
n_rows <- nrow(diabetes)

# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx <- sample(n_rows, n_rows*0.7)

# Create the training and testing data sets using the sampled indices
trainData <- diabetes[idx,]
testData <- diabetes[-idx,]

# Define the formula for the random forest model using the relevant features
formula <- Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPF + Age

# Set up cross-validation parameters for the model (100 iterations)
ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3, savePredictions = "final")

# Train the random forest model using cross-validation
model <- train(formula, data = trainData, method='rf', trControl=ctrl)

# Display the cross-validation model
model
## Random Forest 
## 
## 537 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 484, 483, 483, 484, 483, 483, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7672374  0.4763779
##   5     0.7591195  0.4617980
##   8     0.7578966  0.4604780
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions 
predictions <- predict(model, testData[,-9], type = 'raw')

# Computing the confusion matrix
co_matrix <- confusionMatrix(predictions, testData$Outcome)
co_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 122  26
##          1  30  53
##                                          
##                Accuracy : 0.7576         
##                  95% CI : (0.697, 0.8114)
##     No Information Rate : 0.658          
##     P-Value [Acc > NIR] : 0.000686       
##                                          
##                   Kappa : 0.4678         
##                                          
##  Mcnemar's Test P-Value : 0.688500       
##                                          
##             Sensitivity : 0.8026         
##             Specificity : 0.6709         
##          Pos Pred Value : 0.8243         
##          Neg Pred Value : 0.6386         
##              Prevalence : 0.6580         
##          Detection Rate : 0.5281         
##    Detection Prevalence : 0.6407         
##       Balanced Accuracy : 0.7368         
##                                          
##        'Positive' Class : 0              
## 

. The model’s accuracy of approximately 76% suggests a good level of predictive capacity, as demonstrated by its ability to make correct classifications. This accuracy, falling within a 95% confidence interval of (0.697, 0.8114), further reinforces the reliability of the model’s predictions. . The model demonstrates a sensitivity of 80.26%, indicating its proficiency in accurately detecting individuals with diabetes. A higher sensitivity suggests a reduced likelihood of false negatives,effectively identifying a substantial portion of positive instances with accuracy. . Conversely, the specificity level stands at 67.09%, reflecting the model’s capacity to accurately classify individuals without diabetes. A greater specificity implies a reduced likelihood of false positives, effectively categorizing a significant proportion of negative instances with accuracy.

Variable importance measure

# Extract variable importance
var_importance_rf_model_CV <- varImp(model)
var_importance_rf_model_CV
## rf variable importance
## 
##               Overall
## Glucose       100.000
## BMI            65.761
## Age            52.084
## DiabetesPF     32.320
## BloodPressure  12.927
## SkinThickness  10.683
## Pregnancies     7.175
## Insulin         0.000
# Create a bar plot of variable importance
gg <- ggplot(var_importance_rf_model_CV, aes(x = rownames(var_importance_rf_model_CV), y = Overall)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Variable",
       y = "Importance Score") +
theme_minimal() +  # Change the theme here
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        # Customize label appearance
        axis.text = element_text(face = "bold", color = "red", size = 12))

# Save the plot as a file (e.g., in PNG format)
ggsave("variable_importance_plot.png", gg, width = 4, height = 2)

. Glucose emerges as the most influential variable (score: 100), signifying its pivotal role in predicting diabetes risk. BMI follows closely behind (score: 65.76), emphasizing weight control’s importance. Age (score: 52.08) also significantly influences diabetes likelihood. The Diabetes Pedigree Function (score: 32.32) indicates a genetic predisposition to diabetes. Blood Pressure and Skin Thickness contribute moderately (scores: 12.93 and 10.68), while Pregnancies show a moderate influence (score: 7.17). Insulin has the least influence (score: 0) on diabetes prediction within the model.