# Importing the 'modeest' package for estimating statistical modes or working with statistical distributions
library(modeest)
# Importing the 'ggplot2' package for creating visually appealing and customizable plots
library(ggplot2)
# Importing the 'lattice' package for creating conditioned plots
library(lattice)
# Importing the 'caret' package for machine learning tasks such as data preprocessing, model training, and performance evaluation
library(caret)
# Importing the 'mlbench' package for benchmark datasets commonly used in machine learning research
library(mlbench)
# Importing the 'dlookr' package for data exploration and outlier detection
library(dlookr)
## Registered S3 methods overwritten by 'dlookr':
## method from
## plot.transform scales
## print.transform scales
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:modeest':
##
## skewness
## The following object is masked from 'package:base':
##
## transform
# Load the dplyr library, which provides a grammar of data manipulation.
# dplyr offers a collection of functions like filter(), mutate(), select(), and group_by()
# for efficiently transforming, summarizing, and working with data frames.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the randomForest library, which implements the random forest algorithm.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Reading the csv data file "diabetes.csv" into the 'diabetes' dataframe
diabetes <- read.csv("diabetes.csv", stringsAsFactors = TRUE, na.strings = c("","NA"))
# Generating a summary of the 'diabetes' dataframe
summary(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Pregnancies: The mean number of pregnancies is 3.845, with a median of 3. This suggests that the distribution is slightly positively skewed, as the mean is slightly higher than the median. The minimum value of 0 indicates that some individuals in the dataset have never been pregnant, which is a possible scenario. However, it’s worth noting that the maximum value of 17 pregnancies is relatively high and may be considered an outlier. While it is technically possible for someone to have a high number of pregnancies, it is relatively uncommon and not representative of the majority of the population. The presence of this outlier suggests that there may be one or a few individuals with exceptionally high numbers of pregnancies in the dataset.
Glucose: The mean glucose level is 120.9, with a median of 117. The minimum value of 0 seems unrealistic for glucose levels and may indicate missing or invalid data. It is important to investigate and address the presence of such values in the dataset. The maximum value of 199 indicates a relatively wide range of glucose levels among the individuals included.
BloodPressure: The mean blood pressure is 69.11, with a median of 72. The minimum value of 0 is also concerning, as it is unrealistic for blood pressure readings and may indicate missing or invalid data. Similarly to the glucose variable, further investigation and data validation are necessary to address this issue. The maximum value of 122 suggests a range of blood pressure measurements in the dataset.
SkinThickness: The mean skin thickness is 20.54, with a median of 23. The minimum value of 0 is once again concerning, as it seems unrealistic for skin thickness measurements and may indicate missing or invalid data. It is crucial to investigate and address the presence of such values in the dataset. The maximum value of 99 indicates a wide range of skin thickness measurements among the individuals included.
Insulin: The mean insulin level is 79.8, with a median of 30.5. Similar to the previous variables, the presence of a minimum value of 0 is concerning, as it could indicate missing or invalid data. It is important to carefully investigate and handle these zero values appropriately. The maximum value of 846 suggests a wide range of insulin levels in the dataset.
BMI: The mean BMI (Body Mass Index) is 31.99, with a median of 32. The minimum value of 0 is once again concerning, as it is unrealistic for BMI measurements and may indicate missing or invalid data. It is important to investigate and address the presence of such values in the dataset. The maximum value of 67.1 indicates a wide range of BMI values among the individuals included.
DiabetesPedigreeFunction: The mean diabetes pedigree function is 0.4719, with a median of 0.3725. This function provides information about the genetic influence of diabetes based on family history. The values range from 0.0780 to 2.4200.
Age: The mean age is 33.24, with a median of 29. The minimum age is 21, and the maximum age is 81. These age statistics provide an overview of the age distribution within the dataset. The range of ages suggests that the dataset includes individuals spanning a wide age range.
Outcome: The outcome variable represents whether an individual has diabetes or not, with 0 indicating no diabetes and 1 indicating diabetes. The mean value is 0.349, indicating that approximately 34.9% of the individuals in the dataset have diabetes.Furthermore this variable supposed to be factor and not numerical.
# Converting the "Outcome" variable to factor
diabetes$Outcome <- as.factor(diabetes$Outcome)
# Displaying the structure of the 'diabetes' dataframe
str(diabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...
The structure of the dataset is confirmed
# Calculating the column-wise sum of missing values in the 'diabetes' dataframe
colSums(is.na(diabetes))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Confirmed no missing values
# Removing duplicate rows
diabetes <- unique(diabetes)
# Renaming the 'DiabetesPedigreeFunction' column to 'DiabetesPF'.
diabetes <- rename(diabetes, DiabetesPF = DiabetesPedigreeFunction)
Confirmed no duplicates records
# Selecting numerical variables from diabetes data frame
numerical_vars <- diabetes[, sapply(diabetes, is.numeric)]
# Plot outliers in numerical variables
plot_outlier(numerical_vars)
# Identify nature of outliers using boxplot statistics
boxplot.stats(diabetes$Pregnancies)$out
## [1] 15 17 14 14
boxplot.stats(diabetes$Glucose)$out
## [1] 0 0 0 0 0
boxplot.stats(diabetes$BloodPressure)$out
## [1] 0 0 30 110 0 0 0 0 108 122 30 0 110 0 0 0 0 0 0
## [20] 0 0 0 0 108 0 0 0 0 0 0 0 0 0 0 110 0 24 0
## [39] 0 0 0 114 0 0 0
boxplot.stats(diabetes$SkinThickness)$out
## [1] 99
boxplot.stats(diabetes$Insulin)$out
## [1] 543 846 342 495 325 485 495 478 744 370 680 402 375 545 360 325 465 325 415
## [20] 579 474 328 480 326 330 600 321 440 540 480 335 387 392 510
boxplot.stats(diabetes$BMI)$out
## [1] 0.0 0.0 0.0 0.0 53.2 55.0 0.0 67.1 52.3 52.3 52.9 0.0 0.0 59.4 0.0
## [16] 0.0 57.3 0.0 0.0
boxplot.stats(diabetes$DiabetesPF)$out
## [1] 2.288 1.441 1.390 1.893 1.781 1.222 1.400 1.321 1.224 2.329 1.318 1.213
## [13] 1.353 1.224 1.391 1.476 2.137 1.731 1.268 1.600 2.420 1.251 1.699 1.258
## [25] 1.282 1.698 1.461 1.292 1.394
boxplot.stats(diabetes$Age)$out
## [1] 69 67 72 81 67 67 70 68 69
For the variable “Pregnancies,” the outliers identified are 15, 17, 14, and 14. These values indicate that there are individuals in the dataset with exceptionally high numbers of pregnancies compared to the rest of the data.
For the variable “Glucose,” the outliers identified are all 0. These values indicate missing or invalid data points rather than true outliers.
Based on the observed outliers in the Blood Pressure variable, it is important to note that these values may not necessarily indicate wrong or erroneous data. Instead, they could be indicative of abnormal health conditions related to blood pressure. Outliers above the expected range may signify hypertension, while outliers below the range (excluding the zeros) may suggest low blood pressure or hypotension.
Outliers in the Skin Thickness variable, such as a value of 99, require further investigation. These extreme values may be indicative of unusual or uncommon measurements within the dataset and may warrant consideration in the analysis.
In the Insulin variable, outliers that are significantly higher than the rest of the data suggest individuals with exceptionally high insulin levels. These outliers may be related to specific medical conditions or other factors affecting insulin production or metabolism.
Outliers in the BMI variable, including zero values, may indicate missing or invalid data. These outliers should be treated as missing values and imputed using appropriate techniques to ensure the integrity of the analysis.
The Diabetes Pedigree Function variable’s outliers may represent extreme values in the genetic diabetes score. These outliers may have implications for understanding the hereditary component of diabetes risk.
In the Age variable, outliers such as individuals with ages above or below the expected range may reflect unique cases within the dataset. These outliers provide insights into the age distribution and may have distinct characteristics or experiences relevant to the analysis.
Under normal health conditions, human glucose levels, blood pressure, BMI, and age cannot be zero, as these values are integral to physiological functioning. If any of these variables are recorded as zero, it suggests that the individual does not exist. Therefore, all observed zero values in the dataset during summary statistics and outlier handling likely signify missing values rather than wrong data or valid data points. Following that, the technique known as median imputation is employed to resolve these possible missing data. The proposed methodology involves substituting the discovered zero values in certain numerical variables with the corresponding variable’s median value. Utilising median imputation is considered a robust approach for addressing missing data, specifically in numerical variables. This method effectively reduces the impact of outliers while still maintaining the central tendency of the dataset.
# Removing the Zero values in some variables
diabetes$Pregnancies[which(diabetes$Pregnancies==0)]<- median(diabetes$Pregnancies)
diabetes$Glucose[which(diabetes$Glucose==0)] <- median(diabetes$Glucose)
diabetes$BloodPressure[which(diabetes$BloodPressure==0)] <- median(diabetes$BloodPressure)
diabetes$SkinThickness[which(diabetes$SkinThickness==0)] <- median(diabetes$SkinThickness)
diabetes$Insulin[which(diabetes$Insulin==0)] <- median(diabetes$Insulin)
diabetes$Insulin[which(diabetes$BMI==0)] <- median(diabetes$BMI)
diabetes$DiabetesPF[which(diabetes$DiabetesPF==0)]<- median(diabetes$DiabetesPF)
diabetes$Age[which(diabetes$Age==0)] <- median(diabetes$Age)
# Calculating correlation matrix and rounding to 2 decimal places
round(cor(numerical_vars), 2)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## Pregnancies 1.00 0.13 0.14 -0.08 -0.07 0.02
## Glucose 0.13 1.00 0.15 0.06 0.33 0.22
## BloodPressure 0.14 0.15 1.00 0.21 0.09 0.28
## SkinThickness -0.08 0.06 0.21 1.00 0.44 0.39
## Insulin -0.07 0.33 0.09 0.44 1.00 0.20
## BMI 0.02 0.22 0.28 0.39 0.20 1.00
## DiabetesPF -0.03 0.14 0.04 0.18 0.19 0.14
## Age 0.54 0.26 0.24 -0.11 -0.04 0.04
## DiabetesPF Age
## Pregnancies -0.03 0.54
## Glucose 0.14 0.26
## BloodPressure 0.04 0.24
## SkinThickness 0.18 -0.11
## Insulin 0.19 -0.04
## BMI 0.14 0.04
## DiabetesPF 1.00 0.03
## Age 0.03 1.00
The sole correlation observed with a coefficient above 0.5 is between the variables ‘Pregnancies’ and ‘Age’, with a value of 0.54. This correlation can be considered moderate, indicating a significant relationship between the two variables. As individuals grow older, the probability of encountering a greater number of pregnancies tends to increase.
In addition to the correlation between pregnancies and age, the remaining correlations among the independent variables were all below 0.5, indicating low levels of correlation. These low correlations further suggest the absence of multicollinearity among the independent variables in the dataset.
# Attaching the "Diabetes" data frame
attach(diabetes)
# Building the logistic regression model
logitA <- glm(Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI +DiabetesPF + Age,family = binomial)
# Summarizing the logistic regression model
summary(logitA)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## SkinThickness + Insulin + BMI + DiabetesPF + Age, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6141 -0.7124 -0.4082 0.7050 2.4744
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.8279654 0.7994998 -11.042 < 2e-16 ***
## Pregnancies 0.1426196 0.0359056 3.972 7.12e-05 ***
## Glucose 0.0382817 0.0038628 9.910 < 2e-16 ***
## BloodPressure -0.0090341 0.0085999 -1.050 0.29350
## SkinThickness 0.0042342 0.0121734 0.348 0.72797
## Insulin -0.0015045 0.0009196 -1.636 0.10182
## BMI 0.0798234 0.0165767 4.815 1.47e-06 ***
## DiabetesPF 0.9445078 0.3021420 3.126 0.00177 **
## Age 0.0113604 0.0093738 1.212 0.22554
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 714.80 on 759 degrees of freedom
## AIC: 732.8
##
## Number of Fisher Scoring iterations: 5
In this analysis, variables such as “Pregnancies,” “Glucose,”, “BMI,” and “DiabetesPF” have coefficients with statistically significant p-values (p < 0.05), indicating that they have a significant influence on the log-odds of the outcome (dependent variable).
Among these variables, “Glucose” and “BMI” have the largest coefficient magnitudes, suggesting that they have a relatively higher influence on the outcome. Specifically, for each unit increase in “Glucose,” the log-odds of the outcome increase by 0.0382817. Similarly, for each unit increase in “BMI,” the log-odds of the outcome increase by 0.0798234.
Variables such as “Pregnancies,” and “DiabetesPF” also have statistically significant coefficients but with smaller magnitudes compared to “Glucose” and “BMI.” The remaining variables, “SkinThickness,” “BloodPressure,” “Insulin,” and “Age,” do not have statistically significant coefficients (p > 0.05), indicating that they may have less influence on the outcome.
# Calculating and interpreting the odds ratio
exp(coef(logitA))
## (Intercept) Pregnancies Glucose BloodPressure SkinThickness
## 0.0001465762 1.1532909827 1.0390238523 0.9910066097 1.0042431701
## Insulin BMI DiabetesPF Age
## 0.9984966772 1.0830957581 2.5715473012 1.0114251402
Here are the interpretations of the coefficients:
(Intercept): The odds of the dependent variable when all independent variables are zero. The value of the intercept is 0.0001465762
Pregnancies: For each unit increase in the number of pregnancies, the odds of the dependent variable increase by a factor of approximately 1.15
Glucose: For each unit increase in the glucose level, the odds of the dependent variable increase by a factor of approximately 1.04
BloodPressure: For each unit increase in the blood pressure, the odds of the dependent variable decrease by a factor of approximately 0.99.
SkinThickness: For each unit increase in the skin thickness, the odds of the dependent variable increase by a factor of approximately 1.00.
Insulin: For each unit increase in the insulin level, the odds of the dependent variable decrease by a factor of approximately 0.99.
BMI: For each unit increase in the BMI (body mass index), the odds of the dependent variable increase by a factor of approximately 1.08.
DiabetesPF: For each unit increase in the diabetes pedigree function, the odds of the dependent variable increase by a factor of approximately 2.57.
Age: For each unit increase in age, the odds of the dependent variable increase by a factor of approximately 1.01.
Based on the coefficients, the independent variables with higher magnitudes have a stronger influence on the odds of the dependent variable. In this case, “DiabetesPF” has the highest coefficient, indicating it has the greatest influence on the odds of the dependent variable. “BMI” and “Pregnancies” also have relatively high coefficients, suggesting they have significant influences on the odds. Meanwhile, “BloodPressure” and “Insulin” have coefficients close to 1, indicating weaker influences on the odds. The remaining variables, including “SkinThickness” and “Age,” have coefficients very close to 1, suggesting they have little to no influence on the odds of the dependent variable.
In conclusion, the logistic regression analysis reveals that the likelihood of developing diabetes is significantly increased by four main factors: a greater number of pregnancies, higher glucose levels, elevated BMI, and a positive family history of diabetes.
# Set the random seed for reproducibility
set.seed(1994)
# Determine the number of rows in the 'diabetes' data frame
n_rows <- nrow(diabetes)
# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx <- sample(n_rows, n_rows*0.7)
# Create the training and testing data sets using the sampled indices
trainData <- diabetes[idx,]
testData <- diabetes[-idx,]
# Define the formula for the random forest model using the relevant features
formula <- Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPF + Age
# Set up cross-validation parameters for the model (100 iterations)
ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3, savePredictions = "final")
# Train the random forest model using cross-validation
model <- train(formula, data = trainData, method='rf', trControl=ctrl)
# Display the cross-validation model
model
## Random Forest
##
## 537 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 484, 483, 483, 484, 483, 483, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7672374 0.4763779
## 5 0.7591195 0.4617980
## 8 0.7578966 0.4604780
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions
predictions <- predict(model, testData[,-9], type = 'raw')
# Computing the confusion matrix
co_matrix <- confusionMatrix(predictions, testData$Outcome)
co_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 122 26
## 1 30 53
##
## Accuracy : 0.7576
## 95% CI : (0.697, 0.8114)
## No Information Rate : 0.658
## P-Value [Acc > NIR] : 0.000686
##
## Kappa : 0.4678
##
## Mcnemar's Test P-Value : 0.688500
##
## Sensitivity : 0.8026
## Specificity : 0.6709
## Pos Pred Value : 0.8243
## Neg Pred Value : 0.6386
## Prevalence : 0.6580
## Detection Rate : 0.5281
## Detection Prevalence : 0.6407
## Balanced Accuracy : 0.7368
##
## 'Positive' Class : 0
##
. The model’s accuracy of approximately 76% suggests a good level of predictive capacity, as demonstrated by its ability to make correct classifications. This accuracy, falling within a 95% confidence interval of (0.697, 0.8114), further reinforces the reliability of the model’s predictions. . The model demonstrates a sensitivity of 80.26%, indicating its proficiency in accurately detecting individuals with diabetes. A higher sensitivity suggests a reduced likelihood of false negatives,effectively identifying a substantial portion of positive instances with accuracy. . Conversely, the specificity level stands at 67.09%, reflecting the model’s capacity to accurately classify individuals without diabetes. A greater specificity implies a reduced likelihood of false positives, effectively categorizing a significant proportion of negative instances with accuracy.
# Extract variable importance
var_importance_rf_model_CV <- varImp(model)
var_importance_rf_model_CV
## rf variable importance
##
## Overall
## Glucose 100.000
## BMI 65.761
## Age 52.084
## DiabetesPF 32.320
## BloodPressure 12.927
## SkinThickness 10.683
## Pregnancies 7.175
## Insulin 0.000
# Create a bar plot of variable importance
gg <- ggplot(var_importance_rf_model_CV, aes(x = rownames(var_importance_rf_model_CV), y = Overall)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Variable",
y = "Importance Score") +
theme_minimal() + # Change the theme here
theme(axis.text.x = element_text(angle = 45, hjust = 1),
# Customize label appearance
axis.text = element_text(face = "bold", color = "red", size = 12))
# Save the plot as a file (e.g., in PNG format)
ggsave("variable_importance_plot.png", gg, width = 4, height = 2)
. Glucose emerges as the most influential variable (score: 100), signifying its pivotal role in predicting diabetes risk. BMI follows closely behind (score: 65.76), emphasizing weight control’s importance. Age (score: 52.08) also significantly influences diabetes likelihood. The Diabetes Pedigree Function (score: 32.32) indicates a genetic predisposition to diabetes. Blood Pressure and Skin Thickness contribute moderately (scores: 12.93 and 10.68), while Pregnancies show a moderate influence (score: 7.17). Insulin has the least influence (score: 0) on diabetes prediction within the model.