Diabetes has grown into a major global health issue, impacting millions of people across the globe. Early detection and accurate prediction of diabetes can play a crucial role in preventing complications and enhancing patient well-being. Within the realm of healthcare, machine learning algorithms have emerged as formidable tools, capable of accurately forecasting the risk of developing diabetes.
Machine learning algorithms harness the power of advanced data analysis to identify patterns, correlations, and predictive indicators that may elude human observation. By leveraging the machine learning algorithms, healthcare professionals can tap into comprehensive patient data, including medical history, demographic information, lifestyle factors, and clinical measurements, to develop accurate models for predicting the likelihood of diabetes occurrence.
These machine learning algorithms possess the ability to identify diabetes risk factors, such as obesity, family history, sedentary lifestyle, and abnormal glucose levels. By scrutinizing the complex interplay among various variables, these algorithms generate personalized risk assessments, enabling targeted interventions and preventive measures. Therefore, in this dataset, the variable are as follows:
Diabetes prediction is vital for patient outcomes improvement, preventive care strategies enhancement, and diabetes management obtimization. In addition, glucose level is an important indicator in predicting the risk of developing diabetes in an individual. Therefore, in this project, the research questions are developed as follows:
In this project, our intention is to predict the occurrence of diabetes in an individual based on the input, The objectives of this study are as follows:
This section summarizes on the process that will be conducted throughout this study.
Data Collection: This dataset was obtained from Kaggle, and it is originated from the National Institute of Diabetes and Digestive and Kidney Diseases. The intention of the dataset is to use diagnostic measurements to predict, with a diagnostic purpose, whether a patient has diabetes or not.
Data Preprocessing: The collected data undergoes preprocessing to ensure its quality and suitability for analysis. This crucial step involves various tasks such as handling missing values, addressing outliers, and normalizing or standardizing variables.
Model Training: For model training, it will be categorized into two parts namely classfication modelling and regression modelling.Classification modelling were employed to predict the occurrence of diabetes in an individual based on the input, while regression modelling was utilised to predict the glucose level in an individual. The preprocessed data is utilized to train machine learning algorithms, enabling them to learn patterns and relationships between the input variables and the target variable.Classification modelling that were employed to predict the occurrence of diabetes are Decision Tree Classifier (DTC), Naive Bayes Classifier(Naive) and Support Vector Machine (SVM).Meanwhile, for regression modelling, the machine algorithm that were employed are Decision Tree Regression (DTR), Random Forest Regression (RF), k-Nearest Neighbour (kNN), Linear Regression, and Gradient Boosting Regression (GBR).
Model Evaluation: Model evaluation assess the model’s predictive capability and determine its generalization ability on unseen data. For classification, the model will be evaluated by accuracy and Confusion Matrix. For regression modelling, the models will be evaluated by Mean Squared Error (MSE), Mean Absolute Error (MAE) and Coefficient of Determination (R^2).
Check on the dimension,summary and structure of the dataset.
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
## [1] "Number of rows: 768"
## [1] "Number of columns: 9"
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
From the summary of the dataset, we are able to check on the minimum value, maximum value, median, mean, first quantile and third quantile for each of the variable. By using the dim(), function, it was reviewed that the dataset is consisted of 768 rows and 9 columns. Besides, by looking at the structure of the data, we are able to determine whether does data transformation is required. From this case, data transformation is not necessary.
The collected data is preprocessed to ensure its quality and suitability for analysis. In this step, the null values were being checked and cleaned by using imputation mean method.
Checking for the null values in glucose level and fix it by using imputation mean method.
## [1] 5
Checking for the null values in blood pressure and fix it by using imputation mean method.
## [1] 35
Checking for the null values in skin thickness and fix it by using imputation mean method.
## [1] 227
Checking for the null values in insulin and fix it by using imputation mean method.
## [1] 374
Checking for the null values in BMI and fix it by using imputation mean method.
## [1] 11
Outline:
To Check on the data summary, including standard deviation and variance for each variable.
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:20.54
## Median : 3.000 Median :117.00 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :121.68 Mean : 72.25 Mean :26.61
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :32.45 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
## Pregnancies Glucose BloodPressure
## 3.3695781 30.4360156 12.1159316
## SkinThickness Insulin BMI
## 9.6312407 115.2440024 6.8753735
## DiabetesPedigreeFunction Age Outcome
## 0.3313286 11.7602315 0.4769514
## Pregnancies Glucose BloodPressure
## 1.135406e+01 9.263510e+02 1.467958e+02
## SkinThickness Insulin BMI
## 9.276080e+01 1.328118e+04 4.727076e+01
## DiabetesPedigreeFunction Age Outcome
## 1.097786e-01 1.383030e+02 2.274826e-01
Use the subplot for boxplot to check on the data distribution.
Histogram for number of times being pregnant
Histogram for glucose level
Histogram for blood pressure
Histogram for skin thickness
Histogram for Insulin
Histogram for body mass index
Histogram for diabetes pedigree function
Histogram for age
Boxplot and Outlier Detection
Pregnancies
Glucose
Blood Pressure
Skin Thickness
Insulin
BMI
Diabetes Pedigree Function
Age
Correlation Matrix for each of the Variables
Findings: Based on the bar graph plotted, it was observed in the dataset that individual without diabetes is about 500 and individual with diabetes is approximately 268.
According to World Health Organization (WHO), BMI can be classified into four categories namely underweight (<18.5), normal weight (18.5-24.9), Over Weight (24.9-29.9) and Obesity (>29.9).
Findings: In the dataset, it was observed that individual with obesity has the highest numbers, followed by overweight, normal weight and underweight.
According to World Health Organization (WHO), for individual with glucose level greater than 126, they are considered as diabetic.
Findings: Based on the dataset, it can be identified that individual that are classified as diabetic is higher than non-diabetic based on the glucose level.
To predict the occurrence of diabetes using the classification modelling in machine learning algorithm.
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 436 59
## 1 64 209
##
## Accuracy : 0.8398
## 95% CI : (0.812, 0.8651)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.649
##
## Mcnemar's Test P-Value : 0.7183
##
## Sensitivity : 0.8720
## Specificity : 0.7799
## Pos Pred Value : 0.8808
## Neg Pred Value : 0.7656
## Prevalence : 0.6510
## Detection Rate : 0.5677
## Detection Prevalence : 0.6445
## Balanced Accuracy : 0.8259
##
## 'Positive' Class : 0
##
## Accuracy
## 83.98438
##
## pred_naive 0 1
## 0 412 100
## 1 88 168
## [1] 0.7552083
## Confusion Matrix and Statistics
##
##
## pred_naive 0 1
## 0 412 100
## 1 88 168
##
## Accuracy : 0.7552
## 95% CI : (0.7232, 0.7852)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : 3.033e-10
##
## Kappa : 0.4556
##
## Mcnemar's Test P-Value : 0.4224
##
## Sensitivity : 0.8240
## Specificity : 0.6269
## Pos Pred Value : 0.8047
## Neg Pred Value : 0.6562
## Prevalence : 0.6510
## Detection Rate : 0.5365
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.7254
##
## 'Positive' Class : 0
##
## Accuracy
## 75.52083
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 453 114
## 1 47 154
##
## Accuracy : 0.7904
## 95% CI : (0.7598, 0.8186)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5102
##
## Mcnemar's Test P-Value : 1.977e-07
##
## Sensitivity : 0.9060
## Specificity : 0.5746
## Pos Pred Value : 0.7989
## Neg Pred Value : 0.7662
## Prevalence : 0.6510
## Detection Rate : 0.5898
## Detection Prevalence : 0.7383
## Balanced Accuracy : 0.7403
##
## 'Positive' Class : 0
##
## Accuracy
## 79.03646
2. To predict the glucose level using the regression modelling in machine learning algorithm.
For the regression modelling to predict the glucose level, the diabetes result outcome will be dropped. According to Frakam & Ogden (2005), the majority individual with diabetes are unable to estimate their glucose level accurately. Therefore, in this study, diabetes outcomes will be removed for glucose level prediction.
## [1] "Mean Squared Error of Decision Tree Regression: 529.095117888269"
## [1] "Mean Absolute Error of Decision Tree Regression: 18.0256136269797"
## [1] "R-squared Score of Decision Tree Regression: 0.418595456922064"
## [1] "Mean Squared Error of Random Forest Regression: 679.919487320478"
## [1] "Mean Absolute Error of Random Forest Regression: 20.203739987259"
## [1] "R-squared Score of Random Forest Regression: 0.249926579726261"
## [1] "Mean Squared Error of K Nearest Neightbour: 818.519222955829"
## [1] "Mean Absolute Error of K Nearest Neightbour: 21.4859102074236"
## [1] "R-squared Score of K Nearest Neightbour: 0.143514746563834"
## [1] "Mean Squared Error of Linear Regression: 672.17498163963"
## [1] "Mean Absolute Error of Linear Regression: 19.9637127671935"
## [1] "R-squared Score of Linear Regression: 0.261813948647866"
## [1] "Mean Squared Error: 458.346669723837"
## [1] "Mean Absolute Error: 16.720538126506"
## [1] "R-squared Score: 0.51289603524034"
Findings: From the accuracy graph plotted, it was revealed that Decision Tree Classfication can acchieve the accuracy of 83.99%, accuracy for Naive Bayes Classifier is 75% and Support Vector Machine (78.65%).
Findings: From the mean squared error graph plotted, it was revealed that mean squared error for Decision Tree regression is 652.92, followed by Random Forest Regression with MSE of 664.07, k-NN is about 753.52, whereas the MSE for Linear Regression and Gradient Boosting Regression are 672.25** and 479.65 respectively.
Findings: From the mean squared error graph plotted, it was revealed that mean absolute error for Decision Tree regression is 20.23, followed by Random Forest Regression with MAE of 20.25, k-NN is about 21.96, whereas the MAE for Linear Regression and Gradient Boosting Regression are 20.27** and 17.27 respectively.
Findings: From the Coefficient of Determination graph plotted, it was revealed that R^2 for Decision Tree regression is 0.3444, followed by Random Forest Regression with R^2 of 0.3328, k-NN is about 0.2685, whereas the R^2 for Linear Regression and Gradient Boosting Regression are 0.3252 and 0.5233 respectively.
Since GBR has the best performance in predicting glucose level, thus we will use GBR to perform the feature importance analysis.This feature importance analysis was performed to determine which input variable has the highest impact in the glucose level prediction.
Relative Importance based on GBR
## n.trees not given. Using 100 trees.
## Pregnancies BloodPressure SkinThickness
## 0.1470813 0.3016122 0.1867144
## Insulin BMI DiabetesPedigreeFunction
## 1.0000000 0.5604199 0.4774896
## Age
## 0.6493679
By inspecting upon the correlation matrix, we determined that the glucose level variable played the most significant role in predicting diabetes.
Additionally, other essential features such as body mass index (BMI),age and pregnancies were identified, emphasizing their relevance in predicting diabetes.
These findings underscore the importance of these variables in achieving accurate diabetes predictions.
Consequently, prioritizing these features during diagnostic assessments may enhance the precision of diabetes prediction.
The classification model was developed using a dataset comprising diverse diabetes diagnostic variables to predict the occurrence of diabetes in an individual.
When the performance of the machine learning models was evaluated with the test dataset, the model can acchieve the accuracy rate of 83.99%.
From the accuracy graph plotted, it was revealed that Decision Tree Classfication can acchieve the accuracy of 83.99%, followed by Support Vector Machine (78.65%), whereas the accuracy for Naive Bayes Classifier is the lowest at 75%.
The high accuracy performed by Decision Tree Classfication suggests that the model is proficient at distinguishing between individuals with diabetes and those without the condition.
By inspecting the confusion matrix developed by Decision Tree Classifier, we observe that the model correctly classified 87.2% of patients without diabetes (true negatives) and 77.9% of patients with diabetes (true positives).
However, there were instances where the model misclassified 12.8% of patients without diabetes as having diabetes (false positives) and 22.1% of patients with diabetes as not having diabetes (false negatives).
These misclassifications indicate that the model may face challenges in accurately distinguishing certain cases, highlighting the potential for further enhancements to improve its performance.
The regression model was developed using a dataset comprising diverse diabetes diagnostic variables to predict the glucose level in an individual.
Based on the MSE and MAE graph plotted, it was revealed that the k-NN is having the highest MAE and MSE, followed by Linear Regression, Random Forest Regression, Decision Tree Regression and Gradient Boosting Regression is having the least MAE and MSE. This indicates that the Gradient Boosting Regression produce the least error in compared with others machine learning model.
Based on the R^2 score, Gradient Boosting Regression obtained the highest R^2 score, followed by Decision Tree Regression, Random Forest Regression, Linear Regression and k-NN. This indicates that the Gradient Boosting Regression displays the best performance in goodness of fit with the dataset.
By comparing the MAE,MSE and R^2 score, we can conclude that Gradient Boosting Regression is the machine learning model that possess the best performance in predicting the glucose level of an individual.
Since Gradient Boosting Regression is the machine learning model that possess the best performance in predicting the glucose level of an individual, thus Gradient Boosting Regression will be employed to study which variable have the highest impact on glucose level prediction.
By inspecting the feature importance analysis of Gradient Boosting Regression using the GBM package in R, it was displayed that Insulin (1), Age (0.7234) and BMI(0.5629) are the top three variables that having the highest impact in predicting the glucose level. Therefore, this indicates that these three variables play important role in predicting the glucose level, and they shall be captured especially in diabetes prediction and management for diabetes patient.
-1. Frankum, S., & Ogden, J. (2005). Estimation of blood glucose levels by people with diabetes: a cross-sectional study. British Journal of General Practice, 55(521), 944-948.