Group Members


DIABETES PREDICTION


Introduction

Diabetes has grown into a major global health issue, impacting millions of people across the globe. Early detection and accurate prediction of diabetes can play a crucial role in preventing complications and enhancing patient well-being. Within the realm of healthcare, machine learning algorithms have emerged as formidable tools, capable of accurately forecasting the risk of developing diabetes.

Machine learning algorithms harness the power of advanced data analysis to identify patterns, correlations, and predictive indicators that may elude human observation. By leveraging the machine learning algorithms, healthcare professionals can tap into comprehensive patient data, including medical history, demographic information, lifestyle factors, and clinical measurements, to develop accurate models for predicting the likelihood of diabetes occurrence.

These machine learning algorithms possess the ability to identify diabetes risk factors, such as obesity, family history, sedentary lifestyle, and abnormal glucose levels. By scrutinizing the complex interplay among various variables, these algorithms generate personalized risk assessments, enabling targeted interventions and preventive measures. Therefore, in this dataset, the variable are as follows:

  • Pregnancies
  • Glucose level
  • Blood pressure
  • Skin Thickness
  • Insulin level
  • BMI
  • Diabetes Pedigree Function
  • Age
  • Outcomes

Diabetes prediction is vital for patient outcomes improvement, preventive care strategies enhancement, and diabetes management obtimization. In addition, glucose level is an important indicator in predicting the risk of developing diabetes in an individual. Therefore, in this project, the research questions are developed as follows:

  • 1. How can we predict the occurrence of diabetes in an individual?
  • 2. How can we predict the glucose level of an individual and identify the factors that have the highest impact on glucose level prediction?

Objectives

In this project, our intention is to predict the occurrence of diabetes in an individual based on the input, The objectives of this study are as follows:

  • 1. To predict the occurrence of diabetes using the classification modelling in machine learning algorithm.
  • 2. To predict the glucose level using the regression modelling in machine learning algorithm.
  • 3. To evaluate the impact of input variable based on the feature importance analysis.

Methodology

This section summarizes on the process that will be conducted throughout this study.

  • Data Collection: This dataset was obtained from Kaggle, and it is originated from the National Institute of Diabetes and Digestive and Kidney Diseases. The intention of the dataset is to use diagnostic measurements to predict, with a diagnostic purpose, whether a patient has diabetes or not.

  • Data Preprocessing: The collected data undergoes preprocessing to ensure its quality and suitability for analysis. This crucial step involves various tasks such as handling missing values, addressing outliers, and normalizing or standardizing variables.

  • Model Training: For model training, it will be categorized into two parts namely classfication modelling and regression modelling.Classification modelling were employed to predict the occurrence of diabetes in an individual based on the input, while regression modelling was utilised to predict the glucose level in an individual. The preprocessed data is utilized to train machine learning algorithms, enabling them to learn patterns and relationships between the input variables and the target variable.Classification modelling that were employed to predict the occurrence of diabetes are Decision Tree Classifier (DTC), Naive Bayes Classifier(Naive) and Support Vector Machine (SVM).Meanwhile, for regression modelling, the machine algorithm that were employed are Decision Tree Regression (DTR), Random Forest Regression (RF), k-Nearest Neighbour (kNN), Linear Regression, and Gradient Boosting Regression (GBR).

  • Model Evaluation: Model evaluation assess the model’s predictive capability and determine its generalization ability on unseen data. For classification, the model will be evaluated by accuracy and Confusion Matrix. For regression modelling, the models will be evaluated by Mean Squared Error (MSE), Mean Absolute Error (MAE) and Coefficient of Determination (R^2).


Data Preprocessing

Data Exploration

Check on the dimension,summary and structure of the dataset.

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
## [1] "Number of rows: 768"
## [1] "Number of columns: 9"
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

From the summary of the dataset, we are able to check on the minimum value, maximum value, median, mean, first quantile and third quantile for each of the variable. By using the dim(), function, it was reviewed that the dataset is consisted of 768 rows and 9 columns. Besides, by looking at the structure of the data, we are able to determine whether does data transformation is required. From this case, data transformation is not necessary.

Data Cleaning

The collected data is preprocessed to ensure its quality and suitability for analysis. In this step, the null values were being checked and cleaned by using imputation mean method.

Checking for the null values in glucose level and fix it by using imputation mean method.

## [1] 5

Checking for the null values in blood pressure and fix it by using imputation mean method.

## [1] 35

Checking for the null values in skin thickness and fix it by using imputation mean method.

## [1] 227

Checking for the null values in insulin and fix it by using imputation mean method.

## [1] 374

Checking for the null values in BMI and fix it by using imputation mean method.

## [1] 11

Results

Outline:

Exploratory Data Analysis (EDA)

To Check on the data summary, including standard deviation and variance for each variable.

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:20.54  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :121.68   Mean   : 72.25   Mean   :26.61  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
##              Pregnancies                  Glucose            BloodPressure 
##                3.3695781               30.4360156               12.1159316 
##            SkinThickness                  Insulin                      BMI 
##                9.6312407              115.2440024                6.8753735 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                0.3313286               11.7602315                0.4769514
##              Pregnancies                  Glucose            BloodPressure 
##             1.135406e+01             9.263510e+02             1.467958e+02 
##            SkinThickness                  Insulin                      BMI 
##             9.276080e+01             1.328118e+04             4.727076e+01 
## DiabetesPedigreeFunction                      Age                  Outcome 
##             1.097786e-01             1.383030e+02             2.274826e-01

Use the subplot for boxplot to check on the data distribution.

Data Visualisation

Histogram for number of times being pregnant

Histogram for glucose level

Histogram for blood pressure

Histogram for skin thickness

Histogram for Insulin

Histogram for body mass index

Histogram for diabetes pedigree function

Histogram for age

Boxplot and Outlier Detection

Pregnancies

Glucose

Blood Pressure

Skin Thickness

Insulin

BMI

Diabetes Pedigree Function

Age

Correlation Analysis

Correlation Matrix for each of the Variables

Scatterplot Matrix for each of the variable

To check of number of individuals with diabetes

Findings: Based on the bar graph plotted, it was observed in the dataset that individual without diabetes is about 500 and individual with diabetes is approximately 268.

Categorized BMI based on WHO definition

According to World Health Organization (WHO), BMI can be classified into four categories namely underweight (<18.5), normal weight (18.5-24.9), Over Weight (24.9-29.9) and Obesity (>29.9).

Findings: In the dataset, it was observed that individual with obesity has the highest numbers, followed by overweight, normal weight and underweight.

Categorized glucose level based on WHO definition

According to World Health Organization (WHO), for individual with glucose level greater than 126, they are considered as diabetic.

Findings: Based on the dataset, it can be identified that individual that are classified as diabetic is higher than non-diabetic based on the glucose level.


Data Modelling

Part one

To predict the occurrence of diabetes using the classification modelling in machine learning algorithm.

## [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

Decision Tree Classifier (DTC)

Confusion Matrix and Accuracy (DTC)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 436  59
##          1  64 209
##                                          
##                Accuracy : 0.8398         
##                  95% CI : (0.812, 0.8651)
##     No Information Rate : 0.651          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.649          
##                                          
##  Mcnemar's Test P-Value : 0.7183         
##                                          
##             Sensitivity : 0.8720         
##             Specificity : 0.7799         
##          Pos Pred Value : 0.8808         
##          Neg Pred Value : 0.7656         
##              Prevalence : 0.6510         
##          Detection Rate : 0.5677         
##    Detection Prevalence : 0.6445         
##       Balanced Accuracy : 0.8259         
##                                          
##        'Positive' Class : 0              
## 
## Accuracy 
## 83.98438

Naive Bayes Classifier

Confusion Matrix and Accuracy (Naive Bayes)
##           
## pred_naive   0   1
##          0 412 100
##          1  88 168
## [1] 0.7552083
## Confusion Matrix and Statistics
## 
##           
## pred_naive   0   1
##          0 412 100
##          1  88 168
##                                           
##                Accuracy : 0.7552          
##                  95% CI : (0.7232, 0.7852)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 3.033e-10       
##                                           
##                   Kappa : 0.4556          
##                                           
##  Mcnemar's Test P-Value : 0.4224          
##                                           
##             Sensitivity : 0.8240          
##             Specificity : 0.6269          
##          Pos Pred Value : 0.8047          
##          Neg Pred Value : 0.6562          
##              Prevalence : 0.6510          
##          Detection Rate : 0.5365          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.7254          
##                                           
##        'Positive' Class : 0               
## 
## Accuracy 
## 75.52083

Support Vector Machine (SVM)

Confusion Matrix and Accuracy (SVM)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 453 114
##          1  47 154
##                                           
##                Accuracy : 0.7904          
##                  95% CI : (0.7598, 0.8186)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5102          
##                                           
##  Mcnemar's Test P-Value : 1.977e-07       
##                                           
##             Sensitivity : 0.9060          
##             Specificity : 0.5746          
##          Pos Pred Value : 0.7989          
##          Neg Pred Value : 0.7662          
##              Prevalence : 0.6510          
##          Detection Rate : 0.5898          
##    Detection Prevalence : 0.7383          
##       Balanced Accuracy : 0.7403          
##                                           
##        'Positive' Class : 0               
## 
## Accuracy 
## 79.03646

Part two

2. To predict the glucose level using the regression modelling in machine learning algorithm.

For the regression modelling to predict the glucose level, the diabetes result outcome will be dropped. According to Frakam & Ogden (2005), the majority individual with diabetes are unable to estimate their glucose level accurately. Therefore, in this study, diabetes outcomes will be removed for glucose level prediction.

Decision Tree Regression (DTR)

Mean Squared Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R^2) for Decision Tree Regression
## [1] "Mean Squared Error of Decision Tree Regression: 529.095117888269"
## [1] "Mean Absolute Error of Decision Tree Regression: 18.0256136269797"
## [1] "R-squared Score of Decision Tree Regression: 0.418595456922064"

Random Forest Regression (RF)

Mean Squared Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R^2) for Random Forest Regression
## [1] "Mean Squared Error of Random Forest Regression: 679.919487320478"
## [1] "Mean Absolute Error of Random Forest Regression: 20.203739987259"
## [1] "R-squared Score of Random Forest Regression: 0.249926579726261"

K Nearest Neighbour (k-NN)

Mean Squared Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R^2) for K Nearest Neightbour
## [1] "Mean Squared Error of K Nearest Neightbour: 818.519222955829"
## [1] "Mean Absolute Error of K Nearest Neightbour: 21.4859102074236"
## [1] "R-squared Score of K Nearest Neightbour: 0.143514746563834"

Linear Regression

Mean Squared Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R^2) for Linear Regression
## [1] "Mean Squared Error of Linear Regression: 672.17498163963"
## [1] "Mean Absolute Error of Linear Regression: 19.9637127671935"
## [1] "R-squared Score of Linear Regression: 0.261813948647866"

Gradient Boasting Regression (GBR)

Mean Squared Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R^2) for Gradient Boasting Regression
## [1] "Mean Squared Error: 458.346669723837"
## [1] "Mean Absolute Error: 16.720538126506"
## [1] "R-squared Score: 0.51289603524034"

Model Evaluation

Part one: Classification Model Evaluation

Findings: From the accuracy graph plotted, it was revealed that Decision Tree Classfication can acchieve the accuracy of 83.99%, accuracy for Naive Bayes Classifier is 75% and Support Vector Machine (78.65%).


Part two: Regression Model Evaluation

Findings: From the mean squared error graph plotted, it was revealed that mean squared error for Decision Tree regression is 652.92, followed by Random Forest Regression with MSE of 664.07, k-NN is about 753.52, whereas the MSE for Linear Regression and Gradient Boosting Regression are 672.25** and 479.65 respectively.

Findings: From the mean squared error graph plotted, it was revealed that mean absolute error for Decision Tree regression is 20.23, followed by Random Forest Regression with MAE of 20.25, k-NN is about 21.96, whereas the MAE for Linear Regression and Gradient Boosting Regression are 20.27** and 17.27 respectively.

Findings: From the Coefficient of Determination graph plotted, it was revealed that R^2 for Decision Tree regression is 0.3444, followed by Random Forest Regression with R^2 of 0.3328, k-NN is about 0.2685, whereas the R^2 for Linear Regression and Gradient Boosting Regression are 0.3252 and 0.5233 respectively.

Since GBR has the best performance in predicting glucose level, thus we will use GBR to perform the feature importance analysis.This feature importance analysis was performed to determine which input variable has the highest impact in the glucose level prediction.

Part three: Feature Importance Analysis based of Best Performaning Machine Learning Model.

Relative Importance based on GBR

## n.trees not given. Using 100 trees.
##              Pregnancies            BloodPressure            SkinThickness 
##                0.1470813                0.3016122                0.1867144 
##                  Insulin                      BMI DiabetesPedigreeFunction 
##                1.0000000                0.5604199                0.4774896 
##                      Age 
##                0.6493679

Dicussion of Output

Discussion of Output-Part one

  • By inspecting upon the correlation matrix, we determined that the glucose level variable played the most significant role in predicting diabetes.

  • Additionally, other essential features such as body mass index (BMI),age and pregnancies were identified, emphasizing their relevance in predicting diabetes.

  • These findings underscore the importance of these variables in achieving accurate diabetes predictions.

  • Consequently, prioritizing these features during diagnostic assessments may enhance the precision of diabetes prediction.

  • The classification model was developed using a dataset comprising diverse diabetes diagnostic variables to predict the occurrence of diabetes in an individual.

  • When the performance of the machine learning models was evaluated with the test dataset, the model can acchieve the accuracy rate of 83.99%.

  • From the accuracy graph plotted, it was revealed that Decision Tree Classfication can acchieve the accuracy of 83.99%, followed by Support Vector Machine (78.65%), whereas the accuracy for Naive Bayes Classifier is the lowest at 75%.

  • The high accuracy performed by Decision Tree Classfication suggests that the model is proficient at distinguishing between individuals with diabetes and those without the condition.

  • By inspecting the confusion matrix developed by Decision Tree Classifier, we observe that the model correctly classified 87.2% of patients without diabetes (true negatives) and 77.9% of patients with diabetes (true positives).

  • However, there were instances where the model misclassified 12.8% of patients without diabetes as having diabetes (false positives) and 22.1% of patients with diabetes as not having diabetes (false negatives).

  • These misclassifications indicate that the model may face challenges in accurately distinguishing certain cases, highlighting the potential for further enhancements to improve its performance.

Discussion of Output-Part two & three

  • The regression model was developed using a dataset comprising diverse diabetes diagnostic variables to predict the glucose level in an individual.

  • Based on the MSE and MAE graph plotted, it was revealed that the k-NN is having the highest MAE and MSE, followed by Linear Regression, Random Forest Regression, Decision Tree Regression and Gradient Boosting Regression is having the least MAE and MSE. This indicates that the Gradient Boosting Regression produce the least error in compared with others machine learning model.

  • Based on the R^2 score, Gradient Boosting Regression obtained the highest R^2 score, followed by Decision Tree Regression, Random Forest Regression, Linear Regression and k-NN. This indicates that the Gradient Boosting Regression displays the best performance in goodness of fit with the dataset.

  • By comparing the MAE,MSE and R^2 score, we can conclude that Gradient Boosting Regression is the machine learning model that possess the best performance in predicting the glucose level of an individual.

  • Since Gradient Boosting Regression is the machine learning model that possess the best performance in predicting the glucose level of an individual, thus Gradient Boosting Regression will be employed to study which variable have the highest impact on glucose level prediction.

  • By inspecting the feature importance analysis of Gradient Boosting Regression using the GBM package in R, it was displayed that Insulin (1), Age (0.7234) and BMI(0.5629) are the top three variables that having the highest impact in predicting the glucose level. Therefore, this indicates that these three variables play important role in predicting the glucose level, and they shall be captured especially in diabetes prediction and management for diabetes patient.


Conclusions


References

-1. Frankum, S., & Ogden, J. (2005). Estimation of blood glucose levels by people with diabetes: a cross-sectional study. British Journal of General Practice, 55(521), 944-948.