1. Introduction

The study is the second part of the Diabetes Prediction Machine Learning Model research. Here is the link of the first study, which I have leveraged Logistic Regression Model on the open sourced dataset.

In this Study, I will use Random Forest Model to work with the exact same dataset and endeavour to figure out how well the model performs on prediction tasks.

Go over the variables

The dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. In the dataset, there are the following variables that are considered as diagnostic measurements on having diabetes or not.

  • 1. pregnancies : This variable indicates the number of pregnancies a woman has had, regardless of whether each pregnancy resulted in a live birth, a miscarriage, or a stillbirth.

  • 2. glucose : This variable measures the level of plasma glucose concentration a 2 hours in an oral glucose tolerance test.

  • 3. blood pressure (mm Hg) : This variable refers to blood pressure, which is another risk factor for diabetes.

  • 4. skin thickness (mm) : This variable refers to the thickness of the skinfold.

  • 5. insulin (mu U/ml) : Values here present the level of insulin of tested patients.

  • 6. bmi : The acronym of Body Mass Index. Empirically, a higher BMI is associated with an increased risk of diabetes.

  • 7. diabetes pedigree function : This variable is a measure of an individual’s risk of developing type 2 diabetes based on their family history. A higher value indicates a higher risk of developing type 2 diabetes. A value over 2.0 would be considered to be at a high risk.

  • 8. age (years) : This variable is more straight forward, which stands for the age of the individual subjects, when they participated the test.

The classifying variable is outcome. 1 (stands for positive) states that the subject is a diabetic patient; in contrast, 0 (stands for negative) means that the subject is not.

Dataset looks like

The first 5 rows of the dataset
The first 5 rows of the dataset



2. Building the Random Forest Model on the training dataset

There are two important points that should be taken into account before working out an optimal Random Forest Model on top of the training dataset, which are (1) number of the variables used for each tree and (2) Number of Trees.

The ideal values can be defined by algorithm, which will be executed later.

The Initial Model

From the outcome of the first trial, the default number of trees and number of variables for each tree are 500 and 2 individually. However, we don’t know whether they are the optimal numbers.

And the OOB (Out Of Bag) error rate is 11.65%, which is quite well-performing. Click on the link to the explanation of Out of Bag Error to learn more about OOB error.

Upon examining the Confusion Matrix, 333 non-patients are correctly labelled as non-diabetic, whereas 30 non-patients are incorrectly predicted as diabetes patients by the Model 1. This translates to a classification error rate of 0.083, in other words it can also be interpreted as Specificity / True Negative Rate of 0.917.

Likewise, 160 diabetes patients are also accurately classified by the Model 1, while 35 actual patients are erroneously labelled, resulting in 0.821 for Sensitivity / True Positive Rate.

Define the best number of trees

The above visual analysis of Error Rates reveals that both positive and negative outcome and OOB error rates stabilize when the number of the trees gets close to 500. However, we can still experiment a higher number of tree in the subsequent model to further explore the model’s performance.

Define the best number of variables for each tree

# of Variables

OOB Error

1

0.1183

2

0.1111

3

0.1129

4

0.1129

5

0.1165

6

0.1147

7

0.1147

8

0.1111

It’s explicit to conclude that the optimal number of variables for each tree with the lowest OOB error rate is default value of 2 and 8, however we won’t know the default value is ideal till we have approved it.

Besides, the rates are quite similar across various number of variables (the diagnostic measurement variables).

The well-tuned Model contains 1000 trees and 2 variables for each trees. The presented OOB error rate is down from 11.65% to 11.47%, which denotes a diminutive improvement and both models perform quite well.

As this Random Forest Model (Model 2) is the tweaked one grounded on our training dataset, we could try to reveal the importance of each variables to understand better which variables have more impact on the classification task.

From the visual, we can explicitly observe that glucose has the most impact in contrast to other variables, directly followed by age and bmi.

Empirically, glucose is more like the consequence of having diabetes rather than its causality.

In the last study of Logistic Regression Model , the Diabetes Pedigree Function is the most obviously influential variable to predict the outcome of the diabetes, which is ranked merely as the 4th most important variable in the current model.

More explanations on MDA (Mean Decrease Accuracy) and MDG (Mean Decrease Gini) can be found on the internet.



3. Test the prediction accuracy of the tweaked Model 2

The test dataset is now loaded, which will be then fed to the tweaked Model to check its performance.

3.1 Metric 1 of Model: Accuracy Hit Rate

print(hit.rate)
## [1] 0.8708333

The Hit Rate is 87.08%, which exhibits a pretty good prediction on the outcome from the test dataset.

If we plot the Confusion Matrix, the same result of Hit Rate can be worked out.

\[Hit\ Rate = (148 + 61) / (14 + 61 + 148 + 17) = 0.8708\]

3.2 Calculate Metric 2 of the Model: AUC

print(auc)
## [1] 0.9359596

The ROC Curve of the Model

↑The Area Under the Curve of the optimally trained Random Forest Model is 0.9360

3.3 Predicitive Accuracy Comparison Table

model_name

hit_rate

AUC_value

Logistic Regression Model 2

0.7583

0.8198

Logistic Regression Model 3

0.7500

0.8293

Random Forest Model

0.8708

0.9360

Random Forest Model has a better predictive performance on test dataset in both Hit Rate and AUC terms



4. Prediction in practice

In this study, we will use the same two subjects that we created in the Logistic Regression Model study.

The only difference between two subjects is the measure of diabetes pedigree function. The first subject has a value of 0.51, while the second subject has a value of 2.51.

Subject

pregnancies

glucose

blood_pressure

skin_thickness

insulin

bmi

diabetes_pedigree_function

age

Sub 01

3

121.31

71.27

29.06

115.93

32.81

0.51

31

Sub 02

3

121.31

71.27

29.06

115.93

32.81

2.51

31

print(prediction)
## 1 2 
## 0 1 
## Levels: 0 1

The Random Forest Model predicts (classifies) the first subject is not a diabetic patient, as the outcome of classification is 0, while the second subject has diabetes because the outcome is classified as 1.

The prediction results are the same as what the previous Logistic Regression Model provided.

It’s also possible to work out the probabilities of the classified outcome, as the either subjects are 100% predicted as non-diabetic patient or the opposite.



5. Conclusions

Compared to the Logistic Regression Model, the Random Forest Model exhibits a better prediction accuracy for this specific dataset.

There is no a general conclusion that which Model is better than the other, which always depends on the context of the problem and additional constraints or insights.

Both case studies demonstrate that the relative importance of variables can vary across classification models. In one model, certain variables may be more influential in predicting the outcome, while in another, different variables may play a more prominent role.

Random Forest Model has bunches of advantages for classification tasks, and it can be widely applied in banking, stock trading, medicine, bio-science, e-commerce industries and many more, although it works like a “black box”, making it more onerous to interpret in a layman term than Logistic Regression Model.



6. Disclaimer

The objective of study I and II is for the purposes to train machine learning models and research only.

The result does not necessarily reflect the real-world situation for predicting diabetes for any individual patients.

The both study findings are not intended to be used for any commercial or diagnostic purposes. I am not liable for any conclusions derived from this research for any other purposes.

Last but not the least, I publish the case study in this repository as a showcase of my advanced data analytics abilities, for working with data and concluding insights from complicated datasets are also one of my great interests.😄



7. Contact of the Author

Ned lin MBA @ EMST Berlin

Email:

My LinkedIn