This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This particular dataset is open-source and freely accessible online for research purposes.
All patients included in this dataset are Pima Indians (a subgroup of Native Americans) and are females aged 21 years and older. The objective is to develop a robust machine learning model to predict whether a patient has diabetes based on diagnostic measurements.
We would like to how accurate we can predict whether subject has diabetes by means of given diagnostic measurement variables.
The dataset includes eight (8) independent variables used to predict whether the tested patient has diabetes:
1. pregnancies : This variable indicates the number of pregnancies a woman has had, regardless of whether each pregnancy resulted in a live birth, a miscarriage, or a stillbirth.
2. glucose : This variable measures the level of plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. blood pressure (mm Hg) : This variable refers to blood pressure, which is another risk factor for diabetes.
4. skin thickness (mm) : This variable refers to the thickness of the skinfold, which is used as an indicator of body fat distribution and insulin sensitivity.
5. insulin (mu U/ml) : Values here present the level of insulin of tested patients. But I failed to obtain in which circumstance the value of insulin level were collected, e.g. before / after meals, or before bedtime. That may lead to significant difference. Alright, we can at moment forge this disputed issue, and keep the point open for further attention.
6. bmi : The acronym of Body Mass Index. The formula is weight in kilograms divided by height in meters squared. Empirically, a higher BMI is associated with an increased risk of diabetes.
7. diabetes pedigree function : This variable is a measure of an individual’s risk of developing type 2 diabetes based on their family history. A higher value indicates a higher risk of developing type 2 diabetes. A value over 2.0 would be considered to be at a high risk.
8. age (years) : This variable is more straight forward, which stands for the age of the individual subjects, when they participated the test.
As opposed to the aforementioned eight (8) independent variables,
there is a key classifying variable, namely
outcome. There
are only two values for this variables, 1 (stands for
positive) states that the subject is a diabetic patient; in contrast
0 (stand for negative) means that the subject is not.
pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | diabetes_pedigree_function | age | outcome |
|---|---|---|---|---|---|---|---|---|
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 103 | 30 | 38 | 83 | 43.3 | 0.183 | 33 | 0 |
2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | 1 |
5 | 166 | 72 | 19 | 175 | 25.8 | 0.587 | 51 | 1 |
0 | 118 | 84 | 47 | 230 | 45.8 | 0.551 | 31 | 1 |
The first 10 rows of observations in the dataset.
The dimension of the initial dataset is 798 observations in rows and 9 variables in columns.
In this presentation, the main analytic methodology for classifying is Logistic Regression. The whole study is made of two chapters and in the second separately presented study, I am inclined to leverage Random Forest to work on the same dataset.
The given initial dataset will be split into
training dataset and test dataset. The
training dataset will be leveraged to build the most fit
and accurate machine learning Logistic Regression model; while
test dataset will be used to test the predictive accuracy
of the trained model.
The exactly same two datasets are to used for Random Forest model as well, of which the analytic process will be presented in “Diabetics Prediction Model Study II”, as a focal aim of the studies is to compare the predictive accuracy of these two machine learning models on top of this given dataset.
The objective of this study is for the purposes to train machine learning model and research only and does not necessarily reflect the real-world situation for predicting diabetes for any individual patients. Meanwhile, this original open-sourced data may contain errors, which can not be verified due to limitations in accessing the authentic database.
Last but not least, the study findings are not intended to be used for any commercial or diagnostic purposes. I am not liable for any conclusions derived from this research for any other purposes.
At first, I have cleaned the original dataset, including fixing, removing incorrectly formatted, duplicate, and incomplete data. As a matter of fact, as I cannot communicate with data owner or any subject matter experts to confirm why and where the errors come and to what extent it may have a significant impact on subsequent studies.
Next, I am going to randomly select 70% of the original dataset as
the training dataset and keep the remaining 30% for the
test dataset to evaluate the accuracy of the best-fit model
which to be developed.
This model’s accuracy will be compared with that of the Random Forest in a separate study. I will use the same training and test datasets for both models to determine which one can make better predictions.
Except for variable outcome, which is a binary variable, most of the rest variables exhibit a normal distribution-like curve.
Only age and insulin are obviously skewed to the left.
To determine whether there is a statistically significant difference in the values of measurement variables between diabetic and non-diabetic patients, we can employ multiple plots shown as below.
What can we draw from the above visuals?
Although we can see that the group of actual patients may have more number of pregnancies, and they are older and have higher glucose level than its counterpart, the difference of those variables for diabetic patients and non-patients are not statistically significant, needless to say the other metrics
Besides, a apparent peak of non-patient groups occurs in the curves of pregnancies, glucose and age, however empirically the pregnancies may have rather clearly positive correlation with age. Therefore, we cannot conclude any outstanding findings from the presented visuals.
Which measurement variables have obviously positive correlation with classifying variable, outcome?
What remarkable insights that can be gleaned from these visuals?
All models are crafted by machine learning programming and through several times of iterations to achieve optimal performance.
The formula of generic Model 0 is \[p = 1 / (1 + e^{0.6214 + error})\]
glucose,
age into accountThe best-fit formula of Model 1 is \[p = 1 / (1 + e^{7.9737 - 0.0400·glucose - 0.0739·age + error})\] The formula presents how the predictive possibility of a diabetic patient is calculated.
glucose, bmi,
diabetes predigree function and age into
account.The best-fit formula of Model 2 is \[p = 1 / (1 + e^{10.0973 - 0.03692·glucose - 0.0559·bmi - 1.4338·diabetes~pedigree~function + 0.07084·age + error})\]
pregnancies,
insulin, bmi,
diabetes pedigree function and age into
accountThe formula of Model 4 is \[p = 1 / (1 + e^{-(ß_0+ß_1·pregnancies+ß_2·insulin+ß_3·bmi+ ß_4·diabetes~pedigree~function+ß_5·age + error)})\]
The test dataset will be fed into different prediction
Models to check the values of hit rate.
## [1] 0.6875 0.7417 0.7583 0.7500 0.7250
The different values of Hit Rate refer to the proportion of all subjects in which a test correctly predicts negative and positive.
Subsequently, the identical test dataset will be
reintroduced to calculate the Area Under Curve (AUC) of
each Model, which is another critical metric to assess the model’s
predictive accuracy.
Interpretations of ROC and AUC
↑The Area Under the Curve of model 0 is only 0.5↑
The results suggest that the prediction of this model represents a random guess.
↑The Area Under the Curve of model 1 is 0.7722↑
↑The Area Under the Curve of model 2 is 0.8198↑
The number indicates a pretty good predictive power.
↑The Area Under the Curve of model 3 is 0.8293↑
The number indicates a pretty good predictive power.
↑The Area Under the Curve of model 4 is 0.7694↑
model_name | hit_rate | AUC_value |
|---|---|---|
model 0 | 0.6875 | 0.5000 |
model 1 | 0.7417 | 0.7722 |
model 2 | 0.7583 | 0.8198 |
model 3 | 0.7500 | 0.8293 |
model 4 | 0.7250 | 0.7694 |
Based on the Hit Rate, especially on the ROC (Curve), it is not difficult to conclude that both Model 2 and Model 3 have rather good prediction powers.
I will use these figures to compare with prediction accuracy of another machine learning classifying model Random Forest in a separate study to see which machine learning model suits this case better.
Let’s try the Model. The measures of a random female subject belonging to Pima Indians are shown as below:
pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | diabetes_pedigree_function | age |
|---|---|---|---|---|---|---|---|
3 | 121.31 | 71.27 | 29.06 | 115.93 | 32.81 | 0.51 | 31 |
What will be possibility that she is a diabetic patient?
By putting the data into the strongly predictive Model 3, the result comes out:
## 1
## "28.8%"
It states that based on given measures of this female Pia Indian, Model 3 predicts a 28.8% probability that she might have diabetes, perhaps due to her low diabetes pedigree function value.
What if another female subject also belonging to the same subgroup, who has the identical measures on all metrics except for diabetes pedigree function, which is not at 0.51, but 2.5. What is the probability of her being tested as a diabetic patient?
pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | diabetes_pedigree_function | age |
|---|---|---|---|---|---|---|---|
3 | 121.31 | 71.27 | 29.06 | 115.93 | 32.81 | 2.5 | 31 |
The prediction result is calculated out:
## 1
## "88.4%"
Model 3 predicts a 88.4% probability of diabetes. Why does diabetes pedigree function have a so striking influence on prediction?
The following confident intervals of the variables debunk the reason.
## 2.5 % 97.5 %
## (Intercept) 6.313267e-06 0.0003159715
## pregnancies 9.830843e-01 1.1913753028
## glucose 1.031521e+00 1.0511947258
## blood_pressure 9.713064e-01 1.0132755378
## skin_thickness 9.832566e-01 1.0440032497
## insulin 9.965176e-01 1.0006493050
## bmi 1.008959e+00 1.1126250668
## diabetes_pedigree_function 2.094397e+00 9.3011900959
## age 1.020779e+00 1.0917616911
Every unit increase in diabetes pedigree function is associated with an increased likelihood of having diabetes ranging from 109% - 830% for the subject from this subgroup, which states this variable is the most powerful and instrumental metric to predict the outcome.