Predicting for Diabetes in Humans

Non-Teaching Credit Course (NTCC) Project, 6th semester

Debayan Das
B.Tech Undergraduate, 3rd year
2015-2019

1 Statistical Treatment of the dataset

1.1 “No more sweets for you!”

   

Diabetes can be explained as a simple metabolic disorder, caused by a complex set of myriad reasons (See Medical News).

  • Insulin Signalling : All dietary carbohydrates are broken down into glucose in the small intestine, which is then absorbed into the blood stream. This will either be used as an immediate energy source for ATP synthesis or will be stored, as directed by insulin. The primary job of insulin is to maintain safe and steady blood glucose levels of around 80-100 mg/dl. So when blood glucose levels rise above 100, insulin is secreted by the pancreas. Then, insulin “picks up” the extra glucose out of the blood and takes it to a storage depot for safe keeping.

  • The most anabolic hormone in the human body : Insulin is a hormone released into the blood by an internal organ called the pancreas. Insulin functions in many ways as an anabolic or a storage hormone; in fact it’s been called the most anabolic hormone, even more potent than the Growth Hormone (Somatotrophic hormine). When insulin is released into the bloodstream, it acts to shuttle glucose (carbohydrates), amino acids, and blood fats into the cells of the body. However, there’ a slight problem one may face if one is trying to lose body fat and change body composition: insulin is indiscriminately anabolic and doesn’t care whether it helps with the building of muscle or the accumulation of fat.

  • Understanding the disorder :

    1. Firstly, there’s no way to avoid insulin in the blood. Whenever you eat food, insulin is released.

    2. Secondly, if you theoretically could eliminate insulin, you would abolish all of its anabolic effects and its ability to store energy in the muscle. In fact, type 1 diabetics don’t produce insulin and as a result, if left untreated, they die.

    3. If blood levels of insulin are always highly elevated, trouble results. Chronic elevation of insulin leads to large amounts of fat gain, risk for cardiovascular disease, and ultimately to type 2 diabetes. This second type of diabetes is characterized by obesity, cardiovascular disease, and the poor ability of the muscle to store nutrients, which leads to muscle wasting and tons of fat storage. This is called insulin resistance.

  • Insulin Resistance and type 2 diabetes: If you’re insulin resistant, your cells – especially the muscle cells – don’t respond to the anabolic effects of normal levels of insulin, i.e. they resist insulin’s effects. If this is the case, the body then releases massive amounts of insulin to promote nutrient storage in the resistant cells. Chronic high levels of insulin in the blood are very bad and can cause type 2 diabetes.

   

1.2 “Show me the data”

   

Dataset: diabetes.csv; Obtained from the UCI Machine Learning Repository

“The Pima Indians Diabetes Data Set was compiled by researchers at the Johns Hopkins University School of Medicine, from a larger database owned by the National Institute of Diabetes and Digestive and Kidney Diseases. Included in the dataset is information on females of Pima Indian heritage who were at least 21 years old at the time of data collection. Pima Indians have one of the highest rates of diabetes in the world, and the researchers at Johns Hopkins collected this dataset with the intention of creating a model that would predict the onset of diabetes in the Pima Indian population”

  1. It is a CSV-format file. It is extracted to the working directory.
  2. read.csv() reads the data from file to an dataframe object named ‘DATA’.

( I prefer to code defensively. Thereby, my script checks if the CSV file is present or not, in the directory, for further actions.)

if(!("diabetes.csv" %in% list.files()))
# if the CSV file is not present in the working directory by the default name,
        download.file("https://raw.githubusercontent.com/Dexter1618/NTCC_S06_PredictingForDiabetesInHumans/master/diabetes.csv",quiet = TRUE)
# reading the CSV file
DATA<-read.csv("./diabetes.csv")

The dataset has 768 observations and 9 variables.

The variables available are:

  1. Pregnancies: Number of pregnancies
  2. Glucose: Plasma Glucose Concentration
  3. BloodPressure: Diastolic Blood Pressure
  4. SkinThickness: Triceps skinfold thickness
  5. Insulin: 2-hour serum insulin, in mmU/ml
  6. BMI: Body Mass Index
  7. DiabetesPedigreeFunction
  8. Age: Patient age, in number of years

The last column, named Outcome contains either 1 or 0, indicating whether the person with the corresponding recorded values of the above variables actually has diabetes or not. This variable as been read as a column of integer, but it is more appropiate to treat it as a factor variable.

My comments:

  • observations collected and sampled randomly from an available population of Pima Indian females only.
  • therefore, this is an observational study
  • the results that I conclude with, can be generalized to the population of Pima Indian females, only.
  • causality between variables cannot be explored in this kind of study

   

Let’s explore the relationships that may exist amongst the variables themselves. Outcome is not considered in this exploration.

pairs(~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age,data = DATA)

If that seems quite clumsy to look at and see any trends in data, let me show the pairs of variables which I found that show a distinct positive trend:

My comments:

  • for pair plots showing a single large clustering: the ‘X’ and ‘Y’ variables generally remains constant over time, within a range
  • for pair plots showing a single horizontal or vertical line with a broad and dense ‘base’ on the Y-axis and X-axis respectively: with increase in one variable, the other variable doesn’t change
  • Glucose and Insulin are correlated positively
  • BMI and SkinThickness are correlated positively
  • The pair plots show there are many observations where the variables have value = 0.

   

What about the missing values?

Such variables with 0 values can affect future analytical operations. So, they need to removed or imputed. Deletion seems a more straight-forward to deal with this issue, and not introduce more bias into the data, but we will be losing a lot of data-points. I decided to replace the 0 values from variables which cannot have zero-values, with NA. The respective variable columns were modified as:

DATA$Glucose[DATA$Glucose==0]<-NA
DATA$BloodPressure[DATA$BloodPressure==0]<-NA
DATA$SkinThickness[DATA$SkinThickness==0]<-NA
DATA$Insulin[DATA$Insulin==0]<-NA
DATA$BMI[DATA$BMI==0]<-NA
DATA$Age[DATA$Age==0]<-NA

   

We might as well explore the frequency distributions for the variables themselves. Outcome may not be considered in this exploration also. The distributions for Glucose, Blood Pressure, SkinThickness and DiabetesPedigreeFunction were almost normal (as in normal distribution, bell-shape curve). The distributions for Pregnancies, Insulin, BMI and Age were prominently left-skewed.

My comments:

  • The median number of pregnancies recorded in this dataset is 3.
  • The median age of a female whose data was collected in this dataset is 29 years.
  • The median recorded value of BMI for a Pima Indian female is 32.3. Based on the following BMI chart, a randomly sampled Pima Indian female is most likely to be obese.

   

For further analyses, we ought to divide the cleaned dataset into a training dataset and a testing dataset. We perform all our analyses and our modelling, and significance testing on the training dataset. We test for accuracy of our model on the untouched testing dataset.

set.seed(nrow(DATA))
randomSampling<-sort(sample(nrow(DATA), nrow(DATA)*0.75))
TRAIN_DATA<-DATA[randomSampling,]
TEST_DATA<-DATA[-randomSampling,]

The training set has 576 observations while the test set has the remaining 192 observations.

379 females do not have diabetes, while 197 of them actually have diabetes, from the training dataset.

Now, I have a training set which I would like to work with.

   

1.3 “So, what can help me tell whether a human has diabetes or not?”

   

1.3.1 Investigating my claims

Investigating claim 1: Glucose and Insulin are correlated
cor(TRAIN_DATA$Glucose,TRAIN_DATA$Insulin,use = "complete.obs")

[1] 0.6074488

CLAIM01_LM_CLEAN<-lm(Insulin~Glucose,data = TRAIN_DATA)
kable(coefficients(summary(CLAIM01_LM_CLEAN)))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.254623 23.9090882 -5.824338 0
Glucose 2.449686 0.1900895 12.887008 0

For the linear model between Glucose and Insulin:

  • Residual standard error: 99.17 on 295 degrees of freedon
  • Adjusted R2-squared: 0.3332
  • F-statistic: 148.9 on 1 and 295 DF
  • p-value: <2.2e-16

Glucose explains 33.32% of the variation in Insulin

The relationship between Glucose and Insulin is very significant, and one can help predict the other.

 

Investigating claim 2: SkinThickness and BMI are correlated
cor(TRAIN_DATA$SkinThickness,TRAIN_DATA$BMI,use = "complete.obs")

[1] 0.6339048

CLAIM02_LM_CLEAN<-lm(BMI~SkinThickness,data = TRAIN_DATA)
kable(coefficients(summary(CLAIM02_LM_CLEAN)))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.9125610 0.7820577 26.74043 0
SkinThickness 0.4106586 0.0249274 16.47420 0

For the linear model between BMI and SkinThickness:

  • Residual standard error: 5.089 on 401 degrees of freedon
  • Adjusted R2-squared: 0.4045
  • F-statistic: 274.1 on 1 and 401 DF
  • p-value: <2.2e-16

SkinThickness explains 40.45% of the variation in BMI

The relationship between BMI and SkinThickness is very significant, and one can help predict the other.

 

1.3.2 Proposed logistic models to predict Outcome

I propose four statistical models, with different predictors for each case. And by proposing these four models, I perform the Hypothesis Testing:

  1. The basic standpoint or the NULL hypothesis H0 is that none of the predictors in a given model is helpful to predict Outcome.
  2. The opposing standpoint: my proposal that any one of the predictors in a model is actually helpful to predict Outcome, is called the ALTERNATIVE hypothesis HA .

Mathematically speaking, if the slope or coefficient estimate for a predictor is 0, it means they are not at all helpful in predicting Outcome. For H0, estimate for all predictors in a model will be 0. For HA, estimate will NOT be 0 for any one of the predictors.  

Interpretations of columns (in the four tabs)

  • estimate : coefficient/‘slope’ (approximated)
  • z value : how far it is from the mean (in terms of number of standard deviation for that distribution)
  • p-value: the probability of finding the observed, or more extreme, results when H0 is true; the choice of significance level at which you reject H0 is arbitrary. If p-value < 0.05, it means that the chance H0 is true is less than “1 in a 100 cases”.
  • Deviance of an observation is computed as -2log(likelihood of that observation).
  • Null deviance is calculated from the model with no features, i.e.,only intercept. The null model predicts class via a constant probability.
  • Residual deviance is calculated from the model having all the features. The larger the difference between null and residual deviance, better is the model.
  • AIC is explained in tab 1.3.3
Logistic model A (without Glucose & SkinThickness)
MODEL01<-glm(Outcome~Pregnancies+BloodPressure+Insulin+BMI+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL01)))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.2020458 1.0709911 -5.7909408 0.0000000
Pregnancies 0.1160128 0.0599940 1.9337391 0.0531452
BloodPressure 0.0060550 0.0120399 0.5029112 0.6150267
Insulin 0.0040732 0.0012934 3.1490779 0.0016379
BMI 0.0688191 0.0225460 3.0523873 0.0022703
DiabetesPedigreeFunction 0.9756651 0.4307878 2.2648391 0.0235226
Age 0.0360316 0.0195117 1.8466660 0.0647956

For this model:

  • Null deviance: 365.0033156 on 296 degrees of freedom
  • Residual deviance: 297.1826521 on 290 degrees of freedom
  • AIC: 311.1826521

 

Logistic model B (without Insulin & BMI)
MODEL02<-glm(Outcome~Pregnancies+BloodPressure+Glucose+SkinThickness+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL02)))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.5321190 1.0529082 -8.1033836 0.0000000
Pregnancies 0.1516311 0.0501437 3.0239313 0.0024951
BloodPressure -0.0010183 0.0116409 -0.0874737 0.9302950
Glucose 0.0406761 0.0051078 7.9634959 0.0000000
SkinThickness 0.0374375 0.0136574 2.7411793 0.0061219
DiabetesPedigreeFunction 1.0917085 0.3989687 2.7363258 0.0062129
Age 0.0171673 0.0155791 1.1019463 0.2704850

For this model:

  • Null deviance: 508.9399138 on 400 degrees of freedom
  • Residual deviance: 351.230079 on 394 degrees of freedom
  • AIC: 365.230079

 

Logistic model C (without Insulin & SkinThickness)
MODEL03<-glm(Outcome~Pregnancies+BloodPressure+Glucose+BMI+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL03)))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.4419210 0.9870814 -9.5654935 0.0000000
Pregnancies 0.1391540 0.0412401 3.3742408 0.0007402
BloodPressure -0.0082974 0.0103087 -0.8048895 0.4208835
Glucose 0.0382592 0.0042822 8.9345478 0.0000000
BMI 0.0923435 0.0188110 4.9090035 0.0000009
DiabetesPedigreeFunction 0.8218230 0.3642402 2.2562665 0.0240539
Age 0.0155685 0.0116290 1.3387709 0.1806453

For this model:

  • Null deviance: 688.8132705 on 542 degrees of freedom
  • Residual deviance: 485.713029 on 536 degrees of freedom
  • AIC: 499.713029

 

Logistic model D (without Glucose & BMI)
MODEL04<-glm(Outcome~Pregnancies+BloodPressure+Insulin+SkinThickness+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL04)))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.2808971 0.9600638 -5.500569 0.0000000
Pregnancies 0.0974778 0.0579646 1.681677 0.0926314
BloodPressure 0.0122639 0.0118350 1.036239 0.3000905
Insulin 0.0043021 0.0012888 3.337995 0.0008439
SkinThickness 0.0366044 0.0140425 2.606682 0.0091424
DiabetesPedigreeFunction 0.9183737 0.4143449 2.216448 0.0266609
Age 0.0333513 0.0194375 1.715816 0.0861957

For this model:

  • Null deviance: 365.8194896 on 297 degrees of freedom
  • Residual deviance: 300.742368 on 291 degrees of freedom
  • AIC: 314.742368

 

1.3.3 The parsimonious model to predict Outcome

  • The predictors are independent variables. If two predictors are correlated with each other, they are called collinear variables.

  • Including collinear variables complicates model estimation due to multi-collinearity. One must always avoid adding collinear variables to a model. They bring nothing new to the table.

  • Employ Occam’s razor: The model hypothesis that makes fewer assumptions (i.e uses fewer predictors) is the best one, amongst competing model hypotheses.

  • The simplest model hypothesis is called the Parsimonious model.

  • The parsimonious model has the highest predicting power and uses minimal number of predictors.

  • Criterion used to judge how well a model fits the data and can predict reliably good for the target variable: AIC (Akaike Information Criterion) You may read more about this criterion and what it is from here and here.

So, the model with the least value of AIC would be the parsimonious model amongst the four that I have proposed earlier.

  • We notice that Logistic Model A has the minimum possible value for AIC amongst the four.

Hence, we choose the predictors in Logistic Model A to be the reliable predictors for the target variable Outcome. We shall also use this model to find the significance of each predictor in the model, in predicting for Outcome.

 

1.3.4 “Can’t you automate the whole process to find the best model?”

We can. Let’s automate the process of finding the parsimonious model.

We may use the stepwise model selection algorithm to find the model.

  • Firstly, we decide on a model selection criterion. We have already chosen it as AIC.
  • We perform a backward stepwise model selection.
  • We start with a linear model, containing all the predictors in the dataset. We evaluate the AIC value for this model. Then we repeat the above step again, with an updated linear model containing one less predictor than the previous. We continue this till we are left with only one predictor to leave out.
  • The model with the least value for AIC would be the parsimonious model.
  • We may also do this algorithm in a forward manner, where we start with one predictor and continue adding a predictor till we obtain the full model with all available predictors.
  • The model with the least value for AIC would be the parsimonious model, again.
fullModel<-glm(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age,data=na.omit(TRAIN_DATA),family = binomial(link = 'logit'))
leastAIC_MODEL<-MASS::stepAIC(fullModel)

The algorithm gives the following predictors (on the right hand side) to be the most effective ones to constitute the parsimonious model.

leastAIC_MODEL$formula

Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction  

1.3.5 “Who’s significant?”

For Logistic Model A,

kable((anova(MODEL01,test = "Chisq")))
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 285 365.0033 NA
Pregnancies 1 19.304327 284 345.6990 0.0000111
BloodPressure 1 5.846204 283 339.8528 0.0156107
Insulin 1 22.416327 282 317.4365 0.0000022
BMI 1 10.963070 281 306.4734 0.0009295
DiabetesPedigreeFunction 1 5.760091 280 300.7133 0.0163942
Age 1 3.530645 279 297.1827 0.0602443

A Chi-square ANOVA tells us Pregnancies, Insulin and BMI are signficant predictors for Outcome.

 

For the model selected by stepwise algorithm,

kable((anova(leastAIC_MODEL,test = "Chisq")))
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 284 364.1836 NA
Pregnancies 1 19.059821 283 345.1238 0.0000127
Glucose 1 77.896094 282 267.2277 0.0000000
BMI 1 11.166498 281 256.0612 0.0008329
DiabetesPedigreeFunction 1 5.797067 280 250.2641 0.0160529

A Chi-square ANOVA tells us Pregnancies, Glucose and BMI are signficant predictors for Outcome.

 

1.3.6 “Let the model speak”

We ought to see how well our models can predict the Outcome.

I shall be using the test dataset now. Our models haven’t used the test data all this time and for good reason: they mustn’t be biased, in the sense they mustn’t already know what the real value for Outcome is.

 

First, let’s use Logistic Model A, the one which I proposed by hypothesis testing and evaluation using AIC.

prediction_MODEL01<-predict(MODEL01,TEST_DATA,type = "response")

Now, let’s use the logistic model obtained via the stepwise algorithm to find its predictions.

prediction_leastAIC_MODEL<-predict(leastAIC_MODEL,TEST_DATA,type = "response")

 

We have our arrays of predictions from both the models. Now we get to see how good they are. Since these prediction values are log of odds, we ought to set a cut-off value of probability = C. If prediction value >= C, we treat it as a YES or 1. Else, we treat it as NO or 0.

prediction_MODEL01<-factor(ifelse(prediction_MODEL01>0.8,1,0),levels = c(0,1))
prediction_leastAIC_MODEL<-factor(ifelse(prediction_leastAIC_MODEL>0.6,1,0),levels = c(0,1))

We ought to check for the accuracy of predictions for both of the models:

accuracy_MODEL01<-mean(prediction_MODEL01==TEST_DATA$Outcome,na.rm = TRUE)
accuracy_leastAIC_MODEL<-mean(prediction_leastAIC_MODEL==TEST_DATA$Outcome,na.rm = TRUE)

 

Logistic Model A gives an accuracy of 74.7663551% on TEST_DATA. The right-hand side of the following gives the predictors in Model A: Outcome ~ Pregnancies + BloodPressure + Insulin + BMI + DiabetesPedigreeFunction + Age

The model selected by stepwise algorithm gives an accuracy of 75.1322751% on TEST_DATA. The right-hand side of the following gives the predictors in this model: Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction

 

2 Using Machine Learning algorithms to predict

2.1 “Get all what you need”

   

  • The pandas library supports for dynamic and ‘very-fast’ operations on a dataset, which is stored as a matrix.
  • numpy stands for Numerical Python. It contains methods for performing statistical and general mathematical calculations.
  • matplotlib is a Python plotting library based on MATLAB.
  • sklearn is the Machine Learning library in Python.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
DATA = pd.read_csv("https://raw.githubusercontent.com/Dexter1618/NTCC_S06_PredictingForDiabetesInHumans/master/diabetes.csv")

 

2.2 Splitting DATA

   

We need to divide the dataset DATA into training and testing blocks, just like we did before. However, since we will be using two different machine-learning models, we need to create different splits of DATA to judge and find the best model.

  • These splits account for all the predictors in DATA.
TRAIN_DATA_PREDICTORS, TEST_DATA_PREDICTORS, TRAIN_DATA_OUTCOME, TEST_DATA_OUTCOME = train_test_split(DATA.loc[:, DATA.columns != 'Outcome'], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)
  • These splits account for the predictors that Logistic Model A had. They are Pregnancies,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction and Age.
TRAIN_DATA_BEST_PREDICTORS_01, TEST_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01 = train_test_split(DATA[["Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age"]], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)
  • These splits account for the predictors that leastAIC_MODEL had. They are Pregnancies,Glucose,BMI, and DiabetesPedigreeFunction.
TRAIN_DATA_BEST_PREDICTORS_02, TEST_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02 = train_test_split(DATA[["Pregnancies","Glucose","BMI","DiabetesPedigreeFunction"]], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)

 

2.3 “Love Thy k-NN”

   

The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors”.

I took a range of 1 - 20 neighbours and then tried finding that k-NN model that requires the least number of neighbours to correctly determine the value of Outcome.

2.3.1 “Let’s use all of them”

Here I used all the available 8 predictors in DATA.

from sklearn.neighbors import KNeighborsClassifier

training_accuracy = []
test_accuracy = []
neighbourCount = range(1, 20)

for i in neighbourCount:
    # build the model
    knn = KNeighborsClassifier(i)
    knn.fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
    # record training set accuracy
    training_accuracy.append(knn.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME))
    # record test set accuracy
    test_accuracy.append(knn.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME))

A k-NN Classifier with k = 9 has the maximum accuracy.

plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")

 

2.3.2 Logistic Model A - predictors

Here I used 6 predictors as used by Logistic Model A.

training_accuracy = []
test_accuracy = []
neighbourCount = range(1, 20)

for i in neighbourCount:
    # build the model
    knn = KNeighborsClassifier(i)
    knn.fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)
    # record training set accuracy
    training_accuracy.append(knn.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01))
    # record test set accuracy
    test_accuracy.append(knn.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01))

A k-NN Classifier with k = 6 has the maximum accuracy.

plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")

 

2.3.3 leastAIC_MODEL - predictors

Here I used 4 predictors as used by leastAIC_MODEL.

training_accuracy = []
test_accuracy = []

neighbourCount = range(1, 20)

for i in neighbourCount:
    # build the model
    knn = KNeighborsClassifier(i)
    knn.fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)
    # record training set accuracy
    training_accuracy.append(knn.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02))
    # record test set accuracy
    test_accuracy.append(knn.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02))

A k-NN Classifier with k = 14 has the maximum accuracy.

plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")

 

2.3.4 Accuracy onTRAIN_DATA and TEST_DATA

Now, I need to evaluate the accuracy of predictions for the best k-NN models I found.

  • a k-NN model which uses 9 neighbours and all the predictors
  • a k-NN model which uses 7 neighbours and predictors in Logistic Model A
  • a k-NN model which uses 14 neighbours and predictors in leastAIC_MODEL
knn = KNeighborsClassifier(9)
knn.fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)

print("When K = 9 and with all predictors in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
knn = KNeighborsClassifier(7)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)

print("\nPREDICTORS USED:","Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age")
print("When K = 6 and with the specific predictors (logistic model A) in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01)))
knn = KNeighborsClassifier(14)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)

print("\nPREDICTORS USED:","Pregnancies","Glucose","BMI","DiabetesPedigreeFunction")
print("When K = 14 and with the specific predictors (leastAIC_MODEL) in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02)))
When K = 9 and with all predictors in dataset: 
Accuracy of K-NN classifier on training set: 0.79167
Accuracy of K-NN classifier on test set: 0.77083


PREDICTORS USED: Pregnancies BloodPressure Insulin BMI DiabetesPedigreeFunction Age
When K = 6 and with the specific predictors (logistic model A) in dataset: 
Accuracy of K-NN classifier on training set: 0.73264
Accuracy of K-NN classifier on test set: 0.73438


PREDICTORS USED: Pregnancies Glucose BMI DiabetesPedigreeFunction
When K = 14 and with the specific predictors (leastAIC_MODEL) in dataset: 
Accuracy of K-NN classifier on training set: 0.76910
Accuracy of K-NN classifier on test set: 0.77083

 

2.3.5 “Cross-validation accuracy”

I also evaluated the accuracy of these three k-NN models in-sample, via the cross-validation technique.

For the KNN model with 9 neighbours and all predictors in dataset:
10-fold cross validation average accuracy: 0.73445

For the KNN model with 6 neighbours and specific predictors in Logistic Model A:
10-fold cross validation average accuracy: 0.66313

For the KNN model with 14 neighbours and specific predictors in leastAIC_MODEL:
10-fold cross validation average accuracy: 0.73276

 

2.4 “The prodigal model returns”

   

2.4.1 “Go simple”

from sklearn.linear_model import LogisticRegression

The trade-off parameter of logistic regression that determines the strength of the regularization is called C. Higher values of C correspond to less regularization C is actually the Inverse of regularization strength(lambda)

print("\t\t For all predictors:\n")
print("For C = 1")
LogisticModel01 = LogisticRegression().fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel01.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel01.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\nFor C = 0.01")
LogisticModel02 = LogisticRegression(C=0.01).fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel02.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel02.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\nFor C = 100")
LogisticModel03 = LogisticRegression(C=100).fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel03.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel03.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\n\n\t\t For specific predictors in logistic model A and C = 1:\n")
print("\nPREDICTORS USED:","Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age")
LogisticModelA = LogisticRegression().fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)
print("Training set accuracy: {:.4f}".format(LogisticModelA.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)))
print("Test set accuracy: {:.4f}".format(LogisticModelA.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01)))
print("\n\n\t\t For specific predictors in leastAIC_model and C = 1:\n")
print("\nPREDICTORS USED:","Pregnancies","Glucose","BMI","DiabetesPedigreeFunction")
leastAIC_MODEL = LogisticRegression().fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)
print("Training set accuracy: {:.4f}".format(leastAIC_MODEL.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)))
print("Test set accuracy: {:.4f}".format(leastAIC_MODEL.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02)))
         For all predictors:

For C = 1
Training set accuracy: 0.7778
Test set accuracy: 0.7708

For C = 0.01
Training set accuracy: 0.6979
Test set accuracy: 0.7240

For C = 100
Training set accuracy: 0.7847
Test set accuracy: 0.7708


         For specific predictors in logistic model A and C = 1:


PREDICTORS USED: Pregnancies BloodPressure Insulin BMI DiabetesPedigreeFunction Age
Training set accuracy: 0.7014
Test set accuracy: 0.6823


         For specific predictors in leastAIC_model and C = 1:


PREDICTORS USED: Pregnancies Glucose BMI DiabetesPedigreeFunction
Training set accuracy: 0.7708
Test set accuracy: 0.7812

 

2.4.2 In-sampling testing for the prodigal model

kfold = model_selection.KFold(n_splits=10, random_state=7)
scoring = 'accuracy'
print("For the LR model with C = 1 and all predictors in dataset:")
results = model_selection.cross_val_score(LogisticModel01, TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
print("For the LR model with C = 1 and specific predictors in Logistic Model A:")
results = model_selection.cross_val_score(LogisticModelA, TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
print("For the LR model with C = 1 and specific predictors in leastAIC_MODEL:")
results = model_selection.cross_val_score(leastAIC_MODEL, TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
For the LR model with C = 1 and all predictors in dataset:
10-fold cross validation average accuracy: 0.76576
For the LR model with C = 1 and specific predictors in Logistic Model A:
10-fold cross validation average accuracy: 0.68237
For the LR model with C = 1 and specific predictors in leastAIC_MODEL:
10-fold cross validation average accuracy: 0.75532

 

2.5 “Stay simple, folks”

   

After all what we need, let’s see what’s the best model to use on DATA to predict whether the female has diabetes or not.

If I use leastAIC_MODEL with its 4 chosen predictors and with a value of regularization strength = 1, the model obtains an accuracy of 77.08% on the TRAIN_DATA and an accuracy of 78.12% on TEST_DATA. That’s really good. Cross-validating leastAIC_MODEL on TRAIN_DATA gives an accuracy of 75.53%.

If I use a k-NN classifier with K = 9 and with all predictors available in DATA, the model obtains an accuracy of 79.167% on TRAIN_DATA and an accuracy of 77.08% on TEST_DATA. Cross-validating this k-NN classifier on TRAIN_DATA gives an accuracy of 73.45%.

The simplest model wins, unambigously.

Let’s mention Occam’s Razor again: The simplest model possible that makes the minimal number of assumptions always have the highest predicting power.

leastAIC_MODEL uses only 4 variables, 3 of which are very significant predictors of diabetes in Pima Indian females (Pregnancies, Glucose and BMI) and is able to predict whether a female has diabetes or not using unseen values for these 4 variables.

It has been able to accurately predict on TEST_DATA more than it has on TRAIN_DATA, and definitely more than its competer - the k-NN classifier with k = 9.