Diabetes can be explained as a simple metabolic disorder, caused by a complex set of myriad reasons (See Medical News).
Insulin Signalling : All dietary carbohydrates are broken down into glucose in the small intestine, which is then absorbed into the blood stream. This will either be used as an immediate energy source for ATP synthesis or will be stored, as directed by insulin. The primary job of insulin is to maintain safe and steady blood glucose levels of around 80-100 mg/dl. So when blood glucose levels rise above 100, insulin is secreted by the pancreas. Then, insulin “picks up” the extra glucose out of the blood and takes it to a storage depot for safe keeping.
The most anabolic hormone in the human body : Insulin is a hormone released into the blood by an internal organ called the pancreas. Insulin functions in many ways as an anabolic or a storage hormone; in fact it’s been called the most anabolic hormone, even more potent than the Growth Hormone (Somatotrophic hormine). When insulin is released into the bloodstream, it acts to shuttle glucose (carbohydrates), amino acids, and blood fats into the cells of the body. However, there’ a slight problem one may face if one is trying to lose body fat and change body composition: insulin is indiscriminately anabolic and doesn’t care whether it helps with the building of muscle or the accumulation of fat.
Understanding the disorder :
Firstly, there’s no way to avoid insulin in the blood. Whenever you eat food, insulin is released.
Secondly, if you theoretically could eliminate insulin, you would abolish all of its anabolic effects and its ability to store energy in the muscle. In fact, type 1 diabetics don’t produce insulin and as a result, if left untreated, they die.
If blood levels of insulin are always highly elevated, trouble results. Chronic elevation of insulin leads to large amounts of fat gain, risk for cardiovascular disease, and ultimately to type 2 diabetes. This second type of diabetes is characterized by obesity, cardiovascular disease, and the poor ability of the muscle to store nutrients, which leads to muscle wasting and tons of fat storage. This is called insulin resistance.
Insulin Resistance and type 2 diabetes: If you’re insulin resistant, your cells – especially the muscle cells – don’t respond to the anabolic effects of normal levels of insulin, i.e. they resist insulin’s effects. If this is the case, the body then releases massive amounts of insulin to promote nutrient storage in the resistant cells. Chronic high levels of insulin in the blood are very bad and can cause type 2 diabetes.
Dataset: diabetes.csv; Obtained from the UCI Machine Learning Repository
“The Pima Indians Diabetes Data Set was compiled by researchers at the Johns Hopkins University School of Medicine, from a larger database owned by the National Institute of Diabetes and Digestive and Kidney Diseases. Included in the dataset is information on females of Pima Indian heritage who were at least 21 years old at the time of data collection. Pima Indians have one of the highest rates of diabetes in the world, and the researchers at Johns Hopkins collected this dataset with the intention of creating a model that would predict the onset of diabetes in the Pima Indian population”
( I prefer to code defensively. Thereby, my script checks if the CSV file is present or not, in the directory, for further actions.)
if(!("diabetes.csv" %in% list.files()))
# if the CSV file is not present in the working directory by the default name,
download.file("https://raw.githubusercontent.com/Dexter1618/NTCC_S06_PredictingForDiabetesInHumans/master/diabetes.csv",quiet = TRUE)
# reading the CSV file
DATA<-read.csv("./diabetes.csv")
The dataset has 768 observations and 9 variables.
The variables available are:
Pregnancies: Number of pregnanciesGlucose: Plasma Glucose ConcentrationBloodPressure: Diastolic Blood PressureSkinThickness: Triceps skinfold thicknessInsulin: 2-hour serum insulin, in mmU/mlBMI: Body Mass IndexDiabetesPedigreeFunctionAge: Patient age, in number of yearsThe last column, named Outcome contains either 1 or 0, indicating whether the person with the corresponding recorded values of the above variables actually has diabetes or not. This variable as been read as a column of integer, but it is more appropiate to treat it as a factor variable.
Let’s explore the relationships that may exist amongst the variables themselves. Outcome is not considered in this exploration.
pairs(~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age,data = DATA)
If that seems quite clumsy to look at and see any trends in data, let me show the pairs of variables which I found that show a distinct positive trend:
0.
Such variables with 0 values can affect future analytical operations. So, they need to removed or imputed. Deletion seems a more straight-forward to deal with this issue, and not introduce more bias into the data, but we will be losing a lot of data-points. I decided to replace the 0 values from variables which cannot have zero-values, with NA. The respective variable columns were modified as:
DATA$Glucose[DATA$Glucose==0]<-NA
DATA$BloodPressure[DATA$BloodPressure==0]<-NA
DATA$SkinThickness[DATA$SkinThickness==0]<-NA
DATA$Insulin[DATA$Insulin==0]<-NA
DATA$BMI[DATA$BMI==0]<-NA
DATA$Age[DATA$Age==0]<-NA
We might as well explore the frequency distributions for the variables themselves. Outcome may not be considered in this exploration also. The distributions for Glucose, Blood Pressure, SkinThickness and DiabetesPedigreeFunction were almost normal (as in normal distribution, bell-shape curve). The distributions for Pregnancies, Insulin, BMI and Age were prominently left-skewed.
For further analyses, we ought to divide the cleaned dataset into a training dataset and a testing dataset. We perform all our analyses and our modelling, and significance testing on the training dataset. We test for accuracy of our model on the untouched testing dataset.
set.seed(nrow(DATA))
randomSampling<-sort(sample(nrow(DATA), nrow(DATA)*0.75))
TRAIN_DATA<-DATA[randomSampling,]
TEST_DATA<-DATA[-randomSampling,]
The training set has 576 observations while the test set has the remaining 192 observations.
379 females do not have diabetes, while 197 of them actually have diabetes, from the training dataset.
Now, I have a training set which I would like to work with.
OutcomeI propose four statistical models, with different predictors for each case. And by proposing these four models, I perform the Hypothesis Testing:
Outcome.Outcome, is called the ALTERNATIVE hypothesis HA .Mathematically speaking, if the slope or coefficient estimate for a predictor is 0, it means they are not at all helpful in predicting Outcome. For H0, estimate for all predictors in a model will be 0. For HA, estimate will NOT be 0 for any one of the predictors.
Interpretations of columns (in the four tabs)
estimate : coefficient/‘slope’ (approximated)z value : how far it is from the mean (in terms of number of standard deviation for that distribution)p-value: the probability of finding the observed, or more extreme, results when H0 is true; the choice of significance level at which you reject H0 is arbitrary. If p-value < 0.05, it means that the chance H0 is true is less than “1 in a 100 cases”.AIC is explained in tab 1.3.3Glucose & SkinThickness)MODEL01<-glm(Outcome~Pregnancies+BloodPressure+Insulin+BMI+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL01)))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -6.2020458 | 1.0709911 | -5.7909408 | 0.0000000 |
| Pregnancies | 0.1160128 | 0.0599940 | 1.9337391 | 0.0531452 |
| BloodPressure | 0.0060550 | 0.0120399 | 0.5029112 | 0.6150267 |
| Insulin | 0.0040732 | 0.0012934 | 3.1490779 | 0.0016379 |
| BMI | 0.0688191 | 0.0225460 | 3.0523873 | 0.0022703 |
| DiabetesPedigreeFunction | 0.9756651 | 0.4307878 | 2.2648391 | 0.0235226 |
| Age | 0.0360316 | 0.0195117 | 1.8466660 | 0.0647956 |
For this model:
365.0033156 on 296 degrees of freedom297.1826521 on 290 degrees of freedom311.1826521
Insulin & BMI)MODEL02<-glm(Outcome~Pregnancies+BloodPressure+Glucose+SkinThickness+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL02)))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.5321190 | 1.0529082 | -8.1033836 | 0.0000000 |
| Pregnancies | 0.1516311 | 0.0501437 | 3.0239313 | 0.0024951 |
| BloodPressure | -0.0010183 | 0.0116409 | -0.0874737 | 0.9302950 |
| Glucose | 0.0406761 | 0.0051078 | 7.9634959 | 0.0000000 |
| SkinThickness | 0.0374375 | 0.0136574 | 2.7411793 | 0.0061219 |
| DiabetesPedigreeFunction | 1.0917085 | 0.3989687 | 2.7363258 | 0.0062129 |
| Age | 0.0171673 | 0.0155791 | 1.1019463 | 0.2704850 |
For this model:
508.9399138 on 400 degrees of freedom351.230079 on 394 degrees of freedom365.230079
Insulin & SkinThickness)MODEL03<-glm(Outcome~Pregnancies+BloodPressure+Glucose+BMI+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL03)))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -9.4419210 | 0.9870814 | -9.5654935 | 0.0000000 |
| Pregnancies | 0.1391540 | 0.0412401 | 3.3742408 | 0.0007402 |
| BloodPressure | -0.0082974 | 0.0103087 | -0.8048895 | 0.4208835 |
| Glucose | 0.0382592 | 0.0042822 | 8.9345478 | 0.0000000 |
| BMI | 0.0923435 | 0.0188110 | 4.9090035 | 0.0000009 |
| DiabetesPedigreeFunction | 0.8218230 | 0.3642402 | 2.2562665 | 0.0240539 |
| Age | 0.0155685 | 0.0116290 | 1.3387709 | 0.1806453 |
For this model:
688.8132705 on 542 degrees of freedom485.713029 on 536 degrees of freedom499.713029
Glucose & BMI)MODEL04<-glm(Outcome~Pregnancies+BloodPressure+Insulin+SkinThickness+DiabetesPedigreeFunction+Age,data = TRAIN_DATA,family = binomial(link = 'logit'))
kable(coefficients(summary(MODEL04)))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -5.2808971 | 0.9600638 | -5.500569 | 0.0000000 |
| Pregnancies | 0.0974778 | 0.0579646 | 1.681677 | 0.0926314 |
| BloodPressure | 0.0122639 | 0.0118350 | 1.036239 | 0.3000905 |
| Insulin | 0.0043021 | 0.0012888 | 3.337995 | 0.0008439 |
| SkinThickness | 0.0366044 | 0.0140425 | 2.606682 | 0.0091424 |
| DiabetesPedigreeFunction | 0.9183737 | 0.4143449 | 2.216448 | 0.0266609 |
| Age | 0.0333513 | 0.0194375 | 1.715816 | 0.0861957 |
For this model:
365.8194896 on 297 degrees of freedom300.742368 on 291 degrees of freedom314.742368
OutcomeThe predictors are independent variables. If two predictors are correlated with each other, they are called collinear variables.
Including collinear variables complicates model estimation due to multi-collinearity. One must always avoid adding collinear variables to a model. They bring nothing new to the table.
Employ Occam’s razor: The model hypothesis that makes fewer assumptions (i.e uses fewer predictors) is the best one, amongst competing model hypotheses.
The simplest model hypothesis is called the Parsimonious model.
The parsimonious model has the highest predicting power and uses minimal number of predictors.
Criterion used to judge how well a model fits the data and can predict reliably good for the target variable: AIC (Akaike Information Criterion) You may read more about this criterion and what it is from here and here.
So, the model with the least value of AIC would be the parsimonious model amongst the four that I have proposed earlier.
Hence, we choose the predictors in Logistic Model A to be the reliable predictors for the target variable Outcome. We shall also use this model to find the significance of each predictor in the model, in predicting for Outcome.
We can. Let’s automate the process of finding the parsimonious model.
We may use the stepwise model selection algorithm to find the model.
fullModel<-glm(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age,data=na.omit(TRAIN_DATA),family = binomial(link = 'logit'))
leastAIC_MODEL<-MASS::stepAIC(fullModel)
The algorithm gives the following predictors (on the right hand side) to be the most effective ones to constitute the parsimonious model.
leastAIC_MODEL$formula
Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
For Logistic Model A,
kable((anova(MODEL01,test = "Chisq")))
| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | NA | NA | 285 | 365.0033 | NA |
| Pregnancies | 1 | 19.304327 | 284 | 345.6990 | 0.0000111 |
| BloodPressure | 1 | 5.846204 | 283 | 339.8528 | 0.0156107 |
| Insulin | 1 | 22.416327 | 282 | 317.4365 | 0.0000022 |
| BMI | 1 | 10.963070 | 281 | 306.4734 | 0.0009295 |
| DiabetesPedigreeFunction | 1 | 5.760091 | 280 | 300.7133 | 0.0163942 |
| Age | 1 | 3.530645 | 279 | 297.1827 | 0.0602443 |
A Chi-square ANOVA tells us Pregnancies, Insulin and BMI are signficant predictors for Outcome.
For the model selected by stepwise algorithm,
kable((anova(leastAIC_MODEL,test = "Chisq")))
| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | NA | NA | 284 | 364.1836 | NA |
| Pregnancies | 1 | 19.059821 | 283 | 345.1238 | 0.0000127 |
| Glucose | 1 | 77.896094 | 282 | 267.2277 | 0.0000000 |
| BMI | 1 | 11.166498 | 281 | 256.0612 | 0.0008329 |
| DiabetesPedigreeFunction | 1 | 5.797067 | 280 | 250.2641 | 0.0160529 |
A Chi-square ANOVA tells us Pregnancies, Glucose and BMI are signficant predictors for Outcome.
We ought to see how well our models can predict the Outcome.
I shall be using the test dataset now. Our models haven’t used the test data all this time and for good reason: they mustn’t be biased, in the sense they mustn’t already know what the real value for Outcome is.
First, let’s use Logistic Model A, the one which I proposed by hypothesis testing and evaluation using AIC.
prediction_MODEL01<-predict(MODEL01,TEST_DATA,type = "response")
Now, let’s use the logistic model obtained via the stepwise algorithm to find its predictions.
prediction_leastAIC_MODEL<-predict(leastAIC_MODEL,TEST_DATA,type = "response")
We have our arrays of predictions from both the models. Now we get to see how good they are. Since these prediction values are log of odds, we ought to set a cut-off value of probability = C. If prediction value >= C, we treat it as a YES or 1. Else, we treat it as NO or 0.
prediction_MODEL01<-factor(ifelse(prediction_MODEL01>0.8,1,0),levels = c(0,1))
prediction_leastAIC_MODEL<-factor(ifelse(prediction_leastAIC_MODEL>0.6,1,0),levels = c(0,1))
We ought to check for the accuracy of predictions for both of the models:
accuracy_MODEL01<-mean(prediction_MODEL01==TEST_DATA$Outcome,na.rm = TRUE)
accuracy_leastAIC_MODEL<-mean(prediction_leastAIC_MODEL==TEST_DATA$Outcome,na.rm = TRUE)
Logistic Model A gives an accuracy of 74.7663551% on
TEST_DATA. The right-hand side of the following gives the predictors in Model A:Outcome ~ Pregnancies + BloodPressure + Insulin + BMI + DiabetesPedigreeFunction + Age
The model selected by stepwise algorithm gives an accuracy of 75.1322751% on
TEST_DATA. The right-hand side of the following gives the predictors in this model:Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
pandas library supports for dynamic and ‘very-fast’ operations on a dataset, which is stored as a matrix.numpy stands for Numerical Python. It contains methods for performing statistical and general mathematical calculations.matplotlib is a Python plotting library based on MATLAB.sklearn is the Machine Learning library in Python.import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
DATA = pd.read_csv("https://raw.githubusercontent.com/Dexter1618/NTCC_S06_PredictingForDiabetesInHumans/master/diabetes.csv")
DATA
We need to divide the dataset DATA into training and testing blocks, just like we did before. However, since we will be using two different machine-learning models, we need to create different splits of DATA to judge and find the best model.
DATA.TRAIN_DATA_PREDICTORS, TEST_DATA_PREDICTORS, TRAIN_DATA_OUTCOME, TEST_DATA_OUTCOME = train_test_split(DATA.loc[:, DATA.columns != 'Outcome'], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)
Pregnancies,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction and Age.TRAIN_DATA_BEST_PREDICTORS_01, TEST_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01 = train_test_split(DATA[["Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age"]], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)
leastAIC_MODEL had. They are Pregnancies,Glucose,BMI, and DiabetesPedigreeFunction.TRAIN_DATA_BEST_PREDICTORS_02, TEST_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02 = train_test_split(DATA[["Pregnancies","Glucose","BMI","DiabetesPedigreeFunction"]], DATA['Outcome'], stratify=DATA['Outcome'], random_state=100)
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors”.
I took a range of 1 - 20 neighbours and then tried finding that k-NN model that requires the least number of neighbours to correctly determine the value of Outcome.
Here I used all the available 8 predictors in DATA.
from sklearn.neighbors import KNeighborsClassifier
training_accuracy = []
test_accuracy = []
neighbourCount = range(1, 20)
for i in neighbourCount:
# build the model
knn = KNeighborsClassifier(i)
knn.fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
# record training set accuracy
training_accuracy.append(knn.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME))
# record test set accuracy
test_accuracy.append(knn.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME))
A k-NN Classifier with k = 9 has the maximum accuracy.
plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")
Here I used 6 predictors as used by Logistic Model A.
training_accuracy = []
test_accuracy = []
neighbourCount = range(1, 20)
for i in neighbourCount:
# build the model
knn = KNeighborsClassifier(i)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)
# record training set accuracy
training_accuracy.append(knn.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01))
# record test set accuracy
test_accuracy.append(knn.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01))
A k-NN Classifier with k = 6 has the maximum accuracy.
plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")
leastAIC_MODEL - predictorsHere I used 4 predictors as used by leastAIC_MODEL.
training_accuracy = []
test_accuracy = []
neighbourCount = range(1, 20)
for i in neighbourCount:
# build the model
knn = KNeighborsClassifier(i)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)
# record training set accuracy
training_accuracy.append(knn.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02))
# record test set accuracy
test_accuracy.append(knn.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02))
A k-NN Classifier with k = 14 has the maximum accuracy.
plt.plot(neighbourCount, training_accuracy, label="Accuracy on TRAIN_DATA")
plt.plot(neighbourCount, test_accuracy, label="Accuracy on TEST_DATA")
TRAIN_DATA and TEST_DATANow, I need to evaluate the accuracy of predictions for the best k-NN models I found.
leastAIC_MODELknn = KNeighborsClassifier(9)
knn.fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("When K = 9 and with all predictors in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
knn = KNeighborsClassifier(7)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)
print("\nPREDICTORS USED:","Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age")
print("When K = 6 and with the specific predictors (logistic model A) in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01)))
knn = KNeighborsClassifier(14)
knn.fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)
print("\nPREDICTORS USED:","Pregnancies","Glucose","BMI","DiabetesPedigreeFunction")
print("When K = 14 and with the specific predictors (leastAIC_MODEL) in dataset: ")
print('Accuracy of K-NN classifier on training set: {:.5f}'.format(knn.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)))
print('Accuracy of K-NN classifier on test set: {:.5f}'.format(knn.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02)))
When K = 9 and with all predictors in dataset:
Accuracy of K-NN classifier on training set: 0.79167
Accuracy of K-NN classifier on test set: 0.77083
PREDICTORS USED: Pregnancies BloodPressure Insulin BMI DiabetesPedigreeFunction Age
When K = 6 and with the specific predictors (logistic model A) in dataset:
Accuracy of K-NN classifier on training set: 0.73264
Accuracy of K-NN classifier on test set: 0.73438
PREDICTORS USED: Pregnancies Glucose BMI DiabetesPedigreeFunction
When K = 14 and with the specific predictors (leastAIC_MODEL) in dataset:
Accuracy of K-NN classifier on training set: 0.76910
Accuracy of K-NN classifier on test set: 0.77083
I also evaluated the accuracy of these three k-NN models in-sample, via the cross-validation technique.
For the KNN model with 9 neighbours and all predictors in dataset:
10-fold cross validation average accuracy: 0.73445
For the KNN model with 6 neighbours and specific predictors in Logistic Model A:
10-fold cross validation average accuracy: 0.66313
For the KNN model with 14 neighbours and specific predictors in leastAIC_MODEL:
10-fold cross validation average accuracy: 0.73276
from sklearn.linear_model import LogisticRegression
The trade-off parameter of logistic regression that determines the strength of the regularization is called C. Higher values of C correspond to less regularization C is actually the Inverse of regularization strength(lambda)
print("\t\t For all predictors:\n")
print("For C = 1")
LogisticModel01 = LogisticRegression().fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel01.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel01.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\nFor C = 0.01")
LogisticModel02 = LogisticRegression(C=0.01).fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel02.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel02.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\nFor C = 100")
LogisticModel03 = LogisticRegression(C=100).fit(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)
print("Training set accuracy: {:.4f}".format(LogisticModel03.score(TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME)))
print("Test set accuracy: {:.4f}".format(LogisticModel03.score(TEST_DATA_PREDICTORS, TEST_DATA_OUTCOME)))
print("\n\n\t\t For specific predictors in logistic model A and C = 1:\n")
print("\nPREDICTORS USED:","Pregnancies","BloodPressure","Insulin","BMI","DiabetesPedigreeFunction","Age")
LogisticModelA = LogisticRegression().fit(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)
print("Training set accuracy: {:.4f}".format(LogisticModelA.score(TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01)))
print("Test set accuracy: {:.4f}".format(LogisticModelA.score(TEST_DATA_BEST_PREDICTORS_01, TEST_DATA_OUTCOME_BEST_PREDICTORS_01)))
print("\n\n\t\t For specific predictors in leastAIC_model and C = 1:\n")
print("\nPREDICTORS USED:","Pregnancies","Glucose","BMI","DiabetesPedigreeFunction")
leastAIC_MODEL = LogisticRegression().fit(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)
print("Training set accuracy: {:.4f}".format(leastAIC_MODEL.score(TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02)))
print("Test set accuracy: {:.4f}".format(leastAIC_MODEL.score(TEST_DATA_BEST_PREDICTORS_02, TEST_DATA_OUTCOME_BEST_PREDICTORS_02)))
For all predictors:
For C = 1
Training set accuracy: 0.7778
Test set accuracy: 0.7708
For C = 0.01
Training set accuracy: 0.6979
Test set accuracy: 0.7240
For C = 100
Training set accuracy: 0.7847
Test set accuracy: 0.7708
For specific predictors in logistic model A and C = 1:
PREDICTORS USED: Pregnancies BloodPressure Insulin BMI DiabetesPedigreeFunction Age
Training set accuracy: 0.7014
Test set accuracy: 0.6823
For specific predictors in leastAIC_model and C = 1:
PREDICTORS USED: Pregnancies Glucose BMI DiabetesPedigreeFunction
Training set accuracy: 0.7708
Test set accuracy: 0.7812
kfold = model_selection.KFold(n_splits=10, random_state=7)
scoring = 'accuracy'
print("For the LR model with C = 1 and all predictors in dataset:")
results = model_selection.cross_val_score(LogisticModel01, TRAIN_DATA_PREDICTORS, TRAIN_DATA_OUTCOME, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
print("For the LR model with C = 1 and specific predictors in Logistic Model A:")
results = model_selection.cross_val_score(LogisticModelA, TRAIN_DATA_BEST_PREDICTORS_01, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_01, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
print("For the LR model with C = 1 and specific predictors in leastAIC_MODEL:")
results = model_selection.cross_val_score(leastAIC_MODEL, TRAIN_DATA_BEST_PREDICTORS_02, TRAIN_DATA_OUTCOME_BEST_PREDICTORS_02, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.5f" % (results.mean()))
For the LR model with C = 1 and all predictors in dataset:
10-fold cross validation average accuracy: 0.76576
For the LR model with C = 1 and specific predictors in Logistic Model A:
10-fold cross validation average accuracy: 0.68237
For the LR model with C = 1 and specific predictors in leastAIC_MODEL:
10-fold cross validation average accuracy: 0.75532
After all what we need, let’s see what’s the best model to use on DATA to predict whether the female has diabetes or not.
If I use
leastAIC_MODELwith its 4 chosen predictors and with a value of regularization strength =1, the model obtains an accuracy of77.08%on theTRAIN_DATAand an accuracy of78.12%onTEST_DATA. That’s really good. Cross-validatingleastAIC_MODELonTRAIN_DATAgives an accuracy of75.53%.
If I use a k-NN classifier with K =
9and with all predictors available inDATA, the model obtains an accuracy of79.167%onTRAIN_DATAand an accuracy of77.08%onTEST_DATA. Cross-validating this k-NN classifier onTRAIN_DATAgives an accuracy of73.45%.
The simplest model wins, unambigously.
Let’s mention Occam’s Razor again: The simplest model possible that makes the minimal number of assumptions always have the highest predicting power.
leastAIC_MODEL uses only 4 variables, 3 of which are very significant predictors of diabetes in Pima Indian females (Pregnancies, Glucose and BMI) and is able to predict whether a female has diabetes or not using unseen values for these 4 variables.
It has been able to accurately predict on TEST_DATA more than it has on TRAIN_DATA, and definitely more than its competer - the k-NN classifier with k = 9.