Diabetes is a condition that impairs the body’s ability to process blood glucose, otherwise known as blood sugar. We are going to predict if a pacient has diabete or not based on glucose, blood pressure, skin thickness, insulin, BMI and diabetes pedigree function. The data is taken from www.kaggle.com named diabetes.csv.
Let’s take a look at the data we are going to use.
str(diabetes)
## 'data.frame': 768 obs. of 8 variables:
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
Our interes is to predict diabete. Lets see how many pacient have diabetes and how many don’t have diabetes.
round(prop.table(table(diabetes$Outcome))*100,digits=1)
##
## 0 1
## 65.1 34.9
summary(diabetes)
## Outcome Glucose BloodPressure SkinThickness
## Min. :0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median :0.000 Median :117.0 Median : 72.00 Median :23.00
## Mean :0.349 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.:1.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :1.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
There are about 35% ill pacients. The data in the database are not normalized. For this reason, we are going to build a function in order to normalize the data using MinMax normalization.
normalize=function(x)
{
return ((x-min(x))/(max(x)-min(x)))
}
diabetes_n=as.data.frame(lapply(diabetes,normalize))
summary(diabetes_n)
## Outcome Glucose BloodPressure SkinThickness
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.4975 1st Qu.:0.5082 1st Qu.:0.0000
## Median :0.000 Median :0.5879 Median :0.5902 Median :0.2323
## Mean :0.349 Mean :0.6075 Mean :0.5664 Mean :0.2074
## 3rd Qu.:1.000 3rd Qu.:0.7048 3rd Qu.:0.6557 3rd Qu.:0.3232
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Insulin BMI DiabetesPedigreeFunction Age
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.4069 1st Qu.:0.07077 1st Qu.:0.0500
## Median :0.03605 Median :0.4769 Median :0.12575 Median :0.1333
## Mean :0.09433 Mean :0.4768 Mean :0.16818 Mean :0.2040
## 3rd Qu.:0.15041 3rd Qu.:0.5455 3rd Qu.:0.23409 3rd Qu.:0.3333
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
At this point all the data varies from 0 to 1, so we are ready to proceed with the kNN algorithm. There are 768 diagnosed pacients. The training set will have 668 pacients and the test set will have 100 pacients.
diabetes_train=diabetes_n[1:667,]
diabetes_test=diabetes_n[668:768,]
diabetes_test_labels=diabetes[668:768,1]
diabetes_train_labels=diabetes[1:667,1]
Now it’s time to decide the value of k. We are using 667 data in the training set, so the value of k will be the same as the square root of 667.
diabetes_test_pred=knn(train=diabetes_train,test=diabetes_test,cl=diabetes_train_labels,k=26)
Let’s evaluate the kNN algorithm used for predicting diabetes.
CrossTable(x=diabetes_test_labels,y=diabetes_test_pred,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 101
##
##
## | diabetes_test_pred
## diabetes_test_labels | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 0 | 63 | 0 | 63 |
## | 1.000 | 0.000 | 0.624 |
## | 1.000 | 0.000 | |
## | 0.624 | 0.000 | |
## ---------------------|-----------|-----------|-----------|
## 1 | 0 | 38 | 38 |
## | 0.000 | 1.000 | 0.376 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.376 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 63 | 38 | 101 |
## | 0.624 | 0.376 | |
## ---------------------|-----------|-----------|-----------|
##
##
We see from the table that this method is perfect for this data. All the ill pacients are classified as ill pacients and all non-ill pacients are classified as non-ill persons. The classification error is 0.