Diabetes is a condition that impairs the body’s ability to process blood glucose, otherwise known as blood sugar. We are going to predict if a pacient has diabete or not based on glucose, blood pressure, skin thickness, insulin, BMI and diabetes pedigree function. The data is taken from www.kaggle.com named diabetes.csv.

Let’s take a look at the data we are going to use.

str(diabetes)
## 'data.frame':    768 obs. of  8 variables:
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...

Our interes is to predict diabete. Lets see how many pacient have diabetes and how many don’t have diabetes.

round(prop.table(table(diabetes$Outcome))*100,digits=1)
## 
##    0    1 
## 65.1 34.9
summary(diabetes)
##     Outcome         Glucose      BloodPressure    SkinThickness  
##  Min.   :0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.:0.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median :0.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   :0.349   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.:1.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :1.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00

There are about 35% ill pacients. The data in the database are not normalized. For this reason, we are going to build a function in order to normalize the data using MinMax normalization.

normalize=function(x)
{
  return ((x-min(x))/(max(x)-min(x)))
}
diabetes_n=as.data.frame(lapply(diabetes,normalize))
summary(diabetes_n)
##     Outcome         Glucose       BloodPressure    SkinThickness   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.4975   1st Qu.:0.5082   1st Qu.:0.0000  
##  Median :0.000   Median :0.5879   Median :0.5902   Median :0.2323  
##  Mean   :0.349   Mean   :0.6075   Mean   :0.5664   Mean   :0.2074  
##  3rd Qu.:1.000   3rd Qu.:0.7048   3rd Qu.:0.6557   3rd Qu.:0.3232  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     Insulin             BMI         DiabetesPedigreeFunction      Age        
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000          Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.4069   1st Qu.:0.07077          1st Qu.:0.0500  
##  Median :0.03605   Median :0.4769   Median :0.12575          Median :0.1333  
##  Mean   :0.09433   Mean   :0.4768   Mean   :0.16818          Mean   :0.2040  
##  3rd Qu.:0.15041   3rd Qu.:0.5455   3rd Qu.:0.23409          3rd Qu.:0.3333  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000          Max.   :1.0000

At this point all the data varies from 0 to 1, so we are ready to proceed with the kNN algorithm. There are 768 diagnosed pacients. The training set will have 668 pacients and the test set will have 100 pacients.

diabetes_train=diabetes_n[1:667,]
diabetes_test=diabetes_n[668:768,]
diabetes_test_labels=diabetes[668:768,1]
diabetes_train_labels=diabetes[1:667,1]

Now it’s time to decide the value of k. We are using 667 data in the training set, so the value of k will be the same as the square root of 667.

diabetes_test_pred=knn(train=diabetes_train,test=diabetes_test,cl=diabetes_train_labels,k=26)

Let’s evaluate the kNN algorithm used for predicting diabetes.

CrossTable(x=diabetes_test_labels,y=diabetes_test_pred,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  101 
## 
##  
##                      | diabetes_test_pred 
## diabetes_test_labels |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    0 |        63 |         0 |        63 | 
##                      |     1.000 |     0.000 |     0.624 | 
##                      |     1.000 |     0.000 |           | 
##                      |     0.624 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                    1 |         0 |        38 |        38 | 
##                      |     0.000 |     1.000 |     0.376 | 
##                      |     0.000 |     1.000 |           | 
##                      |     0.000 |     0.376 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |        63 |        38 |       101 | 
##                      |     0.624 |     0.376 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 

We see from the table that this method is perfect for this data. All the ill pacients are classified as ill pacients and all non-ill pacients are classified as non-ill persons. The classification error is 0.