K Nearest Neighbor classification example (pima dataset)

K-nn algorithm is a non-parametric classification method widely used nowadays. It is a lazy learner and one of the most used machine learning giving its simplicity.

This time we will use the pima dataset from the faraway library. This is a dataset which contains a population of at least 21 years old women with Pima Indian heritage, who were tested for diabetes.

In order to make our linear regression model we are going to use the library DMwR.

head(pima)

##   pregnant glucose diastolic triceps insulin  bmi diabetes age test
## 1        6     148        72      35       0 33.6    0.627  50    1
## 2        1      85        66      29       0 26.6    0.351  31    0
## 3        8     183        64       0       0 23.3    0.672  32    1
## 4        1      89        66      23      94 28.1    0.167  21    0
## 5        0     137        40      35     168 43.1    2.288  33    1
## 6        5     116        74       0       0 25.6    0.201  30    0

summary(pima)

##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

1. Data cleaning

First up, we have to set the categorical variables and choose the correct labels for them in order to make a proper classification (otherwise will throw an error).

pima$test <- as.factor(pima$test)
levels(pima$test) <- c("Negative", "Positive")

table(pima$test)

## 
## Negative Positive 
##      500      268

round(prop.table(table(pima$test)) * 100, digits = 2)

## 
## Negative Positive 
##     65.1     34.9

Two thirds of the sample are gave a negative result.

2. Data preparation

Data transformation

We need to normalize our data given the difference between scales.

pima.norm <- decostand(pima[,-9], "normalize")

Creation of the training sample

Next, we create the train and test samples, and another two objects containing the labels of the data.

pima.train <- pima.norm[1:427, ]
pima.test <- pima.norm[428:534, ]
pima.train.lab <- pima[1:427, 9]
pima.test.lab <- pima[428:534, 9]

3. Training the model

This time we will use the class library to do so.

To choose the appropiate number of k, the most widely used method is to do the square root of the rows.

sqrt(nrow(pima))

## [1] 27.71281

pima.k23 <- knn(train = pima.train, test = pima.test, cl = pima.train.lab, k = 23, prob = TRUE)

Also we can check the extent of consens between voters inside the knn model.

pima.k23.prob <- attr(pima.k23, "prob")
head(pima.k23)

## [1] Positive Negative Positive Positive Negative Negative
## Levels: Negative Positive

head(pima.k23.prob)

## [1] 0.5652174 0.6521739 0.5652174 0.5217391 0.8260870 1.0000000

4. Validation of the model

To do so, the best method is to do a confussion matrix

gmodels::CrossTable(x = pima.test.lab, y = pima.k23, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  107 
## 
##  
##               | pima.k23 
## pima.test.lab |  Negative |  Positive | Row Total | 
## --------------|-----------|-----------|-----------|
##      Negative |        67 |        16 |        83 | 
##               |     0.807 |     0.193 |     0.776 | 
##               |     0.870 |     0.533 |           | 
##               |     0.626 |     0.150 |           | 
## --------------|-----------|-----------|-----------|
##      Positive |        10 |        14 |        24 | 
##               |     0.417 |     0.583 |     0.224 | 
##               |     0.130 |     0.467 |           | 
##               |     0.093 |     0.131 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        77 |        30 |       107 | 
##               |     0.720 |     0.280 |           | 
## --------------|-----------|-----------|-----------|
## 
##

table(pima.k23, pima.test.lab)

##           pima.test.lab
## pima.k23   Negative Positive
##   Negative       67       10
##   Positive       16       14

Also we can Compute the accuracy

mean(pima.k23 == pima.test.lab)

## [1] 0.7570093

We have created a test with a 75% of accuracy. This may be a good or a bad indicator depending of the cost-performance calculation. In conclusion:

False negatives: 10 (the most dangerous in most cases), a 9%. False positives: 16 (in this case would represent a burden to the health care system), a 15%.

5. Model improvement

5.1. Z-scores standarization

Let’s see whether z-score standardization can improve our predictive accuracy. This kind of normalization preserves the original range between values, so is a better option in most cases. Scores over 3 or below -3 represents a very rare value.

pima.z <- as.data.frame(scale(pima[,-9]))

pima.z.train <- pima.z[1:427, ]
pima.z.test <- pima.z[428:534, ]

pima.z.k23 <- knn(train = pima.z.train, test = pima.z.test, cl = pima.train.lab, k = 23)

gmodels::CrossTable(x = pima.test.lab, y = pima.z.k23, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  107 
## 
##  
##               | pima.z.k23 
## pima.test.lab |  Negative |  Positive | Row Total | 
## --------------|-----------|-----------|-----------|
##      Negative |        75 |         8 |        83 | 
##               |     0.904 |     0.096 |     0.776 | 
##               |     0.852 |     0.421 |           | 
##               |     0.701 |     0.075 |           | 
## --------------|-----------|-----------|-----------|
##      Positive |        13 |        11 |        24 | 
##               |     0.542 |     0.458 |     0.224 | 
##               |     0.148 |     0.579 |           | 
##               |     0.121 |     0.103 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        88 |        19 |       107 | 
##               |     0.822 |     0.178 |           | 
## --------------|-----------|-----------|-----------|
## 
##

table(pima.z.k23, pima.test.lab)

##           pima.test.lab
## pima.z.k23 Negative Positive
##   Negative       75       13
##   Positive        8       11

mean(pima.z.k23 == pima.test.lab)

## [1] 0.8037383

Interestingly enough, the rate of false positives has increased while the rate of false negatives and the global accuracy has improved. Giving the fact in the health care system the false positives are considerably more problematic than false negatives, we will choose to use a model with normalized punctuation rather than z-scores.