K-nn algorithm is a non-parametric classification method widely used nowadays. It is a lazy learner and one of the most used machine learning giving its simplicity.
This time we will use the pima dataset from the faraway library. This is a dataset which contains a population of at least 21 years old women with Pima Indian heritage, who were tested for diabetes.
In order to make our linear regression model we are going to use the library DMwR.
head(pima)
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
First up, we have to set the categorical variables and choose the correct labels for them in order to make a proper classification (otherwise will throw an error).
pima$test <- as.factor(pima$test)
levels(pima$test) <- c("Negative", "Positive")
table(pima$test)
##
## Negative Positive
## 500 268
round(prop.table(table(pima$test)) * 100, digits = 2)
##
## Negative Positive
## 65.1 34.9
Two thirds of the sample are gave a negative result.
We need to normalize our data given the difference between scales.
pima.norm <- decostand(pima[,-9], "normalize")
Next, we create the train and test samples, and another two objects containing the labels of the data.
pima.train <- pima.norm[1:427, ]
pima.test <- pima.norm[428:534, ]
pima.train.lab <- pima[1:427, 9]
pima.test.lab <- pima[428:534, 9]
This time we will use the class library to do so.
To choose the appropiate number of k, the most widely used method is to do the square root of the rows.
sqrt(nrow(pima))
## [1] 27.71281
pima.k23 <- knn(train = pima.train, test = pima.test, cl = pima.train.lab, k = 23, prob = TRUE)
Also we can check the extent of consens between voters inside the knn model.
pima.k23.prob <- attr(pima.k23, "prob")
head(pima.k23)
## [1] Positive Negative Positive Positive Negative Negative
## Levels: Negative Positive
head(pima.k23.prob)
## [1] 0.5652174 0.6521739 0.5652174 0.5217391 0.8260870 1.0000000
To do so, the best method is to do a confussion matrix
gmodels::CrossTable(x = pima.test.lab, y = pima.k23, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 107
##
##
## | pima.k23
## pima.test.lab | Negative | Positive | Row Total |
## --------------|-----------|-----------|-----------|
## Negative | 67 | 16 | 83 |
## | 0.807 | 0.193 | 0.776 |
## | 0.870 | 0.533 | |
## | 0.626 | 0.150 | |
## --------------|-----------|-----------|-----------|
## Positive | 10 | 14 | 24 |
## | 0.417 | 0.583 | 0.224 |
## | 0.130 | 0.467 | |
## | 0.093 | 0.131 | |
## --------------|-----------|-----------|-----------|
## Column Total | 77 | 30 | 107 |
## | 0.720 | 0.280 | |
## --------------|-----------|-----------|-----------|
##
##
table(pima.k23, pima.test.lab)
## pima.test.lab
## pima.k23 Negative Positive
## Negative 67 10
## Positive 16 14
Also we can Compute the accuracy
mean(pima.k23 == pima.test.lab)
## [1] 0.7570093
We have created a test with a 75% of accuracy. This may be a good or a bad indicator depending of the cost-performance calculation. In conclusion:
False negatives: 10 (the most dangerous in most cases), a 9%. False positives: 16 (in this case would represent a burden to the health care system), a 15%.
Let’s see whether z-score standardization can improve our predictive accuracy. This kind of normalization preserves the original range between values, so is a better option in most cases. Scores over 3 or below -3 represents a very rare value.
pima.z <- as.data.frame(scale(pima[,-9]))
pima.z.train <- pima.z[1:427, ]
pima.z.test <- pima.z[428:534, ]
pima.z.k23 <- knn(train = pima.z.train, test = pima.z.test, cl = pima.train.lab, k = 23)
gmodels::CrossTable(x = pima.test.lab, y = pima.z.k23, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 107
##
##
## | pima.z.k23
## pima.test.lab | Negative | Positive | Row Total |
## --------------|-----------|-----------|-----------|
## Negative | 75 | 8 | 83 |
## | 0.904 | 0.096 | 0.776 |
## | 0.852 | 0.421 | |
## | 0.701 | 0.075 | |
## --------------|-----------|-----------|-----------|
## Positive | 13 | 11 | 24 |
## | 0.542 | 0.458 | 0.224 |
## | 0.148 | 0.579 | |
## | 0.121 | 0.103 | |
## --------------|-----------|-----------|-----------|
## Column Total | 88 | 19 | 107 |
## | 0.822 | 0.178 | |
## --------------|-----------|-----------|-----------|
##
##
table(pima.z.k23, pima.test.lab)
## pima.test.lab
## pima.z.k23 Negative Positive
## Negative 75 13
## Positive 8 11
mean(pima.z.k23 == pima.test.lab)
## [1] 0.8037383
Interestingly enough, the rate of false positives has increased while the rate of false negatives and the global accuracy has improved. Giving the fact in the health care system the false positives are considerably more problematic than false negatives, we will choose to use a model with normalized punctuation rather than z-scores.