Learning Objectives

In this lesson students will learn …

  • How to implement the K-nearest neighbors (knn) algorithm
  • Produce stratified training and testing sets
  • The importance of standardizing data
  • How to tune the knn algorithm to pick the best value of \(k\) hyperparameter

BINARY CASE

Example 1: Pima Indigenous People

Motivation/Background:

  • “The Pima Indians of Arizona have the highest reported prevalences of obesity and non-insulin-dependent diabetes mellitus (NIDDM). In parallel with abrupt changes in lifestyle, these prevalences in Arizona Pimas have increased to epidemic proportions during the past decades.” (Ravussin et al. 1994)
  • “This study provides compelling evidence that changes in lifestyle associated with Westernization play a major role in the global epidemic of type 2 diabetes.” (Schultz et al. 2006)

Sources:

Step 1: Import the Data

pima<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/diabetes.csv", 
               header=TRUE)

str(pima)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...
Questions
  • What type of variable is the response?
  • What might be good explanatory variables based on our previous knowledge?

Step 2: Exploration

Our goal in this step is to find feature variables that best separates the two groups (0 = diabetes negative; 1 = diabetes positive).

We might want to use side-by-side boxplot or overlapping density plots.

library(tidyverse)

## BLOOD PRESSURE vs GLOCOSE
ggplot(data=pima, aes(x=Glucose, y=BloodPressure, color=factor(Outcome)))+
  geom_point()

## GLUCOSE BOXPLOT
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
  geom_boxplot()

## GLUCOSE DENSITY
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
  geom_density(alpha=.5)

Step 3: Why Not Linear?

We want to explore relationships between whether or not an individual has diabetes, which is contained in the Outcome variable, and other feature variables.

We’ve made a lot of scatterplot thus far. How about we try that?

ggplot(data=pima, aes(x=Glucose, y=Outcome))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

That didn’t work out so well.

Questions
  • What was the problem?
  • What might be some problems with trying to fit a traditional linear model to these data?

K-Nearest Neighbors (KNN)

In the previous problem we encountered issues because we were treating Outcome as a number when it should have been treated as a categorical variable (or factor).

KNN is a popular, non-parametric, “lazy” machine learning approach. Predictions are built off defining a neighborhood of \(k\) close neighbors, where closeness is defined by a pre-specified distance metric. Thus, \(k\) is considered a hyperparameter than can be tuned to choose the appropriate model flexibility, whilst considering the bias and variance trade-off.

Step 4: Scaling

Look at the summary of our data

summary(pima)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Calculating distance is central to this algorithm. Therefore, if features have values with large difference in ranges or units, this can highly influence the model. Therefore, we should pre-process the data first.

Consider the following methods:

  • Min-Max Scaling: Scales to range [0,1]

\[X_{sc}=\frac{X-X_{min}}{X_{max}-X_{min}}\]

  • Standardization (Z-score Scaling)

\[Z=\frac{X-\bar{X}}{SD(X)}\]

##### scale
#Normalization
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) }

## REMOVE NAs FIRST
pima<-na.omit(pima)

pimaNorm <- as.data.frame(lapply(pima[,1:8], normalize))%>%
  cbind(Outcome=pima$Outcome)

head(pimaNorm)
##   Pregnancies   Glucose BloodPressure SkinThickness   Insulin       BMI
## 1  0.35294118 0.7437186     0.5901639     0.3535354 0.0000000 0.5007452
## 2  0.05882353 0.4271357     0.5409836     0.2929293 0.0000000 0.3964232
## 3  0.47058824 0.9195980     0.5245902     0.0000000 0.0000000 0.3472429
## 4  0.05882353 0.4472362     0.5409836     0.2323232 0.1111111 0.4187779
## 5  0.00000000 0.6884422     0.3278689     0.3535354 0.1985816 0.6423249
## 6  0.29411765 0.5829146     0.6065574     0.0000000 0.0000000 0.3815201
##   DiabetesPedigreeFunction       Age Outcome
## 1               0.23441503 0.4833333       1
## 2               0.11656704 0.1666667       0
## 3               0.25362938 0.1833333       1
## 4               0.03800171 0.0000000       0
## 5               0.94363792 0.2000000       1
## 6               0.05251921 0.1500000       0

Step 5A: Training and Testing

Let’s make a 70-30 split for training and testing out model.

768*.7
## [1] 537.6
set.seed(999)
sampleInd<-sample(1:768, 537)
train<-pimaNorm[sampleInd,]
test<-pimaNorm[-sampleInd,]

## What proportion people in the training and testing sets have diabetes
mean(train$Outcome)
## [1] 0.3798883
mean(test$Outcome)
## [1] 0.2770563

Wow! There is a 10% difference! We want to have the same proportion of diabetes in the training set as the testing set because it is important that our training data is representative.

Let’s look at that

#### VISUALIZE THIS
pima2<-pimaNorm%>%
  mutate(train=FALSE)

pima2$train[sampleInd]<-TRUE

ggplot(data=pima2, aes(x=Glucose, y=BloodPressure, color=factor(Outcome)))+
  geom_point()+
  facet_grid(.~train)

Step 5B: Stratified Splitting

From Scratch

First, we will build it from scratch by using basic tools in dplyr.

### INSTEAD we want the same prop

## PARTITION DATA
pima0<-pimaNorm%>%
  filter(Outcome==0)
dim(pima0)
## [1] 500   9
pima1<-pimaNorm%>%
  filter(Outcome==1)
dim(pima1)
## [1] 268   9
## SAMPLE INDECES
sample0<-sample(1:500, 350)
sample1<-sample(1:268, 188)

## TRAINING AND TESTING SETS
trainStrat<-rbind(pima0[sample0, ],
                  pima1[sample1, ])

testStrat<-rbind(pima0[-sample0, ],
                  pima1[-sample1, ])

## PROPORITON OF OUTCOME
mean(trainStrat$Outcome)
## [1] 0.3494424
mean(testStrat$Outcome)
## [1] 0.3478261
Caret (Automated)
library(caret)
# Split the data into training and test set
set.seed(314)
caretSamp <- createDataPartition(pimaNorm$Outcome , 
                                        p = 0.7, 
                                        list = FALSE)

## SPLIT TESTING AND TRAINING
trainCaret  <- pimaNorm[caretSamp, ]
testCaret <- pimaNorm[-caretSamp, ]

## PROPORITON OF OUTCOME
mean(trainCaret$Outcome)
## [1] 0.3494424
mean(testCaret$Outcome)
## [1] 0.3478261

Step 6: Prediction with KNN

The knn function is within the class package. Take a moment to read the documentation.

library(class)
help(knn)

We will need to specify four arguments:

  • train : matrix or data frame of training set cases.
  • test : matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case.
  • cl : factor of true classifications of training set
  • k : number of neighbours considered.

Let’s specify the arguments

#### look at knn function
## need train, test, cl, and k

trainFea<-trainCaret%>%
  select(-Outcome)
dim(trainFea)
## [1] 538   8
testFea<-testCaret%>%
  select(-Outcome)

trainOut<-trainCaret$Outcome
testOut<-testCaret$Outcome

Now to implement the algorithm. If there is a “tie” in closeness to the datapoint in question, that tie will be split randomly.

set.seed(1)
knn.pred=knn(train = trainFea,
             test = testFea,
             cl = trainOut,
             k=1)

head(knn.pred)
## [1] 0 1 0 0 0 1
## Levels: 0 1

Step 7: Confusion Matrix

We assess model fit for classification models with the confusion matrix. This illustrates the concordance (diagonal) and discordance (off-diagonal) between our predictions and test data.

#Confusion matrix
cm<-table(knn.pred,testOut)
cm
##         testOut
## knn.pred   0   1
##        0 115  39
##        1  35  41

Thus the correct prediction rate can be calculated using concordance:

### correct rate
mean(knn.pred==testOut)
## [1] 0.6782609

And the error is the complement

### error rate
1-mean(knn.pred==testOut)
## [1] 0.3217391
## OR
mean(knn.pred!=testOut)
## [1] 0.3217391

Other popular metrics include sensitivity and specificity: * Sensitivity = True Positive / (True Positive + False Negative) * Specificity = True Negative / (False Positive + True Negative)

Questions
  • Calculate the sensitivity and specificity for the knn confusion matrix above.

Step 8: Choosing \(k\)

Low values of \(k\) result is too wiggly (overfitting) models that are influenced by noise (high variance). On the other hand, high values of \(k\) are response enough (underfitting) and result in biased models. We can use a grid search to find the best value of \(k\).

## Whats the best k
## Pick a neighborhood
set.seed(123)
error <- c()
for (i in 1:30){
  knnPima<- knn(train = trainFea,
                test = testFea,
                cl = trainOut, 
                k = i)
  error[i] = 1- mean(knnPima==testOut)
}

ggplot(data = data.frame(error), aes(x = 1:30, y = error)) +
  geom_line(color = "Blue")+
  xlab("Neighborhood Size")

which.min(error)
## [1] 12
## best pred...
knn.pred12=knn(train = trainFea,
              test = testFea,
              cl = trainOut,
              k=12)

#Confusion matrix
cm12<-table(knn.pred12,testOut)

mean(knn.pred12==testOut)
## [1] 0.726087

Caret Method

Caret is a useful tool that standardizes model fitting syntax; however, its a blackbox.

### THE CARET METHOD CARET
set.seed(314)
model <- train(
  factor(Outcome) ~., 
  data = trainCaret , 
  method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("range"),
  tuneLength = 20
)
# Plot model accuracy vs different values of k
plot(model)

# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune
##    k
## 4 11
predicted.classes <- model %>% predict(testCaret)
head(predicted.classes)
## [1] 0 1 0 0 1 0
## Levels: 0 1
# Compute model accuracy rate
cmCaret<-table(predicted.classes ,testOut)
cmCaret
##                  testOut
## predicted.classes   0   1
##                 0 120  37
##                 1  30  43
### THE CARET WAY!
confusionMatrix(cmCaret)
## Confusion Matrix and Statistics
## 
##                  testOut
## predicted.classes   0   1
##                 0 120  37
##                 1  30  43
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6454, 0.7666)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.04036         
##                                           
##                   Kappa : 0.3445          
##                                           
##  Mcnemar's Test P-Value : 0.46355         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.5375          
##          Pos Pred Value : 0.7643          
##          Neg Pred Value : 0.5890          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5217          
##    Detection Prevalence : 0.6826          
##       Balanced Accuracy : 0.6687          
##                                           
##        'Positive' Class : 0               
## 

MULTIPLE CLASSES

Example #2: Iris Species

The iris dataset is a popular dataset used for statistical examples. The documentation for this dataset in R states:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Our goal is to be able to predict the species of an unknown iris flower, given measurements for four physical characteristics.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Step 1: Standardize the data

Since we are using a distance metric to assess closeness of neighbors, it is essential to standardize our data.

# unstandardized 
var(iris[ ,1])
## [1] 0.6856935
var(iris[ ,2])
## [1] 0.1899794
# use scale to standardize
irisS<-iris
irisS[,1:4] <-scale(irisS[,1:4])
irisS<-data.frame(irisS)

# now check the standardized variance
var(irisS[ ,1])
## [1] 1
var(irisS[ ,2])
## [1] 1

Step 2: Visualize the Data

Looking at the data do there appear to be any clear groupings?

### Are there groups?
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(irisS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## explore
library(tidyverse)
ggplot(data = irisS, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + 
  geom_point()

ggplot(data = irisS, aes(x = Petal.Length, y = Petal.Width, col = Species)) + 
  geom_point()

Step 3: Train and Test

We want to make sure that we have balanced representation in the training and testing sets.

### train and test
set.seed(239)


setosa<- irisS%>%
  filter(Species=="setosa")

versicolor<- irisS%>%
  filter(Species=="versicolor")

virginica<- irisS%>%
  filter(Species=="virginica")

# 50 observations from each
# Take 60% to train (30) and 40% to test (20)
train <- sample(1:50, 30)

iris.train<- rbind(setosa[train,], versicolor[train,], virginica[train,])
iris.test<- rbind(setosa[-train,], versicolor[-train,], virginica[-train,])

Step 4: Fit KNN

library(class)
knnIris<- knn(train = iris.train[,1:4], 
              test = iris.test[,1:4], 
              cl = iris.train$Species, 
              k = 3)

## correct
mean(knnIris==iris.test$Species)
## [1] 0.9833333
## error
mean(knnIris!=iris.test$Species)
## [1] 0.01666667

Step 5: Cross Validation to find k

## Pick a neighborhood
error <- c()
for (i in 1:15)
{
  knnIris<- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
  error[i] = 1- mean(knnIris == iris.test$Species)
}

ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
  geom_line(color = "Blue")+
  xlab("Neighborhood Size")

Step 6: Prediction for the best k

iris_pred <- knn(train = iris.train[,1:4], 
                 test = iris.test[,1:4], 
                 cl = iris.train$Species, 
                 k=3)

table(iris.test$Species, iris_pred)
##             iris_pred
##              setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         20         0
##   virginica       0          1        19