KNN –> K Nearest Neighbors

Similarity is a key framework for understanding data and distance can be a measure of similarity.

KNN is a Machine Learning framework that looks at the attributes of a set of data and then takes a new observation, and based on given attributes that relate to the previous “training” data, predicts the target attribute.

Nearest neighbors reasoning: looking at the other observations with the least Euclidean Distance away from the observation you are trying to classify.

The k in KNN represents the number of nearest neighbors that are used to determine the to-be-classified observation.

Example

Can weight and height predict species from Star Wars?

Classifications

Three species: Human Wookie Ewok

Create Data

set.seed(1234)

HumanHeight <- rnorm(200, mean = 5.5, sd = .5)
WookieHeight <- rnorm(200, mean = 7.0, sd = .75)
EwokHeight <- rnorm(200, mean = 4, sd = .5)

HumanWeight <- rnorm(200, mean = 150, sd = 30)
WookieWeight <- rnorm(200, mean = 200, sd = 50)
EwokWeight <- rnorm(200, mean = 125, sd = 30)

Create data frame in tidy form

SWSpecies <- data.frame(Species = c(rep("Human", 200), rep("Wookie", 200), rep("Ewok", 200)), Height = c(HumanHeight, WookieHeight, EwokHeight), Weight = c(HumanWeight, WookieWeight, EwokWeight))

Create scatterplot to visualize data

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
ggplot(SWSpecies, aes(x=Height, y=Weight, color=Species, shape=Species)) +
  geom_point()

Create density plots to visualize data

#Weight
ggplot(SWSpecies, aes(Weight, fill = Species)) + geom_density(alpha = .5)

#Height
ggplot(SWSpecies, aes(Height, fill = Species)) + geom_density(alpha = .5)

Split data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Split into training and test
TrainingData <- SWSpecies %>%
  group_by(Species) %>%
  slice(1:133) %>%
  ungroup()

TestData <- SWSpecies %>%
  group_by(Species) %>%
  slice(134:200) %>%
  ungroup()

Run knn

library(class)
SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), test = select(TestData, Height, Weight), cl = TrainingData$Species, k=3)

See Results

FirstTest <- cbind(TestData, SpeciesPrediction)
View(FirstTest)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1

Increase K

SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), test = select(TestData, Height, Weight), cl = TrainingData$Species, k=9)

See results of second test

SecondTest <- cbind(TestData, SpeciesPrediction)
View(SecondTest)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1

Applications

KNN can be used for: - Knowing certain features of those that have a particular medical condition, is a given person likely to develop that medical condition (risk genes)?

Open Questions