Similarity is a key framework for understanding data and distance can be a measure of similarity.
KNN is a Machine Learning framework that looks at the attributes of a set of data and then takes a new observation, and based on given attributes that relate to the previous “training” data, predicts the target attribute.
Nearest neighbors reasoning: looking at the other observations with the least Euclidean Distance away from the observation you are trying to classify.
The k in KNN represents the number of nearest neighbors that are used to determine the to-be-classified observation.
Can weight and height predict species from Star Wars?
Three species: Human Wookie Ewok
set.seed(1234)
HumanHeight <- rnorm(200, mean = 5.5, sd = .5)
WookieHeight <- rnorm(200, mean = 7.0, sd = .75)
EwokHeight <- rnorm(200, mean = 4, sd = .5)
HumanWeight <- rnorm(200, mean = 150, sd = 30)
WookieWeight <- rnorm(200, mean = 200, sd = 50)
EwokWeight <- rnorm(200, mean = 125, sd = 30)
SWSpecies <- data.frame(Species = c(rep("Human", 200), rep("Wookie", 200), rep("Ewok", 200)), Height = c(HumanHeight, WookieHeight, EwokHeight), Weight = c(HumanWeight, WookieWeight, EwokWeight))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
ggplot(SWSpecies, aes(x=Height, y=Weight, color=Species, shape=Species)) +
geom_point()
#Weight
ggplot(SWSpecies, aes(Weight, fill = Species)) + geom_density(alpha = .5)
#Height
ggplot(SWSpecies, aes(Height, fill = Species)) + geom_density(alpha = .5)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Split into training and test
TrainingData <- SWSpecies %>%
group_by(Species) %>%
slice(1:133) %>%
ungroup()
TestData <- SWSpecies %>%
group_by(Species) %>%
slice(134:200) %>%
ungroup()
library(class)
SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), test = select(TestData, Height, Weight), cl = TrainingData$Species, k=3)
FirstTest <- cbind(TestData, SpeciesPrediction)
View(FirstTest)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1
SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), test = select(TestData, Height, Weight), cl = TrainingData$Species, k=9)
SecondTest <- cbind(TestData, SpeciesPrediction)
View(SecondTest)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1
KNN can be used for: - Knowing certain features of those that have a particular medical condition, is a given person likely to develop that medical condition (risk genes)?
Knowing attributes of neighborhoods in which certain business have succeded, is a given neighborhood likely to be a good area for a specific business to open?
Knowing certain qualities of students who make use of a service, is a student likely to take advantage of an extra-curricular or service provided by a school?