KNN Presentation

library(tidyverse)

## ── Attaching packages ───────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
## ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
## ✔ readr   1.1.1          ✔ forcats 0.3.0

## ── Conflicts ──────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

load("~/Dropbox/RProjects/Module 8/cdc.Rdata")

Distance

The Euclidean distance between two points in the plane $(x_1,y_1)$ and $(x_2,y_2)$ is given by

\[ d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]

Example:

Suppose the first point is $(1,1)$ and the second point is $(5,3)$

d = sqrt((1-5)^2 +(1-3)^2)
d

## [1] 4.472136

If we are speaking about longer vectors, the formula expands in an obvious way.

Height and Weight

We can consider the distance between two people in Height-Weight space. Take a sample of people from the cdc dataset and place them in this space.

If you knew a person’s height and weight, how would you predict the person’s gender? A simple way would be to calculate the distance between the height and weight of the new person and those for whom we already know the gender. Then we could sort the known data by distance from the new person. Looking at the gender of the closest known people would give us our decision. How many of these nearest neighbors should we take? This is the K in KNN. Whatever you do, make it odd to avoid a tie.

set.seed(123)
cdc$sample = sample(c(T,F),size=nrow(cdc),prob=c(.01,.99),replace=T)
cdc %>% filter(sample==T) %>% 
  ggplot(aes(height,weight,color=gender)) + geom_point(alpha=.6)

Exercise

Create the dataframe known from the cdc dataframe using the variable cdc$sample. Keep height, weight and gender.

The new person is 66 inches tall and weighs 155. Calculate the distance of each person in known from the new person and add this to the dataframe known.

Sort the dataframe by distance.

Examine the first 10 observations

Does your choice of K make a difference in the gender you assign?

Answer

cdc %>% filter(sample == T) %>% 
  select(height,weight,gender) %>% 
  mutate(d=sqrt((66-height)^2 + (155-weight)^2)) %>% 
  arrange(d) %>% 
  head(10)

##    height weight gender        d
## 1      65    155      f 1.000000
## 2      66    152      f 3.000000
## 3      67    158      f 3.162278
## 4      70    155      m 4.000000
## 5      62    153      f 4.472136
## 6      66    160      f 5.000000
## 7      71    155      m 5.000000
## 8      65    160      f 5.099020
## 9      65    160      f 5.099020
## 10     64    160      f 5.385165

Note that height and weight have different scales. How does this affect the distance computations?

sd(cdc$height)

## [1] 4.125954

sd(cdc$weight)

## [1] 40.08097

A difference of 10 is about 2.5 standard deviations in height but .25 standard deviations in weight.

Does this mean that weight has more influence on the gender classification than height? How could we correct this.

Create new variables zw and zh, the zscores of height and weight. Use these to rework our process.

newh = (66-mean(cdc$height))/sd(cdc$height)
newh

## [1] -0.2866973

neww = (155-mean(cdc$weight))/sd(cdc$weight)
neww

## [1] -0.3663322

cdc %>% 
  mutate(zw = (weight - mean(weight))/sd(weight),
         zh = (height - mean(height))/sd(height)) %>% 
  filter(sample == T) %>% 
  select(height,zh,weight,zw,gender) %>% 
  mutate(d=sqrt((newh-zh)^2 + (neww-zw)^2)) %>% 
  arrange(d) %>% 
  head(10)

##    height          zh weight         zw gender          d
## 1      66 -0.28669731    152 -0.4411807      f 0.07484849
## 2      66 -0.28669731    160 -0.2415847      f 0.12474748
## 3      65 -0.52906548    155 -0.3663322      f 0.24236817
## 4      66 -0.28669731    145 -0.6158272      f 0.24949496
## 5      67 -0.04432914    158 -0.2914837      f 0.25366243
## 6      65 -0.52906548    160 -0.2415847      f 0.27258809
## 7      65 -0.52906548    160 -0.2415847      f 0.27258809
## 8      67 -0.04432914    146 -0.5908777      f 0.33039824
## 9      67 -0.04432914    165 -0.1168372      m 0.34783626
## 10     65 -0.52906548    145 -0.6158272      f 0.34783626