Harold Nelson
4/18/2018
library(tidyverse)
## ── Attaching packages ───────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
load("~/Dropbox/RProjects/Module 8/cdc.Rdata")
The Euclidean distance between two points in the plane \((x_1,y_1)\) and \((x_2,y_2)\) is given by
\[ d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]
Example:
Suppose the first point is \((1,1)\) and the second point is \((5,3)\)
d = sqrt((1-5)^2 +(1-3)^2)
d
## [1] 4.472136
If we are speaking about longer vectors, the formula expands in an obvious way.
We can consider the distance between two people in Height-Weight space. Take a sample of people from the cdc dataset and place them in this space.
If you knew a person’s height and weight, how would you predict the person’s gender? A simple way would be to calculate the distance between the height and weight of the new person and those for whom we already know the gender. Then we could sort the known data by distance from the new person. Looking at the gender of the closest known people would give us our decision. How many of these nearest neighbors should we take? This is the K in KNN. Whatever you do, make it odd to avoid a tie.
set.seed(123)
cdc$sample = sample(c(T,F),size=nrow(cdc),prob=c(.01,.99),replace=T)
cdc %>% filter(sample==T) %>%
ggplot(aes(height,weight,color=gender)) + geom_point(alpha=.6)
Create the dataframe known from the cdc dataframe using the variable cdc$sample. Keep height, weight and gender.
The new person is 66 inches tall and weighs 155. Calculate the distance of each person in known from the new person and add this to the dataframe known.
Sort the dataframe by distance.
Examine the first 10 observations
Does your choice of K make a difference in the gender you assign?
cdc %>% filter(sample == T) %>%
select(height,weight,gender) %>%
mutate(d=sqrt((66-height)^2 + (155-weight)^2)) %>%
arrange(d) %>%
head(10)
## height weight gender d
## 1 65 155 f 1.000000
## 2 66 152 f 3.000000
## 3 67 158 f 3.162278
## 4 70 155 m 4.000000
## 5 62 153 f 4.472136
## 6 66 160 f 5.000000
## 7 71 155 m 5.000000
## 8 65 160 f 5.099020
## 9 65 160 f 5.099020
## 10 64 160 f 5.385165
Note that height and weight have different scales. How does this affect the distance computations?
sd(cdc$height)
## [1] 4.125954
sd(cdc$weight)
## [1] 40.08097
A difference of 10 is about 2.5 standard deviations in height but .25 standard deviations in weight.
Does this mean that weight has more influence on the gender classification than height? How could we correct this.
Create new variables zw and zh, the zscores of height and weight. Use these to rework our process.
newh = (66-mean(cdc$height))/sd(cdc$height)
newh
## [1] -0.2866973
neww = (155-mean(cdc$weight))/sd(cdc$weight)
neww
## [1] -0.3663322
cdc %>%
mutate(zw = (weight - mean(weight))/sd(weight),
zh = (height - mean(height))/sd(height)) %>%
filter(sample == T) %>%
select(height,zh,weight,zw,gender) %>%
mutate(d=sqrt((newh-zh)^2 + (neww-zw)^2)) %>%
arrange(d) %>%
head(10)
## height zh weight zw gender d
## 1 66 -0.28669731 152 -0.4411807 f 0.07484849
## 2 66 -0.28669731 160 -0.2415847 f 0.12474748
## 3 65 -0.52906548 155 -0.3663322 f 0.24236817
## 4 66 -0.28669731 145 -0.6158272 f 0.24949496
## 5 67 -0.04432914 158 -0.2914837 f 0.25366243
## 6 65 -0.52906548 160 -0.2415847 f 0.27258809
## 7 65 -0.52906548 160 -0.2415847 f 0.27258809
## 8 67 -0.04432914 146 -0.5908777 f 0.33039824
## 9 67 -0.04432914 165 -0.1168372 m 0.34783626
## 10 65 -0.52906548 145 -0.6158272 f 0.34783626