Playing with KNN

Harold Nelson

4/14/2021

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Look at mtcars, just mpg and disp. Sort by disp.

Answer

demo = mtcars %>% 
  select(mpg,disp) %>%  
  arrange(disp)

demo
##                      mpg  disp
## Toyota Corolla      33.9  71.1
## Honda Civic         30.4  75.7
## Fiat 128            32.4  78.7
## Fiat X1-9           27.3  79.0
## Lotus Europa        30.4  95.1
## Datsun 710          22.8 108.0
## Toyota Corona       21.5 120.1
## Porsche 914-2       26.0 120.3
## Volvo 142E          21.4 121.0
## Merc 230            22.8 140.8
## Ferrari Dino        19.7 145.0
## Merc 240D           24.4 146.7
## Mazda RX4           21.0 160.0
## Mazda RX4 Wag       21.0 160.0
## Merc 280            19.2 167.6
## Merc 280C           17.8 167.6
## Valiant             18.1 225.0
## Hornet 4 Drive      21.4 258.0
## Merc 450SE          16.4 275.8
## Merc 450SL          17.3 275.8
## Merc 450SLC         15.2 275.8
## Maserati Bora       15.0 301.0
## AMC Javelin         15.2 304.0
## Dodge Challenger    15.5 318.0
## Camaro Z28          13.3 350.0
## Ford Pantera L      15.8 351.0
## Hornet Sportabout   18.7 360.0
## Duster 360          14.3 360.0
## Pontiac Firebird    19.2 400.0
## Chrysler Imperial   14.7 440.0
## Lincoln Continental 10.4 460.0
## Cadillac Fleetwood  10.4 472.0

Let’s focus on a disp value of 110 and getting 5-NN estimate of mpg.

Answer

result = demo %>% 
  mutate(dist = abs(disp - 110)) %>% 
  arrange(dist) %>% 
  head(5) 

result
##                mpg  disp dist
## Datsun 710    22.8 108.0  2.0
## Toyota Corona 21.5 120.1 10.1
## Porsche 914-2 26.0 120.3 10.3
## Volvo 142E    21.4 121.0 11.0
## Lotus Europa  30.4  95.1 14.9
meank5 = mean(result$mpg)
  
meank5
## [1] 24.42

Convert the code above to a function, which will return an estimated mpg value for a given value of disp. Make the value of k a parameter of the function. Test it with k = 3 and k = 5.

Answer

mpgest = function(disp0,k){
result = demo %>% 
  mutate(dist = abs(disp - disp0)) %>% 
  arrange(dist) %>% 
  head(k) 

return( mean(result$mpg))

}
  
mpgest(110,5)
## [1] 24.42
mpgest(110,3)
## [1] 23.43333

Let’s use our function to look at a set of values over the range of disp in the dataset. Get the min and max values of disp for this dataset.

Answer

min(mtcars$disp)
## [1] 71.1
max(mtcars$disp)
## [1] 472

Use seq() to generate a range of values. Go from 70 to 470 with a stepsize of 10.

Answer

dispvalues = seq(from = 70, to = 470, by = 10)

Vectorize

Create vpmgest() as a vectorized version of mpgest().

Answer

vmpgest = Vectorize(mpgest)

DF3 and DF10

Create dataframes of estimated values based on k values of 3 and 10.

Answer

mpg3 = vmpgest(dispvalues,3)
mpg10 = vmpgest(dispvalues, 10)
df3 = data.frame(dispvalues,mpg3)
df10 = data.frame(dispvalues,mpg10)

Plot

Create a plot showing the two knn estimates, the actual data, and a linear model.

Answer

demo %>% 
  ggplot(aes(x = disp, y = mpg)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'blue') +
  geom_point(data = df3,aes(x = dispvalues, y = mpg3), color = 'red') +
  geom_point(data = df10,aes(x = dispvalues, y = mpg10), color = 'green') +
  ggtitle("3 and 10 Nearest Neighbors")
## `geom_smooth()` using formula 'y ~ x'