Harold Nelson
4/14/2021
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Look at mtcars, just mpg and disp. Sort by disp.
## mpg disp
## Toyota Corolla 33.9 71.1
## Honda Civic 30.4 75.7
## Fiat 128 32.4 78.7
## Fiat X1-9 27.3 79.0
## Lotus Europa 30.4 95.1
## Datsun 710 22.8 108.0
## Toyota Corona 21.5 120.1
## Porsche 914-2 26.0 120.3
## Volvo 142E 21.4 121.0
## Merc 230 22.8 140.8
## Ferrari Dino 19.7 145.0
## Merc 240D 24.4 146.7
## Mazda RX4 21.0 160.0
## Mazda RX4 Wag 21.0 160.0
## Merc 280 19.2 167.6
## Merc 280C 17.8 167.6
## Valiant 18.1 225.0
## Hornet 4 Drive 21.4 258.0
## Merc 450SE 16.4 275.8
## Merc 450SL 17.3 275.8
## Merc 450SLC 15.2 275.8
## Maserati Bora 15.0 301.0
## AMC Javelin 15.2 304.0
## Dodge Challenger 15.5 318.0
## Camaro Z28 13.3 350.0
## Ford Pantera L 15.8 351.0
## Hornet Sportabout 18.7 360.0
## Duster 360 14.3 360.0
## Pontiac Firebird 19.2 400.0
## Chrysler Imperial 14.7 440.0
## Lincoln Continental 10.4 460.0
## Cadillac Fleetwood 10.4 472.0
Let’s focus on a disp value of 110 and getting 5-NN estimate of mpg.
## mpg disp dist
## Datsun 710 22.8 108.0 2.0
## Toyota Corona 21.5 120.1 10.1
## Porsche 914-2 26.0 120.3 10.3
## Volvo 142E 21.4 121.0 11.0
## Lotus Europa 30.4 95.1 14.9
## [1] 24.42
Convert the code above to a function, which will return an estimated mpg value for a given value of disp. Make the value of k a parameter of the function. Test it with k = 3 and k = 5.
mpgest = function(disp0,k){
result = demo %>%
mutate(dist = abs(disp - disp0)) %>%
arrange(dist) %>%
head(k)
return( mean(result$mpg))
}
mpgest(110,5)
## [1] 24.42
## [1] 23.43333
Let’s use our function to look at a set of values over the range of disp in the dataset. Get the min and max values of disp for this dataset.
## [1] 71.1
## [1] 472
Use seq() to generate a range of values. Go from 70 to 470 with a stepsize of 10.
Create vpmgest() as a vectorized version of mpgest().
Create dataframes of estimated values based on k values of 3 and 10.
Create a plot showing the two knn estimates, the actual data, and a linear model.
demo %>%
ggplot(aes(x = disp, y = mpg)) +
geom_point() +
geom_smooth(method = 'lm', color = 'blue') +
geom_point(data = df3,aes(x = dispvalues, y = mpg3), color = 'red') +
geom_point(data = df10,aes(x = dispvalues, y = mpg10), color = 'green') +
ggtitle("3 and 10 Nearest Neighbors")
## `geom_smooth()` using formula 'y ~ x'