This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

This is a script created by following the KNN tutorial at https://www.datacamp.com/community/tutorials/machine-learning-in-r#gs.SLKqIHA

You can install KNN with:

install.packages(“knn”)

KNN depends on the class package for classification. We also use the ggvis library for plotting. You can check for the class package with:

any(grepl(“class”, installed.packages()))

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.

library(ggvis)
library(class)   # required for knn
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()

Let’s take a look at the data

head(iris)

A deeper look at the data. The ‘str’ function displays fields and a sample of their data. The ‘table’ function creates tabular results of categorical data.

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

Calculate the proportions of each categoy with prop.table.

round(prop.table(table(iris$Species)) * 100, digits = 1)

    setosa versicolor  virginica 
      33.3       33.3       33.3 

Now let’s get an idea of the ranges. You can optionally select certain columns with: summary(iris[c(“Petal.Width”, “Sepal.Width”)])

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

Don’t need to normalize IRIS, but to do so… This puts everything in range from 0 to 1

normalize <- function(x) { 
  num <- x - min(x) 
  denom <- max(x) - min(x) 
  return (num/denom) 
}
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
summary(iris_norm)
  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  

You’ll want to consider normalizing before using KNN if data is expected to be comparable, yet lies within different ranges.

LS0tCnRpdGxlOiAiS05OIFIgTm90ZWJvb2sgLSBJUklTIGRhdGFzZXQiCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgaHRtbF9kb2N1bWVudDogZGVmYXVsdAogIHBkZl9kb2N1bWVudDogZGVmYXVsdAotLS0KClRoaXMgaXMgYW4gW1IgTWFya2Rvd25dKGh0dHA6Ly9ybWFya2Rvd24ucnN0dWRpby5jb20pIE5vdGVib29rLiBXaGVuIHlvdSBleGVjdXRlIGNvZGUgd2l0aGluIHRoZSBub3RlYm9vaywgdGhlIHJlc3VsdHMgYXBwZWFyIGJlbmVhdGggdGhlIGNvZGUuIAoKVHJ5IGV4ZWN1dGluZyB0aGlzIGNodW5rIGJ5IGNsaWNraW5nIHRoZSAqUnVuKiBidXR0b24gd2l0aGluIHRoZSBjaHVuayBvciBieSBwbGFjaW5nIHlvdXIgY3Vyc29yIGluc2lkZSBpdCBhbmQgcHJlc3NpbmcgKkN0cmwrU2hpZnQrRW50ZXIqLiAKCkFkZCBhIG5ldyBjaHVuayBieSBjbGlja2luZyB0aGUgKkluc2VydCBDaHVuayogYnV0dG9uIG9uIHRoZSB0b29sYmFyIG9yIGJ5IHByZXNzaW5nICpDdHJsK0FsdCtJKi4KCldoZW4geW91IHNhdmUgdGhlIG5vdGVib29rLCBhbiBIVE1MIGZpbGUgY29udGFpbmluZyB0aGUgY29kZSBhbmQgb3V0cHV0IHdpbGwgYmUgc2F2ZWQgYWxvbmdzaWRlIGl0IChjbGljayB0aGUgKlByZXZpZXcqIGJ1dHRvbiBvciBwcmVzcyAqQ3RybCtTaGlmdCtLKiB0byBwcmV2aWV3IHRoZSBIVE1MIGZpbGUpLgoKVGhpcyBpcyBhIHNjcmlwdCBjcmVhdGVkIGJ5IGZvbGxvd2luZyB0aGUgS05OIHR1dG9yaWFsIGF0IGh0dHBzOi8vd3d3LmRhdGFjYW1wLmNvbS9jb21tdW5pdHkvdHV0b3JpYWxzL21hY2hpbmUtbGVhcm5pbmctaW4tciNncy5TTEtxSUhBCgpZb3UgY2FuIGluc3RhbGwgS05OIHdpdGg6CgppbnN0YWxsLnBhY2thZ2VzKCJrbm4iKQoKS05OIGRlcGVuZHMgb24gdGhlIGNsYXNzIHBhY2thZ2UgZm9yIGNsYXNzaWZpY2F0aW9uLiAgV2UgYWxzbyB1c2UgdGhlIGdndmlzIGxpYnJhcnkgZm9yIHBsb3R0aW5nLiBZb3UgY2FuIGNoZWNrIGZvciB0aGUgY2xhc3MgcGFja2FnZSB3aXRoOgoKYW55KGdyZXBsKCJjbGFzcyIsIGluc3RhbGxlZC5wYWNrYWdlcygpKSkKClRoaXMgZmFtb3VzIChGaXNoZXIncyBvciBBbmRlcnNvbidzKSBpcmlzIGRhdGEgc2V0IGdpdmVzIHRoZSBtZWFzdXJlbWVudHMgaW4gY2VudGltZXRlcnMgb2YgdGhlIHZhcmlhYmxlcyBzZXBhbCBsZW5ndGggYW5kIHdpZHRoIGFuZCBwZXRhbCBsZW5ndGggYW5kIHdpZHRoLCByZXNwZWN0aXZlbHksIGZvciA1MCBmbG93ZXJzIGZyb20gZWFjaCBvZiAzIHNwZWNpZXMgb2YgaXJpcy4KCmBgYHtyIFBsb3QgaXJpcyBkYXRhIHNldH0KbGlicmFyeShnZ3ZpcykKbGlicmFyeShjbGFzcykgICAjIHJlcXVpcmVkIGZvciBrbm4KaXJpcyAlPiUgZ2d2aXMoflNlcGFsLkxlbmd0aCwgflNlcGFsLldpZHRoLCBmaWxsID0gflNwZWNpZXMpICU+JSBsYXllcl9wb2ludHMoKQppcmlzICU+JSBnZ3Zpcyh+UGV0YWwuTGVuZ3RoLCB+UGV0YWwuV2lkdGgsIGZpbGwgPSB+U3BlY2llcykgJT4lIGxheWVyX3BvaW50cygpCmBgYAoKTGV0J3MgdGFrZSBhIGxvb2sgYXQgdGhlIGRhdGEKYGBge3IgVGhlIGZpcnN0IFggcm93cyBvZiBkYXRhfQpoZWFkKGlyaXMpCmBgYApBIGRlZXBlciBsb29rIGF0IHRoZSBkYXRhLiAgVGhlICdzdHInIGZ1bmN0aW9uIGRpc3BsYXlzIGZpZWxkcyBhbmQgYSBzYW1wbGUgb2YgdGhlaXIgZGF0YS4gIFRoZSAndGFibGUnIGZ1bmN0aW9uIGNyZWF0ZXMgdGFidWxhciByZXN1bHRzIG9mIGNhdGVnb3JpY2FsIGRhdGEuICAKYGBge3J9CnN0cihpcmlzKQp0YWJsZShpcmlzJFNwZWNpZXMpCmBgYApDYWxjdWxhdGUgdGhlIHByb3BvcnRpb25zIG9mIGVhY2ggY2F0ZWdveSB3aXRoIHByb3AudGFibGUuIApgYGB7cn0Kcm91bmQocHJvcC50YWJsZSh0YWJsZShpcmlzJFNwZWNpZXMpKSAqIDEwMCwgZGlnaXRzID0gMSkKYGBgCgpOb3cgbGV0J3MgZ2V0IGFuIGlkZWEgb2YgdGhlIHJhbmdlcy4gIFlvdSBjYW4gb3B0aW9uYWxseSBzZWxlY3QgY2VydGFpbiBjb2x1bW5zIHdpdGg6CnN1bW1hcnkoaXJpc1tjKCJQZXRhbC5XaWR0aCIsICJTZXBhbC5XaWR0aCIpXSkKYGBge3J9CnN1bW1hcnkoaXJpcykKYGBgCgpEb24ndCBuZWVkIHRvIG5vcm1hbGl6ZSBJUklTLCBidXQgdG8gZG8gc28uLi4KVGhpcyBwdXRzIGV2ZXJ5dGhpbmcgaW4gcmFuZ2UgZnJvbSAwIHRvIDEKYGBge3J9Cm5vcm1hbGl6ZSA8LSBmdW5jdGlvbih4KSB7IAogIG51bSA8LSB4IC0gbWluKHgpIAogIGRlbm9tIDwtIG1heCh4KSAtIG1pbih4KSAKICByZXR1cm4gKG51bS9kZW5vbSkgCn0KaXJpc19ub3JtIDwtIGFzLmRhdGEuZnJhbWUobGFwcGx5KGlyaXNbMTo0XSwgbm9ybWFsaXplKSkKc3VtbWFyeShpcmlzX25vcm0pCmBgYApZb3UnbGwgd2FudCB0byBjb25zaWRlciBub3JtYWxpemluZyBiZWZvcmUgdXNpbmcgS05OIGlmIGRhdGEgaXMgZXhwZWN0ZWQgdG8gYmUgY29tcGFyYWJsZSwgeWV0IGxpZXMgd2l0aGluIGRpZmZlcmVudCByYW5nZXMuICAK