Previously….

A brave fellow student gave a presentation using the KNN matching algorithm to predict the class of an unknown (and possibly alien) species….

KNN Example by Zach Dravis

Zach left us with a few questions at the end of the presentation, including:

In the comments section during the presentation, Prof. Catlin suggested using a confusion matrix to evaluate the models with different k values. In this example, I will present the concept of the confusion matrix and apply it to Zach’s KNN models.

Ewoks and Humans and Wookies (minus the Ewoks)

For Zach’s presentation, he created a data set containing 200 Ewoks, 200 humans, and 200 Wookies, then built a data frame with simulated height and weight data for each case. Then, he used a training set of 399 observations from his data set to build two models with different \(k\) values (\(k=3\) and \(k=9\)), which were then applied to the remaining 201 observations (the test data set).

To simplify the example today, I’ve reduced the problem to having only two classes - Humans and Wookies. Their scatterpoint of Height vs. Weight is shown here:

ggplot(SWSpecies, aes(x=Height, y=Weight, color=Species, shape=Species)) +
  geom_point()

Running the process with two different \(k\) values

Code from Zach Dravis to apply the two models:

\(k=3\)

SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), 
                         test = select(TestData, Height, Weight), 
                         cl = TrainingData$Species, 
                         k = 3)
FirstTest <- cbind(TestData, SpeciesPrediction)
FirstTest %>%  kable("html", caption = "Test1: k = 3") %>%
  kable_styling(bootstrap_options = c("striped")) %>%
  scroll_box(height = "500px")
Test1: k = 3
Species Height Weight SpeciesPrediction
Human 5.742613 119.28202 Human
Human 5.848384 108.37048 Human
Human 5.592757 148.52337 Human
Human 5.850367 204.32882 Wookie
Human 5.655841 147.01470 Human
Human 5.880231 173.31711 Wookie
Human 6.421232 116.92211 Human
Human 6.056181 143.35066 Human
Human 5.516332 166.98285 Human
Human 4.942775 139.35122 Human
Human 5.709029 173.55788 Human
Human 5.299882 170.82142 Human
Human 6.246747 131.01810 Human
Human 4.696459 126.84124 Human
Human 5.292124 220.73452 Wookie
Human 5.711004 144.22352 Human
Human 5.424132 150.19858 Human
Human 5.196924 148.36362 Human
Human 5.347639 132.95379 Human
Human 5.814768 122.19072 Human
Human 5.947586 115.92794 Human
Human 5.830106 177.41887 Wookie
Human 6.636742 120.41812 Human
Human 6.086749 170.02456 Human
Human 5.643855 192.19786 Human
Human 5.170115 184.73337 Human
Human 6.959570 145.83740 Human
Human 5.838708 133.82034 Human
Human 5.157840 117.63320 Human
Human 5.593246 149.44314 Human
Human 5.337803 143.02153 Human
Human 5.362648 154.68881 Human
Human 5.033248 131.87971 Human
Human 5.558423 168.87105 Human
Human 5.659580 171.42509 Human
Human 4.961229 182.27658 Human
Human 3.883424 217.49267 Wookie
Human 5.372563 155.93672 Human
Human 5.514759 132.73361 Human
Human 5.797137 147.68037 Human
Human 5.529568 101.16240 Human
Human 5.706699 153.89769 Human
Human 4.951114 137.22981 Human
Human 5.855588 162.18035 Human
Human 5.859444 141.93569 Human
Human 5.625826 108.57046 Human
Human 6.178637 151.09750 Wookie
Human 5.702234 109.40266 Human
Human 5.632182 176.91019 Human
Human 5.634022 95.11921 Human
Human 5.718465 145.74369 Human
Human 6.030062 163.73945 Wookie
Human 5.726095 104.41147 Human
Human 5.831599 191.97166 Human
Human 4.931813 123.43982 Human
Human 5.314751 134.81009 Human
Human 6.238485 154.89996 Human
Human 4.888048 139.88568 Human
Human 5.629034 118.92145 Human
Human 5.702501 162.28014 Human
Human 5.987902 134.27330 Human
Human 5.325562 165.19501 Human
Human 5.579313 134.85831 Human
Human 4.618373 121.28249 Human
Human 5.669298 147.16097 Human
Human 5.166717 169.48318 Human
Human 5.380677 157.26965 Human
Human 4.906117 148.23198 Human
Human 5.692468 91.56618 Human
Human 5.833290 193.54331 Human
Human 5.347693 148.80181 Human
Human 6.412505 124.08007 Human
Human 5.835280 143.94929 Human
Human 5.974316 157.83767 Wookie
Human 6.524701 104.44946 Human
Human 5.174443 171.51926 Human
Human 5.904310 157.40498 Wookie
Human 5.993290 186.08301 Human
Human 5.496915 179.58118 Human
Human 5.659526 133.86178 Human
Human 4.994089 135.29152 Human
Human 5.735084 120.13451 Human
Human 5.149515 139.14891 Human
Human 5.906841 178.33064 Wookie
Human 5.094285 203.76621 Human
Human 5.659699 177.85228 Wookie
Human 5.076739 95.10187 Human
Human 5.377118 174.65634 Human
Human 4.723571 181.08806 Human
Human 5.564217 168.52260 Human
Human 5.992722 124.35209 Human
Human 5.591624 181.93532 Human
Human 4.616885 171.88444 Human
Human 5.189733 169.87970 Human
Human 6.328022 138.55778 Wookie
Human 6.404903 175.37675 Human
Human 4.912482 153.34776 Human
Human 5.316648 147.96689 Human
Human 5.676813 194.61294 Human
Human 5.659578 172.71670 Wookie
Wookie 7.738585 205.66270 Wookie
Wookie 6.081447 171.74523 Human
Wookie 7.532295 204.03502 Wookie
Wookie 6.918085 165.21143 Wookie
Wookie 8.336956 145.18989 Wookie
Wookie 6.817417 240.22203 Wookie
Wookie 5.854670 232.83221 Wookie
Wookie 7.368876 171.59847 Wookie
Wookie 7.265878 263.63929 Wookie
Wookie 6.986780 190.37280 Wookie
Wookie 6.207587 185.81674 Wookie
Wookie 6.370814 152.31641 Human
Wookie 6.990531 198.78247 Wookie
Wookie 7.778583 193.15070 Wookie
Wookie 6.726657 203.37592 Wookie
Wookie 6.346413 222.25842 Wookie
Wookie 4.971848 245.33002 Wookie
Wookie 7.370687 216.37053 Wookie
Wookie 7.320505 249.62808 Wookie
Wookie 7.981893 253.27003 Wookie
Wookie 8.100042 53.47590 Wookie
Wookie 6.892708 181.86131 Wookie
Wookie 7.973885 228.49663 Wookie
Wookie 7.000985 144.01823 Human
Wookie 6.492692 151.96748 Wookie
Wookie 6.363853 213.68682 Wookie
Wookie 6.739794 177.92884 Wookie
Wookie 7.170610 113.42561 Wookie
Wookie 6.979449 245.29274 Wookie
Wookie 7.150658 157.95802 Wookie
Wookie 7.149207 219.31739 Wookie
Wookie 7.535950 220.62151 Wookie
Wookie 6.795114 276.06343 Wookie
Wookie 8.226511 209.82821 Wookie
Wookie 7.328270 233.28052 Wookie
Wookie 6.104848 257.02955 Wookie
Wookie 6.792513 179.68094 Wookie
Wookie 7.386912 117.22193 Human
Wookie 6.691708 207.14650 Wookie
Wookie 7.258746 266.29572 Wookie
Wookie 7.325797 129.68612 Human
Wookie 7.977608 250.60658 Wookie
Wookie 6.511228 196.99346 Wookie
Wookie 6.987339 204.97407 Wookie
Wookie 6.574231 182.53891 Wookie
Wookie 6.154650 251.84266 Wookie
Wookie 5.613272 271.20782 Wookie
Wookie 7.146647 278.54244 Wookie
Wookie 7.451174 151.62996 Wookie
Wookie 7.479369 237.96429 Wookie
Wookie 5.957505 181.19052 Human
Wookie 6.152742 150.82953 Human
Wookie 6.144134 219.49476 Wookie
Wookie 6.483496 199.10246 Human
Wookie 7.127648 239.58623 Wookie
Wookie 7.445863 289.61989 Wookie
Wookie 6.388725 251.83584 Wookie
Wookie 4.851740 165.21572 Human
Wookie 7.715368 206.98113 Wookie
Wookie 7.461464 270.66227 Wookie
Wookie 6.474347 258.72188 Wookie
Wookie 6.194754 120.76092 Human
Wookie 6.775007 247.32150 Wookie
Wookie 7.081080 182.97430 Wookie
Wookie 7.542401 168.73187 Wookie
Wookie 8.042571 144.93989 Wookie
Wookie 6.363338 184.41023 Human
Wookie 7.218042 234.43283 Wookie
Wookie 8.000632 271.76359 Wookie
Wookie 6.323993 206.94535 Wookie
Wookie 6.832927 196.34738 Wookie
Wookie 5.671399 189.88320 Wookie
Wookie 7.215567 247.68997 Wookie
Wookie 6.519106 145.94033 Human
Wookie 6.306297 212.50094 Wookie
Wookie 6.415017 251.08110 Wookie
Wookie 7.333378 253.09258 Wookie
Wookie 5.378148 220.85351 Wookie
Wookie 6.680867 141.88169 Human
Wookie 6.239712 158.94576 Wookie
Wookie 6.278411 163.97559 Human
Wookie 7.355976 127.83640 Human
Wookie 7.263816 202.00102 Wookie
Wookie 6.919735 261.14733 Wookie
Wookie 7.403138 77.66530 Human
Wookie 6.548421 237.91939 Wookie
Wookie 5.482527 189.86692 Wookie
Wookie 7.865816 161.54331 Wookie
Wookie 6.826980 238.50044 Wookie
Wookie 6.726183 232.41926 Wookie
Wookie 7.118138 204.81087 Wookie
Wookie 7.574449 154.01783 Wookie
Wookie 7.319565 164.75691 Wookie
Wookie 7.695449 119.13830 Human
Wookie 7.034601 294.04470 Wookie
Wookie 6.621786 192.22028 Human
Wookie 6.774663 162.27431 Wookie
Wookie 7.617187 91.95238 Human
Wookie 6.363048 178.52692 Wookie
Wookie 6.284792 172.54360 Wookie

\(k=9\)

SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), 
                         test = select(TestData, Height, Weight), 
                         cl = TrainingData$Species, 
                         k = 9)
SecondTest <- cbind(TestData, SpeciesPrediction)
SecondTest %>%  kable("html", caption = "Test2: k = 9") %>%
  kable_styling(bootstrap_options = c("striped")) %>%
  scroll_box(height = "500px")
Test2: k = 9
Species Height Weight SpeciesPrediction
Human 5.742613 119.28202 Human
Human 5.848384 108.37048 Human
Human 5.592757 148.52337 Human
Human 5.850367 204.32882 Wookie
Human 5.655841 147.01470 Human
Human 5.880231 173.31711 Wookie
Human 6.421232 116.92211 Human
Human 6.056181 143.35066 Human
Human 5.516332 166.98285 Human
Human 4.942775 139.35122 Human
Human 5.709029 173.55788 Wookie
Human 5.299882 170.82142 Human
Human 6.246747 131.01810 Human
Human 4.696459 126.84124 Human
Human 5.292124 220.73452 Wookie
Human 5.711004 144.22352 Human
Human 5.424132 150.19858 Human
Human 5.196924 148.36362 Human
Human 5.347639 132.95379 Human
Human 5.814768 122.19072 Human
Human 5.947586 115.92794 Human
Human 5.830106 177.41887 Wookie
Human 6.636742 120.41812 Human
Human 6.086749 170.02456 Human
Human 5.643855 192.19786 Human
Human 5.170115 184.73337 Human
Human 6.959570 145.83740 Human
Human 5.838708 133.82034 Human
Human 5.157840 117.63320 Human
Human 5.593246 149.44314 Human
Human 5.337803 143.02153 Human
Human 5.362648 154.68881 Human
Human 5.033248 131.87971 Human
Human 5.558423 168.87105 Human
Human 5.659580 171.42509 Human
Human 4.961229 182.27658 Human
Human 3.883424 217.49267 Wookie
Human 5.372563 155.93672 Human
Human 5.514759 132.73361 Human
Human 5.797137 147.68037 Human
Human 5.529568 101.16240 Human
Human 5.706699 153.89769 Human
Human 4.951114 137.22981 Human
Human 5.855588 162.18035 Human
Human 5.859444 141.93569 Human
Human 5.625826 108.57046 Human
Human 6.178637 151.09750 Human
Human 5.702234 109.40266 Human
Human 5.632182 176.91019 Wookie
Human 5.634022 95.11921 Human
Human 5.718465 145.74369 Human
Human 6.030062 163.73945 Human
Human 5.726095 104.41147 Human
Human 5.831599 191.97166 Human
Human 4.931813 123.43982 Human
Human 5.314751 134.81009 Human
Human 6.238485 154.89996 Human
Human 4.888048 139.88568 Human
Human 5.629034 118.92145 Human
Human 5.702501 162.28014 Human
Human 5.987902 134.27330 Human
Human 5.325562 165.19501 Human
Human 5.579313 134.85831 Human
Human 4.618373 121.28249 Human
Human 5.669298 147.16097 Human
Human 5.166717 169.48318 Human
Human 5.380677 157.26965 Human
Human 4.906117 148.23198 Human
Human 5.692468 91.56618 Human
Human 5.833290 193.54331 Human
Human 5.347693 148.80181 Human
Human 6.412505 124.08007 Human
Human 5.835280 143.94929 Human
Human 5.974316 157.83767 Human
Human 6.524701 104.44946 Human
Human 5.174443 171.51926 Human
Human 5.904310 157.40498 Human
Human 5.993290 186.08301 Wookie
Human 5.496915 179.58118 Human
Human 5.659526 133.86178 Human
Human 4.994089 135.29152 Human
Human 5.735084 120.13451 Human
Human 5.149515 139.14891 Human
Human 5.906841 178.33064 Wookie
Human 5.094285 203.76621 Wookie
Human 5.659699 177.85228 Wookie
Human 5.076739 95.10187 Human
Human 5.377118 174.65634 Human
Human 4.723571 181.08806 Human
Human 5.564217 168.52260 Human
Human 5.992722 124.35209 Human
Human 5.591624 181.93532 Human
Human 4.616885 171.88444 Human
Human 5.189733 169.87970 Human
Human 6.328022 138.55778 Human
Human 6.404903 175.37675 Wookie
Human 4.912482 153.34776 Human
Human 5.316648 147.96689 Human
Human 5.676813 194.61294 Wookie
Human 5.659578 172.71670 Wookie
Wookie 7.738585 205.66270 Wookie
Wookie 6.081447 171.74523 Wookie
Wookie 7.532295 204.03502 Wookie
Wookie 6.918085 165.21143 Human
Wookie 8.336956 145.18989 Human
Wookie 6.817417 240.22203 Wookie
Wookie 5.854670 232.83221 Wookie
Wookie 7.368876 171.59847 Wookie
Wookie 7.265878 263.63929 Wookie
Wookie 6.986780 190.37280 Wookie
Wookie 6.207587 185.81674 Wookie
Wookie 6.370814 152.31641 Human
Wookie 6.990531 198.78247 Wookie
Wookie 7.778583 193.15070 Wookie
Wookie 6.726657 203.37592 Wookie
Wookie 6.346413 222.25842 Wookie
Wookie 4.971848 245.33002 Wookie
Wookie 7.370687 216.37053 Wookie
Wookie 7.320505 249.62808 Wookie
Wookie 7.981893 253.27003 Wookie
Wookie 8.100042 53.47590 Human
Wookie 6.892708 181.86131 Wookie
Wookie 7.973885 228.49663 Wookie
Wookie 7.000985 144.01823 Human
Wookie 6.492692 151.96748 Human
Wookie 6.363853 213.68682 Wookie
Wookie 6.739794 177.92884 Wookie
Wookie 7.170610 113.42561 Human
Wookie 6.979449 245.29274 Wookie
Wookie 7.150658 157.95802 Wookie
Wookie 7.149207 219.31739 Wookie
Wookie 7.535950 220.62151 Wookie
Wookie 6.795114 276.06343 Wookie
Wookie 8.226511 209.82821 Wookie
Wookie 7.328270 233.28052 Wookie
Wookie 6.104848 257.02955 Wookie
Wookie 6.792513 179.68094 Wookie
Wookie 7.386912 117.22193 Human
Wookie 6.691708 207.14650 Wookie
Wookie 7.258746 266.29572 Wookie
Wookie 7.325797 129.68612 Human
Wookie 7.977608 250.60658 Wookie
Wookie 6.511228 196.99346 Wookie
Wookie 6.987339 204.97407 Wookie
Wookie 6.574231 182.53891 Wookie
Wookie 6.154650 251.84266 Wookie
Wookie 5.613272 271.20782 Wookie
Wookie 7.146647 278.54244 Wookie
Wookie 7.451174 151.62996 Wookie
Wookie 7.479369 237.96429 Wookie
Wookie 5.957505 181.19052 Human
Wookie 6.152742 150.82953 Human
Wookie 6.144134 219.49476 Wookie
Wookie 6.483496 199.10246 Wookie
Wookie 7.127648 239.58623 Wookie
Wookie 7.445863 289.61989 Wookie
Wookie 6.388725 251.83584 Wookie
Wookie 4.851740 165.21572 Human
Wookie 7.715368 206.98113 Wookie
Wookie 7.461464 270.66227 Wookie
Wookie 6.474347 258.72188 Wookie
Wookie 6.194754 120.76092 Human
Wookie 6.775007 247.32150 Wookie
Wookie 7.081080 182.97430 Wookie
Wookie 7.542401 168.73187 Human
Wookie 8.042571 144.93989 Human
Wookie 6.363338 184.41023 Wookie
Wookie 7.218042 234.43283 Wookie
Wookie 8.000632 271.76359 Wookie
Wookie 6.323993 206.94535 Wookie
Wookie 6.832927 196.34738 Wookie
Wookie 5.671399 189.88320 Wookie
Wookie 7.215567 247.68997 Wookie
Wookie 6.519106 145.94033 Human
Wookie 6.306297 212.50094 Wookie
Wookie 6.415017 251.08110 Wookie
Wookie 7.333378 253.09258 Wookie
Wookie 5.378148 220.85351 Wookie
Wookie 6.680867 141.88169 Human
Wookie 6.239712 158.94576 Human
Wookie 6.278411 163.97559 Human
Wookie 7.355976 127.83640 Human
Wookie 7.263816 202.00102 Wookie
Wookie 6.919735 261.14733 Wookie
Wookie 7.403138 77.66530 Human
Wookie 6.548421 237.91939 Wookie
Wookie 5.482527 189.86692 Wookie
Wookie 7.865816 161.54331 Wookie
Wookie 6.826980 238.50044 Wookie
Wookie 6.726183 232.41926 Wookie
Wookie 7.118138 204.81087 Wookie
Wookie 7.574449 154.01783 Human
Wookie 7.319565 164.75691 Wookie
Wookie 7.695449 119.13830 Human
Wookie 7.034601 294.04470 Wookie
Wookie 6.621786 192.22028 Human
Wookie 6.774663 162.27431 Wookie
Wookie 7.617187 91.95238 Human
Wookie 6.363048 178.52692 Wookie
Wookie 6.284792 172.54360 Wookie

Baseline: Random simulation

With the two given models from Zach’s presentation, I wanted to create a baseline model as a control to test against. I simulated data to assign the Species attribute randomly to a list of 67 Ewoks, 67 humans, and 67 Wookies.

From Data Science for Business: “Comparison against a random model establishes that there is some information to be extracted from the data.”

If the results of our models are better than a random simulation, we’re looking at the right features to use in our models.

simSpeciesPrediction <- c(rep("Wookie", 100), rep("Human", 100))
randOrder <- sample(1:200, 200, replace = FALSE)
simSpeciesPrediction <- simSpeciesPrediction[randOrder]
BaseTest <- cbind(TestData, simSpeciesPrediction)
BaseTest %>%  kable("html", caption = "Baseline: Random") %>%
  kable_styling(bootstrap_options = c("striped")) %>%
  scroll_box(height = "500px")
Baseline: Random
Species Height Weight simSpeciesPrediction
Human 5.742613 119.28202 Human
Human 5.848384 108.37048 Wookie
Human 5.592757 148.52337 Wookie
Human 5.850367 204.32882 Wookie
Human 5.655841 147.01470 Human
Human 5.880231 173.31711 Wookie
Human 6.421232 116.92211 Wookie
Human 6.056181 143.35066 Wookie
Human 5.516332 166.98285 Human
Human 4.942775 139.35122 Wookie
Human 5.709029 173.55788 Wookie
Human 5.299882 170.82142 Wookie
Human 6.246747 131.01810 Human
Human 4.696459 126.84124 Human
Human 5.292124 220.73452 Wookie
Human 5.711004 144.22352 Wookie
Human 5.424132 150.19858 Human
Human 5.196924 148.36362 Wookie
Human 5.347639 132.95379 Human
Human 5.814768 122.19072 Wookie
Human 5.947586 115.92794 Wookie
Human 5.830106 177.41887 Human
Human 6.636742 120.41812 Human
Human 6.086749 170.02456 Wookie
Human 5.643855 192.19786 Wookie
Human 5.170115 184.73337 Wookie
Human 6.959570 145.83740 Human
Human 5.838708 133.82034 Wookie
Human 5.157840 117.63320 Human
Human 5.593246 149.44314 Wookie
Human 5.337803 143.02153 Wookie
Human 5.362648 154.68881 Wookie
Human 5.033248 131.87971 Human
Human 5.558423 168.87105 Human
Human 5.659580 171.42509 Wookie
Human 4.961229 182.27658 Wookie
Human 3.883424 217.49267 Wookie
Human 5.372563 155.93672 Human
Human 5.514759 132.73361 Wookie
Human 5.797137 147.68037 Wookie
Human 5.529568 101.16240 Human
Human 5.706699 153.89769 Wookie
Human 4.951114 137.22981 Human
Human 5.855588 162.18035 Human
Human 5.859444 141.93569 Human
Human 5.625826 108.57046 Human
Human 6.178637 151.09750 Wookie
Human 5.702234 109.40266 Wookie
Human 5.632182 176.91019 Human
Human 5.634022 95.11921 Wookie
Human 5.718465 145.74369 Human
Human 6.030062 163.73945 Wookie
Human 5.726095 104.41147 Human
Human 5.831599 191.97166 Wookie
Human 4.931813 123.43982 Wookie
Human 5.314751 134.81009 Wookie
Human 6.238485 154.89996 Human
Human 4.888048 139.88568 Human
Human 5.629034 118.92145 Wookie
Human 5.702501 162.28014 Human
Human 5.987902 134.27330 Human
Human 5.325562 165.19501 Human
Human 5.579313 134.85831 Human
Human 4.618373 121.28249 Wookie
Human 5.669298 147.16097 Human
Human 5.166717 169.48318 Human
Human 5.380677 157.26965 Wookie
Human 4.906117 148.23198 Human
Human 5.692468 91.56618 Wookie
Human 5.833290 193.54331 Human
Human 5.347693 148.80181 Human
Human 6.412505 124.08007 Wookie
Human 5.835280 143.94929 Human
Human 5.974316 157.83767 Human
Human 6.524701 104.44946 Human
Human 5.174443 171.51926 Wookie
Human 5.904310 157.40498 Human
Human 5.993290 186.08301 Human
Human 5.496915 179.58118 Wookie
Human 5.659526 133.86178 Wookie
Human 4.994089 135.29152 Human
Human 5.735084 120.13451 Human
Human 5.149515 139.14891 Human
Human 5.906841 178.33064 Wookie
Human 5.094285 203.76621 Wookie
Human 5.659699 177.85228 Human
Human 5.076739 95.10187 Wookie
Human 5.377118 174.65634 Human
Human 4.723571 181.08806 Wookie
Human 5.564217 168.52260 Human
Human 5.992722 124.35209 Human
Human 5.591624 181.93532 Human
Human 4.616885 171.88444 Wookie
Human 5.189733 169.87970 Human
Human 6.328022 138.55778 Human
Human 6.404903 175.37675 Human
Human 4.912482 153.34776 Human
Human 5.316648 147.96689 Wookie
Human 5.676813 194.61294 Human
Human 5.659578 172.71670 Wookie
Wookie 7.738585 205.66270 Wookie
Wookie 6.081447 171.74523 Human
Wookie 7.532295 204.03502 Human
Wookie 6.918085 165.21143 Wookie
Wookie 8.336956 145.18989 Wookie
Wookie 6.817417 240.22203 Wookie
Wookie 5.854670 232.83221 Wookie
Wookie 7.368876 171.59847 Wookie
Wookie 7.265878 263.63929 Human
Wookie 6.986780 190.37280 Human
Wookie 6.207587 185.81674 Human
Wookie 6.370814 152.31641 Human
Wookie 6.990531 198.78247 Human
Wookie 7.778583 193.15070 Human
Wookie 6.726657 203.37592 Human
Wookie 6.346413 222.25842 Human
Wookie 4.971848 245.33002 Wookie
Wookie 7.370687 216.37053 Wookie
Wookie 7.320505 249.62808 Human
Wookie 7.981893 253.27003 Wookie
Wookie 8.100042 53.47590 Wookie
Wookie 6.892708 181.86131 Human
Wookie 7.973885 228.49663 Human
Wookie 7.000985 144.01823 Wookie
Wookie 6.492692 151.96748 Wookie
Wookie 6.363853 213.68682 Wookie
Wookie 6.739794 177.92884 Human
Wookie 7.170610 113.42561 Wookie
Wookie 6.979449 245.29274 Wookie
Wookie 7.150658 157.95802 Wookie
Wookie 7.149207 219.31739 Wookie
Wookie 7.535950 220.62151 Human
Wookie 6.795114 276.06343 Wookie
Wookie 8.226511 209.82821 Human
Wookie 7.328270 233.28052 Wookie
Wookie 6.104848 257.02955 Human
Wookie 6.792513 179.68094 Wookie
Wookie 7.386912 117.22193 Wookie
Wookie 6.691708 207.14650 Wookie
Wookie 7.258746 266.29572 Human
Wookie 7.325797 129.68612 Wookie
Wookie 7.977608 250.60658 Human
Wookie 6.511228 196.99346 Human
Wookie 6.987339 204.97407 Human
Wookie 6.574231 182.53891 Wookie
Wookie 6.154650 251.84266 Human
Wookie 5.613272 271.20782 Human
Wookie 7.146647 278.54244 Wookie
Wookie 7.451174 151.62996 Wookie
Wookie 7.479369 237.96429 Human
Wookie 5.957505 181.19052 Wookie
Wookie 6.152742 150.82953 Human
Wookie 6.144134 219.49476 Wookie
Wookie 6.483496 199.10246 Human
Wookie 7.127648 239.58623 Wookie
Wookie 7.445863 289.61989 Human
Wookie 6.388725 251.83584 Human
Wookie 4.851740 165.21572 Human
Wookie 7.715368 206.98113 Human
Wookie 7.461464 270.66227 Wookie
Wookie 6.474347 258.72188 Wookie
Wookie 6.194754 120.76092 Wookie
Wookie 6.775007 247.32150 Wookie
Wookie 7.081080 182.97430 Human
Wookie 7.542401 168.73187 Wookie
Wookie 8.042571 144.93989 Human
Wookie 6.363338 184.41023 Human
Wookie 7.218042 234.43283 Human
Wookie 8.000632 271.76359 Wookie
Wookie 6.323993 206.94535 Wookie
Wookie 6.832927 196.34738 Human
Wookie 5.671399 189.88320 Wookie
Wookie 7.215567 247.68997 Wookie
Wookie 6.519106 145.94033 Wookie
Wookie 6.306297 212.50094 Wookie
Wookie 6.415017 251.08110 Human
Wookie 7.333378 253.09258 Wookie
Wookie 5.378148 220.85351 Human
Wookie 6.680867 141.88169 Wookie
Wookie 6.239712 158.94576 Wookie
Wookie 6.278411 163.97559 Human
Wookie 7.355976 127.83640 Human
Wookie 7.263816 202.00102 Human
Wookie 6.919735 261.14733 Wookie
Wookie 7.403138 77.66530 Wookie
Wookie 6.548421 237.91939 Wookie
Wookie 5.482527 189.86692 Wookie
Wookie 7.865816 161.54331 Human
Wookie 6.826980 238.50044 Human
Wookie 6.726183 232.41926 Wookie
Wookie 7.118138 204.81087 Wookie
Wookie 7.574449 154.01783 Human
Wookie 7.319565 164.75691 Human
Wookie 7.695449 119.13830 Human
Wookie 7.034601 294.04470 Human
Wookie 6.621786 192.22028 Wookie
Wookie 6.774663 162.27431 Human
Wookie 7.617187 91.95238 Human
Wookie 6.363048 178.52692 Wookie
Wookie 6.284792 172.54360 Human

Is accuracy sufficient?

An intuitive first impulse to evaluate the model is to divide the number of correctly predicted results from the test data by the total number of observations in the test data. This is known as Accuracy.

#Accuracy of the Base model
(AccuracyBase <- sum(BaseTest$simSpeciesPrediction == BaseTest$Species)/length(BaseTest$simSpeciesPrediction))
## [1] 0.51
#Accuracy of the FirstTest model (k = 3)
(AccuracyFirst <- sum(FirstTest$SpeciesPrediction == FirstTest$Species)/length(FirstTest$SpeciesPrediction))
## [1] 0.84
#Accuracy of the SecondTest model (k = 9)
(AccuracySecond <- sum(SecondTest$SpeciesPrediction == SecondTest$Species)/length(SecondTest$SpeciesPrediction))
## [1] 0.805

The test models have much greater accuracy than the random assignment simulation, so it seems like we’re on the right track. But the accuracy of a model doesn’t tell the complete story of a model, and may in fact hide its flaws.

Defining the Confusion Matrix

Instead of relying on the single metric of accuracy, we can create a confusion matrix to examine the different types of results the model generated.

While accuracy gives us the proportion of correct results, the confusion matrix separates the correct results into two sets:

The incorrect results are also divided in two:

The correct predictions fall along the main diagonal of the matrix.

From the confusion matrix, many helpful statistics can be calculated to aid in analyzing the model. The figure below gives a visual representation of the different statistics.

This figure also shows how simple accuracy can be misleading. The accuracy = \(\frac{20 + 1820}{2030} \approx 90.6\%\); however, looking at the confusion matrix gives better information - for example, the Positive predictive value is only 10%.

The Data School is a good resource for defining the statistics evaluated from the confusion matrix.

Back to our Star Wars example

In R, the confusionMatrix function from the caret package returns the confusion matrix as well as several of the summary statistics we reviewed in the previous section.

We can create confusion matrices for our three species prediction models to evaluate their performance. Since our data has three levels, the confusion matrix returns summary statistics for each feature separately.

Comparing the three models, we see that the two knn models fare significantly better than the baseline model based on random assignment.

library(caret)
confusionMatrix(BaseTest$Species, BaseTest$simSpeciesPrediction, dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Human Wookie
##     Human     51     49
##     Wookie    49     51
##                                           
##                Accuracy : 0.51            
##                  95% CI : (0.4385, 0.5812)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.416           
##                                           
##                   Kappa : 0.02            
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.510           
##             Specificity : 0.510           
##          Pos Pred Value : 0.510           
##          Neg Pred Value : 0.510           
##              Prevalence : 0.500           
##          Detection Rate : 0.255           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.510           
##                                           
##        'Positive' Class : Human           
## 
confusionMatrix(FirstTest$Species, FirstTest$SpeciesPrediction, dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Human Wookie
##     Human     87     13
##     Wookie    19     81
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7817, 0.8879)
##     No Information Rate : 0.53            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.68            
##  Mcnemar's Test P-Value : 0.3768          
##                                           
##             Sensitivity : 0.8208          
##             Specificity : 0.8617          
##          Pos Pred Value : 0.8700          
##          Neg Pred Value : 0.8100          
##              Prevalence : 0.5300          
##          Detection Rate : 0.4350          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8412          
##                                           
##        'Positive' Class : Human           
## 
confusionMatrix(SecondTest$Species, SecondTest$SpeciesPrediction, dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Human Wookie
##     Human     86     14
##     Wookie    25     75
##                                           
##                Accuracy : 0.805           
##                  95% CI : (0.7432, 0.8575)
##     No Information Rate : 0.555           
##     P-Value [Acc > NIR] : 9.481e-14       
##                                           
##                   Kappa : 0.61            
##  Mcnemar's Test P-Value : 0.1093          
##                                           
##             Sensitivity : 0.7748          
##             Specificity : 0.8427          
##          Pos Pred Value : 0.8600          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.5550          
##          Detection Rate : 0.4300          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8087          
##                                           
##        'Positive' Class : Human           
## 

It appears that the \(k=3\) model performs better than the \(k=9\) model. How do you choose the correct k?

Values for \(k\)

From several sources online, a rule of thumb emerged to choose \(k\) as the square root of the number of observations in the training set. With our training set of 400 observations, how would the knn model fare with \(k=20\)?

source 1 source 2

## square root of 400 is 20
k_pref <- TrainingData$Species %>%
  length() %>%
  sqrt() %>%
  round()

SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), 
                         test = select(TestData, Height, Weight), 
                         cl = TrainingData$Species, 
                         k = k_pref)
ThirdTest <- cbind(TestData, SpeciesPrediction)
ThirdTest %>%  kable("html", caption = "Test3: k = 20") %>%
  kable_styling(bootstrap_options = c("striped")) %>%
  scroll_box(height = "500px")
Test3: k = 20
Species Height Weight SpeciesPrediction
Human 5.742613 119.28202 Human
Human 5.848384 108.37048 Human
Human 5.592757 148.52337 Human
Human 5.850367 204.32882 Wookie
Human 5.655841 147.01470 Human
Human 5.880231 173.31711 Human
Human 6.421232 116.92211 Human
Human 6.056181 143.35066 Human
Human 5.516332 166.98285 Human
Human 4.942775 139.35122 Human
Human 5.709029 173.55788 Human
Human 5.299882 170.82142 Human
Human 6.246747 131.01810 Human
Human 4.696459 126.84124 Human
Human 5.292124 220.73452 Wookie
Human 5.711004 144.22352 Human
Human 5.424132 150.19858 Human
Human 5.196924 148.36362 Human
Human 5.347639 132.95379 Human
Human 5.814768 122.19072 Human
Human 5.947586 115.92794 Human
Human 5.830106 177.41887 Wookie
Human 6.636742 120.41812 Human
Human 6.086749 170.02456 Human
Human 5.643855 192.19786 Wookie
Human 5.170115 184.73337 Human
Human 6.959570 145.83740 Human
Human 5.838708 133.82034 Human
Human 5.157840 117.63320 Human
Human 5.593246 149.44314 Human
Human 5.337803 143.02153 Human
Human 5.362648 154.68881 Human
Human 5.033248 131.87971 Human
Human 5.558423 168.87105 Human
Human 5.659580 171.42509 Human
Human 4.961229 182.27658 Human
Human 3.883424 217.49267 Wookie
Human 5.372563 155.93672 Human
Human 5.514759 132.73361 Human
Human 5.797137 147.68037 Human
Human 5.529568 101.16240 Human
Human 5.706699 153.89769 Human
Human 4.951114 137.22981 Human
Human 5.855588 162.18035 Human
Human 5.859444 141.93569 Human
Human 5.625826 108.57046 Human
Human 6.178637 151.09750 Human
Human 5.702234 109.40266 Human
Human 5.632182 176.91019 Wookie
Human 5.634022 95.11921 Human
Human 5.718465 145.74369 Human
Human 6.030062 163.73945 Human
Human 5.726095 104.41147 Human
Human 5.831599 191.97166 Wookie
Human 4.931813 123.43982 Human
Human 5.314751 134.81009 Human
Human 6.238485 154.89996 Human
Human 4.888048 139.88568 Human
Human 5.629034 118.92145 Human
Human 5.702501 162.28014 Human
Human 5.987902 134.27330 Human
Human 5.325562 165.19501 Human
Human 5.579313 134.85831 Human
Human 4.618373 121.28249 Human
Human 5.669298 147.16097 Human
Human 5.166717 169.48318 Human
Human 5.380677 157.26965 Human
Human 4.906117 148.23198 Human
Human 5.692468 91.56618 Human
Human 5.833290 193.54331 Wookie
Human 5.347693 148.80181 Human
Human 6.412505 124.08007 Human
Human 5.835280 143.94929 Human
Human 5.974316 157.83767 Human
Human 6.524701 104.44946 Human
Human 5.174443 171.51926 Human
Human 5.904310 157.40498 Human
Human 5.993290 186.08301 Wookie
Human 5.496915 179.58118 Wookie
Human 5.659526 133.86178 Human
Human 4.994089 135.29152 Human
Human 5.735084 120.13451 Human
Human 5.149515 139.14891 Human
Human 5.906841 178.33064 Wookie
Human 5.094285 203.76621 Wookie
Human 5.659699 177.85228 Wookie
Human 5.076739 95.10187 Human
Human 5.377118 174.65634 Wookie
Human 4.723571 181.08806 Human
Human 5.564217 168.52260 Human
Human 5.992722 124.35209 Human
Human 5.591624 181.93532 Human
Human 4.616885 171.88444 Human
Human 5.189733 169.87970 Human
Human 6.328022 138.55778 Human
Human 6.404903 175.37675 Wookie
Human 4.912482 153.34776 Human
Human 5.316648 147.96689 Human
Human 5.676813 194.61294 Wookie
Human 5.659578 172.71670 Human
Wookie 7.738585 205.66270 Wookie
Wookie 6.081447 171.74523 Human
Wookie 7.532295 204.03502 Wookie
Wookie 6.918085 165.21143 Human
Wookie 8.336956 145.18989 Human
Wookie 6.817417 240.22203 Wookie
Wookie 5.854670 232.83221 Wookie
Wookie 7.368876 171.59847 Human
Wookie 7.265878 263.63929 Wookie
Wookie 6.986780 190.37280 Wookie
Wookie 6.207587 185.81674 Wookie
Wookie 6.370814 152.31641 Human
Wookie 6.990531 198.78247 Wookie
Wookie 7.778583 193.15070 Wookie
Wookie 6.726657 203.37592 Wookie
Wookie 6.346413 222.25842 Wookie
Wookie 4.971848 245.33002 Wookie
Wookie 7.370687 216.37053 Wookie
Wookie 7.320505 249.62808 Wookie
Wookie 7.981893 253.27003 Wookie
Wookie 8.100042 53.47590 Human
Wookie 6.892708 181.86131 Human
Wookie 7.973885 228.49663 Wookie
Wookie 7.000985 144.01823 Human
Wookie 6.492692 151.96748 Human
Wookie 6.363853 213.68682 Wookie
Wookie 6.739794 177.92884 Wookie
Wookie 7.170610 113.42561 Human
Wookie 6.979449 245.29274 Wookie
Wookie 7.150658 157.95802 Human
Wookie 7.149207 219.31739 Wookie
Wookie 7.535950 220.62151 Wookie
Wookie 6.795114 276.06343 Wookie
Wookie 8.226511 209.82821 Wookie
Wookie 7.328270 233.28052 Wookie
Wookie 6.104848 257.02955 Wookie
Wookie 6.792513 179.68094 Wookie
Wookie 7.386912 117.22193 Human
Wookie 6.691708 207.14650 Wookie
Wookie 7.258746 266.29572 Wookie
Wookie 7.325797 129.68612 Human
Wookie 7.977608 250.60658 Wookie
Wookie 6.511228 196.99346 Wookie
Wookie 6.987339 204.97407 Wookie
Wookie 6.574231 182.53891 Wookie
Wookie 6.154650 251.84266 Wookie
Wookie 5.613272 271.20782 Wookie
Wookie 7.146647 278.54244 Wookie
Wookie 7.451174 151.62996 Human
Wookie 7.479369 237.96429 Wookie
Wookie 5.957505 181.19052 Wookie
Wookie 6.152742 150.82953 Human
Wookie 6.144134 219.49476 Wookie
Wookie 6.483496 199.10246 Wookie
Wookie 7.127648 239.58623 Wookie
Wookie 7.445863 289.61989 Wookie
Wookie 6.388725 251.83584 Wookie
Wookie 4.851740 165.21572 Human
Wookie 7.715368 206.98113 Wookie
Wookie 7.461464 270.66227 Wookie
Wookie 6.474347 258.72188 Wookie
Wookie 6.194754 120.76092 Human
Wookie 6.775007 247.32150 Wookie
Wookie 7.081080 182.97430 Human
Wookie 7.542401 168.73187 Human
Wookie 8.042571 144.93989 Human
Wookie 6.363338 184.41023 Human
Wookie 7.218042 234.43283 Wookie
Wookie 8.000632 271.76359 Wookie
Wookie 6.323993 206.94535 Wookie
Wookie 6.832927 196.34738 Wookie
Wookie 5.671399 189.88320 Wookie
Wookie 7.215567 247.68997 Wookie
Wookie 6.519106 145.94033 Human
Wookie 6.306297 212.50094 Wookie
Wookie 6.415017 251.08110 Wookie
Wookie 7.333378 253.09258 Wookie
Wookie 5.378148 220.85351 Wookie
Wookie 6.680867 141.88169 Human
Wookie 6.239712 158.94576 Human
Wookie 6.278411 163.97559 Wookie
Wookie 7.355976 127.83640 Human
Wookie 7.263816 202.00102 Wookie
Wookie 6.919735 261.14733 Wookie
Wookie 7.403138 77.66530 Human
Wookie 6.548421 237.91939 Wookie
Wookie 5.482527 189.86692 Wookie
Wookie 7.865816 161.54331 Human
Wookie 6.826980 238.50044 Wookie
Wookie 6.726183 232.41926 Wookie
Wookie 7.118138 204.81087 Wookie
Wookie 7.574449 154.01783 Human
Wookie 7.319565 164.75691 Human
Wookie 7.695449 119.13830 Human
Wookie 7.034601 294.04470 Wookie
Wookie 6.621786 192.22028 Wookie
Wookie 6.774663 162.27431 Human
Wookie 7.617187 91.95238 Human
Wookie 6.363048 178.52692 Wookie
Wookie 6.284792 172.54360 Human
(AccuracyThird <- sum(ThirdTest$SpeciesPrediction == ThirdTest$Species)/length(ThirdTest$SpeciesPrediction))
## [1] 0.755
confusionMatrix(ThirdTest$Species, ThirdTest$SpeciesPrediction, dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Human Wookie
##     Human     84     16
##     Wookie    33     67
##                                           
##                Accuracy : 0.755           
##                  95% CI : (0.6894, 0.8129)
##     No Information Rate : 0.585           
##     P-Value [Acc > NIR] : 3.611e-07       
##                                           
##                   Kappa : 0.51            
##  Mcnemar's Test P-Value : 0.02227         
##                                           
##             Sensitivity : 0.7179          
##             Specificity : 0.8072          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.6700          
##              Prevalence : 0.5850          
##          Detection Rate : 0.4200          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.7626          
##                                           
##        'Positive' Class : Human           
## 

(https://cran.r-project.org/web/packages/heuristica/heuristica.pdf)

AccuracyVector <- numeric() 
Stats_df <- data.frame(Accuracy = double(),
                 Sensitivy = double(),
                 Specificity = double(),
                 Precision = double(),
                 stringsAsFactors=FALSE)
for(i in 1:30) {
  SpeciesPrediction <- knn(train = select(TrainingData, Height, Weight), 
                         test = select(TestData, Height, Weight), 
                         cl = TrainingData$Species, 
                         k = i)
  Test <- cbind(TestData, SpeciesPrediction)
  TestTable <- table(Test$Species, Test$SpeciesPrediction)
  
  Stats_df[i,1] <- sum(Test$SpeciesPrediction == Test$Species)/length(Test$SpeciesPrediction)
  Stats_df[i,2] <- sensitivity(TestTable)
  Stats_df[i,3] <- specificity(TestTable)
  Stats_df[i,4] <- precision(TestTable)
}
Stats_df
##    Accuracy Sensitivy Specificity Precision
## 1     0.830 0.8300000   0.8300000      0.83
## 2     0.810 0.7980769   0.8229167      0.83
## 3     0.840 0.8207547   0.8617021      0.87
## 4     0.795 0.7706422   0.8241758      0.84
## 5     0.810 0.7767857   0.8522727      0.87
## 6     0.810 0.7870370   0.8369565      0.85
## 7     0.805 0.7798165   0.8351648      0.85
## 8     0.810 0.7818182   0.8444444      0.86
## 9     0.805 0.7747748   0.8426966      0.86
## 10    0.790 0.7685185   0.8152174      0.83
## 11    0.790 0.7500000   0.8452381      0.87
## 12    0.780 0.7456140   0.8255814      0.85
## 13    0.780 0.7333333   0.8500000      0.88
## 14    0.770 0.7213115   0.8461538      0.88
## 15    0.765 0.7226891   0.8271605      0.86
## 16    0.765 0.7226891   0.8271605      0.86
## 17    0.755 0.7179487   0.8072289      0.84
## 18    0.750 0.7118644   0.8048780      0.84
## 19    0.745 0.7058824   0.8024691      0.84
## 20    0.750 0.7155172   0.7976190      0.83
## 21    0.740 0.7105263   0.7790698      0.81
## 22    0.725 0.6991150   0.7586207      0.79
## 23    0.735 0.7079646   0.7701149      0.80
## 24    0.740 0.7142857   0.7727273      0.80
## 25    0.725 0.7027027   0.7528090      0.78
## 26    0.740 0.7142857   0.7727273      0.80
## 27    0.735 0.7155963   0.7582418      0.78
## 28    0.740 0.7181818   0.7666667      0.79
## 29    0.735 0.7155963   0.7582418      0.78
## 30    0.735 0.7155963   0.7582418      0.78
k <- 1:30
ggplot(Stats_df, aes(x = 1:30)) + 
  geom_line(aes(y = Accuracy), colour="black") +
  geom_line(aes(y = Sensitivy), colour="red") +
  geom_line(aes(y = Specificity), colour="green") +
  geom_line(aes(y = Precision), colour="blue") + 
  ylab(label="Statistic") + 
  xlab("k")