Classification with Nearest Neighbors
KNN - K Nearest Neighbors
Load Packages and Dataset
Packages used to import and analyse the data include class, dplyr, googlesheet4.
class - various functions for classification, including k-nearest neighbor, learning vector quantification and self-organizing maps.
googlesheet4 - used for importing the next_sign dataset built in googlesheet.
used write csv to save a copy of the next_sign dataframe into local drive to make the data-read faster for re-run
Dataset includes a traffic sign dataframe called signs. Also included a dataset called next_sign which holds the observation we want to classify.
First kNN training
We create a vector of sign labels to use with kNN and used knn function to classify the sign from the test dataset. And the sign is recognized as “pedestrian speed stop”.
Train equal to the observations in
library(class)
sign_types <- signs$sign_type
knn(train = signs[-1], test = test_signs[-1], cl = sign_types)## [1] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [19] pedestrian speed speed speed speed speed
## [25] speed speed speed speed speed speed
## [31] speed speed speed speed speed speed
## [37] speed speed speed speed stop stop
## [43] stop stop stop stop stop stop
## [49] stop stop stop stop stop stop
## [55] stop stop stop stop stop
## Levels: pedestrian speed stop
kNN function correctly classify the stop sign because the sign was in some way similar to another stop sign. In other words, kNN is not learning anything but simply looks for the most similar example.
Exploring the traffic sign dataset
Each previously observed street sign was divided into a 4x4 grid. The read, green, and blue level for each of the 16 center pixels is recorded as illustrated above.
Sign_type resulted from this has 16 x 3 = 48 color properties of each sign.
Check out the structure of signs dataframe and view the signs dataset.
str(signs)## 'data.frame': 206 obs. of 49 variables:
## $ sign_type: chr "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
## $ r1 : int 155 142 57 22 169 75 136 118 149 13 ...
## $ g1 : int 228 217 54 35 179 67 149 105 225 34 ...
## $ b1 : int 251 242 50 41 170 60 157 69 241 28 ...
## $ r2 : int 135 166 187 171 231 131 200 244 34 5 ...
## $ g2 : int 188 204 201 178 254 89 203 245 45 21 ...
## $ b2 : int 101 44 68 26 27 53 107 67 1 11 ...
## $ r3 : int 156 142 51 19 97 214 150 132 155 123 ...
## $ g3 : int 227 217 51 27 107 144 167 123 226 154 ...
## $ b3 : int 245 242 45 29 99 75 134 12 238 140 ...
## $ r4 : int 145 147 59 19 123 156 171 138 147 21 ...
## $ g4 : int 211 219 62 27 147 169 218 123 222 46 ...
## $ b4 : int 228 242 65 29 152 190 252 85 242 41 ...
## $ r5 : int 166 164 156 42 221 67 171 254 170 36 ...
## $ g5 : int 233 228 171 37 236 50 158 254 191 60 ...
## $ b5 : int 245 229 50 3 117 36 108 92 113 26 ...
## $ r6 : int 212 84 254 217 205 37 157 241 26 75 ...
## $ g6 : int 254 116 255 228 225 36 186 240 37 108 ...
## $ b6 : int 52 17 36 19 80 42 11 108 12 44 ...
## $ r7 : int 212 217 211 221 235 44 26 254 34 13 ...
## $ g7 : int 254 254 226 235 254 42 35 254 45 27 ...
## $ b7 : int 11 26 70 20 60 44 10 99 19 25 ...
## $ r8 : int 188 155 78 181 90 192 180 108 221 133 ...
## $ g8 : int 229 203 73 183 110 131 211 106 249 163 ...
## $ b8 : int 117 128 64 73 9 73 236 27 184 126 ...
## $ r9 : int 170 213 220 237 216 123 129 135 226 83 ...
## $ g9 : int 216 253 234 234 236 74 109 123 246 125 ...
## $ b9 : int 120 51 59 44 66 22 73 40 59 19 ...
## $ r10 : int 211 217 254 251 229 36 161 254 30 13 ...
## $ g10 : int 254 255 255 254 255 34 190 254 40 27 ...
## $ b10 : int 3 21 51 2 12 37 10 115 34 25 ...
## $ r11 : int 212 217 253 235 235 44 161 254 34 9 ...
## $ g11 : int 254 255 255 243 254 42 190 254 44 23 ...
## $ b11 : int 19 21 44 12 60 44 6 99 35 18 ...
## $ r12 : int 172 158 66 19 163 197 187 138 241 85 ...
## $ g12 : int 235 225 68 27 168 114 215 123 255 128 ...
## $ b12 : int 244 237 68 29 152 21 236 85 54 21 ...
## $ r13 : int 172 164 69 20 124 171 141 118 205 83 ...
## $ g13 : int 235 227 65 29 117 102 142 105 229 125 ...
## $ b13 : int 244 237 59 34 91 26 140 75 46 19 ...
## $ r14 : int 172 182 76 64 188 197 189 131 226 85 ...
## $ g14 : int 228 228 84 61 205 114 171 124 246 128 ...
## $ b14 : int 235 143 22 4 78 21 140 5 59 21 ...
## $ r15 : int 177 171 82 211 125 123 214 106 235 85 ...
## $ g15 : int 235 228 93 222 147 74 221 94 252 128 ...
## $ b15 : int 244 196 17 78 20 22 201 53 67 21 ...
## $ r16 : int 22 164 58 19 160 180 188 101 237 83 ...
## $ g16 : int 52 227 60 27 183 107 211 91 254 125 ...
## $ b16 : int 53 237 60 29 187 26 227 59 53 19 ...
View(signs)first column contain the types of the signs: pedestrian, speed, and stop
columns from r1 to b16 contain the 16 *3 = 48 color properties
overall there are 206 signs
Count the number of signs of each type by passing it the column containing the labels.
table(signs$sign_type)##
## pedestrian speed stop
## 65 70 71
Check r10’s average red level by sign type. The result tells us whether the average red level might vary by sign type.
We can either use aggregate function to calculate the results or use dplyr to achieve the same.
aggregate(r10 ~ sign_types, data = signs, mean)## sign_types r10
## 1 pedestrian 108.78462
## 2 speed 83.08571
## 3 stop 142.50704
Using dplyr package:
signs %>%
group_by(sign_type) %>%
summarize(mean(r10))## # A tibble: 3 × 2
## sign_type `mean(r10)`
## <chr> <dbl>
## 1 pedestrian 109.
## 2 speed 83.1
## 3 stop 143.
The only difference in results is the decimal places in dplyr results are fewer than they are in the aggregate results. Because the resulting output is actually a tibble, which by default prints numbers with 3 significant digits.
This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame. so we included below extra step in above code. And now the results are presented in the same way as they are from aggregate.
signs %>%
group_by(sign_type) %>%
summarize(mean(r10)) %>%
as.data.frame(.)## sign_type mean(r10)
## 1 pedestrian 108.78462
## 2 speed 83.08571
## 3 stop 142.50704
As we can see from the aggregated results showing red levels for each sign, the stop signs tend to have a higher average red value. This is how kNN identifies similar signs.
Classify a collection of road signs
Before heading into the analysis, we need to create the “test” dataset this course is using.
test_signs <- signs_full %>%
filter(sample == "test") %>%
select(-sample, -id)Now we are using kNN to identify the test road signs. cl is factors for true classification of training set.
k-nearest neighbor classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)then create a confusion matrix of the predicted versus actual values. The confusion matrix lets us look for patterns in the classifier’s errors.
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)## signs_actual
## signs_pred pedestrian speed stop
## pedestrian 19 0 0
## speed 0 21 0
## stop 0 0 19
compute the accuracy.
mean(signs_pred == signs_actual)## [1] 1
K in the kNN
An introduction
k is a variable that specifies the number of neighbors to consider when making the classification. as in determining the size of the neighborhoods.
If ignored/did not set the value of k in the class function, it means that only the single, nearest, most similar neighbor was used to classify the unlabeled example.
But bigger ‘k’ is not always better. With smaller neighborhoods, kNN can identify more subtle patterns in the data.
small k creates small neighborhoods. larger k means fussy boundaries - ignores some potential noisy points.
Determine its value
optimal value depends on the complexity of the pattern to be learned, as well as the impact of noisy data.
rule of thumb is to start with k equal to the square root of the number of observations in the training data.
better approach is t test several different values of k and compare the performance on data it has not seen before
some coding examples
Compute the accuracy of the baseline model (default k =1)
test_signs <- read.csv("/Users/katechen/Documents/R programming/Datacamp/DC_ML/Data/test_signs.csv")
k_1 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
mean(k_1 == signs_actual)## [1] 1
Modify the code to set k = 7
k_7 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7)
mean(k_7 == signs_actual)## [1] 0.9661017
set k =15 and compare the results
k_15 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 15)
mean(k_15 == signs_actual)## [1] 0.9661017
See how each neighbor voted
When multiple nearest “neighbors” hold a vote, it can be sometimes useful to examine whether the voters were unanimous or widely divided.
i.e. knowing the voters’ confidence in the classification could allow using the prediction in case with caution.
By adding in the prob parameter, we can extra the proportion of votes for the winning class and see how the probabilities vary from vote to vote.
sign_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7, prob = TRUE)We use the attr() function to obtain the vote proportions for the predicted class. The results are stored in the attribute “prob”.
sign_prob <- attr(sign_pred, "prob")Examine the first several predictions.
head(sign_pred)## [1] pedestrian pedestrian pedestrian stop pedestrian pedestrian
## Levels: pedestrian speed stop
Examine the proportion of votes for the winning class.
head(sign_prob)## [1] 0.5714286 0.7142857 0.8571429 0.4285714 1.0000000 0.8571429
Data preparation for kNN
kNN assumes numeric data
dummy coding for categorical values. 0 and 1 and the item left out
kNN benefits from normalized data - when calculating distance, each feature of the input data should be measured with the same range of values
Normalize data in R
Below function for example re-scales a vector x such that its minimum value is zero and its maximum value is one.
normalize <- function (x) {
return((x - min(x)) / (max(x) - min(x)))
}Note that re-scaling reduces the influence of the extreme values on kNN’s distance function.