ML_Supervised Learning Classification

Kate C

2022-01-02

Classification with Nearest Neighbors

KNN - K Nearest Neighbors

Load Packages and Dataset

Packages used to import and analyse the data include class, dplyr, googlesheet4.

  • class - various functions for classification, including k-nearest neighbor, learning vector quantification and self-organizing maps.

  • googlesheet4 - used for importing the next_sign dataset built in googlesheet.

  • used write csv to save a copy of the next_sign dataframe into local drive to make the data-read faster for re-run

Dataset includes a traffic sign dataframe called signs. Also included a dataset called next_sign which holds the observation we want to classify.

First kNN training

We create a vector of sign labels to use with kNN and used knn function to classify the sign from the test dataset. And the sign is recognized as “pedestrian speed stop”.

Train equal to the observations in

library(class)

sign_types <- signs$sign_type

knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
##  [1] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
##  [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [19] pedestrian speed      speed      speed      speed      speed     
## [25] speed      speed      speed      speed      speed      speed     
## [31] speed      speed      speed      speed      speed      speed     
## [37] speed      speed      speed      speed      stop       stop      
## [43] stop       stop       stop       stop       stop       stop      
## [49] stop       stop       stop       stop       stop       stop      
## [55] stop       stop       stop       stop       stop      
## Levels: pedestrian speed stop

kNN function correctly classify the stop sign because the sign was in some way similar to another stop sign. In other words, kNN is not learning anything but simply looks for the most similar example.

Exploring the traffic sign dataset

Each previously observed street sign was divided into a 4x4 grid. The read, green, and blue level for each of the 16 center pixels is recorded as illustrated above.

Sign_type resulted from this has 16 x 3 = 48 color properties of each sign.

Check out the structure of signs dataframe and view the signs dataset.

str(signs)
## 'data.frame':    206 obs. of  49 variables:
##  $ sign_type: chr  "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
##  $ r1       : int  155 142 57 22 169 75 136 118 149 13 ...
##  $ g1       : int  228 217 54 35 179 67 149 105 225 34 ...
##  $ b1       : int  251 242 50 41 170 60 157 69 241 28 ...
##  $ r2       : int  135 166 187 171 231 131 200 244 34 5 ...
##  $ g2       : int  188 204 201 178 254 89 203 245 45 21 ...
##  $ b2       : int  101 44 68 26 27 53 107 67 1 11 ...
##  $ r3       : int  156 142 51 19 97 214 150 132 155 123 ...
##  $ g3       : int  227 217 51 27 107 144 167 123 226 154 ...
##  $ b3       : int  245 242 45 29 99 75 134 12 238 140 ...
##  $ r4       : int  145 147 59 19 123 156 171 138 147 21 ...
##  $ g4       : int  211 219 62 27 147 169 218 123 222 46 ...
##  $ b4       : int  228 242 65 29 152 190 252 85 242 41 ...
##  $ r5       : int  166 164 156 42 221 67 171 254 170 36 ...
##  $ g5       : int  233 228 171 37 236 50 158 254 191 60 ...
##  $ b5       : int  245 229 50 3 117 36 108 92 113 26 ...
##  $ r6       : int  212 84 254 217 205 37 157 241 26 75 ...
##  $ g6       : int  254 116 255 228 225 36 186 240 37 108 ...
##  $ b6       : int  52 17 36 19 80 42 11 108 12 44 ...
##  $ r7       : int  212 217 211 221 235 44 26 254 34 13 ...
##  $ g7       : int  254 254 226 235 254 42 35 254 45 27 ...
##  $ b7       : int  11 26 70 20 60 44 10 99 19 25 ...
##  $ r8       : int  188 155 78 181 90 192 180 108 221 133 ...
##  $ g8       : int  229 203 73 183 110 131 211 106 249 163 ...
##  $ b8       : int  117 128 64 73 9 73 236 27 184 126 ...
##  $ r9       : int  170 213 220 237 216 123 129 135 226 83 ...
##  $ g9       : int  216 253 234 234 236 74 109 123 246 125 ...
##  $ b9       : int  120 51 59 44 66 22 73 40 59 19 ...
##  $ r10      : int  211 217 254 251 229 36 161 254 30 13 ...
##  $ g10      : int  254 255 255 254 255 34 190 254 40 27 ...
##  $ b10      : int  3 21 51 2 12 37 10 115 34 25 ...
##  $ r11      : int  212 217 253 235 235 44 161 254 34 9 ...
##  $ g11      : int  254 255 255 243 254 42 190 254 44 23 ...
##  $ b11      : int  19 21 44 12 60 44 6 99 35 18 ...
##  $ r12      : int  172 158 66 19 163 197 187 138 241 85 ...
##  $ g12      : int  235 225 68 27 168 114 215 123 255 128 ...
##  $ b12      : int  244 237 68 29 152 21 236 85 54 21 ...
##  $ r13      : int  172 164 69 20 124 171 141 118 205 83 ...
##  $ g13      : int  235 227 65 29 117 102 142 105 229 125 ...
##  $ b13      : int  244 237 59 34 91 26 140 75 46 19 ...
##  $ r14      : int  172 182 76 64 188 197 189 131 226 85 ...
##  $ g14      : int  228 228 84 61 205 114 171 124 246 128 ...
##  $ b14      : int  235 143 22 4 78 21 140 5 59 21 ...
##  $ r15      : int  177 171 82 211 125 123 214 106 235 85 ...
##  $ g15      : int  235 228 93 222 147 74 221 94 252 128 ...
##  $ b15      : int  244 196 17 78 20 22 201 53 67 21 ...
##  $ r16      : int  22 164 58 19 160 180 188 101 237 83 ...
##  $ g16      : int  52 227 60 27 183 107 211 91 254 125 ...
##  $ b16      : int  53 237 60 29 187 26 227 59 53 19 ...
View(signs)
  • first column contain the types of the signs: pedestrian, speed, and stop

  • columns from r1 to b16 contain the 16 *3 = 48 color properties

  • overall there are 206 signs

Count the number of signs of each type by passing it the column containing the labels.

table(signs$sign_type)
## 
## pedestrian      speed       stop 
##         65         70         71

Check r10’s average red level by sign type. The result tells us whether the average red level might vary by sign type.

We can either use aggregate function to calculate the results or use dplyr to achieve the same.

aggregate(r10 ~ sign_types, data = signs, mean)
##   sign_types       r10
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704

Using dplyr package:

signs %>% 
  group_by(sign_type) %>% 
  summarize(mean(r10))
## # A tibble: 3 × 2
##   sign_type  `mean(r10)`
##   <chr>            <dbl>
## 1 pedestrian       109. 
## 2 speed             83.1
## 3 stop             143.

The only difference in results is the decimal places in dplyr results are fewer than they are in the aggregate results. Because the resulting output is actually a tibble, which by default prints numbers with 3 significant digits.

This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame. so we included below extra step in above code. And now the results are presented in the same way as they are from aggregate.

signs %>% 
  group_by(sign_type) %>% 
  summarize(mean(r10)) %>% 
  as.data.frame(.)
##    sign_type mean(r10)
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704

As we can see from the aggregated results showing red levels for each sign, the stop signs tend to have a higher average red value. This is how kNN identifies similar signs.

Classify a collection of road signs

Before heading into the analysis, we need to create the “test” dataset this course is using.

test_signs <- signs_full %>% 
  filter(sample == "test") %>% 
  select(-sample, -id)

Now we are using kNN to identify the test road signs. cl is factors for true classification of training set.

k-nearest neighbor classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

then create a confusion matrix of the predicted versus actual values. The confusion matrix lets us look for patterns in the classifier’s errors.

signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     0    0
##   speed               0    21    0
##   stop                0     0   19

compute the accuracy.

mean(signs_pred == signs_actual)
## [1] 1

K in the kNN

An introduction

  • k is a variable that specifies the number of neighbors to consider when making the classification. as in determining the size of the neighborhoods.

  • If ignored/did not set the value of k in the class function, it means that only the single, nearest, most similar neighbor was used to classify the unlabeled example.

  • But bigger ‘k’ is not always better. With smaller neighborhoods, kNN can identify more subtle patterns in the data.

  • small k creates small neighborhoods. larger k means fussy boundaries - ignores some potential noisy points.

Determine its value

  • optimal value depends on the complexity of the pattern to be learned, as well as the impact of noisy data.

  • rule of thumb is to start with k equal to the square root of the number of observations in the training data.

  • better approach is t test several different values of k and compare the performance on data it has not seen before

some coding examples

Compute the accuracy of the baseline model (default k =1)

test_signs <- read.csv("/Users/katechen/Documents/R programming/Datacamp/DC_ML/Data/test_signs.csv")
k_1 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
mean(k_1 == signs_actual)
## [1] 1

Modify the code to set k = 7

k_7 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7)
mean(k_7 == signs_actual)
## [1] 0.9661017

set k =15 and compare the results

k_15 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 15)
mean(k_15 == signs_actual)
## [1] 0.9661017

See how each neighbor voted

When multiple nearest “neighbors” hold a vote, it can be sometimes useful to examine whether the voters were unanimous or widely divided.

i.e. knowing the voters’ confidence in the classification could allow using the prediction in case with caution.

By adding in the prob parameter, we can extra the proportion of votes for the winning class and see how the probabilities vary from vote to vote.

sign_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7, prob = TRUE)

We use the attr() function to obtain the vote proportions for the predicted class. The results are stored in the attribute “prob”.

sign_prob <- attr(sign_pred, "prob")

Examine the first several predictions.

head(sign_pred)
## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop

Examine the proportion of votes for the winning class.

head(sign_prob)
## [1] 0.5714286 0.7142857 0.8571429 0.4285714 1.0000000 0.8571429

Data preparation for kNN

  • kNN assumes numeric data

  • dummy coding for categorical values. 0 and 1 and the item left out

  • kNN benefits from normalized data - when calculating distance, each feature of the input data should be measured with the same range of values

Normalize data in R

Below function for example re-scales a vector x such that its minimum value is zero and its maximum value is one.

normalize <- function (x) {
  return((x - min(x)) / (max(x) - min(x)))
}

Note that re-scaling reduces the influence of the extreme values on kNN’s distance function.