kNN

Classification, supervised. Numeric features. Ch2.

The data is Activity recognition with healthy older people using a batteryless wearable sensor. There are eight features,

Column 1: Time in seconds Column 2: Acceleration reading in G for frontal axis Column 3: Acceleration reading in G for vertical axis Column 4: Acceleration reading in G for lateral axis Column 5: Id of antenna reading sensor Column 6: Received signal strength indicator (RSSI) Column 7: Phase Column 8: Frequency

The class variable is,

Column 9: Label of activity, 1: sit on bed, 2: sit on chair, 3: lying, 4: ambulating

Only one person’s activities will be analyzed, a 60 year old female.

Step 1: Data collection

Download dataset from: https://archive.ics.uci.edu/ml/datasets/Activity+recognition+with+healthy+older+people+using+a+batteryless+wearable+sensor

Changed the response column from a numeric to a factor with named levels.

library(readr)
participant <- read_csv("data/d1p60F", col_names = FALSE)

colnames(participant) <- c("sec", "acc_front", "acc_vert", "acc_lat", "att_id", "rssi", "phase", "frequency", "activity")

participant$activity <- factor(participant$activity)
levels(participant$activity) <- c("sit on bed", "sit on chair", "lying", "ambulating")

# Preview the dataset. acc_lat removed purely for display purposes; not enough space to display all variables.
rbind(head(participant[ , -4]), tail(participant[ , -4]))

## # A tibble: 12 x 8
##         sec acc_front acc_vert att_id  rssi phase frequency activity  
##       <dbl>     <dbl>    <dbl>  <int> <dbl> <dbl>     <dbl> <fct>     
##  1   0.        0.249     1.05       1 -45.5 4.43       920. ambulating
##  2   0.0250    0.342     1.18       1 -45.0 4.62       922. ambulating
##  3   0.250     0.0610    1.08       1 -44.5 5.84       922. ambulating
##  4   0.300     0.131     1.05       1 -49.5 6.22       923. ambulating
##  5   1.50      0.471     0.905      2 -56.0 1.25       921. ambulating
##  6   2.25      0.471     0.905      2 -57.5 5.89       924. ambulating
##  7 485.        0.331     0.951      3 -57.5 5.17       920. sit on bed
##  8 485.        0.331     0.951      1 -57.0 6.22       922. sit on bed
##  9 485.        0.331     0.951      3 -58.0 0.880      922. sit on bed
## 10 486.        0.331     0.951      1 -60.5 5.01       925. sit on bed
## 11 488.        0.120     0.824      4 -56.0 4.91       926. ambulating
## 12 489.        0.0727    1.32       4 -56.5 1.04       923. ambulating

Step 2: Exploring and preparing the data

Lying and sitting in bed made up the bulk of this person’s activities, at 66.5% and 26.2% respectively.

round(table(participant$activity) / length(participant$activity) , 3)

## 
##   sit on bed sit on chair        lying   ambulating 
##        0.267        0.167        0.539        0.028

There are 4 antennas collecting data in each person’s room, identified by “att_id.” For simplicity’s sake, I assume all antennas are equally accurate and effective at monitoring the participant’s activities. Column 5, the “att_id” variable is dropped.

Additionally, the “sec” feature is a time stamp in seconds, which could function as an identifier and thus highly bias kNN. It is also dropped.

participant2 <- participant[ , c(-1, -5)]

The following is a pairwise scatter plot of the 7 features.

pairs(~acc_front + acc_vert + acc_lat + rssi + phase + frequency, 
      data = participant2,
      main = 'Scaterplot of many variables')

Normalizing features so that they are all equally weighted, using Min-Max normalization.

normalize <- function(x){
  ((x-min(x)) / (max(x) - min(x)))
}

participant_n <- as.data.frame(
  lapply(participant2[1:6], normalize)
)

participant2 <- cbind(participant2[7], participant_n)

rm(participant_n)

Randomizing order of the dataframe, then making training and test sets and label sets. 80% of the observations are used for training, with the remaining 20% left for testing.

participant2 <- participant2[sample(1:nrow(participant2)), ]

training_set_length <- round(nrow(participant2)*.8)

training_set <- participant2[1:training_set_length, -1]
testing_set <- participant2[(training_set_length+1):nrow(participant2) , -1]

training_labels <- participant2[1:training_set_length, 1]
testing_labels <- participant2[(training_set_length+1):nrow(participant2) , 1]

Step 3: Training the model

For the first iteration of kNN, k = 27, the root of the length of the training set.

library(class); library(gmodels); library(tictoc)

tic()

print("k = 27")

## [1] "k = 27"

participant_pred27 <- knn(train = training_set, test = testing_set, cl = training_labels, k=27)

toc()

## 0.02 sec elapsed

Step 4: Evaluating model performance

The model performance is poor, never once correctly identifying when the participant was ambulating. The accuracy of correctly identifying the other activities ranged from 70% to 100%. Since lying made up the bulk of the participant’s activities, the model seems to be biased towards that.

CrossTable(x = testing_labels, y = participant_pred27, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred27 
## testing_labels |   sit on bed | sit on chair |        lying |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##     sit on bed |           23 |           12 |            9 |           44 | 
##                |        0.523 |        0.273 |        0.205 |        0.303 | 
## ---------------|--------------|--------------|--------------|--------------|
##   sit on chair |           10 |           14 |            0 |           24 | 
##                |        0.417 |        0.583 |        0.000 |        0.166 | 
## ---------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           73 |           73 | 
##                |        0.000 |        0.000 |        1.000 |        0.503 | 
## ---------------|--------------|--------------|--------------|--------------|
##     ambulating |            4 |            0 |            0 |            4 | 
##                |        1.000 |        0.000 |        0.000 |        0.028 | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |           37 |           26 |           82 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|
## 
##

Step 5: Improving model performance

Smaller k’s chosen to lessen the impact of the more common labels. The following are k = 3 and 1.

Comparing the Confusion Matrices, k=1 is superior to k=3. Although the accuracy of identifying sitting on bed drops from 80% to 77%, sitting in chair and lying retain their high accuracies, and the accuracy of ID’ing ambulating jumps from 33% to 56%.

tic()

print("k = 3")

## [1] "k = 3"

participant_pred3 <- knn(train = training_set, test = testing_set, cl = training_labels, k=3)

toc()

## 0.02 sec elapsed

CrossTable(x = testing_labels, y = participant_pred3, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred3 
## testing_labels |   sit on bed | sit on chair |        lying |   ambulating |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     sit on bed |           32 |            8 |            4 |            0 |           44 | 
##                |        0.727 |        0.182 |        0.091 |        0.000 |        0.303 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   sit on chair |            2 |           22 |            0 |            0 |           24 | 
##                |        0.083 |        0.917 |        0.000 |        0.000 |        0.166 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           73 |            0 |           73 | 
##                |        0.000 |        0.000 |        1.000 |        0.000 |        0.503 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     ambulating |            2 |            0 |            0 |            2 |            4 | 
##                |        0.500 |        0.000 |        0.000 |        0.500 |        0.028 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   Column Total |           36 |           30 |           77 |            2 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
## 
##

tic()

print("k = 1")

## [1] "k = 1"

participant_pred1 <- knn(train = training_set, test = testing_set, cl = training_labels, k=1)

toc()

## 0 sec elapsed

CrossTable(x = testing_labels, y = participant_pred1, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred1 
## testing_labels |   sit on bed | sit on chair |        lying |   ambulating |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     sit on bed |           34 |            6 |            4 |            0 |           44 | 
##                |        0.773 |        0.136 |        0.091 |        0.000 |        0.303 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   sit on chair |            3 |           21 |            0 |            0 |           24 | 
##                |        0.125 |        0.875 |        0.000 |        0.000 |        0.166 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           73 |            0 |           73 | 
##                |        0.000 |        0.000 |        1.000 |        0.000 |        0.503 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     ambulating |            0 |            0 |            0 |            4 |            4 | 
##                |        0.000 |        0.000 |        0.000 |        1.000 |        0.028 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   Column Total |           37 |           27 |           77 |            4 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
## 
##

Naive Bayes

This is a classification problem involving mushrooms. Based on 21 categorical features, a Naïve Bayes algorithm was used to identify whether particular mushrooms are Poisonous or Edible.

Step 1: Data collection

The data was downloaded from UCI’s Machine Learning Repository, consisting of 8124 observations (hypothetical samples of many species of gilled mushrooms) and 23 variables. The class variable is Type, denoting whether the mushroom is definitely edible (“e”) or poisonous (“p”). The other 22 variables are various attributes of the mushrooms, like color, odor, and size.

A link to the dataset: https://archive.ics.uci.edu/ml/datasets/Mushroom

library(readr)
mushrooms <- read_csv("data/agaricus-lepiota.data", 
                      col_names = c("Type",
                                    # Start feature 1, by twos:
                                    "cap-shape", "cap-surface", 
                                    "cap-color", "bruises", 
                                    "odor", "gill-attachment", 
                                    "gill-spacing", "gill-size", 
                                    "gill-color", "stalk-shape",
                                    "stalk-root", "stalk-surface-above-ring", 
                                    "stalk-surface-below-ring", "stalk-color-above-ring",
                                    "stalk-color-below-ring", "veil-type", 
                                    "veil-color", "ring-number", 
                                    "ring-type", "spore-print-color",
                                    "population", "habitat"))

## Parsed with column specification:
## cols(
##   .default = col_character()
## )

## See spec(...) for full column specifications.

Step 2: Exploring and preparing the data

The feature “stalk-root” has a lot of missing values: 2480 out of 8124 observations. It was removed, leaving 21 variables for the algorithm. All variables were converted to categories, and the dataset was left as a dataframe type. After randomizing the order, the dataset was split 3:1 for training and testing.

# The ratio of edible to poisonous mushrooms
table(mushrooms$Type)

## 
##    e    p 
## 4208 3916

# Since feature "stalk-root" has a lot of missing values, remove it.
table(mushrooms$`stalk-root`)

## 
##    ?    b    c    e    r 
## 2480 3776  556 1120  192

mushrooms2 <- mushrooms[ , -12]

# All variables must be factors in a df! Matrix wouldn't work...
mushrooms2 <- as.data.frame(apply(mushrooms2, MARGIN = 2, FUN = factor))

Shuffle dataset and divide into training and testing.

mushrooms2 <- mushrooms2[sample(nrow(mushrooms2)),]

train.set <- mushrooms2[1:6093, -1]
test.set <- mushrooms2[6094:8124, -1]

train.labels <- mushrooms2[1:6093, 1]
test.labels <- mushrooms2[6094:8124, 1]

Step 3: Training the model

Naïve Bayes was used on the training subset.

library(e1071)
classifier <- naiveBayes(train.set, train.labels)

Step 4: Evaluating model performance

First 10 model predictions in the first column, then the label, and finally whether the prediction agrees with the label:

The classifier identified 99.5% of the edible mushrooms correctly, but only 89.6% of the poisonous ones. False Positives are more dangerous and should be minimized. Kappa value is .90, indicating very good agreement.

test.pred <- predict(classifier, test.set)

# Results
results <- cbind(test.pred, test.labels, test.pred == test.labels)
colnames(results)[3] <- "Agreement"
results[1:10, ]

##       test.pred test.labels Agreement
##  [1,]         1           1         1
##  [2,]         1           1         1
##  [3,]         2           2         1
##  [4,]         1           1         1
##  [5,]         1           1         1
##  [6,]         1           1         1
##  [7,]         1           1         1
##  [8,]         1           1         1
##  [9,]         1           1         1
## [10,]         1           1         1

library(gmodels)
CrossTable(test.pred, test.labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2031 
## 
##  
##              | actual 
##    predicted |         e |         p | Row Total | 
## -------------|-----------|-----------|-----------|
##            e |      1082 |       105 |      1187 | 
##              |     0.999 |     0.111 |           | 
## -------------|-----------|-----------|-----------|
##            p |         1 |       843 |       844 | 
##              |     0.001 |     0.889 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1083 |       948 |      2031 | 
##              |     0.533 |     0.467 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Calculating Kappa, pg 323 \[ K = \frac {Pr(a) - Pr(e)} {1 - Pr(e)}\ \]

# pr_a = proportion of correct identifications across the whole table 
#      = TP + TN
pr_a <- .525 + .423

# pr_e = the sum of the probabilities that by chance the predicted and actual values agree
#      = TP*(proportion of data that is P) + TN*(proportion of data that is N)
pr_e <- .525*.528 + .423*.472

k <- (pr_a - pr_e) / (1 - pr_e)
k

## [1] 0.900601

Step 5: Improving the model

Although it’s unlikely given the moderate number of features compared to observations, some levels in certain factors may never appear in Edible or Poisonous mushrooms in this dataset. A Laplace estimator of 1 was used in the second iteration of this algorithm.

The performance was a little better, from 89.6% to 92.3% for poisonous mushrooms. The performance regarding edible mushrooms was virtually the same, from 99.5% to 99.4%. The kappa values were also equivalent.

classifier2 <- naiveBayes(train.set, train.labels, laplace = 1)
test.pred2 <- predict(classifier2, test.set)

# Results
results2 <- cbind(test.pred2, test.labels, test.pred2 == test.labels)
colnames(results)[3] <- "Agreement"
results2[1:10, ]

##       test.pred2 test.labels  
##  [1,]          1           1 1
##  [2,]          1           1 1
##  [3,]          2           2 1
##  [4,]          1           1 1
##  [5,]          1           1 1
##  [6,]          1           1 1
##  [7,]          1           1 1
##  [8,]          1           1 1
##  [9,]          1           1 1
## [10,]          1           1 1

library(gmodels)
CrossTable(test.pred2, test.labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2031 
## 
##  
##              | actual 
##    predicted |         e |         p | Row Total | 
## -------------|-----------|-----------|-----------|
##            e |      1082 |        76 |      1158 | 
##              |     0.999 |     0.080 |           | 
## -------------|-----------|-----------|-----------|
##            p |         1 |       872 |       873 | 
##              |     0.001 |     0.920 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1083 |       948 |      2031 | 
##              |     0.533 |     0.467 |           | 
## -------------|-----------|-----------|-----------|
## 
##

# Kappa
pr_a <- .525 + .423
pr_e <- .525*.528 + .423*.472

k <- (pr_a - pr_e) / (1 - pr_e)
k

## [1] 0.900601

STAT 6620 Supervised Learning Algorithms

Darren Keeley

kNN

Step 1: Data collection

Step 2: Exploring and preparing the data

Step 3: Training the model

Step 4: Evaluating model performance

Step 5: Improving model performance

Naive Bayes

Step 1: Data collection

Step 2: Exploring and preparing the data

Step 3: Training the model

Step 4: Evaluating model performance

Step 5: Improving the model