kNN

Classification, supervised. Numeric features. Ch2.

The data is Activity recognition with healthy older people using a batteryless wearable sensor. There are eight features,

Column 1: Time in seconds Column 2: Acceleration reading in G for frontal axis Column 3: Acceleration reading in G for vertical axis Column 4: Acceleration reading in G for lateral axis Column 5: Id of antenna reading sensor Column 6: Received signal strength indicator (RSSI) Column 7: Phase Column 8: Frequency

The class variable is,

Column 9: Label of activity, 1: sit on bed, 2: sit on chair, 3: lying, 4: ambulating

Only one person’s activities will be analyzed, a 60 year old female.

Step 1: Data collection

Download dataset from: https://archive.ics.uci.edu/ml/datasets/Activity+recognition+with+healthy+older+people+using+a+batteryless+wearable+sensor

Changed the response column from a numeric to a factor with named levels.

library(readr)
participant <- read_csv("data/d1p60F", col_names = FALSE)

colnames(participant) <- c("sec", "acc_front", "acc_vert", "acc_lat", "att_id", "rssi", "phase", "frequency", "activity")

participant$activity <- factor(participant$activity)
levels(participant$activity) <- c("sit on bed", "sit on chair", "lying", "ambulating")

# Preview the dataset. acc_lat removed purely for display purposes; not enough space to display all variables.
rbind(head(participant[ , -4]), tail(participant[ , -4]))

## # A tibble: 12 x 8
##         sec acc_front acc_vert att_id  rssi phase frequency activity  
##       <dbl>     <dbl>    <dbl>  <int> <dbl> <dbl>     <dbl> <fct>     
##  1   0.        0.249     1.05       1 -45.5 4.43       920. ambulating
##  2   0.0250    0.342     1.18       1 -45.0 4.62       922. ambulating
##  3   0.250     0.0610    1.08       1 -44.5 5.84       922. ambulating
##  4   0.300     0.131     1.05       1 -49.5 6.22       923. ambulating
##  5   1.50      0.471     0.905      2 -56.0 1.25       921. ambulating
##  6   2.25      0.471     0.905      2 -57.5 5.89       924. ambulating
##  7 485.        0.331     0.951      3 -57.5 5.17       920. sit on bed
##  8 485.        0.331     0.951      1 -57.0 6.22       922. sit on bed
##  9 485.        0.331     0.951      3 -58.0 0.880      922. sit on bed
## 10 486.        0.331     0.951      1 -60.5 5.01       925. sit on bed
## 11 488.        0.120     0.824      4 -56.0 4.91       926. ambulating
## 12 489.        0.0727    1.32       4 -56.5 1.04       923. ambulating

Step 2: Exploring and preparing the data

Lying and sitting in bed made up the bulk of this person’s activities, at 66.5% and 26.2% respectively.

round(table(participant$activity) / length(participant$activity) , 3)

## 
##   sit on bed sit on chair        lying   ambulating 
##        0.267        0.167        0.539        0.028

There are 4 antennas collecting data in each person’s room, identified by “att_id.” For simplicity’s sake, I assume all antennas are equally accurate and effective at monitoring the participant’s activities. Column 5, the “att_id” variable is dropped.

Additionally, the “sec” feature is a time stamp in seconds, which could function as an identifier and thus highly bias kNN. It is also dropped.

participant2 <- participant[ , c(-1, -5)]

The following is a pairwise scatter plot of the 7 features.

pairs(~acc_front + acc_vert + acc_lat + rssi + phase + frequency, 
      data = participant2,
      main = 'Scaterplot of many variables')

Normalizing features so that they are all equally weighted, using Min-Max normalization.

normalize <- function(x){
  ((x-min(x)) / (max(x) - min(x)))
}

participant_n <- as.data.frame(
  lapply(participant2[1:6], normalize)
)

participant2 <- cbind(participant2[7], participant_n)

rm(participant_n)

Randomizing order of the dataframe, then making training and test sets and label sets. 80% of the observations are used for training, with the remaining 20% left for testing.

participant2 <- participant2[sample(1:nrow(participant2)), ]

training_set_length <- round(nrow(participant2)*.8)

training_set <- participant2[1:training_set_length, -1]
testing_set <- participant2[(training_set_length+1):nrow(participant2) , -1]

training_labels <- participant2[1:training_set_length, 1]
testing_labels <- participant2[(training_set_length+1):nrow(participant2) , 1]

Step 3: Training the model

For the first iteration of kNN, k = 27, the root of the length of the training set.

library(class); library(gmodels); library(tictoc)

tic()

print("k = 27")

## [1] "k = 27"

participant_pred27 <- knn(train = training_set, test = testing_set, cl = training_labels, k=27)

toc()

## 0 sec elapsed

Step 4: Evaluating model performance

The model performance is poor, never once correctly identifying when the participant was ambulating. The accuracy of correctly identifying the other activities ranged from 70% to 100%. Since lying made up the bulk of the participant’s activities, the model seems to be biased towards that.

CrossTable(x = testing_labels, y = participant_pred27, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred27 
## testing_labels |   sit on bed | sit on chair |        lying |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##     sit on bed |           23 |            9 |            4 |           36 | 
##                |        0.639 |        0.250 |        0.111 |        0.248 | 
## ---------------|--------------|--------------|--------------|--------------|
##   sit on chair |           11 |            9 |            0 |           20 | 
##                |        0.550 |        0.450 |        0.000 |        0.138 | 
## ---------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           84 |           84 | 
##                |        0.000 |        0.000 |        1.000 |        0.579 | 
## ---------------|--------------|--------------|--------------|--------------|
##     ambulating |            4 |            1 |            0 |            5 | 
##                |        0.800 |        0.200 |        0.000 |        0.034 | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |           38 |           19 |           88 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|
## 
##

Step 5:

Smaller k’s chosen to lessen the impact of the more common labels. The following are k = 3 and 1.

Comparing the Confusion Matrices, k=1 is superior to k=3. Although the accuracy of identifying sitting on bed drops from 80% to 77%, sitting in chair and lying retain their high accuracies, and the accuracy of ID’ing ambulating jumps from 33% to 56%.

tic()

print("k = 3")

## [1] "k = 3"

participant_pred3 <- knn(train = training_set, test = testing_set, cl = training_labels, k=3)

toc()

## 0.02 sec elapsed

CrossTable(x = testing_labels, y = participant_pred3, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred3 
## testing_labels |   sit on bed | sit on chair |        lying |   ambulating |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     sit on bed |           22 |           10 |            3 |            1 |           36 | 
##                |        0.611 |        0.278 |        0.083 |        0.028 |        0.248 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   sit on chair |            6 |           14 |            0 |            0 |           20 | 
##                |        0.300 |        0.700 |        0.000 |        0.000 |        0.138 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           84 |            0 |           84 | 
##                |        0.000 |        0.000 |        1.000 |        0.000 |        0.579 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     ambulating |            4 |            0 |            0 |            1 |            5 | 
##                |        0.800 |        0.000 |        0.000 |        0.200 |        0.034 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   Column Total |           32 |           24 |           87 |            2 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
## 
##

tic()

print("k = 1")

## [1] "k = 1"

participant_pred1 <- knn(train = training_set, test = testing_set, cl = training_labels, k=1)

toc()

## 0 sec elapsed

CrossTable(x = testing_labels, y = participant_pred1, 
           prop.chisq=F, prop.c = F, prop.t = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  145 
## 
##  
##                | participant_pred1 
## testing_labels |   sit on bed | sit on chair |        lying |   ambulating |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     sit on bed |           24 |            8 |            3 |            1 |           36 | 
##                |        0.667 |        0.222 |        0.083 |        0.028 |        0.248 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   sit on chair |            6 |           14 |            0 |            0 |           20 | 
##                |        0.300 |        0.700 |        0.000 |        0.000 |        0.138 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##          lying |            0 |            0 |           84 |            0 |           84 | 
##                |        0.000 |        0.000 |        1.000 |        0.000 |        0.579 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##     ambulating |            1 |            1 |            0 |            3 |            5 | 
##                |        0.200 |        0.200 |        0.000 |        0.600 |        0.034 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
##   Column Total |           31 |           23 |           87 |            4 |          145 | 
## ---------------|--------------|--------------|--------------|--------------|--------------|
## 
##

STAT 6620 Machine Learning Algorithms

Darren Keeley

kNN

Step 1: Data collection

Step 2: Exploring and preparing the data

Step 3: Training the model

Step 4: Evaluating model performance

Step 5: