Classification, supervised. Numeric features. Ch2.
The data is Activity recognition with healthy older people using a batteryless wearable sensor. There are eight features,
Column 1: Time in seconds Column 2: Acceleration reading in G for frontal axis Column 3: Acceleration reading in G for vertical axis Column 4: Acceleration reading in G for lateral axis Column 5: Id of antenna reading sensor Column 6: Received signal strength indicator (RSSI) Column 7: Phase Column 8: Frequency
The class variable is,
Column 9: Label of activity, 1: sit on bed, 2: sit on chair, 3: lying, 4: ambulating
Only one person’s activities will be analyzed, a 60 year old female.
Download dataset from: https://archive.ics.uci.edu/ml/datasets/Activity+recognition+with+healthy+older+people+using+a+batteryless+wearable+sensor
Changed the response column from a numeric to a factor with named levels.
library(readr)
participant <- read_csv("data/d1p60F", col_names = FALSE)
colnames(participant) <- c("sec", "acc_front", "acc_vert", "acc_lat", "att_id", "rssi", "phase", "frequency", "activity")
participant$activity <- factor(participant$activity)
levels(participant$activity) <- c("sit on bed", "sit on chair", "lying", "ambulating")
# Preview the dataset. acc_lat removed purely for display purposes; not enough space to display all variables.
rbind(head(participant[ , -4]), tail(participant[ , -4]))
## # A tibble: 12 x 8
## sec acc_front acc_vert att_id rssi phase frequency activity
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>
## 1 0. 0.249 1.05 1 -45.5 4.43 920. ambulating
## 2 0.0250 0.342 1.18 1 -45.0 4.62 922. ambulating
## 3 0.250 0.0610 1.08 1 -44.5 5.84 922. ambulating
## 4 0.300 0.131 1.05 1 -49.5 6.22 923. ambulating
## 5 1.50 0.471 0.905 2 -56.0 1.25 921. ambulating
## 6 2.25 0.471 0.905 2 -57.5 5.89 924. ambulating
## 7 485. 0.331 0.951 3 -57.5 5.17 920. sit on bed
## 8 485. 0.331 0.951 1 -57.0 6.22 922. sit on bed
## 9 485. 0.331 0.951 3 -58.0 0.880 922. sit on bed
## 10 486. 0.331 0.951 1 -60.5 5.01 925. sit on bed
## 11 488. 0.120 0.824 4 -56.0 4.91 926. ambulating
## 12 489. 0.0727 1.32 4 -56.5 1.04 923. ambulating
Lying and sitting in bed made up the bulk of this person’s activities, at 66.5% and 26.2% respectively.
round(table(participant$activity) / length(participant$activity) , 3)
##
## sit on bed sit on chair lying ambulating
## 0.267 0.167 0.539 0.028
There are 4 antennas collecting data in each person’s room, identified by “att_id.” For simplicity’s sake, I assume all antennas are equally accurate and effective at monitoring the participant’s activities. Column 5, the “att_id” variable is dropped.
Additionally, the “sec” feature is a time stamp in seconds, which could function as an identifier and thus highly bias kNN. It is also dropped.
participant2 <- participant[ , c(-1, -5)]
The following is a pairwise scatter plot of the 7 features.
pairs(~acc_front + acc_vert + acc_lat + rssi + phase + frequency,
data = participant2,
main = 'Scaterplot of many variables')
Normalizing features so that they are all equally weighted, using Min-Max normalization.
normalize <- function(x){
((x-min(x)) / (max(x) - min(x)))
}
participant_n <- as.data.frame(
lapply(participant2[1:6], normalize)
)
participant2 <- cbind(participant2[7], participant_n)
rm(participant_n)
Randomizing order of the dataframe, then making training and test sets and label sets. 80% of the observations are used for training, with the remaining 20% left for testing.
participant2 <- participant2[sample(1:nrow(participant2)), ]
training_set_length <- round(nrow(participant2)*.8)
training_set <- participant2[1:training_set_length, -1]
testing_set <- participant2[(training_set_length+1):nrow(participant2) , -1]
training_labels <- participant2[1:training_set_length, 1]
testing_labels <- participant2[(training_set_length+1):nrow(participant2) , 1]
For the first iteration of kNN, k = 27, the root of the length of the training set.
library(class); library(gmodels); library(tictoc)
tic()
print("k = 27")
## [1] "k = 27"
participant_pred27 <- knn(train = training_set, test = testing_set, cl = training_labels, k=27)
toc()
## 0 sec elapsed
The model performance is poor, never once correctly identifying when the participant was ambulating. The accuracy of correctly identifying the other activities ranged from 70% to 100%. Since lying made up the bulk of the participant’s activities, the model seems to be biased towards that.
CrossTable(x = testing_labels, y = participant_pred27,
prop.chisq=F, prop.c = F, prop.t = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 145
##
##
## | participant_pred27
## testing_labels | sit on bed | sit on chair | lying | Row Total |
## ---------------|--------------|--------------|--------------|--------------|
## sit on bed | 23 | 9 | 4 | 36 |
## | 0.639 | 0.250 | 0.111 | 0.248 |
## ---------------|--------------|--------------|--------------|--------------|
## sit on chair | 11 | 9 | 0 | 20 |
## | 0.550 | 0.450 | 0.000 | 0.138 |
## ---------------|--------------|--------------|--------------|--------------|
## lying | 0 | 0 | 84 | 84 |
## | 0.000 | 0.000 | 1.000 | 0.579 |
## ---------------|--------------|--------------|--------------|--------------|
## ambulating | 4 | 1 | 0 | 5 |
## | 0.800 | 0.200 | 0.000 | 0.034 |
## ---------------|--------------|--------------|--------------|--------------|
## Column Total | 38 | 19 | 88 | 145 |
## ---------------|--------------|--------------|--------------|--------------|
##
##
Smaller k’s chosen to lessen the impact of the more common labels. The following are k = 3 and 1.
Comparing the Confusion Matrices, k=1 is superior to k=3. Although the accuracy of identifying sitting on bed drops from 80% to 77%, sitting in chair and lying retain their high accuracies, and the accuracy of ID’ing ambulating jumps from 33% to 56%.
tic()
print("k = 3")
## [1] "k = 3"
participant_pred3 <- knn(train = training_set, test = testing_set, cl = training_labels, k=3)
toc()
## 0.02 sec elapsed
CrossTable(x = testing_labels, y = participant_pred3,
prop.chisq=F, prop.c = F, prop.t = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 145
##
##
## | participant_pred3
## testing_labels | sit on bed | sit on chair | lying | ambulating | Row Total |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## sit on bed | 22 | 10 | 3 | 1 | 36 |
## | 0.611 | 0.278 | 0.083 | 0.028 | 0.248 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## sit on chair | 6 | 14 | 0 | 0 | 20 |
## | 0.300 | 0.700 | 0.000 | 0.000 | 0.138 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## lying | 0 | 0 | 84 | 0 | 84 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.579 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## ambulating | 4 | 0 | 0 | 1 | 5 |
## | 0.800 | 0.000 | 0.000 | 0.200 | 0.034 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## Column Total | 32 | 24 | 87 | 2 | 145 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
##
##
tic()
print("k = 1")
## [1] "k = 1"
participant_pred1 <- knn(train = training_set, test = testing_set, cl = training_labels, k=1)
toc()
## 0 sec elapsed
CrossTable(x = testing_labels, y = participant_pred1,
prop.chisq=F, prop.c = F, prop.t = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 145
##
##
## | participant_pred1
## testing_labels | sit on bed | sit on chair | lying | ambulating | Row Total |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## sit on bed | 24 | 8 | 3 | 1 | 36 |
## | 0.667 | 0.222 | 0.083 | 0.028 | 0.248 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## sit on chair | 6 | 14 | 0 | 0 | 20 |
## | 0.300 | 0.700 | 0.000 | 0.000 | 0.138 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## lying | 0 | 0 | 84 | 0 | 84 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.579 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## ambulating | 1 | 1 | 0 | 3 | 5 |
## | 0.200 | 0.200 | 0.000 | 0.600 | 0.034 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
## Column Total | 31 | 23 | 87 | 4 | 145 |
## ---------------|--------------|--------------|--------------|--------------|--------------|
##
##