Introduction

Sometime you may wonder the simplest solution is the most efficient way to get you started. In this case, it is K Neareast Neighbour (KNN). Now we have an algo to run, can we run it for all features? Sure, why not? Yet the story does not end here, as the 160 features would easily choke any laptop up and not spit any results after hours, hours of huming. Now the intuition (you may call it gut feeling) told me that there is a thing called feature selection (dimesion reduction or variable selection). And that same voice told me that the {x, y, z} of the accelerameters and magnet are the first batch to go. You could go the other way to drop variables batch by batch either by luck or use svm to determine the importancy.

So now, after the lucky run, the over 90% accuracy doesn’t look too bad. Submit the result, it does score well.

Does story end here now? Nop. As we are scientist, we study and research. KNN is one of the many classification algo, and would like to try out other algo to see who performs better for this particular data set. Regression tree is next one list, so does random forest and naive bayes.

pml_train <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
pml_testing <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"))
# split the dataset into training and Validation
# the testing dataset is actually for prediction
trainIndex <- createDataPartition(..., p=.7)
train    <- pml_train[trainIndex, ]
validate <- pml_train[-trainIndex, ]
new.data <- pml_testing

Feature selection and Model building

features <- c(roll_belt,
            pitch_belt,
            yaw_belt,
            accel_belt_x,
            accel_belt_y,
            accel_belt_z,
            magnet_belt_x,
            magnet_belt_y,
            magnet_belt_z,
            roll_arm,
            pitch_arm,
            yaw_arm,
            accel_arm_x,
            accel_arm_y,
            accel_arm_z,
            magnet_arm_x,
            magnet_arm_y,
            magnet_arm_z,
            roll_dumbbell,
            pitch_dumbbell,
            yaw_dumbbell,
            accel_dumbbell_x,
            accel_dumbbell_y,
            accel_dumbbell_z,
            magnet_dumbbell_x,
            magnet_dumbbell_y,
            magnet_dumbbell_z,
            roll_forearm,
            pitch_forearm,
            yaw_forearm,
            accel_forearm_x,
            accel_forearm_y,
            accel_forearm_z,
            magnet_forearm_x,
            magnet_forearm_y,
            magnet_forearm_z)
fit <- knn(train[features], validate[features], train$classe, data=train)

Validation

in sample error = .329 out sample error = .341

    table(predict(fit, validate), validate$classe)
pred      --------- classe -----------
\|/       A     B      C      D      E
 A    1,386    16      0      0      0
 B        6   874     13      0      4
 C        1    16    818     20      8
 D        4     2     14    775     12
 E        0     2      6      2    926

Considerations

Also tried the random forest, from the x validation, the results looks way more encouraging than KNN.

    #fit.tree   <- rpart(classe~features, data=train, importance=T)
    fit.forest <- randomForest(classe~features, data=train, importance=T)

For the training set,

   OOB estimate of  error rate: 0.61%
Confusion matrix:
     A    B    C    D    E  class.error
A 3904    2    0    0    0 0.0005120328
B   18 2633    7    0    0 0.0094055681
C    0   16 2377    3    0 0.0079298831
D    0    0   25 2225    2 0.0119893428
E    0    1    4    6 2514 0.0043564356

here is the confusion matrix on validation set

       A    B    C    D    E
  A 1668   12    0    0    0
  B    3 1123    4    0    0
  C    3    4 1021    9    1
  D    0    0    1  955    2
  E    0    0    0    0 1079

🖖