Sometime you may wonder the simplest solution is the most efficient way to get you started. In this case, it is K Neareast Neighbour (KNN). Now we have an algo to run, can we run it for all features? Sure, why not? Yet the story does not end here, as the 160 features would easily choke any laptop up and not spit any results after hours, hours of huming. Now the intuition (you may call it gut feeling) told me that there is a thing called feature selection (dimesion reduction or variable selection). And that same voice told me that the {x, y, z} of the accelerameters and magnet are the first batch to go. You could go the other way to drop variables batch by batch either by luck or use svm to determine the importancy.
So now, after the lucky run, the over 90% accuracy doesnât look too bad. Submit the result, it does score well.
Does story end here now? Nop. As we are scientist, we study and research. KNN is one of the many classification algo, and would like to try out other algo to see who performs better for this particular data set. Regression tree is next one list, so does random forest and naive bayes.
pml_train <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
pml_testing <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"))
# split the dataset into training and Validation
# the testing dataset is actually for prediction
trainIndex <- createDataPartition(..., p=.7)
train <- pml_train[trainIndex, ]
validate <- pml_train[-trainIndex, ]
new.data <- pml_testing
features <- c(roll_belt,
pitch_belt,
yaw_belt,
accel_belt_x,
accel_belt_y,
accel_belt_z,
magnet_belt_x,
magnet_belt_y,
magnet_belt_z,
roll_arm,
pitch_arm,
yaw_arm,
accel_arm_x,
accel_arm_y,
accel_arm_z,
magnet_arm_x,
magnet_arm_y,
magnet_arm_z,
roll_dumbbell,
pitch_dumbbell,
yaw_dumbbell,
accel_dumbbell_x,
accel_dumbbell_y,
accel_dumbbell_z,
magnet_dumbbell_x,
magnet_dumbbell_y,
magnet_dumbbell_z,
roll_forearm,
pitch_forearm,
yaw_forearm,
accel_forearm_x,
accel_forearm_y,
accel_forearm_z,
magnet_forearm_x,
magnet_forearm_y,
magnet_forearm_z)
fit <- knn(train[features], validate[features], train$classe, data=train)
in sample error = .329 out sample error = .341
table(predict(fit, validate), validate$classe)
pred --------- classe -----------
\|/ A B C D E
A 1,386 16 0 0 0
B 6 874 13 0 4
C 1 16 818 20 8
D 4 2 14 775 12
E 0 2 6 2 926
Also tried the random forest, from the x validation, the results looks way more encouraging than KNN.
#fit.tree <- rpart(classe~features, data=train, importance=T)
fit.forest <- randomForest(classe~features, data=train, importance=T)
For the training set,
OOB estimate of error rate: 0.61%
Confusion matrix:
A B C D E class.error
A 3904 2 0 0 0 0.0005120328
B 18 2633 7 0 0 0.0094055681
C 0 16 2377 3 0 0.0079298831
D 0 0 25 2225 2 0.0119893428
E 0 1 4 6 2514 0.0043564356
here is the confusion matrix on validation set
A B C D E
A 1668 12 0 0 0
B 3 1123 4 0 0
C 3 4 1021 9 1
D 0 0 1 955 2
E 0 0 0 0 1079
ð