Purpose of this project is to predict the activities of 6 people using accelerometers recording devices with the belt, forearm, arm and dumbell.
The plan is to divide data into training/test/validation. Treat validation as test data, train competing models on the train data and pick the best one on validation. To assess performance, apply to test set. Possibly, could re-plit and reperform to get a better etsimate of what the out of average sample error rate will be.
Training and test data were provided by http://groupware.les.inf.puc-rio.br/har.
Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Data was examined for the type of variables and size. From the analysis, the irrelevant variables were removed to be able to compute a manageable file and anticipate the results.
To be reproducible, set.seed() was used. Then the training data was split into 2 sets; composing of training and validation datasets. The training dataset was split into 2: 70% training and 30% validation.
Random forest method is used to model the training data. The technique is to aggregate the results of multiple predictors or trees, such that the better prediction can be resulted over the best individual predictor. Syntax: RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL)
The caret library is used to evaluate the model with function train(). Syntax: train(formula, df, method = “rf”, metric= “Accuracy”, trControl = trainControl(), tuneGrid = NULL)
K-fold cross validation is controlled by the trainControl() function. It is used to randomly split number datasets of almost the same size and evaluated which will then be used on the remaining test set. Syntax: trainControl(method = “cv”, number = n, search =“grid”).
An optimal model is obtained with an accuracy score is 99.3% with low error rate 1.5% and mtry of 27. As such, it is not necessary to tune the model. Using function varImp, the roll-belt showed the most prominent device across the 5 classe.
##
## Call:
## randomForest(x = x, y = y, ntree = 20, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 20
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 1.49%
## Confusion matrix:
## A B C D E class.error
## A 3887 11 4 1 2 0.004609475
## B 36 2591 18 9 4 0.025206922
## C 1 32 2345 15 3 0.021285476
## D 1 4 36 2208 2 0.019102621
## E 2 11 5 8 2499 0.010297030
## Accuracy
## 0.9926933
## rf variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 52)
##
## A B C D E
## roll_belt 64.62 91.09 67.63 88.71 100.00
## pitch_belt 20.91 98.29 59.03 35.51 30.04
## yaw_belt 85.97 62.69 65.50 61.73 39.65
## pitch_forearm 76.86 84.13 75.31 41.42 55.50
## magnet_dumbbell_z 75.50 44.05 60.45 38.06 45.68
## magnet_dumbbell_y 53.02 46.02 50.39 52.75 43.71
## pitch_arm 16.67 46.23 19.82 18.93 29.60
## accel_dumbbell_y 41.95 38.29 29.46 29.55 44.15
## accel_forearm_x 23.26 42.87 26.80 41.29 31.64
## yaw_arm 42.11 42.23 32.55 31.25 17.92
## gyros_arm_y 30.00 29.96 20.82 40.96 21.84
## magnet_belt_y 18.98 36.60 33.45 26.70 33.98
## magnet_belt_z 19.33 35.38 20.11 18.74 29.32
## magnet_forearm_z 35.19 19.82 24.38 31.85 17.89
## magnet_belt_x 16.72 34.82 26.61 12.51 19.91
## gyros_dumbbell_y 27.73 17.49 34.77 34.06 13.34
## roll_forearm 33.78 27.26 34.72 22.73 24.03
## total_accel_dumbbell 14.17 23.97 32.60 22.50 34.68
## magnet_forearm_y 33.58 34.15 17.13 23.34 23.74
## gyros_dumbbell_x 12.54 33.91 17.98 11.06 14.30
## [1] B A B A A E D B A A B C B A E E A B A B
## Levels: A B C D E