Background

Purpose of this project is to predict the activities of 6 people using accelerometers recording devices with the belt, forearm, arm and dumbell.

The plan is to divide data into training/test/validation. Treat validation as test data, train competing models on the train data and pick the best one on validation. To assess performance, apply to test set. Possibly, could re-plit and reperform to get a better etsimate of what the out of average sample error rate will be.

Training and test data were provided by http://groupware.les.inf.puc-rio.br/har.

Data Sources

Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Exploratory Analysis

Data was examined for the type of variables and size. From the analysis, the irrelevant variables were removed to be able to compute a manageable file and anticipate the results.

Preparing the DataSet

To be reproducible, set.seed() was used. Then the training data was split into 2 sets; composing of training and validation datasets. The training dataset was split into 2: 70% training and 30% validation.

Modeling and Evaluation of Validation Data

Random forest method is used to model the training data. The technique is to aggregate the results of multiple predictors or trees, such that the better prediction can be resulted over the best individual predictor. Syntax: RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL)

The caret library is used to evaluate the model with function train(). Syntax: train(formula, df, method = “rf”, metric= “Accuracy”, trControl = trainControl(), tuneGrid = NULL)

K-fold cross validation is controlled by the trainControl() function. It is used to randomly split number datasets of almost the same size and evaluated which will then be used on the remaining test set. Syntax: trainControl(method = “cv”, number = n, search =“grid”).

An optimal model is obtained with an accuracy score is 99.3% with low error rate 1.5% and mtry of 27. As such, it is not necessary to tune the model. Using function varImp, the roll-belt showed the most prominent device across the 5 classe.

## 
## Call:
##  randomForest(x = x, y = y, ntree = 20, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 20
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 1.49%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3887   11    4    1    2 0.004609475
## B   36 2591   18    9    4 0.025206922
## C    1   32 2345   15    3 0.021285476
## D    1    4   36 2208    2 0.019102621
## E    2   11    5    8 2499 0.010297030

##  Accuracy 
## 0.9926933

## rf variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 52)
## 
##                          A     B     C     D      E
## roll_belt            64.62 91.09 67.63 88.71 100.00
## pitch_belt           20.91 98.29 59.03 35.51  30.04
## yaw_belt             85.97 62.69 65.50 61.73  39.65
## pitch_forearm        76.86 84.13 75.31 41.42  55.50
## magnet_dumbbell_z    75.50 44.05 60.45 38.06  45.68
## magnet_dumbbell_y    53.02 46.02 50.39 52.75  43.71
## pitch_arm            16.67 46.23 19.82 18.93  29.60
## accel_dumbbell_y     41.95 38.29 29.46 29.55  44.15
## accel_forearm_x      23.26 42.87 26.80 41.29  31.64
## yaw_arm              42.11 42.23 32.55 31.25  17.92
## gyros_arm_y          30.00 29.96 20.82 40.96  21.84
## magnet_belt_y        18.98 36.60 33.45 26.70  33.98
## magnet_belt_z        19.33 35.38 20.11 18.74  29.32
## magnet_forearm_z     35.19 19.82 24.38 31.85  17.89
## magnet_belt_x        16.72 34.82 26.61 12.51  19.91
## gyros_dumbbell_y     27.73 17.49 34.77 34.06  13.34
## roll_forearm         33.78 27.26 34.72 22.73  24.03
## total_accel_dumbbell 14.17 23.97 32.60 22.50  34.68
## magnet_forearm_y     33.58 34.15 17.13 23.34  23.74
## gyros_dumbbell_x     12.54 33.91 17.98 11.06  14.30

Predicting 20 Test Cases

##  [1] B A B A A E D B A A B C B A E E A B A B
## Levels: A B C D E

R Coursera 8.0 Machine Learning Project

Cora Hermoso

May 18, 2020