Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of six participants. They were asked to perform barbell lifts correctly and incorrectly in five different ways. One totally right and the four others with a common mistake. We will predict how well a given person done the job using these accelerometers.
More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The provided dataset contains the the name of volunteer, time stamp, 56 measurement by accelometers, 96 statistics calculated by moving window on them and the class of activity, “A” for totally right and “B-E” for four common mistakes. As we are supposed to predict the class of activity based on single measurements by acceleometers, we can not use statistics. So we are limited to 56 measurements. These are still too many variables and make the training too slow. To reduce the variables. We first omit the highly correlated variables. Then train a classification tree on the remaining data. Finally choose the most important variables based on this model. Finally train a random forest model on the remaining variables. The final out of sample accuracy based on 50% of data for training and 50% for testing is 98.5%. We did not use cross validation as there are enough data (around 20 thousand) in the training set.
First of all the data are downloaded and loaded into R
Then I took a look at variables names.
The names of variables containing statistics started with one of the words, “max”, “min”, “stddev”, “avg”, “amplitude”, “kurtosis”,“skewness” or “var”. So I checked which columns contain these data and removed them.
Then I removed the first seven columns which do not actually contain measurements.
In this part I omit features with less importance so that the training can be done in a not so powerfull computer. Firstly I removed variables which are highly correlated to others. 13 variables are removed leaving 39 variables. But it is still too much for running a random forest. First divide the dataset to training and testing subsets half of data to each subset.
To reduce the number of variables, one idea would be using PCA. I tried that as well but did not get better results with lots of more processing time. So I decided to use another method.
In the other approach I used the variable importance function in “caret”. I firstly trained a decision tree model (rpart) on all the data which by the way did not show a good out of sample accuracy.
Then ran the varImp() to get the most important variables.
| Variable Name | Overall Importance |
|---|---|
| magnet_dumbbell_y | 100.00000 |
| magnet_belt_y | 89.00120 |
| total_accel_belt | 80.94635 |
| yaw_belt | 74.73625 |
| pitch_forearm | 68.54771 |
| magnet_dumbbell_z | 52.16918 |
| magnet_arm_x | 31.78142 |
| gyros_belt_z | 27.87972 |
| roll_forearm | 23.78797 |
| roll_dumbbell | 21.83467 |
| magnet_dumbbell_x | 21.59404 |
| accel_dumbbell_y | 16.89352 |
| roll_arm | 15.58042 |
| accel_forearm_x | 14.49692 |
| magnet_forearm_z | 12.46572 |
| gyros_belt_x | 0.00000 |
| gyros_belt_y | 0.00000 |
| magnet_belt_x | 0.00000 |
| magnet_belt_z | 0.00000 |
| pitch_arm | 0.00000 |
I filtered the variables which have 14% or more importance. Leave us by 14 variables.
Now I separated just these 14 variables and train a random forest model on them.
To test the model, firstly the important variables have to be seperated in the testing dataset. Then predict the classes using our RF model. Here is the the confusion matrix of the test data.
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 2,783 | 16 | 2 | 0 | 0 |
| B | 6 | 1,847 | 19 | 1 | 1 |
| C | 0 | 34 | 1,683 | 43 | 5 |
| D | 1 | 1 | 3 | 1,564 | 9 |
| E | 0 | 0 | 4 | 0 | 1,788 |
The overall accuracy of the model is 98.52% acceptingly good. The better news is that the sensitivity and specificity on class “A” (doing the job right) are respectively 0.9975 and 0.9974. This means that when the prediction is “A” we are 99.75% confident that the job was done right and when the prediction is not “A”, we are 99.74% confident that the job was really done wrong.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz47LMYDnT4