Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of six participants. They were asked to perform barbell lifts correctly and incorrectly in five different ways. One totally right and the four others with a common mistake. We will predict how well a given person done the job using these accelerometers.

More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Synopsis

The provided dataset contains the the name of volunteer, time stamp, 56 measurement by accelometers, 96 statistics calculated by moving window on them and the class of activity, “A” for totally right and “B-E” for four common mistakes. As we are supposed to predict the class of activity based on single measurements by acceleometers, we can not use statistics. So we are limited to 56 measurements. These are still too many variables and make the training too slow. To reduce the variables. We first omit the highly correlated variables. Then train a classification tree on the remaining data. Finally choose the most important variables based on this model. Finally train a random forest model on the remaining variables. The final out of sample accuracy based on 50% of data for training and 50% for testing is 98.5%. We did not use cross validation as there are enough data (around 20 thousand) in the training set.

Details of doing the job

Loading the dataset

First of all the data are downloaded and loaded into R

Exploring and cleaning the dataset

Then I took a look at variables names.

The names of variables containing statistics started with one of the words, “max”, “min”, “stddev”, “avg”, “amplitude”, “kurtosis”,“skewness” or “var”. So I checked which columns contain these data and removed them.

Cleaning and summarizing the dataset

Then I removed the first seven columns which do not actually contain measurements.

In this part I omit features with less importance so that the training can be done in a not so powerfull computer. Firstly I removed variables which are highly correlated to others. 13 variables are removed leaving 39 variables. But it is still too much for running a random forest. First divide the dataset to training and testing subsets half of data to each subset.

To reduce the number of variables, one idea would be using PCA. I tried that as well but did not get better results with lots of more processing time. So I decided to use another method.

In the other approach I used the variable importance function in “caret”. I firstly trained a decision tree model (rpart) on all the data which by the way did not show a good out of sample accuracy.

Then ran the varImp() to get the most important variables.

First 20 most important variables
Variable Name	Overall Importance
magnet_dumbbell_y	100.00000
magnet_belt_y	89.00120
total_accel_belt	80.94635
yaw_belt	74.73625
pitch_forearm	68.54771
magnet_dumbbell_z	52.16918
magnet_arm_x	31.78142
gyros_belt_z	27.87972
roll_forearm	23.78797
roll_dumbbell	21.83467
magnet_dumbbell_x	21.59404
accel_dumbbell_y	16.89352
roll_arm	15.58042
accel_forearm_x	14.49692
magnet_forearm_z	12.46572
gyros_belt_x	0.00000
gyros_belt_y	0.00000
magnet_belt_x	0.00000
magnet_belt_z	0.00000
pitch_arm	0.00000

I filtered the variables which have 14% or more importance. Leave us by 14 variables.

Trainig and testing the model

Now I separated just these 14 variables and train a random forest model on them.

To test the model, firstly the important variables have to be seperated in the testing dataset. Then predict the classes using our RF model. Here is the the confusion matrix of the test data.

Confusion Matrix of the test dataset
	A	B	C	D	E
A	2,783	16	2	0	0
B	6	1,847	19	1	1
C	0	34	1,683	43	5
D	1	1	3	1,564	9
E	0	0	4	0	1,788

The overall accuracy of the model is 98.52% acceptingly good. The better news is that the sensitivity and specificity on class “A” (doing the job right) are respectively 0.9975 and 0.9974. This means that when the prediction is “A” we are 99.75% confident that the job was done right and when the prediction is not “A”, we are 99.74% confident that the job was really done wrong.

Reference

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Weight lifting Analysis

Reza

29 April 2016