Practical Machine Learning Course Project

Qualitative assesment of weight lifting exercises

The goal of this project is to build a machine learning model from the sample data acquired from motion sensors on participants bodies, which should most accurately predict the manner in which the weight lifting exercise was performed. The sensor data is used to investigate “how well” an activity was performed by the wearer. We will try three different classification prediction algorithms, check their accuracy on the training set and then use the best one to make prediction of the class variable (“classe” in original dataset, meaning class in portugese) in the test set and assigning it to one of 5 given values:

Class A: exactly according to the specification
Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only half way
Class E: throwing the hips to the front.

Synopsis

Using devices such as JawboneUp, NikeFuelBand, and Fitbitit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this analysis, we will use sensor data acquired from accelerometers on the belt, forearm, arm, and dumbell of six participants between 20 to 28 years with little weight lifting experience during exercising with barbells. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the Human Activity Recognition project webpage (see the section on the Weight Lifting Exercise Dataset).

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Dataset description

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Cleaning and processing the dataset

There are 159 attributes in the dataset and 19622 observations. The data in the dataset consists of these three data types: factor, integer, numeric.

The dataset contains time series variables in columns 2 to 6 which are not related to the movement and are useless to our analysis so we’ll discard them. Also there are lots of columns with the large number of missing values (more than 90% of the data missing) which we’ll remove from the dataset.

# Remove time-series data
training = training[, -(2:6)]
test = test[, -(2:6)]

# Remove NAs
nas <- colSums(is.na(training))
table(nas)
length(nas[nas == 0 ])
length(nas[nas != 0 ])  

NAcolumns <- sapply(training, function(x) (sum(is.na(x)) > 0.9*rownum)) # columns where more than 90% of data is NA
training <- training[, NAcolumns == FALSE]
test <- test[, NAcolumns == FALSE]
attrNum = length(names(training)) # resulting variables in used dataset

After removing the columns with non-available data resulting dataset has 54 variables. Cleaned dataset looks like this:

# Describe dataset
str(training)

## 'data.frame':    19622 obs. of  54 variables:
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

To get a better look at our data we are going to make few density plots of four different types of variables grouped by the type of movement in all three directions:

Partioning the dataset into training and validation sets

Since the dataset we’re given is pretty large, we’ll keep 70% of data from the original dataset for the training purposes and remaining 30% we’ll put in the validation set. The classification models will be built on the training sets and then their accuracy will be checked on validation set.

inTrain <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
train <- training[inTrain, ]
validate <- training[-inTrain, ]
dim(train)
dim(validate)

Next, we are going to use three different prediction algorithms to determine which one can provide the best accuracy on our validation set. The three algorithms are: decision tree, random forests and generalized boosted regression.

Predicting with Decision trees algorithm

set.seed(909)
DSmodel <- rpart(classe ~ ., data=train, method="class")
fancyRpartPlot(DSmodel, sub="")

DSprediction <- predict(DSmodel, validate, type = "class")
DSconf <- confusionMatrix(DSprediction, validate$classe)
DSaccuracy <- round(DSconf$overall['Accuracy'], 4)
print(DSconf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1368  151   22   39   14
##          B   62  614   58   79   81
##          C   36  118  797  141  130
##          D  190  193  126  639  160
##          E   18   63   23   66  697
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6992          
##                  95% CI : (0.6873, 0.7109)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6211          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8172   0.5391   0.7768   0.6629   0.6442
## Specificity            0.9463   0.9410   0.9125   0.8641   0.9646
## Pos Pred Value         0.8582   0.6868   0.6522   0.4885   0.8039
## Neg Pred Value         0.9287   0.8948   0.9509   0.9290   0.9233
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2325   0.1043   0.1354   0.1086   0.1184
## Detection Prevalence   0.2709   0.1519   0.2076   0.2223   0.1473
## Balanced Accuracy      0.8818   0.7400   0.8447   0.7635   0.8044

plot(DSconf$table, col = DSconf$byClass, main = paste("Decision Tree model confusion matrix: Accuracy =", round(DSconf$overall['Accuracy'], 4)))

The in-sample accuracy for our decision tree model is 69.92%. Next we make the prediction using the random forests algorithm.

Predicting with Random forests algorithm

set.seed(808)
RFmodel <- randomForest(classe ~ ., data=train)
RFprediction <- predict(RFmodel, validate, type = "class")
RFconf <- confusionMatrix(RFprediction, validate$classe)
print(RFconf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    3    0    0    0
##          B    3 1128    2    0    0
##          C    0    8 1024   11    0
##          D    0    0    0  950    4
##          E    1    0    0    3 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9941          
##                  95% CI : (0.9917, 0.9959)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9925          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9903   0.9981   0.9855   0.9963
## Specificity            0.9993   0.9989   0.9961   0.9992   0.9992
## Pos Pred Value         0.9982   0.9956   0.9818   0.9958   0.9963
## Neg Pred Value         0.9991   0.9977   0.9996   0.9972   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1917   0.1740   0.1614   0.1832
## Detection Prevalence   0.2843   0.1925   0.1772   0.1621   0.1839
## Balanced Accuracy      0.9984   0.9946   0.9971   0.9923   0.9977

RFaccuracy <- round(RFconf$overall['Accuracy'], 4)
plot(RFconf$table, col = RFconf$byClass, main = paste("Random Forest model confusion matrix: Accuracy =", round(RFconf$overall['Accuracy'], 4)))

For our random forests model in-sample prediction accuracy is 99.41%, which is almost perfect. Finally, we will use generalized boosted regression algorithm for our last prediction.

Prediction with Generalized Boosted Regression algorithm

set.seed(303)
fitControl <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 1)

GBRmodel <- train(classe ~ ., data=train, method="gbm",
                 trControl = fitControl,
                 verbose = FALSE)


GBRfinal <- GBRmodel$finalModel

GBRprediction <- predict(GBRmodel, newdata=validate)
GBRconf <- confusionMatrix(GBRprediction, validate$classe)
GBRaccuracy <- round(GBRconf$overall[1], 4)
print(GBRaccuracy)

## Accuracy 
##   0.9602

plot(GBRconf$table, col = GBRconf$byClass, main = paste("Generalized Boosted Regression model confusion matrix: Accuracy =", round(GBRconf$overall['Accuracy'], 4)))

Our generalized boosted regression model has an in-sample accuracy of 96.02%, which is good but not so much as the random forests algorithm.

Model comparison

Comparing the three classification algorithms used and their respective results, we conclude that random forests algorithm produced the best accuracy of 99.41%, so we are going to use it to make the predictions on the test data set where we expect to get out-of-sample error of only 0.59%.

MODEL	ACCURACY
Decision Tree	0.6992
Random Forests	0.9941
Generalized Boosted Regression	0.9602

Table 1. Accuracy comparison between three different models

Making predictions on the test dataset using Random forests

The dataset we need to submit for evaluation has 20 observations for which we are going to use our random forests model and predict which class observation (data from different motion sensors for a particular parcipient) belongs to.

library(randomForest)
RFpredictSubmit <- predict(RFmodel, test, type = "class")
results <- data.frame("Participant"=test$user_name, "Problem_id"=test$problem_id, "Class"=RFpredictSubmit)
print(results)

##    Participant Problem_id Class
## 1        pedro          1     B
## 2       jeremy          2     A
## 3       jeremy          3     B
## 4       adelmo          4     A
## 5       eurico          5     A
## 6       jeremy          6     E
## 7       jeremy          7     D
## 8       jeremy          8     B
## 9     carlitos          9     A
## 10     charles         10     A
## 11    carlitos         11     B
## 12      jeremy         12     C
## 13      eurico         13     B
## 14      jeremy         14     A
## 15      jeremy         15     E
## 16      eurico         16     E
## 17       pedro         17     A
## 18    carlitos         18     B
## 19       pedro         19     B
## 20      eurico         20     B

Prediction results

After submission of the predictions’ results on the Coursera evaluation webpage the result of 100% prediction accuracy was received on the test dataset, so we conclude that our random forest model was very successful in this case.

Citation:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. (http://groupware.les.inf.puc-rio.br/har)

This RMarkdown document was produced with RStudio v0.0.99.486 on R v3.2.2.