Practical Machine Learning Project

Human Activity Recognition and Prediction

Prepared by: Bernard Kiyanda

Summary

Human Activity Recognition - HAR - using wearable accelerometer has emerged as a key research area in the last years and is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, digital assistants for weight lifting exercises, etc.

For this project, the error calculation on the provided data set indicated that the Random Forest prediction model was more reliable to predict the outcome (with an accuracy = 0.9963), versus the Decision Tree model (Accuracy = 0.7382). Therefore we used the the Random Forest model to predict the final outcome of the test data set.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

References

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Read more: http://groupware.les.inf.puc-rio.br/har

Preparing and cleaning the training data

The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The outcome variable is “classe”, a factor variable with 5 levels. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
Class A: exactly according to the specification (correct) Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only halfway
Class E: throwing the hips to the front

First let’s load the data into “trainingActivity”

set.seed(9876)
trainingActivity <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!", ""))
#open the CSV files and observe that the first 7 columns are not needed for the analysis:
#user_name  raw_timestamp_part_1    raw_timestamp_part_2    cvtd_timestamp  new_window  num_window
trainingActivity<-trainingActivity[ ,-c(1:7)]

trainingActivity<-trainingActivity[,colSums(is.na(trainingActivity)) == 0]
dim(trainingActivity)

## [1] 19622    53

Summary of the data structure:

str(trainingActivity)

## 'data.frame':    19622 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Cross validation - Partionning the data

In order to perform cross-validation, i.e. the ability to validate our models created using training data subset with another separate training data subset used for prediction, the training data set is partionned into 2 sets: trainingActivity1 (75%) and trainingActivity2 (25%). This will be performed using random partinioning without replacement.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

subsets <- createDataPartition(y=trainingActivity$classe, p=0.75, list=FALSE)
trainingActivity1 <- trainingActivity[subsets, ] 
trainingActivity2 <- trainingActivity[-subsets, ]

Build machine learning algorithms

Now, let’s build machine learning algorithms to predict activity quality from the activity monitors. The algorithm will be performed on the training data set, and used later predict the final outcome of the test data set. Two prediction models will be developed.

Model 1: Decision tree

Build a first model using the training set trainingActivity1 and predict the outcome using trainingActivity2.

library(rpart)
modFit1 <- rpart(classe ~ .,method="class",data=trainingActivity1)
#print(modFit1$finalModel)

#plot decision tree
#install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(modFit1, main="Decision Tree", extra=102, under=TRUE, faclen=0)

# Predicting on the second training set:
prediction1 <- predict(modFit1, trainingActivity2, type = "class")

Model 2: Random forest

Build a second model using the same training set trainingActivity1 and predict the outcome using trainingActivity2.

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

modFit2 <- randomForest(classe ~. , data=trainingActivity1, method="class")
# Predicting on the second training set:
prediction2 <- predict(modFit2, trainingActivity2, type = "class")

Estimate the error for each model

Estimate the error for model 1:

confusionMatrix(prediction1, trainingActivity2$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1248  128   19   47   15
##          B   63  511   44   60   48
##          C   43  103  686  135  115
##          D   15   79   50  500   48
##          E   26  128   56   62  675
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7382          
##                  95% CI : (0.7256, 0.7504)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6685          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8946   0.5385   0.8023   0.6219   0.7492
## Specificity            0.9404   0.9456   0.9022   0.9532   0.9321
## Pos Pred Value         0.8566   0.7039   0.6340   0.7225   0.7128
## Neg Pred Value         0.9574   0.8952   0.9558   0.9278   0.9429
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2545   0.1042   0.1399   0.1020   0.1376
## Detection Prevalence   0.2971   0.1480   0.2206   0.1411   0.1931
## Balanced Accuracy      0.9175   0.7420   0.8523   0.7875   0.8406

Estimate the error for model 2:

confusionMatrix(prediction2, trainingActivity2$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    2    0    0    0
##          B    1  943    5    0    0
##          C    0    4  850    5    0
##          D    0    0    0  798    0
##          E    0    0    0    1  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9963          
##                  95% CI : (0.9942, 0.9978)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9954          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9937   0.9942   0.9925   1.0000
## Specificity            0.9994   0.9985   0.9978   1.0000   0.9998
## Pos Pred Value         0.9986   0.9937   0.9895   1.0000   0.9989
## Neg Pred Value         0.9997   0.9985   0.9988   0.9985   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1923   0.1733   0.1627   0.1837
## Detection Prevalence   0.2847   0.1935   0.1752   0.1627   0.1839
## Balanced Accuracy      0.9994   0.9961   0.9960   0.9963   0.9999

The error calculation indicates that the Random Forest model is more reliable to predict the outcome (Accuracy : 0.9963), versus the decision tree model (Accuracy : 0.7382). Therefore we will use the the Random Forest model to predict the outcome of the test data set.

Note that if you get this error when running confusionMatrix, then run the identical function to understand why the data levels are not the same: “Error in confusionMatrix.default(prediction1, testActivity$classe) : the data and reference factors must have the same number of levels”

identical(levels(prediction1),levels(trainingActivity2$classe))

## [1] TRUE

levels(prediction1); levels(trainingActivity2$classe)

## [1] "A" "B" "C" "D" "E"

## [1] "A" "B" "C" "D" "E"

Expected out of sample error

The estimated out-of-sample error is 0.004, or 0.4%. The out-of-sample error is calculated as 1 - accuracy for predictions made on a cross-validation set. Given that the model accuracy is above 99% (Accuracy=0.9963), we can expect missclassification will be low.

Analyzing the importance of each variable in our prediction model

varImp2 <- varImp(modFit2)
#varImp2[with(varImp2,order(varImp2$Overall)),]
varImp2

##                        Overall
## roll_belt            921.07404
## pitch_belt           515.84345
## yaw_belt             659.67257
## total_accel_belt     139.55256
## gyros_belt_x          72.74954
## gyros_belt_y          83.83180
## gyros_belt_z         214.91430
## accel_belt_x          87.15184
## accel_belt_y          92.32272
## accel_belt_z         311.79431
## magnet_belt_x        193.92199
## magnet_belt_y        311.76867
## magnet_belt_z        303.51828
## roll_arm             238.35762
## pitch_arm            129.24104
## yaw_arm              192.64289
## total_accel_arm       77.27176
## gyros_arm_x           98.45409
## gyros_arm_y          107.71552
## gyros_arm_z           43.75188
## accel_arm_x          163.35792
## accel_arm_y          120.73193
## accel_arm_z           98.17708
## magnet_arm_x         201.57630
## magnet_arm_y         164.01830
## magnet_arm_z         133.94216
## roll_dumbbell        310.58341
## pitch_dumbbell       141.18586
## yaw_dumbbell         182.01061
## total_accel_dumbbell 201.87726
## gyros_dumbbell_x     103.19062
## gyros_dumbbell_y     190.32112
## gyros_dumbbell_z      68.26597
## accel_dumbbell_x     184.63081
## accel_dumbbell_y     312.73727
## accel_dumbbell_z     249.24763
## magnet_dumbbell_x    370.45640
## magnet_dumbbell_y    513.92462
## magnet_dumbbell_z    573.78041
## roll_forearm         455.31805
## pitch_forearm        597.94589
## yaw_forearm          131.31054
## total_accel_forearm   81.62484
## gyros_forearm_x       54.63780
## gyros_forearm_y       89.85502
## gyros_forearm_z       62.22054
## accel_forearm_x      244.76172
## accel_forearm_y      112.52905
## accel_forearm_z      183.89611
## magnet_forearm_x     164.68869
## magnet_forearm_y     166.56791
## magnet_forearm_z     210.88238

Results - Use model 2 to predict the test data set

The final prediction on the test data set in pml-testing.csv.

# Cleaning the test data
testActivity <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!", ""))
#open the CSV files and observe that the first 7 columns are not needed for the analysis:
#user_name  raw_timestamp_part_1  raw_timestamp_part_2  cvtd_timestamp  new_window  num_window
testActivity<-testActivity[ ,-c(1:7)]
testActivity<-testActivity[,colSums(is.na(testActivity)) == 0]
dim(testActivity)

## [1] 20 53

# Use the prediction model on the test data
TestPrediction <- predict(modFit2, testActivity, type = "class")
TestPrediction

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

The following classes are predicted for each test observation:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E

Project Submission

# Write files for submission
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(TestPrediction)