Human Activity Recognition - HAR - using wearable accelerometer has emerged as a key research area in the last years and is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, digital assistants for weight lifting exercises, etc.
For this project, the error calculation on the provided data set indicated that the Random Forest prediction model was more reliable to predict the outcome (with an accuracy = 0.9963), versus the Decision Tree model (Accuracy = 0.7382). Therefore we used the the Random Forest model to predict the final outcome of the test data set.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Read more: http://groupware.les.inf.puc-rio.br/har
The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The outcome variable is “classe”, a factor variable with 5 levels. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
Class A: exactly according to the specification (correct) Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only halfway
Class E: throwing the hips to the front
First let’s load the data into “trainingActivity”
set.seed(9876)
trainingActivity <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!", ""))
#open the CSV files and observe that the first 7 columns are not needed for the analysis:
#user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp new_window num_window
trainingActivity<-trainingActivity[ ,-c(1:7)]
trainingActivity<-trainingActivity[,colSums(is.na(trainingActivity)) == 0]
dim(trainingActivity)
## [1] 19622 53
Summary of the data structure:
str(trainingActivity)
## 'data.frame': 19622 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
In order to perform cross-validation, i.e. the ability to validate our models created using training data subset with another separate training data subset used for prediction, the training data set is partionned into 2 sets: trainingActivity1 (75%) and trainingActivity2 (25%). This will be performed using random partinioning without replacement.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
subsets <- createDataPartition(y=trainingActivity$classe, p=0.75, list=FALSE)
trainingActivity1 <- trainingActivity[subsets, ]
trainingActivity2 <- trainingActivity[-subsets, ]
Now, let’s build machine learning algorithms to predict activity quality from the activity monitors. The algorithm will be performed on the training data set, and used later predict the final outcome of the test data set. Two prediction models will be developed.
Build a first model using the training set trainingActivity1 and predict the outcome using trainingActivity2.
library(rpart)
modFit1 <- rpart(classe ~ .,method="class",data=trainingActivity1)
#print(modFit1$finalModel)
#plot decision tree
#install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(modFit1, main="Decision Tree", extra=102, under=TRUE, faclen=0)
# Predicting on the second training set:
prediction1 <- predict(modFit1, trainingActivity2, type = "class")
Build a second model using the same training set trainingActivity1 and predict the outcome using trainingActivity2.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modFit2 <- randomForest(classe ~. , data=trainingActivity1, method="class")
# Predicting on the second training set:
prediction2 <- predict(modFit2, trainingActivity2, type = "class")
Estimate the error for model 1:
confusionMatrix(prediction1, trainingActivity2$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1248 128 19 47 15
## B 63 511 44 60 48
## C 43 103 686 135 115
## D 15 79 50 500 48
## E 26 128 56 62 675
##
## Overall Statistics
##
## Accuracy : 0.7382
## 95% CI : (0.7256, 0.7504)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6685
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8946 0.5385 0.8023 0.6219 0.7492
## Specificity 0.9404 0.9456 0.9022 0.9532 0.9321
## Pos Pred Value 0.8566 0.7039 0.6340 0.7225 0.7128
## Neg Pred Value 0.9574 0.8952 0.9558 0.9278 0.9429
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2545 0.1042 0.1399 0.1020 0.1376
## Detection Prevalence 0.2971 0.1480 0.2206 0.1411 0.1931
## Balanced Accuracy 0.9175 0.7420 0.8523 0.7875 0.8406
Estimate the error for model 2:
confusionMatrix(prediction2, trainingActivity2$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 2 0 0 0
## B 1 943 5 0 0
## C 0 4 850 5 0
## D 0 0 0 798 0
## E 0 0 0 1 901
##
## Overall Statistics
##
## Accuracy : 0.9963
## 95% CI : (0.9942, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9954
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9937 0.9942 0.9925 1.0000
## Specificity 0.9994 0.9985 0.9978 1.0000 0.9998
## Pos Pred Value 0.9986 0.9937 0.9895 1.0000 0.9989
## Neg Pred Value 0.9997 0.9985 0.9988 0.9985 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1923 0.1733 0.1627 0.1837
## Detection Prevalence 0.2847 0.1935 0.1752 0.1627 0.1839
## Balanced Accuracy 0.9994 0.9961 0.9960 0.9963 0.9999
The error calculation indicates that the Random Forest model is more reliable to predict the outcome (Accuracy : 0.9963), versus the decision tree model (Accuracy : 0.7382). Therefore we will use the the Random Forest model to predict the outcome of the test data set.
Note that if you get this error when running confusionMatrix, then run the identical function to understand why the data levels are not the same: “Error in confusionMatrix.default(prediction1, testActivity$classe) : the data and reference factors must have the same number of levels”
identical(levels(prediction1),levels(trainingActivity2$classe))
## [1] TRUE
levels(prediction1); levels(trainingActivity2$classe)
## [1] "A" "B" "C" "D" "E"
## [1] "A" "B" "C" "D" "E"
The estimated out-of-sample error is 0.004, or 0.4%. The out-of-sample error is calculated as 1 - accuracy for predictions made on a cross-validation set. Given that the model accuracy is above 99% (Accuracy=0.9963), we can expect missclassification will be low.
varImp2 <- varImp(modFit2)
#varImp2[with(varImp2,order(varImp2$Overall)),]
varImp2
## Overall
## roll_belt 921.07404
## pitch_belt 515.84345
## yaw_belt 659.67257
## total_accel_belt 139.55256
## gyros_belt_x 72.74954
## gyros_belt_y 83.83180
## gyros_belt_z 214.91430
## accel_belt_x 87.15184
## accel_belt_y 92.32272
## accel_belt_z 311.79431
## magnet_belt_x 193.92199
## magnet_belt_y 311.76867
## magnet_belt_z 303.51828
## roll_arm 238.35762
## pitch_arm 129.24104
## yaw_arm 192.64289
## total_accel_arm 77.27176
## gyros_arm_x 98.45409
## gyros_arm_y 107.71552
## gyros_arm_z 43.75188
## accel_arm_x 163.35792
## accel_arm_y 120.73193
## accel_arm_z 98.17708
## magnet_arm_x 201.57630
## magnet_arm_y 164.01830
## magnet_arm_z 133.94216
## roll_dumbbell 310.58341
## pitch_dumbbell 141.18586
## yaw_dumbbell 182.01061
## total_accel_dumbbell 201.87726
## gyros_dumbbell_x 103.19062
## gyros_dumbbell_y 190.32112
## gyros_dumbbell_z 68.26597
## accel_dumbbell_x 184.63081
## accel_dumbbell_y 312.73727
## accel_dumbbell_z 249.24763
## magnet_dumbbell_x 370.45640
## magnet_dumbbell_y 513.92462
## magnet_dumbbell_z 573.78041
## roll_forearm 455.31805
## pitch_forearm 597.94589
## yaw_forearm 131.31054
## total_accel_forearm 81.62484
## gyros_forearm_x 54.63780
## gyros_forearm_y 89.85502
## gyros_forearm_z 62.22054
## accel_forearm_x 244.76172
## accel_forearm_y 112.52905
## accel_forearm_z 183.89611
## magnet_forearm_x 164.68869
## magnet_forearm_y 166.56791
## magnet_forearm_z 210.88238
The final prediction on the test data set in pml-testing.csv.
# Cleaning the test data
testActivity <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!", ""))
#open the CSV files and observe that the first 7 columns are not needed for the analysis:
#user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp new_window num_window
testActivity<-testActivity[ ,-c(1:7)]
testActivity<-testActivity[,colSums(is.na(testActivity)) == 0]
dim(testActivity)
## [1] 20 53
# Use the prediction model on the test data
TestPrediction <- predict(modFit2, testActivity, type = "class")
TestPrediction
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The following classes are predicted for each test observation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
# Write files for submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(TestPrediction)