In this work, I analyze a Human Activity Recognition dataset provided by:
with the aim to determine how well a weight lifting exercise is being performed. To this aim, I use data from sensors located in the belt, arm, forearm and dumbbell used by the test subject. The data will be analyzed, summarized and used to train a random forest model which will allow me to classify new data points in one of five available categories.
For simplicity, all necessary libraries will be loaded now. If you wish to reproduce this work, make sure you have installed these libraries using the install.packages command.
library(caret)
library(dplyr)
library(ggplot2)
library(reshape2) # used for its melt function for the confusion matrix plot
I will load the data and choose only a handful of columns. The reason for this is that many columns are full of NAs, empty or do not represent variables important to the problem of determining how well the activity is being performed (for example raw_timestamp_part1, raw_timestamp_part2, new_window, etc). I made this decision after looking at head(mydf). I do not print this command here to save space.
mydf <- read.csv("./data/pml-training.csv")
mydf <- mydf %>% select(user_name, roll_belt, pitch_belt, yaw_belt,
total_accel_belt, gyros_belt_x, gyros_belt_y,
gyros_belt_z,
accel_belt_x, accel_belt_y, accel_belt_z,
magnet_belt_x, magnet_belt_y, magnet_belt_z,
roll_arm, pitch_arm, yaw_arm,
total_accel_arm, gyros_arm_x, gyros_arm_y, gyros_arm_z,
accel_arm_x, accel_arm_y, accel_arm_z,
magnet_arm_x, magnet_arm_y, magnet_arm_z,
roll_dumbbell, pitch_dumbbell, yaw_dumbbell,
total_accel_dumbbell, gyros_dumbbell_x, gyros_dumbbell_y,
gyros_dumbbell_z,
accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z,
magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z,
roll_forearm, pitch_forearm, yaw_forearm,
total_accel_forearm, gyros_forearm_x, gyros_forearm_y,
gyros_forearm_z,
accel_forearm_x, accel_forearm_y, accel_forearm_z,
magnet_forearm_x, magnet_forearm_y, magnet_forearm_z,
classe)
Next, I will separate the data into training and testing sets using the createDataPartition command from the caret package. I will train my model on the former and test its accuracy on the later.
set.seed(123)
aux <- createDataPartition(mydf$classe, p = 0.75, list = FALSE)
dtrain <- mydf[aux, ]
dtest <- mydf[-aux, ]
The dimensions of our dtrain dataset are 14718 rows and 54 columns. As you can imagine, training a prediction algorithm with 52 numerical variables (user_name and the outcome, classe, are factor variables) would require a lot of computational time. For that reason, I will use Principal Component Analysis (PCA) to summarize the information contained in my dtrain data frame.
# We find the pre-processing model using the preProcess function from caret
ppm <- preProcess(dtrain[ , 2:53], method = "pca")
trainPC <- predict(ppm, dtrain[ , 2:53])
I will now train a random forest using my new data frame trainPC. I will use cross validation within the train command from the caret package. From the ?trainControl command, I learnt that method is the resampling method chosen, which in my case is cross-validation (coded as "cv"). The number of folds or resampling iterations is set by default as 10.
mdl <- train(dtrain$classe ~ ., data = trainPC, method = "rf",
trControl = trainControl(method = "cv"))
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
mdl
## Random Forest
##
## 14718 samples
## 24 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13247, 13245, 13247, 13248, 13247, 13245, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9749972 0.9683676 0.004454494 0.005637988
## 13 0.9702422 0.9623589 0.006007687 0.007604784
## 25 0.9668457 0.9580620 0.006119953 0.007739255
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
mdl$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.3%
## Confusion matrix:
## A B C D E class.error
## A 4146 13 17 6 3 0.009318996
## B 56 2752 35 1 4 0.033707865
## C 4 33 2507 20 3 0.023373588
## D 3 1 97 2306 5 0.043946932
## E 2 7 14 15 2668 0.014042868
Now that I have my model, I will use it to predict the classe variable in the test data set. I will then use the confusionMatrix command from caret to determine the out-of-sample error.
testPC <- predict(ppm, dtest[ , 2:53])
prediction <- predict(mdl, newdata = testPC)
confusionMatrix(prediction, dtest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 8 1 1 0
## B 2 926 10 0 1
## C 0 13 837 39 11
## D 0 0 5 764 5
## E 1 2 2 0 884
##
## Overall Statistics
##
## Accuracy : 0.9794
## 95% CI : (0.975, 0.9832)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9739
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9758 0.9789 0.9502 0.9811
## Specificity 0.9972 0.9967 0.9844 0.9976 0.9988
## Pos Pred Value 0.9929 0.9862 0.9300 0.9871 0.9944
## Neg Pred Value 0.9991 0.9942 0.9955 0.9903 0.9958
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2838 0.1888 0.1707 0.1558 0.1803
## Detection Prevalence 0.2859 0.1915 0.1835 0.1578 0.1813
## Balanced Accuracy 0.9975 0.9862 0.9817 0.9739 0.9899
As it can be seen in the confusion matrix above (and in the Figure below), the model has an out-of-sample overall accuracy of 97.94%. In other words, the out-of-sample error is 2.06%. It is also remarkable that the sensitivity, i.e. the probability of positively identify the classe given that a data point belongs to it, is above 95% for all classes. The specificity, i.e. the probability that the method determines that a point does not belong to a classe when it does not, is above 98% in all cases.
confmatrix <- as.matrix(confusionMatrix(prediction, dtest$classe))
confmatrix <- apply(confmatrix, 2, function(x) round(x/sum(x), 3))
p <- ggplot(melt(confmatrix), aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
p + xlab("Prediction") + ylab("Reference") + ggtitle("Confusion Matrix")
str(mydf)
## 'data.frame': 19622 obs. of 54 variables:
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(mydf)
## user_name roll_belt pitch_belt yaw_belt
## adelmo :3892 Min. :-28.90 Min. :-55.8000 Min. :-180.00
## carlitos:3112 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30
## charles :3536 Median :113.00 Median : 5.2800 Median : -13.00
## eurico :3070 Mean : 64.41 Mean : 0.3053 Mean : -11.21
## jeremy :3402 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90
## pedro :2610 Max. :162.00 Max. : 60.3000 Max. : 179.00
## total_accel_belt gyros_belt_x gyros_belt_y gyros_belt_z
## Min. : 0.00 Min. :-1.040000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.: 3.00 1st Qu.:-0.030000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median :17.00 Median : 0.030000 Median : 0.02000 Median :-0.1000
## Mean :11.31 Mean :-0.005592 Mean : 0.03959 Mean :-0.1305
## 3rd Qu.:18.00 3rd Qu.: 0.110000 3rd Qu.: 0.11000 3rd Qu.:-0.0200
## Max. :29.00 Max. : 2.220000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
## 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
## Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
## Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
## 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
## Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm
## Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800
## 1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900
## Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000
## Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612
## 3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200
## Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500
## yaw_arm total_accel_arm gyros_arm_x gyros_arm_y
## Min. :-180.0000 Min. : 1.00 Min. :-6.37000 Min. :-3.4400
## 1st Qu.: -43.1000 1st Qu.:17.00 1st Qu.:-1.33000 1st Qu.:-0.8000
## Median : 0.0000 Median :27.00 Median : 0.08000 Median :-0.2400
## Mean : -0.6188 Mean :25.51 Mean : 0.04277 Mean :-0.2571
## 3rd Qu.: 45.8750 3rd Qu.:33.00 3rd Qu.: 1.57000 3rd Qu.: 0.1400
## Max. : 180.0000 Max. :66.00 Max. : 4.87000 Max. : 2.8400
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## Min. :-2.3300 Min. :-404.00 Min. :-318.0 Min. :-636.00
## 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0 1st Qu.:-143.00
## Median : 0.2300 Median : -44.00 Median : 14.0 Median : -47.00
## Mean : 0.2695 Mean : -60.24 Mean : 32.6 Mean : -71.25
## 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0 3rd Qu.: 23.00
## Max. : 3.0200 Max. : 437.00 Max. : 308.0 Max. : 292.00
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell
## Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71
## 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49
## Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17
## Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84
## 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61
## Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55
## pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## Min. :-149.59 Min. :-150.871 Min. : 0.00
## 1st Qu.: -40.89 1st Qu.: -77.644 1st Qu.: 4.00
## Median : -20.96 Median : -3.324 Median :10.00
## Mean : -10.78 Mean : 1.674 Mean :13.72
## 3rd Qu.: 17.50 3rd Qu.: 79.643 3rd Qu.:19.00
## Max. : 149.40 Max. : 154.952 Max. :58.00
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## Min. :-204.0000 Min. :-2.10000 Min. : -2.380
## 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
## Median : 0.1300 Median : 0.03000 Median : -0.130
## Mean : 0.1611 Mean : 0.04606 Mean : -0.129
## 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
## Max. : 2.2200 Max. :52.00000 Max. :317.000
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0
## 1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0
## Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0
## Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5
## 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0
## Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0
## magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## Min. :-3600 Min. :-262.00 Min. :-180.0000 Min. :-72.50
## 1st Qu.: 231 1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00
## Median : 311 Median : 13.00 Median : 21.7000 Median : 9.24
## Mean : 221 Mean : 46.05 Mean : 33.8265 Mean : 10.71
## 3rd Qu.: 390 3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40
## Max. : 633 Max. : 452.00 Max. : 180.0000 Max. : 89.80
## yaw_forearm total_accel_forearm gyros_forearm_x
## Min. :-180.00 Min. : 0.00 Min. :-22.000
## 1st Qu.: -68.60 1st Qu.: 29.00 1st Qu.: -0.220
## Median : 0.00 Median : 36.00 Median : 0.050
## Mean : 19.21 Mean : 34.72 Mean : 0.158
## 3rd Qu.: 110.00 3rd Qu.: 41.00 3rd Qu.: 0.560
## Max. : 180.00 Max. :108.00 Max. : 3.970
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
## 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
## Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
## Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0
## 1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0
## Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0
## Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6
## 3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0
## Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
## classe
## A:5580
## B:3797
## C:3422
## D:3216
## E:3607
##
I share the code for this part of the asignment. However, I will not make the answers visible.
# Read the data
mytest <- read.csv("./data/pml-testing.csv")
# Select the appropriate variables
mytest2 <- mytest %>% select(user_name, roll_belt, pitch_belt, yaw_belt,
total_accel_belt, gyros_belt_x, gyros_belt_y,
gyros_belt_z,
accel_belt_x, accel_belt_y, accel_belt_z,
magnet_belt_x, magnet_belt_y, magnet_belt_z,
roll_arm, pitch_arm, yaw_arm,
total_accel_arm, gyros_arm_x, gyros_arm_y, gyros_arm_z,
accel_arm_x, accel_arm_y, accel_arm_z,
magnet_arm_x, magnet_arm_y, magnet_arm_z,
roll_dumbbell, pitch_dumbbell, yaw_dumbbell,
total_accel_dumbbell, gyros_dumbbell_x, gyros_dumbbell_y,
gyros_dumbbell_z,
accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z,
magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z,
roll_forearm, pitch_forearm, yaw_forearm,
total_accel_forearm, gyros_forearm_x, gyros_forearm_y,
gyros_forearm_z,
accel_forearm_x, accel_forearm_y, accel_forearm_z,
magnet_forearm_x, magnet_forearm_y, magnet_forearm_z)
# Predict the values of the principal components
mytest2PC <- predict(ppm, newdata = mytest2[ , 2:53])
# Save the solution
answer <- data.frame(problem_id = mytest$problem_id)
answer$prediction <- predict(mdl, newdata = mytest2PC)
# Show the solution
answer
And this is the code I used to save the answers in text files. It is a slight variation as that proposed by the instructor of the course.
# We force the prediction to character format
answer$prediction <- as.character(answer$prediction)
# We save the answers
n = length(answer$problem_id)
for(i in 1:n){
filename = paste0("./answer/problem_id_",i,".txt")
write.table(answer$prediction[i], file = filename, quote = FALSE,
row.names = FALSE, col.names = FALSE)
}
I have submitted it for evaluation. I received 18 out of 20 points. Predictions for problem_id_6 and problem_id_11 were incorrect. From this, the accuracy could be estimated to be 90%.