Author: Marcos de Aguiar
This assignment consists in getting data from an experiment in which the participants uses on body sensors when doing a basic weight lifiting exercise. The subjects do the exercise correctly, and then do it incorrectly in 4 different ways.
The goal is to use the data collected trough various body sensors, to build a model, that can be used to detect incorrect exercise execution.
It is a classification problem, where “classe A” is the correct posture and the other “classes” are the exercise executed incorrectly by proper simulation of the subjects.
The main article describing it can be found here:
http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf
And the data the experiment generated can be also found here:
The main site of the project is:
library(caret)
library(randomForest)
Gets the dataset pml-training.csv, checks the initial data.
experimentDS <- read.csv("pml-training.csv")
head(experimentDS)
## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3 3 carlitos 1323084231 820366 05/12/2011 11:23
## 4 4 carlitos 1323084232 120339 05/12/2011 11:23
## 5 5 carlitos 1323084232 196328 05/12/2011 11:23
## 6 6 carlitos 1323084232 304277 05/12/2011 11:23
## new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1 no 11 1.41 8.07 -94.4 3
## 2 no 11 1.41 8.07 -94.4 3
## 3 no 11 1.42 8.07 -94.4 3
## 4 no 12 1.48 8.05 -94.4 3
## 5 no 12 1.48 8.07 -94.4 3
## 6 no 12 1.45 8.06 -94.4 3
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt
## 1
## 2
## 3
## 4
## 5
## 6
## skewness_roll_belt skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## max_picth_belt max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## amplitude_roll_belt amplitude_pitch_belt amplitude_yaw_belt
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## var_total_accel_belt avg_roll_belt stddev_roll_belt var_roll_belt
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## avg_pitch_belt stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## stddev_yaw_belt var_yaw_belt gyros_belt_x gyros_belt_y gyros_belt_z
## 1 NA NA 0.00 0.00 -0.02
## 2 NA NA 0.02 0.00 -0.02
## 3 NA NA 0.00 0.00 -0.02
## 4 NA NA 0.02 0.00 -0.03
## 5 NA NA 0.02 0.02 -0.02
## 6 NA NA 0.02 0.00 -0.02
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x magnet_belt_y
## 1 -21 4 22 -3 599
## 2 -22 4 22 -7 608
## 3 -20 5 23 -2 600
## 4 -22 3 21 -6 604
## 5 -21 2 24 -6 600
## 6 -21 4 21 0 603
## magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm var_accel_arm
## 1 -313 -128 22.5 -161 34 NA
## 2 -311 -128 22.5 -161 34 NA
## 3 -305 -128 22.5 -161 34 NA
## 4 -310 -128 22.1 -161 34 NA
## 5 -302 -128 22.1 -161 34 NA
## 6 -312 -128 22.0 -161 34 NA
## avg_roll_arm stddev_roll_arm var_roll_arm avg_pitch_arm stddev_pitch_arm
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## var_pitch_arm avg_yaw_arm stddev_yaw_arm var_yaw_arm gyros_arm_x
## 1 NA NA NA NA 0.00
## 2 NA NA NA NA 0.02
## 3 NA NA NA NA 0.02
## 4 NA NA NA NA 0.02
## 5 NA NA NA NA 0.00
## 6 NA NA NA NA 0.02
## gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z magnet_arm_x
## 1 0.00 -0.02 -288 109 -123 -368
## 2 -0.02 -0.02 -290 110 -125 -369
## 3 -0.02 -0.02 -289 110 -126 -368
## 4 -0.03 0.02 -289 111 -123 -372
## 5 -0.03 0.00 -289 111 -123 -374
## 6 -0.03 0.00 -289 111 -122 -369
## magnet_arm_y magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## 1 337 516
## 2 337 513
## 3 344 513
## 4 344 512
## 5 337 506
## 6 342 513
## kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm skewness_yaw_arm
## 1
## 2
## 3
## 4
## 5
## 6
## max_roll_arm max_picth_arm max_yaw_arm min_roll_arm min_pitch_arm
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## min_yaw_arm amplitude_roll_arm amplitude_pitch_arm amplitude_yaw_arm
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## roll_dumbbell pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 1 13.05217 -70.49400 -84.87394
## 2 13.13074 -70.63751 -84.71065
## 3 12.85075 -70.27812 -85.14078
## 4 13.43120 -70.39379 -84.87363
## 5 13.37872 -70.42856 -84.85306
## 6 13.38246 -70.81759 -84.46500
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 1
## 2
## 3
## 4
## 5
## 6
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## min_yaw_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## amplitude_yaw_dumbbell total_accel_dumbbell var_accel_dumbbell
## 1 37 NA
## 2 37 NA
## 3 37 NA
## 4 37 NA
## 5 37 NA
## 6 37 NA
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## 1 NA NA NA 0
## 2 NA NA NA 0
## 3 NA NA NA 0
## 4 NA NA NA 0
## 5 NA NA NA 0
## 6 NA NA NA 0
## gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 1 -0.02 0.00 -234 47
## 2 -0.02 0.00 -233 47
## 3 -0.02 0.00 -232 46
## 4 -0.02 -0.02 -232 48
## 5 -0.02 0.00 -233 48
## 6 -0.02 0.00 -234 48
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## 1 -271 -559 293 -65
## 2 -269 -555 296 -64
## 3 -270 -561 298 -63
## 4 -269 -552 303 -60
## 5 -270 -554 292 -68
## 6 -269 -558 294 -66
## roll_forearm pitch_forearm yaw_forearm kurtosis_roll_forearm
## 1 28.4 -63.9 -153
## 2 28.3 -63.9 -153
## 3 28.3 -63.9 -152
## 4 28.1 -63.9 -152
## 5 28.0 -63.9 -152
## 6 27.9 -63.9 -152
## kurtosis_picth_forearm kurtosis_yaw_forearm skewness_roll_forearm
## 1
## 2
## 3
## 4
## 5
## 6
## skewness_pitch_forearm skewness_yaw_forearm max_roll_forearm
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## max_picth_forearm max_yaw_forearm min_roll_forearm min_pitch_forearm
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 1 36 NA
## 2 36 NA
## 3 36 NA
## 4 36 NA
## 5 36 NA
## 6 36 NA
## avg_roll_forearm stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1 NA NA 0.03 0.00
## 2 NA NA 0.02 0.00
## 3 NA NA 0.03 -0.02
## 4 NA NA 0.02 -0.02
## 5 NA NA 0.02 0.00
## 6 NA NA 0.02 -0.02
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1 -0.02 192 203 -215
## 2 -0.02 192 203 -216
## 3 0.00 196 204 -213
## 4 0.00 189 206 -214
## 5 -0.02 189 206 -214
## 6 -0.03 193 203 -215
## magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1 -17 654 476 A
## 2 -18 661 473 A
## 3 -18 658 469 A
## 4 -16 658 469 A
## 5 -17 655 473 A
## 6 -9 660 478 A
As we can see, there is a lot of empty fields and NA values. To fit all this to a model, considering that there is a lot of data to use as predictors, a lot of this noisy and falty data must be discarded, and only the consistent values will be used to predict.
## Clean data, remove colunms that are will not be used as predictors.
experimentDS$X <- NULL
experimentDS$user_name <- NULL
experimentDS$raw_timestamp_part_1 <- NULL
experimentDS$raw_timestamp_part_2 <- NULL
experimentDS$cvtd_timestamp <- NULL
experimentDS$new_window <- NULL
experimentDS$num_window <- NULL
This colunms are removed because they don’t have any relevance in the prediction, since user name, or time stamp will not help to make co relations.
The next step is to remove colunms that have NA and empty values.
res <- sapply(experimentDS, function(x)all(!is.na(x)))
completeCols <- names(which(res))
experimentDS <- experimentDS[, completeCols]
res <- sapply(experimentDS, function(x)all(!(x == "")))
completeCols <- names(which(res))
experimentDS <- experimentDS[, completeCols]
Splits the experiment data set into traning and test dataset. As the rule of the thumb suggests 70% of the data is used for training and the remainder for testing.
A seed is set to have the code results being reproducible.
set.seed(2323)
inTrain = createDataPartition(experimentDS$classe, p = 0.7)[[1]]
training = experimentDS[ inTrain,]
testing = experimentDS[-inTrain,]
Random forest model is used. The reason being is that it’s a good match and is the one used in the original article.
The randomForest function is used, because using the “train” with “rf” from caret takes a very long time to compute.
rfModel <- randomForest(classe~.,data=training)
Then the test dataset is used to verify the accuracy of the algorithm.
rfResult <- predict(rfModel, testing)
The accuracy result:
confusionMatrix(testing$classe, rfResult)$overall['Accuracy']
## Accuracy
## 0.9942226
The accuracy seems to be too good. But in the final part, it’s possible to see that the model is correct.
For the final part, the model is used to predict an unclassified data.
First the test dataset is loaded. It has only 20 rows to be classified.
testDS <- read.csv("pml-testing.csv")
nrow(testDS)
## [1] 20
Then the model is used to predict the “classe” of the data.
rfTestResult <- predict(rfModel, testDS)
rfTestResult
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Even tough the accuracy seems to high, which could be an alarm for overfitting. When testing in a unclassified data set and submitted as the test cases for the assignment, the prediction was correct in 100% of the 20 cases, which proves that the model was correctly fitted.