Devices like Jawbone(R) UpTM, Nike(R) FuelBandTM and Fitbit(R) allows us to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement -– a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
image:
This exercise will use supervised learning algorithms for predict the manner in which participants executed barbell lifts (correctly or incorrectly), and test the model on a test set
Accelerometer data was generously shared by the Laboratório de Engenharia de Software of the Pontific Catholic University of Rio de Janeiro.
First, we get the data from this link and check the structure of the file:
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : logi NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
When we inspect the dataframe, we found that the 7 first columns has useless data for the modeling process (like user name, time stamp, window time, etc.), so we’ll suppress it. Also, it’s necessary to define the outcome variable (classe) as a factor, and finally splitting the data in train and test set using caret’s > createDataPartition() function.
data[,1:7] <- NULL
data$classe <- factor(data$classe, levels = c('A','B','C','D','E'))
max_min <- data.frame(max = apply(data, 2 , max),
min = apply(data, 2, min),
columns = names(data))
max_min$useless <- FALSE
max_min$useless[(is.na(max_min$min) & is.na(max_min$max))] <- TRUE
useless_columns <- max_min$columns[max_min$useless == TRUE]
useless_columns <- paste("-", useless_columns, sep = "")
data <- data %>% select_(.dots = useless_columns)
inTrain <- createDataPartition(y = data$classe, p = 3/4, list = FALSE)
trainset <- data[inTrain, ]
testset <- data[-inTrain, ]
dim(trainset)
## [1] 14718 53
It’s necessary to explore relationships between variables, so we’ll explore correlations betwenn variables:
print(as_tibble(cor(trainset[ , names(data) != "classe"])))
## # A tibble: 52 × 52
## roll_belt pitch_belt yaw_belt total…¹ gyros…² gyros…³ gyros…⁴ accel…⁵ accel…⁶
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.216 0.815 0.981 -0.121 0.466 -0.462 0.255 0.924
## 2 -0.216 1 -0.700 -0.138 -0.430 -0.400 -0.109 -0.965 0.0825
## 3 0.815 -0.700 1 0.762 0.141 0.533 -0.275 0.708 0.599
## 4 0.981 -0.138 0.762 1 -0.167 0.413 -0.477 0.171 0.927
## 5 -0.121 -0.430 0.141 -0.167 1 0.331 0.339 0.470 -0.268
## 6 0.466 -0.400 0.533 0.413 0.331 1 0.335 0.448 0.329
## 7 -0.462 -0.109 -0.275 -0.477 0.339 0.335 1 0.119 -0.510
## 8 0.255 -0.965 0.708 0.171 0.470 0.448 0.119 1 -0.0389
## 9 0.924 0.0825 0.599 0.927 -0.268 0.329 -0.510 -0.0389 1
## 10 -0.992 0.161 -0.776 -0.975 0.155 -0.435 0.476 -0.208 -0.932
## # … with 42 more rows, 43 more variables: accel_belt_z <dbl>,
## # magnet_belt_x <dbl>, magnet_belt_y <dbl>, magnet_belt_z <dbl>,
## # roll_arm <dbl>, pitch_arm <dbl>, yaw_arm <dbl>, total_accel_arm <dbl>,
## # gyros_arm_x <dbl>, gyros_arm_y <dbl>, gyros_arm_z <dbl>, accel_arm_x <dbl>,
## # accel_arm_y <dbl>, accel_arm_z <dbl>, magnet_arm_x <dbl>,
## # magnet_arm_y <dbl>, magnet_arm_z <dbl>, roll_dumbbell <dbl>,
## # pitch_dumbbell <dbl>, yaw_dumbbell <dbl>, total_accel_dumbbell <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
As we can notice, there are several degrees of correlation between variables, so we propose that this dataset it’s useful for prediction. After cleaning the data set, we are left with 52 variables, and for this example we will use the random forest model with the default options:
start_time <- Sys.time()
rf_mod <- train(classe ~ ., method = 'rf', data = trainset)
end_time <- Sys.time()
end_time - start_time
## Time difference of 57.84605 mins
rf_mod
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9889060 0.9859673
## 27 0.9895211 0.9867461
## 52 0.9853930 0.9815262
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Max accuracy of the model was 98.61%, with a model mtry = 27. This means that by using 27 predictors (variables) we can obtain an accuracy of 98.9%, which is quite good, so we will not try using other pre-processing options. When testing the model in the test set we obtained the following:
rfmod_pred <- predict(rf_mod, testset)
confusionMatrix(data = rfmod_pred, reference = testset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1388 5 0 0 0
## B 6 942 3 0 0
## C 0 2 849 6 5
## D 0 0 3 798 3
## E 1 0 0 0 893
##
## Overall Statistics
##
## Accuracy : 0.9931
## 95% CI : (0.9903, 0.9952)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9912
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9950 0.9926 0.9930 0.9925 0.9911
## Specificity 0.9986 0.9977 0.9968 0.9985 0.9998
## Pos Pred Value 0.9964 0.9905 0.9849 0.9925 0.9989
## Neg Pred Value 0.9980 0.9982 0.9985 0.9985 0.9980
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2830 0.1921 0.1731 0.1627 0.1821
## Detection Prevalence 0.2841 0.1939 0.1758 0.1639 0.1823
## Balanced Accuracy 0.9968 0.9952 0.9949 0.9955 0.9954
As we can see, the accuracy of the model in the test set was high (accuracy = 99.39%, CI 95% [99.13 - 99.59]), besides the homogeneity between groups is high (kappa = 0.9923), the sensitivity and specificity values are excellent, so we can conclude that the model is suitable for use in exercise prediction with accelerometric data.
Modifications to the base model (such as variable pre-weighting or grouping of variables with PCA) could marginally improve the accuracy of the model.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.