Introduction

Devices like Jawbone(R) UpTM, Nike(R) FuelBandTM and Fitbit(R) allows us to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement -– a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

image:

Objectives

This exercise will use supervised learning algorithms for predict the manner in which participants executed barbell lifts (correctly or incorrectly), and test the model on a test set

Accelerometer data was generously shared by the Laboratório de Engenharia de Software of the Pontific Catholic University of Rio de Janeiro.

Exploratory data analysis

First, we get the data from this link and check the structure of the file:

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Cleaning data

When we inspect the dataframe, we found that the 7 first columns has useless data for the modeling process (like user name, time stamp, window time, etc.), so we’ll suppress it. Also, it’s necessary to define the outcome variable (classe) as a factor, and finally splitting the data in train and test set using caret’s > createDataPartition() function.

data[,1:7] <- NULL
data$classe <- factor(data$classe, levels = c('A','B','C','D','E'))
max_min <- data.frame(max = apply(data, 2 , max),
                      min = apply(data, 2, min),
                      columns = names(data))
max_min$useless <- FALSE
max_min$useless[(is.na(max_min$min) & is.na(max_min$max))] <- TRUE
useless_columns <- max_min$columns[max_min$useless == TRUE]
useless_columns <- paste("-", useless_columns, sep = "")
data <- data %>% select_(.dots = useless_columns)
inTrain <- createDataPartition(y = data$classe, p = 3/4, list = FALSE)
trainset <- data[inTrain, ]
testset <- data[-inTrain, ]
dim(trainset)
## [1] 14718    53

Correlation between variables

It’s necessary to explore relationships between variables, so we’ll explore correlations betwenn variables:

print(as_tibble(cor(trainset[ , names(data) != "classe"])))
## # A tibble: 52 × 52
##    roll_belt pitch_belt yaw_belt total…¹ gyros…² gyros…³ gyros…⁴ accel…⁵ accel…⁶
##        <dbl>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     1        -0.216     0.815   0.981  -0.121   0.466  -0.462  0.255   0.924 
##  2    -0.216     1        -0.700  -0.138  -0.430  -0.400  -0.109 -0.965   0.0825
##  3     0.815    -0.700     1       0.762   0.141   0.533  -0.275  0.708   0.599 
##  4     0.981    -0.138     0.762   1      -0.167   0.413  -0.477  0.171   0.927 
##  5    -0.121    -0.430     0.141  -0.167   1       0.331   0.339  0.470  -0.268 
##  6     0.466    -0.400     0.533   0.413   0.331   1       0.335  0.448   0.329 
##  7    -0.462    -0.109    -0.275  -0.477   0.339   0.335   1      0.119  -0.510 
##  8     0.255    -0.965     0.708   0.171   0.470   0.448   0.119  1      -0.0389
##  9     0.924     0.0825    0.599   0.927  -0.268   0.329  -0.510 -0.0389  1     
## 10    -0.992     0.161    -0.776  -0.975   0.155  -0.435   0.476 -0.208  -0.932 
## # … with 42 more rows, 43 more variables: accel_belt_z <dbl>,
## #   magnet_belt_x <dbl>, magnet_belt_y <dbl>, magnet_belt_z <dbl>,
## #   roll_arm <dbl>, pitch_arm <dbl>, yaw_arm <dbl>, total_accel_arm <dbl>,
## #   gyros_arm_x <dbl>, gyros_arm_y <dbl>, gyros_arm_z <dbl>, accel_arm_x <dbl>,
## #   accel_arm_y <dbl>, accel_arm_z <dbl>, magnet_arm_x <dbl>,
## #   magnet_arm_y <dbl>, magnet_arm_z <dbl>, roll_dumbbell <dbl>,
## #   pitch_dumbbell <dbl>, yaw_dumbbell <dbl>, total_accel_dumbbell <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Model prediction

As we can notice, there are several degrees of correlation between variables, so we propose that this dataset it’s useful for prediction. After cleaning the data set, we are left with 52 variables, and for this example we will use the random forest model with the default options:

start_time <- Sys.time()
rf_mod <- train(classe ~ ., method = 'rf', data = trainset)
end_time <- Sys.time()
end_time - start_time
## Time difference of 57.84605 mins
rf_mod
## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9889060  0.9859673
##   27    0.9895211  0.9867461
##   52    0.9853930  0.9815262
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

Max accuracy of the model was 98.61%, with a model mtry = 27. This means that by using 27 predictors (variables) we can obtain an accuracy of 98.9%, which is quite good, so we will not try using other pre-processing options. When testing the model in the test set we obtained the following:

rfmod_pred <- predict(rf_mod, testset)
confusionMatrix(data = rfmod_pred, reference = testset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1388    5    0    0    0
##          B    6  942    3    0    0
##          C    0    2  849    6    5
##          D    0    0    3  798    3
##          E    1    0    0    0  893
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9931          
##                  95% CI : (0.9903, 0.9952)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9912          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9926   0.9930   0.9925   0.9911
## Specificity            0.9986   0.9977   0.9968   0.9985   0.9998
## Pos Pred Value         0.9964   0.9905   0.9849   0.9925   0.9989
## Neg Pred Value         0.9980   0.9982   0.9985   0.9985   0.9980
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2830   0.1921   0.1731   0.1627   0.1821
## Detection Prevalence   0.2841   0.1939   0.1758   0.1639   0.1823
## Balanced Accuracy      0.9968   0.9952   0.9949   0.9955   0.9954

As we can see, the accuracy of the model in the test set was high (accuracy = 99.39%, CI 95% [99.13 - 99.59]), besides the homogeneity between groups is high (kappa = 0.9923), the sensitivity and specificity values are excellent, so we can conclude that the model is suitable for use in exercise prediction with accelerometric data.

Modifications to the base model (such as variable pre-weighting or grouping of variables with PCA) could marginally improve the accuracy of the model.

References:

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.