1. Introduction.

With the birth of wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit, collecting extensive data on personal activity has become increasingly popular. These devices are central to quantified self-movement, where individuals routinely track their data to enhance their health, identify behavioral patterns, or simply out of interest in technology. While many users focus on quantifying the frequency of their activities, they often need to pay more attention to the quality of their performance.

This project aims to bridge that gap by analyzing data collected from six participants’ accelerometers placed in their belts, forearms, arms, and dumbbells. These participants performed instructed barbell lifts correctly and incorrectly in five distinct ways.

2. Data Cleaning and Preprocessing

data can be downloaded at: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Table: Data

3. Pre-Processing

Preprocessing the data is crucial to ensure the model’s accuracy and performance. The following steps were taken:

Removal of Near-Zero Variance Predictors: Variables with very little variance were removed as they provide little to no information for model training. Handling Missing Data: Columns with excessive missing values (more than 50% NA) were excluded from the dataset. Removing Irrelevant Columns: Columns like user_name, raw_timestamp_part_1, raw_timestamp_part_2, and cvtd_timestamp were removed as they don’t contribute to predicting “classe”. Factorizing the Target Variable: The “classe” variable was converted to a factor to ensure it was treated as a categorical variable.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.884173 -0.120137  0.001538  0.002088  0.097787  0.849132

4. Data Splitting

## 'data.frame':    13737 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.41 1.48 1.42 1.43 1.45 1.43 1.42 1.42 1.48 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.09 8.16 8.17 8.18 8.2 8.21 8.15 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0 ...
##  $ gyros_belt_y        : num  0 0 0.02 0 0 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.02 -0.02 0 -0.02 0 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -21 -22 -20 -21 -22 -22 -22 -21 ...
##  $ accel_belt_y        : int  4 4 2 3 2 4 2 4 4 4 ...
##  $ accel_belt_z        : int  22 22 24 21 24 22 23 21 21 23 ...
##  $ magnet_belt_x       : int  -3 -7 -6 -4 1 -3 -2 -3 -8 0 ...
##  $ magnet_belt_y       : int  599 608 600 599 602 609 602 606 598 592 ...
##  $ magnet_belt_z       : int  -313 -311 -302 -311 -312 -308 -319 -309 -310 -305 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -129 ...
##  $ pitch_arm           : num  22.5 22.5 22.1 21.9 21.7 21.6 21.5 21.4 21.4 21.3 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.03 -0.03 -0.03 -0.03 -0.03 -0.02 0 0 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 0 0 -0.02 -0.02 0 -0.02 -0.03 -0.03 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -288 -288 -288 -287 -288 -289 ...
##  $ accel_arm_y         : int  109 110 111 111 109 110 111 111 111 109 ...
##  $ accel_arm_z         : int  -123 -125 -123 -125 -122 -124 -123 -124 -124 -121 ...
##  $ magnet_arm_x        : int  -368 -369 -374 -373 -369 -376 -363 -372 -371 -367 ...
##  $ magnet_arm_y        : int  337 337 337 336 341 334 343 338 331 340 ...
##  $ magnet_arm_z        : int  516 513 506 509 518 516 520 509 523 509 ...
##  $ roll_dumbbell       : num  13.1 13.1 13.4 13.1 13.2 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.4 -70.2 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -84.9 -85.1 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0.02 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 0 0 0 0 -0.02 -0.02 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -233 -232 -232 -235 -233 -234 -234 -233 ...
##  $ accel_dumbbell_y    : int  47 47 48 47 47 48 47 48 48 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -270 -269 -270 -270 -269 -268 -271 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -554 -551 -549 -558 -554 -552 -554 -554 ...
##  $ magnet_dumbbell_y   : int  293 296 292 295 292 291 291 302 295 297 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -68 -70 -65 -69 -65 -69 -68 -73 ...
##  $ roll_forearm        : num  28.4 28.3 28 27.9 27.7 27.7 27.5 27.2 27.2 27.1 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.9 -63.9 -64 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -151 -151 -151 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.02 0.02 0.03 0.02 0.02 0 0 0.02 ...
##  $ gyros_forearm_y     : num  0 0 0 0 0 0 0.02 0 -0.02 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.03 -0.03 -0.03 0 ...
##  $ accel_forearm_x     : int  192 192 189 195 193 190 191 193 193 194 ...
##  $ accel_forearm_y     : int  203 203 206 205 204 205 203 205 202 204 ...
##  $ accel_forearm_z     : int  -215 -216 -214 -215 -214 -215 -215 -215 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -17 -18 -16 -22 -11 -15 -14 -13 ...
##  $ magnet_forearm_y    : num  654 661 655 659 653 656 657 655 659 656 ...
##  $ magnet_forearm_z    : num  476 473 473 470 476 473 478 472 478 471 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Table: Train Data

## 'data.frame':    5885 obs. of  53 variables:
##  $ roll_belt           : num  1.42 1.48 1.45 1.42 1.45 1.45 1.57 1.56 1.51 1.43 ...
##  $ pitch_belt          : num  8.07 8.05 8.06 8.13 8.18 8.2 8.09 8.1 8.1 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.3 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0.02 0.02 0.03 0 0.02 0.02 0.02 0 ...
##  $ gyros_belt_y        : num  0 0 0 0 0 0 0.02 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.03 -0.02 -0.02 -0.02 0 -0.02 -0.02 -0.02 -0.03 ...
##  $ accel_belt_x        : int  -20 -22 -21 -22 -21 -21 -21 -21 -20 -22 ...
##  $ accel_belt_y        : int  5 3 4 4 2 2 3 4 4 4 ...
##  $ accel_belt_z        : int  23 21 21 21 23 22 21 21 22 19 ...
##  $ magnet_belt_x       : int  -2 -6 0 -2 -5 -1 -2 -4 -3 4 ...
##  $ magnet_belt_y       : int  600 604 603 603 596 597 604 606 601 602 ...
##  $ magnet_belt_z       : int  -305 -310 -312 -313 -317 -310 -313 -311 -318 -316 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -129 -129 -129 -129 -129 ...
##  $ pitch_arm           : num  22.5 22.1 22 21.8 21.5 21.4 20.8 20.7 20.7 20.5 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 -0.02 0.03 ...
##  $ gyros_arm_y         : num  -0.02 -0.03 -0.03 -0.02 -0.03 0 -0.02 -0.02 0 -0.02 ...
##  $ gyros_arm_z         : num  -0.02 0.02 0 0 0 -0.03 -0.02 -0.02 -0.02 0 ...
##  $ accel_arm_x         : int  -289 -289 -289 -289 -290 -289 -289 -290 -289 -290 ...
##  $ accel_arm_y         : int  110 111 111 111 110 111 111 110 110 110 ...
##  $ accel_arm_z         : int  -126 -123 -122 -124 -123 -124 -123 -123 -125 -126 ...
##  $ magnet_arm_x        : int  -368 -372 -369 -372 -366 -374 -372 -373 -374 -375 ...
##  $ magnet_arm_y        : int  344 344 342 338 339 342 338 333 350 339 ...
##  $ magnet_arm_z        : int  513 512 513 510 509 510 510 509 516 508 ...
##  $ roll_dumbbell       : num  12.9 13.4 13.4 12.8 13.1 ...
##  $ pitch_dumbbell      : num  -70.3 -70.4 -70.8 -70.3 -70.6 ...
##  $ yaw_dumbbell        : num  -85.1 -84.9 -84.5 -85.1 -84.7 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 -0.02 0 0 0 0 0 0 0 -0.02 ...
##  $ accel_dumbbell_x    : int  -232 -232 -234 -234 -233 -234 -233 -234 -235 -234 ...
##  $ accel_dumbbell_y    : int  46 48 48 46 47 47 48 48 47 48 ...
##  $ accel_dumbbell_z    : int  -270 -269 -269 -272 -269 -270 -270 -270 -271 -272 ...
##  $ magnet_dumbbell_x   : int  -561 -552 -558 -555 -564 -554 -554 -557 -558 -556 ...
##  $ magnet_dumbbell_y   : int  298 303 294 300 299 294 301 294 291 298 ...
##  $ magnet_dumbbell_z   : num  -63 -60 -66 -74 -64 -63 -65 -69 -71 -62 ...
##  $ roll_forearm        : num  28.3 28.1 27.9 27.8 27.6 27.2 27 26.9 27.1 26.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.8 -63.8 -63.9 -63.9 -63.8 -63.7 -63.7 ...
##  $ yaw_forearm         : num  -152 -152 -152 -152 -152 -151 -151 -151 -151 -151 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.02 0.02 0.02 0 0.02 0.02 0.03 0 ...
##  $ gyros_forearm_y     : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.03 -0.02 -0.03 -0.02 ...
##  $ gyros_forearm_z     : num  0 0 -0.03 0 -0.02 -0.02 -0.02 -0.02 0 -0.02 ...
##  $ accel_forearm_x     : int  196 189 193 193 193 192 191 194 193 196 ...
##  $ accel_forearm_y     : int  204 206 203 205 205 201 206 206 203 207 ...
##  $ accel_forearm_z     : int  -213 -214 -215 -213 -214 -214 -213 -214 -213 -216 ...
##  $ magnet_forearm_x    : int  -18 -16 -9 -9 -17 -16 -17 -10 -11 -15 ...
##  $ magnet_forearm_y    : num  658 658 660 660 657 656 654 653 661 650 ...
##  $ magnet_forearm_z    : num  469 469 478 474 465 472 478 467 470 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Table: Test Data

Predictors and Data Reduction:

Improves model stability by removing redundant information. Highly correlated predictors can cause issues like multicollinearity, which can affect the stability and interpretability of your model

Correlation Matrix: Helps to understand the relationships between numeric predictors. findCorrelation Function: Efficiently identifies and removes highly correlated predictors based on a specified threshold. Data Reduction: Improves model stability by removing redundant information.

5. Modeling

Random Forest

## [1] "Random Forest"
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12362, 12363, 12364, 12364, 12364, 12363, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9916286  0.9894096
##   27    0.9924291  0.9904234
##   52    0.9873331  0.9839752
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

Gradient Boosting

## [1] "Gradient Boosting"
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12362, 12363, 12364, 12364, 12364, 12363, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7530025  0.6870463
##   1                  100      0.8187374  0.7706256
##   1                  150      0.8530229  0.8140483
##   2                   50      0.8561540  0.8178078
##   2                  100      0.9060186  0.8810658
##   2                  150      0.9304792  0.9120372
##   3                   50      0.8943715  0.8663085
##   3                  100      0.9418351  0.9264136
##   3                  150      0.9609081  0.9505472
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1653   32    0    2    3
##          B   12 1073   28    7   10
##          C    5   34  980   24   21
##          D    1    0   16  923   17
##          E    3    0    2    8 1031
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9618          
##                  95% CI : (0.9565, 0.9665)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : 9.061e-08       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9421   0.9552   0.9575   0.9529
## Specificity            0.9912   0.9880   0.9827   0.9931   0.9973
## Pos Pred Value         0.9781   0.9496   0.9211   0.9645   0.9875
## Neg Pred Value         0.9950   0.9861   0.9905   0.9917   0.9895
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2809   0.1823   0.1665   0.1568   0.1752
## Detection Prevalence   0.2872   0.1920   0.1808   0.1626   0.1774
## Balanced Accuracy      0.9893   0.9650   0.9689   0.9753   0.9751

Support Vector Machine (SVM) Implementation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    5    0    1    0
##          B    0 1133    3    0    1
##          C    0    1 1016   12    0
##          D    0    0    7  947    1
##          E    0    0    0    4 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9941          
##                  95% CI : (0.9917, 0.9959)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9925          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9947   0.9903   0.9824   0.9982
## Specificity            0.9986   0.9992   0.9973   0.9984   0.9992
## Pos Pred Value         0.9964   0.9965   0.9874   0.9916   0.9963
## Neg Pred Value         1.0000   0.9987   0.9979   0.9966   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1925   0.1726   0.1609   0.1835
## Detection Prevalence   0.2855   0.1932   0.1749   0.1623   0.1842
## Balanced Accuracy      0.9993   0.9969   0.9938   0.9904   0.9987

For this multiclass classification problem, several models could be considered. A Random Forest model was selected due to its robustness, ability to handle large datasets with higher dimensionality, and relatively minimal tuning requirements. The model’s ability to handle correlated features also made it an ideal choice for this dataset.

The model was trained using the caret package with a 10-fold cross-validation strategy to ensure the model’s performance was robust and generalizable.

Cross-Validation: This technique divides the training data into 10 parts, trains the model on 9 parts, and validates it on the remaining part. This process is repeated 10 times, with each part serving as the validation set once. The results are averaged to provide an estimate of model performance on unseen data.

Models evaluation

Here the comparison of the 3 selected models is presented.

## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: rf, svm, gbm 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.9876184 0.9912632 0.9923578 0.9924291 0.9934486 0.9970888    0
## svm 0.9898108 0.9912600 0.9916305 0.9914830 0.9919927 0.9934450    0
## gbm 0.9533867 0.9581670 0.9606987 0.9609081 0.9632590 0.9679767    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.9843351 0.9889503 0.9903318 0.9904234 0.9917127 0.9963177    0
## svm 0.9871118 0.9889428 0.9894149 0.9892271 0.9898732 0.9917082    0
## gbm 0.9410101 0.9470939 0.9502992 0.9505472 0.9535224 0.9594746    0

6. Model Evaluation - Random Forest

Due to the slightly higher performance, random forest model will be used to evaluate the test set

confusionMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    1 1126    5    0    0
##          C    0    7 1018   10    4
##          D    0    0    3  954    4
##          E    0    0    0    0 1074
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9908, 0.9951)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9914          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9886   0.9922   0.9896   0.9926
## Specificity            0.9986   0.9987   0.9957   0.9986   1.0000
## Pos Pred Value         0.9964   0.9947   0.9798   0.9927   1.0000
## Neg Pred Value         0.9998   0.9973   0.9983   0.9980   0.9983
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1913   0.1730   0.1621   0.1825
## Detection Prevalence   0.2853   0.1924   0.1766   0.1633   0.1825
## Balanced Accuracy      0.9990   0.9937   0.9939   0.9941   0.9963

test set -validation random forest

Confusion Matrix: This provides a detailed breakdown of the model’s performance across all classes, showing how often predictions were correct versus incorrect. Accuracy: The overall accuracy of the model was derived from the confusion matrix. The model’s performance was strong, indicating effective classification of the different exercise forms.

View feature importance

# Summary of the Visualizations: Feature Importance Plot: This plot will help you understand which features contributed most to the model’s predictions. Higher importance values indicate more influential features. Confusion Matrix Visualization: This heatmap-style visualization will clearly show the distribution of correct and incorrect predictions, making it easier to spot any misclassification patterns. Accuracy Plot from Cross-Validation: This plot will show how the model’s accuracy varied across different cross-validation folds, helping assess its consistency.

Expected Out-of-Sample Error

The expected out-of-sample error was estimated using the cross-validation results. Since cross-validation provides an average performance measure across different subsets of the training data, it gives a reliable estimate of how the model will perform on completely unseen data. Expected Out-of-Sample Error Estimate: The Random Forest model was evaluated using 10-fold cross-validation. The average accuracy across the folds was approximately < 5%. Therefore, the expected out-of-sample error, which reflects the error rate when the model is applied to new, unseen data, is estimated to be around 0.008%.

## [1] "expected out of sample_error"
## [1] 0.007570892
## [1] "mean accuracy"
## [1] 0.9924291

Prediction on New Data - 20 new sets using Random Forest

accuracies
x
1.Accuracy 0.9965986
2.Accuracy 0.9897959
3.Accuracy 1.0000000
4.Accuracy 0.9966102
5.Accuracy 0.9965986
6.Accuracy 0.9965986
7.Accuracy 0.9795918
8.Accuracy 0.9931973
9.Accuracy 1.0000000
10.Accuracy 0.9830508
11.Accuracy 0.9864407
12.Accuracy 0.9931973
13.Accuracy 0.9931973
14.Accuracy 0.9965986
15.Accuracy 0.9966102
16.Accuracy 1.0000000
17.Accuracy 0.9830508
18.Accuracy 0.9863946
19.Accuracy 1.0000000
20.Accuracy 0.9965986
mean accuracy
x
0.9932065

7. Predictions - test

## 'data.frame':    20 obs. of  53 variables:
##  $ roll_belt           : num  123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
##  $ pitch_belt          : num  27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
##  $ yaw_belt            : num  -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
##  $ total_accel_belt    : int  20 4 5 17 3 4 4 4 4 18 ...
##  $ gyros_belt_x        : num  -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
##  $ gyros_belt_y        : num  -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
##  $ gyros_belt_z        : num  -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
##  $ accel_belt_x        : int  -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
##  $ accel_belt_y        : int  69 11 -1 45 4 -16 2 -2 1 63 ...
##  $ accel_belt_z        : int  -179 39 49 -156 27 38 35 42 32 -158 ...
##  $ magnet_belt_x       : int  -13 43 29 169 33 31 50 39 -6 10 ...
##  $ magnet_belt_y       : int  581 636 631 608 566 638 622 635 600 601 ...
##  $ magnet_belt_z       : int  -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
##  $ roll_arm            : num  40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
##  $ pitch_arm           : num  -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
##  $ yaw_arm             : num  178 0 0 -142 102 0 0 0 -167 -75.3 ...
##  $ total_accel_arm     : int  10 38 44 25 29 14 15 22 34 32 ...
##  $ gyros_arm_x         : num  -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
##  $ gyros_arm_y         : num  0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
##  $ gyros_arm_z         : num  -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
##  $ accel_arm_x         : int  16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
##  $ accel_arm_y         : int  38 215 245 -57 200 130 79 175 111 -42 ...
##  $ accel_arm_z         : int  93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
##  $ magnet_arm_x        : int  -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
##  $ magnet_arm_y        : int  385 447 474 257 275 176 15 215 335 294 ...
##  $ magnet_arm_z        : int  481 434 413 633 617 516 217 385 520 493 ...
##  $ roll_dumbbell       : num  -17.7 54.5 57.1 43.1 -101.4 ...
##  $ pitch_dumbbell      : num  25 -53.7 -51.4 -30 -53.4 ...
##  $ yaw_dumbbell        : num  126.2 -75.5 -75.2 -103.3 -14.2 ...
##  $ total_accel_dumbbell: int  9 31 29 18 4 29 29 29 3 2 ...
##  $ gyros_dumbbell_x    : num  0.64 0.34 0.39 0.1 0.29 -0.59 0.34 0.37 0.03 0.42 ...
##  $ gyros_dumbbell_y    : num  0.06 0.05 0.14 -0.02 -0.47 0.8 0.16 0.14 -0.21 0.51 ...
##  $ gyros_dumbbell_z    : num  -0.61 -0.71 -0.34 0.05 -0.46 1.1 -0.23 -0.39 -0.21 -0.03 ...
##  $ accel_dumbbell_x    : int  21 -153 -141 -51 -18 -138 -145 -140 0 -7 ...
##  $ accel_dumbbell_y    : int  -15 155 155 72 -30 166 150 159 25 -20 ...
##  $ accel_dumbbell_z    : int  81 -205 -196 -148 -5 -186 -190 -191 9 7 ...
##  $ magnet_dumbbell_x   : int  523 -502 -506 -576 -424 -543 -484 -515 -519 -531 ...
##  $ magnet_dumbbell_y   : int  -528 388 349 238 252 262 354 350 348 321 ...
##  $ magnet_dumbbell_z   : int  -56 -36 41 53 312 96 97 53 -32 -164 ...
##  $ roll_forearm        : num  141 109 131 0 -176 150 155 -161 15.5 13.2 ...
##  $ pitch_forearm       : num  49.3 -17.6 -32.6 0 -2.16 1.46 34.5 43.6 -63.5 19.4 ...
##  $ yaw_forearm         : num  156 106 93 0 -47.9 89.7 152 -89.5 -139 -105 ...
##  $ total_accel_forearm : int  33 39 34 43 24 43 32 47 36 24 ...
##  $ gyros_forearm_x     : num  0.74 1.12 0.18 1.38 -0.75 -0.88 -0.53 0.63 0.03 0.02 ...
##  $ gyros_forearm_y     : num  -3.34 -2.78 -0.79 0.69 3.1 4.26 1.8 -0.74 0.02 0.13 ...
##  $ gyros_forearm_z     : num  -0.59 -0.18 0.28 1.8 0.8 1.35 0.75 0.49 -0.02 -0.07 ...
##  $ accel_forearm_x     : int  -110 212 154 -92 131 230 -192 -151 195 -212 ...
##  $ accel_forearm_y     : int  267 297 271 406 -93 322 170 -331 204 98 ...
##  $ accel_forearm_z     : int  -149 -118 -129 -39 172 -144 -175 -282 -217 -7 ...
##  $ magnet_forearm_x    : int  -714 -237 -51 -233 375 -300 -678 -109 0 -403 ...
##  $ magnet_forearm_y    : int  419 791 698 783 -787 800 284 -619 652 723 ...
##  $ magnet_forearm_z    : int  617 873 783 521 91 884 585 -32 469 512 ...
##  $ problem_id          : int  1 2 3 4 5 6 7 8 9 10 ...

Table: Testing- Validating Data

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
  1. Conclusion.

In this analysis, a Random Forest model was built using the caret package to predict the “classe” variable from a dataset of wearable device readings during exercise. The model was carefully trained and validated using cross-validation to ensure it generalizes well to unseen data. The expected out-of-sample error was estimated based on cross-validation results, and the model was evaluated on a test set, showing strong performance. This approach provides a reliable method for predicting exercise form based on sensor data.

Note: testing validating data showed 100% match