Clean the environment.
rm(list = ls())
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
In this project of Practical machine Learning course, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
In this project of Practical machine Learning course, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project comes from this original source: http://groupware.les.inf.puc-rio.br/har.
However, I downloaded the file to my directory containing the programming environment.
The \(na.string\) setting is used for the later removal of columns by setting cells with empty space to be \(NA\).
training.data <- read.csv("./Data/pml-training.csv", header = TRUE, sep = ",", stringsAsFactors = T, na.strings = c("", "NA"))
#class(training.data)
str(training.data)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
# print(head(training.data, 1))
dim(training.data)
## [1] 19622 160
This step is used to remove the columns containing \(NA\) and empty spaces along with columns that contain information that is unhelpful for the classification such as the index, date and participant’s names.
training.cleaned.data <- training.data[8:length(training.data)]
remCol <- colSums(is.na(training.cleaned.data))
training.cleaned.data <- training.cleaned.data[, remCol == 0]
#print(head(training.data, 12))
#print(tail(training.data, 12))
This step is related to the splitting of training data into a training set and a validation set. The validation set is necessary to estimate the performance of the classifier after it is trained based on the training set.
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(22519) # the set.seed function is chosen for getting reproductible results, when run mamy times.
inTrain <- createDataPartition(training.cleaned.data$classe, p = 3/4)[[1]]
training.set <- training.cleaned.data[inTrain, ]
validation.set <- training.cleaned.data[-inTrain, ]
str(training.set)
## 'data.frame': 14718 obs. of 53 variables:
## $ roll_belt : num 1.41 1.42 1.48 1.42 1.42 1.43 1.45 1.43 1.45 1.48 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.09 8.13 8.16 8.18 8.18 8.2 8.15 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0 0.02 0.02 0.02 0.02 0.03 0.02 0 0 ...
## $ gyros_belt_y : num 0 0 0.02 0 0 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 0 0 ...
## $ accel_belt_x : int -21 -20 -21 -22 -22 -20 -21 -22 -21 -21 ...
## $ accel_belt_y : int 4 5 2 3 4 2 2 2 2 4 ...
## $ accel_belt_z : int 22 23 24 21 21 24 23 23 22 23 ...
## $ magnet_belt_x : int -3 -2 -6 -4 -2 1 -5 -2 -1 0 ...
## $ magnet_belt_y : int 599 600 600 599 603 602 596 602 597 592 ...
## $ magnet_belt_z : int -313 -305 -302 -311 -313 -312 -317 -319 -310 -305 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -129 -129 ...
## $ pitch_arm : num 22.5 22.5 22.1 21.9 21.8 21.7 21.5 21.5 21.4 21.3 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 0 0 ...
## $ gyros_arm_z : num -0.02 -0.02 0 0 0 -0.02 0 0 -0.03 -0.03 ...
## $ accel_arm_x : int -288 -289 -289 -289 -289 -288 -290 -288 -289 -289 ...
## $ accel_arm_y : int 109 110 111 111 111 109 110 111 111 109 ...
## $ accel_arm_z : int -123 -126 -123 -125 -124 -122 -123 -123 -124 -121 ...
## $ magnet_arm_x : int -368 -368 -374 -373 -372 -369 -366 -363 -374 -367 ...
## $ magnet_arm_y : int 337 344 337 336 338 341 339 343 342 340 ...
## $ magnet_arm_z : int 516 513 506 509 510 518 509 520 510 509 ...
## $ roll_dumbbell : num 13.1 12.9 13.4 13.1 12.8 ...
## $ pitch_dumbbell : num -70.5 -70.3 -70.4 -70.2 -70.3 ...
## $ yaw_dumbbell : num -84.9 -85.1 -84.9 -85.1 -85.1 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 0 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -232 -233 -232 -234 -232 -233 -233 -234 -233 ...
## $ accel_dumbbell_y : int 47 46 48 47 46 47 47 47 47 48 ...
## $ accel_dumbbell_z : int -271 -270 -270 -270 -272 -269 -269 -270 -270 -271 ...
## $ magnet_dumbbell_x : int -559 -561 -554 -551 -555 -549 -564 -554 -554 -554 ...
## $ magnet_dumbbell_y : int 293 298 292 295 300 292 299 291 294 297 ...
## $ magnet_dumbbell_z : num -65 -63 -68 -70 -74 -65 -64 -65 -63 -73 ...
## $ roll_forearm : num 28.4 28.3 28 27.9 27.8 27.7 27.6 27.5 27.2 27.1 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 -63.9 -64 ...
## $ yaw_forearm : num -153 -152 -152 -152 -152 -152 -152 -152 -151 -151 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.03 0.02 0.02 0.02 0.03 0.02 0.02 0 0.02 ...
## $ gyros_forearm_y : num 0 -0.02 0 0 -0.02 0 -0.02 0.02 -0.02 0 ...
## $ gyros_forearm_z : num -0.02 0 -0.02 -0.02 0 -0.02 -0.02 -0.03 -0.02 0 ...
## $ accel_forearm_x : int 192 196 189 195 193 193 193 191 192 194 ...
## $ accel_forearm_y : int 203 204 206 205 205 204 205 203 201 204 ...
## $ accel_forearm_z : int -215 -213 -214 -215 -213 -214 -214 -215 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -17 -18 -9 -16 -17 -11 -16 -13 ...
## $ magnet_forearm_y : num 654 658 655 659 660 653 657 657 656 656 ...
## $ magnet_forearm_z : num 476 469 473 470 474 476 465 478 472 471 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
str(validation.set)
## 'data.frame': 4904 obs. of 53 variables:
## $ roll_belt : num 1.41 1.48 1.45 1.45 1.42 1.42 1.51 1.6 1.57 1.56 ...
## $ pitch_belt : num 8.07 8.05 8.06 8.17 8.2 8.21 8.12 8.1 8.09 8.1 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.3 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0.02 0.02 0.02 0.03 0.02 0.02 0 0.02 0.02 0.02 ...
## $ gyros_belt_y : num 0 0 0 0 0 0 0 0 0.02 0 ...
## $ gyros_belt_z : num -0.02 -0.03 -0.02 0 0 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ accel_belt_x : int -22 -22 -21 -21 -22 -22 -21 -20 -21 -21 ...
## $ accel_belt_y : int 4 3 4 4 4 4 4 1 3 4 ...
## $ accel_belt_z : int 22 21 21 22 21 21 22 20 21 21 ...
## $ magnet_belt_x : int -7 -6 0 -3 -3 -8 -6 -10 -2 -4 ...
## $ magnet_belt_y : int 608 604 603 609 606 598 598 607 604 606 ...
## $ magnet_belt_z : int -311 -310 -312 -308 -309 -310 -317 -304 -313 -311 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -129 -129 -129 -129 ...
## $ pitch_arm : num 22.5 22.1 22 21.6 21.4 21.4 21.3 20.9 20.8 20.7 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 ...
## $ gyros_arm_y : num -0.02 -0.03 -0.03 -0.03 -0.02 0 0 -0.02 -0.02 -0.02 ...
## $ gyros_arm_z : num -0.02 0.02 0 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 ...
## $ accel_arm_x : int -290 -289 -289 -288 -287 -288 -289 -288 -289 -290 ...
## $ accel_arm_y : int 110 111 111 110 111 111 110 111 111 110 ...
## $ accel_arm_z : int -125 -123 -122 -124 -124 -124 -122 -124 -123 -123 ...
## $ magnet_arm_x : int -369 -372 -369 -376 -372 -371 -371 -375 -372 -373 ...
## $ magnet_arm_y : int 337 344 342 334 338 331 337 337 338 333 ...
## $ magnet_arm_z : int 513 512 513 516 509 523 512 513 510 509 ...
## $ roll_dumbbell : num 13.1 13.4 13.4 13.3 13.4 ...
## $ pitch_dumbbell : num -70.6 -70.4 -70.8 -70.9 -70.8 ...
## $ yaw_dumbbell : num -84.7 -84.9 -84.5 -84.4 -84.5 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0.02 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 -0.02 0 0 -0.02 -0.02 0 0 0 0 ...
## $ accel_dumbbell_x : int -233 -232 -234 -235 -234 -234 -233 -234 -233 -234 ...
## $ accel_dumbbell_y : int 47 48 48 48 48 48 47 48 48 48 ...
## $ accel_dumbbell_z : int -269 -269 -269 -270 -269 -268 -272 -269 -270 -270 ...
## $ magnet_dumbbell_x : int -555 -552 -558 -558 -552 -554 -551 -554 -554 -557 ...
## $ magnet_dumbbell_y : int 296 303 294 291 302 295 296 299 301 294 ...
## $ magnet_dumbbell_z : num -64 -60 -66 -69 -69 -68 -56 -72 -65 -69 ...
## $ roll_forearm : num 28.3 28.1 27.9 27.7 27.2 27.2 27.1 26.9 27 26.9 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.8 -63.9 -63.9 -64 -63.9 -63.9 -63.8 ...
## $ yaw_forearm : num -153 -152 -152 -152 -151 -151 -151 -151 -151 -151 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.02 0.02 0.02 0.02 0 0 0.02 0.03 0.02 0.02 ...
## $ gyros_forearm_y : num 0 -0.02 -0.02 0 0 -0.02 -0.02 -0.03 -0.03 -0.02 ...
## $ gyros_forearm_z : num -0.02 0 -0.03 -0.02 -0.03 -0.03 0 -0.02 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 189 193 190 193 193 192 194 191 194 ...
## $ accel_forearm_y : int 203 206 203 205 205 202 204 208 206 206 ...
## $ accel_forearm_z : int -216 -214 -215 -215 -215 -214 -213 -214 -213 -214 ...
## $ magnet_forearm_x : int -18 -16 -9 -22 -15 -14 -13 -11 -17 -10 ...
## $ magnet_forearm_y : num 661 658 660 656 655 659 653 654 654 653 ...
## $ magnet_forearm_z : num 473 469 478 473 472 478 481 469 478 467 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#Dimensionality for comparison:
dim(training.data)
## [1] 19622 160
dim(training.cleaned.data)
## [1] 19622 53
dim(training.set)
## [1] 14718 53
dim(validation.set)
## [1] 4904 53
require(rpart)
## Loading required package: rpart
require(rpart.plot)
## Loading required package: rpart.plot
require(rattle)
## Loading required package: rattle
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
decision.tree <- rpart(classe ~ ., data = training.set, method = "class")
fancyRpartPlot(decision.tree, cex = 0.2, tweak = 2, palettes = c("Greys", "Oranges", "Reds", "Greens"), sub = "Decision tree")
# The gradient of the color in the decision tree represents the accuracy of that node.
decision.tree.2 <- rpart(classe ~ ., data = training.set, method = "class")
prp(decision.tree.2)
A predictive model will be fitted using Random Forest algorithm. This fitting way selects important variables and is robust to correlated covariates & outliers. A five-fold cross-validation is used for this predictive model.
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
random.forest.control <- trainControl(method = "cv", 5)
random.forest.model <- train(classe ~ ., data = training.set, method = "rf", trControl = random.forest.control, ntree = 10)
random.forest.model
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11775, 11774, 11772, 11776, 11775
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9779867 0.9721520
## 27 0.9849839 0.9810045
## 52 0.9768308 0.9706881
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The results of the confusion matrix command are as follows:
random.forest.predict.1 <- predict(random.forest.model, training.set)
confusionMatrix(training.set$classe, random.forest.predict.1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4184 1 0 0 0
## B 1 2844 2 0 1
## C 0 1 2566 0 0
## D 0 0 2 2410 0
## E 0 0 0 2 2704
##
## Overall Statistics
##
## Accuracy : 0.9993
## 95% CI : (0.9988, 0.9997)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9998 0.9993 0.9984 0.9992 0.9996
## Specificity 0.9999 0.9997 0.9999 0.9998 0.9998
## Pos Pred Value 0.9998 0.9986 0.9996 0.9992 0.9993
## Neg Pred Value 0.9999 0.9998 0.9997 0.9998 0.9999
## Prevalence 0.2843 0.1934 0.1746 0.1639 0.1838
## Detection Rate 0.2843 0.1932 0.1743 0.1637 0.1837
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 0.9998 0.9995 0.9992 0.9995 0.9997
Both the accuracy 0.9993 and the kappa indicator 0.9991 of concordance indicate that the model is well adjusted to the chosen parameters.
The results of the confusion matrix command are as follows:
random.forest.predict <- predict(random.forest.model, validation.set)
confusionMatrix(validation.set$classe, random.forest.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 3 0 0 0
## B 7 927 12 3 0
## C 0 8 842 5 0
## D 0 1 11 792 0
## E 0 2 4 6 889
##
## Overall Statistics
##
## Accuracy : 0.9874
## 95% CI : (0.9838, 0.9903)
## No Information Rate : 0.2853
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.984
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9950 0.9851 0.9689 0.9826 1.0000
## Specificity 0.9991 0.9944 0.9968 0.9971 0.9970
## Pos Pred Value 0.9978 0.9768 0.9848 0.9851 0.9867
## Neg Pred Value 0.9980 0.9965 0.9933 0.9966 1.0000
## Prevalence 0.2853 0.1919 0.1772 0.1644 0.1813
## Detection Rate 0.2838 0.1890 0.1717 0.1615 0.1813
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9971 0.9898 0.9829 0.9899 0.9985
Both the accuracy 0.9874 and the kappa indicator 0.984 of concordance indicate that the predictor seems to have a low out of sample error rate.
model.accuracy <- postResample(random.forest.predict, validation.set$classe)
print(model.accuracy)
## Accuracy Kappa
## 0.9873573 0.9840081
oose <- 1 - as.numeric(confusionMatrix(validation.set$classe, random.forest.predict)$overall[1])
oose
## [1] 0.01264274
The estimated accuracy of the model is 98.85% and the estimated out-of-sample error is 1.1%.
The test data cleaning is done in the same way as the training data: removing the columns containing \(NA\) and emptying spaces along with columns that contain information which is unhelpful for the classification such as the index, date and participant’s names.
test.data <- read.csv("./Data/pml-testing.csv", header = TRUE, sep = ",", stringsAsFactors = T, na.strings = c("", "NA"))
test.cleaned.data <- test.data[8:length(test.data)]
remCol <- colSums(is.na(test.cleaned.data))
test.cleaned.data <- test.cleaned.data[, remCol == 0]
str(test.cleaned.data)
## 'data.frame': 20 obs. of 53 variables:
## $ roll_belt : num 123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
## $ pitch_belt : num 27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
## $ yaw_belt : num -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
## $ total_accel_belt : int 20 4 5 17 3 4 4 4 4 18 ...
## $ gyros_belt_x : num -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
## $ gyros_belt_y : num -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
## $ gyros_belt_z : num -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
## $ accel_belt_x : int -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
## $ accel_belt_y : int 69 11 -1 45 4 -16 2 -2 1 63 ...
## $ accel_belt_z : int -179 39 49 -156 27 38 35 42 32 -158 ...
## $ magnet_belt_x : int -13 43 29 169 33 31 50 39 -6 10 ...
## $ magnet_belt_y : int 581 636 631 608 566 638 622 635 600 601 ...
## $ magnet_belt_z : int -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
## $ roll_arm : num 40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
## $ pitch_arm : num -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
## $ yaw_arm : num 178 0 0 -142 102 0 0 0 -167 -75.3 ...
## $ total_accel_arm : int 10 38 44 25 29 14 15 22 34 32 ...
## $ gyros_arm_x : num -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
## $ gyros_arm_y : num 0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
## $ gyros_arm_z : num -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
## $ accel_arm_x : int 16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
## $ accel_arm_y : int 38 215 245 -57 200 130 79 175 111 -42 ...
## $ accel_arm_z : int 93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
## $ magnet_arm_x : int -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
## $ magnet_arm_y : int 385 447 474 257 275 176 15 215 335 294 ...
## $ magnet_arm_z : int 481 434 413 633 617 516 217 385 520 493 ...
## $ roll_dumbbell : num -17.7 54.5 57.1 43.1 -101.4 ...
## $ pitch_dumbbell : num 25 -53.7 -51.4 -30 -53.4 ...
## $ yaw_dumbbell : num 126.2 -75.5 -75.2 -103.3 -14.2 ...
## $ total_accel_dumbbell: int 9 31 29 18 4 29 29 29 3 2 ...
## $ gyros_dumbbell_x : num 0.64 0.34 0.39 0.1 0.29 -0.59 0.34 0.37 0.03 0.42 ...
## $ gyros_dumbbell_y : num 0.06 0.05 0.14 -0.02 -0.47 0.8 0.16 0.14 -0.21 0.51 ...
## $ gyros_dumbbell_z : num -0.61 -0.71 -0.34 0.05 -0.46 1.1 -0.23 -0.39 -0.21 -0.03 ...
## $ accel_dumbbell_x : int 21 -153 -141 -51 -18 -138 -145 -140 0 -7 ...
## $ accel_dumbbell_y : int -15 155 155 72 -30 166 150 159 25 -20 ...
## $ accel_dumbbell_z : int 81 -205 -196 -148 -5 -186 -190 -191 9 7 ...
## $ magnet_dumbbell_x : int 523 -502 -506 -576 -424 -543 -484 -515 -519 -531 ...
## $ magnet_dumbbell_y : int -528 388 349 238 252 262 354 350 348 321 ...
## $ magnet_dumbbell_z : int -56 -36 41 53 312 96 97 53 -32 -164 ...
## $ roll_forearm : num 141 109 131 0 -176 150 155 -161 15.5 13.2 ...
## $ pitch_forearm : num 49.3 -17.6 -32.6 0 -2.16 1.46 34.5 43.6 -63.5 19.4 ...
## $ yaw_forearm : num 156 106 93 0 -47.9 89.7 152 -89.5 -139 -105 ...
## $ total_accel_forearm : int 33 39 34 43 24 43 32 47 36 24 ...
## $ gyros_forearm_x : num 0.74 1.12 0.18 1.38 -0.75 -0.88 -0.53 0.63 0.03 0.02 ...
## $ gyros_forearm_y : num -3.34 -2.78 -0.79 0.69 3.1 4.26 1.8 -0.74 0.02 0.13 ...
## $ gyros_forearm_z : num -0.59 -0.18 0.28 1.8 0.8 1.35 0.75 0.49 -0.02 -0.07 ...
## $ accel_forearm_x : int -110 212 154 -92 131 230 -192 -151 195 -212 ...
## $ accel_forearm_y : int 267 297 271 406 -93 322 170 -331 204 98 ...
## $ accel_forearm_z : int -149 -118 -129 -39 172 -144 -175 -282 -217 -7 ...
## $ magnet_forearm_x : int -714 -237 -51 -233 375 -300 -678 -109 0 -403 ...
## $ magnet_forearm_y : int 419 791 698 783 -787 800 284 -619 652 723 ...
## $ magnet_forearm_z : int 617 873 783 521 91 884 585 -32 469 512 ...
## $ problem_id : int 1 2 3 4 5 6 7 8 9 10 ...
dim(test.cleaned.data)
## [1] 20 53
test.results.predict <- predict(random.forest.model, test.cleaned.data[, -length(names(test.cleaned.data))])
test.results.predict
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
A model is built to predict physical exercises based on movement data. An estimatation of the out-of-sample error is 1%. This is a promising result regarding the use of machine learning to detect bad exercises.
function.write.files <- function(y) {
m <- length(y)
path <- "I:/Coursera/Data Science Specialization/Course8_Machine learning/Assignments"
for(j in 1:m) {
filename <- paste0("test_case_", j, ".txt")
write.table(y[j], file = filename, row.names = FALSE, col.names = FALSE, quote = FALSE)
}
}
function.write.files(test.results.predict)
Sys.info()[1:2]
## sysname release
## "Windows" "7 x64"
R.version.string
## [1] "R version 3.2.4 Revised (2016-03-16 r70336)"