Practical Machine Learning Course - Project

Clean the environment.

rm(list = ls())

Install some R packages and upload libraries.

install.packages(“knitr”)

install.packages(“markdown”)

library(knitr)

library(markdown)

Synopsis.

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

In this project of Practical machine Learning course, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.

Project assignment.

In this project of Practical machine Learning course, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.

Goal of the project.

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Step 1: Perform the data exploration.

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project comes from this original source: http://groupware.les.inf.puc-rio.br/har.

However, I downloaded the file to my directory containing the programming environment.

The \(na.string\) setting is used for the later removal of columns by setting cells with empty space to be \(NA\).

training.data <- read.csv("./Data/pml-training.csv", header = TRUE, sep = ",", stringsAsFactors = T, na.strings = c("", "NA"))

#class(training.data)
str(training.data)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]
# print(head(training.data, 1))
dim(training.data)
## [1] 19622   160

Step 2: Perform the data cleaning.

Step 2.1: Clean the training data.

This step is used to remove the columns containing \(NA\) and empty spaces along with columns that contain information that is unhelpful for the classification such as the index, date and participant’s names.

training.cleaned.data <- training.data[8:length(training.data)]
remCol <-  colSums(is.na(training.cleaned.data))
training.cleaned.data <- training.cleaned.data[, remCol == 0] 

#print(head(training.data, 12))
#print(tail(training.data, 12))

Step 2.2: Split the upwards training data in training set and validation set.

This step is related to the splitting of training data into a training set and a validation set. The validation set is necessary to estimate the performance of the classifier after it is trained based on the training set.

require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(22519) # the set.seed function is chosen for getting reproductible results, when run mamy times.
inTrain <- createDataPartition(training.cleaned.data$classe, p = 3/4)[[1]]
training.set <- training.cleaned.data[inTrain, ]
validation.set <- training.cleaned.data[-inTrain, ]
str(training.set)
## 'data.frame':    14718 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.42 1.48 1.42 1.42 1.43 1.45 1.43 1.45 1.48 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.09 8.13 8.16 8.18 8.18 8.2 8.15 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0 0.02 0.02 0.02 0.02 0.03 0.02 0 0 ...
##  $ gyros_belt_y        : num  0 0 0.02 0 0 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 0 0 ...
##  $ accel_belt_x        : int  -21 -20 -21 -22 -22 -20 -21 -22 -21 -21 ...
##  $ accel_belt_y        : int  4 5 2 3 4 2 2 2 2 4 ...
##  $ accel_belt_z        : int  22 23 24 21 21 24 23 23 22 23 ...
##  $ magnet_belt_x       : int  -3 -2 -6 -4 -2 1 -5 -2 -1 0 ...
##  $ magnet_belt_y       : int  599 600 600 599 603 602 596 602 597 592 ...
##  $ magnet_belt_z       : int  -313 -305 -302 -311 -313 -312 -317 -319 -310 -305 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -129 -129 ...
##  $ pitch_arm           : num  22.5 22.5 22.1 21.9 21.8 21.7 21.5 21.5 21.4 21.3 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 0 0 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 0 0 0 -0.02 0 0 -0.03 -0.03 ...
##  $ accel_arm_x         : int  -288 -289 -289 -289 -289 -288 -290 -288 -289 -289 ...
##  $ accel_arm_y         : int  109 110 111 111 111 109 110 111 111 109 ...
##  $ accel_arm_z         : int  -123 -126 -123 -125 -124 -122 -123 -123 -124 -121 ...
##  $ magnet_arm_x        : int  -368 -368 -374 -373 -372 -369 -366 -363 -374 -367 ...
##  $ magnet_arm_y        : int  337 344 337 336 338 341 339 343 342 340 ...
##  $ magnet_arm_z        : int  516 513 506 509 510 518 509 520 510 509 ...
##  $ roll_dumbbell       : num  13.1 12.9 13.4 13.1 12.8 ...
##  $ pitch_dumbbell      : num  -70.5 -70.3 -70.4 -70.2 -70.3 ...
##  $ yaw_dumbbell        : num  -84.9 -85.1 -84.9 -85.1 -85.1 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -232 -233 -232 -234 -232 -233 -233 -234 -233 ...
##  $ accel_dumbbell_y    : int  47 46 48 47 46 47 47 47 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -270 -270 -270 -272 -269 -269 -270 -270 -271 ...
##  $ magnet_dumbbell_x   : int  -559 -561 -554 -551 -555 -549 -564 -554 -554 -554 ...
##  $ magnet_dumbbell_y   : int  293 298 292 295 300 292 299 291 294 297 ...
##  $ magnet_dumbbell_z   : num  -65 -63 -68 -70 -74 -65 -64 -65 -63 -73 ...
##  $ roll_forearm        : num  28.4 28.3 28 27.9 27.8 27.7 27.6 27.5 27.2 27.1 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 -63.9 -64 ...
##  $ yaw_forearm         : num  -153 -152 -152 -152 -152 -152 -152 -152 -151 -151 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.03 0.02 0.02 0.02 0.03 0.02 0.02 0 0.02 ...
##  $ gyros_forearm_y     : num  0 -0.02 0 0 -0.02 0 -0.02 0.02 -0.02 0 ...
##  $ gyros_forearm_z     : num  -0.02 0 -0.02 -0.02 0 -0.02 -0.02 -0.03 -0.02 0 ...
##  $ accel_forearm_x     : int  192 196 189 195 193 193 193 191 192 194 ...
##  $ accel_forearm_y     : int  203 204 206 205 205 204 205 203 201 204 ...
##  $ accel_forearm_z     : int  -215 -213 -214 -215 -213 -214 -214 -215 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -17 -18 -9 -16 -17 -11 -16 -13 ...
##  $ magnet_forearm_y    : num  654 658 655 659 660 653 657 657 656 656 ...
##  $ magnet_forearm_z    : num  476 469 473 470 474 476 465 478 472 471 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
str(validation.set)
## 'data.frame':    4904 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.48 1.45 1.45 1.42 1.42 1.51 1.6 1.57 1.56 ...
##  $ pitch_belt          : num  8.07 8.05 8.06 8.17 8.2 8.21 8.12 8.1 8.09 8.1 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.3 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0.02 0.02 0.02 0.03 0.02 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_belt_y        : num  0 0 0 0 0 0 0 0 0.02 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.03 -0.02 0 0 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ accel_belt_x        : int  -22 -22 -21 -21 -22 -22 -21 -20 -21 -21 ...
##  $ accel_belt_y        : int  4 3 4 4 4 4 4 1 3 4 ...
##  $ accel_belt_z        : int  22 21 21 22 21 21 22 20 21 21 ...
##  $ magnet_belt_x       : int  -7 -6 0 -3 -3 -8 -6 -10 -2 -4 ...
##  $ magnet_belt_y       : int  608 604 603 609 606 598 598 607 604 606 ...
##  $ magnet_belt_z       : int  -311 -310 -312 -308 -309 -310 -317 -304 -313 -311 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -129 -129 -129 -129 ...
##  $ pitch_arm           : num  22.5 22.1 22 21.6 21.4 21.4 21.3 20.9 20.8 20.7 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 ...
##  $ gyros_arm_y         : num  -0.02 -0.03 -0.03 -0.03 -0.02 0 0 -0.02 -0.02 -0.02 ...
##  $ gyros_arm_z         : num  -0.02 0.02 0 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -290 -289 -289 -288 -287 -288 -289 -288 -289 -290 ...
##  $ accel_arm_y         : int  110 111 111 110 111 111 110 111 111 110 ...
##  $ accel_arm_z         : int  -125 -123 -122 -124 -124 -124 -122 -124 -123 -123 ...
##  $ magnet_arm_x        : int  -369 -372 -369 -376 -372 -371 -371 -375 -372 -373 ...
##  $ magnet_arm_y        : int  337 344 342 334 338 331 337 337 338 333 ...
##  $ magnet_arm_z        : int  513 512 513 516 509 523 512 513 510 509 ...
##  $ roll_dumbbell       : num  13.1 13.4 13.4 13.3 13.4 ...
##  $ pitch_dumbbell      : num  -70.6 -70.4 -70.8 -70.9 -70.8 ...
##  $ yaw_dumbbell        : num  -84.7 -84.9 -84.5 -84.4 -84.5 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0.02 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 -0.02 0 0 -0.02 -0.02 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -233 -232 -234 -235 -234 -234 -233 -234 -233 -234 ...
##  $ accel_dumbbell_y    : int  47 48 48 48 48 48 47 48 48 48 ...
##  $ accel_dumbbell_z    : int  -269 -269 -269 -270 -269 -268 -272 -269 -270 -270 ...
##  $ magnet_dumbbell_x   : int  -555 -552 -558 -558 -552 -554 -551 -554 -554 -557 ...
##  $ magnet_dumbbell_y   : int  296 303 294 291 302 295 296 299 301 294 ...
##  $ magnet_dumbbell_z   : num  -64 -60 -66 -69 -69 -68 -56 -72 -65 -69 ...
##  $ roll_forearm        : num  28.3 28.1 27.9 27.7 27.2 27.2 27.1 26.9 27 26.9 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.8 -63.9 -63.9 -64 -63.9 -63.9 -63.8 ...
##  $ yaw_forearm         : num  -153 -152 -152 -152 -151 -151 -151 -151 -151 -151 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.02 0.02 0.02 0.02 0 0 0.02 0.03 0.02 0.02 ...
##  $ gyros_forearm_y     : num  0 -0.02 -0.02 0 0 -0.02 -0.02 -0.03 -0.03 -0.02 ...
##  $ gyros_forearm_z     : num  -0.02 0 -0.03 -0.02 -0.03 -0.03 0 -0.02 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 189 193 190 193 193 192 194 191 194 ...
##  $ accel_forearm_y     : int  203 206 203 205 205 202 204 208 206 206 ...
##  $ accel_forearm_z     : int  -216 -214 -215 -215 -215 -214 -213 -214 -213 -214 ...
##  $ magnet_forearm_x    : int  -18 -16 -9 -22 -15 -14 -13 -11 -17 -10 ...
##  $ magnet_forearm_y    : num  661 658 660 656 655 659 653 654 654 653 ...
##  $ magnet_forearm_z    : num  473 469 478 473 472 478 481 469 478 467 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
#Dimensionality for comparison:

dim(training.data)
## [1] 19622   160
dim(training.cleaned.data)
## [1] 19622    53
dim(training.set)
## [1] 14718    53
dim(validation.set)
## [1] 4904   53

Step 2.2: Assess the highly correlated variables.

To asses if there are highly correlated variables, a correlation matrix is plotted.

library(corrplot)

correlMatrix <- cor(training.set[, -length(names(training.set))])
corrplot(correlMatrix, method = "color", tl.cex = 0.5)

Step 3: Plot decision tree.

require(rpart)
## Loading required package: rpart
require(rpart.plot)
## Loading required package: rpart.plot
require(rattle)
## Loading required package: rattle
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
decision.tree <- rpart(classe ~ ., data = training.set, method = "class")
fancyRpartPlot(decision.tree, cex = 0.2, tweak = 2, palettes = c("Greys", "Oranges", "Reds", "Greens"), sub = "Decision tree")

# The gradient of the color in the decision tree represents the accuracy of that node.
decision.tree.2 <- rpart(classe ~ ., data = training.set, method = "class")
prp(decision.tree.2)

Step 4: Model the data.

A predictive model will be fitted using Random Forest algorithm. This fitting way selects important variables and is robust to correlated covariates & outliers. A five-fold cross-validation is used for this predictive model.

require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
random.forest.control <- trainControl(method = "cv", 5)
random.forest.model <- train(classe ~ ., data = training.set, method = "rf", trControl = random.forest.control, ntree = 10)
random.forest.model
## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11775, 11774, 11772, 11776, 11775 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9779867  0.9721520
##   27    0.9849839  0.9810045
##   52    0.9768308  0.9706881
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The results of the confusion matrix command are as follows:

random.forest.predict.1 <- predict(random.forest.model, training.set)
confusionMatrix(training.set$classe, random.forest.predict.1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4184    1    0    0    0
##          B    1 2844    2    0    1
##          C    0    1 2566    0    0
##          D    0    0    2 2410    0
##          E    0    0    0    2 2704
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9988, 0.9997)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9991          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9998   0.9993   0.9984   0.9992   0.9996
## Specificity            0.9999   0.9997   0.9999   0.9998   0.9998
## Pos Pred Value         0.9998   0.9986   0.9996   0.9992   0.9993
## Neg Pred Value         0.9999   0.9998   0.9997   0.9998   0.9999
## Prevalence             0.2843   0.1934   0.1746   0.1639   0.1838
## Detection Rate         0.2843   0.1932   0.1743   0.1637   0.1837
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      0.9998   0.9995   0.9992   0.9995   0.9997

Both the accuracy 0.9993 and the kappa indicator 0.9991 of concordance indicate that the model is well adjusted to the chosen parameters.

Step 5: Estimate the performance of the model on the validation set.

The results of the confusion matrix command are as follows:

random.forest.predict <- predict(random.forest.model, validation.set)
confusionMatrix(validation.set$classe, random.forest.predict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    3    0    0    0
##          B    7  927   12    3    0
##          C    0    8  842    5    0
##          D    0    1   11  792    0
##          E    0    2    4    6  889
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9874          
##                  95% CI : (0.9838, 0.9903)
##     No Information Rate : 0.2853          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.984           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9851   0.9689   0.9826   1.0000
## Specificity            0.9991   0.9944   0.9968   0.9971   0.9970
## Pos Pred Value         0.9978   0.9768   0.9848   0.9851   0.9867
## Neg Pred Value         0.9980   0.9965   0.9933   0.9966   1.0000
## Prevalence             0.2853   0.1919   0.1772   0.1644   0.1813
## Detection Rate         0.2838   0.1890   0.1717   0.1615   0.1813
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9971   0.9898   0.9829   0.9899   0.9985

Both the accuracy 0.9874 and the kappa indicator 0.984 of concordance indicate that the predictor seems to have a low out of sample error rate.

model.accuracy <- postResample(random.forest.predict, validation.set$classe)
print(model.accuracy)
##  Accuracy     Kappa 
## 0.9873573 0.9840081
oose <- 1 - as.numeric(confusionMatrix(validation.set$classe, random.forest.predict)$overall[1])
oose
## [1] 0.01264274

The estimated accuracy of the model is 98.85% and the estimated out-of-sample error is 1.1%.

Step 6: Load the testing data and clean it.

The test data cleaning is done in the same way as the training data: removing the columns containing \(NA\) and emptying spaces along with columns that contain information which is unhelpful for the classification such as the index, date and participant’s names.

test.data <- read.csv("./Data/pml-testing.csv", header = TRUE, sep = ",", stringsAsFactors = T, na.strings = c("", "NA"))
test.cleaned.data <- test.data[8:length(test.data)]
remCol <-  colSums(is.na(test.cleaned.data))
test.cleaned.data <- test.cleaned.data[, remCol == 0] 
str(test.cleaned.data)
## 'data.frame':    20 obs. of  53 variables:
##  $ roll_belt           : num  123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
##  $ pitch_belt          : num  27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
##  $ yaw_belt            : num  -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
##  $ total_accel_belt    : int  20 4 5 17 3 4 4 4 4 18 ...
##  $ gyros_belt_x        : num  -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
##  $ gyros_belt_y        : num  -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
##  $ gyros_belt_z        : num  -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
##  $ accel_belt_x        : int  -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
##  $ accel_belt_y        : int  69 11 -1 45 4 -16 2 -2 1 63 ...
##  $ accel_belt_z        : int  -179 39 49 -156 27 38 35 42 32 -158 ...
##  $ magnet_belt_x       : int  -13 43 29 169 33 31 50 39 -6 10 ...
##  $ magnet_belt_y       : int  581 636 631 608 566 638 622 635 600 601 ...
##  $ magnet_belt_z       : int  -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
##  $ roll_arm            : num  40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
##  $ pitch_arm           : num  -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
##  $ yaw_arm             : num  178 0 0 -142 102 0 0 0 -167 -75.3 ...
##  $ total_accel_arm     : int  10 38 44 25 29 14 15 22 34 32 ...
##  $ gyros_arm_x         : num  -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
##  $ gyros_arm_y         : num  0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
##  $ gyros_arm_z         : num  -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
##  $ accel_arm_x         : int  16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
##  $ accel_arm_y         : int  38 215 245 -57 200 130 79 175 111 -42 ...
##  $ accel_arm_z         : int  93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
##  $ magnet_arm_x        : int  -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
##  $ magnet_arm_y        : int  385 447 474 257 275 176 15 215 335 294 ...
##  $ magnet_arm_z        : int  481 434 413 633 617 516 217 385 520 493 ...
##  $ roll_dumbbell       : num  -17.7 54.5 57.1 43.1 -101.4 ...
##  $ pitch_dumbbell      : num  25 -53.7 -51.4 -30 -53.4 ...
##  $ yaw_dumbbell        : num  126.2 -75.5 -75.2 -103.3 -14.2 ...
##  $ total_accel_dumbbell: int  9 31 29 18 4 29 29 29 3 2 ...
##  $ gyros_dumbbell_x    : num  0.64 0.34 0.39 0.1 0.29 -0.59 0.34 0.37 0.03 0.42 ...
##  $ gyros_dumbbell_y    : num  0.06 0.05 0.14 -0.02 -0.47 0.8 0.16 0.14 -0.21 0.51 ...
##  $ gyros_dumbbell_z    : num  -0.61 -0.71 -0.34 0.05 -0.46 1.1 -0.23 -0.39 -0.21 -0.03 ...
##  $ accel_dumbbell_x    : int  21 -153 -141 -51 -18 -138 -145 -140 0 -7 ...
##  $ accel_dumbbell_y    : int  -15 155 155 72 -30 166 150 159 25 -20 ...
##  $ accel_dumbbell_z    : int  81 -205 -196 -148 -5 -186 -190 -191 9 7 ...
##  $ magnet_dumbbell_x   : int  523 -502 -506 -576 -424 -543 -484 -515 -519 -531 ...
##  $ magnet_dumbbell_y   : int  -528 388 349 238 252 262 354 350 348 321 ...
##  $ magnet_dumbbell_z   : int  -56 -36 41 53 312 96 97 53 -32 -164 ...
##  $ roll_forearm        : num  141 109 131 0 -176 150 155 -161 15.5 13.2 ...
##  $ pitch_forearm       : num  49.3 -17.6 -32.6 0 -2.16 1.46 34.5 43.6 -63.5 19.4 ...
##  $ yaw_forearm         : num  156 106 93 0 -47.9 89.7 152 -89.5 -139 -105 ...
##  $ total_accel_forearm : int  33 39 34 43 24 43 32 47 36 24 ...
##  $ gyros_forearm_x     : num  0.74 1.12 0.18 1.38 -0.75 -0.88 -0.53 0.63 0.03 0.02 ...
##  $ gyros_forearm_y     : num  -3.34 -2.78 -0.79 0.69 3.1 4.26 1.8 -0.74 0.02 0.13 ...
##  $ gyros_forearm_z     : num  -0.59 -0.18 0.28 1.8 0.8 1.35 0.75 0.49 -0.02 -0.07 ...
##  $ accel_forearm_x     : int  -110 212 154 -92 131 230 -192 -151 195 -212 ...
##  $ accel_forearm_y     : int  267 297 271 406 -93 322 170 -331 204 98 ...
##  $ accel_forearm_z     : int  -149 -118 -129 -39 172 -144 -175 -282 -217 -7 ...
##  $ magnet_forearm_x    : int  -714 -237 -51 -233 375 -300 -678 -109 0 -403 ...
##  $ magnet_forearm_y    : int  419 791 698 783 -787 800 284 -619 652 723 ...
##  $ magnet_forearm_z    : int  617 873 783 521 91 884 585 -32 469 512 ...
##  $ problem_id          : int  1 2 3 4 5 6 7 8 9 10 ...
dim(test.cleaned.data)
## [1] 20 53

Step 7: Fit the testing data based on the developed model.

test.results.predict <- predict(random.forest.model, test.cleaned.data[, -length(names(test.cleaned.data))])
test.results.predict
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Step 8: Conclusions.

A model is built to predict physical exercises based on movement data. An estimatation of the out-of-sample error is 1%. This is a promising result regarding the use of machine learning to detect bad exercises.

function.write.files <- function(y)  {
  m <- length(y)
  path <- "I:/Coursera/Data Science Specialization/Course8_Machine learning/Assignments"
  for(j in 1:m)  {
    filename <- paste0("test_case_", j, ".txt")
    write.table(y[j], file = filename, row.names = FALSE, col.names = FALSE, quote = FALSE)
  }
}

function.write.files(test.results.predict)

Step 9: R version and System information for this analysis.

Sys.info()[1:2]
##   sysname   release 
## "Windows"   "7 x64"
R.version.string
## [1] "R version 3.2.4 Revised (2016-03-16 r70336)"