The goal of this data analysis is to predict the manner in which the user did the exercise. The training data set contains the target variable classe, of which other variables will be used for prediction. The best model that fits the data will be elicited using cross validation. Data preparation and cleaning will be done via removing columns that were not related to accelerometer reading and readings that were dominated by NA values reducing the variables from 160 to 53. The training data is partitioned into Atrain and Atest dataset which is used for training and validation respectively. The cleantest data set is used for the final prediction.
The decision tree model will be used first, followed by the Random forest modela and finally Generalised boosted regression model will be used. The best model will be finally used to predict the cleantest data set.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project is available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data is available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project comes from this original source: http://groupware.les.inf.puc-rio.br/har.
The following libraries are needed for the course of this project
library(rattle)
library(rsample)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(randomForest)
library(RColorBrewer)
library(scales)
set.seed(3030)
After downloading the data from the data source, we can read the two csv files into two data frames and check their structures. From the result below, our training set is a data frame of 19622 observations of 160 varaibles. The test data is a data frame of 20 observations of 160 varaibles. The first seven columns contains some informations that we might not need in the course of our analysis; also there are some features that have NA’s and missing observations.
train <- read.csv("./pml-training.csv", header = TRUE)
test <- read.csv("./pml-testing.csv", header = TRUE)
str(train, 5)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : chr "" "" "" "" ...
## $ kurtosis_picth_belt : chr "" "" "" "" ...
## $ kurtosis_yaw_belt : chr "" "" "" "" ...
## $ skewness_roll_belt : chr "" "" "" "" ...
## $ skewness_roll_belt.1 : chr "" "" "" "" ...
## $ skewness_yaw_belt : chr "" "" "" "" ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : chr "" "" "" "" ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : chr "" "" "" "" ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : chr "" "" "" "" ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : chr "" "" "" "" ...
## $ kurtosis_picth_arm : chr "" "" "" "" ...
## $ kurtosis_yaw_arm : chr "" "" "" "" ...
## $ skewness_roll_arm : chr "" "" "" "" ...
## $ skewness_pitch_arm : chr "" "" "" "" ...
## $ skewness_yaw_arm : chr "" "" "" "" ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : chr "" "" "" "" ...
## $ kurtosis_picth_dumbbell : chr "" "" "" "" ...
## $ kurtosis_yaw_dumbbell : chr "" "" "" "" ...
## $ skewness_roll_dumbbell : chr "" "" "" "" ...
## $ skewness_pitch_dumbbell : chr "" "" "" "" ...
## $ skewness_yaw_dumbbell : chr "" "" "" "" ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : chr "" "" "" "" ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : chr "" "" "" "" ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
str(test, 5)
## 'data.frame': 20 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "pedro" "jeremy" "jeremy" "adelmo" ...
## $ raw_timestamp_part_1 : int 1323095002 1322673067 1322673075 1322832789 1322489635 1322673149 1322673128 1322673076 1323084240 1322837822 ...
## $ raw_timestamp_part_2 : int 868349 778725 342967 560311 814776 510661 766645 54671 916313 384285 ...
## $ cvtd_timestamp : chr "05/12/2011 14:23" "30/11/2011 17:11" "30/11/2011 17:11" "02/12/2011 13:33" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 74 431 439 194 235 504 485 440 323 664 ...
## $ roll_belt : num 123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
## $ pitch_belt : num 27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
## $ yaw_belt : num -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
## $ total_accel_belt : int 20 4 5 17 3 4 4 4 4 18 ...
## $ kurtosis_roll_belt : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : logi NA NA NA NA NA NA ...
## $ skewness_yaw_belt : logi NA NA NA NA NA NA ...
## $ max_roll_belt : logi NA NA NA NA NA NA ...
## $ max_picth_belt : logi NA NA NA NA NA NA ...
## $ max_yaw_belt : logi NA NA NA NA NA NA ...
## $ min_roll_belt : logi NA NA NA NA NA NA ...
## $ min_pitch_belt : logi NA NA NA NA NA NA ...
## $ min_yaw_belt : logi NA NA NA NA NA NA ...
## $ amplitude_roll_belt : logi NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : logi NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : logi NA NA NA NA NA NA ...
## $ var_total_accel_belt : logi NA NA NA NA NA NA ...
## $ avg_roll_belt : logi NA NA NA NA NA NA ...
## $ stddev_roll_belt : logi NA NA NA NA NA NA ...
## $ var_roll_belt : logi NA NA NA NA NA NA ...
## $ avg_pitch_belt : logi NA NA NA NA NA NA ...
## $ stddev_pitch_belt : logi NA NA NA NA NA NA ...
## $ var_pitch_belt : logi NA NA NA NA NA NA ...
## $ avg_yaw_belt : logi NA NA NA NA NA NA ...
## $ stddev_yaw_belt : logi NA NA NA NA NA NA ...
## $ var_yaw_belt : logi NA NA NA NA NA NA ...
## $ gyros_belt_x : num -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
## $ gyros_belt_y : num -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
## $ gyros_belt_z : num -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
## $ accel_belt_x : int -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
## $ accel_belt_y : int 69 11 -1 45 4 -16 2 -2 1 63 ...
## $ accel_belt_z : int -179 39 49 -156 27 38 35 42 32 -158 ...
## $ magnet_belt_x : int -13 43 29 169 33 31 50 39 -6 10 ...
## $ magnet_belt_y : int 581 636 631 608 566 638 622 635 600 601 ...
## $ magnet_belt_z : int -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
## $ roll_arm : num 40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
## $ pitch_arm : num -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
## $ yaw_arm : num 178 0 0 -142 102 0 0 0 -167 -75.3 ...
## $ total_accel_arm : int 10 38 44 25 29 14 15 22 34 32 ...
## $ var_accel_arm : logi NA NA NA NA NA NA ...
## $ avg_roll_arm : logi NA NA NA NA NA NA ...
## $ stddev_roll_arm : logi NA NA NA NA NA NA ...
## $ var_roll_arm : logi NA NA NA NA NA NA ...
## $ avg_pitch_arm : logi NA NA NA NA NA NA ...
## $ stddev_pitch_arm : logi NA NA NA NA NA NA ...
## $ var_pitch_arm : logi NA NA NA NA NA NA ...
## $ avg_yaw_arm : logi NA NA NA NA NA NA ...
## $ stddev_yaw_arm : logi NA NA NA NA NA NA ...
## $ var_yaw_arm : logi NA NA NA NA NA NA ...
## $ gyros_arm_x : num -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
## $ gyros_arm_y : num 0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
## $ gyros_arm_z : num -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
## $ accel_arm_x : int 16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
## $ accel_arm_y : int 38 215 245 -57 200 130 79 175 111 -42 ...
## $ accel_arm_z : int 93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
## $ magnet_arm_x : int -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
## $ magnet_arm_y : int 385 447 474 257 275 176 15 215 335 294 ...
## $ magnet_arm_z : int 481 434 413 633 617 516 217 385 520 493 ...
## $ kurtosis_roll_arm : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : logi NA NA NA NA NA NA ...
## $ skewness_roll_arm : logi NA NA NA NA NA NA ...
## $ skewness_pitch_arm : logi NA NA NA NA NA NA ...
## $ skewness_yaw_arm : logi NA NA NA NA NA NA ...
## $ max_roll_arm : logi NA NA NA NA NA NA ...
## $ max_picth_arm : logi NA NA NA NA NA NA ...
## $ max_yaw_arm : logi NA NA NA NA NA NA ...
## $ min_roll_arm : logi NA NA NA NA NA NA ...
## $ min_pitch_arm : logi NA NA NA NA NA NA ...
## $ min_yaw_arm : logi NA NA NA NA NA NA ...
## $ amplitude_roll_arm : logi NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : logi NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : logi NA NA NA NA NA NA ...
## $ roll_dumbbell : num -17.7 54.5 57.1 43.1 -101.4 ...
## $ pitch_dumbbell : num 25 -53.7 -51.4 -30 -53.4 ...
## $ yaw_dumbbell : num 126.2 -75.5 -75.2 -103.3 -14.2 ...
## $ kurtosis_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ max_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ max_picth_dumbbell : logi NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ min_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : logi NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : logi NA NA NA NA NA NA ...
## [list output truncated]
The full training dataset will be split into a training dataset (Atrain) and a testing? validation dataset (Atest). The testing data will be used to cross validate our models. To clean the data we get rid of NA’s and near zero variables and blank spaces. Here we get the indexes of the columns having at least 70% of NA or blank values on the training dataset
indColToRemove <- which(colSums(is.na(train) | train == "") > 0.7*dim(train)[1])
cleantrain <- train[,-indColToRemove]
cleantrain <- cleantrain[,-c(1:7)]
dim(cleantrain)
## [1] 19622 53
# We do the same for the test set
indColToRemove <- which(colSums(is.na(test) | test == "") >0.7*dim(test)[1])
cleantest <- test[,-indColToRemove]
cleantest <- cleantest[,-c(1:7)]
dim(cleantest)
## [1] 20 53
After cleaning our training data set is reduced to a data frame of 19622 observations with 53 features; while our test data set is reduced to 20 observations of 53 features. To proceed , we will explore our taining data set by plotting the correlation matrix graphs of our training data set.
corrplot(cor(cleantrain[, -length(names(cleantrain))]), method = "color", tl.cex = 0.5)
DATA PREPARATION FOR PREDICTION ——————————- Preparing the training data for prediction by splitting the training data
cleantrain into 70% as train data Atrain and 30% as test data Atest. This splitting will serve also to compute the out-of-sample errors. The split training data is renamed Atrain for the part that will be used for training and the cross validation data renamed as Atest (validate data) will stay as it is and will be used later to test the prediction algorithm on the 20 cases. The Atrain is a dataframe of 13737 observations of 53 variables, while Atest is a dataframe of 5885 obs. of 53 variables. Hence the spliting is accurately done.
set.seed(123)
cleantrain_split <- createDataPartition(y = cleantrain$classe, p = 0.70, list = FALSE)
Atrain <- cleantrain[cleantrain_split, ]
Atest <- cleantrain[-cleantrain_split, ]
str(Atrain, 5)
## 'data.frame': 13737 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.48 1.42 1.43 1.45 1.43 1.42 1.42 1.48 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.09 8.16 8.17 8.18 8.2 8.21 8.15 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0 ...
## $ gyros_belt_y : num 0 0 0.02 0 0 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.02 -0.02 0 -0.02 0 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -21 -22 -20 -21 -22 -22 -22 -21 ...
## $ accel_belt_y : int 4 4 2 3 2 4 2 4 4 4 ...
## $ accel_belt_z : int 22 22 24 21 24 22 23 21 21 23 ...
## $ magnet_belt_x : int -3 -7 -6 -4 1 -3 -2 -3 -8 0 ...
## $ magnet_belt_y : int 599 608 600 599 602 609 602 606 598 592 ...
## $ magnet_belt_z : int -313 -311 -302 -311 -312 -308 -319 -309 -310 -305 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -129 ...
## $ pitch_arm : num 22.5 22.5 22.1 21.9 21.7 21.6 21.5 21.4 21.4 21.3 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.03 -0.03 -0.03 -0.03 -0.03 -0.02 0 0 ...
## $ gyros_arm_z : num -0.02 -0.02 0 0 -0.02 -0.02 0 -0.02 -0.03 -0.03 ...
## $ accel_arm_x : int -288 -290 -289 -289 -288 -288 -288 -287 -288 -289 ...
## $ accel_arm_y : int 109 110 111 111 109 110 111 111 111 109 ...
## $ accel_arm_z : int -123 -125 -123 -125 -122 -124 -123 -124 -124 -121 ...
## $ magnet_arm_x : int -368 -369 -374 -373 -369 -376 -363 -372 -371 -367 ...
## $ magnet_arm_y : int 337 337 337 336 341 334 343 338 331 340 ...
## $ magnet_arm_z : int 516 513 506 509 518 516 520 509 523 509 ...
## $ roll_dumbbell : num 13.1 13.1 13.4 13.1 13.2 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.4 -70.2 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -84.9 -85.1 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0.02 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 0 0 0 0 -0.02 -0.02 0 ...
## $ accel_dumbbell_x : int -234 -233 -233 -232 -232 -235 -233 -234 -234 -233 ...
## $ accel_dumbbell_y : int 47 47 48 47 47 48 47 48 48 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -270 -269 -270 -270 -269 -268 -271 ...
## $ magnet_dumbbell_x : int -559 -555 -554 -551 -549 -558 -554 -552 -554 -554 ...
## $ magnet_dumbbell_y : int 293 296 292 295 292 291 291 302 295 297 ...
## $ magnet_dumbbell_z : num -65 -64 -68 -70 -65 -69 -65 -69 -68 -73 ...
## $ roll_forearm : num 28.4 28.3 28 27.9 27.7 27.7 27.5 27.2 27.2 27.1 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.9 -63.9 -64 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -151 -151 -151 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.02 0.02 0.03 0.02 0.02 0 0 0.02 ...
## $ gyros_forearm_y : num 0 0 0 0 0 0 0.02 0 -0.02 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.03 -0.03 -0.03 0 ...
## $ accel_forearm_x : int 192 192 189 195 193 190 191 193 193 194 ...
## $ accel_forearm_y : int 203 203 206 205 204 205 203 205 202 204 ...
## $ accel_forearm_z : int -215 -216 -214 -215 -214 -215 -215 -215 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -17 -18 -16 -22 -11 -15 -14 -13 ...
## $ magnet_forearm_y : num 654 661 655 659 653 656 657 655 659 656 ...
## $ magnet_forearm_z : num 476 473 473 470 476 473 478 472 478 471 ...
## $ classe : chr "A" "A" "A" "A" ...
str(Atest, 5)
## 'data.frame': 5885 obs. of 53 variables:
## $ roll_belt : num 1.42 1.48 1.45 1.42 1.45 1.45 1.57 1.56 1.51 1.43 ...
## $ pitch_belt : num 8.07 8.05 8.06 8.13 8.18 8.2 8.09 8.1 8.1 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.3 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0.02 0.02 0.03 0 0.02 0.02 0.02 0 ...
## $ gyros_belt_y : num 0 0 0 0 0 0 0.02 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.03 -0.02 -0.02 -0.02 0 -0.02 -0.02 -0.02 -0.03 ...
## $ accel_belt_x : int -20 -22 -21 -22 -21 -21 -21 -21 -20 -22 ...
## $ accel_belt_y : int 5 3 4 4 2 2 3 4 4 4 ...
## $ accel_belt_z : int 23 21 21 21 23 22 21 21 22 19 ...
## $ magnet_belt_x : int -2 -6 0 -2 -5 -1 -2 -4 -3 4 ...
## $ magnet_belt_y : int 600 604 603 603 596 597 604 606 601 602 ...
## $ magnet_belt_z : int -305 -310 -312 -313 -317 -310 -313 -311 -318 -316 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -129 -129 -129 -129 -129 ...
## $ pitch_arm : num 22.5 22.1 22 21.8 21.5 21.4 20.8 20.7 20.7 20.5 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 -0.02 0.03 ...
## $ gyros_arm_y : num -0.02 -0.03 -0.03 -0.02 -0.03 0 -0.02 -0.02 0 -0.02 ...
## $ gyros_arm_z : num -0.02 0.02 0 0 0 -0.03 -0.02 -0.02 -0.02 0 ...
## $ accel_arm_x : int -289 -289 -289 -289 -290 -289 -289 -290 -289 -290 ...
## $ accel_arm_y : int 110 111 111 111 110 111 111 110 110 110 ...
## $ accel_arm_z : int -126 -123 -122 -124 -123 -124 -123 -123 -125 -126 ...
## $ magnet_arm_x : int -368 -372 -369 -372 -366 -374 -372 -373 -374 -375 ...
## $ magnet_arm_y : int 344 344 342 338 339 342 338 333 350 339 ...
## $ magnet_arm_z : int 513 512 513 510 509 510 510 509 516 508 ...
## $ roll_dumbbell : num 12.9 13.4 13.4 12.8 13.1 ...
## $ pitch_dumbbell : num -70.3 -70.4 -70.8 -70.3 -70.6 ...
## $ yaw_dumbbell : num -85.1 -84.9 -84.5 -85.1 -84.7 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 -0.02 0 0 0 0 0 0 0 -0.02 ...
## $ accel_dumbbell_x : int -232 -232 -234 -234 -233 -234 -233 -234 -235 -234 ...
## $ accel_dumbbell_y : int 46 48 48 46 47 47 48 48 47 48 ...
## $ accel_dumbbell_z : int -270 -269 -269 -272 -269 -270 -270 -270 -271 -272 ...
## $ magnet_dumbbell_x : int -561 -552 -558 -555 -564 -554 -554 -557 -558 -556 ...
## $ magnet_dumbbell_y : int 298 303 294 300 299 294 301 294 291 298 ...
## $ magnet_dumbbell_z : num -63 -60 -66 -74 -64 -63 -65 -69 -71 -62 ...
## $ roll_forearm : num 28.3 28.1 27.9 27.8 27.6 27.2 27 26.9 27.1 26.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.8 -63.8 -63.9 -63.9 -63.8 -63.7 -63.7 ...
## $ yaw_forearm : num -152 -152 -152 -152 -152 -151 -151 -151 -151 -151 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.02 0.02 0.02 0 0.02 0.02 0.03 0 ...
## $ gyros_forearm_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.03 -0.02 -0.03 -0.02 ...
## $ gyros_forearm_z : num 0 0 -0.03 0 -0.02 -0.02 -0.02 -0.02 0 -0.02 ...
## $ accel_forearm_x : int 196 189 193 193 193 192 191 194 193 196 ...
## $ accel_forearm_y : int 204 206 203 205 205 201 206 206 203 207 ...
## $ accel_forearm_z : int -213 -214 -215 -213 -214 -214 -213 -214 -213 -216 ...
## $ magnet_forearm_x : int -18 -16 -9 -9 -17 -16 -17 -10 -11 -15 ...
## $ magnet_forearm_y : num 658 658 660 660 657 656 654 653 661 650 ...
## $ magnet_forearm_z : num 469 469 478 474 465 472 478 467 470 473 ...
## $ classe : chr "A" "A" "A" "A" ...
The Dataset now consists of 53 variables with the observations divided as following: 1. Training Data (Atrain): 13737 observations. 2. Validation Data (Atest): 5885 observations. 3. Testing Data: 20 observations.
We fit a predictive model for activity recognition using Decision Tree algorithm and we estimate the performance of the model on the validation data set.
decitree <- rpart(classe ~ ., data = Atrain, method = "class")
# Validate the decison tre model with the validation data set
predictTree <- predict(decitree, Atest, type = "class")
# Plot out our model
rpart.plot(decitree, main = "Classification Tree", extra = 102, under = TRUE, faclen = 0, cex = .4)
prp(decitree)
# calcaulate the confusion matrix
confusionMatrix(factor(Atest$classe), predictTree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1552 48 39 24 11
## B 174 588 220 83 74
## C 18 43 888 75 2
## D 60 63 100 651 90
## E 6 64 148 86 778
##
## Overall Statistics
##
## Accuracy : 0.7573
## 95% CI : (0.7462, 0.7683)
## No Information Rate : 0.3076
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6926
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8575 0.72953 0.6366 0.7084 0.8147
## Specificity 0.9701 0.89151 0.9693 0.9370 0.9383
## Pos Pred Value 0.9271 0.51624 0.8655 0.6753 0.7190
## Neg Pred Value 0.9387 0.95407 0.8957 0.9455 0.9631
## Prevalence 0.3076 0.13696 0.2370 0.1562 0.1623
## Detection Rate 0.2637 0.09992 0.1509 0.1106 0.1322
## Detection Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9138 0.81052 0.8029 0.8227 0.8765
# check the accuracy and standard error
deciAccu <- postResample(predictTree, factor(Atest$classe))
decioutSE <- 1 - as.numeric(confusionMatrix(factor(Atest$classe),predictTree)$overall[1])
deciAccu
## Accuracy Kappa
## 0.7573492 0.6925606
decioutSE
## [1] 0.2426508
percent(deciAccu)
## Accuracy Kappa
## "75.7%" "69.3%"
percent(decioutSE)
## [1] "24%"
The estimated out of sample error with the cross validation dataset for this model is 24%. Accuracy for the decision tree is 75.7% with kappa as 69.3% which is an indication of substantial (in other words problematic and not good enough), hence we will try some other model- Random forest
Set the random forest model with training and cross validation data
RFpml <- randomForest(factor(classe)~.,data = Atrain, ntree = 250, keep.forest = T, xtest = Atest[-53], ytest = factor(Atest$classe))
RFpml
##
## Call:
## randomForest(formula = factor(classe) ~ ., data = Atrain, ntree = 250, keep.forest = T, xtest = Atest[-53], ytest = factor(Atest$classe))
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.52%
## Confusion matrix:
## A B C D E class.error
## A 3899 4 1 1 1 0.001792115
## B 10 2645 3 0 0 0.004890895
## C 0 15 2379 2 0 0.007095159
## D 0 0 23 2227 2 0.011101243
## E 0 0 2 7 2516 0.003564356
## Test set error rate: 0.44%
## Confusion matrix:
## A B C D E class.error
## A 1674 0 0 0 0 0.000000000
## B 4 1132 3 0 0 0.006145742
## C 0 4 1022 0 0 0.003898635
## D 0 0 7 957 0 0.007261411
## E 0 0 4 4 1074 0.007393715
# validate model with the validation test data set
rfpred <- predict(RFpml, Atest, type = "class")
# calculate the confusion matrix
confusionMatrix(rfpred, factor(Atest$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1132 4 0 0
## C 0 3 1022 7 4
## D 0 0 0 957 4
## E 0 0 0 0 1074
##
## Overall Statistics
##
## Accuracy : 0.9956
## 95% CI : (0.9935, 0.9971)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9944
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9939 0.9961 0.9927 0.9926
## Specificity 0.9991 0.9992 0.9971 0.9992 1.0000
## Pos Pred Value 0.9976 0.9965 0.9865 0.9958 1.0000
## Neg Pred Value 1.0000 0.9985 0.9992 0.9986 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1924 0.1737 0.1626 0.1825
## Detection Prevalence 0.2851 0.1930 0.1760 0.1633 0.1825
## Balanced Accuracy 0.9995 0.9965 0.9966 0.9960 0.9963
# estimate the accuracy and out of sample error
accuracy <- postResample(Atest$classe, rfpred)
ose <- 1 - as.numeric(confusionMatrix(rfpred,factor(Atest$classe))$overall[1])
accuracy
## Accuracy Kappa
## 0.9955820 0.9944114
ose
## [1] 0.004418012
mean(RFpml$err.rate)
## [1] 0.01046482
mean(RFpml$test$err.rate)
## [1] 0.007403291
In-Sample Error Rate Is the error rate for the predictions for input data (training data i.e Atrain) to the model. OOB error rate is 0.016. The in-sample error rate is 0.01046482. Out Of Sample Error Rate which is the error rate for predictions on the cross validated data provided to the model. It is normally higher than the in-sample error rate and is 0.007403291. A close observation of the accuracy for the random forest model shows 99.58% and a 99% kappa with an out-of-sample-error of about 0.004 which is highly negligible.
we will simply try a third model just to be sure that we are actually gett ing what we desire from the Random forest model.
set.seed(2020)
# set control parameter
gbmcntl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
#the gbm
gbmmod <- train(classe ~ ., data = Atrain, method = "gbm", trControl = gbmcntl, verbose = FALSE)
gbmmod$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
print(gbmmod)
## Stochastic Gradient Boosting
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 10989, 10989, 10989, 10990, 10991
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7535853 0.6878119
## 1 100 0.8194660 0.7715761
## 1 150 0.8525149 0.8133894
## 2 50 0.8560101 0.8176176
## 2 100 0.9057300 0.8807080
## 2 150 0.9320822 0.9140652
## 3 50 0.8948098 0.8668407
## 3 100 0.9424913 0.9272318
## 3 150 0.9604727 0.9499917
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
#Validate the GBM model and
gbmpred <- predict(gbmmod, newdata = Atest)
gbmconfusion <- confusionMatrix(factor(Atest$classe), gbmpred)
gbmconfusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1648 17 6 1 2
## B 30 1075 33 1 0
## C 0 35 973 16 2
## D 1 6 28 920 9
## E 3 8 19 18 1034
##
## Overall Statistics
##
## Accuracy : 0.9601
## 95% CI : (0.9547, 0.9649)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9495
##
## Mcnemar's Test P-Value : 9.35e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9798 0.9422 0.9188 0.9623 0.9876
## Specificity 0.9938 0.9865 0.9890 0.9911 0.9901
## Pos Pred Value 0.9845 0.9438 0.9483 0.9544 0.9556
## Neg Pred Value 0.9919 0.9861 0.9823 0.9927 0.9973
## Prevalence 0.2858 0.1939 0.1799 0.1624 0.1779
## Detection Rate 0.2800 0.1827 0.1653 0.1563 0.1757
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9868 0.9643 0.9539 0.9767 0.9888
accuracyGBM <- postResample(Atest$classe, gbmpred)
oseGBM <- 1 - as.numeric(confusionMatrix(gbmpred, factor(Atest$classe))$overall[1])
accuracyGBM
## Accuracy Kappa
## 0.960068 0.949484
oseGBM
## [1] 0.03993203
A gradient boosted generalized regression model with multinomial loss function is computed and 150 iterations were performed. There were 53 predictors of which 52 had non-zero influence. Also observing the accuaracy rate and out of sample of the gbm the sample is given as 96% and 0.0366.
Therefore we conclude that the Random forest model is out best choice for this work.
Prediction for the test data is provided by the following line of code:
result <- predict(RFpml, cleantest)
result
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# plot out the predicted result
plot(RFpml, main="Predictive model of the Random Forest Model", sub = "Random forest plot")
There is a certain element of randomness to randomforest, hence, hardware, operating system and time of use may effect the final answer. In the predicted plot above shows that increasing the number of trees improves accuracy, so increasing the number of trees, ntree would improve the accuracy of predictions. From the plot, it is apparent that error rates improve slowly for trees beyond about 50. Use the following answer key to check your output.
result=c("B","A","B","A","A","E","D","B","A","A","B","C","B","A","E","E","A","B","B","B")