Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
### Load Library
#install.packages(c("caret", "randomForest", "rpart.plot"))
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest) #Random forest for classification and regression
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart) # Regressive Partitioning and Regression trees
library(rpart.plot) # Decision Tree
set.seed(101)
# Save downloaded csv in working directory
# Missing values including "#DIV/0!" or "" or "NA" - hanged to NA
# Entire Colums with missing values - will be deleted.
#load Training set
trainingset <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!", ""))
#Load testing set
testingset <- read.csv('pml-testing.csv', na.strings=c("NA","#DIV/0!", ""))
# Check Dimensions of training and test data set
dim(trainingset)
## [1] 19622 160
dim(testingset)
## [1] 20 160
# Delete columns with all missing values
trainingset<-trainingset[,colSums(is.na(trainingset)) == 0]
testingset <-testingset[,colSums(is.na(testingset)) == 0]
# Remove unrequired columns from dataset for e.g: user_name, raw_timestamp_part_1, raw_timestamp_part_,2 cvtd_timestamp, new_window, and num_window (columns 1 to 7).
trainingset <-trainingset[,-c(1:7)]
testingset <-testingset[,-c(1:7)]
# Check Dimensions of training and test data set
dim(trainingset)
## [1] 19622 53
dim(testingset)
## [1] 20 53
head(trainingset)
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## 1 1.41 8.07 -94.4 3 0.00 0.00
## 2 1.41 8.07 -94.4 3 0.02 0.00
## 3 1.42 8.07 -94.4 3 0.00 0.00
## 4 1.48 8.05 -94.4 3 0.02 0.00
## 5 1.48 8.07 -94.4 3 0.02 0.02
## 6 1.45 8.06 -94.4 3 0.02 0.00
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## 1 -0.02 -21 4 22 -3
## 2 -0.02 -22 4 22 -7
## 3 -0.02 -20 5 23 -2
## 4 -0.03 -22 3 21 -6
## 5 -0.02 -21 2 24 -6
## 6 -0.02 -21 4 21 0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## 1 599 -313 -128 22.5 -161 34
## 2 608 -311 -128 22.5 -161 34
## 3 600 -305 -128 22.5 -161 34
## 4 604 -310 -128 22.1 -161 34
## 5 600 -302 -128 22.1 -161 34
## 6 603 -312 -128 22.0 -161 34
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 1 0.00 0.00 -0.02 -288 109 -123
## 2 0.02 -0.02 -0.02 -290 110 -125
## 3 0.02 -0.02 -0.02 -289 110 -126
## 4 0.02 -0.03 0.02 -289 111 -123
## 5 0.00 -0.03 0.00 -289 111 -123
## 6 0.02 -0.03 0.00 -289 111 -122
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## 1 -368 337 516 13.05217 -70.49400
## 2 -369 337 513 13.13074 -70.63751
## 3 -368 344 513 12.85075 -70.27812
## 4 -372 344 512 13.43120 -70.39379
## 5 -374 337 506 13.37872 -70.42856
## 6 -369 342 513 13.38246 -70.81759
## yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1 -84.87394 37 0 -0.02
## 2 -84.71065 37 0 -0.02
## 3 -85.14078 37 0 -0.02
## 4 -84.87363 37 0 -0.02
## 5 -84.85306 37 0 -0.02
## 6 -84.46500 37 0 -0.02
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 0.00 -234 47 -271
## 2 0.00 -233 47 -269
## 3 0.00 -232 46 -270
## 4 -0.02 -232 48 -269
## 5 0.00 -233 48 -270
## 6 0.00 -234 48 -269
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1 -559 293 -65 28.4
## 2 -555 296 -64 28.3
## 3 -561 298 -63 28.3
## 4 -552 303 -60 28.1
## 5 -554 292 -68 28.0
## 6 -558 294 -66 27.9
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## 1 -63.9 -153 36 0.03
## 2 -63.9 -153 36 0.02
## 3 -63.9 -152 36 0.03
## 4 -63.9 -152 36 0.02
## 5 -63.9 -152 36 0.02
## 6 -63.9 -152 36 0.02
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## 1 0.00 -0.02 192 203
## 2 0.00 -0.02 192 203
## 3 -0.02 0.00 196 204
## 4 -0.02 0.00 189 206
## 5 0.00 -0.02 189 206
## 6 -0.02 -0.03 193 203
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 1 -215 -17 654 476
## 2 -216 -18 661 473
## 3 -213 -18 658 469
## 4 -214 -16 658 469
## 5 -214 -17 655 473
## 6 -215 -9 660 478
## classe
## 1 A
## 2 A
## 3 A
## 4 A
## 5 A
## 6 A
head(testingset)
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## 1 123.00 27.00 -4.75 20 -0.50 -0.02
## 2 1.02 4.87 -88.90 4 -0.06 -0.02
## 3 0.87 1.82 -88.50 5 0.05 0.02
## 4 125.00 -41.60 162.00 17 0.11 0.11
## 5 1.35 3.33 -88.60 3 0.03 0.02
## 6 -5.92 1.59 -87.70 4 0.10 0.05
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## 1 -0.46 -38 69 -179 -13
## 2 -0.07 -13 11 39 43
## 3 0.03 1 -1 49 29
## 4 -0.16 46 45 -156 169
## 5 0.00 -8 4 27 33
## 6 -0.13 -11 -16 38 31
## magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## 1 581 -382 40.7 -27.80 178 10
## 2 636 -309 0.0 0.00 0 38
## 3 631 -312 0.0 0.00 0 44
## 4 608 -304 -109.0 55.00 -142 25
## 5 566 -418 76.1 2.76 102 29
## 6 638 -291 0.0 0.00 0 14
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 1 -1.65 0.48 -0.18 16 38 93
## 2 -1.17 0.85 -0.43 -290 215 -90
## 3 2.10 -1.36 1.13 -341 245 -87
## 4 0.22 -0.51 0.92 -238 -57 6
## 5 -1.96 0.79 -0.54 -197 200 -30
## 6 0.02 0.05 -0.07 -26 130 -19
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## 1 -326 385 481 -17.73748 24.96085
## 2 -325 447 434 54.47761 -53.69758
## 3 -264 474 413 57.07031 -51.37303
## 4 -173 257 633 43.10927 -30.04885
## 5 -170 275 617 -101.38396 -53.43952
## 6 396 176 516 62.18750 -50.55595
## yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1 126.23596 9 0.64 0.06
## 2 -75.51480 31 0.34 0.05
## 3 -75.20287 29 0.39 0.14
## 4 -103.32003 18 0.10 -0.02
## 5 -14.19542 4 0.29 -0.47
## 6 -71.12063 29 -0.59 0.80
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 -0.61 21 -15 81
## 2 -0.71 -153 155 -205
## 3 -0.34 -141 155 -196
## 4 0.05 -51 72 -148
## 5 -0.46 -18 -30 -5
## 6 1.10 -138 166 -186
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1 523 -528 -56 141
## 2 -502 388 -36 109
## 3 -506 349 41 131
## 4 -576 238 53 0
## 5 -424 252 312 -176
## 6 -543 262 96 150
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## 1 49.30 156.0 33 0.74
## 2 -17.60 106.0 39 1.12
## 3 -32.60 93.0 34 0.18
## 4 0.00 0.0 43 1.38
## 5 -2.16 -47.9 24 -0.75
## 6 1.46 89.7 43 -0.88
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## 1 -3.34 -0.59 -110 267
## 2 -2.78 -0.18 212 297
## 3 -0.79 0.28 154 271
## 4 0.69 1.80 -92 406
## 5 3.10 0.80 131 -93
## 6 4.26 1.35 230 322
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 1 -149 -714 419 617
## 2 -118 -237 791 873
## 3 -129 -51 698 783
## 4 -39 -233 783 521
## 5 172 375 -787 91
## 6 -144 -300 800 884
## problem_id
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
# The training data set contains 53 variables and 19622 obs.
# The testing data set contains 53 variables and 20 obs.
subsamples <- createDataPartition(y=trainingset$classe, p=0.60, list=FALSE)
subTraining <- trainingset[subsamples, ]
subTesting <- trainingset[-subsamples, ]
dim(subTraining)
## [1] 11776 53
dim(subTesting)
## [1] 7846 53
head(subTraining)
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## 1 1.41 8.07 -94.4 3 0.00 0.00
## 2 1.41 8.07 -94.4 3 0.02 0.00
## 4 1.48 8.05 -94.4 3 0.02 0.00
## 5 1.48 8.07 -94.4 3 0.02 0.02
## 6 1.45 8.06 -94.4 3 0.02 0.00
## 8 1.42 8.13 -94.4 3 0.02 0.00
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## 1 -0.02 -21 4 22 -3
## 2 -0.02 -22 4 22 -7
## 4 -0.03 -22 3 21 -6
## 5 -0.02 -21 2 24 -6
## 6 -0.02 -21 4 21 0
## 8 -0.02 -22 4 21 -2
## magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## 1 599 -313 -128 22.5 -161 34
## 2 608 -311 -128 22.5 -161 34
## 4 604 -310 -128 22.1 -161 34
## 5 600 -302 -128 22.1 -161 34
## 6 603 -312 -128 22.0 -161 34
## 8 603 -313 -128 21.8 -161 34
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 1 0.00 0.00 -0.02 -288 109 -123
## 2 0.02 -0.02 -0.02 -290 110 -125
## 4 0.02 -0.03 0.02 -289 111 -123
## 5 0.00 -0.03 0.00 -289 111 -123
## 6 0.02 -0.03 0.00 -289 111 -122
## 8 0.02 -0.02 0.00 -289 111 -124
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## 1 -368 337 516 13.05217 -70.49400
## 2 -369 337 513 13.13074 -70.63751
## 4 -372 344 512 13.43120 -70.39379
## 5 -374 337 506 13.37872 -70.42856
## 6 -369 342 513 13.38246 -70.81759
## 8 -372 338 510 12.75083 -70.34768
## yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1 -84.87394 37 0 -0.02
## 2 -84.71065 37 0 -0.02
## 4 -84.87363 37 0 -0.02
## 5 -84.85306 37 0 -0.02
## 6 -84.46500 37 0 -0.02
## 8 -85.09708 37 0 -0.02
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 0.00 -234 47 -271
## 2 0.00 -233 47 -269
## 4 -0.02 -232 48 -269
## 5 0.00 -233 48 -270
## 6 0.00 -234 48 -269
## 8 0.00 -234 46 -272
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1 -559 293 -65 28.4
## 2 -555 296 -64 28.3
## 4 -552 303 -60 28.1
## 5 -554 292 -68 28.0
## 6 -558 294 -66 27.9
## 8 -555 300 -74 27.8
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## 1 -63.9 -153 36 0.03
## 2 -63.9 -153 36 0.02
## 4 -63.9 -152 36 0.02
## 5 -63.9 -152 36 0.02
## 6 -63.9 -152 36 0.02
## 8 -63.8 -152 36 0.02
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## 1 0.00 -0.02 192 203
## 2 0.00 -0.02 192 203
## 4 -0.02 0.00 189 206
## 5 0.00 -0.02 189 206
## 6 -0.02 -0.03 193 203
## 8 -0.02 0.00 193 205
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 1 -215 -17 654 476
## 2 -216 -18 661 473
## 4 -214 -16 658 469
## 5 -214 -17 655 473
## 6 -215 -9 660 478
## 8 -213 -9 660 474
## classe
## 1 A
## 2 A
## 4 A
## 5 A
## 6 A
## 8 A
head(subTesting)
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
## 3 1.42 8.07 -94.4 3 0.00
## 7 1.42 8.09 -94.4 3 0.02
## 12 1.43 8.18 -94.4 3 0.02
## 13 1.42 8.20 -94.4 3 0.02
## 22 1.57 8.09 -94.4 3 0.02
## 26 1.55 8.09 -94.4 3 0.02
## gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## 3 0.00 -0.02 -20 5 23
## 7 0.00 -0.02 -22 3 21
## 12 0.00 -0.02 -22 2 23
## 13 0.00 0.00 -22 4 21
## 22 0.02 -0.02 -21 3 21
## 26 0.00 0.00 -21 3 22
## magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm
## 3 -2 600 -305 -128 22.5 -161
## 7 -4 599 -311 -128 21.9 -161
## 12 -2 602 -319 -128 21.5 -161
## 13 -3 606 -309 -128 21.4 -161
## 22 -2 604 -313 -129 20.8 -161
## 26 -10 601 -312 -129 20.7 -161
## total_accel_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x
## 3 34 0.02 -0.02 -0.02 -289
## 7 34 0.00 -0.03 0.00 -289
## 12 34 0.02 -0.03 0.00 -288
## 13 34 0.02 -0.02 -0.02 -287
## 22 34 0.03 -0.02 -0.02 -289
## 26 34 -0.02 -0.02 -0.02 -290
## accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z
## 3 110 -126 -368 344 513
## 7 111 -125 -373 336 509
## 12 111 -123 -363 343 520
## 13 111 -124 -372 338 509
## 22 111 -123 -372 338 510
## 26 108 -123 -366 346 511
## roll_dumbbell pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## 3 12.85075 -70.27812 -85.14078 37
## 7 13.12695 -70.24757 -85.09961 37
## 12 13.10321 -70.45975 -84.89472 37
## 13 13.38246 -70.81759 -84.46500 37
## 22 13.37872 -70.42856 -84.85306 37
## 26 12.80060 -70.31305 -85.11886 37
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x
## 3 0 -0.02 0.00 -232
## 7 0 -0.02 0.00 -232
## 12 0 -0.02 0.00 -233
## 13 0 -0.02 -0.02 -234
## 22 0 -0.02 0.00 -233
## 26 0 -0.02 -0.02 -233
## accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 3 46 -270 -561 298
## 7 47 -270 -551 295
## 12 47 -270 -554 291
## 13 48 -269 -552 302
## 22 48 -270 -554 301
## 26 46 -271 -563 294
## magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm
## 3 -63 28.3 -63.9 -152
## 7 -70 27.9 -63.9 -152
## 12 -65 27.5 -63.8 -152
## 13 -69 27.2 -63.9 -151
## 22 -65 27.0 -63.9 -151
## 26 -72 27.0 -63.7 -151
## total_accel_forearm gyros_forearm_x gyros_forearm_y gyros_forearm_z
## 3 36 0.03 -0.02 0.00
## 7 36 0.02 0.00 -0.02
## 12 36 0.02 0.02 -0.03
## 13 36 0.00 0.00 -0.03
## 22 36 0.02 -0.03 -0.02
## 26 36 0.03 0.00 0.00
## accel_forearm_x accel_forearm_y accel_forearm_z magnet_forearm_x
## 3 196 204 -213 -18
## 7 195 205 -215 -18
## 12 191 203 -215 -11
## 13 193 205 -215 -15
## 22 191 206 -213 -17
## 26 190 203 -216 -16
## magnet_forearm_y magnet_forearm_z classe
## 3 658 469 A
## 7 659 470 A
## 12 657 478 A
## 13 655 472 A
## 22 654 478 A
## 26 658 462 A
A plot to see frequency of each levels in the subTraining data set and compare one another
plot(subTraining$classe, col="grey", main="Bar Plot - levels of the variable classe within the subTraining data set", xlab="classe levels", ylab="Frequency")
# From the graph above, Level A is the most frequent with more than 4000 whereas level D is the least frequent with about 2500 occurrences.
testModel <- rpart(classe ~ ., data=subTraining, method="class")
# Predicting:
testprediction <- predict(testModel, subTesting, type = "class")
# Plot of the Decision Tree
rpart.plot(testModel, main="Classification Tree", extra=102, under=TRUE, faclen=0)
#Test results on our subTesting data set:
confusionMatrix(testprediction, subTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1921 291 63 119 39
## B 87 1028 177 77 120
## C 85 88 1058 202 114
## D 127 96 58 849 112
## E 12 15 12 39 1057
##
## Overall Statistics
##
## Accuracy : 0.7536
## 95% CI : (0.7439, 0.7631)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6874
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8607 0.6772 0.7734 0.6602 0.7330
## Specificity 0.9088 0.9271 0.9245 0.9401 0.9878
## Pos Pred Value 0.7896 0.6904 0.6839 0.6836 0.9313
## Neg Pred Value 0.9425 0.9229 0.9508 0.9338 0.9426
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2448 0.1310 0.1348 0.1082 0.1347
## Detection Prevalence 0.3101 0.1898 0.1972 0.1583 0.1447
## Balanced Accuracy 0.8847 0.8022 0.8490 0.8001 0.8604
trainingModel <- randomForest(classe ~. , data=subTraining, method="class")
# Predicting:
trainingPrediction <- predict(trainingModel, subTesting, type = "class")
# Test results on subTesting data set:
confusionMatrix(trainingPrediction, subTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 15 0 0 0
## B 0 1500 10 0 0
## C 2 3 1358 10 2
## D 0 0 0 1275 2
## E 1 0 0 1 1438
##
## Overall Statistics
##
## Accuracy : 0.9941
## 95% CI : (0.9922, 0.9957)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9926
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9881 0.9927 0.9914 0.9972
## Specificity 0.9973 0.9984 0.9974 0.9997 0.9997
## Pos Pred Value 0.9933 0.9934 0.9876 0.9984 0.9986
## Neg Pred Value 0.9995 0.9972 0.9985 0.9983 0.9994
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1912 0.1731 0.1625 0.1833
## Detection Prevalence 0.2860 0.1925 0.1752 0.1628 0.1835
## Balanced Accuracy 0.9980 0.9933 0.9950 0.9956 0.9985
As expected, Random Forest algorithm performed better than Decision Trees. Accuracy for Random Forest model was 0.9926 (95% CI: (0.9922, 0.9957)) compared to 0.6874 (95% CI: (0.7439, 0.7631)). The accuracy of the model is 0.995. The expected out-of-sample error is estimated at .0074 or .74% The expected out-of-sample error is calculated as 1 - accuracy for predictions made against the cross-validation set. Our Test data set comprises 20 cases. With an accuracy above 99% on our cross-validation data, we can expect that very few, or none, of the test samples will be missclassified.