Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

Exploratory Data Analysis

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

#Downloading files
URLtrain="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLtest="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URLtrain,destfile = "./train.csv")
download.file(URLtest,destfile="./test.csv")

Loading files:

training=read.csv(file="./train.csv",na.strings=c("NA","#DIV/0!",""))
testing=read.csv(file="./test.csv",na.strings=c("NA","#DIV/0!",""))

We have two diferent data sets: train and test set. The data have been extracted from this source: http://groupware.les.inf.puc-rio.br/har and it provides us data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. This is the classe variable in the training set that we want to predict.

Some exploratory data analysis from cleaning training data set:

NAs=apply(training,2,function(x)sum(is.na(x))/length(x))#NAs percent by column

#Cleaning training and testing data set with, at least, 50% NAs
training=training[,NAs<0.5]
testing=testing[,NAs<0.5]

str(training)
## 'data.frame':    19622 obs. of  60 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
table(training$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Data splitting

We are going to create a partition in the training dataset. 70% of the partition will be for calibrating the model and the rest of the data will be for testing.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain=createDataPartition(y=training$classe,p=0.7,list=F)
myTraining=training[inTrain,]
myTesting=training[-inTrain,]

dim(myTraining); dim(myTesting)
## [1] 13737    60
## [1] 5885   60

Preaparing myTraining

Removing variables that are not predictors (columns from 1 to 7)

myTraining=myTraining[,8:dim(myTraining)[2]]
nzv=nearZeroVar(myTraining, saveMetrics = TRUE)
nzv
##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.109453    8.08764650   FALSE FALSE
## pitch_belt            1.075188   12.31709980   FALSE FALSE
## yaw_belt              1.022161   13.01594235   FALSE FALSE
## total_accel_belt      1.052429    0.21110868   FALSE FALSE
## gyros_belt_x          1.070288    0.96090850   FALSE FALSE
## gyros_belt_y          1.128822    0.47317464   FALSE FALSE
## gyros_belt_z          1.073804    1.20841523   FALSE FALSE
## accel_belt_x          1.043478    1.15745796   FALSE FALSE
## accel_belt_y          1.120188    0.99730654   FALSE FALSE
## accel_belt_z          1.067742    2.10380724   FALSE FALSE
## magnet_belt_x         1.090535    2.20572177   FALSE FALSE
## magnet_belt_y         1.168565    2.13292568   FALSE FALSE
## magnet_belt_z         1.005917    3.17390988   FALSE FALSE
## roll_arm             49.916667   17.47834316   FALSE FALSE
## pitch_arm            79.900000   20.41202592   FALSE FALSE
## yaw_arm              32.378378   19.31280483   FALSE FALSE
## total_accel_arm       1.053140    0.47317464   FALSE FALSE
## gyros_arm_x           1.036723    4.55703574   FALSE FALSE
## gyros_arm_y           1.420765    2.67161680   FALSE FALSE
## gyros_arm_z           1.170391    1.71070831   FALSE FALSE
## accel_arm_x           1.024390    5.56162190   FALSE FALSE
## accel_arm_y           1.164474    3.82179515   FALSE FALSE
## accel_arm_z           1.125000    5.55434229   FALSE FALSE
## magnet_arm_x          1.031746    9.57268690   FALSE FALSE
## magnet_arm_y          1.000000    6.20950717   FALSE FALSE
## magnet_arm_z          1.038462    9.11407149   FALSE FALSE
## roll_dumbbell         1.146067   86.40168887   FALSE FALSE
## pitch_dumbbell        2.068627   84.18868749   FALSE FALSE
## yaw_dumbbell          1.259259   85.84115891   FALSE FALSE
## total_accel_dumbbell  1.086458    0.31302322   FALSE FALSE
## gyros_dumbbell_x      1.040000    1.65975104   FALSE FALSE
## gyros_dumbbell_y      1.281863    1.95821504   FALSE FALSE
## gyros_dumbbell_z      1.043981    1.40496469   FALSE FALSE
## accel_dumbbell_x      1.136564    2.98464002   FALSE FALSE
## accel_dumbbell_y      1.045198    3.31950207   FALSE FALSE
## accel_dumbbell_z      1.164706    2.91184393   FALSE FALSE
## magnet_dumbbell_x     1.065574    7.81102133   FALSE FALSE
## magnet_dumbbell_y     1.153846    5.99111888   FALSE FALSE
## magnet_dumbbell_z     1.000000    4.81910170   FALSE FALSE
## roll_forearm         11.040323   13.70022567   FALSE FALSE
## pitch_forearm        59.478261   18.96338356   FALSE FALSE
## yaw_forearm          15.116022   12.82667249   FALSE FALSE
## total_accel_forearm   1.146621    0.48773386   FALSE FALSE
## gyros_forearm_x       1.024390    2.07468880   FALSE FALSE
## gyros_forearm_y       1.010676    5.20492102   FALSE FALSE
## gyros_forearm_z       1.110787    2.09652763   FALSE FALSE
## accel_forearm_x       1.193548    5.68537526   FALSE FALSE
## accel_forearm_y       1.013889    7.09761957   FALSE FALSE
## accel_forearm_z       1.060345    4.07658150   FALSE FALSE
## magnet_forearm_x      1.035714   10.59911189   FALSE FALSE
## magnet_forearm_y      1.368421   13.24888986   FALSE FALSE
## magnet_forearm_z      1.022727   11.66921453   FALSE FALSE
## classe                1.469526    0.03639805   FALSE FALSE

Algorithm

Decision Trees

library(rattle); library(rpart)
## Rattle: A free graphical interface for data mining with R.
## Versión 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y  rotar sus datos.
set.seed(1124)

treeModel=rpart(classe ~ .,data=myTraining,method="class")
fancyRpartPlot(treeModel,sub="")

predictions1=predict(treeModel,newdata=myTesting,type="class")
confusionMatrix(predictions1,myTesting$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1516  171   22   60   17
##          B   68  704   84   93   98
##          C   41  116  820  161  131
##          D   17   88   70  585   59
##          E   32   60   30   65  777
## 
## Overall Statistics
##                                           
##                Accuracy : 0.748           
##                  95% CI : (0.7367, 0.7591)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6805          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9056   0.6181   0.7992  0.60685   0.7181
## Specificity            0.9359   0.9277   0.9076  0.95245   0.9611
## Pos Pred Value         0.8488   0.6724   0.6462  0.71429   0.8060
## Neg Pred Value         0.9615   0.9101   0.9554  0.92519   0.9380
## Prevalence             0.2845   0.1935   0.1743  0.16381   0.1839
## Detection Rate         0.2576   0.1196   0.1393  0.09941   0.1320
## Detection Prevalence   0.3035   0.1779   0.2156  0.13917   0.1638
## Balanced Accuracy      0.9207   0.7729   0.8534  0.77965   0.8396

Random Forests

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rfModel=randomForest(classe ~ .,data=myTraining)

predictions2=predict(rfModel,newdata=myTesting)
confusionMatrix(predictions2,myTesting$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1134    3    0    0
##          C    0    1 1022   12    0
##          D    0    0    1  951    0
##          E    0    0    0    1 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9963          
##                  95% CI : (0.9943, 0.9977)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9953          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9956   0.9961   0.9865   1.0000
## Specificity            0.9991   0.9994   0.9973   0.9998   0.9998
## Pos Pred Value         0.9976   0.9974   0.9874   0.9989   0.9991
## Neg Pred Value         1.0000   0.9989   0.9992   0.9974   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1927   0.1737   0.1616   0.1839
## Detection Prevalence   0.2851   0.1932   0.1759   0.1618   0.1840
## Balanced Accuracy      0.9995   0.9975   0.9967   0.9932   0.9999

Predicting values in the test set

Random Forests algorithm is more accurate than Decision Trees algorithm. Random Forests gives us an accuracy of 99.64% in the testing set. So we are going to use this model for predicting values in the test set.

predictions3=predict(rfModel,testing)
predictions3
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E