Using devices such as Jawbone Up, Nike FuelBand, and Fitbit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participant They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The five ways are exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Only Class A corresponds to correct performance. The goal of this project is to predict the manner in which they did the exercise, i.e., Class A to E. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The data analyis in this report sets out to answer the following question:
Is is possible to classify the manner in in which a dumbbel execise was performed using sensor data from a glove, belt, arm-band and dumbbell?
# Load data from
# urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
# urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download.file(urlTrain, "data/pml-training.csv", method="curl")
# download.file(urlTest, "data/pml-testing.csv", method="curl")
training = read.csv("data/pml-training.csv", na.strings=c("NA", "#DIV/0!", " "))
testing = read.csv("data/pml-testing.csv", na.strings=c("NA", "#DIV/0!", " "))
# remove rows with NAs
cols_vector <- (colSums(is.na(training))==0)
training <- training[,cols_vector]
testing <- testing[,cols_vector]
# remove useless columns: X, ser_name, raw_timestamp_part_1, raw_timestamp_part_2,
# cvtd_timestamp, new_window, num_window
trainOrig <- training[,-c(1:7)]
testOrig <- testing[,-c(1:7)]
In order to estimate out of sample error we need to split data into training (70%) set for training and test (30%) set for validation.
library(caret);
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=trainOrig$classe, p=0.7, list=FALSE)
train <- trainOrig[inTrain,]
valid <- trainOrig[-inTrain,]
dim(train); dim(valid)
## [1] 13737 53
## [1] 5885 53
After data cleaning we have a training set with 52 variables/predictors. In order to choose the training algorythm we need to find correlation between valiables.
library(corrplot)
set.seed(343007)
# "classe" index is 53
M <- abs(cor(train[,-53]))
diag(M) <- 0 # remove variables correlation with themselves
which(M > 0.9, arr.ind = TRUE)
## row col
## total_accel_belt 4 1
## accel_belt_y 9 1
## accel_belt_z 10 1
## accel_belt_x 8 2
## roll_belt 1 4
## accel_belt_y 9 4
## accel_belt_z 10 4
## pitch_belt 2 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_z 10 9
## roll_belt 1 10
## total_accel_belt 4 10
## accel_belt_y 9 10
## gyros_arm_y 19 18
## gyros_arm_x 18 19
## gyros_dumbbell_z 33 31
## gyros_forearm_z 46 31
## gyros_dumbbell_x 31 33
## gyros_forearm_z 46 33
## gyros_dumbbell_x 31 46
## gyros_dumbbell_z 33 46
corrplot(corr=cor(train[, -53]), method = "circle", order="hclust", type='lower', tl.cex=0.5,mar=c(0,2,2,1),
tl.col='blue',tl.pos='ld', diag=FALSE,
title="Fig. 1 Pairwise correlations of the variables in the training set")
Here we have highly correlated variables. Hence we use Random Forest algorythm because it automatically selects important variables and is robust to correlated covariates & outliers in general. As far as we have 5 values of “classe” factor variable - we will use 5-fold cross validation when applying the algorithm.
predNames <- names(train)
predIdx <- grep("^classe", predNames, invert = TRUE)
predNames <- predNames[predIdx]
preProc <- preProcess(train[, predNames], method = "pca", thresh = 0.99)
preProc
## Created from 13737 samples and 52 variables
##
## Pre-processing:
## - centered (52)
## - ignored (0)
## - principal component signal extraction (52)
## - scaled (52)
##
## PCA needed 36 components to capture 99 percent of the variance
As we can see we need 36 variables (instead of 52) to capture 99% of the variance. Despite this the amount of computation is too huge for my computer.
Random Forest is a processor intensive algorythm. That is why we have to choose a strategy to reduce the computational time. The strategies are like these:
Here we use “reducing the number of trees” strategy and check the accuracy after building a model.
library(rattle);library(randomForest); library(rpart)
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
trCtl <- trainControl(method="cv", 5)
modFit <- train(classe ~., method="rf", trControl=trCtl, data=train, ntree=128)
# get the importance of the variables
varImp <- varImp(modFit)
plot(varImp, main="Fig.2 Variable importance in decreasing order")
modFit
## Random Forest
##
## 13737 samples
## 52 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10990, 10989, 10988, 10991
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9887888 0.9858167 0.002645835 0.003348998
## 27 0.9900264 0.9873818 0.002758008 0.003491834
## 52 0.9812181 0.9762365 0.003013268 0.003817149
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Now we estimate the model on the validation data set.
pred <- predict(modFit, valid)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
cfm <- confusionMatrix(valid$classe, pred)
cfm$overall['Accuracy']
## Accuracy
## 0.9962617
Accuracy value is pretty well. So let’s find out of sample error.
1 - as.numeric(cfm$overall['Accuracy'])
## [1] 0.003738318
Here we try the model to predict outcome on original testing data set.
predFin <- predict(modFit, testOrig)
predFin
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Random Forest algorythm with tuning control works well for Weight Lifting exercises analysis.
Here we plot some additional figures.
plot(modFit, main="Fig.3 Number of selected predictors vs Accuracy")