Use data from fitness trackers in order to quantify how well people do barbell lifts

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Synopsis

The basic goal of this analysis is to use the personal activity data generated from fitness trackers in order to quantify how well people are exercising. In order to determine this we have a train set. We will split this into a training and validation set. We will then use the training set to train the model. Using the model, we will predict the class of each dumbell bicep curl in the validation set. Then we will check for the accuracy of the model in the validation set. If the model is accurate we will proceed to apply the it to test set for prediction.

Data Processing

library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
train_set<-read.csv("pml-training.csv")
test_set<-read.csv("pml-testing.csv")

Predictor Selection

We remove variables which are near zero variance, which have NA values and which are irrelevant.

set.seed(1234)
# Remove near zero variance variables
training_nz<-train_set[,-nearZeroVar(train_set)]

# Identify columns with NAs
natrain<-is.na(training_nz)

# Remove columns with NAs
train_clean<-training_nz[,names(which(colSums(natrain) == 0))]

# Remove first 6 columns which contain items such as user name and time stamp which will not act as predictors
train_clean<-train_clean[,-c(1,2,3,4,5,6)]

Partition the data in order to create training and validation sets.

set.seed(1234)
inTrain <- createDataPartition(y=train_clean$classe, p=0.6, list=FALSE)
training<-train_clean[inTrain,]
validation<-train_clean[-inTrain,]
dim(training)
## [1] 11776    53
dim(validation)
## [1] 7846   53

Random Forest model

Random Forest model is run on the training data set Then the model is used to predict on the validation set Finally the accuracy of that prediction is checked

# Create randomForest model
modFitRF<-randomForest(classe~.,data=training)

# Predict values in prediction set
predRFV<-predict(modFitRF,newdata=validation)

# USe confusionMatrix to determine accuracy
confusionMatrix(predRFV,validation$classe)$overall[1]
## Accuracy 
## 0.993245

As the accuracy is of the random forest method is very high and hence the error low, we can now use this model to get predictions for the test set.

Prediction on the test set

Now the randomForest model is used to predict the test data and display the outcome from our prediction model for the test set.

# Predict values in prediction set
predRFT<-predict(modFitRF,newdata=test_set)

# create the outcome data frame
outcome<-cbind(test_set[,c(1,2,5)],predRFT)
outcome
##     X user_name   cvtd_timestamp predRFT
## 1   1     pedro 05/12/2011 14:23       B
## 2   2    jeremy 30/11/2011 17:11       A
## 3   3    jeremy 30/11/2011 17:11       B
## 4   4    adelmo 02/12/2011 13:33       A
## 5   5    eurico 28/11/2011 14:13       A
## 6   6    jeremy 30/11/2011 17:12       E
## 7   7    jeremy 30/11/2011 17:12       D
## 8   8    jeremy 30/11/2011 17:11       B
## 9   9  carlitos 05/12/2011 11:24       A
## 10 10   charles 02/12/2011 14:57       A
## 11 11  carlitos 05/12/2011 11:24       B
## 12 12    jeremy 30/11/2011 17:11       C
## 13 13    eurico 28/11/2011 14:14       B
## 14 14    jeremy 30/11/2011 17:10       A
## 15 15    jeremy 30/11/2011 17:12       E
## 16 16    eurico 28/11/2011 14:15       E
## 17 17     pedro 05/12/2011 14:22       A
## 18 18  carlitos 05/12/2011 11:24       B
## 19 19     pedro 05/12/2011 14:23       B
## 20 20    eurico 28/11/2011 14:14       B