Predicting the manner in which people did exercise

author: angelayuan
date: Thursday, May 21, 2015

Synopsis

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal is to predict the manner in which people did exercise. Training data and Test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

These data was collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways, including class A (exactly according to the specification), class B (throwing the elbows to the front), class C (lifting the dumbbell only halfway), class D (lowering the dumbbell only halfway), and class E (throwing the hips to the front)

Exploratory data analysis

First, we load in the training data and test data, and check a few lines of training data (do not print the table of first six lines because of the limitation of space).

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(training)

## [1] 19622   160

We can see that there are 160 variables in total. 159 of them are candidate predictors and 1 of them is outcome (i.e. classe).

Data Preprocessing

From above checking, we know that there are variables with NAs values and/or missing values, indicating that we need to conduct data preprocessing.

First, we delete variables with NAs values for both training data and test data.

training <- training[, colSums(is.na(training)) == 0] 
testing <- testing[, colSums(is.na(testing)) == 0]

Second, we load in caret package and then we delete variables with near zero variance which barely contribute to models.

library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice
## Loading required package: ggplot2

nzv_train <- nearZeroVar(training)
training <- training[,-nzv_train]
nzv_test <- nearZeroVar(testing)
testing <- testing[,-nzv_test]

Third, the first six variables are just user names and time stamping which barely contribute to models. Therefore we delete the first six variables.

training <- training[,-c(1:6)]
testing <- testing[,-c(1:6)]
dim(training)

## [1] 19622    53

Fitting Random Forest Model

After data preprocessing, we have 53 variables to build models. Here we select random forest model with classe as outcome and other variables as predictor.

First, load in randomForest package.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.1.3

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Second, split training data into training dataset and cross validation dataset.

inTrain <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
training_set <- training[inTrain,]
cv_set <- training[-inTrain,]

Third, build random forest model using training set.

rf_fit <- randomForest(classe ~., data=training_set)

Accuracy check on both training dataset and cross validation dataset (out of sample error)

training_pred <- predict(rf_fit, newdata=training_set)
confusionMatrix(training_pred, training_set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

From above results, we see that the prediction accuracy is perfect (100%) and the sensitivity, specificity etc. are perfect as well. The error rate is 0.0% which is really low.

Next, we will test prediction accuracy on cross validation dataset. The model is fitted based on training dataset, therefore it typically has a lower error rate than the out of sample error. We expect the out of sample error is larger than 0.0%.

cv_pred <- predict(rf_fit, newdata=cv_set)
confusionMatrix(cv_pred, cv_set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    7    0    0    0
##          B    1 1132    2    0    0
##          C    0    0 1023   12    0
##          D    0    0    1  952    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9958          
##                  95% CI : (0.9937, 0.9972)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9946          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9939   0.9971   0.9876   0.9982
## Specificity            0.9983   0.9994   0.9975   0.9994   1.0000
## Pos Pred Value         0.9958   0.9974   0.9884   0.9969   1.0000
## Neg Pred Value         0.9998   0.9985   0.9994   0.9976   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1924   0.1738   0.1618   0.1835
## Detection Prevalence   0.2855   0.1929   0.1759   0.1623   0.1835
## Balanced Accuracy      0.9989   0.9966   0.9973   0.9935   0.9991

From above results, we see that the prediction accuracy on cross validation dataset is pretty high (99.58%), indicating the out of sample error is really low (0.42%). And it also shows a high sensitivity, specificity etc., indicating the model seems perfect.

Applying random forest model to testing data

Finally, we apply the above model to test data to predict the manner in which people exercise.

test_pred <- predict(rf_fit, newdata=testing)
test_pred

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Reference

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har