Predicting People’s Excercise Manner

author: liuyubobobo
date: Sunday, May 24, 2015

Synopsis

The goal of this project is aimed to predict the manner in which people did the exercise.Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

Data Reading

First of all, we read our data from csv files. These data is seperated by training data and test data. The training data for this project are available here:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv; The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Exploratory Data Analysis

To know about our data better, we can do some exploratory data analysis first. Here, we use head, str and summary methods to observe our data. For the limit of the space, we omit the results of these methods.

head(training)
str(training)
summary(training)

Through the str method result, we can see that out training data set have 19622 observations and 160 variables.

Data Preprocessing

Clearly, 160 variables are too many to create a predical model. So we need to decrease our data dimension.

First of all, we can observe through the str method that lots of variables in the training data only contain NA value, which is absolutely useless. Therefore, we find these columns and elimate them in both training and testing data sets.

colBools = colSums(is.na(training)) == 0
training <- training[,colBools]
testing <- testing[,colBools]
dim(training)
## [1] 19622    93

In this step, we decrease our data dimension from 160 to 93.

Secondly, we can use nearZeroVar function to delete variables with all near zero variance. These variables don’t change much and are not contribute for our model creation in the future.

library(lattice)
library(ggplot2)
library(caret)

nearZeroVars <- nearZeroVar(training)
training <- training[,-nearZeroVars]
testing <- testing[,-nearZeroVars]
dim(training)
## [1] 19622    59

In this step, we decrease our data dimension from 93 to 59.

Thirdly, through observe by str function and colnames functions, we know that the first six variables, X, user_name, raw_timestamp_part_, raw_timestamp_part_2, cvtd_timestamp and num_window are not senatic related to our outcome (i.e. classe). We can delete these variables safely.

training <- training[,-c(1,2,3,4,5,6)]
testing <- testing[,-c(1,2,3,4,5,6)]
dim(training)
## [1] 19622    53

In this step, we decrease our data dimension from 59 to 53.

Using PCA

To make our data set more compact, we can use PCA to process our data set further more.

First, we can partition our training data set into 2 parts, training_set and validating_set.

set.seed(54321)
inTrain <- createDataPartition( y = training$classe , p=0.75, list=FALSE )
training_set <- training[ inTrain , ]
validating_set <- training[ -inTrain , ]

Through colnames method, we know the last variable(53th variable) classe is our outcome, we delete this variable to implement PCA method.

compress <- preProcess(training_set[,-53], method = "pca")

training_PCA_set <- predict(compress, training_set[,-53])
validating_PCA_set <- predict(compress, validating_set[,-53])
testing_PCA_set <- predict(compress, testing[,-53])
dim(training_PCA_set)
## [1] 14718    25

In this step, we decrease our data dimension from 53 to 25.

For further model creation function, we put our outcome classe back to the training_PCA_set.

training_PCA_set$classe <- training_set$classe

Building the Model

Now, we can build a random forest model by our new training_PCA_sets

library( randomForest )
## Warning: package 'randomForest' was built under R version 3.1.3
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modelFit = randomForest(classe ~., data=training_PCA_set)

For this model, we can firstly use our validating_PCA_set to check the accuracy.

validating_pred <- predict( modelFit, newdata=validating_PCA_set)
confusionMatrix(validating_pred, validating_set$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1388   13    0    0    0
##          B    0  929   12    0    2
##          C    3    6  830   32    4
##          D    2    0   13  768   10
##          E    2    1    0    4  885
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9788          
##                  95% CI : (0.9744, 0.9826)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9732          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9789   0.9708   0.9552   0.9822
## Specificity            0.9963   0.9965   0.9889   0.9939   0.9983
## Pos Pred Value         0.9907   0.9852   0.9486   0.9685   0.9922
## Neg Pred Value         0.9980   0.9950   0.9938   0.9912   0.9960
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2830   0.1894   0.1692   0.1566   0.1805
## Detection Prevalence   0.2857   0.1923   0.1784   0.1617   0.1819
## Balanced Accuracy      0.9956   0.9877   0.9798   0.9746   0.9902

From the results above, we can see that the prediction accuracy on our validation_PCA_set is 97.88%. So, the out of sample error is 2.12%. The results imply that our model is suitable to do the prediction job.

Using the Model

Now, we can use our model on testing_PCA_set to predict the manner in people’s exercise.

testing_pred <- predict( modelFit, newdata=testing_PCA_set)
testing_pred
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E