author: angelayuan
date: Thursday, May 21, 2015
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal is to predict the manner in which people did exercise. Training data and Test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
These data was collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways, including class A (exactly according to the specification), class B (throwing the elbows to the front), class C (lifting the dumbbell only halfway), class D (lowering the dumbbell only halfway), and class E (throwing the hips to the front)
First, we load in the training data and test data, and check a few lines of training data (do not print the table of first six lines because of the limitation of space).
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(training)
## [1] 19622 160
We can see that there are 160 variables in total. 159 of them are candidate predictors and 1 of them is outcome (i.e. classe).
From above checking, we know that there are variables with NAs values and/or missing values, indicating that we need to conduct data preprocessing.
First, we delete variables with NAs values for both training data and test data.
training <- training[, colSums(is.na(training)) == 0]
testing <- testing[, colSums(is.na(testing)) == 0]
Second, we load in caret package and then we delete variables with near zero variance which barely contribute to models.
library(caret)
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
nzv_train <- nearZeroVar(training)
training <- training[,-nzv_train]
nzv_test <- nearZeroVar(testing)
testing <- testing[,-nzv_test]
Third, the first six variables are just user names and time stamping which barely contribute to models. Therefore we delete the first six variables.
training <- training[,-c(1:6)]
testing <- testing[,-c(1:6)]
dim(training)
## [1] 19622 53
After data preprocessing, we have 53 variables to build models. Here we select random forest model with classe as outcome and other variables as predictor.
First, load in randomForest package.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.1.3
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Second, split training data into training dataset and cross validation dataset.
inTrain <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
training_set <- training[inTrain,]
cv_set <- training[-inTrain,]
Third, build random forest model using training set.
rf_fit <- randomForest(classe ~., data=training_set)
training_pred <- predict(rf_fit, newdata=training_set)
confusionMatrix(training_pred, training_set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 0 0 0 0
## B 0 2658 0 0 0
## C 0 0 2396 0 0
## D 0 0 0 2252 0
## E 0 0 0 0 2525
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
From above results, we see that the prediction accuracy is perfect (100%) and the sensitivity, specificity etc. are perfect as well. The error rate is 0.0% which is really low.
Next, we will test prediction accuracy on cross validation dataset. The model is fitted based on training dataset, therefore it typically has a lower error rate than the out of sample error. We expect the out of sample error is larger than 0.0%.
cv_pred <- predict(rf_fit, newdata=cv_set)
confusionMatrix(cv_pred, cv_set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 7 0 0 0
## B 1 1132 2 0 0
## C 0 0 1023 12 0
## D 0 0 1 952 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9958
## 95% CI : (0.9937, 0.9972)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9946
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9939 0.9971 0.9876 0.9982
## Specificity 0.9983 0.9994 0.9975 0.9994 1.0000
## Pos Pred Value 0.9958 0.9974 0.9884 0.9969 1.0000
## Neg Pred Value 0.9998 0.9985 0.9994 0.9976 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1924 0.1738 0.1618 0.1835
## Detection Prevalence 0.2855 0.1929 0.1759 0.1623 0.1835
## Balanced Accuracy 0.9989 0.9966 0.9973 0.9935 0.9991
From above results, we see that the prediction accuracy on cross validation dataset is pretty high (99.58%), indicating the out of sample error is really low (0.42%). And it also shows a high sensitivity, specificity etc., indicating the model seems perfect.
Finally, we apply the above model to test data to predict the manner in which people exercise.
test_pred <- predict(rf_fit, newdata=testing)
test_pred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har