Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har. The goal of this project is to predict the manner in which they did the execise.

The dataset used in this project can be found here, and the experiment details are described in the original paper.

Preparing

The datasets have been downloaded to local machine, and saved in the working directory. Below codes loaded the datasets and the necessary library.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

training<-read.csv("pml-training.csv",na.strings=c("NA","#DIV/0!"))
testing<-read.csv("pml-testing.csv",na.strings=c("NA","#DIV/0!"))

Data Exploring

dim(training)

## [1] 19622   160

table(training$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

There are 19622 observation in traning dataset, including 160 variables. The last column is the target variable classe. The most abundant class is A.

There are some variables having a lot of missing values, for simplicity, I have removed all the variables containing NA values. And also, several variables are not direcly related to the target variable classe, I also removed those varialbes, those variables are “x”, “user_name”, and all the time related variables, such as “raw_timestamp_part_1” etc.

NA_Count = sapply(1:dim(training)[2],function(x)sum(is.na(training[,x])))
NA_list = which(NA_Count>0)
colnames(training[,c(1:7)])

## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
## [7] "num_window"

training = training[,-NA_list]
training = training[,-c(1:7)]
training$classe = factor(training$classe)

The testing dataset has been processed in the same way

testing = testing[,-NA_list]
testing = testing[,-c(1:7)]

Modeling with Cross Validation

The problem presenting here is a classification problem, I tried to use the classification method in caret package: classification tree algorithm and random force. I also carried out 3-fold validation using the trainControl function.

set.seed(1234)
cv3 = trainControl(method="cv",number=3,allowParallel=TRUE,verboseIter=TRUE)
modrf = train(classe~., data=training, method="rf",trControl=cv3)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=27 
## - Fold1: mtry=27 
## + Fold1: mtry=52 
## - Fold1: mtry=52 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=27 
## - Fold2: mtry=27 
## + Fold2: mtry=52 
## - Fold2: mtry=52 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=27 
## - Fold3: mtry=27 
## + Fold3: mtry=52 
## - Fold3: mtry=52 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set

modtree = train(classe~.,data=training,method="rpart",trControl=cv3)

## Loading required package: rpart

## + Fold1: cp=0.03568 
## - Fold1: cp=0.03568 
## + Fold2: cp=0.03568 
## - Fold2: cp=0.03568 
## + Fold3: cp=0.03568 
## - Fold3: cp=0.03568 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0357 on full training set

We can check the performance of these two model on the testing dataset

prf=predict(modrf,training)
ptree=predict(modtree,training)
table(prf,training$classe)

##    
## prf    A    B    C    D    E
##   A 5580    0    0    0    0
##   B    0 3797    0    0    0
##   C    0    0 3422    0    0
##   D    0    0    0 3216    0
##   E    0    0    0    0 3607

table(ptree,training$classe)

##      
## ptree    A    B    C    D    E
##     A 5080 1581 1587 1449  524
##     B   81 1286  108  568  486
##     C  405  930 1727 1199  966
##     D    0    0    0    0    0
##     E   14    0    0    0 1631

For the testing dataset.

prf=predict(modrf,testing)
ptree=predict(modtree,testing)
table(prf,ptree)

##    ptree
## prf A B C D E
##   A 7 0 0 0 0
##   B 3 0 5 0 0
##   C 0 0 1 0 0
##   D 0 0 1 0 0
##   E 1 0 2 0 0

From the results, it appears that the random forest model has the best accuracy for testing dataset.

Conclusion

Finally, I chosed the random forest model to the testing dataset for submission result.

answers=predict(modrf,testing)
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
answers

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

pml_write_files(answers)

The predicted classes for the 20 tests are: B A B A A E D B A A B C B A E E A B B B.

Coursera Practical Machine Learning Peer Assessment - Prediction Assignment Writeup

Summary

Preparing

Data Exploring

Modeling with Cross Validation

Conclusion