Using devices such as Jawbone Up, Nike FuelBand, and Fitbit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement-a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is to quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is be to use data from accelerometers on the belt,forearm,arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal of project is to predict manner in which they did the exercise, which is the “classe” variable in the training set, with any of the variables in the dataset.Finally use the prediction model to predict 20 different test cases.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Load and read data from links, in which “NA”, “” and “#DIV/0!” are interpreted as missing values.
traindata<-fread("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",na.strings=c("NA","","#DIV/0!"))
testdata<-fread("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",na.strings=c("NA","","#DIV/0!"))
Split the Training Datasets into training set and testing set.
set.seed(1000)
inTrain<-createDataPartition(y=traindata$classe,p=0.60,list=FALSE)
training<-traindata[inTrain,]
testing<-traindata[-inTrain,]
dim(training)
## [1] 11776 160
dim(testing)
## [1] 7846 160
Each datasets consists of 160 variables.
Eliminate the variables with excessive missing values and that in 1st 7 columns which is not exercise parameters.
VarNA<-apply(training,2, function (x) length(which(is.na(x))))
VarNA<-as.data.frame(VarNA)
##Obtain the column names with no missing value
Var<-row.names(apply(VarNA,2,function(x) x[x==0]))
##Eliminate the 1st 7 columns which has nothing to do the predicted variable 'classe'
Var<-Var[-c(1:7)]
There are 52 variables to be used in prediction and ‘classe’ as the outcome.
Var
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [40] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [52] "magnet_forearm_z" "classe"
Select the variables used in prediction in training and testing data.
training<-as.data.frame(training)
mytraining<-training[which(colnames(training)%in% Var)]
testing<-as.data.frame(testing)
mytesting<-testing[which(colnames(testing)%in% Var)]
Remove zero covariates in the mytraining and mytesting dataset.
nsv1<-nearZeroVar(mytraining,saveMetrics=TRUE)
ntraining<-mytraining[,nsv1$zeroVar==FALSE]
nsv2<-nearZeroVar(mytesting,saveMetrics=TRUE)
ntesting<-mytesting[,nsv2$zeroVar==FALSE]
Fit the model with ‘classe’ as the outcome and all the remaining varaibles as predictors.
set.seed(1000)
modfit<-train(classe~.,method="rpart",data=ntraining)
The accuracy of the model is only 0.5878310. The probability of predicting 20 test datasets with correct results would only reaches 0.5878310^20=2.426888e-05, which is almost useless in prediction.
modfit
## CART
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02456099 0.5878310 0.47634203
## 0.04347413 0.4661506 0.29167459
## 0.11687233 0.3285869 0.06472165
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02456099.
The root, nodes, split and possibility of being in each class for each split is presented below for finalModel.
modfit$finalModel
## n= 11776
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 11776 8428 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 129.5 10707 7401 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.15 961 9 A (0.99 0.0094 0 0 0) *
## 5) pitch_forearm>=-33.15 9746 7392 A (0.24 0.23 0.21 0.2 0.12)
## 10) yaw_belt>=169.5 504 52 A (0.9 0.054 0 0.042 0.0079) *
## 11) yaw_belt< 169.5 9242 6999 B (0.21 0.24 0.22 0.21 0.12)
## 22) magnet_dumbbell_z< -93.5 1128 456 A (0.6 0.28 0.051 0.051 0.023) *
## 23) magnet_dumbbell_z>=-93.5 8114 6117 C (0.15 0.24 0.25 0.23 0.14)
## 46) roll_dumbbell< -64.76216 1234 513 C (0.14 0.15 0.58 0.04 0.077) *
## 47) roll_dumbbell>=-64.76216 6880 5078 D (0.15 0.25 0.19 0.26 0.15)
## 94) magnet_dumbbell_y>=317.5 3430 2184 B (0.13 0.36 0.053 0.31 0.15)
## 188) total_accel_dumbbell>=5.5 2637 1452 B (0.11 0.45 0.068 0.19 0.18) *
## 189) total_accel_dumbbell< 5.5 793 246 D (0.17 0.077 0.0013 0.69 0.061) *
## 95) magnet_dumbbell_y< 317.5 3450 2355 C (0.18 0.14 0.32 0.22 0.14) *
## 3) roll_belt>=129.5 1069 42 E (0.039 0 0 0 0.96) *
plot(modfit$finalModel,uniform=TRUE,main="Classification Tree")
text(modfit$finalModel,use.n=TRUE,all=TRUE,cex=.8)
testpredict<-predict(modfit,ntesting)
confusionMatrix(testpredict,ntesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1330 239 32 73 17
## B 196 767 113 343 331
## C 563 463 1223 529 394
## D 111 49 0 341 28
## E 32 0 0 0 672
##
## Overall Statistics
##
## Accuracy : 0.5523
## 95% CI : (0.5412, 0.5633)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4386
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.5959 0.50527 0.8940 0.26516 0.46602
## Specificity 0.9357 0.84466 0.6991 0.97134 0.99500
## Pos Pred Value 0.7865 0.43829 0.3856 0.64461 0.95455
## Neg Pred Value 0.8535 0.87680 0.9690 0.87085 0.89219
## Prevalence 0.2845 0.19347 0.1744 0.16391 0.18379
## Detection Rate 0.1695 0.09776 0.1559 0.04346 0.08565
## Detection Prevalence 0.2155 0.22304 0.4043 0.06742 0.08973
## Balanced Accuracy 0.7658 0.67496 0.7966 0.61825 0.73051
From the confusion Matrix, the accuracy of prediction only reaches 0.5523.
The In Sample Error is 1- 0.5878310=0.412169.
The Out Sample Error is 1- 0.5523=0.4477.
The algorithm of Random Forest is a time-consuming method, which propels us to use parallel processing. But the tradeoff made in this analysis is changing the resampling method from the default of bootstrapping to k-fold cross-validation. The change in resampling technique may trade processing performance for reduced model accuracy. However experiment indicates that 5 fold cross-validation resampling technique delivered the same accuracy as the more computationally expensive bootstrapping technique. Here we use 10 fold cross-validation resampling.
The process for executing the random forest model parallely is as follows.
1- Configure parallel processing
2- Configure trainControl object
3- Develop training model
4- De-register parallel processing cluster
set.seed(1000)
##Configure parallel processing
cluster<-makeCluster(detectCores()-1)
registerDoParallel(cluster)
##Configure trainControl object
fitControl<-trainControl(method="cv",number=10,allowParallel=TRUE)
modfit2<-train(classe~.,method="rf",data=ntraining,trControl=fitControl)
##De-register parallel processing cluster
stopCluster(cluster)
modfit2
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10598, 10597, 10598, 10600, 10598, 10598, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9902344 0.9876457
## 27 0.9911687 0.9888288
## 52 0.9880266 0.9848536
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The accuracy of the model reaches 0.9912.
The In Sample Error is 1- 0.9912=0.0088.
testpredict2<-predict(modfit2,ntesting)
confusionMatrix(testpredict2,ntesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 24 0 0 0
## B 3 1492 12 1 2
## C 0 2 1351 24 2
## D 0 0 5 1261 5
## E 0 0 0 0 1433
##
## Overall Statistics
##
## Accuracy : 0.9898
## 95% CI : (0.9873, 0.9919)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9871
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9829 0.9876 0.9806 0.9938
## Specificity 0.9957 0.9972 0.9957 0.9985 1.0000
## Pos Pred Value 0.9893 0.9881 0.9797 0.9921 1.0000
## Neg Pred Value 0.9995 0.9959 0.9974 0.9962 0.9986
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1902 0.1722 0.1607 0.1826
## Detection Prevalence 0.2872 0.1925 0.1758 0.1620 0.1826
## Balanced Accuracy 0.9972 0.9900 0.9916 0.9895 0.9969
The accuracy of the prediction reaches 0.9898.
The Out Sample Error is 0.0102.
testresult<-predict(modfit2,newdata=testdata)
testresult
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E