Creating machine learning algorithm to predict activity type and performance based on accelerometers data

What is it about:

Idea behind this project is to build predictive model that is able to recognize human workout activity type and if the workout is done in correct manner as defined in training set. The training and testing data for such projects can come from personal activity data sensors such as Nike, Fitbit, etc. One possibility is to use data set from Veloso et al. (2013) described at http://groupware.les.inf.puc-rio.br/har and available for download at https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. Veloso et al. (2013) data set is licensed under the Creative Commons License (CC BY-SA).

Loading data, empty cells and cells filled with empty spaces will be filled with NA’s:

datar<-read.csv("pml-training.csv", na.strings = c("", " ", "NA"))

Pre-processing

dropping columns that had NA, or empty spaces, in this data set those are columns that contain derivative data, such as min, max, std, etc. Also dropping columns 1:7 that contain supplemental data.

cols<-function(x){sum(is.na(x))}
y<-colwise(cols)(datar)
namestodrop<-c(names(y[which(y>0)]))
datarm<-datar[,!colnames(datar) %in% namestodrop]
datarm<-datarm[,-(1:7)]
rm(y, namestodrop, cols)

That dropped number of columns from 160 to 53.

Randomizing and splitting to train (75%), test(25%) data sets:

set.seed(101010)
datas<-datarm[sample(nrow(datarm)),]
inTrain<-createDataPartition(y=datas$classe, p=0.75, list=F)
training<-datas[inTrain,]
testing<-datas[-inTrain,]
remove(inTrain, datas, datarm)

Modelling

Here random forest model is applied. This model is computationally expansive. Therefore, doParallel package utilized, to deploy all cores of my computer processor.

set.seed(1010123)
cl<-makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
trCon<-trainControl(method="cv")
modelfit<-train(classe~., data = training, method="rf", tuneGrid=expand.grid(mtry = 10), trControl=trCon)
registerDoSEQ()

This Random Forest model application uses cross-validation method for resampling. I did not use repetitive CV sampling, since it is not needed for this task. Model performance is:

modelfit

## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13247, 13246, 13246, 13247, 13248, 13245, ... 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.9949039  0.9935533  0.001478     0.001869855
## 
## Tuning parameter 'mtry' was held constant at a value of 10
##

Testing machine learning model on the data that was not used in training the model

Will use this model (modelfit) to predict activity type for the testing data set and compare the results to the actual values available in classe variable of testing data set.

confusionMatrix(testing$classe, predict(modelfit, testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    2    0    0    0
##          B    1  947    1    0    0
##          C    0    3  850    2    0
##          D    0    0    4  799    1
##          E    0    0    0    4  897
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9963          
##                  95% CI : (0.9942, 0.9978)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9954          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9947   0.9942   0.9925   0.9989
## Specificity            0.9994   0.9995   0.9988   0.9988   0.9990
## Pos Pred Value         0.9986   0.9979   0.9942   0.9938   0.9956
## Neg Pred Value         0.9997   0.9987   0.9988   0.9985   0.9998
## Prevalence             0.2843   0.1941   0.1743   0.1642   0.1831
## Detection Rate         0.2841   0.1931   0.1733   0.1629   0.1829
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9994   0.9971   0.9965   0.9957   0.9989

Here Kappa is higher than 0.99. Statistics by class show that model performs well for each class. The model predicted correctly 4886 activities out of 4904.

Conclusion

Build machine learning model used random forest algorithm. The model performs well, with Kappa>0.99

Literature

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.