Machine Learning to assess Weight lifting quality

The following report fits a machine learning model to the Weight Lifting Exercise Dataset ¹. A dataset that has been created to study how well it is executed the exercise of Dumbbell Biceps Curl. This exersice is performed by 6 different subjects that repeat the exercise in 5 different ways, the correct one and 4 incorrect ones, more information at the reference ².

Model selection and model fitting strategy.

The paper, Best practices for machine learning in Human movement ³, reviews a total of 129 papers that fit machine learning models to study human movement, most of the studies involved datasets collected from accelerometers.

Summary of the meta analysis of machine learning models for human movement classification

According to this paper, the most common classification model to address human movement was the Support Vector Machine. Therefore, this model will be fitted and tuned and the results will be compared to the ones produced by the originators of the dataset. They decided to fit a rainforest model.

Additionally, the meta analysis paper suggests the following practices:

Principal Components analysis to simplify the dataset
Data has to be centered and scaled.
Cross validation to improve predictions
Wide range of metrics for future model comparison.

Data pre-process

df<-read.csv("pml-training.csv")

library(caret)

# Remove near zero variables.
nzv<-nearZeroVar(df,saveMetrics = TRUE)
df<-df[,which(nzv$nzv==FALSE)]

# Remove NA values
na_count <-sapply(df, function(y) sum(length(which(is.na(y)))))
#table(na_count) indicates that missing values are pretty consistent throughout the data, 67 variables miss 19216 values out of 19622. These variables will be removed.
rdf<-df[,which(na_count==0)]

# Remove the X and the general time variable. These variables, do not provide relevant information to the model and they are highly correlated to the output as the data was recorded in order.  
rdf<-rdf[,c(-1,-5)]

prepro<-preProcess(rdf,method=c("center", "scale", "pca"), thresh = 0.99)
training<-predict(prepro,rdf)

The data pre-process has consisted of:

Removing the variables that were considered near zero. (~60 variables out of 160 were removed)
From the remaining variables, those that contained a high number of NA were removed (~43 variables out of 100 were removed)
The variable X, and the time, were removed as well, as these are highly correlated to the output due to the way the data was recorded. (2 out of 57)
The remaining variables were centered, scaled and its principal components extracted. The principal components extracted made up to 99 of the variance, in total, 39 predictor variables and 1 as the output variable.

Model tuning and out of sample error

A quick hyper-parameter tuning with 3 different kernels for the support vector machine has been undertaken, the best performing one is the polynomial kernel. The cross validation undertaken is 3-folds, to keep the model building simple.

Finally, the predetermined paremeter tuning has elected the following parameters: degree = 3, scale = 0.1 and C = 1.

When performing cross validation, the accuracy provided by the model is the averaged one of the 3-folds. The model selected has a pretty good out-of-sample error, so it is expected to perform well with the test data.

ctrl <- trainControl(method = "cv", number = 3, verboseIter = TRUE)

modellinear<-train(classe~.,data=training, method="svmLinear", trControl=ctrl)

modelpoly<-train(classe~.,data=training, method="svmPoly", trControl=ctrl)

modelradial<-train(classe~.,data=training, method="svmRadial", trControl=ctrl)

The tuning process of the polynomial model:

modelpoly

## Support Vector Machines with Polynomial Kernel 
## 
## 19622 samples
##    39 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 13081, 13081, 13082 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa    
##   1       0.001  0.25  0.5802162  0.4544048
##   1       0.001  0.50  0.6336767  0.5313720
##   1       0.001  1.00  0.6528389  0.5572044
##   1       0.010  0.25  0.6720521  0.5821410
##   1       0.010  0.50  0.6806138  0.5930609
##   1       0.010  1.00  0.6897361  0.6048222
##   1       0.100  0.25  0.6975333  0.6149026
##   1       0.100  0.50  0.7013554  0.6199117
##   1       0.100  1.00  0.7070633  0.6272600
##   2       0.001  0.25  0.6408116  0.5405979
##   2       0.001  0.50  0.6633884  0.5707413
##   2       0.001  1.00  0.6832640  0.5965506
##   2       0.010  0.25  0.8357457  0.7916625
##   2       0.010  0.50  0.8718276  0.8375975
##   2       0.010  1.00  0.9011824  0.8748059
##   2       0.100  0.25  0.9743657  0.9675572
##   2       0.100  0.50  0.9798696  0.9745286
##   2       0.100  1.00  0.9839467  0.9796896
##   3       0.001  0.25  0.6635413  0.5707229
##   3       0.001  0.50  0.6868824  0.6010764
##   3       0.001  1.00  0.7106822  0.6316587
##   3       0.010  0.25  0.8966466  0.8690820
##   3       0.010  0.50  0.9228418  0.9022733
##   3       0.010  1.00  0.9475590  0.9335866
##   3       0.100  0.25  0.9895526  0.9867844
##   3       0.100  0.50  0.9902661  0.9876869
##   3       0.100  1.00  0.9907247  0.9882672
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 1.

The confusion Matrix is:

confusionMatrix(predict(modelpoly,training),training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5580    1    0    0    0
##          B    0 3796    1    0    0
##          C    0    0 3419    6    0
##          D    0    0    2 3206    0
##          E    0    0    0    4 3607
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9988, 0.9996)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9991          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9997   0.9991   0.9969   1.0000
## Specificity            0.9999   0.9999   0.9996   0.9999   0.9998
## Pos Pred Value         0.9998   0.9997   0.9982   0.9994   0.9989
## Neg Pred Value         1.0000   0.9999   0.9998   0.9994   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1935   0.1742   0.1634   0.1838
## Detection Prevalence   0.2844   0.1935   0.1745   0.1635   0.1840
## Balanced Accuracy      1.0000   0.9998   0.9994   0.9984   0.9999

Results

test<- read.csv("pml-testing.csv")

testing<-predict(prepro, test)

R<-predict(modelradial, testing)

data.frame(Question=test$id, Solution=R)

data.frame(Question=test$problem_id, Solution=R)

##    Question Solution
## 1         1        B
## 2         2        A
## 3         3        B
## 4         4        A
## 5         5        A
## 6         6        C
## 7         7        D
## 8         8        D
## 9         9        A
## 10       10        A
## 11       11        B
## 12       12        C
## 13       13        B
## 14       14        A
## 15       15        E
## 16       16        E
## 17       17        A
## 18       18        B
## 19       19        B
## 20       20        B

Conclusion

Support Machine Vector seems to perfom pretty good with little tuning. To improve accuracy, the following activities could be undertaken:

Compare it with a rainforest model,
Improve the parameter tuning,
Build an ensemble model with a rainforest model.

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.↩
http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har ↩
E. Halilaj, A. Rajagopal, M. Fiterau, J.L. Hicks, T.J. Hastie, S.L. Delp, Machine Learning in Human Movement Biomechanics: Best Practices, Common Pitfalls, and New Opportunities, Journal of Biomechanics (2018), doi:https://doi.org/10.1016/j.jbiomech.2018.09.009 ↩

ML Coursera