Nowadays, there’s a lot of new technologies that can easily track the statistics of your daily routine from standing, sitting, walking, running, and so on. The data that were generated by these tools can be used to predict activity type that the person is doing. See this example paper that attempts to detect activity type using triaxial accelerometers.
In this study, we will be working on weight lifting exercises dataset. The study was conducted by Velloso, Bulling, Gellersen, Ugulino, and Fuks. You can read more about their paper here. Six participants had performed 10 repetitions of unilateral dumbbell biceps curl. They performed the activity in five different ways. One of which is the proper way of executing the exercise (denoted by class A), while the rest corresponds to the common mistakes when doing the activity (other classes B-E). Participants had IMUs on them at specific points to measure triaxial accelerometers for each activity classification.
This study focuses on predicting the manner how this exercise is conducted using the data collected from the experiment. Specifically, we will use the data from accelerometers on the belt, forearm, arm, and dumbbell.
The training data consists of 19,622 rows and 160 columns. As mentioned, the features that will only be used are the accelerometers on the belt, forearm, arm, and dumbbell. Selecting only these columns will give us 12 features (13 columns in total including the classe column, our outcome variable). In the testing set, there’s only 20 rows, which is really intended for the final quiz in Practical Machine Learning course by John Hopkins University in Coursera, but we will use it as our actual test set.
Training set was partitioned into two - one for the actual training set and one for the quiz set. 70% (13,737 rows) of the full training set was used to train the models, while the remaining 30% (5,885 rows) was used to test their accuracy. Our quiz set is our pseudo test set in this study. For the sake of reproducibility, set.seed function was used.
The approach that I did is to first train the models without data pre-processing. I set a threshold of 90% accuracy rate on the quiz set for a model to be used in the test set. If none met the requirement, then, pre-processing will be conducted. Random forest, K-nearest neighbors (k-NN), and gradient boosting machine (GBM) are the models used in this study. These models used repeatedcv as resampling method with 3 folds and 3 repeats.
In a separate script, the models were fit, and the resulting objects were saved. Below graph illustrates the accuracy rate per model on both training and test sets.
From the above visual, we can see that using GBM and k-NN led to overfitting with a variance of 3 and 8% respectively. Using random forest, however, resulted to accuracy rates that are nearly equal and has only 0.02% gap between training and quiz accuracy rate. Following the threshold that we previously set, we will use random forest to predict the test set.
Variable importance is basically the measure of usefulness of a variable on a model. We can get the variable importance for each model that we used. I’ll just present the variable importance of our selected model, that is, random forest.
It’s obvious that the accelerometers on belt on z angle, and dumbbell on y and z angles are the top 3 most useful variables in the data. Variable importance can be considered in feature selection to improve the model. Here, we can set a threshold or select the n-top most important features from the data to build another model aiming for better accuracy.
The plot above is the scatterplot of accel_dumbbell_y and accel_belt_z which was colored by correct and incorrect predictions. Predicting the classe variable from the testing set using random forest gave us 95% accuracy rate or 19 correct predictions out of 20 observations. From the chart above, it’s not too obvious but the incorrect prediction was masked by a correct prediction. Looking closely, the dot which was pointed out by the arrow is a little darker because of the mixed blue and red plot with some transparency. And looking at the actual data below tells us that the there are two points with the same exact values of accel_belt_z and accel_dumbbell_y.
testing %>% filter(accel_belt_z == 49) %>% select(bool, accel_belt_z, accel_dumbbell_y)
## bool accel_belt_z accel_dumbbell_y
## 1 incorrect 49 155
## 2 correct 49 155
Having equal values on the two most important variables might have been the reason of this incorrect prediction. Setting aside data error, we could’ve worked it out by either fitting the same model using the most important features, or we could use other models but different tuning parameters and applying some feature selection. We can also use combined models or ensemble. The other two models we’ve fit had fairly high prediction rates on quiz set. Combining these three models could yield an even higher prediction.
Overall, the final model that we’ve used performed really well.