Classifying dumbbell lifts using random forest

Note: This is an edited version of a project created in partial fulfillment of the Coursera “Practical Machine Learning” course by Johns Hopkins University. The project was first submitted through GitHub on April 6, 2024.

I. Summary

The objective was to use activity sensors to classify the activity of six users into one of five dumbbell lifts (classe A, B, C, D, E) using machine learning. Data from four activity sensors was cleaned, which removed 51 rows and 117 variables. To account for user-dependent variations, the training, test, and validation sets, were normalised for each user. Cross validation was used to compute accuracy and sensitivity to choose the tree, and predictor (with corresponding mtry) parameters. The final random forest model used to classify lifts had 150 trees, mtry of 30, 41 predictors, and the default minimum node size of 1. It had an in sample prediction error of 2.2% and an expected out of sample error of 2.4% (validation overall accuracy of 97.6%).

II. Background, Results and Analysis

A. Background

Data was provided to the Coursera Practical Machine Learning course by Velloso et al. 2013 (http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har), who conducted and published the original study.

Six participants (“user_name”) were asked to perform 5 types of dumbbell lifts to estimate how well combined information from four sensors (dumbbell, belt, arm, forearm) could differentiate the following lifts (i.e. “classe”):
A = according to specifications
B = throwing elbows to the front
C = lifting dumbbell only halfway
D = lowering dumbbell only halfway
E = throwing hips to the front

Two types of algorithms and two versions of the predictors were investigated: Comparisons between random forest, gradient boosting machines using normalised predictors and principal components of normalised predictors suggested that random forest using normalised predictors was best suited for this data (not shown).

B. Pre-processing: Clean and normalise

1. Cleaning

The original csv file contained 19622 observations, 160 variables, and 19217 observations with missing values. Data cleaning involved removing severe outliers. Variables with high percentage of missing values, and highly correlated variables were also removed. This resulted in cleaned data with 19571 observations, 43 variables, and 0 observations with missing values. Additional specifics for the cleaned data are summarised in Table 1.

Table 1. User_name and classe contributions as % of total observations; and number of predictors contributed by each of the sensors to the cleaned data. Percentages (pct_un, pct_cl) might not add up to 100 due to rounding.
user_name	pct_un	classe	pct_cl	sensor	prednum
adelmo	19.6	A	28.5	belt	9
carlitos	15.9	B	19.2	dumbbell	13
charles	18.1	C	17.5	arm	9
eurico	15.7	D	16.4	forearm	10
jeremy	17.4	E	18.4	.	.
pedro	13.3	.	.	.	.

2. Normalising

The cleaned data was partitioned to a 60% training set, and 40% testing + validation sets. Predictor means and ranges differed between users (Figure 1, column a). Therefore, per user normalisation was done to improve consistency. The training mean and absolute maximum values were computed for each user and used to normalise the training, test, and validation sets. This resulted in means of zero and values between 1 and -1 (Figure 1, columns b and c), though the normalised test and validation data deviated from ideal values (not shown).

C. Machine learning with random forest

1. Model comparisons

The final model parameters were 150 trees, minimum node size of 1, mtry of 30, and 41 variables. The in-sample prediction error was 2.2% (Figure 2b1, rf41).

The random forest model was optimized by varying the number of trees, mtry, and number of variables. Cross validation was used two ways: during training, where the training set was resampled with a 10x cross validation setting in caret. As well, models were cross validated using out of sample observations to choose the best model parameters based on accuracy and sensitivity (Figure 2).

To determine changes in accuracy with different number of trees (Figure 2a), and allow computation with limited processing capacity, the normalised training set (19571 observations) was further partitioned to give a treetrain set with 2350 observations. The remaining data was resampled into 7 test sets with 1881 observations each. Forty one variables, which excluded “user_name” were used to predict “classe” (Figure 3d). A model with 150 trees (Figure 2a, r150) was chosen since it had similar accuracy and sensitivity as the 250 (Figure 2a, r250) and 500 (Figure 2a, r500) tree models.

Using 150 trees, models with different mtry and variable combinations were trained using the full 19571-observation training set (Figure 2b). The variable combinations were based on importance from an initial model using the full training set and 41 variables. Model accuracies were cross validated using half of the test set (3913 observations, 20% of total observations) resampled 9 times. The 41 variable model (Figure 2b, rf41) was chosen because although the overall accuracy and sensitivity were similar for models with 16, 25, and 41 variables (Figure 2b, rf16, rf25, and rf41); i) the accuracy for some users improved (Figure 2 b2; Figure 4 c1 vs. c2), ii) the confusion matrix prediction sensitivity were higher (Figure 4, a1 vs. a2), and iii) sensitivity for some users were more consistent with 41 variables (Figure 4, c1 vs. c2). Training with caret determined a corresponding optimum mtry of 30.

2. Variable importance

The 41 predictors were placed into 5 groups based on the final model (different colours in Figure 3). Combinations of these groups correspond to the x-axis labels in Figure 2b. There were broadly similar variable memberships in the groups across models. The most influential variables were from belt and dumbbell sensors. All models had the same top 3 predictors (in yellow: magnet_dumbbell_z, magnet_belt_z, pitch_belt).

The data distribution and density for the top and bottom two variables are illustrated in Figure 1 (columns b and c). Classe A and E can be visually distinguished in the data distribution and density of the top two variables. For the two least influential variables, classe A has higher data density at or closer to zero compared to the other classes (Figure 1, column c). That classe A and E could be easily visually distinguished in the plots of the two most important variables, is consistent with the general trend of greater accuracy and sensitivity for these two classes compared to classes B, C, and D (Figures 2 and 4).

Overall out of sample prediction accuracies of >70% and >90% were possible using only the top 3 and 8 predictors, respectively (Figure 2, b1). Using only the 8 most influential variables gave per-user accuracies and classe sensitivities of >80% (Figure 2, b2 and b3). The majority of these predictors were from the belt and dumbbell sensors, with one predictor from the forearm sensor (Figure 3, d). This suggests that the most influential distinguishing features that differentiated the 5 classes may be attributed to these 8 predictors from 3 of the 4 sensors. Still, accuracy and specificity improved from 8 to 41 variables (Figure 2b; Figure 4 row 1 vs. row 2). This is perhaps due to imperfections in the data (e.g. Figure 1, b2, user_name = eurico). Redundancies likely allowed for activities to be properly classified that would have otherwise been excluded from the appropriate classe.

3. Validation

Classe prediction was done once using the validation set (3912 observations, corresponding to 20% of total observations), as summarised in Figure 4, row 3. Accuracy, sensitivity, and specificity were similar between the 9x resampled test set and the validation set (Figure 4 a2 vs. a3 and b2 vs. b3). However, there were differences amounting to >1% in the per user accuracy and sensitivity in the test vs. validation sets (Figure 4, c2 vs. c3). Greater difficulties in predicting classes B and D, and activities for users carlitos and jeremy were encountered (Figure 4, row 3). From validation, the final model is expected to have an overall accuracy of 97.6% (error of 2.4%), sensitivity of 95% - 99% (Figure 4, a3), user-dependent accuracy of 96% - 99%, and user-dependent sensitivity of 91% - 100% (Figure 4, c3).

D. Figures

1.

Figure 1. Point (a, b) and density (c) plots of the most important (rows 1, 2) and least important (rows 3, 4) predictors in the final model. In point plots (a, b), users are in different colours and classe are separated by vertical lines. In density plots, line colours correspond to classe (c). Data are from before (a) and after (b, c) normalising.

2.

Figure 2. Comparing trees (a), and variable combinations + mtry (b) from out of sample predictions. The range (boxplots) and average (lineplots) of 7 resamples used to cross-validate tree models (a); as well as 9 resamples used to cross-validate variable + mtry models (b) are shown. The predictors used in the variable + mtry models (b) are listed in Figure 3d (e.g. the model resitricted to the 8 most important variables rf08 = L03 + L08, i.e. yellow + light green). The tree models used 41 predictors (a). In-training sample errors are annotated.

3.

Figure 3. Relative variable importance in 3 tree models (a, b, c), and the final model (d). The variables used as predictors were organised into five groups (as coloured: L03-yellow, L08-light green, L16-green, L25-blue, L41-light blue) based on the final model (d). Variable names were removed from the y-axis.

4.

Figure 4. Confusion matrix summary and per user sensitivity using 9 resampled test sets predicted using the 25 variable candidate model (1), the 41 variable final model (2); and the validation set (3) predicted once using the final model (150 trees, 30 mtry, 41 variables). Annotations are from out of sample predictions. Observed totals in Predicted vs. Observed (a) may not add to 1 due to rounding

III. Appendix

System description: R 4.3.3; RStudio build 402; Windows 10; 64 bit, 1.1 GHz Intel Celeron; 4.00 GB RAM; 57 GB HD.
Only general sample scripts are included. These were modified to conduct the analyses.

A. Required packages

library(dplyr)
library(tidyr)
library(caret)
library(ggplot2)
library(ggpubr)

#The ranger random forest package was executed through caret.
#Table 1 was created using knitr::kable

B. Getting and cleaning data

1. Data source

urlTrain <- "https://...DELETED.../pml-training.csv"
urlTest <- "https://...DELETED.../pml-testing.csv"

2. Load and clean training data

C. Training and prediction

1. Partition train to train and test sets

train will be used for training
test will be further partitioned to create testing and validation sets (not shown)

{set.seed(1439)
trainlist <- createDataPartition(y = column, p = prob, list = T/F)
}

#train <- training set
#test <- test set

2. Normalise dataframes

3. General training

The following script was modified to complete the analysis.

#var are the variables to be included in the model

trained_df <- train(y = yvar, x = xvars, method = "MethodToUse")

4. General prediction and confusion matrix

#predict "classe" based on chosen training parameters, using a test or validation set
predT <- predict(model, data)

#confusion matrix to estimate out of sample errors
cm <- confusionMatrix(prediction, observation)

D. Final application

The final, validated model was used to classify 20 observations.