Classifying Barbell Lifting Techniques Using Random Forests and Sensor Data

1. Introduction

This project aims to predict the manner in which individuals perform barbell lifts using data collected from wearable devices. The dataset includes measurements from accelerometers placed on various body parts (e.g., belt, forearm, and dumbbell). The classe variable is the target, representing different lift techniques. The analysis uses a Random Forest model to classify the lift type.

2. Data Cleaning and Preprocessing

Steps Taken

Missing Data: Columns with over 50% missing values were removed.
Near-Zero Variance: Features with negligible variability were excluded to prevent overfitting.
Irrelevant Features: Columns unrelated to prediction, such as timestamps and IDs, were removed.

3. Exploratory Data Analysis

The classe variable was evenly distributed across its five classes (A, B, C, D, and E), making the dataset suitable for classification without additional balancing techniques.

4. Model Development

Training and Testing

The dataset was split into training (80%) and testing (20%) subsets.

Model

A Random Forest model was trained with 500 trees, optimizing feature splits using the square root of the total features (mtry).

Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    0  759    0    0    0
##          C    0    0  684    3    0
##          D    0    0    0  640    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9978, 0.9998)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.999           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   0.9953   1.0000
## Specificity            1.0000   1.0000   0.9991   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   0.9956   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9991   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1935   0.1744   0.1631   0.1838
## Detection Prevalence   0.2845   0.1935   0.1751   0.1631   0.1838
## Balanced Accuracy      1.0000   1.0000   0.9995   0.9977   1.0000

Accuracy: The model achieved a near-perfect accuracy of 99.92%.
Confusion Matrix:
- Predictions were highly accurate, with only three misclassifications out of thousands of predictions.
- Most errors involved misclassifying D as C.

5. Results

Predictions

##    problem_id predicted_classe
## 1           1                B
## 2           2                A
## 3           3                B
## 4           4                A
## 5           5                A
## 6           6                E
## 7           7                D
## 8           8                B
## 9           9                A
## 10         10                A
## 11         11                B
## 12         12                C
## 13         13                B
## 14         14                A
## 15         15                E
## 16         16                E
## 17         17                A
## 18         18                B
## 19         19                B
## 20         20                B

The model generated predictions for the testing dataset (pml-testing.csv). The first six predictions were:

B, A, B, A, A, E.

Since the true labels for the test cases are unavailable, the accuracy of these predictions could not be directly evaluated.

Feature Importance

The most critical features identified were:

roll_belt
yaw_belt
pitch_belt

These features had the highest impact on model accuracy, likely due to their ability to capture nuanced motion patterns during the exercises.

Error Analysis

##       True_Class Predicted_Class
## 12946          D               C
## 12961          D               C
## 15322          D               C

The few misclassified instances indicated potential overlap in movement characteristics between certain classes, such as D and C.

6. Key Feature Investigation

## [1] "Correlation of roll_belt with classe: 0.0621513426488757"

## [1] "Correlation of yaw_belt with classe: 0.0136011047702848"

Correlation Analysis:

roll_belt: Weak positive correlation with classe (0.062).
yaw_belt: Extremely weak correlation (0.014).

Visual Analysis:

The boxplots revealed roll_belt and yaw_belt distributions were notably different for Class E, explaining their importance in distinguishing this class.

7. Conclusion

The project successfully demonstrated the application of machine learning in activity classification using sensor data. The Random Forest model achieved high accuracy, with roll_belt and yaw_belt emerging as key contributors. These findings underscore the importance of wearable sensors in fitness monitoring and activity classification.