Machine Learning of Weight Lifting Exercises

Author: Ken Ho

Executive Summary

In this project, we analyze a set of weight lifting exercise data and come up with a predictive model using machine learning algorithm that is able to identify mistakes in weight lifting with high accuracy.

The Random Forest machine learning algorithm without Principal Component Analysis produces very high accuracy against the cross validation data set, and precisely predicts the outcomes of 20 testing observations.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data File Loadings

First of all, we download both the training and testing data files, and load them to data tables.

trainLocal <- "pml-training.csv"
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testLocal <- "pml-testing.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if(!file.exists(trainLocal)) {
        download.file(trainURL, destfile = trainLocal)
}
if(!file.exists(testLocal)) {
        download.file(testURL, destfile = testLocal)
}

# Also treat empty string as "NA"
rawTrain <- read.csv(trainLocal, header = TRUE, na.strings = c("NA", ""))
rawTest <- read.csv(testLocal, header = TRUE, na.strings = c("NA", ""))

Exploratory Analyses and Data Cleaning

We then preform exploratory analysis and data cleaning wherever is needed on the testing data set.

dim(rawTrain)

## [1] 19622   160

dim(rawTest)

## [1]  20 160

The training data set has 19622 rows and 160 columns. And the training data set has 20 rows and 160 columns.

# Display internal structure of the training data set
str(rawTrain, list.len = 20)

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

# Get a count of the NAs in each column
unname(colSums(is.na(rawTrain)))

##   [1]     0     0     0     0     0     0     0     0     0     0     0
##  [12] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
##  [23] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
##  [34] 19216 19216 19216     0     0     0     0     0     0     0     0
##  [45]     0     0     0     0     0 19216 19216 19216 19216 19216 19216
##  [56] 19216 19216 19216 19216     0     0     0     0     0     0     0
##  [67]     0     0 19216 19216 19216 19216 19216 19216 19216 19216 19216
##  [78] 19216 19216 19216 19216 19216 19216     0     0     0 19216 19216
##  [89] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [100] 19216 19216     0 19216 19216 19216 19216 19216 19216 19216 19216
## [111] 19216 19216     0     0     0     0     0     0     0     0     0
## [122]     0     0     0 19216 19216 19216 19216 19216 19216 19216 19216
## [133] 19216 19216 19216 19216 19216 19216 19216     0 19216 19216 19216
## [144] 19216 19216 19216 19216 19216 19216 19216     0     0     0     0
## [155]     0     0     0     0     0     0

The internal structure of training data set shows that there are columns that have NA values.

A further analysis performed by getting a count of the number of NAs in each column shows that columns either have no NAs at all or 19216 NAs, which account for 97.93% of total observations.

Therefore, it’s safe to exclude these columns from further analysis as they have no significant impact to the rest of the data:

# Exclude columns with NA value(s)
isColNA <- (colSums(is.na(rawTrain)) > 0)

We also noticed that the first 7 columns are irrelavent to the data that we are intereted in as we focus on what sensors provided. So we can also exclude them:

# Remove the first 7 columns
isColUnrelated <- rep(FALSE, ncol(rawTrain))
isColUnrelated[1:7] <- TRUE

# Extract the traning and validation data
cleanedTrain <- rawTrain[, !(isColNA | isColUnrelated)]
ncol(cleanedTrain)

## [1] 53

Now we have 53 columns left for building predictive model.

Data Splitting

Now we are going to split the training data into training and crosss validation sets, with the proportion of 80% and 20%, respectively:

set.seed(800)

# Create data partitions
idxTrain <- createDataPartition(y = cleanedTrain$classe, p = 0.8, list = FALSE)
# Set taining data set
trainData <- cleanedTrain[idxTrain, ]
dim(trainData)

## [1] 15699    53

# Set CV data set
cvData <- cleanedTrain[-idxTrain, ]
dim(cvData)

## [1] 3923   53

Numbers of observations of training data set and cross validation data set are 15699 and 3923, respectively.

Predictive Models

We will compare two predictive models below: Random Forest with and without Principal Component Analysis (PCA). We will use the predictive model with higher accuracy.

Random Forest without PCA

We use randomForest() method which implements Breiman’s random forest algorithm for classification.
Number of trees: 500.
Number of variables randomly sampled as candidates at each split: 7 (square root of number of variables).

# Create the Random Forest model
rfFit <- randomForest(classe ~ ., data = trainData)
rfFit

## 
## Call:
##  randomForest(formula = classe ~ ., data = trainData) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.4%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4461    2    0    0    1 0.000672043
## B   13 3022    3    0    0 0.005266623
## C    0   13 2724    1    0 0.005113221
## D    0    0   23 2549    1 0.009327633
## E    0    0    1    5 2880 0.002079002

Cross Validation and Accuracy

# Perform cross validation
cvPred <- predict(rfFit, cvData)
cvConf <- confusionMatrix(cvData$classe, cvPred)
cvConf$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    2  755    2    0    0
##          C    0    4  679    1    0
##          D    0    0    2  640    1
##          E    0    0    0    0  721

# Get accuracy
cvAccuracy <- cvConf$overall["Accuracy"]
round(cvAccuracy, 4)

## Accuracy 
##   0.9969

From the confusion matrix summary above, we got accuracy of 0.9969 on cross validation data set.

Random Forest with PCA

We again use randomForest() method which implements Breiman’s random forest algorithm for classification.
Number of trees: 500.

Preprocess

# Get the names of predictors
predNames <- names(cleanedTrain)
predIdx <- grep("^classe", predNames, invert = TRUE)
predNames <- predNames[predIdx]

# Create preProcess object
preProc <- preProcess(trainData[, predNames], method = "pca", thresh = 0.99)
preProc$numComp

## [1] 36

PCA needed 36 components to capture 99% of the variance. So the number of variables randomly sampled as candidates at each split will be: 6 (square root of 36).

# Calculate PCs for training data
trainPC <- predict(preProc, trainData[, predNames])
dim(trainPC)

## [1] 15699    36

Number of columns of training data set with PCs is 36.

# Create the Random Forest model
rfFitPC <- randomForest(trainData$classe ~ ., data = trainPC)
rfFitPC

## 
## Call:
##  randomForest(formula = trainData$classe ~ ., data = trainPC) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 1.75%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4453    7    1    1    2 0.002464158
## B   53 2961   19    4    1 0.025345622
## C    7   26 2691   12    2 0.017165814
## D    4    1   88 2471    9 0.039642441
## E    1    9   17   11 2848 0.013167013

Cross Validation and Accuracy

# calculate PCs for cv data
cvPC <- predict(preProc, cvData[, predNames])

# compare results
cvConfPC <- confusionMatrix(cvData$classe, predict(rfFitPC, cvPC))
cvConfPC$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B   22  730    4    1    2
##          C    0   13  668    2    1
##          D    0    0   14  626    3
##          E    0    3    3    1  714

cvPCAccuracy <- cvConfPC$overall["Accuracy"]
round(cvPCAccuracy, 4)

## Accuracy 
##   0.9824

From the confusion matrix summary above, we got accuracy of 0.9824 on cross validation data set with PCs.

Model Selection

Comparison of Accuracies and Out-of-Sample Errors

modCompare <- rbind(c(round(cvAccuracy, 4), round(1 - cvAccuracy, 4)), 
               c(round(cvPCAccuracy, 4), round(1 - cvPCAccuracy, 4)))
colnames(modCompare) <- c("Accuracy", "Out-of-Sample Err");
rownames(modCompare) <- c("Without PCA", "With PCA");
modCompare

##             Accuracy Out-of-Sample Err
## Without PCA   0.9969            0.0031
## With PCA      0.9824            0.0176

The above table shows that the Random Forest model without PCA has slightly higher accuracy. Hence it is chosen as the model to perform prediction on the testing data set.

Prediction of Testing Data Set

Now we predict the testing data set using the Random Forest model without PCA that we derived above:

testPred <- predict(rfFit, rawTest)
testPred

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Conclusion

Among the algorithms we used against the cross validation data set, the Random Forest without Principal Component Analysis was able to produce as high accuracy as 99.69%, and precisely predict the outcomes of 20 testing observations. Such machine learning algorithm was able to identify mistakes in weight lifting with high accuracy.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.