Summary

In this report, we use machine learning to create a prediction model that qualifies how well a person performs a specific fitness excercise. As a training set, several people were supervised to perform the fitness excercise in a correct manner and 4 incorrect manners, while their movement was recorded by 4 accelerometers on their body and the fitness equipment.

The Human Activity Recoginition study was performed and the data opened for public usage by this research group. A paper on building and analysis of this data can be found here.

Loading and examining dataset

Load libraries

require(caret)

Load data

I load the training and the testing set. The test data does not contain the classe column we wish to predict, and is purely for final validation of the model by the coursera quiz.

download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
training <- read.csv("train.csv")
testing <- read.csv("test.csv")

Correct for incomplete entries and select good predictors

As a first inspection, I check both datasets for missing values, by summing the number of missing values per column and summarizing the result. The results show that both datasets have columns with 0 missing values, but also columns with almost all missing values. Further studying (not included here for brevity) reveals that only the columns with numeric values are without missing values. These columns contain x- y- and z-values for gyros, accel and magnet variables.

For the modeling, I create new datasets with only the numeric variables. As a check, I compare the variable names in the obtained training2 and testing2 datasets and find that they are consistent. Finally, I check that the new datasets do not contain any missing values.

dim(training)

## [1] 19622   160

dim(testing)

## [1]  20 160

summary(colSums(is.na(training)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    8047   19220   19220

summary(colSums(is.na(testing)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    20.0    12.5    20.0    20.0

# 
cnum_training <- sapply(1:dim(training)[2], function(x) class(training[,x]) == "numeric")
cnum_testing <- sapply(1:dim(testing)[2], function(x) class(testing[,x]) == "numeric")
cnum <- as.logical(cnum_testing*cnum_training)
training2 <- training[,cnum]
testing2 <- testing[,cnum]

any(names(training2) != names(testing2))

## [1] FALSE

any(is.na(training2))

## [1] FALSE

any(is.na(testing2))

## [1] FALSE

Create a training and validation set

I split the training2 dataset into a training and a validation set, for quantifying the behaviour and error of my models.

training2 <- cbind(training2, classe = training$classe)
inTrain <- createDataPartition(y = training2$classe, p = 0.7, list=FALSE)
validating <- training2[-inTrain,]
training2 <- training2[inTrain,]

Prediction with various models

Training the models

Since the task we are facing is classification, I choose to approach the dataset with a tree-prediction model. To enhance the prediction accuracy, I furthermore use a random-forest model with varying number of trees used.

set.seed(123)
model_rpart <- train(classe ~ . , data = training2, method = "rpart")
model_rf1 <- train(classe ~ . , data = training2, method = "rf", ntree = 1)
model_rf5 <- train(classe ~ . , data = training2, method = "rf", ntree = 5)
model_rf10 <- train(classe ~ . , data = training2, method = "rf", ntree = 10)
model_rf20 <- train(classe ~ . , data = training2, method = "rf", ntree = 20)

Validation

I use the validation dataset to find the out-of-sample error for the five models.

models <- list(rpart = model_rpart, rf1 = model_rf1, rf5 = model_rf5, rf10 = model_rf10, rf20 = model_rf20)
predictions <- lapply(models, function(x) predict(x, validating))
err <- unlist(lapply(predictions, function(x) 1 - sum(x == validating$classe) / length(x)))
err

##      rpart        rf1        rf5       rf10       rf20 
## 0.42395922 0.06932880 0.02022090 0.01478335 0.01104503

barplot(err, xlab = "Model name", ylab = "Out of sample error")

The results show that a single prediction tree only classifies half of the samples correctly. However, once we start combining trees in a random forest, the accuracy quickly increases.

Conclusion

For this classification problem, random forest return very good results. The out of sample error, found by comparing the predicted classe for the validation set by the actual value, goes monotonically down with the number of trees used in the random forest. Since computation time goes up with the number of trees, I stopped modelling at 20 trees. In this particular problem, 20 trees is already close to the ideal setting, as can be seen from plotting the out-of-sample error vs the number of trees:

plot(x = c(1,5,10,20),err[2:5], xlab="Number of trees in random forest", ylab="Out of sample error")

Qualify Movement: using machine learning to tell people how well they excercise.

Jakko de Jong

17 maart 2017