In this report, we use machine learning to create a prediction model that qualifies how well a person performs a specific fitness excercise. As a training set, several people were supervised to perform the fitness excercise in a correct manner and 4 incorrect manners, while their movement was recorded by 4 accelerometers on their body and the fitness equipment.
The Human Activity Recoginition study was performed and the data opened for public usage by this research group. A paper on building and analysis of this data can be found here.
require(caret)
I load the training and the testing set. The test data does not contain the classe column we wish to predict, and is purely for final validation of the model by the coursera quiz.
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
training <- read.csv("train.csv")
testing <- read.csv("test.csv")
As a first inspection, I check both datasets for missing values, by summing the number of missing values per column and summarizing the result. The results show that both datasets have columns with 0 missing values, but also columns with almost all missing values. Further studying (not included here for brevity) reveals that only the columns with numeric values are without missing values. These columns contain x- y- and z-values for gyros, accel and magnet variables.
For the modeling, I create new datasets with only the numeric variables. As a check, I compare the variable names in the obtained training2 and testing2 datasets and find that they are consistent. Finally, I check that the new datasets do not contain any missing values.
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
summary(colSums(is.na(training)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 8047 19220 19220
summary(colSums(is.na(testing)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 20.0 12.5 20.0 20.0
#
cnum_training <- sapply(1:dim(training)[2], function(x) class(training[,x]) == "numeric")
cnum_testing <- sapply(1:dim(testing)[2], function(x) class(testing[,x]) == "numeric")
cnum <- as.logical(cnum_testing*cnum_training)
training2 <- training[,cnum]
testing2 <- testing[,cnum]
any(names(training2) != names(testing2))
## [1] FALSE
any(is.na(training2))
## [1] FALSE
any(is.na(testing2))
## [1] FALSE
I split the training2 dataset into a training and a validation set, for quantifying the behaviour and error of my models.
training2 <- cbind(training2, classe = training$classe)
inTrain <- createDataPartition(y = training2$classe, p = 0.7, list=FALSE)
validating <- training2[-inTrain,]
training2 <- training2[inTrain,]
Since the task we are facing is classification, I choose to approach the dataset with a tree-prediction model. To enhance the prediction accuracy, I furthermore use a random-forest model with varying number of trees used.
set.seed(123)
model_rpart <- train(classe ~ . , data = training2, method = "rpart")
model_rf1 <- train(classe ~ . , data = training2, method = "rf", ntree = 1)
model_rf5 <- train(classe ~ . , data = training2, method = "rf", ntree = 5)
model_rf10 <- train(classe ~ . , data = training2, method = "rf", ntree = 10)
model_rf20 <- train(classe ~ . , data = training2, method = "rf", ntree = 20)
I use the validation dataset to find the out-of-sample error for the five models.
models <- list(rpart = model_rpart, rf1 = model_rf1, rf5 = model_rf5, rf10 = model_rf10, rf20 = model_rf20)
predictions <- lapply(models, function(x) predict(x, validating))
err <- unlist(lapply(predictions, function(x) 1 - sum(x == validating$classe) / length(x)))
err
## rpart rf1 rf5 rf10 rf20
## 0.42395922 0.06932880 0.02022090 0.01478335 0.01104503
barplot(err, xlab = "Model name", ylab = "Out of sample error")
The results show that a single prediction tree only classifies half of the samples correctly. However, once we start combining trees in a random forest, the accuracy quickly increases.
For this classification problem, random forest return very good results. The out of sample error, found by comparing the predicted classe for the validation set by the actual value, goes monotonically down with the number of trees used in the random forest. Since computation time goes up with the number of trees, I stopped modelling at 20 trees. In this particular problem, 20 trees is already close to the ideal setting, as can be seen from plotting the out-of-sample error vs the number of trees:
plot(x = c(1,5,10,20),err[2:5], xlab="Number of trees in random forest", ylab="Out of sample error")