In this project an algorithm is made to predict if a barbell lift is performed correctly. Participants perform the lift in in one correct way and 5 different incorrect ways while being monitored by accelerometers on their belt, forearm, arm and dumbell. The data is used to train a random forest algorithm with PCA input components. The algorithm based on 29 input compontents and 128 trees was shown to be most accurate with 0.97% of classes identified correctly.
A random forest algorithm is used to predict the way a barbell lift is performed (correctly or either of the 5 incorrect ways). As random forests are prune to overfitting a test set is used to evaluate the performance of the random forest. To reduce the effect of overfitting an PCA is used on the input variables. The random forest algorithm is tested on 20, 23, 26 and 29 components (that cover 91% - 98% of the total variance) and with 64 and 128 trees (as suggested in https://www.researchgate.net/publication/230766603_How_Many_Trees_in_a_Random_Forest).
Two datasets were loaded: pml-training.csv and pml-testing.csv.
df <- read.csv('pml-training.csv')
df_evaluate <- read.csv('pml-testing.csv')
It was seen in pml-testing (renamed df_evaluate) that there were a lot of columns containing only NA data. These columns were filtered along with the context columns (column 1-7). Meanwhile the training data was split 70-30 in training data and test data.
col_names <- colnames(df_evaluate)
ind <- 1
indices <- c()
for (name in col_names){
if (length(which(is.na(df_evaluate[name]))) == 20){
indices <- c(indices, ind)
}
ind <- ind + 1
}
inTrain <- createDataPartition(y=df$classe, p=0.7, list=FALSE)
df_train <- df[inTrain, -indices]
df_test <- df[-inTrain, -indices]
df_evaluate <- df_evaluate[, -indices]
df_train <- df_train[, 8:dim(df_train)[2]]
df_test <- df_test[, 8:dim(df_test)[2]]
df_evaluate <- df_evaluate[, 8:dim(df_evaluate)[2]]
For the test data a PCA was done to determine the cumulative proportion of variance the PCA compontents contain. It can be seen that upwards of 23 components over 95% of the variance in the data was explained.
df_train.pca <- prcomp(df_train[, 9:dim(df_train)[2]-1], center = TRUE, scale = TRUE)
summary(df_train.pca)$importance[2, ]
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## 0.15912 0.11668 0.09813 0.09240 0.07671 0.06103 0.04632 0.04364 0.03697
## PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
## 0.03271 0.02877 0.02352 0.02103 0.01795 0.01699 0.01509 0.01219 0.01143
## PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27
## 0.01062 0.00912 0.00857 0.00766 0.00686 0.00638 0.00574 0.00528 0.00424
## PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36
## 0.00413 0.00315 0.00289 0.00263 0.00189 0.00159 0.00134 0.00125 0.00110
## PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45
## 0.00083 0.00076 0.00068 0.00066 0.00060 0.00052 0.00045 0.00022 0.00015
It is known that random forest models are prune to overfitting. To control overfitting different numbers of PCA components and trees are tried in an iterative loop.
n_best <- 0
ntrees_best <- 0
ncomps_best <- 0
fit_best <- NULL
res <- NULL
for (ntrees in c(64, 128)){
for (ncomps in seq(20, 30, by=3)) {
print(paste(
'working on rf model with ',
ntrees,
' trees based on ',
ncomps,
' input components'
))
pre_process <- preProcess(
df_train[, -dim(df_train)[2]-1],
method='pca',
pcaComp=ncomps)
train_pc <- predict(pre_process, df_train[, -dim(df_train)[2]])
model_fit <- train(
y=df_train[, dim(df_train)[2]],
x= train_pc,
method='rf',
ntree=ntrees
)
test_pc <- predict(pre_process, df_test[, -dim(df_test)[2]])
test_outcome <- predict(model_fit, newdata=test_pc)
n_corr <- length(which(test_outcome == df_test[, dim(df_test)[2]]))
res <- rbind(res, c(trees=ntrees, comps=ncomps, corr=n_corr))
if (n_corr > n_best){
n_best <- n_corr
ntrees_best <- ntrees
ncomps_best <- ncomps
fit_best <- model_fit
}
}
}
## [1] "working on rf model with 64 trees based on 20 input components"
## [1] "working on rf model with 64 trees based on 23 input components"
## [1] "working on rf model with 64 trees based on 26 input components"
## [1] "working on rf model with 64 trees based on 29 input components"
## [1] "working on rf model with 128 trees based on 20 input components"
## [1] "working on rf model with 128 trees based on 23 input components"
## [1] "working on rf model with 128 trees based on 26 input components"
## [1] "working on rf model with 128 trees based on 29 input components"
print(paste('The best fit is ', n_corr / dim(df_test)[1] * 100, '% correct with ', ntrees_best, ' trees and ', ncomps_best, ' input components'))
## [1] "The best fit is 97.6720475785896 % correct with 128 trees and 29 input components"
So of the models tested the random forest model with 128 trees and 29 input components was shown to perform best with an out-of-sample error on the test set of only 2.6%. The out of sample error is usually close to but slightly higher than the in sample error. The in sample error is shown to be 0% below.The out of sample error could therefore be expected to be around 2-3% (unless the algorithm is excessively overfitting which this algorithm through the low number of trees and the use of only a subset of PCA components is not).
fit_outcome <- predict(fit_best, newdata=train_pc)
expected_outcome <- df_train[, dim(df_train)[2]]
in_sample_error <- 100 - length(which(expected_outcome == fit_outcome)) / length(fit_outcome) * 100
print(paste('The in sample error is ', in_sample_error, '%'))
## [1] "The in sample error is 0 %"
The quiz error (1 in 20) was also low as expected.
The resulting algorithm is then used on the evaluation data to predict the quiz answers.
pre_process <- preProcess(
df_train[, -dim(df_train)[2]-1],
method='pca',
pcaComp=ncomps_best)
evaluate_pc <- predict(pre_process, df_evaluate[, -dim(df_evaluate)[2]])
evaluate_outcome <- predict(fit_best, newdata=evaluate_pc)
print(evaluate_outcome)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
In the exploratory data analysis it was seen that all components have ‘normal’ data ranges. There seems to be no need for the use of a log transform.
for ( i in seq(1, dim(df_train)[2]-1, by=1)){
plot(
df_train[, dim(df_train)[2]],
df_train[, i],
xlab='Class',
ylab=colnames(df_train)[i],
main=paste('Exploration of ', colnames(df_train)[i])
)
}