These document present the algorithm for classify the ways of doing dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E) [1].
In the process was considered a linear discriminant analysis and logistic regression, the best model was the logistic regression with an accuracy of 73% in the test set. The approximate accuracy for predict the classes A,B,C,D and E are 87%,64%,67%,72% and 67%, respectively.
The correct sequence provided by the course practice machine learning, for the twenty test is: “B” “A” “B” “A” “A” “E” “D” “B” “A” “A” “B” “C” “B” “A” “E” “E” “A” “B” “B” “B”, the result is obtained by a logistic regression model in two attempts.
Before of build the model, is necessary preprocess the dataset obtained from http://groupware.les.inf.puc-rio.br/har, the preprocess consist in the following steps:
library(stringr)
train = read.csv(file = "pml-training.csv", header = TRUE)
names_train <- names(train)
# Vector that contain the index for the information of arms,belt,dumbbell
# and forearms
index <- vector()
# index for store the indexes
j <- 0
for (i in 1:length(names_train)) {
name <- str_split(names_train[i], "_")
if ((any(name[[1]] == "forearm") | any(name[[1]] == "arm") | any(name[[1]] ==
"belt") | any(name[[1]] == "dumbbell")) & (class(train[, i]) == "numeric" |
class(train[, i]) == "integer")) {
j <- j + 1
index[j] <- i
}
}
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1)
inTrain <- createDataPartition(train$classe, p = 0.5, list = FALSE)
test <- train[-inTrain, ]
train <- train[inTrain, ]
train2 <- train[, c(index)]
# Eliminate the predictors with NA from the train set
train2 <- train2[, !apply(apply(train2, 2, is.na), 2, any)]
test2 <- test[, c(index)]
# Eliminate the predictors with NA from the test set
test2 <- test2[, !apply(apply(test2, 2, is.na), 2, any)]
-Create a dataframe with all the classes for train the model, the columns of the dataframe contain 0 and 1 for each response, for example if the value correspond with the class the value is 1,otherwise the value is 0.
train_class <- list()
class <- unique(test$classe)
for (q in 1:length(class)) {
temp <- as.character(train$classe)
temp[temp != class[q]] <- 0
temp[temp == class[q]] <- 1
train_class[[class[q]]] <- as.numeric(temp)
}
train_class <- as.data.frame(train_class)
names(train_class) <- class
head(train_class, 5)
## A B C D E
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
The amount of predictors for the models are 52, in these was considered all the information provided from the arms, belt, dumbbell and forearms, for this was necessary removed the predictors that not contain information for all the registers like amplitude and others.
The first model considered is a linear discriminant analysis because the assumption of normality for each predictor is an approach that made easier the estimation of the parameters.
For measure the accuracy is used cross validation with 10 k-folds (this is the number for default in the function trainControl).
ctrl <- trainControl(method = "cv")
model_lda <- train(train$classe ~ ., method = "lda", trControl = ctrl, data = train2)
## Loading required package: MASS
model_lda
## Linear Discriminant Analysis
##
## 9812 samples
## 51 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8832, 8831, 8832, 8829, 8831, 8831, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.7 0.6 0.01 0.02
##
##
The accuracy of the model is calculated in the following way:
pred_lda <- predict(model_lda, test2)
table_lda <- table(pred_lda, test$classe)
acc_lda <- sum(diag(table_lda))/(sum(apply(table_lda, 2, sum)))
acc_classes <- diag(table_lda)/(apply(table_lda, 2, sum))
# Table of results
table_lda
##
## pred_lda A B C D E
## A 2278 293 161 105 74
## B 67 1191 170 68 289
## C 209 240 1103 191 168
## D 226 73 234 1168 191
## E 10 101 43 76 1081
# Accuracy of the model
acc_lda
## [1] 0.6953
# Accuracy for A,B,C,D and E respectly
acc_classes
## A B C D E
## 0.8165 0.6275 0.6447 0.7264 0.5996
For the logistics regression is necessary fit a model for each of the response, this mean a model for the classes A,B,C,D and E, for doing that, was used the dataframe train_class into a loop and was calculated the probability P(Y=1|X) for each class within his respective model.
probs <- list()
for (nn in 1:5) {
model_glm <- glm(train_class[, nn] ~ ., family = binomial, data = train2)
probs[[nn]] <- predict(model_glm, test2, type = "response")
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs <- as.data.frame(probs)
names(probs) <- class
For testing the model in the test set, the predicted response is the class with the hightest probability.
pred <- vector()
for (w in 1:nrow(test2)) {
# Select the class for the hightest probability in each row:
pred[w] <- names(probs)[which.max(probs[w, ])]
}
The accuracy is obtained creating a table between the predicted and test classes:
table_glm <- table(pred, test$classe)
acc_glm <- sum(diag(table_glm))/(sum(apply(table_glm, 2, sum)))
acc_classes_glm <- diag(table_glm)/(apply(table_glm, 2, sum))
# Table of results
table_glm
##
## pred A B C D E
## A 2428 260 190 123 93
## B 61 1210 166 55 228
## C 135 188 1145 174 99
## D 142 65 134 1160 173
## E 24 175 76 96 1210
# Accuracy of the model
acc_glm
## [1] 0.7292
# Accuracy for A,B,C,D and E respectly
acc_classes_glm
## A B C D E
## 0.8703 0.6375 0.6692 0.7214 0.6711
The linear discriminant analysis have an approximate accuracy of 70% in the cross validation and the test set, the prediction accuracy for the classes A,B,C,D and E are 82%,63%,64%,73% and 60%, respectively.
The logistic regression model have an approximate accuracy of 73%, the prediction accuracy for the classes A,B,C,D and E are 87%,64%,67%,72% and 67%, respectively.
According with the results presented above, the model selected for predict the twenty test set is the logistics regression because his accuracy is greater than the linear discriminant analysis.
For the prediction of the twenty cases supplied by the course is necessary doing the following steps:
test_final = read.csv(file = "pml-testing.csv", header = TRUE)
probs_final <- list()
for (nn in 1:5) {
model_glm <- glm(train_class[, nn] ~ ., family = binomial, data = train2)
probs_final[[nn]] <- predict(model_glm, test_final, type = "response")
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs_final <- as.data.frame(probs_final)
names(probs_final) <- class
-Finally are selected the classes with the hightest probability in each row of the dataframe probs_final.
pred_final <- vector()
for (w in 1:nrow(test_final)) {
pred_final[w] <- names(probs_final)[which.max(probs_final[w, ])]
}
The prediction for the twenty cases supplied by the course of practical machine learning are:
pred_final
## [1] "B" "A" "B" "A" "A" "E" "D" "E" "A" "A" "D" "A" "B" "A" "E" "A" "A"
## [18] "B" "B" "B"
Presenting the results in the platform of the course practical machine learning, of the twenty predicted classes sixteen was correct and four was incorrect. For give a solution to that, was considered a different approach for the wrong predictions, that is, consider the predicted class with the second highest probability, as follow:
wrong_pred <- c(8, 11, 12, 16)
-Second the rows of probs_second are sort in decreasing order and was selected the name of the column corresponding to the second index in the list.
for (w in wrong_pred) {
sort_probs <- sort(probs_final[w, ], decreasing = TRUE, index.return = TRUE)
# predict the class with the second hightest probability:
pred_final[w] <- names(sort_probs)[2]
}
The new predict values for this cases are:
pred_final[wrong_pred]
## [1] "B" "B" "C" "E"
Presenting these results in the platform of the course we found that the new values are correct, finally the predicted values for the twenty test are:
pred_final
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"
[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.