PREDICTION ALGORITHM FOR BARBELL LIFTS THROUGH CLASSIFICATION MODELS:LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS

ABSTRACT

These document present the algorithm for classify the ways of doing dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E) [1].

In the process was considered a linear discriminant analysis and logistic regression, the best model was the logistic regression with an accuracy of 73% in the test set. The approximate accuracy for predict the classes A,B,C,D and E are 87%,64%,67%,72% and 67%, respectively.

The correct sequence provided by the course practice machine learning, for the twenty test is: “B” “A” “B” “A” “A” “E” “D” “B” “A” “A” “B” “C” “B” “A” “E” “E” “A” “B” “B” “B”, the result is obtained by a logistic regression model in two attempts.

PREPROCESSING

Before of build the model, is necessary preprocess the dataset obtained from http://groupware.les.inf.puc-rio.br/har, the preprocess consist in the following steps:

Open the file and extract the information with the measures registered from the arms, belt,dumbbell and forearms.

library(stringr)
train = read.csv(file = "pml-training.csv", header = TRUE)
names_train <- names(train)
# Vector that contain the index for the information of arms,belt,dumbbell
# and forearms
index <- vector()

# index for store the indexes
j <- 0
for (i in 1:length(names_train)) {
    name <- str_split(names_train[i], "_")
    if ((any(name[[1]] == "forearm") | any(name[[1]] == "arm") | any(name[[1]] == 
        "belt") | any(name[[1]] == "dumbbell")) & (class(train[, i]) == "numeric" | 
        class(train[, i]) == "integer")) {
        j <- j + 1
        index[j] <- i
    }
}

Divided the training dataset in the train and test sets, for these is considered a 50% of the data.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(1)
inTrain <- createDataPartition(train$classe, p = 0.5, list = FALSE)
test <- train[-inTrain, ]
train <- train[inTrain, ]

Eliminate the predictors that not contain information for all the registers of the test and train set.

train2 <- train[, c(index)]
# Eliminate the predictors with NA from the train set
train2 <- train2[, !apply(apply(train2, 2, is.na), 2, any)]

test2 <- test[, c(index)]
# Eliminate the predictors with NA from the test set
test2 <- test2[, !apply(apply(test2, 2, is.na), 2, any)]

-Create a dataframe with all the classes for train the model, the columns of the dataframe contain 0 and 1 for each response, for example if the value correspond with the class the value is 1,otherwise the value is 0.

train_class <- list()
class <- unique(test$classe)

for (q in 1:length(class)) {

    temp <- as.character(train$classe)
    temp[temp != class[q]] <- 0
    temp[temp == class[q]] <- 1

    train_class[[class[q]]] <- as.numeric(temp)
}
train_class <- as.data.frame(train_class)
names(train_class) <- class
head(train_class, 5)

##   A B C D E
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0

The amount of predictors for the models are 52, in these was considered all the information provided from the arms, belt, dumbbell and forearms, for this was necessary removed the predictors that not contain information for all the registers like amplitude and others.

DESCRIPTION OF THE MODELS

Linear Discriminant Analysis

The first model considered is a linear discriminant analysis because the assumption of normality for each predictor is an approach that made easier the estimation of the parameters.

For measure the accuracy is used cross validation with 10 k-folds (this is the number for default in the function trainControl).

ctrl <- trainControl(method = "cv")
model_lda <- train(train$classe ~ ., method = "lda", trControl = ctrl, data = train2)

## Loading required package: MASS

model_lda

## Linear Discriminant Analysis 
## 
## 9812 samples
##   51 predictors
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 8832, 8831, 8832, 8829, 8831, 8831, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.7       0.6    0.01         0.02    
## 
##

The accuracy of the model is calculated in the following way:

pred_lda <- predict(model_lda, test2)
table_lda <- table(pred_lda, test$classe)
acc_lda <- sum(diag(table_lda))/(sum(apply(table_lda, 2, sum)))
acc_classes <- diag(table_lda)/(apply(table_lda, 2, sum))

# Table of results
table_lda

##         
## pred_lda    A    B    C    D    E
##        A 2278  293  161  105   74
##        B   67 1191  170   68  289
##        C  209  240 1103  191  168
##        D  226   73  234 1168  191
##        E   10  101   43   76 1081


# Accuracy of the model
acc_lda

## [1] 0.6953


# Accuracy for A,B,C,D and E respectly
acc_classes

##      A      B      C      D      E 
## 0.8165 0.6275 0.6447 0.7264 0.5996

Logistic Regression

For the logistics regression is necessary fit a model for each of the response, this mean a model for the classes A,B,C,D and E, for doing that, was used the dataframe train_class into a loop and was calculated the probability P(Y=1|X) for each class within his respective model.

probs <- list()
for (nn in 1:5) {
    model_glm <- glm(train_class[, nn] ~ ., family = binomial, data = train2)
    probs[[nn]] <- predict(model_glm, test2, type = "response")
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


probs <- as.data.frame(probs)
names(probs) <- class

For testing the model in the test set, the predicted response is the class with the hightest probability.

pred <- vector()
for (w in 1:nrow(test2)) {
    # Select the class for the hightest probability in each row:
    pred[w] <- names(probs)[which.max(probs[w, ])]
}

The accuracy is obtained creating a table between the predicted and test classes:

table_glm <- table(pred, test$classe)
acc_glm <- sum(diag(table_glm))/(sum(apply(table_glm, 2, sum)))
acc_classes_glm <- diag(table_glm)/(apply(table_glm, 2, sum))

# Table of results
table_glm

##     
## pred    A    B    C    D    E
##    A 2428  260  190  123   93
##    B   61 1210  166   55  228
##    C  135  188 1145  174   99
##    D  142   65  134 1160  173
##    E   24  175   76   96 1210


# Accuracy of the model
acc_glm

## [1] 0.7292


# Accuracy for A,B,C,D and E respectly
acc_classes_glm

##      A      B      C      D      E 
## 0.8703 0.6375 0.6692 0.7214 0.6711

RESULTS

The linear discriminant analysis have an approximate accuracy of 70% in the cross validation and the test set, the prediction accuracy for the classes A,B,C,D and E are 82%,63%,64%,73% and 60%, respectively.

The logistic regression model have an approximate accuracy of 73%, the prediction accuracy for the classes A,B,C,D and E are 87%,64%,67%,72% and 67%, respectively.

According with the results presented above, the model selected for predict the twenty test set is the logistics regression because his accuracy is greater than the linear discriminant analysis.

PREDICTION

For the prediction of the twenty cases supplied by the course is necessary doing the following steps:

Load the test set with the twenty cases:

test_final = read.csv(file = "pml-testing.csv", header = TRUE)

Estimate the probabilities P(Y=1|X) for each class of the test set

probs_final <- list()
for (nn in 1:5) {
    model_glm <- glm(train_class[, nn] ~ ., family = binomial, data = train2)
    probs_final[[nn]] <- predict(model_glm, test_final, type = "response")
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

probs_final <- as.data.frame(probs_final)
names(probs_final) <- class

-Finally are selected the classes with the hightest probability in each row of the dataframe probs_final.

pred_final <- vector()
for (w in 1:nrow(test_final)) {
    pred_final[w] <- names(probs_final)[which.max(probs_final[w, ])]
}

The prediction for the twenty cases supplied by the course of practical machine learning are:

pred_final

##  [1] "B" "A" "B" "A" "A" "E" "D" "E" "A" "A" "D" "A" "B" "A" "E" "A" "A"
## [18] "B" "B" "B"

Testing the result

Presenting the results in the platform of the course practical machine learning, of the twenty predicted classes sixteen was correct and four was incorrect. For give a solution to that, was considered a different approach for the wrong predictions, that is, consider the predicted class with the second highest probability, as follow:

First is consider the probabilities of the problems with the id 8,11,12 and 16, cause this are the wrong predictions.

wrong_pred <- c(8, 11, 12, 16)

-Second the rows of probs_second are sort in decreasing order and was selected the name of the column corresponding to the second index in the list.

for (w in wrong_pred) {
    sort_probs <- sort(probs_final[w, ], decreasing = TRUE, index.return = TRUE)
    # predict the class with the second hightest probability:
    pred_final[w] <- names(sort_probs)[2]
}

The new predict values for this cases are:

pred_final[wrong_pred]

## [1] "B" "B" "C" "E"

Presenting these results in the platform of the course we found that the new values are correct, finally the predicted values for the twenty test are:

pred_final

##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

REFERENCES

[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.