Summary

In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (that were asked to perform barbell lifts correctly and incorrectly in 5 different ways) for predict the manner in which they did the exercise.

Introduction

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

Loading the Database and Libraries

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(mlbench)

library(readr)
pmltrain <- read_csv("pml-training.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   user_name = col_character(),
##   cvtd_timestamp = col_character(),
##   new_window = col_character(),
##   kurtosis_roll_belt = col_character(),
##   kurtosis_picth_belt = col_character(),
##   kurtosis_yaw_belt = col_character(),
##   skewness_roll_belt = col_character(),
##   skewness_roll_belt.1 = col_character(),
##   skewness_yaw_belt = col_character(),
##   max_yaw_belt = col_character(),
##   min_yaw_belt = col_character(),
##   amplitude_yaw_belt = col_character(),
##   kurtosis_picth_arm = col_character(),
##   kurtosis_yaw_arm = col_character(),
##   skewness_pitch_arm = col_character(),
##   skewness_yaw_arm = col_character(),
##   kurtosis_yaw_dumbbell = col_character(),
##   skewness_yaw_dumbbell = col_character(),
##   kurtosis_roll_forearm = col_character(),
##   kurtosis_picth_forearm = col_character()
##   # ... with 8 more columns
## )
## See spec(...) for full column specifications.
## Warning: 182 parsing failures.
##  row               col expected  actual               file
## 2231 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## 2231 skewness_roll_arm a double #DIV/0! 'pml-training.csv'
## 2255 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## 2255 skewness_roll_arm a double #DIV/0! 'pml-training.csv'
## 2282 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## .... ................. ........ ....... ..................
## See problems(...) for more details.
pmltest <- read_csv("pml-testing.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   X1 = col_double(),
##   user_name = col_character(),
##   raw_timestamp_part_1 = col_double(),
##   raw_timestamp_part_2 = col_double(),
##   cvtd_timestamp = col_character(),
##   new_window = col_character(),
##   num_window = col_double(),
##   roll_belt = col_double(),
##   pitch_belt = col_double(),
##   yaw_belt = col_double(),
##   total_accel_belt = col_double(),
##   gyros_belt_x = col_double(),
##   gyros_belt_y = col_double(),
##   gyros_belt_z = col_double(),
##   accel_belt_x = col_double(),
##   accel_belt_y = col_double(),
##   accel_belt_z = col_double(),
##   magnet_belt_x = col_double(),
##   magnet_belt_y = col_double(),
##   magnet_belt_z = col_double()
##   # ... with 40 more columns
## )
## See spec(...) for full column specifications.

Data Exporation

str(pmltrain)

In the first instance, we notice that a set of variables contains few data, with NA predominating, so we eliminate them from the database. The same with the identification, data and time. The Variable classe is the outcome: the manner in which the exercise was done.

The data will be divided into three populations (training, testing and validation) in order to make an assembly.

tr <- pmltrain[, c(7:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
inTrain = createDataPartition(tr$classe, p = 0.7)[[1]]
training = tr[ inTrain,]
testing = tr[-inTrain,]

validation <- pmltest[, c(7:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]

str(training)
summary(training)
any(is.na(training)) #there is no missing data (NA)
featurePlot(x=training[, c(2, 39, 41, 42)], y = training$classe, plot="pairs")
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion

Feature Selection

A mechanism to perform the selection of characteristics is to take advantage of the property of some algorithms to qualify the importance of some variables in the training process, such as decision trees or regularized random forests. I select the first six more important variables.

set.seed(523)
rPartMod <- train(classe ~ ., data=training, method="rpart")
rpartImp <- varImp(rPartMod)
print(rpartImp)
## rpart variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                   Overall
## pitch_forearm      100.00
## roll_forearm        71.80
## roll_belt           70.22
## magnet_dumbbell_y   49.28
## accel_belt_z        42.53
## magnet_belt_y       41.10
## num_window          39.55
## yaw_belt            38.94
## total_accel_belt    35.65
## magnet_arm_x        26.81
## accel_arm_x         26.05
## roll_dumbbell       19.10
## magnet_dumbbell_z   18.61
## roll_arm            15.27
## gyros_dumbbell_y     0.00
## gyros_forearm_y      0.00
## gyros_forearm_z      0.00
## gyros_arm_z          0.00
## accel_dumbbell_y     0.00
## accel_forearm_y      0.00
plot(rpartImp, top = 15, main='Variable Importance')

finalTrain <- training [, c(2, 39, 41, 42, 54)]
finalTest <- testing [, c(2, 39, 41, 42, 54)]
finalValid<- validation [, c(2, 39, 41, 42, 54)]

Training the Machine Learning Models

Within this report I will use five different models in caret: a) CARD model: “rpart”, b) neural network: “nnet”, c) boosted trees: “gbm”, d) linear discriminant analysis: “lda”, e) support vector machines: “svmRadial”. As well as its respective assembly by rf.

set.seed(3452)
model1<- train(classe~ ., data= finalTrain, method="rpart")
model2<- train(classe~ ., data= finalTrain, method="nnet", verbose=FALSE)
model3<- train(classe~ ., data= finalTrain, method="gbm", verbose=FALSE)
model4<- train(classe~ ., data= finalTrain, method="lda")
model5<- train(classe~ ., data= finalTrain, method="svmRadial")
t11 <- predict(model1,newdata= finalTrain)
t21 <- predict(model2,newdata= finalTrain)
t31 <- predict(model3,newdata= finalTrain)
t41 <- predict(model4,newdata= finalTrain)
t51 <- predict(model5,newdata= finalTrain)

t12 <- predict(model1,newdata= finalTest)
t22 <- predict(model2,newdata= finalTest)
t32 <- predict(model3,newdata= finalTest)
t42 <- predict(model4,newdata= finalTest)
t52 <- predict(model5,newdata= finalTest)

predE1 <-data.frame(t12, t22, t32, t42, t52, classe = finalTest$classe) 
ensambl <- train(classe ~.,method="rf",data=predE1)
combPred1 <- predict(ensambl,predE1)

Results

The table presents the result of the models trained and tested, and its assemble.

models <- c("rpart", "nnet", "gbm", "lda", "svmRadial", "asemble")

train_Acc<- c(confusionMatrix(as.factor(finalTrain$classe),t11)$overall['Accuracy'], 
              confusionMatrix(as.factor(finalTrain$classe),t21)$overall['Accuracy'],
              confusionMatrix(as.factor(finalTrain$classe),t31)$overall['Accuracy'], 
              confusionMatrix(as.factor(finalTrain$classe),t41)$overall['Accuracy'],
              confusionMatrix(as.factor(finalTrain$classe),t51)$overall['Accuracy'], NA)
    
testAcc <- c(confusionMatrix(as.factor(finalTest$classe),t12)$overall['Accuracy'], 
             confusionMatrix(as.factor(finalTest$classe),t22)$overall['Accuracy'],
             confusionMatrix(as.factor(finalTest$classe),t32)$overall['Accuracy'], 
             confusionMatrix(as.factor(finalTest$classe),t42)$overall['Accuracy'],
             confusionMatrix(as.factor(finalTest$classe),t52)$overall['Accuracy'], 
             confusionMatrix(as.factor(finalTest$classe),combPred1)$overall['Accuracy'])

result<- cbind(models, train_Acc, testAcc)
result
##          models      train_Acc           testAcc            
## Accuracy "rpart"     "0.495887020455704" "0.494137638062872"
## Accuracy "nnet"      "0.382470699570503" "0.383177570093458"
## Accuracy "gbm"       "0.855718133508044" "0.840101954120646"
## Accuracy "lda"       "0.355390551066463" "0.348853016142736"
## Accuracy "svmRadial" "0.739608356992065" "0.722175021240442"
##          "asemble"   NA                  "0.855225148683093"

Comparing the Models

results <- resamples(list(CARD =model1, ANN =model2, BOOST=model3, LDA =model4, SVM = model5, ASEMBL = ensambl))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CARD, ANN, BOOST, LDA, SVM, ASEMBL 
## Number of resamples: 25 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CARD   0.4909919 0.5145631 0.5191847 0.5203092 0.5277613 0.5511532    0
## ANN    0.3065622 0.3441416 0.3614387 0.3647675 0.3850734 0.4475622    0
## BOOST  0.8180392 0.8272908 0.8298294 0.8303603 0.8336673 0.8414512    0
## LDA    0.3366733 0.3507194 0.3575119 0.3568576 0.3634558 0.3713171    0
## SVM    0.7183682 0.7259071 0.7306244 0.7305166 0.7343844 0.7403865    0
## ASEMBL 0.8298969 0.8373467 0.8394946 0.8404698 0.8423989 0.8555098    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CARD   0.33545538 0.3681741 0.3839419 0.3817963 0.3950501 0.4248489    0
## ANN    0.05767577 0.1460961 0.1760066 0.1736843 0.2054347 0.2983788    0
## BOOST  0.77032615 0.7815913 0.7850809 0.7855712 0.7896929 0.7995154    0
## LDA    0.14142143 0.1567153 0.1650182 0.1636785 0.1692116 0.1819146    0
## SVM    0.64226365 0.6533644 0.6590059 0.6586836 0.6638222 0.6710906    0
## ASEMBL 0.78540780 0.7947401 0.7972362 0.7985521 0.8009400 0.8171923    0
bwplot(results)

dotplot(results)

Conclussion

We found that the worst performance was the simple perceptron and the linear discriminant analysis, followed by the CARD model. The SVM showed adequate performance. It is striking that the boosting was slightly worse than the assembly of all the models using random forest.

Referencia

  1. Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.(http://groupware.les.inf.puc-rio.br/har)