In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (that were asked to perform barbell lifts correctly and incorrectly in 5 different ways) for predict the manner in which they did the exercise.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(mlbench)
library(readr)
pmltrain <- read_csv("pml-training.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## user_name = col_character(),
## cvtd_timestamp = col_character(),
## new_window = col_character(),
## kurtosis_roll_belt = col_character(),
## kurtosis_picth_belt = col_character(),
## kurtosis_yaw_belt = col_character(),
## skewness_roll_belt = col_character(),
## skewness_roll_belt.1 = col_character(),
## skewness_yaw_belt = col_character(),
## max_yaw_belt = col_character(),
## min_yaw_belt = col_character(),
## amplitude_yaw_belt = col_character(),
## kurtosis_picth_arm = col_character(),
## kurtosis_yaw_arm = col_character(),
## skewness_pitch_arm = col_character(),
## skewness_yaw_arm = col_character(),
## kurtosis_yaw_dumbbell = col_character(),
## skewness_yaw_dumbbell = col_character(),
## kurtosis_roll_forearm = col_character(),
## kurtosis_picth_forearm = col_character()
## # ... with 8 more columns
## )
## See spec(...) for full column specifications.
## Warning: 182 parsing failures.
## row col expected actual file
## 2231 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## 2231 skewness_roll_arm a double #DIV/0! 'pml-training.csv'
## 2255 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## 2255 skewness_roll_arm a double #DIV/0! 'pml-training.csv'
## 2282 kurtosis_roll_arm a double #DIV/0! 'pml-training.csv'
## .... ................. ........ ....... ..................
## See problems(...) for more details.
pmltest <- read_csv("pml-testing.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_logical(),
## X1 = col_double(),
## user_name = col_character(),
## raw_timestamp_part_1 = col_double(),
## raw_timestamp_part_2 = col_double(),
## cvtd_timestamp = col_character(),
## new_window = col_character(),
## num_window = col_double(),
## roll_belt = col_double(),
## pitch_belt = col_double(),
## yaw_belt = col_double(),
## total_accel_belt = col_double(),
## gyros_belt_x = col_double(),
## gyros_belt_y = col_double(),
## gyros_belt_z = col_double(),
## accel_belt_x = col_double(),
## accel_belt_y = col_double(),
## accel_belt_z = col_double(),
## magnet_belt_x = col_double(),
## magnet_belt_y = col_double(),
## magnet_belt_z = col_double()
## # ... with 40 more columns
## )
## See spec(...) for full column specifications.
str(pmltrain)
In the first instance, we notice that a set of variables contains few data, with NA predominating, so we eliminate them from the database. The same with the identification, data and time. The Variable classe is the outcome: the manner in which the exercise was done.
The data will be divided into three populations (training, testing and validation) in order to make an assembly.
tr <- pmltrain[, c(7:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
inTrain = createDataPartition(tr$classe, p = 0.7)[[1]]
training = tr[ inTrain,]
testing = tr[-inTrain,]
validation <- pmltest[, c(7:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
str(training)
summary(training)
any(is.na(training)) #there is no missing data (NA)
featurePlot(x=training[, c(2, 39, 41, 42)], y = training$classe, plot="pairs")
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
## Warning in panel.xyplot(..., identifier = identifier): NAs introduced by
## coercion
A mechanism to perform the selection of characteristics is to take advantage of the property of some algorithms to qualify the importance of some variables in the training process, such as decision trees or regularized random forests. I select the first six more important variables.
set.seed(523)
rPartMod <- train(classe ~ ., data=training, method="rpart")
rpartImp <- varImp(rPartMod)
print(rpartImp)
## rpart variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## pitch_forearm 100.00
## roll_forearm 71.80
## roll_belt 70.22
## magnet_dumbbell_y 49.28
## accel_belt_z 42.53
## magnet_belt_y 41.10
## num_window 39.55
## yaw_belt 38.94
## total_accel_belt 35.65
## magnet_arm_x 26.81
## accel_arm_x 26.05
## roll_dumbbell 19.10
## magnet_dumbbell_z 18.61
## roll_arm 15.27
## gyros_dumbbell_y 0.00
## gyros_forearm_y 0.00
## gyros_forearm_z 0.00
## gyros_arm_z 0.00
## accel_dumbbell_y 0.00
## accel_forearm_y 0.00
plot(rpartImp, top = 15, main='Variable Importance')
finalTrain <- training [, c(2, 39, 41, 42, 54)]
finalTest <- testing [, c(2, 39, 41, 42, 54)]
finalValid<- validation [, c(2, 39, 41, 42, 54)]
Within this report I will use five different models in caret: a) CARD model: “rpart”, b) neural network: “nnet”, c) boosted trees: “gbm”, d) linear discriminant analysis: “lda”, e) support vector machines: “svmRadial”. As well as its respective assembly by rf.
set.seed(3452)
model1<- train(classe~ ., data= finalTrain, method="rpart")
model2<- train(classe~ ., data= finalTrain, method="nnet", verbose=FALSE)
model3<- train(classe~ ., data= finalTrain, method="gbm", verbose=FALSE)
model4<- train(classe~ ., data= finalTrain, method="lda")
model5<- train(classe~ ., data= finalTrain, method="svmRadial")
t11 <- predict(model1,newdata= finalTrain)
t21 <- predict(model2,newdata= finalTrain)
t31 <- predict(model3,newdata= finalTrain)
t41 <- predict(model4,newdata= finalTrain)
t51 <- predict(model5,newdata= finalTrain)
t12 <- predict(model1,newdata= finalTest)
t22 <- predict(model2,newdata= finalTest)
t32 <- predict(model3,newdata= finalTest)
t42 <- predict(model4,newdata= finalTest)
t52 <- predict(model5,newdata= finalTest)
predE1 <-data.frame(t12, t22, t32, t42, t52, classe = finalTest$classe)
ensambl <- train(classe ~.,method="rf",data=predE1)
combPred1 <- predict(ensambl,predE1)
The table presents the result of the models trained and tested, and its assemble.
models <- c("rpart", "nnet", "gbm", "lda", "svmRadial", "asemble")
train_Acc<- c(confusionMatrix(as.factor(finalTrain$classe),t11)$overall['Accuracy'],
confusionMatrix(as.factor(finalTrain$classe),t21)$overall['Accuracy'],
confusionMatrix(as.factor(finalTrain$classe),t31)$overall['Accuracy'],
confusionMatrix(as.factor(finalTrain$classe),t41)$overall['Accuracy'],
confusionMatrix(as.factor(finalTrain$classe),t51)$overall['Accuracy'], NA)
testAcc <- c(confusionMatrix(as.factor(finalTest$classe),t12)$overall['Accuracy'],
confusionMatrix(as.factor(finalTest$classe),t22)$overall['Accuracy'],
confusionMatrix(as.factor(finalTest$classe),t32)$overall['Accuracy'],
confusionMatrix(as.factor(finalTest$classe),t42)$overall['Accuracy'],
confusionMatrix(as.factor(finalTest$classe),t52)$overall['Accuracy'],
confusionMatrix(as.factor(finalTest$classe),combPred1)$overall['Accuracy'])
result<- cbind(models, train_Acc, testAcc)
result
## models train_Acc testAcc
## Accuracy "rpart" "0.495887020455704" "0.494137638062872"
## Accuracy "nnet" "0.382470699570503" "0.383177570093458"
## Accuracy "gbm" "0.855718133508044" "0.840101954120646"
## Accuracy "lda" "0.355390551066463" "0.348853016142736"
## Accuracy "svmRadial" "0.739608356992065" "0.722175021240442"
## "asemble" NA "0.855225148683093"
results <- resamples(list(CARD =model1, ANN =model2, BOOST=model3, LDA =model4, SVM = model5, ASEMBL = ensambl))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CARD, ANN, BOOST, LDA, SVM, ASEMBL
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CARD 0.4909919 0.5145631 0.5191847 0.5203092 0.5277613 0.5511532 0
## ANN 0.3065622 0.3441416 0.3614387 0.3647675 0.3850734 0.4475622 0
## BOOST 0.8180392 0.8272908 0.8298294 0.8303603 0.8336673 0.8414512 0
## LDA 0.3366733 0.3507194 0.3575119 0.3568576 0.3634558 0.3713171 0
## SVM 0.7183682 0.7259071 0.7306244 0.7305166 0.7343844 0.7403865 0
## ASEMBL 0.8298969 0.8373467 0.8394946 0.8404698 0.8423989 0.8555098 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CARD 0.33545538 0.3681741 0.3839419 0.3817963 0.3950501 0.4248489 0
## ANN 0.05767577 0.1460961 0.1760066 0.1736843 0.2054347 0.2983788 0
## BOOST 0.77032615 0.7815913 0.7850809 0.7855712 0.7896929 0.7995154 0
## LDA 0.14142143 0.1567153 0.1650182 0.1636785 0.1692116 0.1819146 0
## SVM 0.64226365 0.6533644 0.6590059 0.6586836 0.6638222 0.6710906 0
## ASEMBL 0.78540780 0.7947401 0.7972362 0.7985521 0.8009400 0.8171923 0
bwplot(results)
dotplot(results)
We found that the worst performance was the simple perceptron and the linear discriminant analysis, followed by the CARD model. The SVM showed adequate performance. It is striking that the boosting was slightly worse than the assembly of all the models using random forest.