Research on the recognition of human activity has traditionally focused on discriminating between different activities. However, research on “how (well)” has received little attention so far, although it potentially provides useful information for a wide variety of applications, such as sports training http://groupware.les.inf.puc-rio.br/har.
For the prediction of how the individuals performed the assigned exercise, six young health participants were asked to perform a series of 10 repetitions of unilateral dumbbell biceps flexion in five different ways: exactly according to the specification (Class A), throwing the elbows at the front (Class B), raise the dumbbell only halfway (Class C), lower the dumbbell only halfway (Class D) and throw the hips forward (Class E).
The purpose of this report is to use machine learning algorithms to predict the kind of exercise that individuals were doing by using measurements available on devices such as Jawbone Up, Nike FuelBand and Fitbit.
To begin we download the data. There are 2 data sets that are downloaded from the following links:
library(caret)
library(corrplot)
library(knitr)
library(dplyr)
library(tidyr)# TrainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
# TestURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# write.csv(read.csv(TrainURL), file = "train.csv")
# write.csv(read.csv(TestURL), file = "test.csv")
training <- read.csv("train.csv")
test <- read.csv("test.csv")dim(training) # dimensions of the train set[1] 19622 161
dim(test) # dimensions of the test set[1] 20 161
Now we proceed to the cleaning of the data. We check if there is NA, NAN or empty data. In case of finding, we will eliminate the variables that have more than 95% of their data as NA, NAN or empty data. The other cases in which these values are found are reviewed individually. In addition, variables that have variance that store to zero will be removed to eviatr that affects the design of the models.
# Count how many NA there are in each variable and filter them if necessary
CountNA <- data.frame(colSums(1*is.na(training)))
CountNotNA <- data.frame(colSums(1*!is.na(training)))
ratioNA <- t(100*(1 - CountNA/(CountNA + CountNotNA)))
training <- training[,ratioNA > 5]
test <- test[,ratioNA > 5]
# Count how many # DIV / 0! and empty data in each variable and filters them if necessary
valMiss <- ((training=="#DIV/0!") + (training==""))>=1
CountMiss <- data.frame(colSums(1*valMiss))
CountNotMiss <- data.frame(colSums(1*!valMiss))
ratioMiss <- t(100*(1 - CountMiss/(CountMiss + CountNotMiss)))
training <- training[,ratioMiss > 5]
test <- test[,ratioMiss > 5]
# Removes hyphens and dots in the names of the variables.
names(training) <- gsub("_","",names(training))
names(training)[1] <- "X1"
names(test) <- gsub("_","",names(test))
names(test)[1] <- "X1"
training <- training[,-1]
test <- test[,-1]
# Eliminate variables with variance close to zero
NZV <- nearZeroVar(training)
training <- training[,-NZV]
test <- test[,-NZV]
# Eliminates variables that are not transcendent for prediction.
training <- training[,-c(1:6)]
test <- test[,-c(1:6)]
dim(training) # dimensions of the training set[1] 19622 53
dim(test) # dimensions of the test set[1] 20 53
A correlation analysis is performed between the variables before the modeling work is done. Select “FPC” for the first principal component.
corrplot(cor(training[,-53]), order = "FPC", method = "circle", type = "upper", tl.cex = 0.6, tl.col = rgb(0, 0, 0), title = "Figure 1 - Graphic correlation between variables", mar = c(0, 0, 5, 0))If two variables are highly correlated, their colors are blue or red (for a positive or negative correlation). To further reduce the number of variables, a Principal Component Analysis (PCA) could be performed, however, since there are only very few strong correlations between the input variables, the PCA will not be performed. Instead, several different prediction models will be constructed below.
For the training and testing set, a ratio of 70% of training and 30% of test is considered.
set.seed(28916022)
Index1 <- createDataPartition(y = training$classe, p = 0.7,
list = FALSE)
testing <- training[-Index1,]
training <- training[Index1,]
table <- rbind(prop.table(table(training$classe)),
prop.table(table(testing$classe)))
rownames(table) <- c("training", "testing")
round(table,3) # Proportion of the different levels in each data set A B C D E
training 0.284 0.193 0.174 0.164 0.184
testing 0.284 0.194 0.174 0.164 0.184
4 models will be used to predict:
In addition, a 5-fold (k-fold) cross-validation is considered: the k-fold cross-validation method consists of dividing the data set into k-subsets. For each subset it is maintained while the model is trained in all other subsets. It is a robust method to estimate the accuracy, and the size of k and adjust the amount of bias in the estimate.
trC=trainControl(method="cv", number=5)
m="Accuracy"
set.seed(2891)
fitLDA <- train(classe~., data=training, method="lda", metric=m,
trControl=trC)
set.seed(2891)
fitQDA <- train(classe~., data=training, method="qda", metric=m,
trControl=trC)
set.seed(2891)
fitGBM <- train(classe~., data=training, method="gbm", metric=m,
trControl=trC, verbose=FALSE)
set.seed(2891)
fitKNN <- train(classe~., data=training, method="knn", metric=m,
trControl=trC)With the models already trained we proceed to make the test predictions and evaluate their performance.
PredLDAtrain<-predict(fitLDA, newdata=training)
PredQDAtrain<-predict(fitQDA, newdata=training)
PredKNNtrain<-predict(fitKNN, newdata=training)
PredGBMtrain<-predict(fitGBM, newdata=training)
PredLDAtest<-predict(fitLDA, newdata=testing)
PredQDAtest<-predict(fitQDA, newdata=testing)
PredKNNtest<-predict(fitKNN, newdata=testing)
PredGBMtest<-predict(fitGBM, newdata=testing)
mSummaryTrain <- cbind(confusionMatrix(PredLDAtrain, training$classe)$overall[1],
confusionMatrix(PredQDAtrain, training$classe)$overall[1],
confusionMatrix(PredKNNtrain, training$classe)$overall[1],
confusionMatrix(PredGBMtrain, training$classe)$overall[1])
colnames(mSummaryTrain) <- c("LDA", "QDA", "KNN", "GBM")
rownames(mSummaryTrain) <- "Training"
mSummaryTest <- cbind(confusionMatrix(PredLDAtest, testing$classe)$overall[1],
confusionMatrix(PredQDAtest, testing$classe)$overall[1],
confusionMatrix(PredKNNtest, testing$classe)$overall[1],
confusionMatrix(PredGBMtest, testing$classe)$overall[1])
colnames(mSummaryTest) <- c("LDA", "QDA", "KNN", "GBM")
rownames(mSummaryTest) <- "Testing"
# Confusion matrix of the fitLDA predictor and the test set
confusionMatrix(PredLDAtest, testing$classe)Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1363 188 103 71 48
B 34 727 93 37 197
C 143 129 693 115 88
D 126 41 117 703 101
E 8 54 20 38 648
Overall Statistics
Accuracy : 0.7025
95% CI : (0.6906, 0.7141)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6232
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8142 0.6383 0.6754 0.7293 0.5989
Specificity 0.9026 0.9239 0.9022 0.9218 0.9750
Pos Pred Value 0.7688 0.6682 0.5933 0.6461 0.8438
Neg Pred Value 0.9244 0.9141 0.9294 0.9456 0.9152
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2316 0.1235 0.1178 0.1195 0.1101
Detection Prevalence 0.3013 0.1849 0.1985 0.1849 0.1305
Balanced Accuracy 0.8584 0.7811 0.7888 0.8255 0.7870
# Confusion matrix of the fitQDA predictor and the test set
confusionMatrix(PredQDAtest, testing$classe)Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1551 54 0 1 0
B 69 960 53 2 31
C 24 115 970 129 53
D 27 3 2 812 23
E 3 7 1 20 975
Overall Statistics
Accuracy : 0.8952
95% CI : (0.887, 0.9029)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8676
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9265 0.8428 0.9454 0.8423 0.9011
Specificity 0.9869 0.9673 0.9339 0.9888 0.9935
Pos Pred Value 0.9658 0.8610 0.7514 0.9366 0.9692
Neg Pred Value 0.9713 0.9625 0.9878 0.9697 0.9781
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2636 0.1631 0.1648 0.1380 0.1657
Detection Prevalence 0.2729 0.1895 0.2194 0.1473 0.1709
Balanced Accuracy 0.9567 0.9051 0.9397 0.9156 0.9473
# Confusion matrix of the fitKNN predictor and the test set
confusionMatrix(PredKNNtest, testing$classe)Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1606 49 12 18 16
B 25 993 33 3 52
C 17 32 921 61 33
D 18 34 38 872 34
E 8 31 22 10 947
Overall Statistics
Accuracy : 0.9072
95% CI : (0.8995, 0.9145)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8826
Mcnemar's Test P-Value : 8.755e-10
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9594 0.8718 0.8977 0.9046 0.8752
Specificity 0.9774 0.9762 0.9706 0.9748 0.9852
Pos Pred Value 0.9442 0.8978 0.8656 0.8755 0.9303
Neg Pred Value 0.9837 0.9694 0.9782 0.9812 0.9723
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2729 0.1687 0.1565 0.1482 0.1609
Detection Prevalence 0.2890 0.1879 0.1808 0.1692 0.1730
Balanced Accuracy 0.9684 0.9240 0.9341 0.9397 0.9302
# Confusion matrix of the fitGBM predictor and the test set
confusionMatrix(PredGBMtest, testing$classe)Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1643 26 0 2 1
B 16 1088 28 3 12
C 10 23 973 29 8
D 4 1 22 922 9
E 1 1 3 8 1052
Overall Statistics
Accuracy : 0.9648
95% CI : (0.9598, 0.9694)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9555
Mcnemar's Test P-Value : 0.002477
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9815 0.9552 0.9483 0.9564 0.9723
Specificity 0.9931 0.9876 0.9856 0.9927 0.9973
Pos Pred Value 0.9827 0.9486 0.9329 0.9624 0.9878
Neg Pred Value 0.9926 0.9892 0.9891 0.9915 0.9938
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2792 0.1849 0.1653 0.1567 0.1788
Detection Prevalence 0.2841 0.1949 0.1772 0.1628 0.1810
Balanced Accuracy 0.9873 0.9714 0.9670 0.9746 0.9848
# Comparing accuracy of the training and test set
round(rbind(mSummaryTrain, mSummaryTest),3) LDA QDA KNN GBM
Training 0.704 0.900 0.958 0.975
Testing 0.702 0.895 0.907 0.965
# Comparing error of the training and test set
## Error is considered as the sum of the cases in which the prediction differs from the reference,
## divided among all the cases.
errorLDAtrain <- sum(PredLDAtrain!=training$classe)/length(training$classe)
errorQDAtrain <- sum(PredQDAtrain!=training$classe)/length(training$classe)
errorKNNtrain <- sum(PredKNNtrain!=training$classe)/length(training$classe)
errorGBMtrain <- sum(PredGBMtrain!=training$classe)/length(training$classe)
errorLDAtest <- sum(PredLDAtest!=testing$classe)/length(testing$classe)
errorQDAtest <- sum(PredQDAtest!=testing$classe)/length(testing$classe)
errorKNNtest <- sum(PredKNNtest!=testing$classe)/length(testing$classe)
errorGBMtest <- sum(PredGBMtest!=testing$classe)/length(testing$classe)
mError <- rbind(cbind(errorLDAtrain,errorQDAtrain,errorKNNtrain,errorGBMtrain),
cbind(errorLDAtest,errorQDAtest,errorKNNtest,errorGBMtest))
colnames(mError) <- c("LDA", "QDA", "KNN", "GBM")
rownames(mError) <- c("Training","Testing")
round(mError,3) # Error of the training and test set LDA QDA KNN GBM
Training 0.296 0.100 0.042 0.025
Testing 0.298 0.105 0.093 0.035
It is observed that the predictions of the training set have a better precision than those of the test set. In the case of error, the opposite occurs, the error increases in the test set compared with the training set.
In addition, we observe that the model with the best precision and the least error is GBM.
Now we select the GBM model to predict based on the data set “pml-testing.csv”.
PredictTest <- predict(fitGBM, newdata=test)
filename = "Results_problem_id.txt"
write.table(PredictTest,file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)