We are going to create an algorithm to predict as precisely as possible the correct way (How well) to exercise. To do so, we are going to use the public dataset (Weight Lifting Exercises Dataset) and Machine Learning techniques, principally Random Forest, Generalized Boosted, Linear Discriminant Analysis, Recursive Partitioning And Regression Trees and, of course, Cross Validation.
URL.train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URL.test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Fil.train <- "pml-training.csv"
Fil.test <- "pml-testing.csv"
if(!file.exists(Fil.train))
download.file(URL.train, destfile = Fil.train)
if(!file.exists(Fil.test))
download.file(URL.test, destfile = Fil.test)
Dat.train <- read.csv(Fil.train, na.strings=c("NA","#DIV/0!",""))
Dat.test <- read.csv(Fil.test, na.strings=c("NA","#DIV/0!",""))
We download the 2 datasets (training and testing) and upload them to the memory. We have previously identified various residual and null values, which we proceed to convert to NA.
set.seed(13)
Spl <- createDataPartition(Dat.train$classe, p = 0.7, list = FALSE)
Dat.train.train <- Dat.train[Spl, ]
Dat.train.valid <- Dat.train[-Spl, ]
Dat.train.train <- Dat.train.train[, -c(1:5)]
Dat.train.valid <- Dat.train.valid[, -c(1:5)]
nz <- nearZeroVar(Dat.train.train)
Dat.train.train <- Dat.train.train[, -nz]
Dat.train.valid <- Dat.train.valid[, -nz]
vna <- sapply(Dat.train.train, function(x) mean(is.na(x))) > 0.97
Dat.train.train <- Dat.train.train[, vna==FALSE]
Dat.train.valid <- Dat.train.valid[, vna==FALSE]
dim(Dat.train.train)
## [1] 13737 54
dim(Dat.train.valid)
## [1] 5885 54
descrCor <- cor(Dat.train.train[, -length(Dat.train.train)])
highlyCorDescr <- findCorrelation(descrCor, cutoff = .8)
Dat.train.train <- Dat.train.train[,-highlyCorDescr]
Dat.train.valid <- Dat.train.valid[,-highlyCorDescr]
dim(Dat.train.train)
## [1] 13737 41
dim(Dat.train.valid)
## [1] 5885 41
Initially, the 2 datasets have 160 covariables each. We split the training data into 2 parts, one using 70% to build the models, and the other using 30% to validate them and make it possible to choose the most accurate one. Then, we have to eliminate the descriptive covariables of the mediation process itself, or those that have id’s that are of no use to our prediction (the first 5), followed by the covariables that have a variance near zero, and lastly eliminating the covariables that, for the most part, have a value of NA (over 97% of the data). Finally, we evaluate the correlation between the 54 covariables and, by establishing a threshold of 80% of absolute correlation, we are left with 41 covariables that we deem appropriate for building the prediction models.
Dat.test <- Dat.test[, -c(1:5)]
Dat.test <- Dat.test[, -nz]
Dat.test <- Dat.test[, vna==FALSE]
Dat.test <- Dat.test[,-highlyCorDescr]
dim(Dat.test)
## [1] 20 41
We must carry out the same transformations with the testing data provided, which we will use to make the prediction for the 20 samples (individuals) at the end of the report.
vControl <- trainControl(method="cv", number=4, verboseIter = FALSE)
vMetric <- "Accuracy"
We establish the general parameters that we will use for building all of the models. We are going to use Cross Validation in all of the cases.
Modfit.lda <- train(classe ~ ., method = "lda", data = Dat.train.train, verbose = FALSE, trControl = vControl, metric = vMetric)
Pre.lda <- predict(Modfit.lda, Dat.train.valid)
confusionMatrix(Pre.lda, Dat.train.valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1300 188 118 46 51
## B 139 663 84 69 176
## C 118 118 657 122 138
## D 88 83 136 653 137
## E 29 87 31 74 580
##
## Overall Statistics
##
## Accuracy : 0.6547
## 95% CI : (0.6424, 0.6669)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5634
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7766 0.5821 0.6404 0.6774 0.53604
## Specificity 0.9043 0.9014 0.8979 0.9098 0.95399
## Pos Pred Value 0.7634 0.5862 0.5698 0.5953 0.72409
## Neg Pred Value 0.9106 0.8999 0.9220 0.9350 0.90126
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.18386
## Detection Rate 0.2209 0.1127 0.1116 0.1110 0.09856
## Detection Prevalence 0.2894 0.1922 0.1959 0.1864 0.13611
## Balanced Accuracy 0.8404 0.7417 0.7691 0.7936 0.74502
We build the model using 70% of the training data and validate it with the remaining 30%. In this instance, the Accuracy is under 66%, and we thereby conclude that this Machine Learning technique is not an appropriate tool for our data.
Modfit.rpart <- train(classe ~ ., method = "rpart", data = Dat.train.train, trControl = vControl, metric = vMetric)
Pre.rpart <- predict(Modfit.rpart, Dat.train.valid)
confusionMatrix(Pre.rpart, Dat.train.valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1379 204 16 83 99
## B 29 386 24 191 213
## C 265 547 986 606 395
## D 0 0 0 0 0
## E 1 2 0 84 375
##
## Overall Statistics
##
## Accuracy : 0.5312
## 95% CI : (0.5183, 0.544)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4057
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8238 0.33889 0.9610 0.0000 0.34658
## Specificity 0.9045 0.90371 0.6269 1.0000 0.98189
## Pos Pred Value 0.7743 0.45789 0.3523 NaN 0.81169
## Neg Pred Value 0.9281 0.85065 0.9870 0.8362 0.86963
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.18386
## Detection Rate 0.2343 0.06559 0.1675 0.0000 0.06372
## Detection Prevalence 0.3026 0.14325 0.4756 0.0000 0.07850
## Balanced Accuracy 0.8642 0.62130 0.7939 0.5000 0.66423
Similarly, we build the model with 70% of the training data and validate it with the remaining 30%. In this case, the Accuracy is under 54%, and so Machine Learning is definitively not the appropriate technique to use for our data.
Modfit.gbm <- train(classe ~ ., method = "gbm", data = Dat.train.train, trControl = vControl, metric = vMetric, verbose = FALSE)
Pre.gbm <- predict(Modfit.gbm, Dat.train.valid)
confusionMatrix(Pre.gbm, Dat.train.valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1669 7 0 0 0
## B 5 1117 5 0 0
## C 0 12 1018 8 1
## D 0 3 3 955 5
## E 0 0 0 1 1076
##
## Overall Statistics
##
## Accuracy : 0.9915
## 95% CI : (0.9888, 0.9937)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9893
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9807 0.9922 0.9907 0.9945
## Specificity 0.9983 0.9979 0.9957 0.9978 0.9998
## Pos Pred Value 0.9958 0.9911 0.9798 0.9886 0.9991
## Neg Pred Value 0.9988 0.9954 0.9983 0.9982 0.9988
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1898 0.1730 0.1623 0.1828
## Detection Prevalence 0.2848 0.1915 0.1766 0.1641 0.1830
## Balanced Accuracy 0.9977 0.9893 0.9939 0.9942 0.9971
We do the same for this model. We build it with 70% of the training data and validate it with the remaining 30%. In this case, the accuracy is really good, reaching 99%. Depending on the results from the final model, this could turn out to be the chosen one.
Modfit.rf <- train(classe ~ ., method = "rf", data = Dat.train.train, trControl = vControl, metric = vMetric)
Pre.rf <- predict(Modfit.rf, Dat.train.valid)
confusionMatrix(Pre.rf, Dat.train.valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1135 1 0 0
## C 0 2 1025 2 0
## D 0 0 0 962 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9976, 0.9995)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9965 0.9990 0.9979 1.0000
## Specificity 0.9995 0.9998 0.9992 1.0000 1.0000
## Pos Pred Value 0.9988 0.9991 0.9961 1.0000 1.0000
## Neg Pred Value 1.0000 0.9992 0.9998 0.9996 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1929 0.1742 0.1635 0.1839
## Detection Prevalence 0.2848 0.1930 0.1749 0.1635 0.1839
## Balanced Accuracy 0.9998 0.9981 0.9991 0.9990 1.0000
As we can see, Random Forest is the most exact model, with an Accuracy of a little bit more than 99,8%, making it practically unbeatable. This is the model that we will use to make our prediction for the covariable ‘classe’ that will determine the way that each of the 20 samples (individuals) exercises, with the value A representing the correct way, and B, C, D and E representing the 4 most common errors with regard to doing the exercises specified in the experiment.
Accu <- sum(Pre.rf == Dat.train.valid$classe) / length(Pre.rf)
Accu
## [1] 0.9988105
Error <- 1 - Accu
Error
## [1] 0.001189465
pError <- Error * 100
pError
## [1] 0.1189465
We have calculated the rate of error ‘out-of-sample’ for our model built using Random Forest and, as we expected, it is very low, under 0.2% (0.12%). We can rest assured that this is the winning model. In addition to providing the best calculations, it also has a very high level of accuracy.
Pre.rf.testing <- predict(Modfit.rf, Dat.test)
Pre.rf.testing
The predictions of the 20 samples (individuals) carried out by our winning model are all correct. The model adjusts perfectly to the reality of the data. We have verified that the 20 results are correct by introducing them in the Automated Grading Quiz.