Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.More information are available from this website.
The training data and the test data from the provided links and the data for this project come from the provided source.
We load the data as below:
#Set the URL for the downloads
UrlTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
UrlTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#Download the datasets
training <- read.csv(url(UrlTrain))
testing <- read.csv(url(UrlTest))
The packages knitr, caret, rpart, rpart.plot, rattle, randomForest, e1071, gridExtra and gbm have to be loaded.
Regarding the course Practical Machine Learning, the data will be divided into a training data set and a testing data set with a rate 70%/30%.
#Make the research reproducible
set.seed(12345)
partition <- createDataPartition(training$classe, p=0.7, list=FALSE)
training_set <- training[partition, ]
test_set <- training[-partition, ]
These variables have a lot of NA, that can be removed as below:
#Variables with Nearly Zero Variance are removed
NZV <- nearZeroVar(training_set)
training_set <- training_set[, -NZV]
test_set <- test_set[, -NZV]
remove_NA <- sapply(training_set, function(x) mean(is.na(x))) > 0.95
training_set <- training_set[, remove_NA==FALSE]
test_set <- test_set[, remove_NA==FALSE]
#ID variables (from 1 to 5) are removed
training_set <- training_set[, -(1:5)]
test_set <- test_set[, -(1:5)]
dim(training_set)
## [1] 13737 54
dim(test_set)
## [1] 5885 54
The number of variables for the analysis has been reduced to 54 only.
model_decision_tree <- rpart(classe ~ ., data = training_set, method="class", control = rpart.control(method = "cv", number = 10))
fancyRpartPlot(model_decision_tree)
We do not expect a very high accuracy.
prediction_decision_tree <- predict(model_decision_tree, test_set, type = "class")
confusion_matrix_DT <- confusionMatrix(prediction_decision_tree, test_set$classe)
confusion_matrix_DT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1530 269 51 79 16
## B 35 575 31 25 68
## C 17 73 743 68 84
## D 39 146 130 702 128
## E 53 76 71 90 786
##
## Overall Statistics
##
## Accuracy : 0.7368
## 95% CI : (0.7253, 0.748)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6656
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9140 0.50483 0.7242 0.7282 0.7264
## Specificity 0.9014 0.96650 0.9502 0.9100 0.9396
## Pos Pred Value 0.7866 0.78338 0.7543 0.6131 0.7305
## Neg Pred Value 0.9635 0.89051 0.9422 0.9447 0.9384
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2600 0.09771 0.1263 0.1193 0.1336
## Detection Prevalence 0.3305 0.12472 0.1674 0.1946 0.1828
## Balanced Accuracy 0.9077 0.73566 0.8372 0.8191 0.8330
The accuracy reaches 73.68%.
model_random_forest <- randomForest(classe ~ ., data = training_set, method = "rf", importance = T, trControl = trainControl(method = "cv", classProbs=TRUE,savePredictions=TRUE,allowParallel=TRUE, number = 10))
plot(model_random_forest)
Using random forest, the out of sample error is small as it is expected.
prediction_random_forest <- predict(model_random_forest, newdata=test_set)
confusion_matrix_RF <- confusionMatrix(prediction_random_forest, test_set$classe)
confusion_matrix_RF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 6 0 0 0
## B 0 1133 6 0 0
## C 0 0 1020 12 0
## D 0 0 0 952 4
## E 0 0 0 0 1078
##
## Overall Statistics
##
## Accuracy : 0.9952
## 95% CI : (0.9931, 0.9968)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.994
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 0.9942 0.9876 0.9963
## Specificity 0.9986 0.9987 0.9975 0.9992 1.0000
## Pos Pred Value 0.9964 0.9947 0.9884 0.9958 1.0000
## Neg Pred Value 1.0000 0.9987 0.9988 0.9976 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1925 0.1733 0.1618 0.1832
## Detection Prevalence 0.2855 0.1935 0.1754 0.1624 0.1832
## Balanced Accuracy 0.9993 0.9967 0.9958 0.9934 0.9982
The accuracy reaches 99.52%.
control_boosting <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
model_boosting <- train(classe ~ ., data=training_set, method = "gbm", trControl = control_boosting, verbose = FALSE)
plot(model_boosting)
prediction_boosting <- predict(model_boosting, test_set)
confusion_matrix_boosting <- confusionMatrix(prediction_boosting, test_set$classe)
confusion_matrix_boosting
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 16 0 3 0
## B 0 1103 16 7 5
## C 0 17 1005 14 4
## D 0 3 5 939 15
## E 0 0 0 1 1058
##
## Overall Statistics
##
## Accuracy : 0.982
## 95% CI : (0.9783, 0.9852)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9772
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9684 0.9795 0.9741 0.9778
## Specificity 0.9955 0.9941 0.9928 0.9953 0.9998
## Pos Pred Value 0.9888 0.9752 0.9663 0.9761 0.9991
## Neg Pred Value 1.0000 0.9924 0.9957 0.9949 0.9950
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1874 0.1708 0.1596 0.1798
## Detection Prevalence 0.2877 0.1922 0.1767 0.1635 0.1799
## Balanced Accuracy 0.9977 0.9812 0.9862 0.9847 0.9888
The accuracy reaches 98.2%
The Random Forest model provides the best accuracy. Hence we expect to get a perfect prediction as below:
prediction_test <- predict(model_random_forest, newdata=testing)
prediction_test
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Finally, we got a perfect score in the quiz.