Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
# Packages
library(caret)
# Data uploading
url_train <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
url_test <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
download.file(url_train,'train.csv')
train_data <- read.csv('train.csv')
download.file(url_test,'test.csv')
test_data <- read.csv('test.csv')
The training data contains 19622 rows and 160 columns, of which 158 are predictors.
dim(train_data)
## [1] 19622 160
dim(test_data)
## [1] 20 160
The variable we want to predict is classe, the manner in which they did the exercise. In particular six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
levels(train_data$classe)
## [1] "A" "B" "C" "D" "E"
Observing the columns (variables), it is possible to notice that the first seven columns do not seem to be useful predictors. They are variables related to the partecipants and the time when the experiment was conducted.
# Remove first seven columns
exper <- names(train_data[1:7])
exper
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window"
train_data <- train_data[,8:160]
The data set contains many NA values, for this reason the columns with NA values are not considered for training the model. (Imputation could have been another option.)
# Columns with NA values are not considered in the model
nas <- apply(train_data,2,anyNA)
columns_notNA <- names(nas[!nas])
train_data <- train_data[,columns_notNA]
dim(train_data)
## [1] 19622 86
Another step is to remove the variables with near zero variance, they are not useful as predictors for the model.
# Check and remove near zero var columns
zerovar <- nearZeroVar(train_data,saveMetrics = T)
train_data <- train_data[,zerovar$nzv==FALSE]
dim(train_data)
## [1] 19622 53
Before moving on with the model fitting, the train_data set is split into two sets to train different models and select (test) the best performing one. In this way the initial testing set downloaded at the beginning is only going to be used for the final prediction with the best model. Using it to select the best model would not be efficient.
# From train_data create training and testing
inTrain <- createDataPartition(train_data$classe,p=0.7,list = FALSE)
train_model <- train_data[inTrain,]
test_model <- train_data[-inTrain,]
Because this is a classification problem the machine learning models tested are Random Forest and Gradient Boosting Machine (GBM).
A Random Forest model is trained, using cross validation to determine the parameters and setting it 5 folds (lower than default of 10 to save computing time).
# Train a random forest
set.seed(123)
trControl <- trainControl(method = 'cv',number = 5)
fit_rf <- train(classe~., data = train_model,method='rf',trControl=trControl,prox=T)
fit_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.68%
## Confusion matrix:
## A B C D E class.error
## A 3902 2 1 0 1 0.001024066
## B 20 2629 6 3 0 0.010910459
## C 0 14 2375 7 0 0.008764608
## D 0 1 24 2224 3 0.012433393
## E 0 0 3 8 2514 0.004356436
Now we verify the model on the test_model set to see how well it performs.
# Prediction on test data
pred_rf <- predict(fit_rf,test_model)
conf_rf <- confusionMatrix(pred_rf,test_model$classe)
conf_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 8 0 0 0
## B 4 1127 3 0 1
## C 0 4 1019 17 0
## D 0 0 4 947 1
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9929
## 95% CI : (0.9904, 0.9949)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9895 0.9932 0.9824 0.9982
## Specificity 0.9981 0.9983 0.9957 0.9990 1.0000
## Pos Pred Value 0.9952 0.9930 0.9798 0.9947 1.0000
## Neg Pred Value 0.9990 0.9975 0.9986 0.9966 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1915 0.1732 0.1609 0.1835
## Detection Prevalence 0.2851 0.1929 0.1767 0.1618 0.1835
## Balanced Accuracy 0.9979 0.9939 0.9944 0.9907 0.9991
In the following chart, we can see how the error drastically decreases with the increasing number of trees.
# Plot error versus number of trees
plot(fit_rf$finalModel,main = 'Random Forest Model')
It is also possible to see the importance of the predictors used in the model.
# See predictors importance in the rf model
varImp(fit_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.000
## pitch_forearm 59.515
## yaw_belt 54.281
## pitch_belt 44.923
## magnet_dumbbell_y 44.731
## magnet_dumbbell_z 42.217
## roll_forearm 41.142
## accel_dumbbell_y 21.784
## roll_dumbbell 16.691
## magnet_dumbbell_x 16.457
## accel_forearm_x 16.038
## magnet_belt_z 14.788
## accel_dumbbell_z 14.661
## total_accel_dumbbell 13.332
## magnet_forearm_z 13.021
## magnet_belt_y 12.067
## accel_belt_z 12.033
## gyros_belt_z 11.186
## yaw_arm 10.803
## magnet_belt_x 9.678
The second model is Gradient Boosting Machine (GBM).
# Train gbm model
fit_gbm <- train(classe~.,data = train_model,method='gbm',trControl=trControl,verbose=F)
fit_gbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 43 had non-zero influence.
Now we verify the model on the test_model data set to see how well it performs.
# Prediction from gbm
pred_gbm <- predict(fit_gbm,test_model)
conf_gbm <- confusionMatrix(pred_gbm,test_model$classe)
conf_gbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1642 38 0 2 0
## B 22 1058 29 6 12
## C 6 41 983 35 11
## D 3 2 12 913 8
## E 1 0 2 8 1051
##
## Overall Statistics
##
## Accuracy : 0.9596
## 95% CI : (0.9542, 0.9644)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9488
## Mcnemar's Test P-Value : 2.166e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9809 0.9289 0.9581 0.9471 0.9713
## Specificity 0.9905 0.9855 0.9809 0.9949 0.9977
## Pos Pred Value 0.9762 0.9388 0.9136 0.9733 0.9896
## Neg Pred Value 0.9924 0.9830 0.9911 0.9897 0.9936
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2790 0.1798 0.1670 0.1551 0.1786
## Detection Prevalence 0.2858 0.1915 0.1828 0.1594 0.1805
## Balanced Accuracy 0.9857 0.9572 0.9695 0.9710 0.9845
We can now compare the Accuracy of the two models and notice that Random Forest is the best performing model.
# Compare the two models
acc_rf <- round(conf_rf$overall[1],4)
acc_gbm <- round(conf_gbm$overall[1],4)
df <- data.frame('Models'= c('Accuracy','Out of sample error estimate'),
'Random Forest'=c(acc_rf,1-acc_rf),
'GBM'=c(acc_gbm,1-acc_gbm))
print(df,row.names=F)
## Models Random.Forest GBM
## Accuracy 0.9929 0.9596
## Out of sample error estimate 0.0071 0.0404
According to the accuracy measured in the previous section, the best model is the Random Forest. This model can now be used to predict classe for the test data (20 different test cases).
# Predict on the 20 different test cases.
pred_final <- predict(fit_rf,test_data)
pred_final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E