One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants are used to predict the manner in which they perform unilateral dumbbell biceps curls. The 5 possible methods are:
A: exactly according to the specification B: throwing the elbows to the front C: lifting the dumbbell only halfway D: lowering the dumbbell only halfway E: throwing the hips to the front
The library packages required and datasets were first loaded as below:
knitr::opts_chunk$set(warning=F)
set.seed(77)
library(corrplot)
library(rattle)
library(randomForest)
library(caret)
library(rpart.plot)
train_val <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
The data structure were then examined:
names(train_val)
names(test)
# str(train_val)
# str(test)
It was found that the variables have either none missing entries or are mostly unfilled.
apply(X=train_val, MARGIN=2, FUN=function(x) {sum(is.na(x))})
## X user_name raw_timestamp_part_1
## 0 0 0
## raw_timestamp_part_2 cvtd_timestamp new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt kurtosis_roll_belt
## 0 0 0
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 0 0 0
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 0 0 19216
## max_picth_belt max_yaw_belt min_roll_belt
## 19216 0 19216
## min_pitch_belt min_yaw_belt amplitude_roll_belt
## 19216 0 19216
## amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt
## 19216 0 19216
## avg_roll_belt stddev_roll_belt var_roll_belt
## 19216 19216 19216
## avg_pitch_belt stddev_pitch_belt var_pitch_belt
## 19216 19216 19216
## avg_yaw_belt stddev_yaw_belt var_yaw_belt
## 19216 19216 19216
## gyros_belt_x gyros_belt_y gyros_belt_z
## 0 0 0
## accel_belt_x accel_belt_y accel_belt_z
## 0 0 0
## magnet_belt_x magnet_belt_y magnet_belt_z
## 0 0 0
## roll_arm pitch_arm yaw_arm
## 0 0 0
## total_accel_arm var_accel_arm avg_roll_arm
## 0 19216 19216
## stddev_roll_arm var_roll_arm avg_pitch_arm
## 19216 19216 19216
## stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 19216 19216 19216
## stddev_yaw_arm var_yaw_arm gyros_arm_x
## 19216 19216 0
## gyros_arm_y gyros_arm_z accel_arm_x
## 0 0 0
## accel_arm_y accel_arm_z magnet_arm_x
## 0 0 0
## magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 0 0 0
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm
## 0 0 0
## skewness_pitch_arm skewness_yaw_arm max_roll_arm
## 0 0 19216
## max_picth_arm max_yaw_arm min_roll_arm
## 19216 19216 19216
## min_pitch_arm min_yaw_arm amplitude_roll_arm
## 19216 19216 19216
## amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell
## 19216 19216 0
## pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 0 0 0
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 0 0 0
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 0 0 19216
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 19216 0 19216
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 19216 0 19216
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 19216 0 0
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## 19216 19216 19216
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## 19216 19216 19216
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## 19216 19216 19216
## var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 19216 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 0 0 0
## kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 0 0 0
## skewness_yaw_forearm max_roll_forearm max_picth_forearm
## 0 19216 19216
## max_yaw_forearm min_roll_forearm min_pitch_forearm
## 0 19216 19216
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 0 19216 19216
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 0 0 19216
## avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 19216 19216 19216
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm
## 19216 19216 19216
## avg_yaw_forearm stddev_yaw_forearm var_yaw_forearm
## 19216 19216 19216
## gyros_forearm_x gyros_forearm_y gyros_forearm_z
## 0 0 0
## accel_forearm_x accel_forearm_y accel_forearm_z
## 0 0 0
## magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 0 0 0
## classe
## 0
The variables with mostly missing values in the test set were first dropped:
train_val <- train_val[, colSums(is.na(test)) == 0]
test <- test[, colSums(is.na(test)) == 0]
The cleaning was followed by variables that have very few unique values relative to the number of sample:
nzv_vars <- nearZeroVar(train_val)
train_val <- train_val[, -nzv_vars]
test <- test[, -nzv_vars]
And also variables that only serve for identification purposes:
train_val <- train_val[, -(1:5)]
test <- test[, -(1:5)]
The data type of the variable to be predicted was converted into a categorical variable:
class(train_val$classe)
## [1] "character"
train_val$classe <- factor(train_val$classe)
The cleaned train dataset was split into datasets for training (70%) and validation (30%).
train_id <- createDataPartition(y=train_val$classe, p=0.70, list=F)
train <- train_val[train_id, ]
validation <- train_val[-train_id, ]
The correlations between all variables except response variable in the train set were then visualized:
corr_mat <- cor(train[, -ncol(train)])
corrplot(corr=corr_mat, method="color", type="lower", order="FPC", tl.cex=0.5, tl.col=rgb(0, 0, 0))
And the variables with correlations higher than 0.8 were listed as below:
findCorrelation(x=corr_mat, cutoff=0.8, names=T)
## [1] "accel_belt_z" "roll_belt" "accel_belt_y" "accel_dumbbell_z"
## [5] "accel_belt_x" "pitch_belt" "accel_arm_x" "accel_dumbbell_x"
## [9] "magnet_arm_y" "gyros_forearm_y" "gyros_dumbbell_x" "gyros_dumbbell_z"
## [13] "gyros_arm_x"
A Random Forest and a Generalized Boosted Model were chosen for the activity recognition prediction due to their robustness to correlated covariates and outliers. Besides, the feature selection could also be taken care of automatically. The model that achieves better accuracy using 5-fold cross validation will be used to predict the test set classe. The results were visualized using a confusion matrix.
The random forest model was trained as below:
controlRF <- trainControl(method="cv", number=5)
modelRF <- train(classe~., data=train, method="rf", trControl=controlRF)
modelRF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.27%
## Confusion matrix:
## A B C D E class.error
## A 3904 1 0 0 1 0.0005120328
## B 8 2648 2 0 0 0.0037622272
## C 0 6 2386 4 0 0.0041736227
## D 0 0 7 2244 1 0.0035523979
## E 0 1 0 6 2518 0.0027722772
The model was then evaluated using the validation dataset:
predictRF <- predict(object=modelRF, newdata=validation)
confmatRF <- confusionMatrix(predictRF, validation$classe)
confmatRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 3 0 0 0
## B 0 1133 0 0 0
## C 0 2 1026 3 0
## D 0 1 0 961 1
## E 0 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9983
## 95% CI : (0.9969, 0.9992)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9979
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 1.0000 0.9969 0.9991
## Specificity 0.9993 1.0000 0.9990 0.9996 1.0000
## Pos Pred Value 0.9982 1.0000 0.9952 0.9979 1.0000
## Neg Pred Value 1.0000 0.9987 1.0000 0.9994 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1925 0.1743 0.1633 0.1837
## Detection Prevalence 0.2850 0.1925 0.1752 0.1636 0.1837
## Balanced Accuracy 0.9996 0.9974 0.9995 0.9982 0.9995
The estimated out-of-sample error was:
round(1 - as.numeric(confmatRF$overall['Accuracy']), 4)
## [1] 0.0017
The confusion matrix was visualized below:
plot(confmatRF$table, col=confmatRF$byClass, main=paste("RF Accuracy:", round(confmatRF$overall['Accuracy'], 4)))
The GBM model was trained as below:
controlGBM <- trainControl(method="cv", number=5)
modelGBM <- train(classe~., data=train, method="gbm", trControl=controlGBM, verbose=F)
modelGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
And similarly evaluated using the validation dataset:
predictGBM <- predict(object=modelGBM, newdata=validation)
confmatGBM <- confusionMatrix(predictGBM, validation$classe)
confmatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 11 0 1 0
## B 3 1117 6 5 4
## C 0 10 1017 7 2
## D 1 1 3 948 6
## E 0 0 0 3 1070
##
## Overall Statistics
##
## Accuracy : 0.9893
## 95% CI : (0.9863, 0.9918)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9865
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9807 0.9912 0.9834 0.9889
## Specificity 0.9972 0.9962 0.9961 0.9978 0.9994
## Pos Pred Value 0.9929 0.9841 0.9817 0.9885 0.9972
## Neg Pred Value 0.9990 0.9954 0.9981 0.9968 0.9975
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1898 0.1728 0.1611 0.1818
## Detection Prevalence 0.2858 0.1929 0.1760 0.1630 0.1823
## Balanced Accuracy 0.9974 0.9884 0.9937 0.9906 0.9941
The estimated out-of-sample error was:
round(1 - as.numeric(confmatGBM$overall['Accuracy']), 4)
## [1] 0.0107
The confusion matrix was visualized below:
plot(confmatGBM$table, col=confmatGBM$byClass, main=paste("GBM Accuracy:", round(confmatGBM$overall['Accuracy'], 4)))
Since the Random Forest model yields higher accuracy, it was used to predict the classe of the test set:
predict(object=modelRF, newdata=test)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E