This project is devoted to the construction of prediction model for the quality of weight lifting excersizes performed by a group of six young participants. The target activity quality variable is outcome variable classe
. The rest of variables are activity monitors.
The data obtained from activity monitors is cleaned, analyzed and prediction models built based on cross validation, extreme gradient boosting and out-of-sample error is estimated. Prediction model is used to predict 20 different cases.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. This report is devoted to the prediction of the manner in which people do exercises. This is the classe
variable in the training set. Data obtained from accelerometers on the belt, forearm, arm, and dumbell of six young health participants. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
Here we install all required libraries by using of pacman, convinient R package management tool.
# R package management tool
if (!require("pacman")) {
install.packages("pacman", repos = "http://cran.us.r-project.org")
}
library(pacman)
# Load required libraries
p_load(plyr,dplyr,ggplot2,xgboost,knitr,caret,corrplot,rpart,rpart.plot,e1071,data.table)
Original training dataset consists of 19622 observations of 160 variables. Testing dataset has 20 observations of 160 variables. Target outcome variable is last classe
variable. We take a look at this variable and count all observations corresponding each type of excersize (A,B,C,D,E).
train.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train.dl <- file.path(getwd(), "pml-training.csv")
download.file(train.url, train.dl, method = "curl")
train <- read.csv("./pml-training.csv", na.strings=c("NA", "", "#DIV/0!"))
dim(train)
## [1] 19622 160
# names(train)
test.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test.dl <- file.path(getwd(), "pml-testing.csv")
download.file(test.url, test.dl, method = "curl")
test <- read.csv("./pml-testing.csv", na.strings=c("NA", "", "#DIV/0!"))
dim(test)
## [1] 20 160
# names(test)
# To find # of observations for each type of activity (A,B,C,D,E)
table(train$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
# outcome variable
clas.se = train[, "classe"]
outcome = clas.se
# A, B, C, D, E character levels of outcome
levels(outcome)
## [1] "A" "B" "C" "D" "E"
# converting outcome to numeric for future use by xgboost
levels(outcome) = 1:5
str(outcome)
## Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
Based on analysis of dataset columns from 1 to 7 just represent information which is irrelevant for present goal of building prediction model and they will be dropped from the dataset. All columns containing NAs above 99% are removed. As it was mentioned in Synopsis data subset containing accelerometers on the belt, forearm, arm, and dumbell with corresponding keywords will be extracted.
# removing first 7 columns
train_c <- train[, -c(1:7)]
test_c <- test[, -c(1:7)]
# rate of NAs in each column
rate_na <- apply(train_c, 2, function(x) sum(is.na(x)))/dim(train_c)[1]
# extract dataset with NAs percentage below 1%
training <- train_c[(rate_na==0)]
testing <- test_c[(rate_na==0)]
dim(training)
## [1] 19622 53
# Filtering data based on belt, forearm, arm, and dumbell excercizes
filt1 <- grepl("arm|belt|dumbell|classe", names(training))
# filtering out classe variable
filt2 <- grepl("arm|belt|dumbell", names(training))
train_f <- training[, filt1]
test_f <- testing[, filt1]
train_wo <- training[, filt2]
test_wo <- testing[, filt2]
dim(train_f)
## [1] 19622 40
names(train_f)
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_forearm"
## [28] "pitch_forearm" "yaw_forearm" "total_accel_forearm"
## [31] "gyros_forearm_x" "gyros_forearm_y" "gyros_forearm_z"
## [34] "accel_forearm_x" "accel_forearm_y" "accel_forearm_z"
## [37] "magnet_forearm_x" "magnet_forearm_y" "magnet_forearm_z"
## [40] "classe"
The extracted training dataset contains now 39 observation variables.
Variables having zero variability can not contribute in the building of the prediction model. Therefore we will remove them from dataset.
# Find variables with zero variance
NZV = nearZeroVar(train_f, saveMetrics=TRUE)
The result of analysis confirms the absence of variables with zero variance.
To check correlation between variables we plot correlation matrix. As we can see from the plot most of variables are uncorrelated, which justifies that we can proceed without additional preprocessing with Principal Component Analysis.
# Correlation plot
corrplot(cor(train_wo), order ="hclust", type = "upper")
XGBoost is working with matrices. Therefore we convert all our data into matrices.
train.m <- as.matrix(train_wo)
mode(train.m) <- "numeric"
test.m <- as.matrix(test_wo)
mode(test.m) <- "numeric"
label <- as.matrix(as.integer(outcome)-1)
# list of xgb parameters
xgb_par<-list("objective" = "multi:softprob", "num_class" = 5, "eval_metric" = "merror",
"nthread" = 8, "max_depth" = 16, "eta" = 0.3, "gamma" = 0, "subsample" = 1,
"colsample_bytree" = 1, "min_child_weight" = 12)
set.seed(543)
# 5-fold cross-validation
cv7 <- xgb.cv(param=xgb_par, data=train.m, label=label, nfold=5, nrounds=100, prediction=TRUE, verbose=FALSE)
We calculate confusion matrix from the predictions of cross-validation.
cv = matrix(cv7$pred, nrow=length(cv7$pred)/5, ncol=5)
cv = max.col(cv, "last")
# get confusion matrix
confusionMatrix(factor(label+1), factor(cv))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 5566 8 4 2 0
## 2 9 3776 12 0 0
## 3 0 26 3380 16 0
## 4 0 0 21 3189 6
## 5 0 1 0 6 3600
##
## Overall Statistics
##
## Accuracy : 0.9943
## 95% CI : (0.9932, 0.9953)
## No Information Rate : 0.2841
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9928
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9984 0.9908 0.9892 0.9925 0.9983
## Specificity 0.9990 0.9987 0.9974 0.9984 0.9996
## Pos Pred Value 0.9975 0.9945 0.9877 0.9916 0.9981
## Neg Pred Value 0.9994 0.9978 0.9977 0.9985 0.9996
## Prevalence 0.2841 0.1942 0.1741 0.1637 0.1838
## Detection Rate 0.2837 0.1924 0.1723 0.1625 0.1835
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9987 0.9947 0.9933 0.9954 0.9989
Here we fit training data with gradient boosting model
# minimum merror
min.merror.idx <- which.min(cv7$dt[, test.merror.mean])
# model fit training
rmft <- xgboost(param=xgb_par, data=train.m, label=label, nrounds=min.merror.idx, verbose=0)
pred_test <- predict(rmft, test.m)
head(pred_test, 20)
## [1] 6.892200e-04 9.966874e-01 1.921948e-03 3.340804e-04 3.674268e-04
## [6] 9.983462e-01 1.128621e-03 4.870083e-04 9.535551e-06 2.868428e-05
## [11] 1.552407e-02 9.681228e-01 1.451811e-02 6.649956e-04 1.170040e-03
## [16] 9.942462e-01 1.829086e-04 7.629342e-04 4.696007e-03 1.120004e-04
predictiions = matrix(pred_test, nrow=5, ncol=length(pred_test)/5)
predictiions = t(predictiions)
predictiions = max.col(predictiions, "last")
prd = toupper(letters[predictiions])
mod = xgb.dump(rmft, with.stats=TRUE)
# get the feature all names
allnames = dimnames(train.m)[[2]]
# feature importance
importance_m <- xgb.importance(allnames, model=rmft)
print(xgb.plot.importance(importance_m))
Extreme Gradient Boosting method is very efficient linear model solver. It provides extremely good accuracy about 99.3 % with error rate less than 1 %.
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(prd)