Predicting Quality of Weight Lifting

Synopsis
0.1 Background
0.2 Source of Data
0.3 Loading required libraries
0.4 Getting, Cleaning and Filtering Data
0.5 Finding variables with zero variance
0.6 Correlation matrix
0.7 Superfast Extreme Gradient Boosting
Conclusions
Files submission

Synopsis

This project is devoted to the construction of prediction model for the quality of weight lifting excersizes performed by a group of six young participants. The target activity quality variable is outcome variable classe. The rest of variables are activity monitors.

The data obtained from activity monitors is cleaned, analyzed and prediction models built based on cross validation, extreme gradient boosting and out-of-sample error is estimated. Prediction model is used to predict 20 different cases.

0.1 Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. This report is devoted to the prediction of the manner in which people do exercises. This is the classe variable in the training set. Data obtained from accelerometers on the belt, forearm, arm, and dumbell of six young health participants. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

0.2 Source of Data

Project source:	http://groupware.les.inf.puc-rio.br/har
Training set:	https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Test set:	https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

0.3 Loading required libraries

Here we install all required libraries by using of pacman, convinient R package management tool.

# R package management tool
if (!require("pacman")) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
}
library(pacman)
# Load required libraries
p_load(plyr,dplyr,ggplot2,xgboost,knitr,caret,corrplot,rpart,rpart.plot,e1071,data.table)

0.4 Getting, Cleaning and Filtering Data

Original training dataset consists of 19622 observations of 160 variables. Testing dataset has 20 observations of 160 variables. Target outcome variable is last classe variable. We take a look at this variable and count all observations corresponding each type of excersize (A,B,C,D,E).

train.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train.dl <- file.path(getwd(), "pml-training.csv")
download.file(train.url, train.dl, method = "curl")
train <- read.csv("./pml-training.csv", na.strings=c("NA", "", "#DIV/0!"))
dim(train)

## [1] 19622   160

# names(train)
test.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test.dl <- file.path(getwd(), "pml-testing.csv")
download.file(test.url, test.dl, method = "curl")
test <- read.csv("./pml-testing.csv", na.strings=c("NA", "", "#DIV/0!"))
dim(test)

## [1]  20 160

# names(test)
# To find # of observations for each type of activity (A,B,C,D,E)
table(train$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

# outcome variable
clas.se = train[, "classe"]
outcome = clas.se 
# A, B, C, D, E character levels of outcome
levels(outcome)

## [1] "A" "B" "C" "D" "E"

# converting outcome to numeric for future use by xgboost
levels(outcome) = 1:5
str(outcome)

##  Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Based on analysis of dataset columns from 1 to 7 just represent information which is irrelevant for present goal of building prediction model and they will be dropped from the dataset. All columns containing NAs above 99% are removed. As it was mentioned in Synopsis data subset containing accelerometers on the belt, forearm, arm, and dumbell with corresponding keywords will be extracted.

# removing first 7 columns
train_c <- train[, -c(1:7)]
test_c <- test[, -c(1:7)]
# rate of NAs in each column
rate_na <- apply(train_c, 2, function(x) sum(is.na(x)))/dim(train_c)[1]
# extract dataset with NAs percentage below 1%
training <- train_c[(rate_na==0)]
testing <- test_c[(rate_na==0)]
dim(training)

## [1] 19622    53

# Filtering data based on belt, forearm, arm, and dumbell excercizes
filt1 <- grepl("arm|belt|dumbell|classe", names(training))
# filtering out classe variable
filt2 <- grepl("arm|belt|dumbell", names(training))
train_f <- training[, filt1]
test_f <- testing[, filt1]
train_wo <- training[, filt2]
test_wo <- testing[, filt2]
dim(train_f)

## [1] 19622    40

names(train_f)

##  [1] "roll_belt"           "pitch_belt"          "yaw_belt"           
##  [4] "total_accel_belt"    "gyros_belt_x"        "gyros_belt_y"       
##  [7] "gyros_belt_z"        "accel_belt_x"        "accel_belt_y"       
## [10] "accel_belt_z"        "magnet_belt_x"       "magnet_belt_y"      
## [13] "magnet_belt_z"       "roll_arm"            "pitch_arm"          
## [16] "yaw_arm"             "total_accel_arm"     "gyros_arm_x"        
## [19] "gyros_arm_y"         "gyros_arm_z"         "accel_arm_x"        
## [22] "accel_arm_y"         "accel_arm_z"         "magnet_arm_x"       
## [25] "magnet_arm_y"        "magnet_arm_z"        "roll_forearm"       
## [28] "pitch_forearm"       "yaw_forearm"         "total_accel_forearm"
## [31] "gyros_forearm_x"     "gyros_forearm_y"     "gyros_forearm_z"    
## [34] "accel_forearm_x"     "accel_forearm_y"     "accel_forearm_z"    
## [37] "magnet_forearm_x"    "magnet_forearm_y"    "magnet_forearm_z"   
## [40] "classe"

The extracted training dataset contains now 39 observation variables.

0.5 Finding variables with zero variance

Variables having zero variability can not contribute in the building of the prediction model. Therefore we will remove them from dataset.

# Find variables with zero variance
NZV = nearZeroVar(train_f, saveMetrics=TRUE)

The result of analysis confirms the absence of variables with zero variance.

0.6 Correlation matrix

To check correlation between variables we plot correlation matrix. As we can see from the plot most of variables are uncorrelated, which justifies that we can proceed without additional preprocessing with Principal Component Analysis.

# Correlation plot
corrplot(cor(train_wo), order ="hclust", type = "upper")

0.7 Superfast Extreme Gradient Boosting

XGBoost is working with matrices. Therefore we convert all our data into matrices.

train.m <- as.matrix(train_wo)
mode(train.m) <- "numeric"
test.m <- as.matrix(test_wo)
mode(test.m) <- "numeric"
label <- as.matrix(as.integer(outcome)-1)
# list of xgb parameters
xgb_par<-list("objective" = "multi:softprob", "num_class" = 5, "eval_metric" = "merror",
              "nthread" = 8, "max_depth" = 16, "eta" = 0.3, "gamma" = 0, "subsample" = 1, 
              "colsample_bytree" = 1, "min_child_weight" = 12)

0.7.1 Cross-validation

set.seed(543)
# 5-fold cross-validation
cv7 <- xgb.cv(param=xgb_par, data=train.m, label=label, nfold=5, nrounds=100, prediction=TRUE, verbose=FALSE)

0.7.2 Calculation of Confusion Matrix

We calculate confusion matrix from the predictions of cross-validation.

cv = matrix(cv7$pred, nrow=length(cv7$pred)/5, ncol=5)
cv = max.col(cv, "last")
# get confusion matrix
confusionMatrix(factor(label+1), factor(cv))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 5566    8    4    2    0
##          2    9 3776   12    0    0
##          3    0   26 3380   16    0
##          4    0    0   21 3189    6
##          5    0    1    0    6 3600
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9943          
##                  95% CI : (0.9932, 0.9953)
##     No Information Rate : 0.2841          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9928          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9984   0.9908   0.9892   0.9925   0.9983
## Specificity            0.9990   0.9987   0.9974   0.9984   0.9996
## Pos Pred Value         0.9975   0.9945   0.9877   0.9916   0.9981
## Neg Pred Value         0.9994   0.9978   0.9977   0.9985   0.9996
## Prevalence             0.2841   0.1942   0.1741   0.1637   0.1838
## Detection Rate         0.2837   0.1924   0.1723   0.1625   0.1835
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9987   0.9947   0.9933   0.9954   0.9989

0.7.3 Real Model Fit Training

Here we fit training data with gradient boosting model

# minimum merror
min.merror.idx <- which.min(cv7$dt[, test.merror.mean]) 
# model fit training
rmft <- xgboost(param=xgb_par, data=train.m, label=label, nrounds=min.merror.idx, verbose=0)

0.7.4 Testing Data Prediction

pred_test <- predict(rmft, test.m)  
head(pred_test, 20)

##  [1] 6.892200e-04 9.966874e-01 1.921948e-03 3.340804e-04 3.674268e-04
##  [6] 9.983462e-01 1.128621e-03 4.870083e-04 9.535551e-06 2.868428e-05
## [11] 1.552407e-02 9.681228e-01 1.451811e-02 6.649956e-04 1.170040e-03
## [16] 9.942462e-01 1.829086e-04 7.629342e-04 4.696007e-03 1.120004e-04

0.7.5 Finalizing predictions

predictiions = matrix(pred_test, nrow=5, ncol=length(pred_test)/5)
predictiions = t(predictiions)
predictiions = max.col(predictiions, "last")
prd = toupper(letters[predictiions])

0.7.6 Feature Importance Plot

mod = xgb.dump(rmft, with.stats=TRUE)
# get the feature all names
allnames = dimnames(train.m)[[2]]
# feature importance 
importance_m <- xgb.importance(allnames, model=rmft)
print(xgb.plot.importance(importance_m))

Conclusions

Extreme Gradient Boosting method is very efficient linear model solver. It provides extremely good accuracy about 99.3 % with error rate less than 1 %.

Files submission

pml_write_files = function(x){
        n = length(x)
        for(i in 1:n){
                filename = paste0("problem_id_",i,".txt")
                write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
        }
}
pml_write_files(prd)