Introduction

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we will be to use data from accelerometers on the belt, forearm, arm, and dumbell of six participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different classes as follows:

More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

Data

For this project, the training and test data sets are here and here, respectively.

Goal

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Your submission for the Peer Review portion should consist of a link to a Github repo with your R markdown and compiled HTML file describing your analysis. Please constrain the text of the writeup to < 2000 words and the number of figures to be less than 5. It will make it easier for the graders if you submit a repo with a gh-pages branch so the HTML page can be viewed online (and you always want to make it easy on graders.

Results

Data Processing

# Required libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(rpart.plot) 
## Loading required package: rpart
library(e1071)  # Skewness function use  

library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
# Setting seed for reproducibility 
set.seed(123)
# Setting parallel calculations
cluster <- makeCluster(detectCores()-1) 
registerDoParallel(cluster)
# Create Data repo
if(!dir.exists('./Data')){dir.create('./Data')}

# Create Figures repo
if(!dir.exists('./Figures')){dir.create('./Figures')}

# Load train data set
if(!file.exists('./Data/pml-training.csv')){
fileUrl<- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
download.file(fileUrl,destfile='./Data/pml-training.csv',mode = 'wb')
}

# Load test data set
if(!file.exists('./Data/pml-testing.csv')){
fileUrl<- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
download.file(fileUrl,destfile='./Data/pml-testing.csv',mode = 'wb')
}

# Load train data set
mydata_train <- read.csv("Data/pml-training.csv", na.strings=c("NA", ""))

# Load test data set
mydata_test <- read.csv("Data/pml-testing.csv", na.strings=c("NA", ""))

# Check dimension of data sets
dim(mydata_train); dim(mydata_test)
## [1] 19622   160
## [1]  20 160

The training and test data sets have \(19622\) and \(20\) observations, respectively. Both have \(160\) variables.

# Check structure of train data set 
str(mydata_train)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

We can see from the above that some variables (or predictors) have important numbers of NA values. Let’s check in details this presence of NAs:

# Number of variables with stricly NA values
sum(colSums(is.na(mydata_train))==dim(mydata_train)[1])
## [1] 0
# Number of variables with over 95 % of NA values
sum(colSums(is.na(mydata_train))>=0.95*dim(mydata_train)[1])
## [1] 100
# Number of variables without NA values
sum(colSums(!is.na(mydata_train))==dim(mydata_train)[1])
## [1] 60

There is zero predictor with stricly NAs. However, \(100\) variables are composed of at least \(95\)% of NAs, while \(60\) variables have no NAs. In this condition, we consider these last predictors (with no NAs). Moreover, we also omit the first seven variables (related to the ID of persons) which have a minor influence on the outcome classe:

# Find variables (without NAs)
NoNA_Var<- which(colSums(!is.na(mydata_train))==dim(mydata_train)[1])
# Take into account the above variables without the first seven variables
mydata_train <- mydata_train %>% select(NoNA_Var) %>% select(-c(1:7))
mydata_test <- mydata_test %>% select(NoNA_Var) %>% select(-c(1:7))

Let’s check variables with very low variance:

# Find our variables (without NAs)
nearZeroVar(mydata_train)
## integer(0)

No variables with very low variance are found in our train data set. Let’s also omit highly correlated variables as follows (over correlation of 0.9):

# Correlation values between variables 
correlations<-  cor(select(mydata_train,-classe))
# Cut off correlation over 0.9
highCorr<-  findCorrelation(correlations, cutoff=  0.9)
# Subset data with our correlation limit
mydata_train<- mydata_train %>% select(-highCorr)
mydata_test<- mydata_test %>% select(-highCorr)

Most of predictive models are based on predicators’ normal distributions. Then, let’s now scale our data sets:

# Preprocessing: scaling, skewness (without the outcome 'classe') 
trans<-  preProcess(select(mydata_train,-classe),method=  c('center','scale','BoxCox'))

# Transformed data (train and test) sets 
mydata_train_trans<- predict(trans,select(mydata_train,-classe))
mydata_test_trans<- predict(trans,select(mydata_test,-classe))

We also neglect the remaining highly skew variables as follows:

# Variables with highly skewness
Skew_var<- apply(mydata_train_trans,2,skewness) > 10
# Omit skew variables on data sets  
mydata_train_trans<- mydata_train_trans[!Skew_var]
mydata_test_trans<- mydata_test_trans[!Skew_var]

# Add the 'classe' column 
mydata_train_trans<- mydata_train_trans %>% mutate(classe=mydata_train$classe)
mydata_test_trans<- mydata_test_trans %>% mutate(classe=mydata_test$classe)

# Check our transformed data sets 
dim(mydata_train_trans)
## [1] 19622    43
dim(mydata_test_trans)
## [1] 20 43

We reduced our initial data sets to have \(43\) variables.

Data Split
The train data set is splitted into a subtrain (to build the predictive model) and a validation (to check the accuracy) parts. The test data set is used in last to predict the required outcomes for this project.

# Indexes of splitting (subtrain 80% and validation 20% of the train data set)
Ind_part <- createDataPartition(y=mydata_train_trans$classe, p=0.8, list=F)
# Split into sub_training and validation parts
mydata_sub_train<- mydata_train_trans[Ind_part,] 
mydata_valid<- mydata_train_trans[-Ind_part,]

# Dimension of data sets
dim(mydata_sub_train)
## [1] 15699    43
dim(mydata_valid)
## [1] 3923   43

The subtrain and validation data sets are splitted from \(80\)% and \(20\)% of the train data set, respectively.

Predictive Models
The present project is a classification study in if-then kinds of ways. Obviously, we first use the Decision Tree model, and if necesssary the Random Forest algorithm, both from the caret package.

We use the k-fold cross validation resampling technique in this study:

# Type of resampling / 5-fold cross-validation / Parallel calculations
control  <-  trainControl(method= 'cv', number=  5, allowParallel = T)

Decision Tree

# Build the predictive model on the subtrain data set
DT_model<- train(classe~. , data=mydata_sub_train, method= 'rpart')
# Predict on the validation data set
prediction<- predict(DT_model, mydata_valid)
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1015  308  304  258  149
##          B   16  262   26  132  132
##          C   68  155  280   62  168
##          D   17   34   74  155   23
##          E    0    0    0   36  249
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4999          
##                  95% CI : (0.4841, 0.5156)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.347           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9095  0.34519  0.40936  0.24106  0.34535
## Specificity            0.6370  0.90329  0.86014  0.95488  0.98876
## Pos Pred Value         0.4990  0.46127  0.38199  0.51155  0.87368
## Neg Pred Value         0.9465  0.85186  0.87335  0.86519  0.87026
## Prevalence             0.2845  0.19347  0.17436  0.16391  0.18379
## Detection Rate         0.2587  0.06679  0.07137  0.03951  0.06347
## Detection Prevalence   0.5185  0.14479  0.18685  0.07724  0.07265
## Balanced Accuracy      0.7732  0.62424  0.63475  0.59797  0.66706
# Plot the Decision Tree
png('./Figures/unnamed-chunk-13.png',width=800,height=600)

rpart.plot(DT_model$finalModel, main="Decision Tree", extra=102, under=T, faclen=0, cex = 0.5,branch = 1, type = 0, fallen.leaves = T)

dev.off()
## png 
##   2

plot of unnamed-chunk-13

The low accuracy value (~\(50\)%) shows that the Decision Tree is a bad classifier for the present study. Let’s check with the Random Forest model:

Random Forest

# Build the predictive model on the subtrain data set
RF_model<- train(classe~., data=mydata_sub_train, method = "rf", trControl = control)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Predict on the validation data set
prediction <- predict(RF_model, mydata_valid)
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1115    8    0    0    0
##          B    1  746    3    0    0
##          C    0    5  680    4    0
##          D    0    0    1  638    1
##          E    0    0    0    1  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9909, 0.9961)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9923          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9829   0.9942   0.9922   0.9986
## Specificity            0.9971   0.9987   0.9972   0.9994   0.9997
## Pos Pred Value         0.9929   0.9947   0.9869   0.9969   0.9986
## Neg Pred Value         0.9996   0.9959   0.9988   0.9985   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1902   0.1733   0.1626   0.1835
## Detection Prevalence   0.2863   0.1912   0.1756   0.1631   0.1838
## Balanced Accuracy      0.9981   0.9908   0.9957   0.9958   0.9992

As expected, the Random Forest is a better predictive model than the Decision Tree. Indeed, the Random Forest has a larger accuracy (99.4%). Let’s now consider the first \(30\) most important predictors of the Random Forest model (to reduce computing cost):

# Names of first important variables 
Imp_vars<-rownames(varImp(RF_model)$importance)[1:30]
# Build the predictive model on the subtrain data set (with the most important predictors)
RF_model_2<- train(classe~., data=mydata_sub_train[c(Imp_vars,'classe')], method = "rf", trControl = control)
# Predict on the validation data set
prediction <- predict(RF_model_2, mydata_valid[c(Imp_vars,'classe')])
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112   11    1    2    0
##          B    2  740    4    0    0
##          C    1    7  676   11    1
##          D    1    0    3  629    3
##          E    0    1    0    1  717
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9875          
##                  95% CI : (0.9835, 0.9907)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9842          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9750   0.9883   0.9782   0.9945
## Specificity            0.9950   0.9981   0.9938   0.9979   0.9994
## Pos Pred Value         0.9876   0.9920   0.9713   0.9890   0.9972
## Neg Pred Value         0.9986   0.9940   0.9975   0.9957   0.9988
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1886   0.1723   0.1603   0.1828
## Detection Prevalence   0.2870   0.1902   0.1774   0.1621   0.1833
## Balanced Accuracy      0.9957   0.9865   0.9911   0.9880   0.9969

In the above condition, we have an acccuracy of \(0.9875\) (in 95% CI: [0.9835, 0.9907]).The out of sample error is \(1.25\)% (=1-accuracy) which leads to consider the Random Forest model as a good classifier to obtain outcomes of the test data set as follows:

# predict on the validation data set
result_1<-predict(RF_model, mydata_test_trans)
# predict on the validation data set (with the first 30 important predictors)
result_2<-predict(RF_model_2, mydata_test_trans[c(Imp_vars,'classe')])

Note that we have similar predictive results with the first important and all predictors:

identical(result_1,result_2)
## [1] TRUE

With this condition, the required outcomes for this project are the following:

result_2
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# close clusters / Parallel calculations
stopCluster(cluster)

Conclusions

A study is presented to predict barbell lifts according to different classes. Data sets (train and test) are first reduced via the characteristics of predictors. These characteristics are the percentage of NAs values, low variance, correlation and skewness. The variables of the data sets are also scaled. The train data set is splitted into subtrain and validation parts to construct a predictive model and evaluate its accuracy. Decision Tree and Random Forest are applied and it was found that this latter is more accurate and gives satisfactory results even with the most mportant predictors.

This project is reproducible and was done with the following environment:

# Software environment
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3    
##  [4] e1071_1.6-7       rpart.plot_1.5.3  rpart_4.1-10     
##  [7] dplyr_0.4.3       caret_6.0-64      ggplot2_2.0.0    
## [10] lattice_0.20-33  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3        formatR_1.2.1      nloptr_1.0.4      
##  [4] plyr_1.8.3         class_7.3-14       tools_3.2.0       
##  [7] digest_0.6.9       lme4_1.1-10        evaluate_0.8      
## [10] nlme_3.1-124       gtable_0.1.2       mgcv_1.8-11       
## [13] Matrix_1.2-0       DBI_0.3.1          yaml_2.1.13       
## [16] SparseM_1.7        stringr_1.0.0      knitr_1.12.3      
## [19] MatrixModels_0.4-1 stats4_3.2.0       grid_3.2.0        
## [22] nnet_7.3-12        R6_2.1.2           rmarkdown_0.9.2   
## [25] minqa_1.2.4        reshape2_1.4.1     car_2.1-1         
## [28] magrittr_1.5       scales_0.3.0       codetools_0.2-14  
## [31] htmltools_0.3      MASS_7.3-45        splines_3.2.0     
## [34] assertthat_0.1     pbkrtest_0.4-4     colorspace_1.2-6  
## [37] quantreg_5.19      stringi_1.0-1      lazyeval_0.1.10   
## [40] munsell_0.4.2