Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement â a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we will be to use data from accelerometers on the belt, forearm, arm, and dumbell of six participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different classes as follows:
Class A: Exactly according to the specificationClass B; Throwing the elbows to the frontClass C: Lifting the dumbbell only halfwayClass D: Lowering the dumbbell only halfwayClass E: Throwing the hips to the frontMore information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
Data
For this project, the training and test data sets are here and here, respectively.
Goal
The goal of your project is to predict the manner in which they did the exercise. This is the âclasseâ variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Your submission for the Peer Review portion should consist of a link to a Github repo with your R markdown and compiled HTML file describing your analysis. Please constrain the text of the writeup to < 2000 words and the number of figures to be less than 5. It will make it easier for the graders if you submit a repo with a gh-pages branch so the HTML page can be viewed online (and you always want to make it easy on graders.
Data Processing
# Required libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(rpart.plot)
## Loading required package: rpart
library(e1071) # Skewness function use
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
# Setting seed for reproducibility
set.seed(123)
# Setting parallel calculations
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)
# Create Data repo
if(!dir.exists('./Data')){dir.create('./Data')}
# Create Figures repo
if(!dir.exists('./Figures')){dir.create('./Figures')}
# Load train data set
if(!file.exists('./Data/pml-training.csv')){
fileUrl<- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
download.file(fileUrl,destfile='./Data/pml-training.csv',mode = 'wb')
}
# Load test data set
if(!file.exists('./Data/pml-testing.csv')){
fileUrl<- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
download.file(fileUrl,destfile='./Data/pml-testing.csv',mode = 'wb')
}
# Load train data set
mydata_train <- read.csv("Data/pml-training.csv", na.strings=c("NA", ""))
# Load test data set
mydata_test <- read.csv("Data/pml-testing.csv", na.strings=c("NA", ""))
# Check dimension of data sets
dim(mydata_train); dim(mydata_test)
## [1] 19622 160
## [1] 20 160
The training and test data sets have \(19622\) and \(20\) observations, respectively. Both have \(160\) variables.
# Check structure of train data set
str(mydata_train)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
We can see from the above that some variables (or predictors) have important numbers of NA values. Let’s check in details this presence of NAs:
# Number of variables with stricly NA values
sum(colSums(is.na(mydata_train))==dim(mydata_train)[1])
## [1] 0
# Number of variables with over 95 % of NA values
sum(colSums(is.na(mydata_train))>=0.95*dim(mydata_train)[1])
## [1] 100
# Number of variables without NA values
sum(colSums(!is.na(mydata_train))==dim(mydata_train)[1])
## [1] 60
There is zero predictor with stricly NAs. However, \(100\) variables are composed of at least \(95\)% of NAs, while \(60\) variables have no NAs. In this condition, we consider these last predictors (with no NAs). Moreover, we also omit the first seven variables (related to the ID of persons) which have a minor influence on the outcome classe:
# Find variables (without NAs)
NoNA_Var<- which(colSums(!is.na(mydata_train))==dim(mydata_train)[1])
# Take into account the above variables without the first seven variables
mydata_train <- mydata_train %>% select(NoNA_Var) %>% select(-c(1:7))
mydata_test <- mydata_test %>% select(NoNA_Var) %>% select(-c(1:7))
Let’s check variables with very low variance:
# Find our variables (without NAs)
nearZeroVar(mydata_train)
## integer(0)
No variables with very low variance are found in our train data set. Let’s also omit highly correlated variables as follows (over correlation of 0.9):
# Correlation values between variables
correlations<- cor(select(mydata_train,-classe))
# Cut off correlation over 0.9
highCorr<- findCorrelation(correlations, cutoff= 0.9)
# Subset data with our correlation limit
mydata_train<- mydata_train %>% select(-highCorr)
mydata_test<- mydata_test %>% select(-highCorr)
Most of predictive models are based on predicators’ normal distributions. Then, let’s now scale our data sets:
# Preprocessing: scaling, skewness (without the outcome 'classe')
trans<- preProcess(select(mydata_train,-classe),method= c('center','scale','BoxCox'))
# Transformed data (train and test) sets
mydata_train_trans<- predict(trans,select(mydata_train,-classe))
mydata_test_trans<- predict(trans,select(mydata_test,-classe))
We also neglect the remaining highly skew variables as follows:
# Variables with highly skewness
Skew_var<- apply(mydata_train_trans,2,skewness) > 10
# Omit skew variables on data sets
mydata_train_trans<- mydata_train_trans[!Skew_var]
mydata_test_trans<- mydata_test_trans[!Skew_var]
# Add the 'classe' column
mydata_train_trans<- mydata_train_trans %>% mutate(classe=mydata_train$classe)
mydata_test_trans<- mydata_test_trans %>% mutate(classe=mydata_test$classe)
# Check our transformed data sets
dim(mydata_train_trans)
## [1] 19622 43
dim(mydata_test_trans)
## [1] 20 43
We reduced our initial data sets to have \(43\) variables.
Data Split
The train data set is splitted into a subtrain (to build the predictive model) and a validation (to check the accuracy) parts. The test data set is used in last to predict the required outcomes for this project.
# Indexes of splitting (subtrain 80% and validation 20% of the train data set)
Ind_part <- createDataPartition(y=mydata_train_trans$classe, p=0.8, list=F)
# Split into sub_training and validation parts
mydata_sub_train<- mydata_train_trans[Ind_part,]
mydata_valid<- mydata_train_trans[-Ind_part,]
# Dimension of data sets
dim(mydata_sub_train)
## [1] 15699 43
dim(mydata_valid)
## [1] 3923 43
The subtrain and validation data sets are splitted from \(80\)% and \(20\)% of the train data set, respectively.
Predictive Models
The present project is a classification study in if-then kinds of ways. Obviously, we first use the Decision Tree model, and if necesssary the Random Forest algorithm, both from the caret package.
We use the k-fold cross validation resampling technique in this study:
# Type of resampling / 5-fold cross-validation / Parallel calculations
control <- trainControl(method= 'cv', number= 5, allowParallel = T)
Decision Tree
# Build the predictive model on the subtrain data set
DT_model<- train(classe~. , data=mydata_sub_train, method= 'rpart')
# Predict on the validation data set
prediction<- predict(DT_model, mydata_valid)
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1015 308 304 258 149
## B 16 262 26 132 132
## C 68 155 280 62 168
## D 17 34 74 155 23
## E 0 0 0 36 249
##
## Overall Statistics
##
## Accuracy : 0.4999
## 95% CI : (0.4841, 0.5156)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.347
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9095 0.34519 0.40936 0.24106 0.34535
## Specificity 0.6370 0.90329 0.86014 0.95488 0.98876
## Pos Pred Value 0.4990 0.46127 0.38199 0.51155 0.87368
## Neg Pred Value 0.9465 0.85186 0.87335 0.86519 0.87026
## Prevalence 0.2845 0.19347 0.17436 0.16391 0.18379
## Detection Rate 0.2587 0.06679 0.07137 0.03951 0.06347
## Detection Prevalence 0.5185 0.14479 0.18685 0.07724 0.07265
## Balanced Accuracy 0.7732 0.62424 0.63475 0.59797 0.66706
# Plot the Decision Tree
png('./Figures/unnamed-chunk-13.png',width=800,height=600)
rpart.plot(DT_model$finalModel, main="Decision Tree", extra=102, under=T, faclen=0, cex = 0.5,branch = 1, type = 0, fallen.leaves = T)
dev.off()
## png
## 2
The low accuracy value (~\(50\)%) shows that the Decision Tree is a bad classifier for the present study. Let’s check with the Random Forest model:
Random Forest
# Build the predictive model on the subtrain data set
RF_model<- train(classe~., data=mydata_sub_train, method = "rf", trControl = control)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Predict on the validation data set
prediction <- predict(RF_model, mydata_valid)
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1115 8 0 0 0
## B 1 746 3 0 0
## C 0 5 680 4 0
## D 0 0 1 638 1
## E 0 0 0 1 720
##
## Overall Statistics
##
## Accuracy : 0.9939
## 95% CI : (0.9909, 0.9961)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9923
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9829 0.9942 0.9922 0.9986
## Specificity 0.9971 0.9987 0.9972 0.9994 0.9997
## Pos Pred Value 0.9929 0.9947 0.9869 0.9969 0.9986
## Neg Pred Value 0.9996 0.9959 0.9988 0.9985 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1902 0.1733 0.1626 0.1835
## Detection Prevalence 0.2863 0.1912 0.1756 0.1631 0.1838
## Balanced Accuracy 0.9981 0.9908 0.9957 0.9958 0.9992
As expected, the Random Forest is a better predictive model than the Decision Tree. Indeed, the Random Forest has a larger accuracy (99.4%). Let’s now consider the first \(30\) most important predictors of the Random Forest model (to reduce computing cost):
# Names of first important variables
Imp_vars<-rownames(varImp(RF_model)$importance)[1:30]
# Build the predictive model on the subtrain data set (with the most important predictors)
RF_model_2<- train(classe~., data=mydata_sub_train[c(Imp_vars,'classe')], method = "rf", trControl = control)
# Predict on the validation data set
prediction <- predict(RF_model_2, mydata_valid[c(Imp_vars,'classe')])
# Confusion matrix on the validation data set
confusionMatrix(prediction, mydata_valid$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 11 1 2 0
## B 2 740 4 0 0
## C 1 7 676 11 1
## D 1 0 3 629 3
## E 0 1 0 1 717
##
## Overall Statistics
##
## Accuracy : 0.9875
## 95% CI : (0.9835, 0.9907)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9842
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9750 0.9883 0.9782 0.9945
## Specificity 0.9950 0.9981 0.9938 0.9979 0.9994
## Pos Pred Value 0.9876 0.9920 0.9713 0.9890 0.9972
## Neg Pred Value 0.9986 0.9940 0.9975 0.9957 0.9988
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1886 0.1723 0.1603 0.1828
## Detection Prevalence 0.2870 0.1902 0.1774 0.1621 0.1833
## Balanced Accuracy 0.9957 0.9865 0.9911 0.9880 0.9969
In the above condition, we have an acccuracy of \(0.9875\) (in 95% CI: [0.9835, 0.9907]).The out of sample error is \(1.25\)% (=1-accuracy) which leads to consider the Random Forest model as a good classifier to obtain outcomes of the test data set as follows:
# predict on the validation data set
result_1<-predict(RF_model, mydata_test_trans)
# predict on the validation data set (with the first 30 important predictors)
result_2<-predict(RF_model_2, mydata_test_trans[c(Imp_vars,'classe')])
Note that we have similar predictive results with the first important and all predictors:
identical(result_1,result_2)
## [1] TRUE
With this condition, the required outcomes for this project are the following:
result_2
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# close clusters / Parallel calculations
stopCluster(cluster)
A study is presented to predict barbell lifts according to different classes. Data sets (train and test) are first reduced via the characteristics of predictors. These characteristics are the percentage of NAs values, low variance, correlation and skewness. The variables of the data sets are also scaled. The train data set is splitted into subtrain and validation parts to construct a predictive model and evaluate its accuracy. Decision Tree and Random Forest are applied and it was found that this latter is more accurate and gives satisfactory results even with the most mportant predictors.
This project is reproducible and was done with the following environment:
# Software environment
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
## [4] e1071_1.6-7 rpart.plot_1.5.3 rpart_4.1-10
## [7] dplyr_0.4.3 caret_6.0-64 ggplot2_2.0.0
## [10] lattice_0.20-33
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 formatR_1.2.1 nloptr_1.0.4
## [4] plyr_1.8.3 class_7.3-14 tools_3.2.0
## [7] digest_0.6.9 lme4_1.1-10 evaluate_0.8
## [10] nlme_3.1-124 gtable_0.1.2 mgcv_1.8-11
## [13] Matrix_1.2-0 DBI_0.3.1 yaml_2.1.13
## [16] SparseM_1.7 stringr_1.0.0 knitr_1.12.3
## [19] MatrixModels_0.4-1 stats4_3.2.0 grid_3.2.0
## [22] nnet_7.3-12 R6_2.1.2 rmarkdown_0.9.2
## [25] minqa_1.2.4 reshape2_1.4.1 car_2.1-1
## [28] magrittr_1.5 scales_0.3.0 codetools_0.2-14
## [31] htmltools_0.3 MASS_7.3-45 splines_3.2.0
## [34] assertthat_0.1 pbkrtest_0.4-4 colorspace_1.2-6
## [37] quantreg_5.19 stringi_1.0-1 lazyeval_0.1.10
## [40] munsell_0.4.2