Report Content
Content Description
Executive Summary Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. The data was collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways ( “classe” ):
A - Regular
B - Throwing the elbows to the front
C - Lifting the dumbbell only halfway
D - Lowering the dumbbell only halfway
E - Throwing the hips to the front
The goal of this project is to predict the manner in which the participants did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
The training and test data for this project are available on:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Questions addressing
HOW WAS BUILT THE MODEL ( Chosen Regressions method ) ?
I chose the two most common method to make this analysis: The CART (Classification and regression Tree) and The Random Forest. The CART method is the standard machine learning technique in ensemble terms, corresponds to a weak learner. In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets.The Random Forest (Breiman, 2001) is an ensemble approach which is a collection of "weak learners" that can come together to form a "strong learner".
- CARET method is easy to interpret, fast to run but hard to estimate the uncertainty
- RANDOM FOREST has a very good accuracy, but difficult to interpret and low speed to run (may take a couple of hours)
HOW WAS CHOSEN THE CROSS VALIDATION?
There are several methods to estimate the model accuracy among them, Data Split, Boostrap, K-Fold Cross validation, Repeated K-Fold Cross validation and Leave One Out Cross validation. I choose the DATA SPLIT due to the very large amount of data of thsi problem.
- Data Splitting: Involves partitioning the data into an explicit Training dataset used to prepare the model, (typically 70% ) and an unseen Test dataset ( typically 30% )used to evaluate the models performance on unseen data. It is useful for a very large dataset so that the test dataset can provide a meaningful estimation of the performance or for when you are using slow methods and need a quick approximation of performance
- K-Fold Cross Validation; Involves splitting the dataset into K-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided.
- Leave One Out Cross Validation: a data instance is left out and a model constructed on all other data instances in the training set . This is repeated for all data instances.
WHAT YOU THINK THE EXPECTED OUT OF SAMPLE ERROR IS ( = 1 - ACCURACY )?
The two methods chosen provided diferente accuracy. the CART Method provided an accuracy of 60,65%, consequently the expected out of the sample error 1- 60,65% = 39,35% and on the other hand the RANDOM FOREST method provided a much better accuracy 99,32%, giving an expected out of sample error of 1 - 99,32% = 0.68%.
USE THE FINAL PREDICT MODEL TO PREDICT 20 DIFFERENT TEST CASES.
See TABLE A, in the last session
Data Analysis
Getting and Cleaning Data
Loading the libraries necessary to run the codes
library(dplyr)
library(caret)
library(rattle)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(randomForest)
Loading the files necessary to develop the project
training_original <- read.csv("J:/pml-training.csv", sep=";", stringsAsFactor = FALSE,na.strings=c("#DIV/0!"))
testing_original <- read.csv("J:/pml-testing.csv", sep=";", stringsAsFactor = FALSE,na.strings=c("#DIV/0!"))
str(training_original)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : chr "" "" "" "" ...
## $ kurtosis_picth_belt : chr "" "" "" "" ...
## $ kurtosis_yaw_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt : chr "" "" "" "" ...
## $ skewness_roll_belt.1 : chr "" "" "" "" ...
## $ skewness_yaw_belt : logi NA NA NA NA NA NA ...
## $ max_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ max_picth_belt : chr "NA" "NA" "NA" "NA" ...
## $ max_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ min_pitch_belt : chr "NA" "NA" "NA" "NA" ...
## $ min_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ amplitude_pitch_belt : chr "NA" "NA" "NA" "NA" ...
## $ amplitude_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : chr "NA" "NA" "NA" "NA" ...
## $ avg_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ stddev_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ var_roll_belt : chr "NA" "NA" "NA" "NA" ...
## $ avg_pitch_belt : chr "NA" "NA" "NA" "NA" ...
## $ stddev_pitch_belt : chr "NA" "NA" "NA" "NA" ...
## $ var_pitch_belt : chr "NA" "NA" "NA" "NA" ...
## $ avg_yaw_belt : chr "NA" "NA" "NA" "NA" ...
## $ stddev_yaw_belt : chr "NA" "NA" "NA" "NA" ...
## $ var_yaw_belt : chr "NA" "NA" "NA" "NA" ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : chr "NA" "NA" "NA" "NA" ...
## $ avg_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ stddev_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ var_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ avg_pitch_arm : chr "NA" "NA" "NA" "NA" ...
## $ stddev_pitch_arm : chr "NA" "NA" "NA" "NA" ...
## $ var_pitch_arm : chr "NA" "NA" "NA" "NA" ...
## $ avg_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ stddev_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ var_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : chr "" "" "" "" ...
## $ kurtosis_picth_arm : chr "" "" "" "" ...
## $ kurtosis_yaw_arm : chr "" "" "" "" ...
## $ skewness_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ max_picth_arm : chr "NA" "NA" "NA" "NA" ...
## $ max_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ min_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ min_pitch_arm : chr "NA" "NA" "NA" "NA" ...
## $ min_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ amplitude_roll_arm : chr "NA" "NA" "NA" "NA" ...
## $ amplitude_pitch_arm : chr "NA" "NA" "NA" "NA" ...
## $ amplitude_yaw_arm : chr "NA" "NA" "NA" "NA" ...
## $ roll_dumbbell : chr "1.305.217.456" "1.313.073.959" "1.285.074.981" "1.343.119.971" ...
## $ pitch_dumbbell : chr "-7.049.400.371" "-7.063.750.507" "-7.027.811.982" "-7.039.379.464" ...
## $ yaw_dumbbell : chr "-8.487.393.888" "-8.471.064.711" "-8.514.078.134" "-8.487.362.553" ...
## $ kurtosis_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ max_roll_dumbbell : chr "NA" "NA" "NA" "NA" ...
## $ max_picth_dumbbell : chr "NA" "NA" "NA" "NA" ...
## $ max_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : chr "NA" "NA" "NA" "NA" ...
## $ min_pitch_dumbbell : chr "NA" "NA" "NA" "NA" ...
## $ min_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : chr "NA" "NA" "NA" "NA" ...
## [list output truncated]
The project has the following dimensions ( number of rows versus number of columns)
dim(training_original)
## [1] 19622 160
In order to manipulate the data to perform the regression and correlation analysis, the columns of the project should be converted to numeric, the conversion of the 160 columns was done using the “FOR” loop excluding only the column “classe”.
a<- ncol(training_original)
for (i in 1:(a-1)){training_original[,i] <- as.numeric(training_original[,i])}
b<- ncol(testing_original)
for (i in 1:(b-1)){testing_original [,i] <- as.numeric(testing_original [,i])}
The variables which the sum of the column is zero were excluded.
training_original <- training_original[, colSums(is.na(training_original)) ==0]
The variables which the variation od the column is zero also were excluded.
classe <- training_original[,ncol(training_original)]
zeroVar <- nearZeroVar(training_original, saveMetrics=TRUE)
training_original <- training_original[, zeroVar$nzv==FALSE]
The variables with high correlation with another variable was removed using the collinearity approach.
#Removing columns with collinearity
correlation <- cor(training_original[,-ncol(training_original)])
top <- findCorrelation( correlation, cutoff =.75)
training_original <- training_original[, -top]
Some variables were eliminated based on their null influence on the final results.
training_original <- select(training_original, -1,-2,-3,-4)
training_original$classe <- as.factor(training_original$classe)
In order to maintain the project reproducible, it was utilized the set.seed (1000).
set.seed(1000)
The data was split 70% for the training dataset and 30% for the testing dataset.
inTrain <- createDataPartition(training_original$classe, p=0.7, list=FALSE)
training <- training_original[inTrain,]
testing <- training_original[-inTrain,]
CART (Classification and regression Tree) RESULTS.
modFit_T<- train(classe~., method="rpart", data= training)
modFit_T
## CART
##
## 13737 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.01759740 0.6081001 0.5033186 0.04442749 0.05680087
## 0.02120842 0.5720932 0.4582851 0.03226512 0.04151583
## 0.03092259 0.4261071 0.2257590 0.13692269 0.22390158
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0175974.
print(modFit_T$finalModel)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) pitch_forearm< -33.95 1101 7 A (0.99 0.0064 0 0 0) *
## 3) pitch_forearm>=-33.95 12636 9824 A (0.22 0.21 0.19 0.18 0.2)
## 6) gyros_belt_z< 0.06 11792 8989 A (0.24 0.22 0.2 0.18 0.16)
## 12) yaw_belt>=169.5 580 58 A (0.9 0.04 0 0.052 0.0086) *
## 13) yaw_belt< 169.5 11212 8623 B (0.2 0.23 0.21 0.18 0.17)
## 26) pitch_belt< -42.95 563 78 B (0.016 0.86 0.089 0.016 0.018) *
## 27) pitch_belt>=-42.95 10649 8309 C (0.21 0.2 0.22 0.19 0.18)
## 54) yaw_belt< 4.515 8569 6507 B (0.23 0.24 0.22 0.2 0.11)
## 108) roll_forearm< 124.5 4763 3078 A (0.35 0.25 0.13 0.2 0.061)
## 216) roll_forearm>=-42.25 2580 1210 A (0.53 0.23 0.12 0.12 0.0027) *
## 217) roll_forearm< -42.25 2183 1529 D (0.14 0.28 0.15 0.3 0.13)
## 434) yaw_forearm>=-101.5 1392 886 B (0.16 0.36 0.18 0.13 0.17) *
## 435) yaw_forearm< -101.5 791 315 D (0.12 0.12 0.091 0.6 0.062) *
## 109) roll_forearm>=124.5 3806 2568 C (0.075 0.23 0.33 0.21 0.17)
## 218) accel_forearm_x>=-104.5 2652 1666 C (0.085 0.27 0.37 0.071 0.2)
## 436) magnet_forearm_z< -245 217 43 A (0.8 0.18 0 0.014 0) *
## 437) magnet_forearm_z>=-245 2435 1449 C (0.021 0.28 0.4 0.076 0.22)
## 874) gyros_dumbbell_y< -0.425 486 185 B (0.035 0.62 0.12 0.037 0.19) *
## 875) gyros_dumbbell_y>=-0.425 1949 1021 C (0.017 0.2 0.48 0.086 0.22) *
## 219) accel_forearm_x< -104.5 1154 562 D (0.053 0.11 0.22 0.51 0.1) *
## 55) yaw_belt>=4.515 2080 1120 E (0.14 0.02 0.23 0.15 0.46)
## 110) yaw_belt>=158.5 1122 652 C (0.27 0.037 0.42 0.27 0.0053) *
## 111) yaw_belt< 158.5 958 4 E (0 0 0 0.0042 1) *
## 7) gyros_belt_z>=0.06 844 227 E (0.011 0.046 0.0071 0.2 0.73) *
predictions_T <- predict(modFit_T, newdata=testing)
confusionMatrix(predictions_T, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1317 293 145 134 4
## B 126 535 136 96 139
## C 160 190 609 224 194
## D 69 102 133 426 63
## E 2 19 3 84 682
##
## Overall Statistics
##
## Accuracy : 0.6065
## 95% CI : (0.5938, 0.619)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7867 0.46971 0.5936 0.44191 0.6303
## Specificity 0.8632 0.89528 0.8419 0.92542 0.9775
## Pos Pred Value 0.6957 0.51841 0.4423 0.53720 0.8633
## Neg Pred Value 0.9106 0.87554 0.9075 0.89434 0.9215
## Prevalence 0.2845 0.19354 0.1743 0.16381 0.1839
## Detection Rate 0.2238 0.09091 0.1035 0.07239 0.1159
## Detection Prevalence 0.3217 0.17536 0.2340 0.13475 0.1342
## Balanced Accuracy 0.8250 0.68250 0.7178 0.68367 0.8039
cart <- rpart(classe~., data=training, method="class")
prp(cart)
RANDOM FOREST RESULTS.
modFit_RF<- train(classe~., method="rf", data = training)
modFit_RF
## Random Forest
##
## 13737 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9879621 0.9847667 0.001334410 0.001689145
## 16 0.9858949 0.9821525 0.001979898 0.002501480
## 31 0.9761664 0.9698456 0.003464061 0.004372205
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(modFit_RF$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.71%
## Confusion matrix:
## A B C D E class.error
## A 3899 3 1 1 2 0.001792115
## B 9 2645 3 0 1 0.004890895
## C 0 22 2361 13 0 0.014607679
## D 0 0 38 2212 2 0.017761989
## E 0 1 0 2 2522 0.001188119
predictions_RF <- predict(modFit_RF, newdata=testing)
confusionMatrix(predictions_RF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 5 0 0 0
## B 3 1132 8 0 0
## C 0 1 1014 15 0
## D 0 0 4 949 3
## E 0 1 0 0 1079
##
## Overall Statistics
##
## Accuracy : 0.9932
## 95% CI : (0.9908, 0.9951)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9914
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9939 0.9883 0.9844 0.9972
## Specificity 0.9988 0.9977 0.9967 0.9986 0.9998
## Pos Pred Value 0.9970 0.9904 0.9845 0.9927 0.9991
## Neg Pred Value 0.9993 0.9985 0.9975 0.9970 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1924 0.1723 0.1613 0.1833
## Detection Prevalence 0.2848 0.1942 0.1750 0.1624 0.1835
## Balanced Accuracy 0.9985 0.9958 0.9925 0.9915 0.9985
Most Important Variables
varimp <- varImp(modFit_RF, scale=FALSE)
plot(varimp, main="Most Important Variables")
head(getTree(modFit_RF$finalModel, k=2))
## left daughter right daughter split var split point status prediction
## 1 2 3 29 -18.500 1 0
## 2 4 5 1 -42.550 1 0
## 3 6 7 11 31.500 1 0
## 4 8 9 6 42.500 1 0
## 5 10 11 4 0.135 1 0
## 6 12 13 7 -497.500 1 0
pred<-testing$classe
table(pred, testing$classe)
##
## pred A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1026 0 0
## D 0 0 0 964 0
## E 0 0 0 0 1082
The final predict model with better accuracy ( Randon Forest ) was used to predict the 20 different test cases proposed in the initial request of the project
predictions_RR <- predict(modFit_RF, newdata=testing_original)
Table A: showing the prediction of the 20 different test cases
table(predictions_RR)
## predictions_RR
## A B C D E
## 7 8 1 1 3
predictions_RR
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
sources:
http://groupware.les.inf.puc-rio.br/har. https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/ http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/