The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data.
library(caret)
library(rattle)
library(randomForest)
library(rpart)
library(e1071)
library(gbm)
library(corrplot)
trainData <- read.csv("./pml-training.csv",header=TRUE)
validData <- read.csv("./pml-testing.csv",header=TRUE)
dim(trainData)
## [1] 19622 160
dim(validData)
## [1] 20 160
str(trainData)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : chr "" "" "" "" ...
## $ kurtosis_picth_belt : chr "" "" "" "" ...
## $ kurtosis_yaw_belt : chr "" "" "" "" ...
## $ skewness_roll_belt : chr "" "" "" "" ...
## $ skewness_roll_belt.1 : chr "" "" "" "" ...
## $ skewness_yaw_belt : chr "" "" "" "" ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : chr "" "" "" "" ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : chr "" "" "" "" ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : chr "" "" "" "" ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : chr "" "" "" "" ...
## $ kurtosis_picth_arm : chr "" "" "" "" ...
## $ kurtosis_yaw_arm : chr "" "" "" "" ...
## $ skewness_roll_arm : chr "" "" "" "" ...
## $ skewness_pitch_arm : chr "" "" "" "" ...
## $ skewness_yaw_arm : chr "" "" "" "" ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : chr "" "" "" "" ...
## $ kurtosis_picth_dumbbell : chr "" "" "" "" ...
## $ kurtosis_yaw_dumbbell : chr "" "" "" "" ...
## $ skewness_roll_dumbbell : chr "" "" "" "" ...
## $ skewness_pitch_dumbbell : chr "" "" "" "" ...
## $ skewness_yaw_dumbbell : chr "" "" "" "" ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : chr "" "" "" "" ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : chr "" "" "" "" ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
The training data has 19622 observations with 160 columns. On viewing the summary of the data, it can be noticed that many columns have mostly NA or blank values. Also, the first seven columns are names and timestamps of people who performed the test. All these columns do not provide any valuable information to our model so we will drop them.
# removing columns containing mean values
trainData <- trainData[, colSums(is.na(trainData)) == 0]
dim(trainData)
## [1] 19622 93
validData <- validData[, colSums(is.na(validData)) == 0]
dim(validData)
## [1] 20 60
# removing identity columns
trainData <- trainData[, -c(1:7)]
validData <- validData[, -c(1:7)]
dim(trainData)
## [1] 19622 86
dim(validData)
## [1] 20 53
The training data only has 86 variables now. To use it for modelling, we can further clean this data by removing those features which contribute almost zero variance using nearZeroVar.
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
dim(trainData)
## [1] 19622 53
Post cleaning, the trainData can be split into two sets and then tested using the validData.
trainData$classe <- as.factor(trainData$classe)
inTrain <- createDataPartition(trainData$classe, p=0.7, list=FALSE)
trainSet <- trainData[inTrain, ]
testSet <- trainData[-inTrain, ]
dim(trainSet)
## [1] 13737 53
dim(testSet)
## [1] 5885 53
Now we will use the trainSet to explore the variables and build a model off of it.
corrMatrix <- cor(trainSet[, -53])
par(ps=16)
corrplot(corrMatrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.5, tl.col = rgb(0, 0, 0))
This correlation matrix follows the order of First Principal Component. The variables which contribute highest to the variance, are the darkest.
We will use three methods to model the regression. They are:
# fitting model
set.seed(1)
modelCT <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modelCT, cex=0.3)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# testing model
predictCT <- predict(modelCT, newdata = testSet, type = "class")
cmCT <- confusionMatrix(predictCT, (testSet$classe))
cmCT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1560 248 15 94 31
## B 38 613 102 44 91
## C 37 147 816 101 109
## D 29 80 63 647 65
## E 10 51 30 78 786
##
## Overall Statistics
##
## Accuracy : 0.7514
## 95% CI : (0.7402, 0.7624)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6839
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9319 0.5382 0.7953 0.6712 0.7264
## Specificity 0.9079 0.9421 0.9189 0.9518 0.9648
## Pos Pred Value 0.8008 0.6903 0.6744 0.7319 0.8230
## Neg Pred Value 0.9710 0.8947 0.9551 0.9366 0.9400
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2651 0.1042 0.1387 0.1099 0.1336
## Detection Prevalence 0.3310 0.1509 0.2056 0.1502 0.1623
## Balanced Accuracy 0.9199 0.7401 0.8571 0.8115 0.8456
cmCT$overall[1]
## Accuracy
## 0.7514019
We notice that the accuracy is around 0.751 which is considerable.
# fitting model
crossV <- trainControl(method="cv", number=5, verboseIter = FALSE)
modelRF <- train(classe ~ ., data=trainSet, method="rf", trControl=crossV)
# testing model
predictRF <- predict(modelRF, newdata = testSet)
cmRF <- confusionMatrix(predictRF, testSet$classe)
cmRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1135 2 0 0
## C 0 2 1022 3 0
## D 0 0 2 959 1
## E 0 0 0 2 1081
##
## Overall Statistics
##
## Accuracy : 0.9976
## 95% CI : (0.996, 0.9987)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.997
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9965 0.9961 0.9948 0.9991
## Specificity 0.9995 0.9996 0.9990 0.9994 0.9996
## Pos Pred Value 0.9988 0.9982 0.9951 0.9969 0.9982
## Neg Pred Value 1.0000 0.9992 0.9992 0.9990 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1929 0.1737 0.1630 0.1837
## Detection Prevalence 0.2848 0.1932 0.1745 0.1635 0.1840
## Balanced Accuracy 0.9998 0.9980 0.9975 0.9971 0.9993
cmRF$overall[1]
## Accuracy
## 0.9976211
Random Forest gives us a much higher accuracy of around 0.998.
# fitting model
crossVgbm <- trainControl(method="repeatedcv", number=5, repeats=1)
modelGBM <- train(classe ~ ., data=trainSet, method="gbm",
trControl=crossVgbm, verbose = FALSE)
# testing model
predictGBM <- predict(modelGBM, newdata = testSet)
cmGBM <- confusionMatrix(predictGBM, testSet$classe)
cmGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1653 30 0 1 2
## B 12 1095 28 6 6
## C 5 14 989 29 10
## D 3 0 9 924 12
## E 1 0 0 4 1052
##
## Overall Statistics
##
## Accuracy : 0.9708
## 95% CI : (0.9661, 0.9749)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.963
##
## Mcnemar's Test P-Value : 2.848e-08
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9875 0.9614 0.9639 0.9585 0.9723
## Specificity 0.9922 0.9890 0.9881 0.9951 0.9990
## Pos Pred Value 0.9804 0.9547 0.9446 0.9747 0.9953
## Neg Pred Value 0.9950 0.9907 0.9924 0.9919 0.9938
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2809 0.1861 0.1681 0.1570 0.1788
## Detection Prevalence 0.2865 0.1949 0.1779 0.1611 0.1796
## Balanced Accuracy 0.9898 0.9752 0.9760 0.9768 0.9856
cmGBM$overall[1]
## Accuracy
## 0.9707732
The GBM method gives an accuracy of around 0.971 which is slightly less as compared to Random forest.
Comparing the three methods, Random forest has the highest accuracy. So we use that model to predict the 20 testcases.
Output <- predict(modelRF, newdata=validData)
Output
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E