Introduction: Using devices such as Jawbone Up, Nike Fuel Band, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self-movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of an activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumb bell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Here Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different classes. 1. Class A: exactly according to the specification 2. Class B: throwing the elbows to the front) 3. Class C: lifting the dumbbell only halfway 4. Class D: lowering the dumbbell only halfway 5. Class E: throwing the hips to the front Goal: The goal of this project is to predict the way the participants exercise. Here variable “classe” is from the training dataset. The cross-validation method will be used to build the machine learning model. The expected out of sample error should be calculated. The prediction model will predict 20 different test cases. Note: Please refer the WLE dataset for reference- The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013. I want to thank the authors for their generosity to allow me to their dataset for my assignment. Sources of dataset: http://groupware.les.inf.puc-rio.br/har Training set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv Testing set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Data Preprocessing:
Loading of Dataset:
trainurl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testurl= "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainData <- read.csv(url(trainurl),header = TRUE)
testData <- read.csv(url(testurl),header = TRUE)
Data Cleaning: Removing non zero variables (NA) with the mean value
NZV <- nearZeroVar(trainData, saveMetrics = TRUE)
NZV1 <- nearZeroVar(testData, saveMetrics = TRUE)
trainData <- trainData[ ,NZV$nzv==FALSE]
testData <- testData[ ,NZV1$nzv==FALSE]
AllNA <-sapply(trainData,function(x) mean(is.na(x)))
AllNA1 <- sapply(testData, function(x)mean(is.na(x)))
trainData <- trainData[,AllNA == FALSE]
testData <- testData[ ,AllNA1 == FALSE]
Removing first 5 features from the training and testing dataset and Now number of variables are reduced to 54.
trainData <- trainData[,-(1:5)]
testData <- testData[,-(1:5)]
dim(trainData)
## [1] 19622 54
dim(testData)
## [1] 20 54
Partition of Dataset:
inTrain <- createDataPartition(trainData$classe,p = 0.7,list=FALSE)
training <- trainData[inTrain,]
testing <- trainData[-inTrain,]
Correlation Analysis:
corMatrix <- cor(training[,-54])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", tl.cex =0.8, tl.col =rgb(0,0,0))
Prediction Model : 1. Generalised Boosting Model(GBM)
set.seed(12345)
controlGBM <-trainControl(method = "cv", number = 5)
ModelFitGBM <- train(classe~., data = training, method = "gbm", trControl= controlGBM, verbose = FALSE)
print(ModelFitGBM$finalModel)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.
predictionsGBM <- predict(ModelFitGBM, newdata = testing)
confusionMatrix(predictionsGBM, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 12 0 1 0
## B 1 1114 9 6 2
## C 0 13 1017 8 2
## D 1 0 0 947 14
## E 1 0 0 2 1064
##
## Overall Statistics
##
## Accuracy : 0.9878
## 95% CI : (0.9846, 0.9904)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9845
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9781 0.9912 0.9824 0.9834
## Specificity 0.9969 0.9962 0.9953 0.9970 0.9994
## Pos Pred Value 0.9923 0.9841 0.9779 0.9844 0.9972
## Neg Pred Value 0.9993 0.9947 0.9981 0.9965 0.9963
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1893 0.1728 0.1609 0.1808
## Detection Prevalence 0.2862 0.1924 0.1767 0.1635 0.1813
## Balanced Accuracy 0.9976 0.9871 0.9932 0.9897 0.9914
plot(ModelFitGBM,ylim= c(0.7,1))
2.Random Forest Model(Using 5-fold cross validation for the algorithm)
set.seed(12345)
controlRF <- trainControl(method = "cv", number = 5)
ModelFitRF <- train(classe~., data = training, method ="rf",trControl = controlRF, verbose = FALSE)
print(ModelFitRF$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.15%
## Confusion matrix:
## A B C D E class.error
## A 3906 0 0 0 0 0.0000000000
## B 4 2651 3 0 0 0.0026335591
## C 0 3 2393 0 0 0.0012520868
## D 0 0 9 2243 0 0.0039964476
## E 0 0 0 1 2524 0.0003960396
predictionsRF <- predict(ModelFitRF, newdata = testing)
confusionMatrix(predictionsRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 3 0 0 0
## B 0 1133 1 0 2
## C 0 2 1025 4 0
## D 0 1 0 960 10
## E 1 0 0 0 1070
##
## Overall Statistics
##
## Accuracy : 0.9959
## 95% CI : (0.9939, 0.9974)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9948
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9947 0.9990 0.9959 0.9889
## Specificity 0.9993 0.9994 0.9988 0.9978 0.9998
## Pos Pred Value 0.9982 0.9974 0.9942 0.9887 0.9991
## Neg Pred Value 0.9998 0.9987 0.9998 0.9992 0.9975
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1925 0.1742 0.1631 0.1818
## Detection Prevalence 0.2848 0.1930 0.1752 0.1650 0.1820
## Balanced Accuracy 0.9993 0.9971 0.9989 0.9968 0.9944
plot(ModelFitRF)
3. Decision Tree Model
set.seed(12345)
ModelFitDT <- train(classe~.,data = training, method = "rpart")
print(ModelFitDT$finalModel)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12570 8674 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -34 1150 6 A (0.99 0.0052 0 0 0) *
## 5) pitch_forearm>=-34 11420 8668 A (0.24 0.23 0.21 0.2 0.12)
## 10) num_window>=45.5 10924 8172 A (0.25 0.24 0.22 0.2 0.089)
## 20) magnet_dumbbell_y< 439.5 9297 6602 A (0.29 0.19 0.25 0.19 0.084)
## 40) num_window< 241.5 2220 911 A (0.59 0.14 0.12 0.12 0.031) *
## 41) num_window>=241.5 7077 5028 C (0.2 0.2 0.29 0.21 0.1)
## 82) magnet_dumbbell_z< -27.5 1593 595 A (0.63 0.23 0.064 0.062 0.019) *
## 83) magnet_dumbbell_z>=-27.5 5484 3537 C (0.071 0.2 0.36 0.25 0.12) *
## 21) magnet_dumbbell_y>=439.5 1627 729 B (0.035 0.55 0.046 0.25 0.12) *
## 11) num_window< 45.5 496 95 E (0 0 0 0.19 0.81) *
## 3) roll_belt>=130.5 1167 10 E (0.0086 0 0 0 0.99) *
predictionDT <- predict(ModelFitDT, newdata = testing)
confusionMatrix(predictionDT, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1474 252 147 177 38
## B 24 388 33 161 85
## C 172 499 846 575 313
## D 0 0 0 0 0
## E 4 0 0 51 646
##
## Overall Statistics
##
## Accuracy : 0.5699
## 95% CI : (0.5572, 0.5826)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4509
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8805 0.34065 0.8246 0.0000 0.5970
## Specificity 0.8542 0.93616 0.6792 1.0000 0.9885
## Pos Pred Value 0.7059 0.56151 0.3518 NaN 0.9215
## Neg Pred Value 0.9473 0.85541 0.9483 0.8362 0.9159
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2505 0.06593 0.1438 0.0000 0.1098
## Detection Prevalence 0.3548 0.11742 0.4087 0.0000 0.1191
## Balanced Accuracy 0.8674 0.63840 0.7519 0.5000 0.7928
fancyRpartPlot(ModelFitDT$finalModel)
##APPLYING THE SELECTED MODEL TO THE TEST DATA #The accuracy for the 3 models are: #1. GENERALL BOOSTING MODEL(GBM) : 0.9856 #2. RANDOM FOREST MODEL(RF):0.9985 #3. DECISION TREES (DT):0.53 ##Best Model : Random forest Model #The estimated accuracy of the Model
Accuracy <- postResample(predictionsRF,testing$classe)
Accuracy
## Accuracy Kappa
## 0.9959218 0.9948417
OoSerror <- 1 - as.numeric(confusionMatrix(testing$classe, predictionsRF)$overall[1])
OoSerror
## [1] 0.004078165
In this case, Random Forest model will be applied to predict 20 quiz results on testing dataset.
predictionTest <- predict(ModelFitRF, newdata = testData)
predictionTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E