Introduction: Using devices such as Jawbone Up, Nike Fuel Band, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self-movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of an activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumb bell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Here Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different classes. 1. Class A: exactly according to the specification 2. Class B: throwing the elbows to the front) 3. Class C: lifting the dumbbell only halfway 4. Class D: lowering the dumbbell only halfway 5. Class E: throwing the hips to the front Goal: The goal of this project is to predict the way the participants exercise. Here variable “classe” is from the training dataset. The cross-validation method will be used to build the machine learning model. The expected out of sample error should be calculated. The prediction model will predict 20 different test cases. Note: Please refer the WLE dataset for reference- The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013. I want to thank the authors for their generosity to allow me to their dataset for my assignment. Sources of dataset: http://groupware.les.inf.puc-rio.br/har Training set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv Testing set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Data Preprocessing:

Loading of Dataset:

trainurl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testurl= "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainData <- read.csv(url(trainurl),header = TRUE)
testData <- read.csv(url(testurl),header = TRUE)

Data Cleaning: Removing non zero variables (NA) with the mean value

NZV <- nearZeroVar(trainData, saveMetrics = TRUE)
NZV1 <- nearZeroVar(testData, saveMetrics = TRUE)
trainData <- trainData[ ,NZV$nzv==FALSE]
testData <- testData[ ,NZV1$nzv==FALSE]

AllNA <-sapply(trainData,function(x) mean(is.na(x)))
AllNA1 <- sapply(testData, function(x)mean(is.na(x)))
trainData <- trainData[,AllNA == FALSE]
testData <-  testData[ ,AllNA1 == FALSE]

Removing first 5 features from the training and testing dataset and Now number of variables are reduced to 54.

trainData <- trainData[,-(1:5)]
testData <- testData[,-(1:5)]
dim(trainData)

## [1] 19622    54

dim(testData)

## [1] 20 54

Partition of Dataset:

inTrain <- createDataPartition(trainData$classe,p = 0.7,list=FALSE)
training <- trainData[inTrain,]
testing <- trainData[-inTrain,]

Correlation Analysis:

corMatrix <- cor(training[,-54])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", tl.cex =0.8, tl.col =rgb(0,0,0))

Prediction Model : 1. Generalised Boosting Model(GBM)

set.seed(12345)
controlGBM <-trainControl(method = "cv", number = 5)
ModelFitGBM <- train(classe~., data = training, method = "gbm", trControl= controlGBM, verbose = FALSE)
print(ModelFitGBM$finalModel)

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.

Prediction on Testset(GBM)

predictionsGBM <- predict(ModelFitGBM, newdata = testing)
confusionMatrix(predictionsGBM, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671   12    0    1    0
##          B    1 1114    9    6    2
##          C    0   13 1017    8    2
##          D    1    0    0  947   14
##          E    1    0    0    2 1064
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9846, 0.9904)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9845          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9781   0.9912   0.9824   0.9834
## Specificity            0.9969   0.9962   0.9953   0.9970   0.9994
## Pos Pred Value         0.9923   0.9841   0.9779   0.9844   0.9972
## Neg Pred Value         0.9993   0.9947   0.9981   0.9965   0.9963
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1893   0.1728   0.1609   0.1808
## Detection Prevalence   0.2862   0.1924   0.1767   0.1635   0.1813
## Balanced Accuracy      0.9976   0.9871   0.9932   0.9897   0.9914

plot(ModelFitGBM,ylim= c(0.7,1))

2.Random Forest Model(Using 5-fold cross validation for the algorithm)

set.seed(12345)
controlRF <- trainControl(method = "cv", number = 5)
ModelFitRF <- train(classe~., data = training, method ="rf",trControl = controlRF, verbose = FALSE)
print(ModelFitRF$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.15%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3906    0    0    0    0 0.0000000000
## B    4 2651    3    0    0 0.0026335591
## C    0    3 2393    0    0 0.0012520868
## D    0    0    9 2243    0 0.0039964476
## E    0    0    0    1 2524 0.0003960396

Prediction on Testset(RF)

predictionsRF <- predict(ModelFitRF, newdata = testing)
confusionMatrix(predictionsRF, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    3    0    0    0
##          B    0 1133    1    0    2
##          C    0    2 1025    4    0
##          D    0    1    0  960   10
##          E    1    0    0    0 1070
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9939, 0.9974)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9947   0.9990   0.9959   0.9889
## Specificity            0.9993   0.9994   0.9988   0.9978   0.9998
## Pos Pred Value         0.9982   0.9974   0.9942   0.9887   0.9991
## Neg Pred Value         0.9998   0.9987   0.9998   0.9992   0.9975
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1925   0.1742   0.1631   0.1818
## Detection Prevalence   0.2848   0.1930   0.1752   0.1650   0.1820
## Balanced Accuracy      0.9993   0.9971   0.9989   0.9968   0.9944

plot(ModelFitRF)

3. Decision Tree Model

set.seed(12345)
ModelFitDT <- train(classe~.,data = training, method = "rpart")
print(ModelFitDT$finalModel)

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12570 8674 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -34 1150    6 A (0.99 0.0052 0 0 0) *
##      5) pitch_forearm>=-34 11420 8668 A (0.24 0.23 0.21 0.2 0.12)  
##       10) num_window>=45.5 10924 8172 A (0.25 0.24 0.22 0.2 0.089)  
##         20) magnet_dumbbell_y< 439.5 9297 6602 A (0.29 0.19 0.25 0.19 0.084)  
##           40) num_window< 241.5 2220  911 A (0.59 0.14 0.12 0.12 0.031) *
##           41) num_window>=241.5 7077 5028 C (0.2 0.2 0.29 0.21 0.1)  
##             82) magnet_dumbbell_z< -27.5 1593  595 A (0.63 0.23 0.064 0.062 0.019) *
##             83) magnet_dumbbell_z>=-27.5 5484 3537 C (0.071 0.2 0.36 0.25 0.12) *
##         21) magnet_dumbbell_y>=439.5 1627  729 B (0.035 0.55 0.046 0.25 0.12) *
##       11) num_window< 45.5 496   95 E (0 0 0 0.19 0.81) *
##    3) roll_belt>=130.5 1167   10 E (0.0086 0 0 0 0.99) *

Prediction on Testset(DT)

predictionDT <- predict(ModelFitDT, newdata = testing)
confusionMatrix(predictionDT, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1474  252  147  177   38
##          B   24  388   33  161   85
##          C  172  499  846  575  313
##          D    0    0    0    0    0
##          E    4    0    0   51  646
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5699          
##                  95% CI : (0.5572, 0.5826)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4509          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8805  0.34065   0.8246   0.0000   0.5970
## Specificity            0.8542  0.93616   0.6792   1.0000   0.9885
## Pos Pred Value         0.7059  0.56151   0.3518      NaN   0.9215
## Neg Pred Value         0.9473  0.85541   0.9483   0.8362   0.9159
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2505  0.06593   0.1438   0.0000   0.1098
## Detection Prevalence   0.3548  0.11742   0.4087   0.0000   0.1191
## Balanced Accuracy      0.8674  0.63840   0.7519   0.5000   0.7928

fancyRpartPlot(ModelFitDT$finalModel)

##APPLYING THE SELECTED MODEL TO THE TEST DATA #The accuracy for the 3 models are: #1. GENERALL BOOSTING MODEL(GBM) : 0.9856 #2. RANDOM FOREST MODEL(RF):0.9985 #3. DECISION TREES (DT):0.53 ##Best Model : Random forest Model #The estimated accuracy of the Model

Accuracy <- postResample(predictionsRF,testing$classe)
Accuracy

##  Accuracy     Kappa 
## 0.9959218 0.9948417

The estimated out of sample error of the model

OoSerror <- 1 - as.numeric(confusionMatrix(testing$classe, predictionsRF)$overall[1])
OoSerror

## [1] 0.004078165

In this case, Random Forest model will be applied to predict 20 quiz results on testing dataset.

predictionTest <- predict(ModelFitRF, newdata = testData)
predictionTest

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Project - Prediction of self-movement using the fitness devices

Lopamudra Satpathy

February 1, 2018

Prediction on Testset(GBM)

Prediction on Testset(RF)

Prediction on Testset(DT)

The estimated out of sample error of the model