Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do. We use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
In this analysis We use har dataset to predict if the excercise done by participants was right or not. The outcome variable classe is a factor with 5 levels : A B C D E aPrticipants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: exactly according to the specification (Class A) throwing the elbows to the front (Class B) lifting the dumbbell only halfway (Class C) lowering the dumbbell only halfway (Class D) throwing the hips to the front (Class E)?
We Use Random forest and Decision trees two train our Models, cross validate the results and predict on a test set of 20 observations.
Cross Validation is done by subsampling our trainSet data set into training(75%) and validation(25%) datasets. The model is trained on training dataset and validated on validation dataset. The test set is used to predict the outcome for 20 observations.
set.seed(123)
trainset<-read.csv("C:/Users/himan/Desktop/HAR-Analysis/pml-training.csv")
testing<-read.csv("C:/Users/himan/Desktop/HAR-Analysis/pml-testing.csv")
dim(trainset)
## [1] 19622 160
table(trainset$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
Loading Required Libraries:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
Partitioning the TrainingSet:
partition<-createDataPartition(y=trainset$classe,p=0.75,list=F)
training<-trainset[partition,]
validation<-trainset[-partition,]
Removing near-zero variance variables:
training<-training[,-nearZeroVar(training)]
Removing Descriptive Variables:
Cl<-c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp", "new_window", "num_window")
training <- training[,!names(training) %in% Cl]
Removing Variables with more than 60% NA values:
training[training==""] <- NA
naproportion<-sapply(training,function(x) sum(is.na(x))/nrow(training))
training<-training[!(naproportion>0.60)]
dim(training)
## [1] 14718 53
Traning With Decision Tree:
modelrp<-train(classe~.,data=training,method='rpart')
Prediction on training Model and accuracy on traning Set:
predictrp<-predict(modelrp,training)
print(confusionMatrix(predictrp,training$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3811 1202 1207 1100 397
## B 58 966 85 433 370
## C 303 680 1275 879 715
## D 0 0 0 0 0
## E 13 0 0 0 1224
##
## Overall Statistics
##
## Accuracy : 0.4944
## 95% CI : (0.4863, 0.5025)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3388
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9106 0.33919 0.49669 0.0000 0.45233
## Specificity 0.6292 0.92030 0.78792 1.0000 0.99892
## Pos Pred Value 0.4938 0.50523 0.33100 NaN 0.98949
## Neg Pred Value 0.9466 0.85304 0.88110 0.8361 0.89007
## Prevalence 0.2843 0.19350 0.17441 0.1639 0.18386
## Detection Rate 0.2589 0.06563 0.08663 0.0000 0.08316
## Detection Prevalence 0.5243 0.12991 0.26172 0.0000 0.08405
## Balanced Accuracy 0.7699 0.62974 0.64230 0.5000 0.72562
fancyRpartPlot(modelrp$finalModel)
Accuracy on Validation set:
print(confusionMatrix(predict(modelrp,validation),validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1271 384 380 349 127
## B 23 320 23 135 116
## C 100 245 452 320 251
## D 0 0 0 0 0
## E 1 0 0 0 407
##
## Overall Statistics
##
## Accuracy : 0.4996
## 95% CI : (0.4855, 0.5137)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3468
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9111 0.33720 0.52865 0.0000 0.45172
## Specificity 0.6466 0.92491 0.77377 1.0000 0.99975
## Pos Pred Value 0.5062 0.51864 0.33041 NaN 0.99755
## Neg Pred Value 0.9482 0.85328 0.88603 0.8361 0.89012
## Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Detection Rate 0.2592 0.06525 0.09217 0.0000 0.08299
## Detection Prevalence 0.5120 0.12582 0.27896 0.0000 0.08320
## Balanced Accuracy 0.7789 0.63105 0.65121 0.5000 0.72574
Decision Trees had an accuracy of 0.4944 on training set and 0.4996 on validation Set.
Training with Random Forest:
modelrf<-train(classe~.,data=training,method='rf',ntree=10)
Prediction on training Model and Accuracy on training Set:
predictrf<-predict(modelrf,training)
print(confusionMatrix(predictrf,training$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4185 5 0 0 0
## B 0 2843 1 0 0
## C 0 0 2566 3 0
## D 0 0 0 2409 1
## E 0 0 0 0 2705
##
## Overall Statistics
##
## Accuracy : 0.9993
## 95% CI : (0.9988, 0.9997)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9991
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 0.9996 0.9988 0.9996
## Specificity 0.9995 0.9999 0.9998 0.9999 1.0000
## Pos Pred Value 0.9988 0.9996 0.9988 0.9996 1.0000
## Neg Pred Value 1.0000 0.9996 0.9999 0.9998 0.9999
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2843 0.1932 0.1743 0.1637 0.1838
## Detection Prevalence 0.2847 0.1932 0.1745 0.1637 0.1838
## Balanced Accuracy 0.9998 0.9991 0.9997 0.9993 0.9998
Accuracy on Validation Set:
print(confusionMatrix(predict(modelrf,validation),validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 6 0 0 0
## B 3 941 10 1 0
## C 0 2 840 14 3
## D 0 0 5 788 4
## E 0 0 0 1 894
##
## Overall Statistics
##
## Accuracy : 0.99
## 95% CI : (0.9868, 0.9926)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9874
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9916 0.9825 0.9801 0.9922
## Specificity 0.9983 0.9965 0.9953 0.9978 0.9998
## Pos Pred Value 0.9957 0.9853 0.9779 0.9887 0.9989
## Neg Pred Value 0.9991 0.9980 0.9963 0.9961 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2838 0.1919 0.1713 0.1607 0.1823
## Detection Prevalence 0.2851 0.1947 0.1752 0.1625 0.1825
## Balanced Accuracy 0.9981 0.9940 0.9889 0.9890 0.9960
Random Forest had an accuracy of 0.9992 on training set and 0.9904 on validation Set.
The validation set accuracy (out-of-sample) for random forest 0.9904 was far greater than out-of-sample accuracy for decision tree 0.4996.
Prediction on test set:
predictions<-predict(modelrf,testing)
predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E