Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The outcome variable is classe, a factor variable with 5 levels. - exactly according to the specification (Class A) - throwing the elbows to the front (Class B) - lifting the dumbbell only halfway (Class C) - lowering the dumbbell only halfway (Class D) - throwing the hips to the front (Class E)
knitr::opts_chunk$set(echo = TRUE)
library(caret)
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
library(rpart)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.2
training <- read.csv("C:/Users/Yaswanth Pulavarthi/Downloads/pml-training.csv",na.strings=c("NA","#DIV/0!", ""))
testing <- read.csv("C:/Users/Yaswanth Pulavarthi/Downloads/pml-testing.csv",na.strings=c("NA","#DIV/0!", ""))
training<-training[,colSums(is.na(training)) == 0]
testing <-testing[,colSums(is.na(testing)) == 0]
training <-training[,-c(1:7)]
testing <-testing[,-c(1:7)]
training$classe<-as.factor(training$classe)
subSamples <- createDataPartition(y=training$classe, p=0.75, list=FALSE)
subTraining <- training[subSamples, ]
subTesting <- training[-subSamples, ]
The expected out-of-sample error will correspond to the quantity: 1-accuracy in the cross-validation data.
barplot(table(subTraining$classe))
LEVEL A occurs most. D appears to be the least frequent one
# Model Fit
modFitDTree <- rpart(classe ~ ., data=subTraining, method="class")
# prediction
predictDTree <- predict(modFitDTree, subTesting, type = "class")
# Plot
rpart.plot(modFitDTree, main="Classification Tree", extra=102, under=TRUE, faclen=0)
confusionMatrix(predictDTree, as.factor(subTesting$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1295 191 30 96 53
## B 33 563 72 23 67
## C 29 93 689 116 76
## D 20 69 43 523 48
## E 18 33 21 46 657
##
## Overall Statistics
##
## Accuracy : 0.76
## 95% CI : (0.7478, 0.7719)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6944
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9283 0.5933 0.8058 0.6505 0.7292
## Specificity 0.8946 0.9507 0.9224 0.9561 0.9705
## Pos Pred Value 0.7778 0.7427 0.6869 0.7440 0.8477
## Neg Pred Value 0.9691 0.9069 0.9574 0.9331 0.9409
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2641 0.1148 0.1405 0.1066 0.1340
## Detection Prevalence 0.3395 0.1546 0.2045 0.1434 0.1580
## Balanced Accuracy 0.9114 0.7720 0.8641 0.8033 0.8499
modFitRForest <- randomForest(classe ~ ., data=subTraining, method="class")
predictRForest <- predict(modFitRForest, subTesting, type = "class")
confusionMatrix(predictRForest, subTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 4 0 0 0
## B 2 945 3 0 0
## C 0 0 852 5 0
## D 0 0 0 799 4
## E 0 0 0 0 897
##
## Overall Statistics
##
## Accuracy : 0.9963
## 95% CI : (0.9942, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9954
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9958 0.9965 0.9938 0.9956
## Specificity 0.9989 0.9987 0.9988 0.9990 1.0000
## Pos Pred Value 0.9971 0.9947 0.9942 0.9950 1.0000
## Neg Pred Value 0.9994 0.9990 0.9993 0.9988 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2841 0.1927 0.1737 0.1629 0.1829
## Detection Prevalence 0.2849 0.1937 0.1748 0.1637 0.1829
## Balanced Accuracy 0.9987 0.9973 0.9976 0.9964 0.9978
Random Forest Algorithm had more than 99 precentage of accuracy.So, it is more enough to predict the final testing model.
finalClassePRED <- predict(modFitRForest, testing)
print(finalClassePRED)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E