PracticalML

Synopsis

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

description

The outcome variable is classe, a factor variable with 5 levels. - exactly according to the specification (Class A) - throwing the elbows to the front (Class B) - lifting the dumbbell only halfway (Class C) - lowering the dumbbell only halfway (Class D) - throwing the hips to the front (Class E)

loading packages

knitr::opts_chunk$set(echo = TRUE)
library(caret)

## Warning: package 'caret' was built under R version 4.0.2

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.2

library(rpart)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.0.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.0.2

data processing

training <- read.csv("C:/Users/Yaswanth Pulavarthi/Downloads/pml-training.csv",na.strings=c("NA","#DIV/0!", ""))
testing <- read.csv("C:/Users/Yaswanth Pulavarthi/Downloads/pml-testing.csv",na.strings=c("NA","#DIV/0!", ""))

cleaning data

training<-training[,colSums(is.na(training)) == 0]
testing <-testing[,colSums(is.na(testing)) == 0]

subsetting data

training   <-training[,-c(1:7)]
testing <-testing[,-c(1:7)]
training$classe<-as.factor(training$classe)

crossValidation set

subSamples <- createDataPartition(y=training$classe, p=0.75, list=FALSE)
subTraining <- training[subSamples, ] 
subTesting <- training[-subSamples, ]

The expected out-of-sample error will correspond to the quantity: 1-accuracy in the cross-validation data.

Exploratory analysis

barplot(table(subTraining$classe))

LEVEL A occurs most. D appears to be the least frequent one

Prediction models

Decision tree

# Model Fit
modFitDTree <- rpart(classe ~ ., data=subTraining, method="class")
# prediction
predictDTree <- predict(modFitDTree, subTesting, type = "class")
# Plot 
rpart.plot(modFitDTree, main="Classification Tree", extra=102, under=TRUE, faclen=0)

confusionMatrix(predictDTree, as.factor(subTesting$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1295  191   30   96   53
##          B   33  563   72   23   67
##          C   29   93  689  116   76
##          D   20   69   43  523   48
##          E   18   33   21   46  657
## 
## Overall Statistics
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.7478, 0.7719)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6944          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9283   0.5933   0.8058   0.6505   0.7292
## Specificity            0.8946   0.9507   0.9224   0.9561   0.9705
## Pos Pred Value         0.7778   0.7427   0.6869   0.7440   0.8477
## Neg Pred Value         0.9691   0.9069   0.9574   0.9331   0.9409
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2641   0.1148   0.1405   0.1066   0.1340
## Detection Prevalence   0.3395   0.1546   0.2045   0.1434   0.1580
## Balanced Accuracy      0.9114   0.7720   0.8641   0.8033   0.8499

Random forest

modFitRForest <- randomForest(classe ~ ., data=subTraining, method="class")
predictRForest <- predict(modFitRForest, subTesting, type = "class")

confusionMatrix(predictRForest, subTesting$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    4    0    0    0
##          B    2  945    3    0    0
##          C    0    0  852    5    0
##          D    0    0    0  799    4
##          E    0    0    0    0  897
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9963          
##                  95% CI : (0.9942, 0.9978)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9954          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9986   0.9958   0.9965   0.9938   0.9956
## Specificity            0.9989   0.9987   0.9988   0.9990   1.0000
## Pos Pred Value         0.9971   0.9947   0.9942   0.9950   1.0000
## Neg Pred Value         0.9994   0.9990   0.9993   0.9988   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2841   0.1927   0.1737   0.1629   0.1829
## Detection Prevalence   0.2849   0.1937   0.1748   0.1637   0.1829
## Balanced Accuracy      0.9987   0.9973   0.9976   0.9964   0.9978

Conclusion

Random Forest Algorithm had more than 99 precentage of accuracy.So, it is more enough to predict the final testing model.

finalClassePRED <- predict(modFitRForest, testing)
print(finalClassePRED)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E