The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with.
We propose a dataset with 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects.
Goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
The data for this project come from this source - The training data for this project are available here - The test data are available here
# import libraries
library(plotly);library(caret);
# import data
train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
First we need to clean up our dataset and only use column with data. - Remove unwanted columns - Handle NAs
# removing columns with NAs
train <- train[, colSums(is.na(train)) == 0]
NZV <- nearZeroVar(train)
train <- train[, -NZV]
# removing columns with unneeded data
train <- train[, -c(1:7)]
# examine our objective column
plot_ly(x = train$classe, type = "histogram", histnorm = "probability") %>%
layout(title = "Propotion of Classe variable in Train dataset",
xaxis= list(title = "Classe"), yaxis= list(title = "Probability") )
Two main techniques will be Random Forest & Decision Tree modeling.
library(rpart)
library(rpart.plot)
# create training, validation
inTrain <- createDataPartition(train$classe, p = 0.7, list = FALSE)
training <- train[inTrain, ]
validation <- train[-inTrain, ]
training$classe <- as.factor(training$classe)
validation$classe <- as.factor(validation$classe)
# build model & plot result
mod1 <- rpart(classe ~ ., data = training, method = "class",na.action = na.pass)
rpart.plot(mod1, main="Model#1: Decision Tree")
# use model#1 to calculate prediction
pred1 <- predict(mod1, newdata=validation, type = "class")
confusionTree <- confusionMatrix(pred1, validation$classe)
confusionTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1487 210 9 75 69
## B 34 636 90 31 125
## C 32 74 763 157 149
## D 95 178 134 658 133
## E 26 41 30 43 606
##
## Overall Statistics
##
## Accuracy : 0.7052
## 95% CI : (0.6933, 0.7168)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6263
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8883 0.5584 0.7437 0.6826 0.5601
## Specificity 0.9138 0.9410 0.9152 0.8903 0.9709
## Pos Pred Value 0.8038 0.6943 0.6494 0.5492 0.8123
## Neg Pred Value 0.9537 0.8988 0.9442 0.9347 0.9074
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2527 0.1081 0.1297 0.1118 0.1030
## Detection Prevalence 0.3144 0.1556 0.1997 0.2036 0.1268
## Balanced Accuracy 0.9010 0.7497 0.8294 0.7864 0.7655
library(randomForest)
# build model & plot result ..
mod2 <- randomForest(classe ~ ., data=training, ntree=50, mtry=5, importance=TRUE)
plot(mod2, log="y")
# use model#2 to calculate prediction
pred2 <- predict(mod2, validation)
confusionRF <- confusionMatrix(pred2, validation$classe)
confusionRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 10 0 0 0
## B 2 1126 9 0 0
## C 0 3 1015 21 2
## D 0 0 2 940 0
## E 0 0 0 3 1080
##
## Overall Statistics
##
## Accuracy : 0.9912
## 95% CI : (0.9884, 0.9934)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9888
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9886 0.9893 0.9751 0.9982
## Specificity 0.9976 0.9977 0.9946 0.9996 0.9994
## Pos Pred Value 0.9941 0.9903 0.9750 0.9979 0.9972
## Neg Pred Value 0.9995 0.9973 0.9977 0.9951 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1913 0.1725 0.1597 0.1835
## Detection Prevalence 0.2858 0.1932 0.1769 0.1601 0.1840
## Balanced Accuracy 0.9982 0.9931 0.9920 0.9873 0.9988
predict(mod2, newdata=test)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E