Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Data are downloada and load into R.
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(trainUrl),na.strings = c("NA",""))
testing <- read.csv(url(testUrl),na.strings = c("NA",""))
nrow(training)
## [1] 19622
ncol(training)
## [1] 160
nrow(testing)
## [1] 20
ncol(testing)
## [1] 160
The dataset for training consist of: * 19622 observations * 160 columns
The dataset for training consist of: * 20 observations * 160 columns
The data which has any missing values are removed from the training and the testing set.
train_clean<-training[, colSums(is.na(training)) == 0]
test<-testing[, colSums(is.na(testing)) == 0]
Any column that have NA value are removed. The cleaning process results in reduction of columns from 160 columns into 60 columns in both ‘train’ data and ‘test’ data. This will surely give an edge during the ML process since the data are now clean.
Before we proceed to model creation and prediction, we need to split the ‘train’ dataset into 2 which are the real training set and the validation set. This is to compute the ‘out-of-sample’ error. For the data splitting, we use a splitting ratio of 70% for training and 30% validation set.
trainPartition <- createDataPartition(train_clean$classe, p = 0.7, list = FALSE)
train <- train_clean[trainPartition, ]
valid <- train_clean[-trainPartition, ]
We will use the ‘party’ package to use the decision tree algorithms.
control <- trainControl(method = "cv", number = 5)
model_ctree <- train(classe ~ ., data = train, method = "ctree", trControl = control)
print(model_ctree, digits = 5)
## Conditional Inference Tree
##
## 13737 samples
## 59 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10990, 10990, 10989
## Resampling results across tuning parameters:
##
## mincriterion Accuracy Kappa
## 0.01 0.99964 0.99954
## 0.50 0.99964 0.99954
## 0.99 0.99964 0.99954
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.99.
plot(model_ctree$finalModel)
The model has been created by CV method using the ‘train’ data. Now, we will predict by using the ‘valid’ dataset.
predict_ctree <- predict(model_ctree, valid)
confusionMatrix(valid$classe, predict_ctree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1025 1 0
## D 0 0 0 964 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9998
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9998
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 0.9990 1.0000
## Specificity 1.0000 1.0000 0.9998 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 0.9990 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 0.9998 1.0000
## Prevalence 0.2845 0.1935 0.1742 0.1640 0.1839
## Detection Rate 0.2845 0.1935 0.1742 0.1638 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 1.0000 1.0000 0.9999 0.9995 1.0000
From the prediction results using the validation set, we get accuracy: 0.9995
. So, the out of sample error is 0.0005
for Decision Tree.
Now, we will use the Random Forest algorithm for the model creation and prediction.
control <- trainControl(method = "cv", number = 5)
model_rf <- train(classe ~ ., data = train, method = "rf", trControl = control)
print(model_rf, digits = 5)
## Random Forest
##
## 13737 samples
## 59 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10989, 10990, 10989, 10989, 10991
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.99505 0.99374
## 41 0.99993 0.99991
## 81 0.99985 0.99982
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 41.
plot(model_rf$finalModel)
The model has been created by CV method using the ‘train’ data. Now, we will predict by using the ‘valid’ dataset.
predict_rf <- predict(model_rf, valid)
confusionMatrix(valid$classe, predict_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 1 1138 0 0 0
## C 0 0 1026 0 0
## D 0 0 0 964 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9998
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9998
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 0.9998 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 0.9991 1.0000 1.0000 1.0000
## Neg Pred Value 0.9998 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2846 0.1934 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1934 0.1743 0.1638 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9997 0.9999 1.0000 1.0000 1.0000
From the prediction results using the validation set, we get accuracy: 0.9998
. So, the out of sample error is 0.0002
for Random Forest.
With the prediction above, we can see the accuracy of both models are: * Decsion Tree model: * Random Forest model:
Based on the results of the prediction on the validation dataset above, we will choose Random Forest model as our prediction model for this project.
Finally, we will perform prediction by using the chosen prediction model which is the Random Forest model.
predict_final <- predict(model_rf, test)
predict_final
## [1] A A A A A A A A A A A A A A A A A A A A
## Levels: A B C D E