Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source:
http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
# Don't forget to set the working directory
# 0. Load all the necessary packages
library(dplyr)
library(lubridate)
library(ggplot2)
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
In the following section, the data would be loaded into workspace and some cleansing work will be conducted. The main purpose of this step is to remove all columns which are filled by NA. Additionaly, considering this project doesn’t have any type of time series data involved, time stamp and identification columns have been removed as well.
# 1. Load the data
test <- read.csv("./pml-testing.csv")
train <- read.csv("./pml-training.csv")
# 2. Data cleaning
# Create a partition with the training data
inTrain <- createDataPartition(train$classe, p = 0.7, list = FALSE)
test.set <- train[inTrain, ]
train.set <- train[-inTrain, ]
# Remove the columns contain NAs
na.list <- which(colSums(is.na(test)) > 0)
test.set <- test.set[, -na.list]
train.set <- train.set[, -na.list]
test <- test[, -na.list]
# Remove time stamp
test.set <- test.set[, -c(1:6)]
train.set <- train.set[, -c(1:6)]
test <- test[, -c(1:6)]
A heat map can quickly give out a brief image about the relationships among all these variables.
# 3. Exploratory Analysis
# Heat map
map.Matrix <- cor(train.set[, -c(54)])
heatmap(map.Matrix, scale = 'column')
# 4. Prediction Model
# 4.1 Radom forest
# Set for reproducibility
set.seed(2018)
# Random forest model fit
rf.mod <- randomForest(classe ~., data = train.set, ntree = 1000)
# Prediction
rf.pred <- predict(rf.mod, test.set, type = "class")
# Result evaluation
rf.confmat <- confusionMatrix(rf.pred, test.set$classe)
rf.confmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 22 0 0 0
## B 0 2630 53 0 0
## C 0 6 2340 27 0
## D 0 0 3 2221 2
## E 0 0 0 4 2523
##
## Overall Statistics
##
## Accuracy : 0.9915
## 95% CI : (0.9898, 0.993)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9892
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9895 0.9766 0.9862 0.9992
## Specificity 0.9978 0.9952 0.9971 0.9996 0.9996
## Pos Pred Value 0.9944 0.9802 0.9861 0.9978 0.9984
## Neg Pred Value 1.0000 0.9975 0.9951 0.9973 0.9998
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1915 0.1703 0.1617 0.1837
## Detection Prevalence 0.2859 0.1953 0.1727 0.1620 0.1840
## Balanced Accuracy 0.9989 0.9923 0.9869 0.9929 0.9994
# 4.2 Decision tree
# Set for reproducibility
set.seed(2018)
# Decision tree model fit
dt.mod <- rpart(classe ~., data = train.set, method = "class")
fancyRpartPlot(dt.mod)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# Decision tree prediction
dt.pred <- predict(dt.mod, newdata = test.set, type = "class")
dt.confmat <- confusionMatrix(dt.pred, test.set$classe)
dt.confmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3374 209 3 24 7
## B 392 2098 264 442 132
## C 1 123 1951 63 14
## D 107 105 142 1555 219
## E 32 123 36 168 2153
##
## Overall Statistics
##
## Accuracy : 0.8103
## 95% CI : (0.8036, 0.8168)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7604
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8638 0.7893 0.8143 0.6905 0.8527
## Specificity 0.9753 0.8890 0.9823 0.9501 0.9680
## Pos Pred Value 0.9328 0.6304 0.9066 0.7307 0.8571
## Neg Pred Value 0.9474 0.9462 0.9616 0.9400 0.9669
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2456 0.1527 0.1420 0.1132 0.1567
## Detection Prevalence 0.2633 0.2423 0.1567 0.1549 0.1829
## Balanced Accuracy 0.9195 0.8391 0.8983 0.8203 0.9103
It is pretty obvious that Random Forest has a better level of accuracy, which is 0.9921, whereas the accuracy level of Decision Tree is only 0.7337.
As mentioned above, it is pretty obvious that the Random Forest has a much more satisfactory performance of accuracy. Hence the Random Forest model will be applied to the actual test case.
# 5. Final Test
# Random Forest
rf.prediction.test <- predict(rf.mod, newdata = test)
rf.prediction.test
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Decision Tree
dt.prediction.test <- predict(dt.mod, newdata = test, type = "class")
dt.prediction.test
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A A A A D D B A A D C E A E E A A B E
## Levels: A B C D E