by Sandy Sng
1 June 2018
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The goal of this project will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, to predict the manner in which they did the exercise. This is the “classe” variable in the training set. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
This report describes how the model was built, how cross validation was used, the expected out of sample error, and reasons for choices made. The prediction model is also used to predict 20 different test cases.
Understanding the Data
We explore the “classe” variable in the training set. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions (Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.):
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz5H9tDLZig
Note that data has below structure:
str(trainingData) #‘data.frame’: 19622 obs. of 160 variables
str(testingData) #‘data.frame’: 20 obs. of 160 variables
Building the Model
Prediction evaluations will be based on maximizing the accuracy and minimizing the out-of-sample error. All other available variables after cleaning will be used for prediction. Two models will be tested using decision tree and random forest algorithms. The model with the highest accuracy will be chosen as our final model.
Cross-validation
Cross-validation will be performed by subsampling our training dataset randomly without replacement into 2 subsamples: subTraining (75% of the original training data) and subTesting (25%). Our models will be fitted on the subTraining, and tested on subTesting. Once the most accurate model is chosen, it will be tested on the original testing data.
Expected Out-of-Sample Error
The expected out-of-sample error will correspond to the quantity: 1-accuracy in the cross-validation data.
Accuracy is the proportion of (correct classified observation/the total sample in the subTesting).
Expected accuracy is the expected accuracy in the out-of-sample dataset (i.e. original testing dataset).
Thus, the expected value of the out-of-sample error will correspond to (the expected number of misclassified observations/total observations in testing dataset), which is the quantity: 1-accuracy found from the cross-validation dataset.
Choices and Reasons
Our outcome variable “classe” is an unordered factor variable. Thus, we can choose our error type as 1-accuracy. We have a large sample size with N = 19622 in the training dataset. This allow us to divide our training sample into subTraining and subTesting to allow cross-validation. Features with all missing values will be discarded as well as features that are irrelevant. All other features will be kept as relevant variables.
Decision tree and random forest algorithms are known for their ability of detecting the features that are important for classification. Feature selection is inherent, so it is not so necessary at the data preparation phase. Thus, there won’t be any feature selection section in this report.
Reproducibility
Installing packages, loading libraries, and setting the overall seed for reproduceability:
library(lattice); library(ggplot2); library(caret)
library(randomForest) # Random forest for classification and regression
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart) # Regressive Partitioning and Regression trees
library(rpart.plot) # Decision Tree plot
library(e1071) # Confusion Matrix model
set.seed(789)
Loading and Exploring Data
Download and read CSV files, explore training and testing datasets
trainingFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(trainingFileUrl, "TrainingFile")
testingFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(testingFileUrl, "TestingFile")
trainingData <- read.csv("TrainingFile", na.strings=c("NA","#DIV/0!", ""))
testingData <- read.csv("TestingFile", na.strings=c("NA","#DIV/0!", "")) # Only used with the final accurate model
str(trainingData) # 'data.frame': 19622 obs. of 160 variables
str(testingData) # 'data.frame': 20 obs. of 160 variables
Cleaning Data
Delete columns with all missing values
Remove irrelevant variables
trainingData <- trainingData[,colSums(is.na(trainingData)) == 0]
testingData <- testingData[,colSums(is.na(testingData)) == 0]
# We can delete these variables (columns 1 to 7): user_name, raw_timestamp_part_1, raw_timestamp_part_,2 cvtd_timestamp, new_window, and num_window
trainingData <- trainingData[,-c(1:7)]
testingData <- testingData[,-c(1:7)]
# Check cleaned datasets
str(trainingData) # 'data.frame': 19622 obs. of 53 variables
str(testingData) # 'data.frame': 20 obs. of 53 variables:
Partitioning the Training Dataset
This allows cross-validation.
Partition the original training dataset such that 75% subTraining and the remaining 25% to subTesting.
inTrain <- createDataPartition(y = trainingData$classe, p = 0.75, list = FALSE)
subTrainingData <- trainingData[inTrain, ] # 14718 x 53
subTestingData <- trainingData[-inTrain, ] # 4904 x 53
Plot of Classe Levels (Optional)
The variable “classe” contains 5 levels: A, B, C, D and E.
A plot of the outcome variable will allow us to see the frequency of each levels in the subTrainingData and compare one another.
We see that each classe frequency is within the same order of magnitude of each other. Level A is the most frequent while level D is the least frequent.
Prediction Model 1: Decision Trees
# Fit model on subTrainingData...
model1 <- rpart(classe ~ ., data = subTrainingData, method = "class")
rpart.plot(model1, main = "Classification Tree", extra = 102, under = TRUE, faclen = 0)
# ...predict with subTestingData.
prediction1 <- predict(model1, subTestingData, type = "class")
confusionMatrix(prediction1, subTestingData$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1250 147 56 94 31
## B 49 548 75 76 92
## C 38 118 650 113 106
## D 33 87 58 430 29
## E 25 49 16 91 643
##
## Overall Statistics
##
## Accuracy : 0.718
## 95% CI : (0.7052, 0.7305)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6415
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8961 0.5774 0.7602 0.53483 0.7137
## Specificity 0.9065 0.9262 0.9074 0.94951 0.9548
## Pos Pred Value 0.7921 0.6524 0.6341 0.67504 0.7803
## Neg Pred Value 0.9564 0.9013 0.9472 0.91235 0.9368
## Prevalence 0.2845 0.1935 0.1743 0.16395 0.1837
## Detection Rate 0.2549 0.1117 0.1325 0.08768 0.1311
## Detection Prevalence 0.3218 0.1713 0.2090 0.12989 0.1680
## Balanced Accuracy 0.9013 0.7518 0.8338 0.74217 0.8342
For Model 1: Decision Trees:
Accuracy = 0.7463
95% CI : (0.7339, 0.7585)
Prediction Model 2: Random Forest
# Fit model on subTrainingData...
model2 <- randomForest(classe ~. , data = subTrainingData, method = "class")
# ...predict with subTestingData.
prediction2 <- predict(model2, subTestingData, type = "class")
confusionMatrix(prediction2, subTestingData$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 5 0 0 0
## B 0 941 4 0 0
## C 0 3 849 9 1
## D 0 0 2 795 5
## E 0 0 0 0 895
##
## Overall Statistics
##
## Accuracy : 0.9941
## 95% CI : (0.9915, 0.996)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9925
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9916 0.9930 0.9888 0.9933
## Specificity 0.9986 0.9990 0.9968 0.9983 1.0000
## Pos Pred Value 0.9964 0.9958 0.9849 0.9913 1.0000
## Neg Pred Value 1.0000 0.9980 0.9985 0.9978 0.9985
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1919 0.1731 0.1621 0.1825
## Detection Prevalence 0.2855 0.1927 0.1758 0.1635 0.1825
## Balanced Accuracy 0.9993 0.9953 0.9949 0.9935 0.9967
For Model 2: Random Forest:
Accuracy : 0.9965
95% CI : (0.9945, 0.998)
Random Forest algorithm performed better than Decision Trees.
Apply Chosen Model 2: Random Forest, on Testing Dataset
predictfinal <- predict(model2, testingData, type = "class")
predictfinal
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Write files for submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file = filename,
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
}
}
pml_write_files(predictfinal)