Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Using the data collected, machine learning models are built and the best performing model is used to predict the ‘classe’ variable. Data are provided from the http://groupware.les.inf.puc-rio.br/har.
The prediction model using random forest algorithm is built and executed against a validation data (extracted from the training data) and it yielded 99.8% accuracy.
Loading in all the necessary libraries.
The training and testing datasets are downloaded from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv, respectively.
The downloaded data are loaded into memory.
A summary of the training data (train variable) and testing data (test) are as follows:
dim(train)
## [1] 19622 160
dim(test)
## [1] 20 160
A validation dataset is extracted from the downloaded training dataset. 60% are reserved for actual training and remaining 40% are reserved for validation. The validation data is used to cross validate the prediction model before it is runs once against the actual testing data.
inTrain <- createDataPartition(y=train$classe, p=0.6, list=FALSE)
subTrain <- train[inTrain, ]
subValidate <- train[-inTrain, ]
A summary of training data (subTrain variable) and validation data (subValidate) are as follows:
dim(subTrain)
## [1] 11776 160
dim(subValidate)
## [1] 7846 160
The criteria to remove variables as predictors to build the model are as follows:
The finalised list of variables selected for building the model are also applied on the validation data and testing data.
# remove columns with more than 90% of observations that are "NA"
cleanSubTrain<-subTrain[,!(colSums(is.na(subTrain))/dim(subTrain)[1] > 0.90)]
# remove columns with '#DIV/0!'
cleanSubTrain <- cleanSubTrain [, apply(cleanSubTrain, 2,
function(x) sum(x == "#DIV/0!" || x == "")) == 0]
# remove variables that will not impact the prediction
drops <- c("X","user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp")
cleanSubTrain <- cleanSubTrain[,!(names(cleanSubTrain) %in% drops)]
# find out covariates that have near zero variance and then to remove them.
nsv <- nearZeroVar(cleanSubTrain, saveMetrics=T)
drops <- c("new_window")
cleanSubTrain <- cleanSubTrain[,!(names(cleanSubTrain) %in% drops)]
# apply the same cleaning process to Validation and Test datasets
cleanSubValidate <- NULL
for (i in 1:length(colnames(subValidate)) ){
for (j in 1:length(colnames(cleanSubTrain)) ) {
if (colnames(subValidate)[i]==colnames(cleanSubTrain)[j]) {
if (!is.null(cleanSubValidate)){
cleanSubValidate <- cbind(cleanSubValidate,subValidate[,i])
ind <- ind+1
} else {
ind <- 1
cleanSubValidate <- as.data.frame(subValidate[,i])
}
colnames(cleanSubValidate)[ind] <- colnames(subValidate)[i]
}
}
}
cleanTest <- NULL
for (i in 1:length(colnames(test)) ){
for (j in 1:length(colnames(cleanSubTrain)) ) {
if (colnames(test)[i]==colnames(cleanSubTrain)[j]) {
if (!is.null(cleanTest)){
cleanTest <- cbind(cleanTest,test[,i])
ind <- ind+1
} else {
ind <- 1
cleanTest <- as.data.frame(test[,i])
}
colnames(cleanTest)[ind] <- colnames(test)[i]
}
}
}
The Decision Tree algorithm is selected for building the model and is validated against the validation data.
dt.stime <- proc.time()
modFit.DT <- rpart(classe ~ ., data=cleanSubTrain, method="class")
dt.etime <- proc.time()
predict.DT <- predict(modFit.DT, cleanSubValidate, type = "class")
CM.DT <- confusionMatrix(predict.DT, cleanSubValidate$classe)
print(CM.DT)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1935 114 3 22 6
## B 131 1130 86 125 69
## C 22 100 1129 48 5
## D 130 89 122 1012 154
## E 14 85 28 79 1208
##
## Overall Statistics
##
## Accuracy : 0.817
## 95% CI : (0.809, 0.826)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.77
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.867 0.744 0.825 0.787 0.838
## Specificity 0.974 0.935 0.973 0.925 0.968
## Pos Pred Value 0.930 0.733 0.866 0.672 0.854
## Neg Pred Value 0.948 0.938 0.963 0.957 0.964
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.247 0.144 0.144 0.129 0.154
## Detection Prevalence 0.265 0.196 0.166 0.192 0.180
## Balanced Accuracy 0.921 0.840 0.899 0.856 0.903
The Random Forest algorithm is selected for building the model and is validated against the validation data.
rf.stime <- proc.time()
modFit.RF <- randomForest (classe ~ ., data=cleanSubTrain, importance=TRUE)
rf.etime <- proc.time()
predict.RF <- predict(modFit.RF, cleanSubValidate, type = "response")
CM.RF <- confusionMatrix(predict.RF, cleanSubValidate$classe)
print(CM.RF)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 2 0 0 0
## B 0 1514 7 0 0
## C 0 2 1361 4 0
## D 0 0 0 1282 1
## E 0 0 0 0 1441
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.997, 0.999)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.997 0.995 0.997 0.999
## Specificity 1.000 0.999 0.999 1.000 1.000
## Pos Pred Value 0.999 0.995 0.996 0.999 1.000
## Neg Pred Value 1.000 0.999 0.999 0.999 1.000
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.173 0.163 0.184
## Detection Prevalence 0.285 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 0.998 0.997 0.998 1.000
The Random Forest with Cross Validation algorithm is selected for building the model and is validated against the validation data.
rfcv.stime <- proc.time()
modFit.RFCV <- rfcv(trainx = cleanSubTrain[,-54], trainy = cleanSubTrain[,54],
scale = "log", step=0.5, cv.fold=3)
rfcv.etime <- proc.time()
predict.RFCV <- predict(modFit.RF, cleanSubValidate, type = "response")
CM.RFCV <- confusionMatrix(predict.RFCV, cleanSubValidate$classe)
print(CM.RFCV)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 2 0 0 0
## B 0 1514 7 0 0
## C 0 2 1361 4 0
## D 0 0 0 1282 1
## E 0 0 0 0 1441
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.997, 0.999)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.997 0.995 0.997 0.999
## Specificity 1.000 0.999 0.999 1.000 1.000
## Pos Pred Value 0.999 0.995 0.996 0.999 1.000
## Neg Pred Value 1.000 0.999 0.999 0.999 1.000
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.173 0.163 0.184
## Detection Prevalence 0.285 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 0.998 0.997 0.998 1.000
| Model Name | Accuracy | Elapsed Time |
|---|---|---|
| Decision Tree | 0.817487 | 3.87 |
| Random Forest | 0.997961 | 105.11 |
| Random Forest with Cross Validation | 0.997961 | 272.13 |
The model built using Random Forest is the best because it balances well between speed and accuracy of prediction.
print(modFit.RF)
##
## Call:
## randomForest(formula = classe ~ ., data = cleanSubTrain, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.36%
## Confusion matrix:
## A B C D E class.error
## A 3347 1 0 0 0 0.000298686
## B 6 2269 4 0 0 0.004387889
## C 0 8 2044 2 0 0.004868549
## D 0 0 16 1913 1 0.008808290
## E 0 0 0 4 2161 0.001847575
The OOB estimate error is 0.36% for the training data is low which is good.
predict.RF.test <- predict(modFit.RF, cleanTest)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predict.RF.test)
The prediction on the 20 test cases are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B