Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
We first load the relevant R packages that will be needed for the analysis:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(17)
We download the data sources, save it into the R working directory, and finally read it into R:
train <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
We investigate the data fields with mostly NA values and remove these columns from the train data:
na_cols <- sapply(train, FUN=function(x) {sum(is.na(x))})
table(na_cols) ## investigate the NAs in the columns
## na_cols
## 0 19216 19217 19218 19220 19221 19225 19226 19227 19248 19293 19294
## 60 67 1 1 1 4 1 4 2 2 1 1
## 19296 19299 19300 19301 19622
## 2 1 4 2 6
newtrain <- train[,-which(na_cols >= 19216)] ## retain only the columns without any NAs
We investigate the data fields with Near Zero Variance and remove these columns from the train data:
NZV_set <- nearZeroVar(newtrain, saveMetrics=TRUE)
NZV_set # show the NZV analysis
## freqRatio percentUnique zeroVar nzv
## X 1.000000 100.00000000 FALSE FALSE
## user_name 1.100679 0.03057792 FALSE FALSE
## raw_timestamp_part_1 1.000000 4.26562022 FALSE FALSE
## raw_timestamp_part_2 1.000000 85.53154622 FALSE FALSE
## cvtd_timestamp 1.000668 0.10192641 FALSE FALSE
## new_window 47.330049 0.01019264 FALSE TRUE
## num_window 1.000000 4.37264295 FALSE FALSE
## roll_belt 1.101904 6.77810621 FALSE FALSE
## pitch_belt 1.036082 9.37722964 FALSE FALSE
## yaw_belt 1.058480 9.97349913 FALSE FALSE
## total_accel_belt 1.063160 0.14779329 FALSE FALSE
## gyros_belt_x 1.058651 0.71348486 FALSE FALSE
## gyros_belt_y 1.144000 0.35164611 FALSE FALSE
## gyros_belt_z 1.066214 0.86127816 FALSE FALSE
## accel_belt_x 1.055412 0.83579655 FALSE FALSE
## accel_belt_y 1.113725 0.72877383 FALSE FALSE
## accel_belt_z 1.078767 1.52379982 FALSE FALSE
## magnet_belt_x 1.090141 1.66649679 FALSE FALSE
## magnet_belt_y 1.099688 1.51870350 FALSE FALSE
## magnet_belt_z 1.006369 2.32901845 FALSE FALSE
## roll_arm 52.338462 13.52563449 FALSE FALSE
## pitch_arm 87.256410 15.73234125 FALSE FALSE
## yaw_arm 33.029126 14.65701763 FALSE FALSE
## total_accel_arm 1.024526 0.33635715 FALSE FALSE
## gyros_arm_x 1.015504 3.27693405 FALSE FALSE
## gyros_arm_y 1.454369 1.91621649 FALSE FALSE
## gyros_arm_z 1.110687 1.26388747 FALSE FALSE
## accel_arm_x 1.017341 3.95984099 FALSE FALSE
## accel_arm_y 1.140187 2.73672409 FALSE FALSE
## accel_arm_z 1.128000 4.03628580 FALSE FALSE
## magnet_arm_x 1.000000 6.82397309 FALSE FALSE
## magnet_arm_y 1.056818 4.44399144 FALSE FALSE
## magnet_arm_z 1.036364 6.44684538 FALSE FALSE
## roll_dumbbell 1.022388 84.20650290 FALSE FALSE
## pitch_dumbbell 2.277372 81.74498012 FALSE FALSE
## yaw_dumbbell 1.132231 83.48282540 FALSE FALSE
## total_accel_dumbbell 1.072634 0.21914178 FALSE FALSE
## gyros_dumbbell_x 1.003268 1.22821323 FALSE FALSE
## gyros_dumbbell_y 1.264957 1.41677709 FALSE FALSE
## gyros_dumbbell_z 1.060100 1.04984201 FALSE FALSE
## accel_dumbbell_x 1.018018 2.16593619 FALSE FALSE
## accel_dumbbell_y 1.053061 2.37488533 FALSE FALSE
## accel_dumbbell_z 1.133333 2.08949139 FALSE FALSE
## magnet_dumbbell_x 1.098266 5.74864948 FALSE FALSE
## magnet_dumbbell_y 1.197740 4.30129447 FALSE FALSE
## magnet_dumbbell_z 1.020833 3.44511263 FALSE FALSE
## roll_forearm 11.589286 11.08959331 FALSE FALSE
## pitch_forearm 65.983051 14.85577413 FALSE FALSE
## yaw_forearm 15.322835 10.14677403 FALSE FALSE
## total_accel_forearm 1.128928 0.35674243 FALSE FALSE
## gyros_forearm_x 1.059273 1.51870350 FALSE FALSE
## gyros_forearm_y 1.036554 3.77637346 FALSE FALSE
## gyros_forearm_z 1.122917 1.56457038 FALSE FALSE
## accel_forearm_x 1.126437 4.04647844 FALSE FALSE
## accel_forearm_y 1.059406 5.11160942 FALSE FALSE
## accel_forearm_z 1.006250 2.95586586 FALSE FALSE
## magnet_forearm_x 1.012346 7.76679238 FALSE FALSE
## magnet_forearm_y 1.246914 9.54031189 FALSE FALSE
## magnet_forearm_z 1.000000 8.57710733 FALSE FALSE
## classe 1.469581 0.02548160 FALSE FALSE
NZV_set$fields <- row.names(NZV_set) # create new field that is equal to row names
NZV_fields <- NZV_set %>% filter(nzv == "TRUE") %>% select(fields) # select the fields which are NZV
newtrain <- newtrain[,-which(names(newtrain) %in% NZV_fields)] # retain only the columns that are non-NZV
We also remove columns from the train data which are irrelevant to the prediction. These are the first 6 columns of newtrain, which are mainly user related or time related data:
newtrain <- newtrain[,-(1:6)] # retain only the columns that are relevant for prediction
Finally, we split newtrain into training and testing sets for development of the prediction algorithm:
inTrain <- createDataPartition(y=newtrain$classe, p=0.8, list=FALSE)
trainingset <- newtrain[inTrain,]
testingset <- newtrain[-inTrain,]
I decided to apply the Random Forest algorithm on the trainingset. For a start, I will run the algorithm with all variables as predictors, and with pre-processing (centre and scale) and cross validation (4-folds) applied:
set.seed(17)
model <- train(trainingset$classe ~ ., method="rf", preProcess=c("center", "scale"), trControl=trainControl(method = "cv", number = 4), data=trainingset)
print(model) # review the model
## Random Forest
##
## 15699 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 11774, 11774, 11774, 11775
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9919742 0.9898467 0.001695710 0.002145392
## 27 0.9909551 0.9885581 0.003569432 0.004515449
## 52 0.9842030 0.9800155 0.001775883 0.002245550
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
In order to assess the model and estimate the out-of-sample error, we run the model against the testingset created earlier. We use the confusionMatrix function to review the results:
predictions <- predict(model, newdata=testingset)
print(confusionMatrix(predictions, testingset$classe), digits=4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 6 0 0 0
## B 0 751 5 0 0
## C 0 2 678 15 0
## D 0 0 1 628 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9891, 0.9948)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9895 0.9912 0.9767 0.9986
## Specificity 0.9979 0.9984 0.9948 0.9994 1.0000
## Pos Pred Value 0.9947 0.9934 0.9755 0.9968 1.0000
## Neg Pred Value 1.0000 0.9975 0.9981 0.9954 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1914 0.1728 0.1601 0.1835
## Detection Prevalence 0.2860 0.1927 0.1772 0.1606 0.1835
## Balanced Accuracy 0.9989 0.9939 0.9930 0.9880 0.9993
As the accuracy of the model is good (99.24%) and the out-of-sample error is low (1-99.24=0.76%%), I have decided to utilise the Random Forest algorithm as the chosen model for the project. The pre-processing and cross-validation methods were applied effectively to improve the accuracy of the predictions.
After the model has been decided, I ran the model against the 20 test cases as per required by the project:
print(predict(model, newdata=test))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The above results are utilised to answer the quiz relating to the predictions.