Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of this project is to predict the manner (or activity) in which they did the exercise. These exercises (labelled as “A, B, C, D, E”) are captured in the “classe” variable in the training set, and our intent is to predict the type of exercises in the testing dataset.
clear memory space
rm(list=ls()) #free up memory so as to download the data set
# .rs.restartR()
check memory limit and allocate more memory to R
memory.limit() #[1] 8090
## [1] 8090
memory.limit(10*10^10)
## [1] 1e+11
set working directory
setwd("C:\\Users\\Andy's Home PC\\Documents\\Coursera Courses\\Data Science\\Practical Machine Learning\\Week 4\\Peer Graded Assignment")
download training data and read the downloaded file
if(!file.exists("./data")){dir.create("./data")} #this will create a folder "data", if it doesn't exists
fileUrl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile="./data/pml-training.csv", method="curl") #this download the file
training <- read.csv("./data/pml-training.csv", sep=",", header=TRUE, na.strings=c("NA", "#DIV/0!", ""))
download testing data and read the downloaded file
if(!file.exists("./data")){dir.create("./data")} #this will create a folder "data", if it doesn't exists
fileUrl2 = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl2, destfile="./data/pml-testing.csv", method="curl") #this download the file
testing <- read.csv("./data/pml-testing.csv", sep=",", header=TRUE, na.strings=c("NA", "#DIV/0!", ""))
Load the essential packages
suppressMessages(library(caret))
suppressMessages(library(randomForest))
suppressMessages(library(rpart))
suppressMessages(library(rpart.plot))
suppressMessages(library(RColorBrewer))
suppressMessages(library(rattle))
suppressMessages(library(ggplot2))
set seed to ensure reproducibility
set.seed(650542)
Find size of training dataset & testing dataset
dim(training)#[1] 19622 160
## [1] 19622 160
dim(testing)#[1] 20 160
## [1] 20 160
We further split our training dataset into 2 smaller data sets to use for training and testing while minimising model over-fitting. Let’s use 80:20 for our training:testing sub-data sets accordingly.
inTrain <- createDataPartition(training$classe, p=0.8, list=FALSE)
subTraining <- training[inTrain, ]
subTesting <- training[-inTrain, ]
dim(subTraining)#[1] 15699 160
## [1] 15699 160
dim(subTesting)#[1] 3923 160
## [1] 3923 160
Find details of the “classe” variable
str(training$classe) #factor variable with 5 levels
## Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
table(subTraining$classe)#A:4464; B:3038; C:2738; D:2573; E:2886
##
## A B C D E
## 4464 3038 2738 2573 2886
In our subtraining dataset, there are 15699 observation and 160 variables. One of these variables, “classe” column is a factor variable comprising of 5 levels.
Let’s check for any missing values in training dataset, and store them in a list
na_count<- sapply(1:dim(training)[2], function(x) sum(is.na(training[,x])))
na_list <- which(na_count>0)
Let’s remove missing values in subTraining, subTesting, and testing datasets.
subTraining<-subTraining[,-na_list]
subTesting<-subTesting[,-na_list]
testing<-testing[,-na_list]
Let’s print the first 7 columns of the training set
head(training[, c(1:7)])
## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3 3 carlitos 1323084231 820366 05/12/2011 11:23
## 4 4 carlitos 1323084232 120339 05/12/2011 11:23
## 5 5 carlitos 1323084232 196328 05/12/2011 11:23
## 6 6 carlitos 1323084232 304277 05/12/2011 11:23
## new_window num_window
## 1 no 11
## 2 no 11
## 3 no 11
## 4 no 12
## 5 no 12
## 6 no 12
Note that these first 7 columns are not relevant, let’s remove them in all our datasets (i.e. subTraining, subTesting, and testing).
subTraining<-subTraining[,-c(1:7)]
subTesting<-subTesting[,-c(1:7)]
testing<-testing[,-c(1:7)]
This step is important as running processes like random forest can take many hours.
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores()-1) #convention to leave 1 core for OS
registerDoParallel(cluster)
The key parameters for trainControl object are: (1) the resampling method “method”; (2) “number” that specifies the quantity of folds for K-fold cross-validation, and (3) “allowParallel” that tells caret to use the cluster registered in earlier step.
Here, we are using k-fold cross validation (“cv”) method, with 5 folds, and allowing for parallel processing.
fitControl <- trainControl(method="cv", number=5, allowParallel=TRUE, verboseIter = TRUE)
#train model using subTraining dataset
model_RandomForest <- train(classe~., data=subTraining, method="rf", trControl=fitControl, tuneGrid=data.frame(mtry=7))
## Aggregating results
## Fitting final model on full training set
#mtry: Number of variables randomly sampled as candidates at each split.
#predict trained model on subTesting dataset
predict_RandomForest <- predict(model_RandomForest, subTesting[,-53])#less off "classe" variable in the 53th column
#calculate confusion matrix
accuracy_RandomForest <- confusionMatrix(predict_RandomForest, subTesting$classe)
print(accuracy_RandomForest)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1115 2 0 0 0
## B 1 755 1 0 0
## C 0 2 682 6 0
## D 0 0 1 637 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9964
## 95% CI : (0.994, 0.998)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9955
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9947 0.9971 0.9907 0.9986
## Specificity 0.9993 0.9994 0.9975 0.9994 1.0000
## Pos Pred Value 0.9982 0.9974 0.9884 0.9969 1.0000
## Neg Pred Value 0.9996 0.9987 0.9994 0.9982 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1925 0.1738 0.1624 0.1835
## Detection Prevalence 0.2847 0.1930 0.1759 0.1629 0.1835
## Balanced Accuracy 0.9992 0.9970 0.9973 0.9950 0.9993
The accuracy from our random forest model is more than 99%.
#train model using subTraining dataset
model_rpart <- train(classe~., data=subTraining, method="rpart", trControl=fitControl)
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0357 on full training set
#rpart no need to specify mtry
#predict trained model on subTesting dataset
predict_rpart <- predict(model_rpart, subTesting[,-53])#less off "classe" variable in the 53th column
#calculate confusion matrix
accuracy_rpart <- confusionMatrix(predict_rpart, subTesting$classe)
print(accuracy_rpart)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1015 308 314 289 104
## B 19 257 30 116 98
## C 79 194 340 238 197
## D 0 0 0 0 0
## E 3 0 0 0 322
##
## Overall Statistics
##
## Accuracy : 0.493
## 95% CI : (0.4772, 0.5088)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3375
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9095 0.33860 0.49708 0.0000 0.44660
## Specificity 0.6384 0.91688 0.78141 1.0000 0.99906
## Pos Pred Value 0.5000 0.49423 0.32443 NaN 0.99077
## Neg Pred Value 0.9466 0.85248 0.88035 0.8361 0.88911
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2587 0.06551 0.08667 0.0000 0.08208
## Detection Prevalence 0.5175 0.13255 0.26714 0.0000 0.08284
## Balanced Accuracy 0.7740 0.62774 0.63925 0.5000 0.72283
Accuracy from the decision tree (using “rpart”) is only about 50%.
Since our random forest model has greater accuracy, we shall apply the model on the testing dataset to predict the activity class.
#predict trained model on subTesting dataset
predict_RandomForest_validation <- predict(model_RandomForest, testing)
predict_RandomForest_validation
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Hence, the predicted activity classes for the 20 rows in the testing dataset are: B A B A A E D B A A B C B A E E A B B B