One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
# set working diretory
setwd("~/R_Coursera/Practical Machine Learning")
# Download training dataset if it is not already in the directory
if (!file.exists("har.txt")) {
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url,"har.txt")
}
# Download testing dataset if it is not already in the directory
if (!file.exists("har2.txt")) {
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url2,"har2.txt")
}
# Load datasets
train <- read.csv("har.txt",sep = ",", header = TRUE,na.strings= c('#DIV/0', '', 'NA'))
test <- read.csv("har2.txt",sep = ",", header = TRUE,na.strings= c('#DIV/0', '', 'NA'))
Let’s take a look at the dimensions of the training and testing datasets, and also take a look at the outcome variable CLASSE to see how it is distributed on this sample data.
It is important to check if there are NA values on training dataset, because they may cause errors during future procedures.
# Check dataset dimensions
dim(train)
## [1] 19622 160
dim(test)
## [1] 20 160
# check outcome(classe) distribution
summary(train$classe)
## A B C D E
## 5580 3797 3422 3216 3607
# check NA values
sum(is.na(train))
## [1] 1921600
sum(is.na(test))
## [1] 2000
Note that there many columns with NA values and they need to be discarded. Also, exploring the content of the dataset, we can observe that the seven intial columns have no correlation with the outcome variable prediction and they will be discarded as well.
# Process training dataset
# Discard columns with NA values and not correlated ones
Nac <- sapply(1:160,function(n){sum(is.na(train[,n]))})
cwithNA <- which(Nac>0)
train <- train[,-cwithNA]
train <- train[,-c(1:7)]
# Process test dataset
test <- test[,-cwithNA]
test <- test[,-c(1:7)]
Now, after removing NA values, we have train and test datasets with less columns and smaller sizes, which will be less computer demanding to be processed. New dimensions are:
# Check new datasets dimensions
dim(train)
## [1] 19622 53
dim(test)
## [1] 20 53
Instead of doing a single random data partition, we’re gonna use cross validation. A 10-fold cross validation will be used with the traincontrol() function of the caret package.
Below we’re going to to fit Random Forest(rf) and Linear discriminant analysis(lda) models to see how they perform.
# Set the seed for reproducibility
set.seed(1234)
# Load necessary libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Cross validation - 10 folds
cv10 <- trainControl(method = "cv",allowParallel = TRUE, number = 10)
# Since it takes more than an hour to fit RF model, I'll check if it already exists.
# Start the clock!
# ptm <- proc.time()
# fit Random Forest(RF) model
# fitRf <- train(classe~.,data=train, method="rf",allowParallel=TRUE,trcontorl=cv10)
# Stop the clock
# (proc.time() - ptm)/60
# Time spent(in minutes) to fit the model : 73 minutes
# usuário sistema decorrido
# 62.1833333 0.2756667 73.0075000
# load saved fitted model from disk
fitRf <- readRDS("./fitRf_model.rds")
# predicting the outcome variable(classe) on training dataset
predRf <- predict(fitRf,train)
table(predRf,train$classe)
##
## predRf A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
confusionMatrix(predRf,train$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# fit Linear discriminant analysis (LDA) model
# Start the clock!
ptm <- proc.time()
fitLda <- train(classe~.,data=train,method="lda",trcontorl=cv10)
# Stop the clock
(proc.time() - ptm)/60
## user system elapsed
## 0.1881667 0.0135000 0.2023333
# Time spent(in minutes) to fit the model: 0.3 minute
# user system elapsed
# 0.19666667 0.01416667 0.28716667
# Predicting with Linear discriminant analysis(LDA) model
predLda <- predict(fitLda,train)
table(predLda,train$classe)
##
## predLda A B C D E
## A 4568 586 341 191 133
## B 121 2429 333 130 611
## C 444 455 2254 379 323
## D 429 148 411 2383 344
## E 18 179 83 133 2196
confusionMatrix(predLda,train$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4568 586 341 191 133
## B 121 2429 333 130 611
## C 444 455 2254 379 323
## D 429 148 411 2383 344
## E 18 179 83 133 2196
##
## Overall Statistics
##
## Accuracy : 0.7048
## 95% CI : (0.6984, 0.7112)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6264
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8186 0.6397 0.6587 0.7410 0.6088
## Specificity 0.9109 0.9245 0.9012 0.9188 0.9742
## Pos Pred Value 0.7850 0.6703 0.5847 0.6415 0.8417
## Neg Pred Value 0.9267 0.9145 0.9259 0.9476 0.9171
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2328 0.1238 0.1149 0.1214 0.1119
## Detection Prevalence 0.2966 0.1847 0.1965 0.1893 0.1330
## Balanced Accuracy 0.8648 0.7821 0.7799 0.8299 0.7915
The random forest model had an excellent performance, with 99.98% accuracy on the training dataset.
The linear discriminant analysis model had an inferior performance compared to previous model, just 70% accuracy.
Based on the performance comparison above, the random forest model is the best fitted model and it will be used to make the out of the sample prediction, in this case, the testing dataset with new 20 samples.
Below are the random forest predictions for the outcome variable CLASSE, using the testing dataset:
# Based on the outstanding accuracy of the Random Forest model, I will use it to do the predictions on the testing dataset.
predRftest <- predict(fitRf,test)
predRftest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E