Human Activity Recognition - HAR - has emerged as a key research area in the last few years and is gaining increasing attention by the pervasive computing research community. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Human Activity Recognition. Refer the section on the Weight Lifting Exercise Dataset.
The goal of your project is to predict the manner in which the participants did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
# Download HAR training dataset only if needed.
if (!file.exists("har_training_data.csv")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "har_training_data.csv", method = "curl")
}
# Download HAR test data only if needed
if (!file.exists("har_test_data.csv")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile = "har_test_data.csv", method = "curl")
}
# ls the current working directory
list.files("./")
## [1] "har_test_data.csv" "har_training_data.csv"
## [3] "index.html" "index.Rmd"
## [5] "LICENSE" "PracticalMachineLearning.Rproj"
## [7] "README.md"
# Load the CSV files as dataframes
training_data <- read.csv("har_training_data.csv", na.strings=c("NA","#DIV/0!",""))
test_data <- read.csv("har_test_data.csv", na.strings=c("NA","#DIV/0!",""))
Before going for dimensionality reduction using standard available techniques like PCA, identify the coloumns from dataset which needs to be removed because of null entries. ### Cleaning
# Count the NA's for each columns in the dataset
na_count <- sapply(training_data, function(y) sum(is.na(y)))
# Remove all those columns for which majority (60%) of the entries are NA's
training_data <- training_data[!na_count > 0.6*nrow(training_data)]
# Start the dataset with roll_belt measure onwards
training_data <- subset(training_data, select = -c(1:7))
# Random Forest
rf_model_fit <- train(classe ~ ., method="rf", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
print(rf_model_fit)
## Random Forest
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: principal component signal extraction (52), centered
## (52), scaled (52)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17660, 17659, 17660, 17659, 17659, 17660, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9821627 0.9774343
## 27 0.9731422 0.9660286
## 52 0.9725815 0.9653176
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Boosted Logistic Regression
logitboost_model_fit <- train(classe ~ ., method="LogitBoost", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))
## Loading required package: caTools
print(logitboost_model_fit)
## Boosted Logistic Regression
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: principal component signal extraction (52), centered
## (52), scaled (52)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17660, 17660, 17660, 17659, 17661, 17661, ...
## Resampling results across tuning parameters:
##
## nIter Accuracy Kappa
## 11 0.6183350 0.5024192
## 21 0.6318632 0.5186530
## 31 0.6269243 0.5161601
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 21.
# Support Vector Machines with Linear Kernel
svmLinear_model_fit <- train(classe ~ ., method="svmLinear", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))
print(svmLinear_model_fit)
## Support Vector Machines with Linear Kernel
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: principal component signal extraction (52), centered
## (52), scaled (52)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17660, 17660, 17658, 17659, 17661, 17658, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5725221 0.4577581
##
## Tuning parameter 'C' was held constant at a value of 1
##
# Dicision Tree
rpart_model_fit <- train(classe ~ ., method="rpart", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))
## Loading required package: rpart
print(rpart_model_fit)
## CART
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: principal component signal extraction (52), centered
## (52), scaled (52)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17660, 17660, 17661, 17660, 17659, 17659, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03567868 0.4015405 0.1998109
## 0.05998671 0.3769255 0.1605753
## 0.11515454 0.2843747 0.0000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03567868.
library(rattle)
## Please install GTK+ from http://r.research.att.com/libs/GTK_2.24.17-X11.pkg
## If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable
## IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(rpart_model_fit$finalModel)
# Random Forest
cat("Random Forest Model Accuracy = ", rf_model_fit$results$Accuracy, "\n")
## Random Forest Model Accuracy = 0.9821627 0.9731422 0.9725815
predict(rf_model_fit, test_data)
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Boosted Logistic Regression
cat("Boosted Logistic Regression Model Accuracy = ", logitboost_model_fit$results$Accuracy, "\n")
## Boosted Logistic Regression Model Accuracy = 0.618335 0.6318632 0.6269243
predict(logitboost_model_fit, test_data)
## [1] <NA> <NA> <NA> <NA> A <NA> D A A A <NA> <NA> B A
## [15] E <NA> A <NA> A <NA>
## Levels: A B C D E
# Support Vector Machines with Linear Kernel
cat("Support Vector Machines with Linear Kernel Model Accuracy = ", svmLinear_model_fit$results$Accuracy, "\n")
## Support Vector Machines with Linear Kernel Model Accuracy = 0.5725221
predict(svmLinear_model_fit, test_data)
## [1] C C A A A C D D A C A C E A E B A B A B
## Levels: A B C D E
# Dicision Tree
cat("Dicision Tree = ", rpart_model_fit$results$Accuracy, "\n")
## Dicision Tree = 0.4015405 0.3769255 0.2843747
predict(rpart_model_fit, test_data)
## [1] C C C A C C D C A C A C A A E C A A A E
## Levels: A B C D E
On the given training dataset, Random Forest relatively seems to have performed much better as compare to any of its couterparts (i.e. with 98% accuracy). The actual accuracy on the test data was 95% (approx.) With the other 2 algorithms i.e. Logistic Regression (Boost) and Decision Tree (rpart) some of the labels remain un-classified and has much less accuracy w.r.t. Random Forest (Refer model summary above).