Practical Machine Learning Course Project

Background

Human Activity Recognition - HAR - has emerged as a key research area in the last few years and is gaining increasing attention by the pervasive computing research community. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Reference(s)

Human Activity Recognition. Refer the section on the Weight Lifting Exercise Dataset.

Overview

The goal of your project is to predict the manner in which the participants did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Data Sources

Importing Dataset into R

# Download HAR training dataset only if needed.
if (!file.exists("har_training_data.csv")) {
  fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(fileUrl, destfile = "har_training_data.csv", method = "curl")
}
# Download HAR test data only if needed
if (!file.exists("har_test_data.csv")) {
  fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  download.file(fileUrl, destfile = "har_test_data.csv", method = "curl")
}
# ls the current working directory
list.files("./")

## [1] "har_test_data.csv"              "har_training_data.csv"         
## [3] "index.html"                     "index.Rmd"                     
## [5] "LICENSE"                        "PracticalMachineLearning.Rproj"
## [7] "README.md"

# Load the CSV files as dataframes
training_data <- read.csv("har_training_data.csv", na.strings=c("NA","#DIV/0!",""))
test_data <- read.csv("har_test_data.csv", na.strings=c("NA","#DIV/0!",""))

Pre-processing of Dataset

Before going for dimensionality reduction using standard available techniques like PCA, identify the coloumns from dataset which needs to be removed because of null entries. ### Cleaning

# Count the NA's for each columns in the dataset
na_count <- sapply(training_data, function(y) sum(is.na(y)))
# Remove all those columns for which majority (60%) of the entries are NA's
training_data <- training_data[!na_count > 0.6*nrow(training_data)]
# Start the dataset with roll_belt measure onwards
training_data <- subset(training_data, select = -c(1:7))

Correlated Predictors

Use findCorrelation function that searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.

library(caret); library(kernlab);

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

corr_cols <- findCorrelation(cor(training_data[,-dim(training_data)[2]]), 
                cutoff = 0.80, 
                verbose = TRUE)

## Compare row 10  and column  1 with corr  0.992 
##   Means:  0.27 vs 0.168 so flagging column 10 
## Compare row 1  and column  9 with corr  0.925 
##   Means:  0.25 vs 0.164 so flagging column 1 
## Compare row 9  and column  4 with corr  0.928 
##   Means:  0.233 vs 0.161 so flagging column 9 
## Compare row 36  and column  29 with corr  0.849 
##   Means:  0.251 vs 0.158 so flagging column 36 
## Compare row 8  and column  2 with corr  0.966 
##   Means:  0.239 vs 0.154 so flagging column 8 
## Compare row 2  and column  11 with corr  0.884 
##   Means:  0.221 vs 0.15 so flagging column 2 
## Compare row 21  and column  24 with corr  0.814 
##   Means:  0.194 vs 0.149 so flagging column 21 
## Compare row 34  and column  28 with corr  0.808 
##   Means:  0.185 vs 0.146 so flagging column 34 
## Compare row 25  and column  26 with corr  0.814 
##   Means:  0.154 vs 0.145 so flagging column 25 
## Compare row 19  and column  18 with corr  0.918 
##   Means:  0.096 vs 0.146 so flagging column 18 
## Compare row 46  and column  45 with corr  0.846 
##   Means:  0.11 vs 0.149 so flagging column 45 
## Compare row 46  and column  31 with corr  0.914 
##   Means:  0.091 vs 0.152 so flagging column 31 
## Compare row 46  and column  33 with corr  0.933 
##   Means:  0.07 vs 0.156 so flagging column 33 
## All correlations <= 0.8

The analysis shows that the following columns are highly correlated to eacn other:

names(training_data)[corr_cols]

##  [1] "accel_belt_z"     "roll_belt"        "accel_belt_y"    
##  [4] "accel_dumbbell_z" "accel_belt_x"     "pitch_belt"      
##  [7] "accel_arm_x"      "accel_dumbbell_x" "magnet_arm_y"    
## [10] "gyros_forearm_y"  "gyros_dumbbell_x" "gyros_dumbbell_z"
## [13] "gyros_arm_x"

Hence it would make sense to use PCA to reduce number of predictors and hence noise.

Training Models and its Specification

# Random Forest
rf_model_fit <- train(classe ~ ., method="rf", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

print(rf_model_fit)

## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (52), centered
##  (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17660, 17659, 17660, 17659, 17659, 17660, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9821627  0.9774343
##   27    0.9731422  0.9660286
##   52    0.9725815  0.9653176
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

# Boosted Logistic Regression
logitboost_model_fit <- train(classe ~ ., method="LogitBoost", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))

## Loading required package: caTools

print(logitboost_model_fit)

## Boosted Logistic Regression 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (52), centered
##  (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17660, 17660, 17660, 17659, 17661, 17661, ... 
## Resampling results across tuning parameters:
## 
##   nIter  Accuracy   Kappa    
##   11     0.6183350  0.5024192
##   21     0.6318632  0.5186530
##   31     0.6269243  0.5161601
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was nIter = 21.

# Support Vector Machines with Linear Kernel
svmLinear_model_fit <- train(classe ~ ., method="svmLinear", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))
print(svmLinear_model_fit)

## Support Vector Machines with Linear Kernel 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (52), centered
##  (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17660, 17660, 17658, 17659, 17661, 17658, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5725221  0.4577581
## 
## Tuning parameter 'C' was held constant at a value of 1
##

# Dicision Tree
rpart_model_fit <- train(classe ~ ., method="rpart", preProcess="pca", data=training_data, trControl = trainControl(method = "cv"))

## Loading required package: rpart

print(rpart_model_fit)

## CART 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (52), centered
##  (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17660, 17660, 17661, 17660, 17659, 17659, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03567868  0.4015405  0.1998109
##   0.05998671  0.3769255  0.1605753
##   0.11515454  0.2843747  0.0000000
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03567868.

library(rattle)

## Please install GTK+ from http://r.research.att.com/libs/GTK_2.24.17-X11.pkg

## If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable

## IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(rpart_model_fit$finalModel)

Prediction using the Trained Models

# Random Forest
cat("Random Forest Model Accuracy = ", rf_model_fit$results$Accuracy, "\n")

## Random Forest Model Accuracy =  0.9821627 0.9731422 0.9725815

predict(rf_model_fit, test_data)

##  [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E

# Boosted Logistic Regression
cat("Boosted Logistic Regression Model Accuracy = ", logitboost_model_fit$results$Accuracy, "\n")

## Boosted Logistic Regression Model Accuracy =  0.618335 0.6318632 0.6269243

predict(logitboost_model_fit, test_data)

##  [1] <NA> <NA> <NA> <NA> A    <NA> D    A    A    A    <NA> <NA> B    A   
## [15] E    <NA> A    <NA> A    <NA>
## Levels: A B C D E

# Support Vector Machines with Linear Kernel
cat("Support Vector Machines with Linear Kernel Model Accuracy = ", svmLinear_model_fit$results$Accuracy, "\n")

## Support Vector Machines with Linear Kernel Model Accuracy =  0.5725221

predict(svmLinear_model_fit, test_data)

##  [1] C C A A A C D D A C A C E A E B A B A B
## Levels: A B C D E

# Dicision Tree
cat("Dicision Tree = ", rpart_model_fit$results$Accuracy, "\n")

## Dicision Tree =  0.4015405 0.3769255 0.2843747

predict(rpart_model_fit, test_data)

##  [1] C C C A C C D C A C A C A A E C A A A E
## Levels: A B C D E

Conclusion / Remarks

On the given training dataset, Random Forest relatively seems to have performed much better as compare to any of its couterparts (i.e. with 98% accuracy). The actual accuracy on the test data was 95% (approx.) With the other 2 algorithms i.e. Logistic Regression (Boost) and Decision Tree (rpart) some of the labels remain un-classified and has much less accuracy w.r.t. Random Forest (Refer model summary above).