This report describe my approach to apply a machine learning algorithm to the Human Activity Recognition data set provided by PUC-RIO. The data is collected from 6 participants who performed barbel lifts correctly and incorrectly in five different ways, to collect the data accelerometers on the belt, forearm, arm, and dumbell were used. The goal of this project is to build a multiclass classification machine learning to predict what activity was performed in a hidden data set.
The multiclass classification algorithm chosen for this project was Random Forest, because after pre-processing the data there were no missing value and its unexcelled accuracy among current algorithms. The random forest model built in this project scored right 19 out of 20 hidden samples.
The code in this project was executed using R(3.1.3) programming language, and the following libraries caret, randomForest, running under a OS X 10.10.3 (Yosemite) operating system with a quad-core processor and 8GB of ram memory.
To begin, the necessary packages are loaded to R in addition a seed is set for reproducibility
library(caret);
library(randomForest);
library(doMC);
set.seed(125);
In this project two data sets are used. The first data set is given to build a multiclass classification model and the second dataset has unlabeled data that has to be predicted and its values submitted for automated grading. The following piece of code in R download the data in the current R running environment.
download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile = "pml-training.csv", method = "curl");
download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile = "pml-testing.csv" , method = "curl");
Having the data in the current working directory, the data set used to build the training model is loaded in R. Special cases of NA values are treated next it plots. The train data set has 19622 samples and 160 variables.
missing_values_flag = c("NA","#DIV/0!", "");
training <- read.csv(file = "pml-training.csv", na.strings = missing_values_flag);
dim(training);
## [1] 19622 160
# Shows the first 12 variable names
colnames(training[ , 1:12]);
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "kurtosis_roll_belt"
The next step is to pre-process the data. Firstly, the first six variables (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window) are removed from the training data because they are not useful for prediction. Next it is removes predictors that have near zero variance from the training data. After removing near zero variance, the predictors that have more than a certain threshold (70%) of missing values are also removed from the training data, after this step there is no predictor with missing value.
# Removes the first six variables
# from the data because they are not
# considered useful for prediction
training = training[ , -c(1, 2, 3, 4, 5, 6)];
# remove near zero variance predictors
nzvar <- nearZeroVar(training);
training = training[ , -nzvar];
# Check if a given column has more than a threshold (70%) of its values as NA
thresholdNA <- 0.7;
checkNA <- function(col) {
(sum(is.na(col)) >= (thresholdNA * length(col)))
}
# get variables to remove based on training set
lotNas <- sapply(training, checkNA);
# remove variables which has lots of na from training set
training <- training[ , !lotNas];
# Reduced the number of predictors
dim(training);
## [1] 19622 54
# Now the training set has no missing value
any(is.na(training));
## [1] FALSE
After pre-processing the data, the training data set provided is spitted into two sets. One set is used to build a random forest model and the other is used to validate the model. The split was based on the outcome in the outcome variable (classe), which has 5 distinct values (A, B, C, D, E) that correspond to each exercise, with 60% of the samples for training the model and 40% for testing. The following R code shows how the data was split.
###### Training data split based on the outcome
# 60% for training
# 40 for testing
trainIndex <- createDataPartition(training$classe, p = .6,
list = FALSE,
times = 1);
inTrain <- training[trainIndex, ];
inTest <- training[-trainIndex, ];
On the next step, the random forest model is built using caret parallel random forest method. Firstly, it is registered number of cores to be used. Secondly, the model is built to predict the classe variable using all other predictors. Finally, the model is used to make prediction on the testing set, it has a high accuracy (99.46%), having all of its misclassification on class C (7 misclassification). Therefore, it is expected that the random forest will perform well on the unlabeled data because it is expected that this data is collected from the sample experiment.
##### Random Forest
registerDoMC(4); # Explicitly register four core
rfModel <- train(classe ~., data = inTrain, method = "parRF"); # call parallel random forest in caret
rfModel;
## Parallel Random Forest
##
## 11776 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9899493 0.9872866 0.001510135 0.001909256
## 27 0.9938631 0.9922386 0.001615010 0.002040862
## 53 0.9883917 0.9853191 0.004474800 0.005656878
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# Predict using random forest
predRF <- predict(rfModel, inTest);
confusionMatrix(inTest$classe, predRF);
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 0 0 0 0
## B 5 1506 7 0 0
## C 0 7 1361 0 0
## D 0 0 16 1270 0
## E 0 2 0 0 1440
##
## Overall Statistics
##
## Accuracy : 0.9953
## 95% CI : (0.9935, 0.9967)
## No Information Rate : 0.2851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.994
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9941 0.9834 1.0000 1.0000
## Specificity 1.0000 0.9981 0.9989 0.9976 0.9997
## Pos Pred Value 1.0000 0.9921 0.9949 0.9876 0.9986
## Neg Pred Value 0.9991 0.9986 0.9964 1.0000 1.0000
## Prevalence 0.2851 0.1931 0.1764 0.1619 0.1835
## Detection Rate 0.2845 0.1919 0.1735 0.1619 0.1835
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9989 0.9961 0.9911 0.9988 0.9998