The data set is taken from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal is to qualify how well they do it.
training <- read.csv('pml-training.csv', na.strings=c('#DIV/0', '', 'NA'))
dim(training)
## [1] 19622 160
After having a look at the data, there are lot of sparse features having lot of NAs. The columns containing more than 95% NAs are identified and removed.
na_count <-sapply(training, function(y) sum(is.na(y)))
na_percent <- data.frame(na_count)/nrow(training)
training_remove_sparse_records<-training[,na_percent<0.95]
Columns 1:6 do not cotribute to outcome so are removed as well. These columns are either identifier or timestamps which do not contribute the outcome.
str(training_remove_sparse_records[,1:6])
## 'data.frame': 19622 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1: int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2: int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
training_clean<-training_remove_sparse_records[,-c(1:6)]
dim(training_clean)
## [1] 19622 54
Plot the correlation matrix of the data set. Diagonal elements are set to 0.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
training_explore<-training_clean
training_explore$classe<-as.numeric(training_explore$classe)
cor_matrix<-abs(cor(training_explore))
diag(cor_matrix)<-0
library(corrplot)
corrplot(cor_matrix, method="square")
From the plot of correlation matrix, it is clear that lot of predictors are higly correlated with each other. So using PCA do reduce dimensions seem like an good option.
prComp<-prcomp(training_clean[,-54],scale. = TRUE)
std_dev <- prComp$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
sum(prop_varex[1:30])
## [1] 0.9741212
plot(cumsum(prop_varex), xlab = "Principal Component",ylab = "Cumulative Proportion of Variance Explained",type = "b")
abline(h=0.975,col='red',v=30)
30 Principal Components explain about 97.5 % of total variance. So by using PCA, the dimensions are reduces from 53 to 30.
For model building, Random Forest is used over the data set preprocessed using PCA. Repeated Cross validation with 10 folds and 3 repeats is applied to avoid over fitting. doMC is used to parallelize the model creation.
train.data<-data.frame(classe = training_clean$classe, prComp$x)
train.data <- train.data[,1:30]
metric <- "Accuracy"
control <- trainControl(method="repeatedcv", number=10, repeats=3)
mtry <- sqrt(ncol(train.data))
tunegrid <- expand.grid(.mtry=mtry)
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores = 4)
model_rf <- train(classe~.,data=train.data, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
print(model_rf)
## Random Forest
##
## 19622 samples
## 29 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 17659, 17660, 17661, 17660, 17659, 17659, ...
## Resampling results:
##
## Accuracy Kappa
## 0.983573 0.9792181
##
## Tuning parameter 'mtry' was held constant at a value of 5.477226
prediction_rf<-predict(model_rf, newdata =train.data)
confusionMatrix(prediction_rf,train.data$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Test data is cleaned and preprocessed using the same technique. PCA is applied using the preprocessing model built on training data. The random forest model built on the training data is then applied on the test data to yield the results.
testing <- read.csv('pml-testing.csv', na.strings=c('#DIV/0', '', 'NA'))
na_count <-sapply(testing, function(y) sum(is.na(y)))
na_percent <- data.frame(na_count)/nrow(testing)
testing_remove_sparse_records<-testing[,na_percent<0.95]
testing_clean<-testing_remove_sparse_records[,-c(1:6)]
test.data<-predict(prComp, newdata = testing_clean)
test.data <- as.data.frame(test.data)
test.data <- test.data[,1:30]
pred_test <- predict(model_rf, test.data)
pred_test
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E