Introduction

Using of personal devices such as Fitbit for monitoring personal activity performance are getting more popular. As of a part of current project, the accelerometer reading from 6 research study participants are given. These data are recorded from multiple positions including rom accelerometers on the belt, forearm, arm, and dumbell. The training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing the testing data consists of accelerometer data without the identifying label. The main goal is predicting identifying label for test data based on the training data.

Detailed approach to achive the target:

Initial Analysis

Loading required libraries, reading data, removing NA and ZV varibales

install.packages("caret", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Reza/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reza\AppData\Local\Temp\RtmpIdLwjq\downloaded_packages
install.packages("e1071", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Reza/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'e1071' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reza\AppData\Local\Temp\RtmpIdLwjq\downloaded_packages
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
ptrain <- read.csv("pml-training.csv")
ptest <- read.csv("pml-testing.csv")
set.seed(10)
inTrain <- createDataPartition(y=ptrain$classe, p=0.7, list=F)
ptrain1 <- ptrain[inTrain, ]
ptrain2 <- ptrain[-inTrain, ]
nzv <- nearZeroVar(ptrain1)
ptrain1 <- ptrain1[, -nzv]
ptrain2 <- ptrain2[, -nzv]
mostlyNA <- sapply(ptrain1, function(x) mean(is.na(x))) > 0.95
ptrain1 <- ptrain1[, mostlyNA==F]
ptrain2 <- ptrain2[, mostlyNA==F]
ptrain1 <- ptrain1[, -(1:5)]
ptrain2 <- ptrain2[, -(1:5)]

Modeling & Evaluation

Modeling has been performed by Random Forest fitting training data

fitControl <- trainControl(method="cv", number=3, verboseIter=F)
fit <- train(classe ~ ., data=ptrain1, method="rf", trControl=fitControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.26%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    7 2647    3    1    0 0.0041384500
## C    0    7 2389    0    0 0.0029215359
## D    0    0    9 2242    1 0.0044404973
## E    0    0    0    7 2518 0.0027722772
preds <- predict(fit, newdata=ptrain2)
confusionMatrix(ptrain2$classe, preds)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    3 1134    1    1    0
##          C    0    3 1023    0    0
##          D    0    0    2  962    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9981          
##                  95% CI : (0.9967, 0.9991)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9976          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9974   0.9971   0.9979   1.0000
## Specificity            1.0000   0.9989   0.9994   0.9996   0.9998
## Pos Pred Value         1.0000   0.9956   0.9971   0.9979   0.9991
## Neg Pred Value         0.9993   0.9994   0.9994   0.9996   1.0000
## Prevalence             0.2850   0.1932   0.1743   0.1638   0.1837
## Detection Rate         0.2845   0.1927   0.1738   0.1635   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9991   0.9982   0.9982   0.9988   0.9999
#Evaluating model based on whole training set
nzv <- nearZeroVar(ptrain)
ptrain <- ptrain[, -nzv]
ptest <- ptest[, -nzv]
mostlyNA <- sapply(ptrain, function(x) mean(is.na(x))) > 0.95
ptrain <- ptrain[, mostlyNA==F]
ptest <- ptest[, mostlyNA==F]
ptrain <- ptrain[, -(1:5)]
ptest <- ptest[, -(1:5)]
fitControl <- trainControl(method="cv", number=3, verboseIter=F)
fit <- train(classe ~ ., data=ptrain, method="rf", trControl=fitControl)

Prediction of Identifying Labels Using The Train Model on the Test Data

preds <- predict(fit, newdata=ptest)
preds <- as.character(preds)
pml_write_files <- function(x) {
    n <- length(x)
    for(i in 1:n) {
        filename <- paste0("problem_id_", i, ".txt")
        write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
    }
}
pml_write_files(preds)