Introduction

Using of personal devices such as Fitbit for monitoring personal activity performance are getting more popular. As of a part of current project, the accelerometer reading from 6 research study participants are given. These data are recorded from multiple positions including rom accelerometers on the belt, forearm, arm, and dumbell. The training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing the testing data consists of accelerometer data without the identifying label. The main goal is predicting identifying label for test data based on the training data.

Detailed approach to achive the target:

Initial Analysis

Loading required libraries, reading data, removing NA and ZV varibales

install.packages("caret", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Reza/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reza\AppData\Local\Temp\RtmpIdLwjq\downloaded_packages

install.packages("e1071", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Reza/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'e1071' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reza\AppData\Local\Temp\RtmpIdLwjq\downloaded_packages

library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

ptrain <- read.csv("pml-training.csv")
ptest <- read.csv("pml-testing.csv")
set.seed(10)
inTrain <- createDataPartition(y=ptrain$classe, p=0.7, list=F)
ptrain1 <- ptrain[inTrain, ]
ptrain2 <- ptrain[-inTrain, ]
nzv <- nearZeroVar(ptrain1)
ptrain1 <- ptrain1[, -nzv]
ptrain2 <- ptrain2[, -nzv]
mostlyNA <- sapply(ptrain1, function(x) mean(is.na(x))) > 0.95
ptrain1 <- ptrain1[, mostlyNA==F]
ptrain2 <- ptrain2[, mostlyNA==F]
ptrain1 <- ptrain1[, -(1:5)]
ptrain2 <- ptrain2[, -(1:5)]

Modeling & Evaluation

Modeling has been performed by Random Forest fitting training data

fitControl <- trainControl(method="cv", number=3, verboseIter=F)
fit <- train(classe ~ ., data=ptrain1, method="rf", trControl=fitControl)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

fit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.26%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    7 2647    3    1    0 0.0041384500
## C    0    7 2389    0    0 0.0029215359
## D    0    0    9 2242    1 0.0044404973
## E    0    0    0    7 2518 0.0027722772

preds <- predict(fit, newdata=ptrain2)
confusionMatrix(ptrain2$classe, preds)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    3 1134    1    1    0
##          C    0    3 1023    0    0
##          D    0    0    2  962    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9981          
##                  95% CI : (0.9967, 0.9991)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9976          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9974   0.9971   0.9979   1.0000
## Specificity            1.0000   0.9989   0.9994   0.9996   0.9998
## Pos Pred Value         1.0000   0.9956   0.9971   0.9979   0.9991
## Neg Pred Value         0.9993   0.9994   0.9994   0.9996   1.0000
## Prevalence             0.2850   0.1932   0.1743   0.1638   0.1837
## Detection Rate         0.2845   0.1927   0.1738   0.1635   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9991   0.9982   0.9982   0.9988   0.9999

#Evaluating model based on whole training set
nzv <- nearZeroVar(ptrain)
ptrain <- ptrain[, -nzv]
ptest <- ptest[, -nzv]
mostlyNA <- sapply(ptrain, function(x) mean(is.na(x))) > 0.95
ptrain <- ptrain[, mostlyNA==F]
ptest <- ptest[, mostlyNA==F]
ptrain <- ptrain[, -(1:5)]
ptest <- ptest[, -(1:5)]
fitControl <- trainControl(method="cv", number=3, verboseIter=F)
fit <- train(classe ~ ., data=ptrain, method="rf", trControl=fitControl)

Prediction of Identifying Labels Using The Train Model on the Test Data

preds <- predict(fit, newdata=ptest)
preds <- as.character(preds)
pml_write_files <- function(x) {
    n <- length(x)
    for(i in 1:n) {
        filename <- paste0("problem_id_", i, ".txt")
        write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
    }
}
pml_write_files(preds)

Human Activity Recognition Project

Reza Rahimi

June 11, 2017

Introduction

Initial Analysis

Modeling & Evaluation

Prediction of Identifying Labels Using The Train Model on the Test Data