Executive summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Based on these data we are going to build a prediction model, which predicts the manner in which the excersize is done according to quantified variables.

Data and environment preparation

In this section we are going to do the following tasks:

load needed libraries
load the training and test data sets
Remove unneeded variables which does not have correct effect of prediction such as timestamp,…
Remove variables which contrain NA values

library(caret)
library(ggplot2)
library(randomForest)
library(corrplot)
library(rpart)
set.seed(1000)

mainDir <- getwd()
subDir <- "outputDirectory"
if (!file.exists(subDir)){
        dir.create(file.path(mainDir, subDir))
}

if(!file.exists("outputDirectory/training.csv")){
        fileURL_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
        download.file(fileURL_train, destfile = "outputDirectory/training.csv")
        
}

if(!file.exists("outputDirectory/test.csv")){
        fileURL_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
        download.file(fileURL_test, destfile = "outputDirectory/test.csv")
        
}

pmltr <- read.csv("outputDirectory/training.csv", stringsAsFactors = FALSE)
pmlts <- read.csv("outputDirectory/test.csv", stringsAsFactors = FALSE)

# remove index variable which does not affect on prediction
pmltr <- pmltr[, -c(1, 3:7)]
pmltr$user_name <- as.factor(pmltr$user_name)
pmltr$classe <- as.factor(pmltr$classe)

# convert variables to number
for(i in 1:ncol(pmltr)){
        
        if(class(pmltr[,i]) == "character"){
                
                pmltr[,i] <- as.numeric(pmltr[,i])
        }
}
# remove variables which contain NA
num_na <- sapply(1:ncol(pmltr), function(x) sum(is.na(pmltr[,x])))
pmltr_new <- pmltr[, num_na == 0]

Explorarory analysis

In this section we are going to do some exploratory ananlysis on training data such as: grouping data by classe and user_name variables and also draw a plot based on two variables

table(pmltr_new$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

table(pmltr_new$user_name)

## 
##   adelmo carlitos  charles   eurico   jeremy    pedro 
##     3892     3112     3536     3070     3402     2610

# draw some plot of data
qplot(user_name, roll_belt, colour=classe, data=pmltr_new, main = "Activity level of each user on belt", xlab = "user name", ylab = "activity level on belt")

Partition training data

In this section we are going to partition training data set, to two separate chunks. The big one is dedicated for training and small one is dedicated for testing

inTrain <- createDataPartition(y = pmltr_new$classe, p = 0.75, list = FALSE)
tempTrain <- pmltr_new[inTrain,]
tempTest <- pmltr_new[-inTrain,]
table(tempTrain$classe)

## 
##    A    B    C    D    E 
## 4185 2848 2567 2412 2706

Method1: Classification Trees

In this section we build the classification Trees model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.

ct_model <- train(classe ~., data = tempTrain, method = "rpart", trControl=trainControl(method="cv",number=2))
confusionMatrix(predict(ct_model, newdata = tempTest), tempTest$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 837 144  21  37  11
##          B 252 586 161 347 252
##          C 283 219 673 420 222
##          D   0   0   0   0   0
##          E  23   0   0   0 416
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5122          
##                  95% CI : (0.4981, 0.5263)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3865          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6000   0.6175   0.7871   0.0000  0.46171
## Specificity            0.9393   0.7441   0.7175   1.0000  0.99425
## Pos Pred Value         0.7971   0.3667   0.3704      NaN  0.94761
## Neg Pred Value         0.8552   0.8902   0.9410   0.8361  0.89138
## Prevalence             0.2845   0.1935   0.1743   0.1639  0.18373
## Detection Rate         0.1707   0.1195   0.1372   0.0000  0.08483
## Detection Prevalence   0.2141   0.3259   0.3705   0.0000  0.08952
## Balanced Accuracy      0.7696   0.6808   0.7523   0.5000  0.72798

print(plot(varImp(ct_model, scale = FALSE)))

predict(ct_model, newdata = pmlts)

##  [1] B C C C B C C C A A C C B A C B B C B B
## Levels: A B C D E

This model has the lowest accuracy amoung the selected models and it is about 51% with confidence interval 95%.

Method2: Gradiant Boosting Model

In this section we build the Gradiant Boosting model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.

gbm_model <- train(classe ~., data = tempTrain, method = "gbm", verbose = FALSE, trControl=trainControl(method="cv",number=2))

## Loading required package: gbm

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

## Loading required package: plyr

confusionMatrix(predict(gbm_model, newdata = tempTest), tempTest$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1374   26    0    0    2
##          B   16  892   21    2    6
##          C    1   31  821   21   10
##          D    3    0   13  778    9
##          E    1    0    0    3  874
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9664          
##                  95% CI : (0.9609, 0.9712)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9574          
##  Mcnemar's Test P-Value : 0.0004813       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9849   0.9399   0.9602   0.9677   0.9700
## Specificity            0.9920   0.9886   0.9844   0.9939   0.9990
## Pos Pred Value         0.9800   0.9520   0.9287   0.9689   0.9954
## Neg Pred Value         0.9940   0.9856   0.9915   0.9937   0.9933
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2802   0.1819   0.1674   0.1586   0.1782
## Detection Prevalence   0.2859   0.1911   0.1803   0.1637   0.1790
## Balanced Accuracy      0.9885   0.9643   0.9723   0.9808   0.9845

print(plot(varImp(gbm_model, scale = FALSE)))

predict(gbm_model, newdata = pmlts)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

This model has a good accuracy and it is 96% with confidence interval 95%

Method3: Random Forest Model

In this section we build the Random Forest model based on trainings data with two fold cross validation. The result is displayed in confuction matrix and a diagram displays the importance of variables used in prediction.

rf_model <- train(classe ~., data = tempTrain, method = "rf", trControl=trainControl(method="cv",number=2))
confusionMatrix(predict(rf_model, newdata = tempTest), tempTest$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    4    0    0    0
##          B    3  944    6    0    0
##          C    0    1  839    6    0
##          D    0    0   10  795    6
##          E    0    0    0    3  895
## 
## Overall Statistics
##                                           
##                Accuracy : 0.992           
##                  95% CI : (0.9891, 0.9943)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9899          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9978   0.9947   0.9813   0.9888   0.9933
## Specificity            0.9989   0.9977   0.9983   0.9961   0.9993
## Pos Pred Value         0.9971   0.9906   0.9917   0.9803   0.9967
## Neg Pred Value         0.9991   0.9987   0.9961   0.9978   0.9985
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2838   0.1925   0.1711   0.1621   0.1825
## Detection Prevalence   0.2847   0.1943   0.1725   0.1654   0.1831
## Balanced Accuracy      0.9984   0.9962   0.9898   0.9925   0.9963

print(plot(varImp(rf_model, scale = FALSE)))

predict(rf_model, newdata = pmlts)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

This model has the best accuracy and it is 99% with confidence interval 95%

PML-Project: Classification based on a set of variables to predict the manner in which the excersie is done

Mahmood Karimi

27 Aug 2017