Machine Learning Project

Background Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). In this work we first define quality of execution of Correct execution, Automatic and Robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: • Class A: exactly according to the specification • Class B: throwing the elbows to the front • Class C: lifting the dumbbell only halfway • Class D: lowering the dumbbell only halfway • Class E: throwing the hips to the front

Goal

The goal of your project is to predict the manner in which they did the exercise. Also,we have created a report that describe how we built our model, how we used cross validation.

Libraries The R libraries utilized for this analysis includes:

library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(gridExtra)
library(dplyr)
library(lattice)
library(ggplot2)
library(cluster)

``` Data Loading

The data for this project originated from the following source: http://groupware.les.inf.puc-rio.br/har.

Initial loading and reading of the data is as follows:

URLtraining <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLtesting <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url = URLtraining, destfile = "training.csv",method = "curl")
download.file(url = URLtesting, destfile = "testing.csv",method = "curl")
Training <- read.csv("training.csv",header = TRUE,  stringsAsFactors=FALSE, na.strings=c("NA","#DIV/0!",""))
Testing <- read.csv("testing.csv",header = TRUE,  stringsAsFactors=FALSE, na.strings=c("NA","#DIV/0!",""))

Data Pre-processing

Next, I perform some data pre-processing for data reduction , substitution of the (NA) with (0 value), and validation of the near zero variance predictor for our project prediction.

Training <- Training[,(colSums(is.na(Training)) == 0)]
Testing <-  Testing[,(colSums(is.na(Testing)) == 0)]
Training <- Training[,-c(1:7)]
Testing <-  Testing[,-c(1:7)]
nzv <- nearZeroVar(Training, saveMetrics=TRUE)
nzv<- nearZeroVar(Testing,saveMetrics=TRUE)

Slicing the data:

We can cross validation/data splitting oth the “Training” data into a new validated training data set (70%) and a validation “testing” data set (30%). We will use the validation training data set to conduct cross validation in future steps.

set.seed(42)
inTrain <- createDataPartition(y=Training$classe,  p = 0.7, list = FALSE)
training <- Training[inTrain,]
testing <- Training[-inTrain,]
dim(training)

## [1] 13737    53

dim(testing)

## [1] 5885   53

Locating Relevant Features

This graphic leaves us to distinguish the class pattern of the exercise in general data set

qplot(classe, colour=classe, data=training, geom="density")

Data Modeling: rpart model

We used rpart model to construct trees for activity recognition because it automatically selects important variables and is robust to correlated covariates & outliers in general.

set.seed(42)
treeFit <- rpart(classe ~ ., method = "class", data = training)

treePredict <- predict(treeFit, training, type = "class")
confusionmatrix_rp<-confusionMatrix(treePredict, training$classe)
confusionmatrix_rp

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3488  393   32   98   34
##          B   99 1472  210  173  184
##          C  115  362 1953  365  323
##          D  131  192  138 1406  136
##          E   73  239   63  210 1848
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7401          
##                  95% CI : (0.7327, 0.7474)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6711          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8930   0.5538   0.8151   0.6243   0.7319
## Specificity            0.9433   0.9399   0.8973   0.9480   0.9478
## Pos Pred Value         0.8623   0.6885   0.6264   0.7019   0.7596
## Neg Pred Value         0.9569   0.8977   0.9583   0.9279   0.9401
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2539   0.1072   0.1422   0.1024   0.1345
## Detection Prevalence   0.2945   0.1556   0.2270   0.1458   0.1771
## Balanced Accuracy      0.9182   0.7468   0.8562   0.7862   0.8399

Prediction with Decision Trees

fancyRpartPlot(treeFit, main = "Decision Tree", 
               sub = "Rpart Decision Tree To Predict Classe", cex=0.3, cex.main = 2)

Data Modeling: Random Forest Model

As the rpart model was generally inaccurate and the outcome variable appears to have more gradations in variable, a random forest model was tested to see if this model fit more suitably in these project

set.seed(40) 

inTrain <- createDataPartition(Training$classe, p=0.70, list=FALSE)
trainData <- Training[inTrain, ]
testData <- Training[-inTrain, ]
fit <- randomForest(as.factor(classe) ~ . , data=trainData, importance=TRUE, proximity=TRUE )
prediction_rf <- predict(fit , trainData, type = "class")
confusionmatrix_rf <- confusionMatrix(prediction_rf, trainData$classe)
confusionmatrix_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

The variable importance plot shown below illustrates a model in survival analysis- prediction which of the variables have significant importance.

varImpPlot(fit, main="Random Forest Variable Importance")

The variable importance plot shown below illustrates a model in survival analysis- prediction error curves

plot(fit)

Summary

Random Forest was a superior model for prediction of exercise quality compared to rpart. The Random Forest had over 99%accuracy and fitted well to other subsamples of the data.

In general, it is important in evaluation of the devices for tracking movements are affected in gathering data, predictable errors, and quality of measurements. This project give us idea about qualty of exercise that can be collected and analysed from this type of device.

Machine Learning Project

Albion Dervishi

June 21, 2015