Assignment Objective:

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants and to predict the manner in which they did the exerciseThis is the “classe” variable in the training set. You may use any of the other variables to predict with.

This report describes how model is built, how its cross validated, what can be expected out of sample error, and describe the choices made. Also, prediction model will be used to predict 20 different test cases.

Data pre-processing: Data cleanup and preliminary exploration:

Loading data into a dataframe and relavent R packages. The working directory should have the training and testing files.

setwd("~/Data Science/Git Repository/Practical-machine-learning")
traindatamaster <- read.csv("Assignment/pml-training.csv", header = TRUE)

From the dataset, there are 160 variables. But we need predictor variable data from the accelerometers on belt, forearm, arm and dumbell. These variable’s names are denoted by _forearm, _arm, _dumbbell and _belt. So, dataframe will be subset to only include those variables and the outcome variable classe. Also excluding variables which have NA and missing observations.

missingObs <- sapply(traindatamaster, function (x) any(is.na(x) | x == ""))
requiredvar <- !missingObs & grepl("belt|[^(fore)]arm|dumbbell|forearm|classe", names(missingObs))
traindatavar <- names(missingObs)[requiredvar]
trainingdata <- traindatamaster [ ,traindatavar]

This should leave us with 1 factor (classe) outcome variable and other numeric or integer predictor variable types.

table(sapply(trainingdata[1,], class))
## 
##  factor integer numeric 
##       1      25      27

List of Predictors and Outcome variables

colnames(trainingdata)
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"

Data Prediction:

We will split our preprocessed data into 70% training and 30% testing. Caret package will be used for this prediction.

We will build a Random Forest Model, building 500 decision trees. The model will be displayed on a plot.

require(caret)
library(randomForest)
library(ggplot2)

set.seed(9876)

intrain <- createDataPartition(trainingdata$classe, p=0.7, list = FALSE )
training <- trainingdata[intrain,]
testing <- trainingdata[-intrain,]

randforestModel <- randomForest(classe~., data = training, ntree = 500)

Summarizing the result of the random forest model and plotting the model to identify the pattern between error and the decision trees

randforestModel
## 
## Call:
##  randomForest(formula = classe ~ ., data = training, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.46%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B   13 2641    4    0    0 0.0063957863
## C    0   13 2380    3    0 0.0066777963
## D    0    0   20 2229    3 0.0102131439
## E    0    0    3    3 2519 0.0023762376
plot(randforestModel, main ="Random Forest Model")

From the random forest model, the resulting predictors have a low error(OOB) rate with 7 variables tried at each split. Also, the plot indicates that after 100 decision trees, there is not a significant reduction in error rate.

Dotcharting the variable importance for the model

varImpPlot(randforestModel)

Applying the training predictor on the testing data which is a subsample of training data. We will use the confusionmatrix method to cross tabulate observed and predicted values.

predictTest <- predict(randforestModel, newdata = testing)

confusionMatrix(predictTest, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    5    0    0    0
##          B    0 1130    4    0    0
##          C    0    4 1022    8    2
##          D    0    0    0  955    1
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9958          
##                  95% CI : (0.9937, 0.9972)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9946          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9921   0.9961   0.9907   0.9972
## Specificity            0.9988   0.9992   0.9971   0.9998   0.9998
## Pos Pred Value         0.9970   0.9965   0.9865   0.9990   0.9991
## Neg Pred Value         1.0000   0.9981   0.9992   0.9982   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1920   0.1737   0.1623   0.1833
## Detection Prevalence   0.2853   0.1927   0.1760   0.1624   0.1835
## Balanced Accuracy      0.9994   0.9956   0.9966   0.9952   0.9985

From the graph, we see that Kappa indicator and accuracy indicate that predictors have a low error rate. Random forest is the best suited prediction model for the data set. This model will be verified for the 20 test cases.

Applying my machine learning algorithm to the 20 test cases available in the test data file pml-testing.csv

Loading the data and applying the model

setwd("~/Data Science/Git Repository/Practical-machine-learning")

testdata <- read.csv("Assignment/pml-testing.csv", header = TRUE)
testpred <- predict(randforestModel, newdata = testdata)