Prediction-Assignment-Writeup

1. Project Overview

This project is focused on predicting the manner in which six young health participants aged between 20-28 years, with little weight lifting experience performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). All these excercises are in the variable name called “classe”. The provided plm-training data set is partitioned in to training and validation data sets and pml-testing with 20 observations are available for test data set. The links for the data are presented in the readme file.

2. Loading and preprocessing the data

Necessary packages

library(knitr)
library(caret)  
library(MASS)
library(klaR)
library(rattle)
library(readr) 
library(ggplot2)

Data downloading and reading

filePath<- getwd()
fileName1<- "pml-training.csv"
fileName2<- "pml-testing.csv"
urll<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urll, destfile = fileName1, method = "curl")
download.file(url2, destfile = fileName2, method = "curl")

trainingRD<- read.csv(fileName1)
testingRD<- read.csv(fileName2)

Cleaning the Data

View(trainingRD); View(testingRD)

As you can see from the above function, both data have columns with NA and contain unncessary variables. its should be removed before the analysis started.

nearZero<- nearZeroVar(trainingRD)
trainingRD <- trainingRD[ ,-nearZero]
trainingRD<- trainingRD[ ,which(colSums(is.na(trainingRD))== 0)]

# the first 7 columns are variables that has no relationship with "classe"
trainingSet<- trainingRD[ ,-c(1:7)]
testing<- testingRD[ ,-c(1:7)]

The following table shows the number of observations for each class category after the data is cleaned.

table(trainingSet$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

3. Creating train, test and validation data sets

set.seed(12345)
inTrain<- createDataPartition(y = trainingSet$classe, p = 0.7, list = FALSE)
training<- trainingSet[inTrain, ]
validation<- trainingSet[-inTrain, ]
dim(training)

## [1] 13737    52

dim(validation)

## [1] 5885   52

dim(testing)

## [1]  20 153

The following plots show, as sample observations, how four representative variables (total_accel_belt, total_accel_arm, total_accel_dumbbell and total_accel_forearm) were varied across the range of observations.

par(mfrow = c(2, 2))
plot(training$classe, training$total_accel_belt, xlab = "Class", ylab = "total_accel_belt", main = "Class vs Total acceleration on belt")
plot(training$classe, training$total_accel_arm, xlab = "Class", ylab = "total_accel_arm", main = "Class vs Total acceleration on arm")
plot(training$classe, training$total_accel_dumbbell, xlab = "Class", ylab = "total_accel_dumbbell", main = "Class vs Total acceleration on dumbbell")
plot(training$classe, training$total_accel_forearm, xlab = "Class", ylab = "total_accel_forearm", main = "Class vs Total acceleration on forearm")

4. Best fit model selection

Three models (lda, rpart and rf) were chosen and the best fit model is selected based on highest accuracy value.

1. Linear discriminant analysis (“lda”)

mod_lda<- train(classe ~., data = training, method = "lda")
plda <- predict(mod_lda, validation)
confusionMatrix(plda, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1387  174  107   56   50
##          B   40  744  115   52  202
##          C  116  119  645  124   90
##          D  127   50  138  686  112
##          E    4   52   21   46  628
## 
## Overall Statistics
##                                          
##                Accuracy : 0.695          
##                  95% CI : (0.683, 0.7067)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6137         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8286   0.6532   0.6287   0.7116   0.5804
## Specificity            0.9081   0.9138   0.9076   0.9132   0.9744
## Pos Pred Value         0.7818   0.6453   0.5896   0.6164   0.8362
## Neg Pred Value         0.9302   0.9165   0.9205   0.9417   0.9116
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2357   0.1264   0.1096   0.1166   0.1067
## Detection Prevalence   0.3014   0.1959   0.1859   0.1891   0.1276
## Balanced Accuracy      0.8683   0.7835   0.7681   0.8124   0.7774

2. Recursive Partitioning (“rpart”) and plot Trees

mod_rpart<- train(classe ~., data = training, method = "rpart")
prpart<- predict(mod_rpart, validation)
confusionMatrix(prpart, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1527  482  498  423  243
##          B   31  387   38  188  228
##          C   77  124  423  126  150
##          D   38  146   67  227  145
##          E    1    0    0    0  316
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4894          
##                  95% CI : (0.4765, 0.5022)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3317          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9122  0.33977  0.41228  0.23548  0.29205
## Specificity            0.6091  0.89781  0.90183  0.91953  0.99979
## Pos Pred Value         0.4812  0.44381  0.47000  0.36437  0.99685
## Neg Pred Value         0.9458  0.84999  0.87904  0.85994  0.86243
## Prevalence             0.2845  0.19354  0.17434  0.16381  0.18386
## Detection Rate         0.2595  0.06576  0.07188  0.03857  0.05370
## Detection Prevalence   0.5392  0.14817  0.15293  0.10586  0.05387
## Balanced Accuracy      0.7607  0.61879  0.65706  0.57750  0.64592

fancyRpartPlot(mod_rpart$finalModel)

3. Random forest analysis(“rf”)

mod_rf<- train(classe ~., method = "rf", data = training,  importance = T, trControl = trainControl(method = "cv", classProbs=TRUE,savePredictions=TRUE,allowParallel=TRUE, number =3))
prf<- predict(mod_rf, validation)
confusionMatrix(prf, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    1 1129    7    0    0
##          C    0    4 1016    8    0
##          D    0    0    3  955    2
##          E    0    0    0    1 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9912   0.9903   0.9907   0.9982
## Specificity            0.9986   0.9983   0.9975   0.9990   0.9998
## Pos Pred Value         0.9964   0.9930   0.9883   0.9948   0.9991
## Neg Pred Value         0.9998   0.9979   0.9979   0.9982   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1918   0.1726   0.1623   0.1835
## Detection Prevalence   0.2853   0.1932   0.1747   0.1631   0.1837
## Balanced Accuracy      0.9990   0.9948   0.9939   0.9948   0.9990

The above result show that the random forest model has the highest accuracy in cross validation. Therefore, we will use the random forest model for predicting test samples.

5. Predictions with test data

since the random forest (rf) method has highest accurancy, it is selected to predict the test sample.

testing_pre<- predict(mod_rf, newdata = testing)
testing_pre

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

6. Conclusions

The pml-training dataset is splitted into training and validation data set to construct a predictive model and evaluate its accuracy. To select the best fit model, lda, rpart and rf models are applied.The rf is the best fit model and this model is used for predicting the test data.

REFERENCE Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz6SGpbauXU