Prediction of Quality of Active Work

Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

exactly according to the specification (Class A) throwing the elbows to the front (Class B) lifting the dumbbell only halfway (Class C) lowering the dumbbell only halfway (Class D) throwing the hips to the front (Class E)

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har

The main objectives of this project are as follows

Predict the manner in which they did the exercise
Build a prediction model
Calculate the out of sample error.
Use the prediction model to predict 20 different test cases provided

Load Data

  training <- read.csv(file="./data/pml-training.csv", head=TRUE, na.strings=c("NA",""))
  testing <- read.csv(file="data/pml-testing.csv", head=TRUE, na.strings=c("NA",""))
  dim(training)   #[1] 19622   160

## [1] 19622   160

  dim(testing)    #[1]  20 160

## [1]  20 160

  # str(training)

The dataset comprises 160 features and 19622 observations in the training set and 20 test cases in the testing set.

Processing data

First, we check how many columns have NA values in the training and testing data and what is the quantity of NA values present.

   sum(is.na(training)) #[1] 1921600

## [1] 1921600

   sum(is.na(testing))  #[1] 2000

## [1] 2000

we are going to ignore NA values using the following code segment

# for training dataset
columnNACounts <- colSums(is.na(training)) 
# columnNACounts 
# after checking columnNACounts , we noticed:
# most columns with NA values have sum of NA values exceeeds 19200 
badColumns <- columnNACounts >= 19200           
cleanTrainingdata <- training[!badColumns]        
sum(is.na(cleanTrainingdata)) # 0

## [1] 0

# same for testing dataset
columnNACounts <- colSums(is.na(testing))  
# columnNACounts 
# after checking columnNACounts , we noticed:
# most columns with NA values have sum of NA values exceeeds 20
badColumns <- columnNACounts >= 20                
cleanTestingdata <- testing[!badColumns]       
sum(is.na(cleanTestingdata)) # 0

## [1] 0

Feature Selection

# remove the first 6 columns as they contain user name and time stamps
# which are not useful to the classifier
cleanTrainingdata <- cleanTrainingdata[, c(7:60)] 
cleanTestingdata <- cleanTestingdata[, c(7:60)]
dim(cleanTrainingdata) # [1] 19622    54

## [1] 19622    54

dim(cleanTestingdata)  # [1] 20 54

## [1] 20 54

Exploratory Data Analysis

  plot(cleanTrainingdata$classe,col=rainbow(5),main = "classe frequency plot")

  attach(cleanTrainingdata)
  # plot scatter plot matrices to determine relationship: Linear or Nonlinear
  pairs(classe~num_window+roll_arm+pitch_arm,data=cleanTrainingdata, 
   main="Simple Scatterplot Matrix")

  pairs(classe~roll_belt+pitch_belt+yaw_belt,data=cleanTrainingdata, 
   main="Simple Scatterplot Matrix")

From the above analysis, we may conclude that the relation is nonlinear

Now we start partitioning the data:

Partition cleaned training data : A training set and a cross validation set.

library(caret)

## Warning: package 'caret' was built under R version 3.1.2

## Loading required package: lattice
## Loading required package: ggplot2

inTrain <- createDataPartition(y = cleanTrainingdata$classe, p = 0.6, list = FALSE)
trainingdata <- cleanTrainingdata[inTrain, ]
crossval <- cleanTrainingdata[-inTrain, ]

Fit a random forest predictor relating the factor variable classe to the remaining variables.

cvCtrl <- trainControl(method = "cv", number = 5, allowParallel = TRUE, verboseIter = TRUE)
# Build the model using 5-fold cross validation
model <- train(classe ~ ., data = trainingdata, method = "rf", trControl = cvCtrl)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=27 
## - Fold1: mtry=27 
## + Fold1: mtry=53 
## - Fold1: mtry=53 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=27 
## - Fold2: mtry=27 
## + Fold2: mtry=53 
## - Fold2: mtry=53 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=27 
## - Fold3: mtry=27 
## + Fold3: mtry=53 
## - Fold3: mtry=53 
## + Fold4: mtry= 2 
## - Fold4: mtry= 2 
## + Fold4: mtry=27 
## - Fold4: mtry=27 
## + Fold4: mtry=53 
## - Fold4: mtry=53 
## + Fold5: mtry= 2 
## - Fold5: mtry= 2 
## + Fold5: mtry=27 
## - Fold5: mtry=27 
## + Fold5: mtry=53 
## - Fold5: mtry=53 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set

To Check the importance of features

vimp <- varImp(model)
print(vimp)

## rf variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                      Overall
## num_window           100.000
## roll_belt             67.998
## pitch_forearm         42.220
## yaw_belt              32.119
## magnet_dumbbell_z     31.338
## pitch_belt            29.957
## magnet_dumbbell_y     29.619
## roll_forearm          26.876
## accel_dumbbell_y      12.229
## accel_forearm_x       11.776
## magnet_dumbbell_x     11.358
## roll_dumbbell         11.210
## accel_belt_z          10.329
## total_accel_dumbbell   9.381
## magnet_forearm_z       8.428
## accel_dumbbell_z       8.056
## magnet_belt_y          7.982
## magnet_belt_z          7.861
## magnet_belt_x          6.143
## yaw_dumbbell           5.323

Calculate in-sample accuracy

Here, we calculate the in sample accuracy which is the prediction accuracy of our model on the training data set.

training_pred <- predict(model, trainingdata)# We build the model using 5-fold cross validation.
confusionMatrix(training_pred, trainingdata$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2279    0    0    0
##          C    0    0 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Thus, from the above confusion matrix, sample accuracy value is 100%.

Calculate out-of-sample accuracy

testing_pred <- predict(model, crossval)
confusionMatrix(testing_pred, crossval$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231    2    0    0    0
##          B    0 1515    1    0    0
##          C    0    1 1367    6    0
##          D    0    0    0 1280    3
##          E    1    0    0    0 1439
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9982        
##                  95% CI : (0.997, 0.999)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9977        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9980   0.9993   0.9953   0.9979
## Specificity            0.9996   0.9998   0.9989   0.9995   0.9998
## Pos Pred Value         0.9991   0.9993   0.9949   0.9977   0.9993
## Neg Pred Value         0.9998   0.9995   0.9998   0.9991   0.9995
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1931   0.1742   0.1631   0.1834
## Detection Prevalence   0.2846   0.1932   0.1751   0.1635   0.1835
## Balanced Accuracy      0.9996   0.9989   0.9991   0.9974   0.9989

The out-of-sample accuracy is 99%. Now, we apply the above model to the clean testing data (20 cases)

Testing our model with new data (20 cases)

answers <- predict(model, testing)
answers <- as.character(answers)
answers

##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"