Introduction

Data source: http://groupware.les.inf.puc-rio.br/har

Several individuals were instructed to perform various weight-lifting exercises while deliberately making specific mistakes in each exercise. Many body sensors were used to gather data which yielded about 160 numerical features. We used the caret package in R for some basic preprocessing and model training using the k-nearest-neighbors method to classify mistakes in exercises.

The goal of this project is to predict the manner in which the participant did the exercise. This is the “classe” variable in the training set; its possible values are

A. performing exercise as instructed / no mistake

B. throwing elbow out to the front

C. lifting the dumbbell only halfway

D. lowering the dumbbell only halfway

E. throwing the hips to the front

We use most of the other variables to predict the type of mistake / lack thereof.

Preliminaries and first round of cross-validation

Read in the training data, then partition into 75% training and 25% testing sets (first round of cross-validation).

setwd("~/coursera/practical machine learning/Practical Machine Learning")

if(!require(caret)){
    library(caret)
    library(knitr)
    library(randomForest)
}
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
trainingData <- read.csv("pml-training.csv")
#Check what data look like without NAs
trainingDataLite <- read.csv("pml-training.csv")
set.seed(333)
inTrain <- createDataPartition(trainingData$classe, p = 0.75, list = FALSE)
training <- trainingData[inTrain, ]
testing <- trainingData[-inTrain, ]

Exploratory

Many varaibles are missing most values.

#Function to compute the NA percentage of a variable x
percentMissing <- function(x) {
  sum(is.na(x))/length(x)
}

#Determine which variables are more than 50% missing.
manyMissing <- apply(training, MARGIN = 2, FUN = percentMissing)
qplot(x = colnames(training), y = manyMissing)

  #There are lots.
  mostlyMissingVariables <- colnames(training)[manyMissing > 0.5]
  cat("There are", length(mostlyMissingVariables), "variables missing", 
      "more than 50% of their values.")
## There are 67 variables missing more than 50% of their values.

Does new_window have an effect on classe?

Pre-processing

1. Remove variables that are almost entirely missing.

training <- training[, manyMissing <= 0.05]
testing <- testing[, manyMissing <= 0.05]

2. Remove non-predictors

The variable X simply numbers the rows, so may be disregarded. Similarly the timestamps appear to be irrelevant.

Each of the 6 participants performed 10 repetitions in 5 fashions (classes), so the classe does not depend on the user_name in a meaningful way. Therefore user_name should be eliminated as a predictor.

training <- training[, -c(1:5)]
testing <- testing[, -c(1:5)]
#qplot(new_window, classe, data = training)

3. Check which variables have near-zero variance.

Remove all numeric variables from the training and testing sets that exhibit near-zero variance in the training set.

nsv <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[ , nsv$nzv == FALSE]
testing <- testing[ , nsv$nzv == FALSE]

4. Do KNN imputation to all variables except the response variable classe.

classeIndex <- dim(training)[2]  #column index for classe variable
preObj <- preProcess(training[, -classeIndex], method = "knnImpute")
trainImp <- predict(preObj, training[, -classeIndex])
#Now put classe back into data frame to fit the model with multinomial logistic regression later
trainImpFinal <- cbind(trainImp, training$classe)
colnames(trainImpFinal)[classeIndex] <- "classe"

Fit several standard classification models

We use several suggested models used in the tutorial [1] for predicting the species variable in the iris dataset. We also use their suggested method for visualization of model accuracy comparison.

We use 10-fold cross validation

with accuracy as the metric for selection. Note that we have not chosen to use any linear models because we are doing classification, not prediction of a continuous response.

control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"

classification tree

set.seed(7)
modFitTree <- train(classe ~.,
                data = trainImpFinal,
                method = "rpart",
                trControl = control,
                metric = metric)

linear discriminant analysis

set.seed(7)
modFitLda <- train(classe ~.,
                data = trainImpFinal,
                method = "lda",
                trControl = control,
                metric = metric)

k-nearest neighbors

set.seed(7)
modFitKnn <- train(classe ~.,
                data = trainImpFinal,
                method = "knn",
                trControl = control,
                metric = metric)

Model comparison and out-of-sample error estimate

results <- resamples(list(lda = modFitLda, cart = modFitTree, knn = modFitKnn))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.6861413 0.7062525 0.7137531 0.7124623 0.7234100 0.7296196    0
## cart 0.4921928 0.5072995 0.5421468 0.5334936 0.5538950 0.5665761    0
## knn  0.9619565 0.9636616 0.9663836 0.9663681 0.9685802 0.9728261    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.6023583 0.6277628 0.6381047 0.6361558 0.6499648 0.6582571    0
## cart 0.3358518 0.3613374 0.4201349 0.4030304 0.4352448 0.4497154    0
## knn  0.9518533 0.9540209 0.9574849 0.9574564 0.9602490 0.9656215    0
#compare accuracy
dotplot(results)

#summarize modFitKnn
print(modFitKnn)
## k-Nearest Neighbors 
## 
## 14718 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13246, 13245, 13248, 13246, 13246, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9663681  0.9574564
##   7  0.9535936  0.9412828
##   9  0.9416356  0.9261407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

It appears that the k-nearest neighbors model modFitKnn has the highest accuracy in the training set, so we test this model on testing set.

#Compute the new test set according to the above preprocessing routine using the preprocess object preObj.
testingImp <- predict(preObj, testing[, -classeIndex])

#Predict classe values in the testing set
testingPredictions <- predict(modFitKnn, testingImp)

#Compare to true values with confusion matrix
testingTruth <- testing$classe
confusionMatrix(testingTruth, testingPredictions)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1376    5    8    3    3
##          B   16  922   11    0    0
##          C    1   15  825   12    2
##          D    1    1   40  761    1
##          E    0    3    4    8  886
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9727          
##                  95% CI : (0.9677, 0.9771)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9654          
##                                           
##  Mcnemar's Test P-Value : 1.124e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9871   0.9746   0.9291   0.9707   0.9933
## Specificity            0.9946   0.9932   0.9925   0.9896   0.9963
## Pos Pred Value         0.9864   0.9715   0.9649   0.9465   0.9834
## Neg Pred Value         0.9949   0.9939   0.9844   0.9944   0.9985
## Prevalence             0.2843   0.1929   0.1811   0.1599   0.1819
## Detection Rate         0.2806   0.1880   0.1682   0.1552   0.1807
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9908   0.9839   0.9608   0.9801   0.9948

Estimate of the out of sample accuracy/error

The model is quite accurate on the testing data that we set aside for the preliminary round of cross-validation. At this stage, we estimate the out-of-sample accuracy to be 0.9725. This is equivalent (complementary) to estimating the out-of-sample error.

Load the separate data file for testing (call it validation) and make predictions using the k-nearest neighbors model modFitKnn.

validation <- read.csv("pml-testing.csv")

Preprocess the validation set in the same way as the training set.

1. Remove the same variables.

validation <- validation[, manyMissing <= 0.05]

2. Remove non-predictors

validation <- validation[, -c(1:5)]

3. Remove the same variables that were removed from the training set due to near-zero variance.

validation <- validation[ , nsv$nzv == FALSE]

4. Preprocess the validation set using preObj (the same preprocess object developed on the training set).

#Compute the new test set according to the above preprocessing routine using the preprocess object preObj.
validationImp <- predict(preObj, validation[, -classeIndex])

#Predict classe values in the testing set
validationPredictions <- predict(modFitKnn, validationImp)

# #Compare to true values with confusion matrix
# validationTruth <- validation$classe
# confusionMatrix(validationTruth, validationPredictions)

Predictions for the validation/test data:

validationPredictions
##  [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E