Data source: http://groupware.les.inf.puc-rio.br/har
Several individuals were instructed to perform various weight-lifting exercises while deliberately making specific mistakes in each exercise. Many body sensors were used to gather data which yielded about 160 numerical features. We used the caret package in R for some basic preprocessing and model training using the k-nearest-neighbors method to classify mistakes in exercises.
The goal of this project is to predict the manner in which the participant did the exercise. This is the “classe” variable in the training set; its possible values are
A. performing exercise as instructed / no mistake
B. throwing elbow out to the front
C. lifting the dumbbell only halfway
D. lowering the dumbbell only halfway
E. throwing the hips to the front
We use most of the other variables to predict the type of mistake / lack thereof.
Read in the training data, then partition into 75% training and 25% testing sets (first round of cross-validation).
setwd("~/coursera/practical machine learning/Practical Machine Learning")
if(!require(caret)){
library(caret)
library(knitr)
library(randomForest)
}
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
trainingData <- read.csv("pml-training.csv")
#Check what data look like without NAs
trainingDataLite <- read.csv("pml-training.csv")
set.seed(333)
inTrain <- createDataPartition(trainingData$classe, p = 0.75, list = FALSE)
training <- trainingData[inTrain, ]
testing <- trainingData[-inTrain, ]
Many varaibles are missing most values.
#Function to compute the NA percentage of a variable x
percentMissing <- function(x) {
sum(is.na(x))/length(x)
}
#Determine which variables are more than 50% missing.
manyMissing <- apply(training, MARGIN = 2, FUN = percentMissing)
qplot(x = colnames(training), y = manyMissing)
#There are lots.
mostlyMissingVariables <- colnames(training)[manyMissing > 0.5]
cat("There are", length(mostlyMissingVariables), "variables missing",
"more than 50% of their values.")
## There are 67 variables missing more than 50% of their values.
Does new_window have an effect on classe?
training <- training[, manyMissing <= 0.05]
testing <- testing[, manyMissing <= 0.05]
The variable X simply numbers the rows, so may be disregarded. Similarly the timestamps appear to be irrelevant.
Each of the 6 participants performed 10 repetitions in 5 fashions (classes), so the classe does not depend on the user_name in a meaningful way. Therefore user_name should be eliminated as a predictor.
training <- training[, -c(1:5)]
testing <- testing[, -c(1:5)]
#qplot(new_window, classe, data = training)
Remove all numeric variables from the training and testing sets that exhibit near-zero variance in the training set.
nsv <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[ , nsv$nzv == FALSE]
testing <- testing[ , nsv$nzv == FALSE]
classeIndex <- dim(training)[2] #column index for classe variable
preObj <- preProcess(training[, -classeIndex], method = "knnImpute")
trainImp <- predict(preObj, training[, -classeIndex])
#Now put classe back into data frame to fit the model with multinomial logistic regression later
trainImpFinal <- cbind(trainImp, training$classe)
colnames(trainImpFinal)[classeIndex] <- "classe"
We use several suggested models used in the tutorial [1] for predicting the species variable in the iris dataset. We also use their suggested method for visualization of model accuracy comparison.
with accuracy as the metric for selection. Note that we have not chosen to use any linear models because we are doing classification, not prediction of a continuous response.
control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
set.seed(7)
modFitTree <- train(classe ~.,
data = trainImpFinal,
method = "rpart",
trControl = control,
metric = metric)
set.seed(7)
modFitLda <- train(classe ~.,
data = trainImpFinal,
method = "lda",
trControl = control,
metric = metric)
set.seed(7)
modFitKnn <- train(classe ~.,
data = trainImpFinal,
method = "knn",
trControl = control,
metric = metric)
results <- resamples(list(lda = modFitLda, cart = modFitTree, knn = modFitKnn))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6861413 0.7062525 0.7137531 0.7124623 0.7234100 0.7296196 0
## cart 0.4921928 0.5072995 0.5421468 0.5334936 0.5538950 0.5665761 0
## knn 0.9619565 0.9636616 0.9663836 0.9663681 0.9685802 0.9728261 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6023583 0.6277628 0.6381047 0.6361558 0.6499648 0.6582571 0
## cart 0.3358518 0.3613374 0.4201349 0.4030304 0.4352448 0.4497154 0
## knn 0.9518533 0.9540209 0.9574849 0.9574564 0.9602490 0.9656215 0
#compare accuracy
dotplot(results)
#summarize modFitKnn
print(modFitKnn)
## k-Nearest Neighbors
##
## 14718 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13246, 13246, 13245, 13248, 13246, 13246, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9663681 0.9574564
## 7 0.9535936 0.9412828
## 9 0.9416356 0.9261407
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
It appears that the k-nearest neighbors model modFitKnn has the highest accuracy in the training set, so we test this model on testing set.
#Compute the new test set according to the above preprocessing routine using the preprocess object preObj.
testingImp <- predict(preObj, testing[, -classeIndex])
#Predict classe values in the testing set
testingPredictions <- predict(modFitKnn, testingImp)
#Compare to true values with confusion matrix
testingTruth <- testing$classe
confusionMatrix(testingTruth, testingPredictions)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1376 5 8 3 3
## B 16 922 11 0 0
## C 1 15 825 12 2
## D 1 1 40 761 1
## E 0 3 4 8 886
##
## Overall Statistics
##
## Accuracy : 0.9727
## 95% CI : (0.9677, 0.9771)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9654
##
## Mcnemar's Test P-Value : 1.124e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9871 0.9746 0.9291 0.9707 0.9933
## Specificity 0.9946 0.9932 0.9925 0.9896 0.9963
## Pos Pred Value 0.9864 0.9715 0.9649 0.9465 0.9834
## Neg Pred Value 0.9949 0.9939 0.9844 0.9944 0.9985
## Prevalence 0.2843 0.1929 0.1811 0.1599 0.1819
## Detection Rate 0.2806 0.1880 0.1682 0.1552 0.1807
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9908 0.9839 0.9608 0.9801 0.9948
The model is quite accurate on the testing data that we set aside for the preliminary round of cross-validation. At this stage, we estimate the out-of-sample accuracy to be 0.9725. This is equivalent (complementary) to estimating the out-of-sample error.
validation <- read.csv("pml-testing.csv")
validation <- validation[, manyMissing <= 0.05]
validation <- validation[, -c(1:5)]
validation <- validation[ , nsv$nzv == FALSE]
#Compute the new test set according to the above preprocessing routine using the preprocess object preObj.
validationImp <- predict(preObj, validation[, -classeIndex])
#Predict classe values in the testing set
validationPredictions <- predict(modFitKnn, validationImp)
# #Compare to true values with confusion matrix
# validationTruth <- validation$classe
# confusionMatrix(validationTruth, validationPredictions)
validationPredictions
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E