This project uses data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways (classified in the “classe” variable: A - Correctly; B-E - Incorrectly). More information on the data source is available from this website (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here. With this data (19.622 observations of 160 variables) we must develop a machine learning algorithm that “learns” how to classify new data (not yet classified). This exercise involves using our prediction model to predict 20 different test cases, available here.
This document describes:
how I built my model,
how I used cross validation,
what the expected out of sample error I obtained,
and all the choices I made to get it.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). This info is given in the “classe” variable, in the training set.
The goal of this project is to build a prediction algorithm that identifies these categories in a independent data set.
##Load training data
training <- read.table("./projData/training.csv", sep = ",", header = TRUE)
dim(training)
## [1] 19622 160
##Load testing data
testing <- read.table("./projData/testing.csv", sep = ",", header = TRUE)
dim(testing)
## [1] 20 160
I initially took a peek at the test data set, to get a feeling of what it is about. The strategy I pursued involved finding the relevant covariates in the testing dataset (no point in considering null variance covariates in the classification algorithm…), as there are lots of variables that don’t have any information. For this, I use the \(nearZeroVar\) formula available in R package caret.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
nsv_ts <- nearZeroVar(testing, saveMetrics=TRUE)
#nsv_ts
The following code picks the relevant covariates for the prediction algorithm.
#define relevant covariates
covariates <- rownames(nsv_ts[nsv_ts$zeroVar == 0,])
## covariates
## summary(covariates)
In an effort to avoid over fitting, I also disregarded some variables not directly related to the Human Activity Recognition exercise, such as user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, or num_window. I also disregarded counters X and problem_id.
#removes spurious variables from covariates
covariates <- covariates[7:58]
covariates
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [40] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [52] "magnet_forearm_z"
The next code chunk applies this set of variables to filter the relevant covariates in the training data set.
# add predictive variable to list of variables
covariates[length(covariates)+1] <- "classe"
trainWork <- training[,covariates]
Looking out for missing values…there aren’t any, so nothing to worry here.
NAnumber <- sapply(trainWork, function(x) sum(is.na(x)))
sum(NAnumber)
## [1] 0
In the next step, I plotted a histogram of all variables, to infer about the skewness and normality of covariates (the command line for this is given below, although inactive). I then standardized all covariates (with the \(preProcess\) formula, also from the caret package), aiming to better fit the data to the purpose at hands.
## for(i in 1:dim(trainWork)[2]){hist(as.numeric(trainWork[,i]), main = covariates[i])}
preObj <- preProcess(trainWork[,-53], method = c("center","scale"))
trainSt <- predict(preObj, trainWork[,-53])
trainSt$classe <- trainWork$classe
## summary(trainSt)
My first approach to training the model was to perform a default Random Forest algorithm with the \(caret\) package, Keeping faithful to Veloso et al approach, who initially justified this methodology due to “the characteristic noise in the sensor data”. For that, I divided the training set in 60/40.
set.seed(12345)
inTrain = createDataPartition(y = trainSt$classe, p = 0.6, list = FALSE)
trainCDP = trainSt[ inTrain,]
testCDP = trainSt[-inTrain,]
set.seed(123)
modFit1 <- train(classe ~ ., method = "rf", data = trainCDP)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
saveRDS(modFit1, file="modFitRandom1.rds"); ##myVariableName = readRDS("myFile.rds")
With this model, I got the following out of sample errors’ statistics:
pred1 <- predict(modFit1, newdata = testCDP)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
confM1 <- confusionMatrix(pred1,testCDP$classe)
confM1
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2228 11 0 0 0
## B 4 1501 7 0 3
## C 0 6 1356 17 3
## D 0 0 5 1267 4
## E 0 0 0 2 1432
##
## Overall Statistics
##
## Accuracy : 0.9921
## 95% CI : (0.9899, 0.9939)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.99
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9888 0.9912 0.9852 0.9931
## Specificity 0.9980 0.9978 0.9960 0.9986 0.9997
## Pos Pred Value 0.9951 0.9908 0.9812 0.9929 0.9986
## Neg Pred Value 0.9993 0.9973 0.9981 0.9971 0.9984
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2840 0.1913 0.1728 0.1615 0.1825
## Detection Prevalence 0.2854 0.1931 0.1761 0.1626 0.1828
## Balanced Accuracy 0.9981 0.9933 0.9936 0.9919 0.9964
I got an overall “out of sample” accuracy of 0.9920979, which is indeed a very high figure.
Still, I tried to improve this model, by means of a 10 k-fold repeated cross validation. As you can see below, I was able to improve the overall accuracy of the predicting algorithm, but just marginally (in line with what Jeff talked about in his lectures).
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
modFit2 <- train(classe ~ ., data = trainCDP,
method = "rf",
trControl = fitControl,
verbose = FALSE)
saveRDS(modFit2, file="modFitRandom2.rds")
pred2 <- predict(modFit2, newdata = testCDP)
confM2 <- confusionMatrix(pred2,testCDP$classe)
confM2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 11 0 0 0
## B 3 1502 7 0 2
## C 0 5 1358 18 3
## D 0 0 3 1266 5
## E 0 0 0 2 1432
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.9903, 0.9943)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9895 0.9927 0.9844 0.9931
## Specificity 0.9980 0.9981 0.9960 0.9988 0.9997
## Pos Pred Value 0.9951 0.9921 0.9812 0.9937 0.9986
## Neg Pred Value 0.9995 0.9975 0.9985 0.9970 0.9984
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1914 0.1731 0.1614 0.1825
## Detection Prevalence 0.2855 0.1930 0.1764 0.1624 0.1828
## Balanced Accuracy 0.9983 0.9938 0.9943 0.9916 0.9964
The next table lists the most relevant variables for the classification algorithm.
varImp(modFit2)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 56.80
## yaw_belt 55.00
## pitch_belt 44.80
## magnet_dumbbell_z 43.15
## magnet_dumbbell_y 41.36
## roll_forearm 40.36
## accel_dumbbell_y 21.40
## accel_forearm_x 19.15
## magnet_dumbbell_x 18.79
## roll_dumbbell 17.76
## magnet_belt_z 15.51
## accel_dumbbell_z 14.49
## accel_belt_z 14.29
## magnet_forearm_z 13.45
## gyros_belt_z 11.62
## total_accel_dumbbell 11.50
## magnet_belt_x 11.01
## magnet_belt_y 10.86
## yaw_arm 10.63
This last model is used to answer the Course Project Submission part. To do so, one must prepare the Test data set so that it can be processed by our algorithm (using the same transformations as those used to train the model).
The following table assigns probabilities of each observation belonging to a “classe”" category, according to the classification algorithm and, for each case, puts forward the classe with higher probability.
testWork <- testing[,covariates[1:(length(covariates)-1)]]
testSt <- predict(preObj, testWork)
##run prediction on test data
predSubmit <- predict(modFit2, newdata = testSt, type = "prob")
results <- cbind(predSubmit, classe = LETTERS[apply(predSubmit,1, function(x) which (x == max(x)))])
results
## A B C D E classe
## 1 0.054 0.766 0.140 0.020 0.020 B
## 2 0.976 0.016 0.004 0.002 0.002 A
## 3 0.110 0.784 0.048 0.008 0.050 B
## 4 0.924 0.004 0.032 0.038 0.002 A
## 5 0.980 0.008 0.010 0.000 0.002 A
## 6 0.010 0.016 0.048 0.012 0.914 E
## 7 0.016 0.002 0.042 0.918 0.022 D
## 8 0.038 0.792 0.040 0.098 0.032 B
## 9 0.998 0.002 0.000 0.000 0.000 A
## 10 0.990 0.002 0.000 0.008 0.000 A
## 11 0.024 0.790 0.116 0.034 0.036 B
## 12 0.002 0.034 0.904 0.018 0.042 C
## 13 0.002 0.984 0.002 0.000 0.012 B
## 14 1.000 0.000 0.000 0.000 0.000 A
## 15 0.004 0.018 0.014 0.008 0.956 E
## 16 0.008 0.030 0.000 0.006 0.956 E
## 17 0.948 0.000 0.000 0.000 0.052 A
## 18 0.020 0.872 0.018 0.082 0.008 B
## 19 0.150 0.828 0.004 0.018 0.000 B
## 20 0.000 1.000 0.000 0.000 0.000 B
Finally, the last code chunk produces the files submitted to Coursera’s Course Project: Submission.
answers <- as.character(results$classe)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)