This is a project for Practical Machine Learning course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Data for the assignment comes from Groupware, viz: accelerometers on the belt, forearm, arm, and dumbell of six participants. They performed 10 bicep curls in five different fashions: exactly according to the correct specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). The data set contains a train and a test data sets.
Code buttonLoad packages & data
library(caret); library(dplyr); library(rattle)
load <- function(name, url) {
dest <- paste0("./data/1124_DS-ML-w4_ActivityRecognition/", name,".csv")
if(!file.exists(dest)) {download.file(url, destfile = dest, method = "curl")}
load<- read.csv(dest)
}
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training<- load("training", url)
url<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- load("testing", url)Look at the dimension and classes of the training set:
dim(training)[1] 19622 160
table(training$classe)
A B C D E
5580 3797 3422 3216 3607
classe: A, B, C, D, ECreate testIN data set for later cross-validation:
set.seed(12345)
index <- createDataPartition(y = training$classe, p = 0.6, list = FALSE)
train <- training[index, ]
testIN <- training[-index, ]Append to NA values also blanks (“”), “NA”, div0, then count “non-NA” proportion:
train[train == ""] <- NA
train[train=="#DIV/0!"] <- NA
train[train=="<NA>"] <- NA
NAs <- unique(apply(train, 2,function(x){sum(is.na(x))}))
NAs <- dim(train)[1]-NAs[2]
nonNAs <- NAs/dim(train)[1]Yet choosing only columns with non-NAs, leaves the following variables:
NAcols<-unique(names(train[colSums(is.na(train)) > 0]))
train <- train%>% select(-NAcols)
cols<- dim(train)[2]
NAs<- sum(is.na(train))
names(train) [1] "X" "user_name" "raw_timestamp_part_1"
[4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
[7] "num_window" "roll_belt" "pitch_belt"
[10] "yaw_belt" "total_accel_belt" "gyros_belt_x"
[13] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
[16] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
[19] "magnet_belt_y" "magnet_belt_z" "roll_arm"
[22] "pitch_arm" "yaw_arm" "total_accel_arm"
[25] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
[28] "accel_arm_x" "accel_arm_y" "accel_arm_z"
[31] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
[34] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
[37] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
[40] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
[43] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
[46] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
[49] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
[52] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
[55] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
[58] "magnet_forearm_y" "magnet_forearm_z" "classe"
So, there are now 60 columns in the training set with 0 NAs in them.
near0 <- nearZeroVar(train, saveMetrics = TRUE)
train <- train[, near0$zeroVar == FALSE & near0$nzv == FALSE][
, -c(1:6)]testIN settestIN <- testIN %>% select(-NAcols)
testIN <- testIN[, near0$zeroVar == FALSE & near0$nzv == FALSE][
, -c(1:6)]Test three 5-fold models: classification tree, random forest, gradient boosting
trControl <- trainControl(method="cv", number=5)
model_CT <- train(classe~., data=train, method="rpart", trControl=trControl)
accurT<- getTrainPerf(model_CT)[[1]]
fancyRpartPlot(model_CT$finalModel, sub="")model_CTCART
11776 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 9421, 9421, 9420, 9421, 9421
Resampling results across tuning parameters:
cp Accuracy Kappa
0.03440911 0.4753702 0.30552993
0.05964246 0.4150861 0.20742048
0.11449929 0.3322835 0.07309721
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.03440911.
Model accuracy of 47.54\(\%\) is very low.
set.seed(1234)
model_RF <- train(classe~., data=train, method="rf",
trControl=trControl, verbose=FALSE)
model_RFRandom Forest
11776 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 9420, 9422, 9420, 9421, 9421
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9885364 0.9854970
27 0.9883664 0.9852831
52 0.9828474 0.9783011
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Final model variable importance:
varImp(model_RF)rf variable importance
only 20 most important variables shown (out of 52)
Overall
roll_belt 100.00
yaw_belt 75.42
magnet_dumbbell_z 68.21
pitch_belt 59.97
pitch_forearm 59.68
magnet_dumbbell_y 58.39
magnet_dumbbell_x 50.67
roll_forearm 50.56
magnet_belt_y 43.84
accel_belt_z 42.68
accel_dumbbell_y 41.52
roll_dumbbell 41.24
magnet_belt_z 40.51
accel_dumbbell_z 34.59
accel_forearm_x 32.25
roll_arm 32.16
accel_dumbbell_x 30.00
yaw_dumbbell 29.15
gyros_belt_z 28.40
magnet_forearm_x 26.84
Accuracy of 98.85\(\%\) is very nice, so carry out a cross-validation using testIN subset, and check then confusion matrix and accuracy.
Confusion matrix:
pred_Rf <- predict(model_RF, testIN)
confusionMatrix(pred_Rf, as.factor(testIN$classe))$table Reference
Prediction A B C D E
A 2231 6 0 0 0
B 1 1511 9 0 0
C 0 1 1353 16 4
D 0 0 6 1269 3
E 0 0 0 1 1435
confusionMatrix(pred_Rf, as.factor(testIN$classe))$overall[1] Accuracy
0.9940097
accur<- confusionMatrix(pred_Rf, as.factor(testIN$classe))$overall[[1]]
err<- 1-accurThere is a very high cross-validation accuracy of 99.4\(\%\) and obtained expected out-of-sample error is 0.6\(\%\).
model_GBM <- train(classe~., data=train, method="gbm", trControl=trControl,
verbose=FALSE)
plot(model_GBM, lw=3)model_GBM$finalModelA gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 52 predictors of which 51 had non-zero influence.
Accuracy of 96.01\(\%\) is quite good. Carry out a cross-validation using testIN subset, and check then confusion matrix and accuracy.
Confusion matrix:
pred_GBM <- predict(model_GBM, testIN)
confusionMatrix(pred_GBM, as.factor(testIN$classe))$table Reference
Prediction A B C D E
A 2188 39 0 2 2
B 34 1438 44 7 7
C 4 31 1304 30 17
D 6 3 18 1244 26
E 0 7 2 3 1390
confusionMatrix(pred_GBM, as.factor(testIN$classe))$overall[1] Accuracy
0.9640581
accurG<- confusionMatrix(pred_GBM, as.factor(testIN$classe))$overall[[1]]Accuracy is good, but lower than in Random Forest model, and correspondingly, expected out-of-sample error of 3.59\(\%\) is higher.
Look then at Random Forest performance on the original testing data set:
test_RF <- predict(model_RF, testing)
test_RF [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E