Inertial detectors are used to monitor subjects' activities. Data could be modeled to predict the type and quality of subjects' activity.
In this project, dumbbell lifting subjects were monitored by 9 degrees inertial measurement unit(IMU) attatched to 1: arm, 2: belt; 3:dumbbell. Sujbects are asked to perform the "right" lifting and "wrong" lifting.
Further information for the data generation could be found:http://groupware.les.inf.puc-rio.br/har.
The objectives of this analysis is to select features from hundreds of predictors and model the type of activity with key predictors.
Further, Model will be applied to test dataset.
Data preprocessing:
* remove zero and near-zero value;
* remove highly correlated variables using cor function;
* further feature plot to remove variables that are dominated by certain value.
Modeling:
* K-fold random forest was used with repeated cross validation.
Result:
* The model is very effective to predict;
* Based on 24 predictors (see below section "the variable importance."), the model accuracy is 0.98 and the test set accuracy is 0.98;
* Based on the model, the prediction of 20 sample is :B A C A A E D B A A B C B A E E A B B B, with one error.
<<<<<<< HEAD
# make this an external chunk that can be included in any file
library(knitr)
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
options(xtable.type = 'html')
knit_hooks$set(inline = function(x) {
if(is.numeric(x)) {
round(x, getOption('digits'))
} else {
paste(as.character(x), collapse = ', ')
}
})
knit_hooks$set(plot = knitr:::hook_plot_html)
if (!file.exists("data")) {
dir.create("data")
}
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "./data/pml-training.csv", method = "curl")
download.file(urlTest, destfile = "./data/pml-testing.csv", method = "curl")
pmlTrain <- read.table("/Volumes/s/R programming/data/pml-training.csv", header = TRUE, sep = ",")
pmlTest <- read.table("/Volumes/s/R programming/data/pml-testing.csv", header = TRUE, sep = ",")
pmlTrain <- read.table("./data/pml-training.csv", header = TRUE, sep = ",")
pmlTest <- read.table("./data/pml-testing.csv", header = TRUE, sep = ",")
dim(pmlTrain);dim(pmlTest);table(pmlTrain$classe)
[1] 19622 160
[1] 20 160
A B C D E
5580 3797 3422 3216 3607
set.seed(38360)
# nzv in training set.
nzvTrain <- nearZeroVar(pmlTrain)
# nzv in testing set.
nzvTest <- nearZeroVar(pmlTest)
# merge the two nzv index.
nzv <- unique(c(nzvTrain, nzvTest))
# remove nzv values in training an test of 20 samples.
rDat <- pmlTrain[, -nzv]
test20 <- pmlTest[, -nzv]
rDatCls <- as.factor(rDat$classe)
# remove non-relevant factors.
rDat <- rDat[, -c(2, 5)]
test20 <- test20[, -c(2, 5)]
set.seed(38360)
rDatCor <- cor(rDat[,-57])
highlyCorrelated <- findCorrelation(rDatCor, cutoff=0.75)
rDatFinal <- rDat[, -highlyCorrelated]
test20Final <- test20[, -highlyCorrelated]
<<<<<<< HEAD
=======
names(rDatFinal)
>>>>>>> 639b3dd68fb9cbd68a80dd5ff1b0ad08e367ab5b
for (i in 1:34){
fplot <- featurePlot(x=rDatFinal[,i],
y = rDatFinal$classe,
plot="density",
alpha = 0.8)
print(fplot)
}
rDatFinal <- rDatFinal[, -c(1, 2, 3, 4, 5, 6, 23, 25, 28, 29)]
test20Final <- test20Final[, -c(1, 2, 3, 4, 5, 6, 23, 25, 28, 29)]
inTrain <- createDataPartition(rDatFinal$classe, p = 0.75, list = FALSE)
train <- rDatFinal[inTrain,]
test <- rDatFinal[-inTrain,]
fitControl <- trainControl(method = "repeatedcv",
number = 3,
repeats = 10)
rf_Fit <- train(classe ~ ., data = train,
method = "rf",
trControl = fitControl,
proximity = TRUE)
rf_Fit
Random Forest
14718 samples
24 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 10 times)
Summary of sample sizes: 9812, 9811, 9813, 9812, 9812, 9812, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9757304 0.9692930
13 0.9706482 0.9628686
24 0.9614079 0.9511839
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
vImp <- varImp(rf_Fit)
order(vImp$importance);names(train)
[1] 10 8 19 2 1 9 18 6 12 22 7 23 11 21 14 24 20 5 3 16 15 13 17 4
[1] "gyros_belt_x" "gyros_belt_y" "gyros_belt_z" "magnet_belt_z"
[5] "roll_arm" "pitch_arm" "yaw_arm" "total_accel_arm"
[9] "gyros_arm_y" "gyros_arm_z" "magnet_arm_x" "magnet_arm_z"
[13] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
[17] "roll_forearm" "yaw_forearm" "total_accel_forearm" "accel_forearm_x"
[21] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y" "magnet_forearm_z"
[25] "classe"
predictions <- predict(rf_Fit, newdata = test[, -35])
confusionMatrix(predictions, test$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1389 22 1 1 0
B 0 921 14 0 2
C 1 3 837 24 1
D 5 0 3 776 4
E 0 3 0 3 894
Overall Statistics
Accuracy : 0.9823
95% CI : (0.9782, 0.9858)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9775
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9957 0.9705 0.9789 0.9652 0.9922
Specificity 0.9932 0.9960 0.9928 0.9971 0.9985
Pos Pred Value 0.9830 0.9829 0.9665 0.9848 0.9933
Neg Pred Value 0.9983 0.9929 0.9955 0.9932 0.9983
Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Detection Rate 0.2832 0.1878 0.1707 0.1582 0.1823
Detection Prevalence 0.2881 0.1911 0.1766 0.1607 0.1835
Balanced Accuracy 0.9944 0.9832 0.9859 0.9811 0.9954
test20predict1 <- predict(rf_Fit, newdata = test20Final[, -28])
test20predict1
[1] B A C A A E D B A A B C B A E E A B B B
Levels: A B C D E
If more predictors were included(Model not shown here), 100% accuracy of the 20 sample could be attained. But I'm use a "loose" model here to explore the flexibility of the model.
In five experiments with different in-sample accuracy, every sample except No. 3 could be well predicted. For example, in above model with in-sample error of 0.9747, 19 test samples are predicted correctly. But sample No3 was mis-predicted from "classe B" to "classe C". Why is sample No3 vulnerable?
# data of No3 sample in test set,
sample3 <- test20Final[3, -25]
# generate two same row and labeled each row with "B" and "C"
sample3 <- rbind(sample3, sample3)
sample3$classe <- c("B", "C")
# melt sample 3 data.
sample3.melt <- melt(sample3, id.vars = "classe",
value.name = "var",
variable.name = "predictors")
# melt train data.
train.melt <- melt(train, id.vars = "classe",
value.name = "var",
variable.name = "predictors")
# plotting to observe the data of sample3 in classe B and classe C
g <- ggplot(data = train.melt,
aes(x = predictors, y= var))
g <- g + geom_boxplot()
g <- g + geom_point(data = sample3.melt,
aes(x = predictors, y = var, colour = classe),
size = 5, alpha = 0.5)
g <- g + facet_grid(classe~.)
g
From this plot, we notice that there maybe two reasons for the mis-prediction of sample 3. 1) Data in classe B and C are very similar; 2) the value of predictors of sample3 are in-between the “common” range of both B and C. So the model makes mistakes.