The 'Quantifided-Self' movement is concerned with measuring human activities every day using wearable computing gadgets. In one such study (http://groupware.les.inf.puc-rio.br/har) a group of volunteers were asked to perform dumb bell exercises and their body movements were recorded using accelerometers attached to their bodies. There are correct and incorrect ways of doing the exercise. These are classified into 5 categories, named 'A' to 'E'.
The core of the project is to predict the category of movement from the biometric data. The data set consists of 19622 rows of 159 predictor variables and a columm for the activity indicator. We will first load the data and do a quick exploratory analysis.
# load necessary libraries
library(VIM)
library(corrplot)
library(caret)
Load training data set:
da = read.csv("pml-training.csv")
dim(da)
## [1] 19622 160
# str(da)
names(da)[1:10]
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt"
sum(complete.cases(da))
## [1] 406
We see that only 460 out of the 19622 rows are complete in every respect. This indicates the presense of a large number of NAs in the data. We do some initial cleanup by removing the row ids, user names etc that will not be used in this analysis.
The matrixplot() function in VIM package is an easy way to visualize a data set. Since there are a large number of rows, we will visualize only the first 1000 rows:
# remove ids, user names, time stamps etc
da = da[-(1:7)]
dim(da)
## [1] 19622 153
NA values are indicated in red in a matrix plot. The plot below shows that a large number of colums are blank.
matrixplot(da[1:1000, ]) # takes long time
The algorithm used to identify the blank columns is: if a colum has more than 100 NAs, then drop the column.
# some columns are full of NAs, first identify them
nas = apply(da, 2, function(x) sum(is.na(x)))
w = which(nas > 100)
length(w)
## [1] 67
# remove the NA columns
da = da[-w]
dim(da)
## [1] 19622 86
We also see from the matrix plot that many of the columns are mostly blank, except for a few numbers. There are also cells indicating a #DIV by 0 error. We remove these factor variables also.
fac = sapply(da, is.factor)
w = which(fac)
names(w) = NULL
length(w)
## [1] 34
w = w[-length(w)] # retain classe column
# remove the columns permanently
da = da[-w]
dim(da)
## [1] 19622 53
Now we have a neat 52 variables to model with. Column #53 is the outcome label. But again, we may not need all the 19622 rows for modeling. We will pick a random sample of 5000 rows for our work: .
# before subsetting, record the distribution of the groups
t1 = table(da$classe)
t1/sum(t1)
##
## A B C D E
## 0.2844 0.1935 0.1744 0.1639 0.1838
# take only a subset of 5000 rows for modeling
set.seed(321)
samp = sample(1:nrow(da), 5000)
da = da[samp, ]
t2 = table(da$classe)
t2/sum(t2) # same as t1
##
## A B C D E
## 0.2880 0.1828 0.1708 0.1618 0.1966
We double check the accuracy of the subset by checking the class distribution before and after subsetting. The distribution tables are very nearly the same, so we can proceed.
We next find the correlations among the columns:
# find correlations among the columns
cormat = abs(cor(da[-53]))
corrplot(cormat)
Apart from the diagonals, there are some highly correlated columns. This may be useful later to choose columns for the model. But for now, we decide to ignore the correlations and use all the 52 explanatory variables for modeling.
Next we visualize the range and spread of values in the columns by drawing box plots:
# visualize the spread fo the variables
boxplot(da[-53], xaxt = "n")
mtext(side = 1, text = 1:52, at = 1:52, cex = 0.5)
There is lot of variation in the range of variables. This hints at the need for centering and scaling.
We do a traditional 70:30 split for model evaluation.
set.seed(123)
index = createDataPartition(da$classe, p = 0.7, list = F)
trainset = da[index, ]
testset = da[-index, ]
t3 = table(trainset$classe)
t3/sum(t3)
##
## A B C D E
## 0.2878 0.1828 0.1708 0.1619 0.1967
Thus we have verified that the distribution of labels in the training set are nearly the same as in the original data set.
We tried various models and evaluated them for performance. The details are omitted here due to space constraints. The cross validation accuracy rates were:
Thus we chose random forest model as the best predictor for this data set.
# fit a random forest model
fit1 = train(classe ~ ., data = trainset, method = "rf") # takes a long time
fit1
fit1$finalModel
Random Forest
3503 samples
52 predictors
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3503, 3503, 3503, 3503, 3503, 3503, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.948 0.934 0.00688 0.0087
27 0.952 0.94 0.00676 0.00855
52 0.943 0.928 0.0106 0.0134
# validate using the split test set
pr1 = predict(fit1, testset)
confusionMatrix(pr1, testset$classe)
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 27
OOB estimate of error rate: 3.48%
Confusion matrix:
A B C D E class.error
A 976 7 2 2 2 0.01314459
B 29 642 18 1 1 0.07091172
C 0 15 567 11 2 0.04705882
D 1 2 17 572 1 0.03541315
E 0 0 6 5 624 0.01732283
The out of band error rate is estimated to be 3.48 %.
The accuracy of our model is 96.46 %.
The kappa statistic is high at 0.9552.
This means all the predicions mutually agree to a high degree. The p-value indicates the null hypothesis can be rejected.
Overall Statistics
Accuracy : 0.9646
95% CI : (0.9539, 0.9734)
No Information Rate : 0.2826
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.9552
Mcnemar's Test P-Value : 0.02647
As a final step, we load the test set provided for project evaluation, pml-testing.csv and perform the exact subsetting on columns as we did on the training set. (Details are not shown here to save space). We call it da2. Then we use our model to predict the 20 activity labels.
# use it on the actual test set to be submitted for project evaluation
pr2 = predict(fit1, da2[-53]) # remove row index
pr2
[1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
We fitted a random forest model to the classification problem of human activity. When the final outcomes were uploaded to the course website, all the 20 predictions turned out to be correct.