Human Activity Recognition and Classification

Exploratory analysis and Pre-processing

We set the seed to 69 for later reproducibity and get some idea about the outcome(classe) and the frequencies of its values.

set.seed(69)
table(dat$classe)/nrow(dat)

## 
##         A         B         C         D         E 
## 0.2843747 0.1935073 0.1743961 0.1638977 0.1838243

We plot some more feature tables:

table(dat$new_window)

## 
##    no   yes 
## 19216   406

table(is.na(dat$max_picth_belt))

## 
## FALSE  TRUE 
##   406 19216

we notice many variables have a lot of NA’s and empty cases. Using the previous tables, it appears that those NA’s are caused by the new_window variable.

table(dat$new_window)[1]/nrow(dat)

##        no 
## 0.9793089

the above table indicates that 97.9308939% of the new_window values are “no” which cause the same percentage of NA’s in other variables
Note since the NA’s make up 98% of those columns , Imputing the missing values won’t make much sense since we can’t use 2% of the data to fill the other 98%. It would be wiser if we deleted the valriables.

Feature selection

We remove the NA’s features from out data frame

x<-dat[dat$new_window=="no",]
nzv<-nearZeroVar(x,saveMetrics = TRUE)
dat<-dat[,!nzv$nzv]
val<-val[,!nzv$nzv]

now we remove the first 6 features (ID , name , timestamps ..) because of their irrelevance to our classe prediction.

dat<-dat[,-c(1:6)]
val<-val[,-c(1:6)]

We check if we still have NA’s in our data frame:

table(is.na(dat))

## 
##   FALSE 
## 1039966

No more NA values.

Splitting the data

We split the data in dat to training and testing datasets.

intrain<-createDataPartition(dat$classe,p=0.7,list=FALSE)
ts<-dat[-intrain,] ##Testing dataset
tr<-dat[intrain,]  ##Training dataset

dim(tr)

## [1] 13737    53

dim(ts)

## [1] 5885   53

dim(val)

## [1] 20 53

We plot the correlation table to see if we have a big cluster of correlated feature that can cause a problem:

corrplot(cor(tr[, -53]), order = "FPC", method = "color", type = "upper", 
         tl.cex = 0.5, tl.col = rgb(0, 0, 0))

One last step before we try our models, we need to scale and center our variables.

prep<-preProcess(tr,method = c("center","scale"))
tr<-predict(prep,tr)
ts<-predict(prep,ts)
val<-predict(prep,val)

we make sure to apply the same pre-process with the same mean and Std.deviation to both the testing and validation datasets

Fitting a Decision Tree model

the decision tree model (CART) can have a good accuracy on classificaltion task when using cross validation and a good CP value.

we use our trainControl to set the cross validation to 5-folds repeated 5 times. then we Tune to different CP values in the tuneGrid function.

ctr<-trainControl(method="repeatedcv",number=5,repeats=5)
tune<-data.frame(cp=c(0.1,0.01,0.001,0.0001))

now we fit our model:

mf1<-train(classe~.,data=tr,method="rpart",trControl=ctr,tuneGrid=tune)

We can visualize our model using the fancyRpartPlot function from the rattle package

fancyRpartPlot(mf1$finalModel)

We can use the plot the model accuracy by its Complexity Parameter

plot(mf1)

We notice we get our best accuracy with a CP=0.0001.

Prediction

we test the model using the testing data.

pre1<-predict(mf1,ts)
con<-confusionMatrix(pre1,ts$classe)
con$overall[1]

##  Accuracy 
## 0.9274427

con$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1628   46    8   12    5
##          B   25  982   33   22   24
##          C   16   59  957   31   16
##          D    4   25   22  880   26
##          E    1   27    6   19 1011

We get a good accuracy of 0.9274427 and our confusion table has good sensitivies and specificities.

plot(con$table)

We also predict the Validation set classe:

pre2<-predict(mf1,val)
pre2

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

the above validation classification got me a 95% score on the course quizz, which makes it around 95% accurate for the validation test.

Error rate

the error rate 0.0725573 can be explained by random noise during the barbell exercice with each of the 6 participants doing his task slightly differenct than the others. it can also be caused by the features that got removed for excessive NA values. ### Fitting Other Models I’ve tried applied Random Forests and GBM but unfortunately the running time was taking too long because of the high data dimensions and i had to kill the process

Conclusion

Human Activity can be recognized and classified with a good accuracy even for a very specific task like barbell lifting, which can maybe in the futre help health professionals to study patients’ movement patterns and predict a health problems

Human Activity Recognition and Classification

Firas Ismail

02/05/2020

Overview

Getting the data