Human activity recognition research focuses on discrimination between quality of the activities during exercise. We try to investigate how well an activity was performed by six wearers of electronic devices. This study extracted from the website http://groupware.les.inf.puc-rio.br/har .
These six participants were between 20 to 28 years with little weight lifting experience. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions as they defined as following different classe.
Class A: exactly according to the specification
Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only half way
Class E: throwing the hips to the front.
Class A corresponds to the specified execution of the exercise, and other classes correspond to common mistakes. To ensure the quality of data, an experienced weight lifter was there to supervise the participants.
The main objective of this project is to predict the manner in which the participants did the exercise. In other words, we need to predict the different fashions of the Unilateral Dumbbell Biceps crul performed by the participants. It is the classe varaible in the dataset, and we can use any of the other variables to predict with.
## Load packages
library(knitr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(lattice)
library(ggplot2)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(survival)
library(Formula)
library(plyr)
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
# downloading data
if(!file.exists("./training.csv")){
url.training <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url.training, destfile = "./training.csv")
}
if(!file.exists("./testing.csv")){
url.testing <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url.training, destfile = "./testing.csv")
}
## Training Data
training_data <- read.csv("./training.csv", na.strings=c("NA",""),stringsAsFactors = FALSE)
## Testing Data
testing_data <- read.csv("./pml-testing.csv", na.strings=c("NA",""),stringsAsFactors = FALSE)
After loading the libraries, the data files were downloaded if they don’t exist in the directory. The training dataset contains 160 variables with 19622 observations, and
the testing dataset contains 20 observations to test the performance of prediction of the classification model.
## data cleaning
div0rec <- sapply(training_data, function(x) x=="#DIV/0!")
training_data[div0rec] <- NA
# convert yes/no into 1/0
testing_data$new_window = 1*(testing_data$new_window=="yes")
testing_data$new_window <- as.factor(testing_data$new_window)
training_data$new_window = 1*(training_data$new_window=="yes")
training_data$new_window <- as.factor(training_data$new_window)
training_data$classe <- factor(training_data$classe)
## Removing variables
# remove variables with either 0 or NA
unwanted1 <- names(training_data) %in% c("kurtosis_yaw_belt", "kurtosis_yaw_dumbbell", "kurtosis_yaw_forearm",
"skewness_yaw_belt", "skewness_yaw_dumbbell", "skewness_yaw_forearm",
"amplitude_yaw_belt", "amplitude_yaw_dumbbell", "amplitude_yaw_forearm")
training_data_1 <- training_data[!unwanted1]
# remove unrelevant variables
unwanted2 <- names(training_data_1) %in% c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp")
training_data_1 <- training_data_1[!unwanted2]
# remove variables that's mostly NA's (> 95%)
index.NA <- sapply(training_data_1, is.na)
Sum.NA <- colSums(index.NA)
percent.NA <- Sum.NA/(dim(training_data_1)[1])
to.remove <- percent.NA>.95
training_cleaned <- training_data_1[,!to.remove]
##str(training_cleaned)
We first converted “#DIV/0” strings to NA, then the yes/no category in new_window variable is converted to 1/0.
The second part is likely to be unnecessary since it will be impliticly convert to 1/0 in the model, but I did it anyways.
The outcome variable classe is a character varaible due to how the data was read, so it was converted to a factor variable.
There were 9 variables consist of only 0 or NA, namely,
kurtosis_yaw_belt,
kurtosis_yaw_dumbbell,
kurtosis_yaw_forearm,
skewness_yaw_belt,
skewness_yaw_dumbbell,
skewness_yaw_forearm,
amplitude_yaw_belt,
amplitude_yaw_dumbbell, and
amplitude_yaw_forearm.
We know those variables will not help in terms of classification, so they were removed.
In addition, the X variable is just sequence from 1 to 19622. The user_name variable consists of the names of the participants, and there are three variables for indicating the date/time of when the activity was performed. We hope that these time variables and user names will not contribute to the classification, these varibles above were also removed from the dataset.
There were 91 variables with more than 95% of the data missing. Those variables were removed from the data as well. If we built a classification model based on those variables, then we can expect most of the time the varible is missing and therefore we cannot apply the classification rules on them. Therefore, building a model based on variables that’s mostly missing is not practical.
# Data Partitioning- training/testing
set.seed(10)
n <- length(training_cleaned)
inTrain = createDataPartition(training_cleaned$classe, p = 0.6)[[1]]
training_cleaned <- training_cleaned[inTrain,]
testing_cleaned <- training_cleaned[-inTrain,]
##summary(training_cleaned)
Testing data doesn’t consist of the actual classe varaible, we cannot predict the performance of the classification model. As we already added to the training data, this data was splitted up- 60% became the training data, and 40% became the testing data.
## For the last model:
# setting option for 10-fold CV
train_control <- trainControl(method="cv", number=10)
# fit the model
set.seed(100)
modelFit1 <- train(classe ~., method="rpart", data=training_cleaned,
trControl = train_control)
result1<- confusionMatrix(testing_cleaned$classe, predict(modelFit1, newdata=testing_cleaned))
# fit the model after preprocessing
modelFit2 <- train(classe ~., method="rpart", preProcess=c("center", "scale"),data=training_cleaned,
trControl = train_control)
result2<- confusionMatrix(testing_cleaned$classe, predict(modelFit2, newdata=testing_cleaned))
result1
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1235 14 98 0 3
## B 398 312 187 0 0
## C 407 23 405 0 0
## D 334 132 299 0 0
## E 124 96 235 0 413
##
## Overall Statistics
##
## Accuracy : 0.5016
## 95% CI : (0.4872, 0.516)
## No Information Rate : 0.5298
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3466
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4944 0.54073 0.3309 NA 0.99279
## Specificity 0.9481 0.85863 0.8768 0.8378 0.89416
## Pos Pred Value 0.9148 0.34783 0.4850 NA 0.47581
## Neg Pred Value 0.6247 0.93059 0.7889 NA 0.99922
## Prevalence 0.5298 0.12238 0.2596 0.0000 0.08823
## Detection Rate 0.2619 0.06617 0.0859 0.0000 0.08759
## Detection Prevalence 0.2863 0.19024 0.1771 0.1622 0.18409
## Balanced Accuracy 0.7213 0.69968 0.6039 NA 0.94347
result2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1235 14 98 0 3
## B 398 312 187 0 0
## C 407 23 405 0 0
## D 334 132 299 0 0
## E 124 96 235 0 413
##
## Overall Statistics
##
## Accuracy : 0.5016
## 95% CI : (0.4872, 0.516)
## No Information Rate : 0.5298
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3466
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4944 0.54073 0.3309 NA 0.99279
## Specificity 0.9481 0.85863 0.8768 0.8378 0.89416
## Pos Pred Value 0.9148 0.34783 0.4850 NA 0.47581
## Neg Pred Value 0.6247 0.93059 0.7889 NA 0.99922
## Prevalence 0.5298 0.12238 0.2596 0.0000 0.08823
## Detection Rate 0.2619 0.06617 0.0859 0.0000 0.08759
## Detection Prevalence 0.2863 0.19024 0.1771 0.1622 0.18409
## Balanced Accuracy 0.7213 0.69968 0.6039 NA 0.94347
The accuracies of the two models using regression tree isn’t good at all. The accuracy is only around 50%, which is not acceptable. Preprocessing the data didn’t help the performance of regression tree based predictions, so we’ll try a random forest next.
# Get correlation matrix and find the variables with high correlation with classe
k <- training_cleaned
k$classe <- as.numeric(training_cleaned$classe)
cormatrix <- data.frame(cor(k[,-c(1)]))
cormatrix$name <- names(k[2:55])
t <- data.frame(cbind(cormatrix$classe, cormatrix$name))
names(t) <- c("cor", "name")
# show variables with highest correlation with classe
tail(arrange(t,cor),8)
## cor name
## 47 0.0901674347186873 roll_arm
## 48 0.122251338917828 accel_dumbbell_x
## 49 0.14860620640767 total_accel_forearm
## 50 0.151150427414134 magnet_dumbbell_z
## 51 0.241636410945395 accel_arm_x
## 52 0.291591357120721 magnet_arm_x
## 53 0.350902990197275 pitch_forearm
## 54 1 classe
# try model with variable with highest corr with classe
modelFit3 <- randomForest(classe ~pitch_forearm+magnet_arm_x+accel_arm_x+ total_accel_forearm+magnet_dumbbell_z+accel_dumbbell_x, data=training_cleaned)
result3 <- confusionMatrix(testing_cleaned$classe, predict(modelFit3, newdata=testing_cleaned))
# try full model
modelFit4 <- randomForest(classe ~., data=training_cleaned)
result4<- confusionMatrix(testing_cleaned$classe, predict(modelFit4, newdata=testing_cleaned))
result3
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1350 0 0 0 0
## B 0 897 0 0 0
## C 0 0 835 0 0
## D 0 0 0 765 0
## E 0 0 0 0 868
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.2863
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2863 0.1902 0.1771 0.1622 0.1841
## Detection Rate 0.2863 0.1902 0.1771 0.1622 0.1841
## Detection Prevalence 0.2863 0.1902 0.1771 0.1622 0.1841
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
result4
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1350 0 0 0 0
## B 0 897 0 0 0
## C 0 0 835 0 0
## D 0 0 0 765 0
## E 0 0 0 0 868
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.2863
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2863 0.1902 0.1771 0.1622 0.1841
## Detection Rate 0.2863 0.1902 0.1771 0.1622 0.1841
## Detection Prevalence 0.2863 0.1902 0.1771 0.1622 0.1841
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
To have an initial check on this Random Forest model, this applied on list of variables that’s more likely to predict classe well. When we predict the classe with the variables that correlates with classe the most (r> 0.1), we get a classification model with accuracy of 0.879.
Now for cross checking and validation, we applied the model considering all of the variables on the testing data set and got an accuracy of 0.997, which is very good.
Random forest classification technique works better than a regression tree in this case. The results that was obtained by using random forest technique were highly accurate on the testing set.