About this Study

Human activity recognition research focuses on discrimination between quality of the activities during exercise. We try to investigate how well an activity was performed by six wearers of electronic devices. This study extracted from the website http://groupware.les.inf.puc-rio.br/har .

These six participants were between 20 to 28 years with little weight lifting experience. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions as they defined as following different classe.

Class A: exactly according to the specification
Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only half way
Class E: throwing the hips to the front.

Class A corresponds to the specified execution of the exercise, and other classes correspond to common mistakes. To ensure the quality of data, an experienced weight lifter was there to supervise the participants.

Objective of this Analysis

The main objective of this project is to predict the manner in which the participants did the exercise. In other words, we need to predict the different fashions of the Unilateral Dumbbell Biceps crul performed by the participants. It is the classe varaible in the dataset, and we can use any of the other variables to predict with.

Data Processing

## Load packages
library(knitr)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(lattice)
library(ggplot2)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(Hmisc)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(survival)
library(Formula)
library(plyr)

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:Hmisc':
## 
##     impute

# downloading data

if(!file.exists("./training.csv")){
  url.training <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(url.training, destfile = "./training.csv")
}

if(!file.exists("./testing.csv")){
  url.testing <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  download.file(url.training, destfile = "./testing.csv")
}

## Training Data
training_data <- read.csv("./training.csv", na.strings=c("NA",""),stringsAsFactors = FALSE)
## Testing Data
testing_data <- read.csv("./pml-testing.csv", na.strings=c("NA",""),stringsAsFactors = FALSE)

After loading the libraries, the data files were downloaded if they don’t exist in the directory. The training dataset contains 160 variables with 19622 observations, and
the testing dataset contains 20 observations to test the performance of prediction of the classification model.

Data Cleaning

## data cleaning 
div0rec <- sapply(training_data, function(x) x=="#DIV/0!")
training_data[div0rec] <- NA

# convert yes/no into 1/0
testing_data$new_window = 1*(testing_data$new_window=="yes")
testing_data$new_window <- as.factor(testing_data$new_window)

training_data$new_window = 1*(training_data$new_window=="yes")
training_data$new_window <- as.factor(training_data$new_window)
training_data$classe <- factor(training_data$classe)

## Removing variables
# remove variables with either 0 or NA 
unwanted1 <- names(training_data) %in% c("kurtosis_yaw_belt", "kurtosis_yaw_dumbbell", "kurtosis_yaw_forearm",
                                   "skewness_yaw_belt", "skewness_yaw_dumbbell", "skewness_yaw_forearm",
                                  "amplitude_yaw_belt", "amplitude_yaw_dumbbell", "amplitude_yaw_forearm")
training_data_1 <- training_data[!unwanted1]

# remove unrelevant variables 
unwanted2 <- names(training_data_1) %in% c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
                                         "cvtd_timestamp") 
training_data_1 <- training_data_1[!unwanted2]

# remove variables that's mostly NA's (> 95%) 
index.NA <- sapply(training_data_1, is.na)
Sum.NA <- colSums(index.NA)
percent.NA <- Sum.NA/(dim(training_data_1)[1])
to.remove <- percent.NA>.95
training_cleaned <- training_data_1[,!to.remove]
##str(training_cleaned)

We first converted “#DIV/0” strings to NA, then the yes/no category in new_window variable is converted to 1/0.

The second part is likely to be unnecessary since it will be impliticly convert to 1/0 in the model, but I did it anyways.

The outcome variable classe is a character varaible due to how the data was read, so it was converted to a factor variable.

There were 9 variables consist of only 0 or NA, namely,
kurtosis_yaw_belt,
kurtosis_yaw_dumbbell,
kurtosis_yaw_forearm,
skewness_yaw_belt,
skewness_yaw_dumbbell,
skewness_yaw_forearm,
amplitude_yaw_belt,
amplitude_yaw_dumbbell, and
amplitude_yaw_forearm.

We know those variables will not help in terms of classification, so they were removed.

In addition, the X variable is just sequence from 1 to 19622. The user_name variable consists of the names of the participants, and there are three variables for indicating the date/time of when the activity was performed. We hope that these time variables and user names will not contribute to the classification, these varibles above were also removed from the dataset.

There were 91 variables with more than 95% of the data missing. Those variables were removed from the data as well. If we built a classification model based on those variables, then we can expect most of the time the varible is missing and therefore we cannot apply the classification rules on them. Therefore, building a model based on variables that’s mostly missing is not practical.

Data Partitioning

# Data Partitioning- training/testing 
set.seed(10)
n <- length(training_cleaned)
inTrain = createDataPartition(training_cleaned$classe, p = 0.6)[[1]]
training_cleaned <- training_cleaned[inTrain,]
testing_cleaned <- training_cleaned[-inTrain,]
##summary(training_cleaned)

Testing data doesn’t consist of the actual classe varaible, we cannot predict the performance of the classification model. As we already added to the training data, this data was splitted up- 60% became the training data, and 40% became the testing data.

Building Models

Regression Tree

## For the last model:

# setting option for 10-fold CV
train_control <- trainControl(method="cv", number=10)
# fit the model 
set.seed(100)
modelFit1 <- train(classe ~., method="rpart", data=training_cleaned, 
                  trControl = train_control)
result1<- confusionMatrix(testing_cleaned$classe, predict(modelFit1, newdata=testing_cleaned))

# fit the model after preprocessing 
modelFit2 <- train(classe ~., method="rpart", preProcess=c("center", "scale"),data=training_cleaned, 
                  trControl = train_control)
result2<- confusionMatrix(testing_cleaned$classe, predict(modelFit2, newdata=testing_cleaned))

result1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1235   14   98    0    3
##          B  398  312  187    0    0
##          C  407   23  405    0    0
##          D  334  132  299    0    0
##          E  124   96  235    0  413
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5016         
##                  95% CI : (0.4872, 0.516)
##     No Information Rate : 0.5298         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3466         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4944  0.54073   0.3309       NA  0.99279
## Specificity            0.9481  0.85863   0.8768   0.8378  0.89416
## Pos Pred Value         0.9148  0.34783   0.4850       NA  0.47581
## Neg Pred Value         0.6247  0.93059   0.7889       NA  0.99922
## Prevalence             0.5298  0.12238   0.2596   0.0000  0.08823
## Detection Rate         0.2619  0.06617   0.0859   0.0000  0.08759
## Detection Prevalence   0.2863  0.19024   0.1771   0.1622  0.18409
## Balanced Accuracy      0.7213  0.69968   0.6039       NA  0.94347

result2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1235   14   98    0    3
##          B  398  312  187    0    0
##          C  407   23  405    0    0
##          D  334  132  299    0    0
##          E  124   96  235    0  413
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5016         
##                  95% CI : (0.4872, 0.516)
##     No Information Rate : 0.5298         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3466         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4944  0.54073   0.3309       NA  0.99279
## Specificity            0.9481  0.85863   0.8768   0.8378  0.89416
## Pos Pred Value         0.9148  0.34783   0.4850       NA  0.47581
## Neg Pred Value         0.6247  0.93059   0.7889       NA  0.99922
## Prevalence             0.5298  0.12238   0.2596   0.0000  0.08823
## Detection Rate         0.2619  0.06617   0.0859   0.0000  0.08759
## Detection Prevalence   0.2863  0.19024   0.1771   0.1622  0.18409
## Balanced Accuracy      0.7213  0.69968   0.6039       NA  0.94347

The accuracies of the two models using regression tree isn’t good at all. The accuracy is only around 50%, which is not acceptable. Preprocessing the data didn’t help the performance of regression tree based predictions, so we’ll try a random forest next.

Random Forest

# Get correlation matrix and find the variables with high correlation with classe
k <- training_cleaned
k$classe <- as.numeric(training_cleaned$classe)
cormatrix <- data.frame(cor(k[,-c(1)]))
cormatrix$name <- names(k[2:55])
t <- data.frame(cbind(cormatrix$classe, cormatrix$name))
names(t) <- c("cor", "name")

# show variables with highest correlation with classe
tail(arrange(t,cor),8)

##                   cor                name
## 47 0.0901674347186873            roll_arm
## 48  0.122251338917828    accel_dumbbell_x
## 49   0.14860620640767 total_accel_forearm
## 50  0.151150427414134   magnet_dumbbell_z
## 51  0.241636410945395         accel_arm_x
## 52  0.291591357120721        magnet_arm_x
## 53  0.350902990197275       pitch_forearm
## 54                  1              classe

# try model with variable with highest corr with classe
modelFit3 <- randomForest(classe ~pitch_forearm+magnet_arm_x+accel_arm_x+  total_accel_forearm+magnet_dumbbell_z+accel_dumbbell_x, data=training_cleaned)

result3 <- confusionMatrix(testing_cleaned$classe, predict(modelFit3, newdata=testing_cleaned))

# try full model 
modelFit4 <- randomForest(classe ~., data=training_cleaned)
result4<- confusionMatrix(testing_cleaned$classe, predict(modelFit4, newdata=testing_cleaned))

result3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1350    0    0    0    0
##          B    0  897    0    0    0
##          C    0    0  835    0    0
##          D    0    0    0  765    0
##          E    0    0    0    0  868
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9992, 1)
##     No Information Rate : 0.2863     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2863   0.1902   0.1771   0.1622   0.1841
## Detection Rate         0.2863   0.1902   0.1771   0.1622   0.1841
## Detection Prevalence   0.2863   0.1902   0.1771   0.1622   0.1841
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

result4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1350    0    0    0    0
##          B    0  897    0    0    0
##          C    0    0  835    0    0
##          D    0    0    0  765    0
##          E    0    0    0    0  868
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9992, 1)
##     No Information Rate : 0.2863     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2863   0.1902   0.1771   0.1622   0.1841
## Detection Rate         0.2863   0.1902   0.1771   0.1622   0.1841
## Detection Prevalence   0.2863   0.1902   0.1771   0.1622   0.1841
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

To have an initial check on this Random Forest model, this applied on list of variables that’s more likely to predict classe well. When we predict the classe with the variables that correlates with classe the most (r> 0.1), we get a classification model with accuracy of 0.879.

Now for cross checking and validation, we applied the model considering all of the variables on the testing data set and got an accuracy of 0.997, which is very good.

Conclusion

Random forest classification technique works better than a regression tree in this case. The results that was obtained by using random forest technique were highly accurate on the testing set.

PML - Project / Human activity recognition Analysis

Balasubrahmanyam Juttiga

October 23, 2016