Summary

In this study, 6 participants performed exercises with accelerometers on their belts, forearms, arms, and dumbbells. They performed these exercises both correctly and incorrectly. This project will attempt to predict whether each exercise was done in the correct manner or not, and if incorrect, the manner in which it was incorrect. The classe variable indicates how the exercise was performed:

Class A: Exactly according to specification (correct)
Class B: Throwing the elbows to the front (incorrect)
Class C: Lifting the dumbbell only halfway (incorrect)
Class D: Lowering the dumbbell only halfway (incorrect)
Class E: Throwing the hips to the front (incorrect)

Data Loading and Cleaning

trainingdata <- read.csv(paste0(getwd(),'/pml-training.csv'), header = TRUE, na.strings = c('#DIV/0!', 'NA', ''))
testingdata <- read.csv(paste0(getwd(),'/pml-testing.csv'), header = TRUE, na.strings = c('#DIV/0!', 'NA', ''))
dim(trainingdata)

## [1] 19622   160

The training set consists of 19,622 observations of 160 variables. Column # 160 is the variable we are trying to predict: the classe variable. We want to use cross validation, so we’ll use the trainControl function to split our data into 10 folds for cross validation.

library(caret); library(randomForest); library(rpart)
set.seed(48374)
train_control <- trainControl(method="cv", number=10)

Before modeling, there’s some data cleanup that needs to be done. First, we need to deal with NA values. There are many NA values in the data set. For columns that are made up of 50% or more NA values, we will throw them out.

for (i in 1:ncol(trainingdata)) {
    NAnum <- sum(is.na(trainingdata[i]))
    if(i == 1) {
        NAdf <- data.frame(colnum = i, noNAs = NAnum)
    } else {
        newrow <- c(i, NAnum)
        NAdf <- rbind(NAdf, newrow)
    }
    
}

NAdf$pct <- NAdf$noNAs / nrow(trainingdata)
nonNAcols <- NAdf[NAdf$pct < 0.5, ]$colnum
trainingdata <- trainingdata[ , nonNAcols]

We also need to get rid of any variables that have near zero variance.

NZVvalues <- nearZeroVar(trainingdata, saveMetrics = TRUE)
NZVvalues <- row.names(NZVvalues[NZVvalues$nzv == TRUE, ])
trainingdata <- trainingdata[ , -which(names(trainingdata) %in% c(NZVvalues))]

Next, we’ll get rid of highly correlated variables.

trainingcor <- findCorrelation(cor(trainingdata[ , 6:58]), cutoff = 0.8, exact = TRUE, names = TRUE)
trainingdata <- trainingdata[ , !(colnames(trainingdata) %in% trainingcor)]

Finally, we’ll want to remove the first five columns from the dataset, which are the row index, the usernames, and time information.

trainingdata <- trainingdata[ , 6:ncol(trainingdata)]

Data Modeling

Now we can move on to fitting some models. We’ll try a few different models to see which appears to work the best. First, Random Forest.

rf_mod <- randomForest(classe ~ ., data = trainingdata, trControl = train_control)
rf_pred <- predict(rf_mod, trainingdata)
rf_conf <- confusionMatrix(rf_pred, trainingdata$classe)
rf_conf$overall[1]

## Accuracy 
##        1

Random Forest, along with the combination of 10 folds for cross validation, gives us a 100% accuracy rate in predicting the classe variable. Just for comparison, we’ll try doing a decision tree model.

rpart_mod <- train(classe ~ ., data = trainingdata, trControl = train_control, method = 'rpart')
rpart_pred <- predict(rpart_mod, trainingdata)
rpart_conf <- confusionMatrix(rpart_pred, trainingdata$classe)
rpart_conf$overall[1]

##  Accuracy 
## 0.5292529

The accuracy using decision trees went down to 53% – much less reliable than the model built using the Random Forest method. We will move forward with Random Forest.

Applying Model to Testing Data

Before running these models on the testing data, we need to transform the testing data in the same way that we transformed the training data. We should be able to do this by reducing the variables in the testing set to just those that are in the training set.

testingdata <- testingdata[ , colnames(testingdata) %in% colnames(trainingdata)]

Next, we’ll apply our model to this data set to fill in the classe variable.

testingdata$classe <- predict(rf_mod, testingdata)
testingresults <- data.frame(classe = testingdata$classe)
testingresults

##    classe
## 1       B
## 2       A
## 3       B
## 4       A
## 5       A
## 6       E
## 7       D
## 8       B
## 9       A
## 10      A
## 11      B
## 12      C
## 13      B
## 14      A
## 15      E
## 16      E
## 17      A
## 18      B
## 19      B
## 20      B

Human Activity Recognition Prediction

Deb Martin

May 3, 2017

Summary

Data Loading and Cleaning

Data Modeling

Applying Model to Testing Data