Using Machine Learning to Predict Activity Types from Biometric Data

Introduction

This project uses Human Activity Recognition data. This data measures various aspects of belt, forearm, arm, and dumbell and includes five classes of activity: sitting-down, standing-up, standing, walking, and sitting. Measurements of belt, forearm, arm, and dumbell will be used to predict the class of activity.

Loading and Processing the Data

First, relevant R packages must be loaded:

library(randomForest); library(caret); library(e1071)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: lattice
## Loading required package: ggplot2

Next, a training set (which includes class activities) and a test set (for which we will predict the class of activity), are loaded. All blank cells are conveted to NA values.

train  <- read.csv("pml-training.csv", header=TRUE, na.strings = c("", " ","NA"))
test  <- read.csv("pml-testing.csv", header=TRUE, na.strings = c("", " ","NA"))

Since many columns contain NA data, we’ll get rid of columns where at least half the rows in the training set are NAs:

train2 <- train[,colSums(is.na(train)) < .5*nrow(train)]
test2 <- test[,colSums(is.na(train)) < .5*nrow(train)]

This reduced the number of columns from 160 to 60. We’ll also remove the first 7 columns, as they only contain metadata:

train3 <- train2[,-c(1:7)]
test3 <- test2[,-c(1:7)]

This reduces the number of columns to 53, a much more manageable data set. Finally, we’ll make all values numeric, except for the categorical data we want to predict:

for (i in 1:52){
   train3[,i] <- as.numeric(train3[,i])
   test3[,i] <- as.numeric(test3[,i])   
}

Modeling the data

Next, we’ll set the seed (for reproducible randomness) and randomly divide the training set into two subsets. The first subset will be for training the model, while the second set will be to check for accuracy.

set.seed(321)
subTrain <- createDataPartition(y = train3$classe,
                               p = .8,
                               list = FALSE)

training <- train3[subTrain,]
testing <- train3[-subTrain,]

We’ll create a random forest model:

rf_model <- randomForest(classe~.,data=training, ntree=250, importance=TRUE)

Predicting the data

Next, we use the model to predict values:

predTrain <- predict(rf_model,testing); predAcc <- predTrain==testing$classe
table(predTrain,testing$classe)

##          
## predTrain    A    B    C    D    E
##         A 1115    4    0    0    0
##         B    1  754    2    0    0
##         C    0    1  682    8    0
##         D    0    0    0  634    1
##         E    0    0    0    1  720

prop.table(table(predAcc))

## predAcc
##       FALSE        TRUE 
## 0.004588325 0.995411675

As we can see, this model is 99.5% accurate. This is a high degree of accuracy, without overfitting the model (at 100% accuracy). We’ll move on to predict the test set:

predTest <- predict(rf_model,newdata=test3)
table(predTest)

## predTest
## A B C D E 
## 7 8 1 1 3

The model predicted several different values, as expected.

Using Machine Learning to Predict Activity Types from Biometric Data

Onna Nelson (wugology)

February 20, 2015

Introduction

Loading and Processing the Data

Modeling the data

Predicting the data