Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

Using a weight lifting exercise dataset from http://groupware.les.inf.puc-rio.br/har, this analysis attempts to predict the manner of activity (sitting, standing, walking, etc) using a variety of gyroscopes and acceleromters. Using Random Forest classification, we ultimately develop a model that has greater than 99% accuracy on a holdout set.

Analysis

We begin the analysis with some exploratory measures to understand the shape and size of the data. We find that there are 19,622 observations of 160 variables many of which do not have information (read NAs).

Originally, I removed the rows with NAs (using complete.cases()), but that left me with 30 rows and 160 variables - it did not bring me any closer to making an inference. I then explored removing the columns with NAs. Doing some high-level analysis of the NAs, we can determine that there is no pattern of NAs relative to the object we are classifying (eg, distribution of NA between “classe” A, B, C, D, or E).

Stripping the dataset of the variables with NAs, we were left with 93 variables. I originally attempted to run a Random Forest on this dataset and was surprised to find that R would not do an analysis on a set with greater than 53 dimensions. I was hoping I could determine which variables were most important and call it a day. :) Generally speaking, Random Forests have great predictive ability and are useful if we are not required to explain the logic embedded in the decision tree. (For example, healthcare applications might care about the path taken to a conclusion.)

I played around a little with Principal Component Analysis (PCA, in the appendix) and realized I needed to build a correlation matrix to remove strongly associated dimensions and try to get below the 53 threshold. Correlation Matrices only work on numeric datasets, so I stripped the data again of its non-numeric dimensions.

## Warning: package 'caret' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
## Warning: package 'tree' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: MASS
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
## 
##     outlier
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:lattice':
## 
##     melanoma
## 
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
## 
##     parallel

The Correlation Matrix can be found below. I tried to remove the labels for the different variables, but could not find the code online. If you have any thoughts or recommendations, I would greatly appreciate it!

I developed a lost of the variables that had correlations greater than 0.8 and removed them from the data. I then added back the “classe” column, which is what we are trying to classify. You will notice that I removed the first 4 columns. This is the Index (“X”) and other Timestamp data. When I first did the analysis, I found that I perfectly predicted all variables (Accuracy = 1). It turns out that the “classe” variable is listed in the dataset (all As, then all Bs, etc) - and the Index was being used to predict outcomes!

matrix <- cor(data)
corrplot(matrix, type="upper", order="hclust", tl.col="black", tl.srt=45, ann = FALSE)

list = findCorrelation(matrix, cutoff=0.8)
list = sort(list)
data = data[,-c(list)]
data = cbind(data, predictor)
colnames(data)[44] <- "classe"
data <- data[ , 5:44]

We now have the data prepared to do some analysis. We set aside 25% of the data for a holdout, and develop a Random Forest model on the training set. Using the model to predict “classe” variables in the test set, we find that we have a 99.51% accuracy, and that the “yaw_belt”, “pitch_belt”, and “pitch_forearm” variables are most important.

This was then used on the test set for the exercise and provided 20/20 correct.

set.seed(1000)
inTrain = createDataPartition(data$classe, p = 3/4)[[1]]
training = data[ inTrain,]
testing = data[-inTrain,]

fol <- formula(classe ~ .)
model2 <- randomForest(fol, data = training)
modelresult2 <- predict(model2, testing, type = "class")
confusionMatrix(testing$classe, modelresult2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391    4    0    0    0
##          B    0  948    1    0    0
##          C    0    6  846    3    0
##          D    0    0    4  798    2
##          E    0    0    0    4  897
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9951          
##                  95% CI : (0.9927, 0.9969)
##     No Information Rate : 0.2836          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9896   0.9941   0.9913   0.9978
## Specificity            0.9989   0.9997   0.9978   0.9985   0.9990
## Pos Pred Value         0.9971   0.9989   0.9895   0.9925   0.9956
## Neg Pred Value         1.0000   0.9975   0.9988   0.9983   0.9995
## Prevalence             0.2836   0.1954   0.1735   0.1642   0.1833
## Detection Rate         0.2836   0.1933   0.1725   0.1627   0.1829
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9994   0.9947   0.9960   0.9949   0.9984
rfimp <- varImp(model2, scale = TRUE)
rfimp
##                        Overall
## pitch_belt           700.32639
## yaw_belt             933.70789
## total_accel_belt     277.04750
## gyros_belt_x         121.48386
## gyros_belt_y         126.39142
## gyros_belt_z         362.51444
## magnet_belt_y        459.62799
## magnet_belt_z        427.51589
## roll_arm             300.58077
## pitch_arm            170.12728
## yaw_arm              245.00512
## total_accel_arm      112.54123
## gyros_arm_y          156.29410
## gyros_arm_z           62.19650
## accel_arm_y          171.04765
## accel_arm_z          124.10851
## magnet_arm_x         258.06295
## magnet_arm_z         176.28826
## roll_dumbbell        358.71793
## pitch_dumbbell       190.49263
## yaw_dumbbell         264.91302
## total_accel_dumbbell 281.47956
## gyros_dumbbell_y     242.51652
## accel_dumbbell_y     376.75358
## magnet_dumbbell_x    442.78117
## magnet_dumbbell_y    580.43901
## magnet_dumbbell_z    690.69081
## roll_forearm         521.12588
## pitch_forearm        696.76447
## yaw_forearm          162.61702
## total_accel_forearm  111.24082
## gyros_forearm_x       87.71357
## gyros_forearm_z       93.44899
## accel_forearm_x      293.78553
## accel_forearm_y      138.08692
## accel_forearm_z      247.02963
## magnet_forearm_x     194.13326
## magnet_forearm_y     205.95934
## magnet_forearm_z     269.88257

APPENDIX

In doing this exercise, I learned a bit about PCA. It creates new variables from the dimensions, combining them to try and give predictive power to the variance. This is a Scree plot, which determines that there are 13 characteristics or PCA values, that together provide most of the explanatory power necessary for this analysis.

#remove the one we want to predict
ev <- eigen(cor(data[sapply(data, is.numeric)])) # get eigenvalues
ap <- parallel(subject=nrow(data[sapply(data, is.numeric)]),var=ncol(data[sapply(data, is.numeric)]),
               rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)