Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
Using a weight lifting exercise dataset from http://groupware.les.inf.puc-rio.br/har, this analysis attempts to predict the manner of activity (sitting, standing, walking, etc) using a variety of gyroscopes and acceleromters. Using Random Forest classification, we ultimately develop a model that has greater than 99% accuracy on a holdout set.
We begin the analysis with some exploratory measures to understand the shape and size of the data. We find that there are 19,622 observations of 160 variables many of which do not have information (read NAs).
Originally, I removed the rows with NAs (using complete.cases()), but that left me with 30 rows and 160 variables - it did not bring me any closer to making an inference. I then explored removing the columns with NAs. Doing some high-level analysis of the NAs, we can determine that there is no pattern of NAs relative to the object we are classifying (eg, distribution of NA between “classe” A, B, C, D, or E).
Stripping the dataset of the variables with NAs, we were left with 93 variables. I originally attempted to run a Random Forest on this dataset and was surprised to find that R would not do an analysis on a set with greater than 53 dimensions. I was hoping I could determine which variables were most important and call it a day. :) Generally speaking, Random Forests have great predictive ability and are useful if we are not required to explain the logic embedded in the decision tree. (For example, healthcare applications might care about the path taken to a conclusion.)
I played around a little with Principal Component Analysis (PCA, in the appendix) and realized I needed to build a correlation matrix to remove strongly associated dimensions and try to get below the 53 threshold. Correlation Matrices only work on numeric datasets, so I stripped the data again of its non-numeric dimensions.
## Warning: package 'caret' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
## Warning: package 'tree' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: MASS
## Loading required package: psych
##
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:lattice':
##
## melanoma
##
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
##
## parallel
The Correlation Matrix can be found below. I tried to remove the labels for the different variables, but could not find the code online. If you have any thoughts or recommendations, I would greatly appreciate it!
I developed a lost of the variables that had correlations greater than 0.8 and removed them from the data. I then added back the “classe” column, which is what we are trying to classify. You will notice that I removed the first 4 columns. This is the Index (“X”) and other Timestamp data. When I first did the analysis, I found that I perfectly predicted all variables (Accuracy = 1). It turns out that the “classe” variable is listed in the dataset (all As, then all Bs, etc) - and the Index was being used to predict outcomes!
matrix <- cor(data)
corrplot(matrix, type="upper", order="hclust", tl.col="black", tl.srt=45, ann = FALSE)
list = findCorrelation(matrix, cutoff=0.8)
list = sort(list)
data = data[,-c(list)]
data = cbind(data, predictor)
colnames(data)[44] <- "classe"
data <- data[ , 5:44]
We now have the data prepared to do some analysis. We set aside 25% of the data for a holdout, and develop a Random Forest model on the training set. Using the model to predict “classe” variables in the test set, we find that we have a 99.51% accuracy, and that the “yaw_belt”, “pitch_belt”, and “pitch_forearm” variables are most important.
This was then used on the test set for the exercise and provided 20/20 correct.
set.seed(1000)
inTrain = createDataPartition(data$classe, p = 3/4)[[1]]
training = data[ inTrain,]
testing = data[-inTrain,]
fol <- formula(classe ~ .)
model2 <- randomForest(fol, data = training)
modelresult2 <- predict(model2, testing, type = "class")
confusionMatrix(testing$classe, modelresult2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1391 4 0 0 0
## B 0 948 1 0 0
## C 0 6 846 3 0
## D 0 0 4 798 2
## E 0 0 0 4 897
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9927, 0.9969)
## No Information Rate : 0.2836
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9896 0.9941 0.9913 0.9978
## Specificity 0.9989 0.9997 0.9978 0.9985 0.9990
## Pos Pred Value 0.9971 0.9989 0.9895 0.9925 0.9956
## Neg Pred Value 1.0000 0.9975 0.9988 0.9983 0.9995
## Prevalence 0.2836 0.1954 0.1735 0.1642 0.1833
## Detection Rate 0.2836 0.1933 0.1725 0.1627 0.1829
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9994 0.9947 0.9960 0.9949 0.9984
rfimp <- varImp(model2, scale = TRUE)
rfimp
## Overall
## pitch_belt 700.32639
## yaw_belt 933.70789
## total_accel_belt 277.04750
## gyros_belt_x 121.48386
## gyros_belt_y 126.39142
## gyros_belt_z 362.51444
## magnet_belt_y 459.62799
## magnet_belt_z 427.51589
## roll_arm 300.58077
## pitch_arm 170.12728
## yaw_arm 245.00512
## total_accel_arm 112.54123
## gyros_arm_y 156.29410
## gyros_arm_z 62.19650
## accel_arm_y 171.04765
## accel_arm_z 124.10851
## magnet_arm_x 258.06295
## magnet_arm_z 176.28826
## roll_dumbbell 358.71793
## pitch_dumbbell 190.49263
## yaw_dumbbell 264.91302
## total_accel_dumbbell 281.47956
## gyros_dumbbell_y 242.51652
## accel_dumbbell_y 376.75358
## magnet_dumbbell_x 442.78117
## magnet_dumbbell_y 580.43901
## magnet_dumbbell_z 690.69081
## roll_forearm 521.12588
## pitch_forearm 696.76447
## yaw_forearm 162.61702
## total_accel_forearm 111.24082
## gyros_forearm_x 87.71357
## gyros_forearm_z 93.44899
## accel_forearm_x 293.78553
## accel_forearm_y 138.08692
## accel_forearm_z 247.02963
## magnet_forearm_x 194.13326
## magnet_forearm_y 205.95934
## magnet_forearm_z 269.88257
In doing this exercise, I learned a bit about PCA. It creates new variables from the dimensions, combining them to try and give predictive power to the variance. This is a Scree plot, which determines that there are 13 characteristics or PCA values, that together provide most of the explanatory power necessary for this analysis.
#remove the one we want to predict
ev <- eigen(cor(data[sapply(data, is.numeric)])) # get eigenvalues
ap <- parallel(subject=nrow(data[sapply(data, is.numeric)]),var=ncol(data[sapply(data, is.numeric)]),
rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)