Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset). This site also provides the training and testing data available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Executive Summary

The goal of this project is to predict the manner in which they did the exercise, which is reflected in the “classe” variable in the training set. I explored the available predictor variables and seleted those that did not have near zero values and that were not primarily single valued. I compared two machine learning techniques, proceeded with the better one and performed cross validation to estimate the out of sample error. The resulting model was used to predict the classes of the 20 test questions on the project assignment page.

A random forest model provided the best accuracy rfAccur on the training data and the best out-of-sample error estimate rfOutSamp.

The classification model provided cmAccur accuracy and cmOutSamp out-of-sample error estimate.

The score on the 20 test questions was testScore.

Processing Data

The Rmarkdown document downloads and caches the training and testing files and caches them. 40% of the training reserved for validation resulting in three files with respectively. Then the data were explored and cleaned. Variables were removed for two reasons: they were identified by R nearZeroVar or they had values that were primarily a single value (greater than 25% of observations had a single value). This was necessary so that models would perform adequately. After this cleaning process there were only, predictor variables left.

R nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.

The R Markdown file contains all the code needed to reproduce these data files and perform the analysis.

## Loading required package: lattice
## Loading required package: ggplot2
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

#,results='hide'}
set.seed(303202)
# split training data 60/40 and use the 40 for validation.
valLog<-createDataPartition(y=train$classe,times=1,p=0.4,list=FALSE)
dim(valLog);  head(valLog)

## [1] 7850    1

##      Resample1
## [1,]         1
## [2,]         3
## [3,]         6
## [4,]         8
## [5,]        12
## [6,]        13

val<-train[valLog,];    train2<-train[-valLog,]
train2Obs<-nrow(train2);   valObs<-nrow(val);    testObs<-nrow(test)
save(train2,file="train2.rds");  load("train2.rds")
save(val,file="val.rds");  load("val.rds")

#find near zero variance variables.  
allVars<-nearZeroVar(train2,freqCut=80/20,uniqueCut=10,saveMetrics=TRUE)
save(allVars,file="allVars.rds");  load("allVars.rds")
goodVars<-allVars[!allVars$nzv,]
save(goodVars,file="goodVars.rds") ; load("goodVars.rds")
badVars<-allVars[allVars$nzv,]
save(badVars,file="badVars.rds");  load("badVars.rds")
dir(pattern="*.rds"); ls()

##  [1] "allVars.rds"  "badVars.rds"  "goodVars.rds" "test.rds"    
##  [5] "testsave.rds" "train.rds"    "train2.rds"   "train3.rds"  
##  [9] "val.rds"      "xdf1.rds"

##  [1] "allVars"   "badVars"   "goodVars"  "test"      "testname" 
##  [6] "testObs"   "testURL"   "train"     "train2"    "train2Obs"
## [11] "trainname" "trainURL"  "val"       "valLog"    "valObs"

#from train2, remove variables with near zero variance
goodVarNames<-row.names(goodVars)
train3<-train2[,goodVarNames]
save(train3,file="train3.rds");  load("train3.rds")
#from train3, find and remove the variable names that are NA more than 25% of the time
naCols<-apply(train3,2,function (x) length(which(is.na(x)) ) )
naColsLog<-apply(train3,2,function (x) length(which(is.na(x)) ) >= nrow(train3)/4 ) 
notNAvarnames<-names(naColsLog)[!naColsLog]
train4<-train3[,notNAvarnames]
train4<-train4[,-1]  # the ID=X variable may interfere with the model
save(train4,file="train4.rds");  load("train4.rds")
# save these variables
charVec<-c("goodVarNames","naCols","naColsLog","notNAvarnames")
save(list=charVec,file="proc5Vars")
#do the same transformation on val
val2<-val[,goodVarNames]
save(val2,file="val2.rds");  load("val2.rds")
val3<-val2[,notNAvarnames]
val3<-val3[,-1]  # the ID=X variable may interfere with the model
save(val3,file="val3.rds");  load("val3.rds")


#do the same transformation on test - does not have the "classe" variable
problem_id<-test[,"problem_id"] # has problem_id instead of classe
test2<-test[,goodVarNames[1:91]] # var#92 is classe
save(test2,file="test2.rds");  load("test2.rds")
test3<-test2[,notNAvarnames[1:58]] # var#59 is classe
test3<-cbind(test3,problem_id) # put back problem_id
test3<-test3[,-1]  # the ID=X variable may interfere with the model
save(test3,file="test3.rds");  load("test3.rds")

Building the First Model

The first model is a classification tree that has an accuracy of about 87%.

build1Fit<-rpart(classe~.,data=train4,method="class")
par(mfrow = c(1,1), xpd = NA)
fancyRpartPlot(build1Fit)

save(build1Fit,file="build1Fit");  load("build1Fit")

pred1<-predict(build1Fit,train4,type="class")
confusionMatrix(pred1,train4$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3231   93   12    5    0
##          B   89 1897  120   85    0
##          C   28  273 1881  197    4
##          D    0   15   21 1318  138
##          E    0    0   19  324 2022
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8791         
##                  95% CI : (0.8731, 0.885)
##     No Information Rate : 0.2844         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.847          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9651   0.8327   0.9162   0.6833   0.9344
## Specificity            0.9869   0.9690   0.9483   0.9823   0.9643
## Pos Pred Value         0.9671   0.8658   0.7893   0.8834   0.8550
## Neg Pred Value         0.9861   0.9602   0.9817   0.9406   0.9849
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2745   0.1611   0.1598   0.1120   0.1718
## Detection Prevalence   0.2838   0.1861   0.2024   0.1267   0.2009
## Balanced Accuracy      0.9760   0.9009   0.9323   0.8328   0.9493

Building the Second Model

The second model uses a random forest and then a classification tree, which gives accuracy of 100%

build2RF<-randomForest(classe~.,data=train4)

save(build2RF,file="build2RF");  load("build2RF")

pred2<-predict(build2RF,train4,type="class")
confusionMatrix(pred2,train4$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2278    0    0    0
##          C    0    0 2053    0    0
##          D    0    0    0 1929    0
##          E    0    0    0    0 2164
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2844     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Cross Validating the Model

Estimating Out of Sample Error

Producing Response Files for 20 Questions

Project1 Machine Learning

David Seibel

February 21, 2015