Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Model

Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions. The fashions are listed in the data as variable “Classe”. Each Classe is a factor listed A-E. Below are the descriptions for each factor:

  • A: Correctly

  • B: Throwing elbows

  • C: Lifting halfway

  • D: Lowering halfway

  • E: Throwing hips

As shown, factor A represents a correct lift, while the remaining factors are various mistakes to the dumbbell lift. All other factors after A are used for prediction. Decision Tree and Random Forest algorithms will be used for the models. Whichever has the highest accuracy will be used.

Cross Validation

Cross Validation is performed by creating subsets (80%/20% split) of the given training data. Each model will be trained on the 80% training data subset and then initially tested on the 80% training data split. Once it is determined which model is more accurate, it will be used on the test data set.

Data Workup

This section will detail the code used to prep the data for the model. The libraries used in this analysis are: caret, rpart, and randomForest for machine learning along with RColorBrewer, rpart.plot, rattle, lattice, and ggplot2 for graphing purposes. For reproducable purposes, a seed value of 9067 is used throughout the analysis

Loading Data

# Load data and scrub
traindat <- read.csv("pml-training.csv",na.strings = c("$DIV/0",""," ","NA"))
testdat <- read.csv("pml-testing.csv",na.strings = c("$DIV/0",""," ","NA"))

Tidy Data

After loading the data, exploratory analysis was done using dim, head, summary, and str functions. To save space, these were not included in the code. It was determined the initial 7 columns were not needed for the initial analysis and thus were removed from the data set. Columns containing all NA values were removed as well.

traindat <- traindat[,-c(1:7)]
testdat <- testdat[,-c(1:7)]

traindat <- traindat[,colSums(is.na(traindat))==0]
testdat <- testdat[,colSums(is.na(traindat))==0]

Prepare Data

Before doing any testing, the training data needs to be split as detailed in the Cross Validation Section.

set.seed(9067)
trainsub <- createDataPartition(y=traindat$classe, p=.8, list=FALSE)
maintraindat <- traindat[trainsub,]
testtraindat <- traindat[-trainsub,]

Now that the data is split, a distribution of classe can be shown:

table(maintraindat$classe)

   A    B    C    D    E 
4464 3038 2738 2573 2886 

As shown, A (correct lift) is the most represented while D (lowering halfway) is the least represented.

Training Models

Decision Tree Model

The first model used on the training data set is a decistion tree. It is fitted to the main training data and plotted. Since decision trees are interpretable, we can easily plot them.

set.seed(9067)
treemodel <- rpart(classe ~., data=maintraindat, method="class")
fancyRpartPlot(treemodel)

Decison Tree Prediction

Next, the model is fitted to a prediction and ran through a confusion matrix to determine it’s prediction quality

set.seed(9067)
treepredict <- predict(treemodel, testtraindat, type = "class")
confusionMatrix(treepredict,testtraindat$classe)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1005  137   14   81   30
         B   27  389   61   28   53
         C   22  102  536   55   53
         D   54   54   51  421   53
         E    8   77   22   58  532

Overall Statistics
                                          
               Accuracy : 0.7349          
                 95% CI : (0.7208, 0.7487)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6633          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9005  0.51252   0.7836   0.6547   0.7379
Specificity            0.9067  0.94659   0.9284   0.9354   0.9485
Pos Pred Value         0.7932  0.69713   0.6979   0.6651   0.7633
Neg Pred Value         0.9582  0.89004   0.9531   0.9325   0.9414
Prevalence             0.2845  0.19347   0.1744   0.1639   0.1838
Detection Rate         0.2562  0.09916   0.1366   0.1073   0.1356
Detection Prevalence   0.3230  0.14224   0.1958   0.1614   0.1777
Balanced Accuracy      0.9036  0.72955   0.8560   0.7951   0.8432

Random Forest Model and Prediction

The second model used on the training data set is a random forest.

set.seed(9067)
forestmodel <- randomForest(classe ~., data=maintraindat, method="class")

Random Forest Prediction

The random forest model is fitted to a prediction and ran through a confusion matrix. Random forests are much less interpretable than decision trees and thus aren’t easy to plot.

set.seed(9067)
forestpredict <- predict(forestmodel, testtraindat, type = "class")
confusionMatrix(forestpredict,testtraindat$classe)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1114    1    0    0    0
         B    2  757    6    0    0
         C    0    1  678    0    1
         D    0    0    0  643    2
         E    0    0    0    0  718

Overall Statistics
                                          
               Accuracy : 0.9967          
                 95% CI : (0.9943, 0.9982)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9958          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9982   0.9974   0.9912   1.0000   0.9958
Specificity            0.9996   0.9975   0.9994   0.9994   1.0000
Pos Pred Value         0.9991   0.9895   0.9971   0.9969   1.0000
Neg Pred Value         0.9993   0.9994   0.9981   1.0000   0.9991
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2840   0.1930   0.1728   0.1639   0.1830
Detection Prevalence   0.2842   0.1950   0.1733   0.1644   0.1830
Balanced Accuracy      0.9989   0.9974   0.9953   0.9997   0.9979

Training Results

Looking at the Balanced Accuracy on both confusion matrix results, it is determined the random forest algorithm best predicts the data. Since the end goal is prediction rather than interpretability, we will use random forest on the test data.

Final Predictions

Lastly, the random forest model is used on the test data to predict which type of lift was done for each observation. The results are detailed below. Please refer to the Model section for description of the classes.

set.seed(9067)
testpredict <- predict(forestmodel, testdat, type = "class")
testpredict
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
Levels: A B C D E