Synopsis

In this project, students were asked to develop a machine learning algorithm to predict results of a particular dumbell exercise data. In the data, six young health participants performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

The task for this this project was to use biometric data to predict the class of dumbell curl. In the analysis, a random forest algorithm was developed which predicted the correct class over 99 percent of the time.

The Data

The exercise data consisted of 160 columns with over 19,000 observations. The data contained only 52 columns used as predictors after deleting empty columns and adminstrative data. To clean and ‘tidy’ the data, the following code was used.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##Download test and validation data sets #####
setwd("~/Data_Science_Specialization/8_MachineLearning/FinalProject/")
if (!file.exists("data")) {
        dir.create("data")
}
# urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
# urlValid <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download.file(urlTrain, destfile = "./data/training.csv")
# download.file(urlTest, destfile = "./data/Valid.csv")

# list.files("./data")

#read the files note the validation file is used later to answer questions in a  quiz:
data <- read.csv("./data/training.csv")
valid <- read.csv("./data/Valid.csv")

Next, the data were seperated into a test and training sets and pre-processed to make it ready for modelling.

inTrain <- createDataPartition(y=data$classe,
                               p=0.7, list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
dim(training); dim(testing)
## [1] 13737   160
## [1] 5885  160
## Pre-processing
# Get rid of empty or almost completely empty columns (and also the non-predictors)
columns <- c(8:11, 37:49,60:68,84:86, 102, 113:124,140,151:160)
training <- training[,columns]
testing <- testing[,columns]
#colnames(training) #note column 53 is the classe.
#Test if there are any missing values we need to impute
sum(complete.cases(training)) - length(training$roll_belt) #no
## [1] 0

After pre-processing, the data sets contained the ‘classe’ variable and 52 predictors. These 52 predictors are used to build a model for predicting class.

Running the Model

Several models were considered and run for this analysis including linear discriminant analysis (lda), Recursive Partitioning and Regression Trees (rpart), boosting (nba), and generalized linear models (glm), and random forests (rf). After building the model, each used to predict the class on the test data produced in data preprocessing. In the end, a random forest (rf) alorithm proved to be the most effective with a prediction accuracy on the test set of 99 percent. This is a much better than the other methods used which provided from as low 40 percent to 70 percent accuracies. The code below shows how the model was produced and the accuracy output. In an attempt to tune the amount of bias in the model and improve accuracy, k-fold cross-validation with a k of 3 was used. Although this increases the prediction power of our model, it is very computationally intensive to create. On a PC with an Intel I3 processor, each of these models took several minutes to compute.

set.seed(1234)
cv <- trainControl(method="cv", number=3)
modFitRf <- train(classe~.,data=training, method="rf", trControl=cv, verbose=F)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
prediction_rf <- predict(modFitRf,newdata=testing)
confusionMatrix(prediction_rf,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    6    0    0    0
##          B    3 1127   12    0    0
##          C    0    6 1014   12    0
##          D    0    0    0  951    4
##          E    1    0    0    1 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9898, 0.9944)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9895   0.9883   0.9865   0.9963
## Specificity            0.9986   0.9968   0.9963   0.9992   0.9996
## Pos Pred Value         0.9964   0.9869   0.9826   0.9958   0.9981
## Neg Pred Value         0.9990   0.9975   0.9975   0.9974   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1915   0.1723   0.1616   0.1832
## Detection Prevalence   0.2848   0.1941   0.1754   0.1623   0.1835
## Balanced Accuracy      0.9981   0.9932   0.9923   0.9929   0.9979

From the confusion matrix, we can see that the model produces very accurate predictions for the class of dumbbell exercise. Below is code using linear discriminant analyis. Notice that this model does not produce nearly as accurate results as the random forest model produced earlier.

set.seed(1234)
modFitlda <- train(classe ~ .,data=training,method="lda", trControl = cv)
## Loading required package: MASS
prediction_lda <- predict(modFitlda, newdata = testing)
confusionMatrix(prediction_lda, testing$classe)$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.988955e-01   6.190582e-01   6.869937e-01   7.105998e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   5.270007e-61

Conclusion

In this project, we were able to fit a model that predicts with over 99 percent accuracy the class of dumbell exercises given a series of movements. A random forest model proved to be the most accurate model, however it was also computationally intensive and may take a long time if the number of variables increases greatly.