Data

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement, a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, Machine learning algorithms and techniques will be used to make a model. We will also check if the model is able to predict the way exercises are being done. The success of this model also asserts the argument that trainers could be replaced by machines which can “correct” the exercising technique with more accuracy.

In the study referenced above, the data was obtained by attaching sensors (inertial measurement units) to both study participants, and weights, to measure the motion as exercises were performed. Each participant was instructed to perform an exercise five different ways (one “correct” way and differnt “incorrect” ways)

#loading the required packages

library(caret)                  # For performing PCA
library(caTools)                # For splitting the training data for cross validation
library(randomForest)           # For performing randomForest
library(rpart)                  # For performing regression and classification trees
library(rpart.plot)             # For plotting the output of CART models
library(e1071)                  # For "intelligently" performing cross-validations
library(rattle)                 # Another library to make a visually aesthetic CART plots

Now that we are done with loading the libraries, lets download the data. It is observed that many observations are ‘#DIV/0’. We will set that as an argument to read as NA.

Downloading data

rm(list=ls())                           # removing existing files in the workspace
setwd("C:/Users/rruj/Desktop")          # Setting the working directory


#Downloading and reading the data
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(url(trainURL), header = T, na.strings = c("NA","#DIV/0!",""))

testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(url(testURL), header = T, na.strings = c("NA","#DIV/0!",""))

dim(train)      # Checking the dimensions of the training set

## [1] 19622   160

dim(test)       # Checking the dimensions of the testing set

## [1]  20 160

Now after a quick check on the datasets, we see that many columns are just NA’s. These columns will not help in building the prediction models. We will remove these columns from the dataset.

Pre-Processing

NAcount <- apply(train, 2, function(x) sum(is.na(x)))   # Counts the number of NA's in each column
train <- train[,!NAcount/nrow(train) >= 0.7]    # Selecting the columns

test <- test[,names(test) %in% names(train)]
train <- train[,-c(1:7)]
test <- test[,-c(1:7)]

dim(train)      # Checking the dimensions of the training set

## [1] 19622    53

dim(test)       # Checking the dimensions of the testing set

## [1] 20 52

We are down to just 53 columns from the initial 160 columns. Good going! Lets, split the training data into 2 parts. I’ll choose the standard 70%-30% split. The library used here is caTools which is a wonderful library to split datasets while maintaining uniformity in the target column.

set.seed(144)           # Setting seed to ensure the results are reproducible.
split <- sample.split(train$classe, SplitRatio = 0.7)
training <- train[split,]
testing <- train[!split,]

dim(training)   # Checking the dimensions of the training set

## [1] 13735    53

dim(testing)    # Checking the dimensions of the testing set

## [1] 5887   53

CART Model

Now that we are done with cleaning the data, lets jump into making a predciction model. We will try out the CART model from the rpart library. To start with, we will feed the model with default parameters.

CART <- rpart(classe ~ ., data = training, minbucket = 2000) # Creating the prediction model
predictionsCART <- predict(CART, newdata = testing, type = "class") # Making predictions
confusionMatrix(predictionsCART, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1317  513  279  290  142
##          B    0    0    0    0    0
##          C    0    0    0    0    0
##          D  218  453  587  659  402
##          E  139  173  161   16  538
## 
## Overall Statistics
##                                           
##                Accuracy : 0.427           
##                  95% CI : (0.4144, 0.4398)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.266           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7867   0.0000   0.0000   0.6829  0.49723
## Specificity            0.7095   1.0000   1.0000   0.6627  0.89823
## Pos Pred Value         0.5183      NaN      NaN   0.2842  0.52386
## Neg Pred Value         0.8933   0.8065   0.8255   0.9142  0.88807
## Prevalence             0.2844   0.1935   0.1745   0.1639  0.18379
## Detection Rate         0.2237   0.0000   0.0000   0.1119  0.09139
## Detection Prevalence   0.4316   0.0000   0.0000   0.3939  0.17445
## Balanced Accuracy      0.7481   0.5000   0.5000   0.6728  0.69773

Not bad! We have a 42.7% accuracy rate of the out-of-sample data. Let’s have a closer look on how our model looks like.

fancyRpartPlot(CART)

Tuned CART Model

Pretty complicated eh!. Anyways, we want to make a model with better prediction capability, so we will compromise on the interpretibility. Let’s try to tune the model. We will use the e1071 library to find the optimum vale of the cp parameter for the rpart model.

numFolds <- trainControl(method = "cv", number = 10)
cpGrid <- expand.grid(.cp = seq(0.0001, 0.001, 0.0001))
train(classe ~ ., data = training, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)

## CART 
## 
## 13735 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 12361, 12361, 12362, 12361, 12362, 12363, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa      Accuracy SD  Kappa SD  
##   1e-04  0.9248614  0.9049472  0.01363328   0.01725461
##   2e-04  0.9261717  0.9066079  0.01347175   0.01705217
##   3e-04  0.9255172  0.9057736  0.01320146   0.01670743
##   4e-04  0.9242795  0.9042140  0.01273306   0.01611984
##   5e-04  0.9219492  0.9012614  0.01289269   0.01631309
##   6e-04  0.9185281  0.8969309  0.01134797   0.01436154
##   7e-04  0.9164170  0.8942519  0.01150834   0.01455500
##   8e-04  0.9110287  0.8874383  0.01295398   0.01636988
##   9e-04  0.9105914  0.8868838  0.01244165   0.01571792
##   1e-03  0.9053496  0.8802634  0.01280944   0.01617831
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 2e-04.

# It is clearly seen that the model has the highest accuracy with cp = 0.0002. We will set it as out cp value in the rpart model.
tunedCART <- rpart(classe ~ ., data = training, method = "class", cp = 0.0002)
predictionstunedCART <- predict(tunedCART, newdata = testing, type = "class")
confusionMatrix(testing$classe, predictionstunedCART)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1613   34   12    7    8
##          B   45 1027   36   11   20
##          C   10   46  952   11    8
##          D   10   24   40  876   15
##          E    4   25   17   17 1019
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9321          
##                  95% CI : (0.9253, 0.9384)
##     No Information Rate : 0.2857          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.914           
##  Mcnemar's Test P-Value : 0.0008454       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9590   0.8884   0.9007   0.9501   0.9523
## Specificity            0.9855   0.9763   0.9845   0.9821   0.9869
## Pos Pred Value         0.9636   0.9017   0.9270   0.9078   0.9418
## Neg Pred Value         0.9836   0.9728   0.9784   0.9907   0.9894
## Prevalence             0.2857   0.1964   0.1795   0.1566   0.1818
## Detection Rate         0.2740   0.1745   0.1617   0.1488   0.1731
## Detection Prevalence   0.2844   0.1935   0.1745   0.1639   0.1838
## Balanced Accuracy      0.9722   0.9324   0.9426   0.9661   0.9696

Bingo!. The Out-of-sample accuracy has moved tp 93%. We can play around with the parameters further, but lets check other machine learning algorithms if we can do better. I will now use randomForests to check the accuracy of the model.

Random Forest

#random forest
set.seed(144)
rf <- randomForest(classe ~ ., data = training, ntree = 500, nodesize = 1, importance = TRUE)

predictRF <- predict(rf, newdata = testing)
confusionMatrix(predictRF, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    0 1135    3    0    0
##          C    0    3 1022    7    0
##          D    1    0    2  958    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9968         
##                  95% CI : (0.995, 0.9981)
##     No Information Rate : 0.2844         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9959         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9965   0.9951   0.9927   0.9982
## Specificity            0.9998   0.9994   0.9979   0.9990   1.0000
## Pos Pred Value         0.9994   0.9974   0.9903   0.9948   1.0000
## Neg Pred Value         0.9998   0.9992   0.9990   0.9986   0.9996
## Prevalence             0.2844   0.1935   0.1745   0.1639   0.1838
## Detection Rate         0.2842   0.1928   0.1736   0.1627   0.1835
## Detection Prevalence   0.2844   0.1933   0.1753   0.1636   0.1835
## Balanced Accuracy      0.9996   0.9979   0.9965   0.9959   0.9991

plot(rf)

we see randomForest has done quite a good job in making the out of sample predictions with an accuracy of 99.6%. We observe that randomForest was able to make accurate predictions after 100 trees from the plot. Let’s check how randomForest has interpreted the variables.

#plotting the variable importance plot
varImpPlot(rf, n.var = 15)

#plot for the count of variable used in the randomForest model (Top 10 variables)
vu = varUsed(rf, count = TRUE)
vusorted = sort(vu, decreasing = F, index.return = T)
dotchart(vusorted$x[1:10], names(rf$forest$xlevels[vusorted$ix[1:10]]), main = "Variable used count", xlab = "Count")

Since randomForest gives us the maximum accuracy, we will use this model to make the predictions in the test dataset. The predictions are saved in separate files through the script below.

Final Predictions

Predictions <- predict(rf, newdata = test, type = "class")

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(Predictions)

Predicting the quality of exercise through Machine Learning

Ramaranjan Ruj

July 26, 2015

Synopsis